article

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Authors:

Stephen W. KecklerAuthors Info & Claims

ACM SIGPLAN Notices, Volume 37, Issue 10

Pages 211 - 222

https://doi.org/10.1145/605432.605420

Published: 01 October 2002 Publication History

Abstract

Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.

References

[1]

V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock rate vs. IPC: The end of the road for conventional microprocessors. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 248-259, June 2000.

Digital Library

[2]

D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In Proceedings of the 32nd International Symposium on Microarchitecture, pages 248-259, December 1999.

Digital Library

[3]

D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS parallel benchmarks. Technical Report RNR-91-002 Revision 2, NASA Ames Research Laboratory, Mountain View, CA, August 1991.

[4]

F. Bodin and A. Seznec. Skewed associativity enhances performance predictability. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 265-274, June 1995.

Digital Library

[5]

F. Dahlgren and P. Stenström. On reconfigurable on-chip data caches. In Proceedings of the 24th International Symposium on Microarchitecture, pages 189-198, November 1991.

Digital Library

[6]

R. Desikan, D. Burger, S. W. Keckler, and T. M. Austin. Sim-alpha: A validated execution-driven alpha 21264 simulator. Technical Report TR-01-23, Department of Computer Sciences, University of Texas at Austin, 2001.

[7]

A. González, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In Proceedings of the 1995 International Conference on Supercomputing, pages 338-347, July 1995.

Digital Library

[8]

L. Gwennap. Alpha 21364 to ease memory bottleneck. Microprocessor Report, 12(14), October 1998.

[9]

E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In Proceedings of the 27th International Symposium on Computer Architecture, pages 107-116, June 2000.

Digital Library

[10]

J. M. Hill and J. Lachman. A 900MHz 2.25 MB cache with on-chip CPU now in Cu SOI. In Proceedings of the IEEE International Solid-State Circuits Conference, pages 171-177, February 2001.

[11]

M. Horowitz, R. Ho, and K. Mai. The future of wires. In Seminconductor Research Corporation Workshop on Interconnects for Systems on a Chip, May 1999.

[12]

M. S. Hrishikesh, Norman P. Jouppi, Keith I. Farkas, Doug Burger, Stephen W. Keckler, and Premkishore Shivakumar. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 14-24, May 2002.

Digital Library

[13]

J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques, pages 199-210, September 2001.

Digital Library

[14]

J. Rubinstein, P. Penfield, and M. A. Horowitz. Signal delay in RC tree networks. IEEE Transactions on Computer-Aided Design, CAD-2(3):202-211, 1983.

[15]

T. L. Johnson and W. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 315-326, June 1997.

Digital Library

[16]

N. Jouppi and S. Wilton. An enhanced access and cycle time model for on-chip caches. Technical Report TR-93-5, Compaq WRL, July 1994.

[17]

R. E. Kessler. Analysis of Multi-Megabyte Secondary CPU Cache Memories. PhD thesis, University of Wisconsin-Madison, December 1989.

Digital Library

[18]

R. E. Kessler. The alpha 21264 microprocessor. IEEE Micro, 19(2):24-36, March/April 1999.

Digital Library

[19]

R. E. Kessler, M. D. Hill, and D. A. Wood. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Transactions on Computers, 43(6):664-675, June 1994.

Digital Library

[20]

R. E. Kessler, R. Jooss, A. Lebeck, and M. D. Hill. Inexpensive implementations of set-associativity. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 131-139, May 1989.

Digital Library

[21]

K.-F. Lee, H.-W. Hon, and R. Reddy. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech and Signal Processing, 38(1):35-44, 1990.

[22]

D. Matzke. Will physical scalability sabotage performance gains? IEEE Computer, 30(9):37-39, September 1997.

Digital Library

[23]

H. Pilo, A. Allen, J. Covino, P. Hansen, S. Lamphier, C. Murphy, T. Traver, and P. Yee. An 833MHz 1.5w 18Mb CMOS SRAM with 1.67Gb/s/pin. In Proceedings of the 2000 IEEE International Solid-State Circuits Conference, pages 266-267, February 2000.

[24]

M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, and K. Roy. Reducing set-associative cache energy via way-prediction and selective direct-mapping. In Proceedings of the 34th International Symposium on Microarchitecture, pages 54-65, December 2001.

Digital Library

[25]

S. A. Przybylski. Performance-Directed Memory Hierarchy Design. PhD thesis, Stanford University, September 1988. Technical report CSL-TR-88-366.

Digital Library

[26]

The national technology roadmap for semiconductors. Semiconductor Industry Association, 1999.

[27]

P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, August 2001.

[28]

K. So and R. N. Rechtshaffen. Cache operations by MRU change. IEEE Transactions on Computers, 37(6):700-109, July 1988.

Digital Library

[29]

G. S. Sohi and M. Franklin. High-performance data memory systems for superscalar processors. In Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 53-62, April 1991.

Digital Library

[30]

Standard Performance Evaluation Corporation. SPEC Newsletter, Fairfax, VA, September 2000.

[31]

G. Tyson, M. Farrens, J. Matthews, and A. Pleszkun. A modified approach to data cache management. In Proceedings of the 28th International Symposium on Microarchitecture, pages 93-103, December 1995.

Digital Library

[32]

K. M. Wilson and K. Olukotun. Designing high bandwidth on-chip caches. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 121-132, June 1997.

Digital Library

[33]

S. Wilton and N. Jouppi. Cacti: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, 31(5):677-688, May 1996.

Cited By

Oliveira GOlgun AYağlıkçı ABostancı FGómez-Luna JGhose SMutlu O(2024)MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00024(186-203)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00024
Barbhuiya NDas PRoy B(2024)Cache Memory and On-Chip Cache Architecture: A SurveyAdvanced Computing, Machine Learning, Robotics and Internet Technologies10.1007/978-3-031-47221-3_12(126-138)Online publication date: 16-Apr-2024
https://doi.org/10.1007/978-3-031-47221-3_12
Egawa RSaito RSato MKobayashi H(2019)A Layer-Adaptable Cache Hierarchy by a Multiple-layer Bypass MechanismProceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies10.1145/3337801.3337820(1-6)Online publication date: 6-Jun-2019
https://dl.acm.org/doi/10.1145/3337801.3337820
Show More Cited By

Index Terms

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
ASPLOS X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems

Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the ...
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Special Issue: Proceedings of the 10th annual conference on Architectural Support for Programming Languages and Operating Systems

Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the ...
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 37, Issue 10

October 2002

296 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/605432

Issue’s Table of Contents

ASPLOS X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
October 2002
318 pages
ISBN:1581135742
DOI:10.1145/605397
Conference Chair:
Kourosh Gharachorloo
Compaq Western Research Lab
,
Program Chair:
David A. Wood

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2002

Published in SIGPLAN Volume 37, Issue 10

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

600
Total Citations
View Citations
3,049
Total Downloads

Downloads (Last 12 months)155
Downloads (Last 6 weeks)10

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Oliveira GOlgun AYağlıkçı ABostancı FGómez-Luna JGhose SMutlu O(2024)MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00024(186-203)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00024
Barbhuiya NDas PRoy B(2024)Cache Memory and On-Chip Cache Architecture: A SurveyAdvanced Computing, Machine Learning, Robotics and Internet Technologies10.1007/978-3-031-47221-3_12(126-138)Online publication date: 16-Apr-2024
https://doi.org/10.1007/978-3-031-47221-3_12
Egawa RSaito RSato MKobayashi H(2019)A Layer-Adaptable Cache Hierarchy by a Multiple-layer Bypass MechanismProceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies10.1145/3337801.3337820(1-6)Online publication date: 6-Jun-2019
https://dl.acm.org/doi/10.1145/3337801.3337820
Farshin ARoozbeh AMaguire GKostić D(2019)Make the Most out of Last Level Cache in Intel ProcessorsProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303977(1-17)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303977
Rohbani NGau HMohammadinejad SMaiti TNavarro DMiura-Mattausch MMattausch HTakatsuka H(2019)Power Reduction and BTI Mitigation of Data-Cache Memory Based on the Storage Management of Narrow-Width ValuesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.290948827:7(1675-1684)Online publication date: Jul-2019
https://doi.org/10.1109/TVLSI.2019.2909488
Rohbani NKumar Maiti TNavarro DMiura-Mattausch MMattausch HTakatsuka H(2019)NVDL-Cache: Narrow-Width Value Aware Variable Delay Low-Power Data Cache2019 IEEE 37th International Conference on Computer Design (ICCD)10.1109/ICCD46524.2019.00040(264-272)Online publication date: Nov-2019
https://doi.org/10.1109/ICCD46524.2019.00040
Jang GGaudiot J(2019)Data Shepherding: A Last Level Cache Design for Large Scale Chips2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00265(1920-1927)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00265
Wang ZChen XLi CGuo YLiao MLiu Z(2019)Load-Balanced Link Distribution in Mesh-Based Many-Core Systems2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00147(1028-1034)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00147
Kislal OKotra JTang XKandemir MJung M(2018)Enhancing computation-to-core assignment with physical location informationACM SIGPLAN Notices10.1145/3296979.319238653:4(312-327)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192386
Kislal OKotra JTang XKandemir MJung MFoster JGrossman D(2018)Enhancing computation-to-core assignment with physical location informationProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3192366.3192386(312-327)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3192366.3192386
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents