Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/MICRO.2014.51acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
tutorial

Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

Published: 13 December 2014 Publication History

Abstract

Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches.
We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

References

[1]
B. Calder, D. Grunwald, and J. Emer, "Predictive sequential associative cache," in Proceedings of the 2nd International Symposium on High-Performance Computer Architecture, Feb. 1996.
[2]
C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos, "Accurate and complexity-effective spatial pattern prediction," in Proceedings of the 10th International Symposium on High Performance Computer Architecture, feb 2004.
[3]
CloudSuite benchmarks, http://parsa.epfl.ch/cloudsuite.
[4]
Y. Deng and W. P. Maly, "Interconnect characteristics of 2.5-d system integration scheme," in Proceedings of the 2001 International Symposium on Physical Design, ser. ISPD '01. New York, NY, USA: ACM, 2001, pp. 171--175.
[5]
Y. Deng and W. P. Maly, 3-Dimensional VLSI: A 2.5-Dimensional Integration Scheme, 1st ed. Springer Berlin Heidelberg, 2010.
[6]
X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi, "Simple but effective heterogeneous main memory with on-chip memory controller support," in Proceedings of the 2010 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.
[7]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the clouds: a study of emerging scale-out workloads on modern hardware," in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2012.
[8]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Toward dark silicon in servers," IEEE Micro, vol. 31, no. 4, pp. 6--15, July-August 2011.
[9]
D. Hyuk Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee, "An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth," in Proceedings of the 16th International Symposium on High Performance Computer Architecture, Jan. 2010.
[10]
D. Jevdjic, S. Volos, and B. Falsafi, "Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache," in Proceedings of the 40th Annual International Symposium on Computer Architecture, Jul. 2013.
[11]
X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian, "Chop: Adaptive filter-based dram caching for cmp server platforms," in Proceedings of the 16th International Symposium on High Performance Computer Architecture, Jan. 2010.
[12]
M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou, "The Researcher's Guide To The Data Deluge: Querying A Scientific Database In Just A Few Seconds," in Proceedings of International Conference on Very Large Data Bases (VLDB), 2011.
[13]
T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, and K. Flautner, "PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor," in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006.
[14]
H. Kim, P. Ghoshal, B. Grot, P. V. Gratz, and D. A. Jimenez, "Reducing network-on-chip energy consumption through spatial locality speculation," in Proceedings of the 5th International Symposium on Networks-on-Chip, May 2011.
[15]
S. Kumar and C. Wilkerson, "Exploiting spatial locality in data caches using spatial footprints," in Proceedings of the 25th International Symposium on Computer Architecture, Jun. 1998.
[16]
S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and L. Shannon, "Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy," in Proceedings of the 45th International Symposium on Microarchitecture, 2012.
[17]
C. Liu, I. Ganusov, and M. Burtscher, "Bridging the processor-memory performance gap with 3D IC technology," IEEE Design & Test of Computers, Nov-Dec 2005.
[18]
G. H. Loh, "3d-stacked memory architectures for multi-core processors," in Proceedings of the 35th International Symposium on Computer Architecture, Jun. 2008.
[19]
G. H. Loh, "Extending the effectiveness of 3d-stacked dram caches with an adaptive multi-queue policy," in Proceedings of the 42nd International Symposium on Microarchitecture, Dec. 2009.
[20]
G. H. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-stacked dram caches," in Proceedings of the 44th International Symposium on Microarchitecture, Dec. 2011.
[21]
P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, "Scale-out processors," in Proceedings of the 39th International Symposium on Computer Architecture, Jun. 2012.
[22]
Micron's Hybrid Memory Cube Earns High Praise in Next-Generation Supercomputer. Available: http://investors.micron.com/releasedetail.cfm?ReleaseID=805283
[23]
M. Powell, A. Agarwal, T. Vijaykumar, B. Falsafi, and K. Roy, "Reducing set-associative cache energy via way-prediction and selective direct-mapping," in Proceedings of the 34th International Symposium on Microarchitecture, Dec. 2001.
[24]
M. Qureshi and G. H. Loh, "Fundamental latency trade-offs in architecting DRAM caches," in Proceedings of the 45th International Symposium on Microarchitecture, Dec. 2012.
[25]
P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "Dramsim2: A cycle accurate memory system simulator," Computer Architecture Letters, vol. 10, no. 1,pp. 16--19, jan.-jun 2011.
[26]
M. Santarini, "Stacked & Loaded: Xilinx SSI, 28-Gbps I/O Yield Amazing FPGAs," Xilinx Xcell Journal, Tech. Rep., 1st Quarter 2011.
[27]
S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, "Spatial memory streaming," in Proceedings of the 33rd International Symposium on Computer Architecture, Jun. 2006.
[28]
S. Volos, J. Picorel, B. Grot, and B. Falsafi, "BuMP: Bulk memory access prediction and streaming," in Proceedings of the 47th International Symposium on Microarchitecture, Dec. 2014.
[29]
T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, "Simflex: Statistical sampling of computer system simulation," IEEE Micro, vol. 26, pp. 18--31, Jul. 2006.
[30]
W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20--24, Mar. 1995.
[31]
R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, "SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling," in Proceedings of the 30th International Symposium on Computer Architecture, Jun. 2003.
[32]
D. H. Yoon, M. K. Jeong, M. Sullivan, and M. Erez, "The dynamic granularity memory system," in Proceedings of the 39th International Symposium on Computer Architecture, Jun. 2012.
[33]
L. Zhao, R. Iyer, R. Illikkal, and D. Newell, "Exploring dram cache architectures for cmp server platforms," in Proceedings of the 25th International Conference on Computer Design, Oct. 2007.

Cited By

View all
  • (2024)Trimma: Trimming Metadata Storage and Latency for Hybrid Memory SystemsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689612(108-120)Online publication date: 14-Oct-2024
  • (2024)HMComp: Extending Near-Memory Capacity using Compression in Hybrid MemoryProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656612(74-84)Online publication date: 30-May-2024
  • (2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
  • Show More Cited By

Index Terms

  1. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture
    December 2014
    697 pages
    ISBN:9781479969982

    Sponsors

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 13 December 2014

    Check for updates

    Author Tags

    1. 3D die stacking
    2. DRAM
    3. caches
    4. memory
    5. servers

    Qualifiers

    • Tutorial
    • Research
    • Refereed limited

    Conference

    MICRO-47
    Sponsor:

    Acceptance Rates

    MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;
    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Trimma: Trimming Metadata Storage and Latency for Hybrid Memory SystemsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689612(108-120)Online publication date: 14-Oct-2024
    • (2024)HMComp: Extending Near-Memory Capacity using Compression in Hybrid MemoryProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656612(74-84)Online publication date: 30-May-2024
    • (2023)MC-ELMM: Multi-Chip Endurance-Limited Memory ManagementProceedings of the International Symposium on Memory Systems10.1145/3631882.3631905(1-16)Online publication date: 2-Oct-2023
    • (2022)A Practical Shared Optical Cache With Hybrid MWSR/R-SWMR NoC for Multicore ProcessorsACM Journal on Emerging Technologies in Computing Systems10.1145/353101218:4(1-28)Online publication date: 13-Oct-2022
    • (2022)Software Hint-Driven Data Management for Hybrid Memory in Mobile SystemsACM Transactions on Embedded Computing Systems10.1145/349453621:1(1-18)Online publication date: 14-Jan-2022
    • (2022)An Energy-Efficient DRAM Cache Architecture for Mobile Platforms With PCM-Based Main MemoryACM Transactions on Embedded Computing Systems10.1145/345199521:1(1-22)Online publication date: 14-Jan-2022
    • (2021)Offline and Online Algorithms for SSD ManagementProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/34910455:3(1-28)Online publication date: 15-Dec-2021
    • (2021)Reliability-aware Garbage Collection for Hybrid HBM-DRAM MemoriesACM Transactions on Architecture and Code Optimization10.1145/343180318:1(1-25)Online publication date: 20-Jan-2021
    • (2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
    • (2020)On-the-fly Page Migration and Address Reconciliation for Heterogeneous Memory SystemsACM Journal on Emerging Technologies in Computing Systems10.1145/336417916:1(1-27)Online publication date: 9-Jan-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media