tutorial

Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

Authors:

Djordje Jevdjic,

Gabriel H. Loh,

Cansu Kaynak,

Babak FalsafiAuthors Info & Claims

MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 25 - 37

https://doi.org/10.1109/MICRO.2014.51

Published: 13 December 2014 Publication History

Get Access

Abstract

Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches.

We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

References

[1]

B. Calder, D. Grunwald, and J. Emer, "Predictive sequential associative cache," in Proceedings of the 2nd International Symposium on High-Performance Computer Architecture, Feb. 1996.

Abstract

References

Cited By

Index Terms

Recommendations

Decoupled Fused Cache: Fusing a Decoupled LLC with a DRAM Cache

Coordinating DRAM and Last-Level-Cache Policies with the Virtual Write Queue

Variation-tolerant non-uniform 3D cache management in die stacked multicore processor

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations