research-article

Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Authors:

Jayesh Gaur,

Raghuram Srinivasan,

Sreenivas Subramoney,

Mainak ChaudhuriAuthors Info & Claims

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 395 - 407

https://doi.org/10.1145/2540708.2540742

Published: 07 December 2013 Publication History

Get Access

Abstract

Three-dimensional (3D) scene rendering is implemented in the form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get accessed. These include, for instance, vertex, depth, stencil, render target (same as pixel color), and texture sampler data. The GPUs traditionally include small caches for vertex, render target, depth, and stencil data as well as multi-level caches for the texture sampler units. Recent introduction of reasonably large last-level caches (LLCs) shared among these data streams in discrete as well as integrated graphics hardware architectures has opened up new opportunities for improving 3D rendering. The GPUs equipped with such large LLCs can enjoy far-flung intra- and inter-stream reuses. However, there is no comprehensive study that can help graphics cache architects understand how to effectively manage a large multi-megabyte LLC shared between different 3D graphics streams.

In this paper, we characterize the intra-stream and inter-stream reuses in 52 frames captured from eight DirectX game titles and four DirectX benchmark applications spanning three different frame resolutions. Based on this characterization, we propose graphics stream-aware probabilistic caching (GSPC) that dynamically learns the reuse probabilities and accordingly manages the LLC of the GPU. Our detailed trace-driven simulation of a typical GPU equipped with 768 shader thread contexts, twelve fixed-function texture samplers, and an 8 MB 16-way LLC shows that GSPC saves up to 29.6% and on average 13.1% LLC misses across 52 frames compared to the baseline state-of-the-art two-bit dynamic re-reference interval prediction (DRRIP) policy. These savings in the LLC misses result in a speedup of up to 18.2% and on average 8.0%. On a 16 MB LLC, the average speedup achieved by GSPC further improves to 11.8% compared to DRRIP.

References

[1]

B. Anderson et al. Accommodating Memory Latency in a Low-cost Rasterizer. In Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, pages 97--101, August 1997.

Abstract

References

Cited By

Index Terms

Recommendations

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Bypass and insertion algorithms for exclusive last-level caches

Bypass and insertion algorithms for exclusive last-level caches

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations