Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2540708.2540742acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Published: 07 December 2013 Publication History

Abstract

Three-dimensional (3D) scene rendering is implemented in the form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get accessed. These include, for instance, vertex, depth, stencil, render target (same as pixel color), and texture sampler data. The GPUs traditionally include small caches for vertex, render target, depth, and stencil data as well as multi-level caches for the texture sampler units. Recent introduction of reasonably large last-level caches (LLCs) shared among these data streams in discrete as well as integrated graphics hardware architectures has opened up new opportunities for improving 3D rendering. The GPUs equipped with such large LLCs can enjoy far-flung intra- and inter-stream reuses. However, there is no comprehensive study that can help graphics cache architects understand how to effectively manage a large multi-megabyte LLC shared between different 3D graphics streams.
In this paper, we characterize the intra-stream and inter-stream reuses in 52 frames captured from eight DirectX game titles and four DirectX benchmark applications spanning three different frame resolutions. Based on this characterization, we propose graphics stream-aware probabilistic caching (GSPC) that dynamically learns the reuse probabilities and accordingly manages the LLC of the GPU. Our detailed trace-driven simulation of a typical GPU equipped with 768 shader thread contexts, twelve fixed-function texture samplers, and an 8 MB 16-way LLC shows that GSPC saves up to 29.6% and on average 13.1% LLC misses across 52 frames compared to the baseline state-of-the-art two-bit dynamic re-reference interval prediction (DRRIP) policy. These savings in the LLC misses result in a speedup of up to 18.2% and on average 8.0%. On a 16 MB LLC, the average speedup achieved by GSPC further improves to 11.8% compared to DRRIP.

References

[1]
B. Anderson et al. Accommodating Memory Latency in a Low-cost Rasterizer. In Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, pages 97--101, August 1997.
[2]
L. A. Belady. A Study of Replacement Algorithms for a Virtual-storage Computer. In IBM Systems Journal, 5(2): 78--101, 1966.
[3]
E. Catmull. A Subdivision Algorithm for Computer Display of Curved Surface. PhD thesis, University of Utah, 1974.
[4]
M. Chaudhuri et al. Introducing Hierarchy-awareness in Replacement and Bypass Algorithms for Last-level Caches. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques, pages 293--304, September 2012.
[5]
M. Chaudhuri. Pseudo-LIFO: The Foundation of a New Family of Replacement Policies for Last-level Caches. In Proceedings of the 42nd International Symposium on Microarchitecture, pages 401--412, December 2009.
[6]
C. J. Choi et al. Performance Comparison of Various Cache Systems for Texture Mapping. In Proceedings of the 4th International Conf. on High Perf. Computing in Asia-Pacific Region, pages 374--379, May 2000.
[7]
M. Cox, N. Bhandari, and M. Shantz. Multi-level Texture Caching for 3D Graphics Hardware. In Proceedings of the 25th International Symposium on Computer Architecture, pages 86--97, June/July 1998.
[8]
M. F. Deering, S. A. Schlapp, and M. G. Lavelle. FBRAM: A New Form of Memory Optimized for 3D Graphics. In Proceedings of the 21st SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques, pages 167--174, July 1994.
[9]
M. Demler. Iris Pro Takes On Discrete GPUs. In Microprocessor Report, September 9, 2013.
[10]
M. Doggett. Texture Caches. In IEEE Micro, 32(3): 136--141, May/June 2012.
[11]
J. Gaur, M. Chaudhuri, and S. Subramoney. Bypass and Insertion Algorithms for Exclusive Last-level Caches. In Proceedings of the 38th International Symposium on Computer Architecture, pages 81--92, June 2011.
[12]
N. Greene, M. Kass, and G. Miller. Hierarchical Z-buffer Visibility. In Proceedings of the 20th SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques, pages 231--238, August 1993.
[13]
Z. S. Hakura and A. Gupta. The Design and Analysis of a Cache Architecture for Texture Mapping. In Proceedings of the 24th International Symposium on Computer Architecture, pages 108--120, May 1997.
[14]
M. Harris. Dynamic Texturing. Available at http://developer.download.nvidia.com/assets/gamedev/docs/DynamicTexturing.pdf.
[15]
Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proceedings of the 29th International Symposium on Computer Architecture, pages 209--220, May 2002.
[16]
H. Igehy, M. Eldridge, and P. Hanrahan. Parallel Texture Caching. In Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, pages 95--106, August 1999.
[17]
H. Igehy, M. Eldridge, and K. Proudfoot. Prefetching in a Texture Cache Architecture. In Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, pages 133--142, August/September 1998.
[18]
Intel Core i7-3770 Processor. http://ark.intel.com/products/65719/.
[19]
A. Jaleel et al. High Performance Cache Replacement using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture, pages 60--71, June 2010.
[20]
A. Jaleel et al. Adaptive Insertion Policies for Managing Shared Caches. In Proceedings of the 17th International Conference on Parallel Architecture and Compilation Techniques, pages 208--219, October 2008.
[21]
D. Kanter. Intel's Ivy Bridge Graphics Architecture. April 2012. Available at http://www.realworldtech.com/ivy-bridge-gpu/.
[22]
D. Kanter. Intel's Sandy Bridge Graphics Architecture. August 2011. Available at http://www.realworldtech.com/sandy-bridge-gpu/.
[23]
S. Khan, Y. Tian, and D. A. Jimènez. Dead Block Replacement and Bypass with a Sampling Predictor. In Proceedings of the 43rd International Symposium on Microarchitecture, pages 175--186, December 2010.
[24]
S. Khan et al. Using Dead Blocks as a Virtual Victim Cache. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 489--500, September 2010.
[25]
M. Kharbutli and Y. Solihin. Counter-based Cache Replacement and Bypassing Algorithms. In IEEE Transactions on Computers, 57(4): 433--447, April 2008.
[26]
M. J. Kilgard. Realizing OpenGL: Two Implementations of One Architecture. In Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, pages 45--56, August 1997.
[27]
A-C. Lai, C. Fide, and B. Falsafi. Dead-block Prediction & Dead-block Correlating Prefetchers. In Proceedings of the 28th International Symposium on Computer Architecture, pages 144--154, June/July 2001.
[28]
J. Lee and H. Kim. TAP: A TLP-aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture, pages 91--102, February 2012.
[29]
H. Liu et al. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Proceedings of the 41st International Symposium on Microarchitecture, pages 222--233, November 2008.
[30]
F. D. Luna. Introduction to 3D Game Programming with DirectX 10. Wordware Publishing Inc.
[31]
R. Manikantan, K. Rajan, and R. Govindarajan. Probabilistic Shared Cache Management (PriSM). In Proceedings of the 39th International Symposium on Computer Architecture, pages 428--439, June 2012.
[32]
M. Mantor and M. Houston. AMD Graphic Core Next: Low Power High Performance Graphics & Parallel Compute. In Symposium on High-Performance Graphics, August 2011.
[33]
R. L. Mattson et al. Evaluation Techniques for Storage Hierarchies. In IBM Systems Journal, 9(2): 78--117, 1970.
[34]
S. Molner. Design Tradeoffs in the Kepler Architecture. In Symposium on High-Perf. Graphics, August 2012.
[35]
S. Morein. ATI Radeon HyperZ Technology. In SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, August 2000.
[36]
D. Nehab, J. Barczak, and P. V. Sander. Triangle Order Optimization for Graphics Hardware Computation Culling. In Proceedings of the Symposium on Interactive 3D Graphics and Games, pages 207--211, March 2006.
[37]
E. Persson. Depth In-depth. Available at http://developer.amd.com/media/gpu_assets/Depth_in-depth.pdf.
[38]
M. Pharr et al. Rendering Complex Scenes with Memory-coherent Ray Tracing. In Proceedings of the 24th SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques, pages 101--108, August 1997.
[39]
T. Piazza. Intel Processor Graphics. In Symposium on High-Performance Graphics, August 2012.
[40]
M. K. Qureshi et al. Adaptive Insertion Policies for High Performance Caching. In Proceedings of the 34th International Symposium on Computer Architecture, pages 381--391, June 2007.
[41]
M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proceedings of the 39th International Symposium on Microarchitecture, pages 423--432, December 2006.
[42]
D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-grain Cache Partitioning. In Proceedings of the 38th International Symposium on Computer Architecture, pages 57--68, June 2011.
[43]
A. Schilling, G. Knittel, and W. Strasser. Texram: A Smart Memory for Texturing. In IEEE Computer Graphics and Applications, 16(3): 32--41, May 1996.
[44]
A. L. Shimpi. Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested. June 2013. Available at http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested.
[45]
J. Torborg and J. Kajiya. Talisman: Commodity Real-time 3D Graphics for the PC. In Proceedings of the 23rd SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques, pages 353--363, August 1996.
[46]
Unigine: Real-time 3D Engine. http://unigine.com.
[47]
A. Vartanian, J-L. Bechennec, and N. Drach-Temam. Evaluation of High Performance Multicache Parallel Texture Mapping. In Proceedings of the 12th International Conference on Supercomputing, pages 289--296, July 1998.
[48]
L. Williams. Pyramidal Parametrics. In Proceedings of the 10th SIGGRAPH Conference on Computer Graphics and Interactive Techniques, pages 1--11, July 1983.
[49]
C. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi GF100 GPU Architecture. In IEEE Micro, 31(2): 50--59, March/April 2011.
[50]
C-J. Wu et al. SHiP: Signature-Based Hit Predictor for High Performance Caching. In Proceedings of the 44th International Symposium on Microarchitecture, pages 430--441, December 2011.
[51]
Y. Xie and G. H. Loh. PIPP: Promotion/Insertion Pseudo-partitioning of Multi-core Shared Caches. In Proceedings of the 36th International Symposium on Computer Architecture, pages 174--183, June 2009.
[52]
M. Yuffe et al. A Fully Integrated Multi-CPU, GPU, and Memory Controller 32 nm Processor. In Proceedings of the International Solid-State Circuits Conference, pages 264--266, February 2011.
[53]
3D Mark Benchmark. http://www.3dmark.com/.

Cited By

View all
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2023)Optimization strategies for GPUs: an overview of architectural approachesInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2023.217375238:2(140-154)Online publication date: 5-Feb-2023
  • (2022)TCOR: A Tile Cache with Optimal Replacement2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00055(662-675)Online publication date: Apr-2022
  • Show More Cited By

Index Terms

  1. Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
    December 2013
    498 pages
    ISBN:9781450326384
    DOI:10.1145/2540708
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D scene rendering
    2. caches
    3. graphics processing units

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MICRO-46
    Sponsor:

    Acceptance Rates

    MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;
    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
    • (2023)Optimization strategies for GPUs: an overview of architectural approachesInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2023.217375238:2(140-154)Online publication date: 5-Feb-2023
    • (2022)TCOR: A Tile Cache with Optimal Replacement2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00055(662-675)Online publication date: Apr-2022
    • (2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168831(214-227)Online publication date: 2018
    • (2018)CUDAAdvisor: LLVM-based runtime profiling for modern GPUsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168831(214-227)Online publication date: 24-Feb-2018
    • (2018)The locality descriptorProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00074(829-842)Online publication date: 2-Jun-2018
    • (2018)Criticality aware tiered cache hierarchyProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00019(96-109)Online publication date: 2-Jun-2018
    • (2018)Tail-PASS: Resource-Based Cache Management for Tiled Graphics Rendering Hardware2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)10.1109/BDCloud.2018.00022(55-63)Online publication date: Dec-2018
    • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGARCH Computer Architecture News10.1145/3093337.303770945:1(297-311)Online publication date: 4-Apr-2017
    • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGPLAN Notices10.1145/3093336.303770952:4(297-311)Online publication date: 4-Apr-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media