Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2304576.2304582acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Characterizing and improving the use of demand-fetched caches in GPUs

Published: 25 June 2012 Publication History

Abstract

Initially introduced as special-purpose accelerators for games and graphics code, graphics processing units (GPUs) have emerged as widely-used high-performance parallel computing platforms. GPUs traditionally provided only software-managed local memories (or scratchpads) instead of demand-fetched caches. Increasingly, however, GPUs are being used in broader application domains where memory access patterns are both harder to analyze and harder to manage in software-controlled caches. In response, GPU vendors have included sizable demand-fetched caches in recent chip designs. Nonetheless, several problems remain. First, since these hardware caches are quite new and highly-configurable, it can be difficult to know when and how to use them; they sometimes degrade performance instead of improving it. Second, since GPU programming is quite distinct from general-purpose programming, application programmers do not yet have solid intuition about which memory reference patterns are amenable to demand-fetched caches.
In response, this paper characterizes application performance on GPUs with caches and provides a taxonomy for reasoning about different types of access patterns and locality. Based on this taxonomy, we present an algorithm which can be automated and applied at compile-time to identify an application's memory access patterns and to use that information to intelligently configure cache usage to improve application performance. Experiments on real GPU systems show that our algorithm reliably predicts when GPU caches will help or hurt performance. Compared to always passively turning caches on, our method can increase the average benefit of caches from 5.8% to 18.0% for applications that have significant performance sensitivity to caching.

References

[1]
M. M. Baskaran et al. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. 22nd ACM Intl. Conf. on Supercomputing, 2008.
[2]
I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. In 31st Intl. Conf. on Computer Graphics and Interactive Techniques, 2004.
[3]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. IEEE Int. Symp. Workload Characterization, 2009.
[4]
W. J. Dally et al. Merrimac: Supercomputing with streams. In Proc. 2003 ACM/IEEE Conf. Supercomputing, 2003.
[5]
E. Demers. Evolution of AMD graphics, 2011. Presented at AMD Fusion Developer Summit.
[6]
K. Fatahalian and M. Houston. A closer look at GPUs. Communications of the ACM, 51(10):50--57, October 2008.
[7]
M. Gebhart et al. Energy-efficient mechanisms for managing thread context in throughput processors. In Proc. 38th Ann. Int. Symp. Computer Architecture, 2011.
[8]
T. D. Han and T. S. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Trans. on Parallel and Distributed Systems, 2011.
[9]
P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the GPU using CUDA. In Proc. 14th Intl. Conf. High Performance Computing, 2007.
[10]
M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(12):1612--1630, December 1989.
[11]
S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. 36th Ann. Int. Symp. Computer Architecture, 2009.
[12]
NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.
[13]
NVIDIA Corp. PTX: Parallel Thread Execution ISA Version 2.3, March 2011.
[14]
NVIDIA Corp. Tuning CUDA Applications for Fermi, May 2011. Page 3.
[15]
S. Seo et al. Design and implementation of software-managed caches for multicores with local memory. In IEEE 15th Intl. Symp. on High Performance Computer Architecture, 2009.
[16]
I.-J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications In Proc. 19th Int. Conf. on Parallel Architectural and Compilation Techniques, 2010.
[17]
Y. Torres and A. Gonzales-Escribano. Understanding the impact of CUDA tuning techniques for Fermi. In 2011 Intl. Conf. on High Performance Computing and Simulation, 2011.
[18]
S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. on Embedded Computing Systems, 2006.
[19]
H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In IEEE Intl. Symp. on Performance Analysis of Systems Software, 2010.
[20]
Y. Yang et al. A GPGPU compiler for memory optimization and parallelism management. In Proc. 2010 ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2010.
[21]
E. Z. Zhang et al. Streamlining GPU applications on the fly-thread divergence elimination through runtime thread-data remapping. In Proc. 24th ACM Intl. Conf. on Supercomputing, 2010.

Cited By

View all
  • (2024)A Survey of Caching Techniques for General Purpose Graphics Processing Units2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512116(1-6)Online publication date: 1-Mar-2024
  • (2023)L2 Cache Access Pattern Analysis using Static Profiling of an Application2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00022(97-102)Online publication date: Jun-2023
  • (2023)Optimization strategies for GPUs: an overview of architectural approachesInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2023.217375238:2(140-154)Online publication date: 5-Feb-2023
  • Show More Cited By

Index Terms

  1. Characterizing and improving the use of demand-fetched caches in GPUs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
    June 2012
    400 pages
    ISBN:9781450313162
    DOI:10.1145/2304576
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 June 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CUDA
    2. GPGPU
    3. GPU cache
    4. compiler optimization

    Qualifiers

    • Research-article

    Conference

    ICS'12
    Sponsor:
    ICS'12: International Conference on Supercomputing
    June 25 - 29, 2012
    San Servolo Island, Venice, Italy

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Survey of Caching Techniques for General Purpose Graphics Processing Units2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512116(1-6)Online publication date: 1-Mar-2024
    • (2023)L2 Cache Access Pattern Analysis using Static Profiling of an Application2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00022(97-102)Online publication date: Jun-2023
    • (2023)Optimization strategies for GPUs: an overview of architectural approachesInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2023.217375238:2(140-154)Online publication date: 5-Feb-2023
    • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
    • (2022)A Quantitative Study of Locality in GPU Caches for Memory-Divergent WorkloadsInternational Journal of Parallel Programming10.1007/s10766-022-00729-250:2(189-216)Online publication date: 5-Apr-2022
    • (2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
    • (2020)GEVOACM Transactions on Architecture and Code Optimization10.1145/341805517:4(1-28)Online publication date: 25-Nov-2020
    • (2020)Selective Caching: Avoiding Performance Valleys in Massively Parallel Architectures2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00051(290-298)Online publication date: Mar-2020
    • (2020)A Quantitative Study of Locality in GPU CachesEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_16(228-242)Online publication date: 7-Oct-2020
    • (2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media