research-article

Characterizing and improving the use of demand-fetched caches in GPUs

Authors:

Margaret MartonosiAuthors Info & Claims

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Pages 15 - 24

https://doi.org/10.1145/2304576.2304582

Published: 25 June 2012 Publication History

Abstract

Initially introduced as special-purpose accelerators for games and graphics code, graphics processing units (GPUs) have emerged as widely-used high-performance parallel computing platforms. GPUs traditionally provided only software-managed local memories (or scratchpads) instead of demand-fetched caches. Increasingly, however, GPUs are being used in broader application domains where memory access patterns are both harder to analyze and harder to manage in software-controlled caches. In response, GPU vendors have included sizable demand-fetched caches in recent chip designs. Nonetheless, several problems remain. First, since these hardware caches are quite new and highly-configurable, it can be difficult to know when and how to use them; they sometimes degrade performance instead of improving it. Second, since GPU programming is quite distinct from general-purpose programming, application programmers do not yet have solid intuition about which memory reference patterns are amenable to demand-fetched caches.

In response, this paper characterizes application performance on GPUs with caches and provides a taxonomy for reasoning about different types of access patterns and locality. Based on this taxonomy, we present an algorithm which can be automated and applied at compile-time to identify an application's memory access patterns and to use that information to intelligently configure cache usage to improve application performance. Experiments on real GPU systems show that our algorithm reliably predicts when GPU caches will help or hurt performance. Compared to always passively turning caches on, our method can increase the average benefit of caches from 5.8% to 18.0% for applications that have significant performance sensitivity to caching.

References

[1]

M. M. Baskaran et al. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. 22nd ACM Intl. Conf. on Supercomputing, 2008.

Digital Library

[2]

I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. In 31st Intl. Conf. on Computer Graphics and Interactive Techniques, 2004.

Digital Library

[3]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. IEEE Int. Symp. Workload Characterization, 2009.

Digital Library

[4]

W. J. Dally et al. Merrimac: Supercomputing with streams. In Proc. 2003 ACM/IEEE Conf. Supercomputing, 2003.

Digital Library

[5]

E. Demers. Evolution of AMD graphics, 2011. Presented at AMD Fusion Developer Summit.

[6]

K. Fatahalian and M. Houston. A closer look at GPUs. Communications of the ACM, 51(10):50--57, October 2008.

Digital Library

[7]

M. Gebhart et al. Energy-efficient mechanisms for managing thread context in throughput processors. In Proc. 38th Ann. Int. Symp. Computer Architecture, 2011.

Digital Library

[8]

T. D. Han and T. S. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Trans. on Parallel and Distributed Systems, 2011.

Digital Library

[9]

P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the GPU using CUDA. In Proc. 14th Intl. Conf. High Performance Computing, 2007.

Digital Library

[10]

M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(12):1612--1630, December 1989.

Digital Library

[11]

S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. 36th Ann. Int. Symp. Computer Architecture, 2009.

Digital Library

[12]

NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.

[13]

NVIDIA Corp. PTX: Parallel Thread Execution ISA Version 2.3, March 2011.

[14]

NVIDIA Corp. Tuning CUDA Applications for Fermi, May 2011. Page 3.

[15]

S. Seo et al. Design and implementation of software-managed caches for multicores with local memory. In IEEE 15th Intl. Symp. on High Performance Computer Architecture, 2009.

[16]

I.-J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications In Proc. 19th Int. Conf. on Parallel Architectural and Compilation Techniques, 2010.

Digital Library

[17]

Y. Torres and A. Gonzales-Escribano. Understanding the impact of CUDA tuning techniques for Fermi. In 2011 Intl. Conf. on High Performance Computing and Simulation, 2011.

[18]

S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. on Embedded Computing Systems, 2006.

Digital Library

[19]

H. Wong et al. Demystifying GPU microarchitecture through microbenchmarking. In IEEE Intl. Symp. on Performance Analysis of Systems Software, 2010.

[20]

Y. Yang et al. A GPGPU compiler for memory optimization and parallelism management. In Proc. 2010 ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2010.

Digital Library

[21]

E. Z. Zhang et al. Streamlining GPU applications on the fly-thread divergence elimination through runtime thread-data remapping. In Proc. 24th ACM Intl. Conf. on Supercomputing, 2010.

Digital Library

Cited By

Shenoy G(2024)A Survey of Caching Techniques for General Purpose Graphics Processing Units2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512116(1-6)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10512116
Adufu TKim Y(2023)L2 Cache Access Pattern Analysis using Static Profiling of an Application2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00022(97-102)Online publication date: Jun-2023
https://doi.org/10.1109/COMPSAC57700.2023.00022
Masola ACapodieci N(2023)Optimization strategies for GPUs: an overview of architectural approachesInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2023.217375238:2(140-154)Online publication date: 5-Feb-2023
https://doi.org/10.1080/17445760.2023.2173752
Show More Cited By

Index Terms

Characterizing and improving the use of demand-fetched caches in GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Cooperative Caching for GPUs

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad ...
Best-effort semantic document search on GPUs
GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units

Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel ...
Reducing Static and Dynamic Power of L1 Data Caches in GPGPUs
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops

With the widespread adoption of GPGPUs for general purpose computing domain, the size of GPGPUs has grown quickly, making power consumption a major bottleneck. L1 data caches boost performance of processors by hiding latency of memory but consume ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

June 2012

400 pages

ISBN:9781450313162

DOI:10.1145/2304576

General Chairs:
Utpal Banerjee
University of California at Irvine, USA
,
Kyle A. Gallivan
Florida State University, USA
,
Program Chairs:
Gianfranco Bilardi
Università degli Studi di Padova, Italy
,
Manolis G.H. Katevenis
FORTH and University of Crete, Greece

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'12

Sponsor:

SIGARCH

ICS'12: International Conference on Supercomputing

June 25 - 29, 2012

San Servolo Island, Venice, Italy

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

129
Total Citations
View Citations
561
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shenoy G(2024)A Survey of Caching Techniques for General Purpose Graphics Processing Units2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512116(1-6)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10512116
Adufu TKim Y(2023)L2 Cache Access Pattern Analysis using Static Profiling of an Application2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00022(97-102)Online publication date: Jun-2023
https://doi.org/10.1109/COMPSAC57700.2023.00022
Masola ACapodieci N(2023)Optimization strategies for GPUs: an overview of architectural approachesInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2023.217375238:2(140-154)Online publication date: 5-Feb-2023
https://doi.org/10.1080/17445760.2023.2173752
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
Lal SVarma BJuurlink B(2022)A Quantitative Study of Locality in GPU Caches for Memory-Divergent WorkloadsInternational Journal of Parallel Programming10.1007/s10766-022-00729-250:2(189-216)Online publication date: 5-Apr-2022
https://doi.org/10.1007/s10766-022-00729-2
Fang JWei ZYang H(2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
https://doi.org/10.3390/mi12101262
Liou JWang XForrest SWu C(2020)GEVOACM Transactions on Architecture and Code Optimization10.1145/341805517:4(1-28)Online publication date: 25-Nov-2020
https://dl.acm.org/doi/10.1145/3418055
Jadidi AKandemir MDas C(2020)Selective Caching: Avoiding Performance Valleys in Massively Parallel Architectures2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00051(290-298)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00051
Lal SJuurlink B(2020)A Quantitative Study of Locality in GPU CachesEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_16(228-242)Online publication date: 7-Oct-2020
https://doi.org/10.1007/978-3-030-60939-9_16
Kim HHong SLee HSeo EHan H(2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337886
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents