Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Cooperative Caching for GPUs

Published: 12 December 2016 Publication History

Abstract

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies.
In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.

References

[1]
Manuel E. Acacio, José González, José M. García, and José Duato. 2002. Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC’02). IEEE, Los Alamitos, CA, 1--12. http://dl.acm.org/citation.cfm?id=762761.762762
[2]
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU concurrency: Weak behaviours and programming assumptions. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 577--591.
[3]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, Los Alamitos, CA, 163--174.
[4]
L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th International Symposium on Computer Architecture. 282--293.
[5]
Bradford M. Beckmann and David A. Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-37). IEEE, Los Alamitos, CA, 319--330.
[6]
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks. 2014. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA’14). IEEE, Los Alamitos, CA, 217--228. http://dl.acm.org/citation.cfm?id=2665671.2665705
[7]
Jichuan Chang and Gurindar S. Sohi. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA’06). IEEE, Los Alamitos, CA, 264--276.
[8]
Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS’07). ACM, New York, NY, 242--252.
[9]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, Los Alamitos, CA, 44--54.
[10]
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive cache management for energy-efficient GPU computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE, Los Alamitos, CA, 343--355.
[11]
Hyojin Choi, Jaewoo Ahn, and Wonyong Sung. 2012. Reducing off-chip memory traffic by selective cache management scheme in GPGPUs. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, 110--119.
[12]
George Chrysos. 2012. Intel Xeon Phi Coprocessor—The Architecture. Technical Report. Intel Corporation. https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner.
[13]
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Characterizing memory bottlenecks in GPGPU workloads. In Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC’16).
[14]
Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE, Los Alamitos, CA, 389--400.
[15]
Jayesh Gaur, Mainak Chaudhuri, and Sreenivas Subramoney. 2011. Bypass and insertion algorithms for exclusive last-level caches. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 81--92.
[16]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce framework on graphics processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 260--269.
[17]
Enric Herrero, José González, and Ramon Canal. 2008. Distributed cooperative caching. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 134--143.
[18]
Mark A. Holliday and Michael Stumm. 1994. Performance evaluation of hierarchical ring-based shared memory multiprocessors. IEEE Transactions on Computers 43, 1, 52--67.
[19]
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 15--24.
[20]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 395--406.
[21]
Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2016. Exploiting core criticality for enhanced GPU performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. 351--363.
[22]
Stefanos Kaxiras and Georgios Keramidas. 2010. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro 30, 5, 54--65.
[23]
Mohammad Mahdi Keshtegar, Hajar Falahati, and Shaahin Hessabi. 2015. Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality. IET Computers and Digital Techniques 9, 5, 275--282. http://digital-library.theiet.org/content/journals/10.1049/iet-cdt.2014.0092.
[24]
P. Kongetira, K. Aingaran, and K. Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25, 2, 21--29.
[25]
Alvin R. Lebeck and David A. Wood. 1995. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). ACM, New York, NY, 48--59.
[26]
Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, Los Alamitos, CA, 1--12.
[27]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 487--498.
[28]
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 67--77.
[29]
Lingda Li, Ari B. Hayes, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Tag-split cache for efficient GPGPU cache utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, New York, NY, Article No. 43.
[30]
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why on-chip cache coherence is here to stay. Communications of the ACM 55, 7, 78--89.
[31]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 308--317.
[32]
NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. Technical Report. NVIDIA Corporation. http://www.nvidia.co.uk/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[33]
NVIDIA. 2012. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Technical Report. NVIDIA Corporation. http://www.nvidia.co.uk/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[34]
NVIDIA. 2014. CUDA by Example—Errata Page. Retrieved October 27, 2016, from http://developer.nvidia.com/cuda-example-errata-page.
[35]
NVIDIA. 2016. Parallel Thread Execution ISA, Version 5.0. Retrieved October 27, 2016, from http://docs.nvidia.com/cuda/parallel-thread-execution.
[36]
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 457--467.
[37]
Govindan Ravindran and Michael Stumm. 1997. A performance comparison of hierarchical ring- and mesh-connected multiprocessor networks. In Proceedings of the 3rd IEEE Symposium on High Performance Computer Architecture (HPCA’97). IEEE, Los Alamitos, CA, 58. http://dl.acm.org/citation.cfm?id=548716.822685
[38]
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 86--98.
[39]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE, Los Alamitos, CA, 72--83.
[40]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 99--110.
[41]
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, et al. 2008. Larrabee: A many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 Papers (SIGGRAPH’08). ACM, New York, NY, Article No. 18.
[42]
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the 2013 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’13). 578--590.
[43]
John A. Stratton, Christopher Rodrigrues, I.-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.
[44]
David Tarjan, Jiayuan Meng, and Kevin Skadron. 2009. Increasing memory miss tolerance for SIMD cores. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). ACM, New York, NY, Article No. 22.
[45]
David Tarjan and Kevin Skadron. 2010. The sharing tracker: Using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’10). IEEE, Los Alamitos, CA, 1--10.
[46]
Yi Yang, Ping Xiang, Mike Mantor, and Huiyang Zhou. 2012. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). IEEE, Los Alamitos, CA, 1--12.
[47]
A. Yazdanbakhsh, B. Thwaites, H. Esmaeilzadeh, G. Pekhimenko, O. Mutlu, and T. C. Mowry. 2016. Mitigating the memory bottleneck with approximate load value prediction. IEEE Design and Test 33, 1, 32--42.

Cited By

View all
  • (2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
  • (2024)A Survey of Caching Techniques for General Purpose Graphics Processing Units2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512116(1-6)Online publication date: 1-Mar-2024
  • (2024)Collaborative Coalescing of Redundant Memory Access for GPU System2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473837(195-200)Online publication date: 22-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
December 2016
648 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3012405
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2016
Accepted: 01 September 2016
Revised: 01 August 2016
Received: 01 June 2016
Published in TACO Volume 13, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. bandwidth bottlenecks
  3. intercore reuse

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)141
  • Downloads (Last 6 weeks)27
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
  • (2024)A Survey of Caching Techniques for General Purpose Graphics Processing Units2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512116(1-6)Online publication date: 1-Mar-2024
  • (2024)Collaborative Coalescing of Redundant Memory Access for GPU System2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC)10.1109/ASP-DAC58780.2024.10473837(195-200)Online publication date: 22-Jan-2024
  • (2023)COLABProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567838(314-319)Online publication date: 16-Jan-2023
  • (2023)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: 29-Nov-2023
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
  • (2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
  • (2022)Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp MulticastingIEEE Transactions on Computers10.1109/TC.2022.3207134(1-12)Online publication date: 2022
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media