research-article

Adaptive GPU cache bypassing

Authors:

Sooraj Puthoor,

Joseph L. Greathouse,

Bradford M. Beckmann,

Daniel A. JiménezAuthors Info & Claims

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Pages 25 - 35

https://doi.org/10.1145/2716282.2716283

Published: 07 February 2015 Publication History

Abstract

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

References

[1]

Advanced Micro Devices, Inc. AMD Graphics Cores Next (GCN) Architecture. http://www.amd.com/ Documents/GCN_Architecture_whitepaper.pdf, Jun. 2012.

[2]

Advanced Micro Devices, Inc. AMD Radeon TM HD 7900 Series Graphics Cards: 7970, 7970 GHz, 7950. http://www.amd.com/en-us/products/graphics/ desktop/7000/7900, Jan. 2015.

[3]

F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A Post-Compiler Approach to Scratchpad Mapping of Code. In Proc. of the Int’l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 2004.

Digital Library

[4]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 Simulator. SIGARCH Computer Architecture News, 39(2):1–7, Aug. 2011.

Digital Library

[5]

C.-A. Bohn. Kohonen Feature Mapping through Graphics Hardware. In Proc. of the Int’l Conf. on Computational Intelligence and Neurosciences, 1998.

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of the IEEE Int’l Symp. on Workload Characterization (IISWC), 2009.

Digital Library

[7]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A Performance Study of General-Purpose Applications on Graphics Processors Using CUDA. Journal of Parallel and Distributed Computing, 68(10):1370–1380, 2008.

Digital Library

[8]

Chipworks, Inc. Inside the ASUS AMD 7970 graphics card - TSMC 28nm! http://www.chipworks.com/en/technicalcompetitive-analysis/resources/blog/insidethe-asus-amd-7970-graphics-card-tsmc-28-nm/, Feb. 2012.

[9]

Chipworks, Inc. A Look at Sony’s Playstation 4 Core Processor. http://www.chipworks.com/en/ technical-competitive-analysis/resources/blog/ a-look-at-sonys-playstation-4-core-processor, Nov. 2013.

[10]

W. Feng, H. Lin, T. Scogland, and J. Zhang. OpenCL and the 13 Dwarfs: A Work in Progress. In Proc. of the Int’l Conf. on Performance Engineering (ICPE), 2012.

Digital Library

[11]

J. Fung and S. Mann. OpenVIDIA: Parallel GPU Computer Vision. In Proc. of the Int’l Conf. on Multimedia, 2005.

Digital Library

[12]

J. Fung, F. Tang, and S. Mann. Mediated Reality Using Computer Graphics Hardware for Computer Vision. In Proc. of the Int’l Symp. on Wearable Computers, 2002.

Digital Library

[13]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2007.

Digital Library

[14]

R. V. Garde, S. Subramaniam, and G. H. Loh. Deconstructing the Inefficacy of Global Cache Replacement Policies. In Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2008.

[15]

J. Gaur, M. Chaudhuri, and S. Subramoney. Bypass and Insertion Algorithms for Exclusive Last-level Caches. In Proc. of the Int’l Symp. on Computer Architecture (ISCA), 2011.

Digital Library

[16]

A. González, C. Aliagas, and M. Valero. A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality. In Proc. of the Int’l Conf. on Supercomputing (SC), 1995.

Digital Library

[17]

L. Howes and A. Munshi. The OpenCL Specification Version 2.0, 2014. https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.

[18]

HSA Foundation. HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG). http://www.hsafoundation.com/?ddownload=4945, Jun. 2014.

[19]

W. W. Hwu, editor. GPU Computing Gems Emerald Edition. Morgan Kaufmann, 2011.

Digital Library

[20]

A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Proc. of the Int’l Symp. on Computer Architecture (ISCA), 2010.

Digital Library

[21]

J. Jalminger and P. Stenstrom. A Novel Approach to Cache Block Reuse Predictions. In Proc. of the Int’l Conf. on Parallel Processing (ICPP), 2003.

[22]

W. Jia, K. A. Shaw, and M. Martonosi. MRPB: Memory Request Prioritization for Massively Parallel Processors. In Int’l Symp. on High Performance Computer Architecture (HPCA), 2014.

[23]

T. L. Johnson, D. A. Connors, M. C. Merten, and W. W. Hwu. Run-time Cache Bypassing. IEEE Trans. on Computers, 48(12):1338–1354, 1999.

Digital Library

[24]

H. Jooybar, W. W. Fung, M. O’Connor, J. Devietti, and T. M. Aamodt. GPUDet: a Deterministic GPU Architecture. In Proc. of the Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.

Digital Library

[25]

M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 23(2):243–260, Feb. 2004.

Digital Library

[26]

S. M. Khan, Y. Tian, and D. A. Jiménez. Sampling Dead Block Prediction for Last-Level Caches. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2010.

Digital Library

[27]

M. Kharbutli and Y. Solihin. Counter-Based Cache Replacement and Bypassing Algorithms. IEEE Trans. on Computers, 57(4):433–447, April 2008.

Digital Library

[28]

T. J. Knight, J. Y. Park, M. Ren, M. Houston, M. Erez, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan. Compilation for Explicitly Managed Memory Hierarchies. In Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2007.

Digital Library

[29]

G. Kyriazis. Heterogeneous System Architecture: A Technical Review. Technical report, HSA Foundation, 2012.

[30]

A.-C. Lai, C. Fide, and B. Falsafi. Dead-Block Prediction & Dead-Block Correlating Prefetchers. In Proc. of the Int’’l Symp. on Computer Architecture (ISCA), 2001.

Digital Library

[31]

J. Lee and H. Kim. TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. In Proc. of the Int’l Symp. on High Performance Computer Architecture (HPCA), 2012.

Digital Library

[32]

Leonidas. AMD R1000/Tahiti Die-Shot. http://www.3dcenter.org/abbildung/ amd-r1000tahiti-die-shot-markiert, Sep. 2012.

[33]

J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis. Comparing Memory Systems for Chip Multiprocessors. In Proc. of the Int’l Symp. on Computer Architecture (ISCA), 2007.

Digital Library

[34]

H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2008.

Digital Library

[35]

V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. Managing Shared Last-Level Cache in a Heterogeneous Multicore Processor. In Proc. of the Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), 2013.

Digital Library

[36]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85, HP Laboratories, Apr. 2009.

[37]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40–53, 2008.

Digital Library

[38]

J. Nickolls and W. J. Dally. The GPU Computing Era. IEEE Micro, 30(2):56–69, 2010.

Digital Library

[39]

Nvidia Corp. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 2009.

[40]

Nvidia Corp. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. 2012.

[41]

Nvidia Corp. CUDA C Programming Guide Version 6.5. http://docs.nvidia.com/cuda/pdf/CUDA_C_ Programming_Guide.pdf, Aug. 2014.

[42]

Nvidia Corp. Tuning CUDA Applications for Kepler. http://docs.nvidia.com/cuda/kepler-tuningguide/, Aug. 2014.

[43]

Nvidia Corp. Tuning CUDA Applications for Maxwell. http://docs.nvidia.com/cuda/maxwell-tuningguide/, Aug. 2014.

[44]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. Proc. of the IEEE, 96(5):879–899, 2008.

[45]

J. A. Rivers, E. S. Tam, G. S. Tyson, E. S. Davidson, and M. Farrens. Utilizing Reuse Information in Data Cache Management. In Proc. of the Int’l Conf. on Supercomputing (SC), 1998.

Digital Library

[46]

T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2012.

Digital Library

[47]

G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. A Modified Approach to Data Cache Management. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 1995.

Digital Library

[48]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer. SHiP: Signature-based Hit Predictor for High Performance Caching. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2011.

Digital Library

[49]

M. Zahran. Cache Replacement Policy Revisited. In Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2007.

[50]

J. Zhao, G. Sun, G. H. Loh, and Y. Xie. Energy-efficient GPU Design with Reconfigurable In-package Graphics Memory. In Proc. of the Int’l Symp. on Low Power Electronics and Design (ISPLED), 2012.

Digital Library

Cited By

Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://doi.org/10.1145/3653019
Wang GDu YHuang W(2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3443707
Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Show More Cited By

Index Terms

Adaptive GPU cache bypassing
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...
Performance-Energy Considerations for Shared Cache Management in a Heterogeneous Multicore Processor

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is ...
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

February 2015

120 pages

ISBN:9781450334075

DOI:10.1145/2716282

Program Chairs:
David Kaeli
Northeastern University, USA
,
John Cavazos
University of Delaware, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

GPGPU-8

GPGPU-8: General-purpose Processing with Graphics Processing Units 8

February 7, 2015

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
866
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://doi.org/10.1145/3653019
Wang GDu YHuang W(2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3443707
Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Ma YWu YAi QLiu YShao YZhang MMa S(2023)Incorporating Structural Information into Legal Case RetrievalACM Transactions on Information Systems10.1145/360979642:2(1-28)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1145/3609796
Zhou YCox ADwarkadas SDong X(2023)The Impact of Page Size and Microarchitecture on Instruction Address Translation OverheadACM Transactions on Architecture and Code Optimization10.1145/360008920:3(1-25)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3600089
Chen RShi HWu JLi YLiu XWang G(2023)Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud DatacentersACM Transactions on Architecture and Code Optimization10.1145/359305520:3(1-24)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593055
Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Sahni AOmar HAli UKhan O(2023)ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting ProcessesACM Transactions on Architecture and Code Optimization10.1145/358748020:3(1-24)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3587480
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1109/PACT58117.2023.00019
Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents