Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2716282.2716283acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Adaptive GPU cache bypassing

Published: 07 February 2015 Publication History

Abstract

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend to include data structures that would not fit in any reasonably sized caches, leading to very low cache hit rates. This problem is exacerbated by the design of current GPUs, which share small caches be- tween many threads. Caching these streaming data struc- tures needlessly burns power while evicting data that may otherwise fit into the cache. We propose a GPU cache management technique to im- prove the efficiency of small GPU caches while further re- ducing their power consumption. It adaptively bypasses the GPU cache for blocks that are unlikely to be referenced again before being evicted. This technique saves energy by avoid- ing needless insertions and evictions while avoiding cache pollution, resulting in better performance. We show that, with a 16KB L1 data cache, dynamic bypassing achieves sim- ilar performance to a double-sized L1 cache while reducing energy consumption by 25% and power by 18%. The technique is especially interesting for programs that do not use programmer-managed scratchpad memories. We give a case study to demonstrate the inefficiency of current GPU caches compared to programmer-managed scratchpad memories and show the extent to which cache bypassing can make up for the potential performance loss where the effort to program scratchpad memories is impractical.

References

[1]
Advanced Micro Devices, Inc. AMD Graphics Cores Next (GCN) Architecture. http://www.amd.com/ Documents/GCN_Architecture_whitepaper.pdf, Jun. 2012.
[2]
Advanced Micro Devices, Inc. AMD Radeon TM HD 7900 Series Graphics Cards: 7970, 7970 GHz, 7950. http://www.amd.com/en-us/products/graphics/ desktop/7000/7900, Jan. 2015.
[3]
F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A Post-Compiler Approach to Scratchpad Mapping of Code. In Proc. of the Int’l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 2004.
[4]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 Simulator. SIGARCH Computer Architecture News, 39(2):1–7, Aug. 2011.
[5]
C.-A. Bohn. Kohonen Feature Mapping through Graphics Hardware. In Proc. of the Int’l Conf. on Computational Intelligence and Neurosciences, 1998.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of the IEEE Int’l Symp. on Workload Characterization (IISWC), 2009.
[7]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A Performance Study of General-Purpose Applications on Graphics Processors Using CUDA. Journal of Parallel and Distributed Computing, 68(10):1370–1380, 2008.
[8]
Chipworks, Inc. Inside the ASUS AMD 7970 graphics card - TSMC 28nm! http://www.chipworks.com/en/technicalcompetitive-analysis/resources/blog/insidethe-asus-amd-7970-graphics-card-tsmc-28-nm/, Feb. 2012.
[9]
Chipworks, Inc. A Look at Sony’s Playstation 4 Core Processor. http://www.chipworks.com/en/ technical-competitive-analysis/resources/blog/ a-look-at-sonys-playstation-4-core-processor, Nov. 2013.
[10]
W. Feng, H. Lin, T. Scogland, and J. Zhang. OpenCL and the 13 Dwarfs: A Work in Progress. In Proc. of the Int’l Conf. on Performance Engineering (ICPE), 2012.
[11]
J. Fung and S. Mann. OpenVIDIA: Parallel GPU Computer Vision. In Proc. of the Int’l Conf. on Multimedia, 2005.
[12]
J. Fung, F. Tang, and S. Mann. Mediated Reality Using Computer Graphics Hardware for Computer Vision. In Proc. of the Int’l Symp. on Wearable Computers, 2002.
[13]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2007.
[14]
R. V. Garde, S. Subramaniam, and G. H. Loh. Deconstructing the Inefficacy of Global Cache Replacement Policies. In Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2008.
[15]
J. Gaur, M. Chaudhuri, and S. Subramoney. Bypass and Insertion Algorithms for Exclusive Last-level Caches. In Proc. of the Int’l Symp. on Computer Architecture (ISCA), 2011.
[16]
A. González, C. Aliagas, and M. Valero. A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality. In Proc. of the Int’l Conf. on Supercomputing (SC), 1995.
[17]
L. Howes and A. Munshi. The OpenCL Specification Version 2.0, 2014. https://www.khronos.org/ registry/cl/specs/opencl-2.0.pdf.
[18]
HSA Foundation. HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG). http://www.hsafoundation.com/?ddownload=4945, Jun. 2014.
[19]
W. W. Hwu, editor. GPU Computing Gems Emerald Edition. Morgan Kaufmann, 2011.
[20]
A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer. High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP). In Proc. of the Int’l Symp. on Computer Architecture (ISCA), 2010.
[21]
J. Jalminger and P. Stenstrom. A Novel Approach to Cache Block Reuse Predictions. In Proc. of the Int’l Conf. on Parallel Processing (ICPP), 2003.
[22]
W. Jia, K. A. Shaw, and M. Martonosi. MRPB: Memory Request Prioritization for Massively Parallel Processors. In Int’l Symp. on High Performance Computer Architecture (HPCA), 2014.
[23]
T. L. Johnson, D. A. Connors, M. C. Merten, and W. W. Hwu. Run-time Cache Bypassing. IEEE Trans. on Computers, 48(12):1338–1354, 1999.
[24]
H. Jooybar, W. W. Fung, M. O’Connor, J. Devietti, and T. M. Aamodt. GPUDet: a Deterministic GPU Architecture. In Proc. of the Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.
[25]
M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 23(2):243–260, Feb. 2004.
[26]
S. M. Khan, Y. Tian, and D. A. Jiménez. Sampling Dead Block Prediction for Last-Level Caches. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2010.
[27]
M. Kharbutli and Y. Solihin. Counter-Based Cache Replacement and Bypassing Algorithms. IEEE Trans. on Computers, 57(4):433–447, April 2008.
[28]
T. J. Knight, J. Y. Park, M. Ren, M. Houston, M. Erez, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan. Compilation for Explicitly Managed Memory Hierarchies. In Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2007.
[29]
G. Kyriazis. Heterogeneous System Architecture: A Technical Review. Technical report, HSA Foundation, 2012.
[30]
A.-C. Lai, C. Fide, and B. Falsafi. Dead-Block Prediction & Dead-Block Correlating Prefetchers. In Proc. of the Int’’l Symp. on Computer Architecture (ISCA), 2001.
[31]
J. Lee and H. Kim. TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. In Proc. of the Int’l Symp. on High Performance Computer Architecture (HPCA), 2012.
[32]
Leonidas. AMD R1000/Tahiti Die-Shot. http://www.3dcenter.org/abbildung/ amd-r1000tahiti-die-shot-markiert, Sep. 2012.
[33]
J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis. Comparing Memory Systems for Chip Multiprocessors. In Proc. of the Int’l Symp. on Computer Architecture (ISCA), 2007.
[34]
H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2008.
[35]
V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. Managing Shared Last-Level Cache in a Heterogeneous Multicore Processor. In Proc. of the Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), 2013.
[36]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85, HP Laboratories, Apr. 2009.
[37]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40–53, 2008.
[38]
J. Nickolls and W. J. Dally. The GPU Computing Era. IEEE Micro, 30(2):56–69, 2010.
[39]
Nvidia Corp. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 2009.
[40]
Nvidia Corp. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. 2012.
[41]
Nvidia Corp. CUDA C Programming Guide Version 6.5. http://docs.nvidia.com/cuda/pdf/CUDA_C_ Programming_Guide.pdf, Aug. 2014.
[42]
Nvidia Corp. Tuning CUDA Applications for Kepler. http://docs.nvidia.com/cuda/kepler-tuningguide/, Aug. 2014.
[43]
Nvidia Corp. Tuning CUDA Applications for Maxwell. http://docs.nvidia.com/cuda/maxwell-tuningguide/, Aug. 2014.
[44]
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. Proc. of the IEEE, 96(5):879–899, 2008.
[45]
J. A. Rivers, E. S. Tam, G. S. Tyson, E. S. Davidson, and M. Farrens. Utilizing Reuse Information in Data Cache Management. In Proc. of the Int’l Conf. on Supercomputing (SC), 1998.
[46]
T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2012.
[47]
G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. A Modified Approach to Data Cache Management. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 1995.
[48]
C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer. SHiP: Signature-based Hit Predictor for High Performance Caching. In Proc. of the Int’l Symp. on Microarchitecture (MICRO), 2011.
[49]
M. Zahran. Cache Replacement Policy Revisited. In Workshop on Duplicating, Deconstructing, and Debunking (WDDD), 2007.
[50]
J. Zhao, G. Sun, G. H. Loh, and Y. Xie. Energy-efficient GPU Design with Reconfigurable In-package Graphics Memory. In Proc. of the Int’l Symp. on Low Power Electronics and Design (ISPLED), 2012.

Cited By

View all
  • (2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
  • (2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • Show More Cited By

Index Terms

  1. Adaptive GPU cache bypassing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs
    February 2015
    120 pages
    ISBN:9781450334075
    DOI:10.1145/2716282
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 February 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bypassing
    2. graphics processing unit cache
    3. prediction

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    GPGPU-8

    Acceptance Rates

    Overall Acceptance Rate 57 of 129 submissions, 44%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)53
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 30 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
    • (2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
    • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
    • (2023)Incorporating Structural Information into Legal Case RetrievalACM Transactions on Information Systems10.1145/360979642:2(1-28)Online publication date: 8-Nov-2023
    • (2023)The Impact of Page Size and Microarchitecture on Instruction Address Translation OverheadACM Transactions on Architecture and Code Optimization10.1145/360008920:3(1-25)Online publication date: 19-Jul-2023
    • (2023)Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud DatacentersACM Transactions on Architecture and Code Optimization10.1145/359305520:3(1-24)Online publication date: 19-Jul-2023
    • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
    • (2023)ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting ProcessesACM Transactions on Architecture and Code Optimization10.1145/358748020:3(1-24)Online publication date: 19-Jul-2023
    • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
    • (2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media