Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An Efficient Compiler Framework for Cache Bypassing on GPUs

Published: 01 October 2015 Publication History

Abstract

Graphics processing units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly configurable. The programmer or compiler can explicitly control cache access or bypass for global load instructions. This highly configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we present techniques to explore the unified cache and shared memory design space. We integrate our techniques into an automatic compiler framework that leverages parallel thread execution instruction set architecture to enable cache bypassing for GPUs. Experiments evaluation on NVIDIA GTX680 using a variety of applications demonstrates that compared to cache-all and bypass-all solutions, our techniques improve the performance from 4.6% to 13.1% for 16 KB L1 cache.

References

[1]
V. Bertacco et al., “On the use of GP-GPUs for accelerating compute-intensive EDA applications,” in Proc. Conf. Design Autom. Test Europe (DATE), Grenoble, France, 2013, pp. 1357–1366.
[2]
Y. Liang et al., “Real-time implementation and performance optimization of 3D sound localization on GPUs,” in Proc. Design Autom. Test Europe Conf. Exhibit. (DATE), Dresden, Germany, Mar. 2012, pp. 832–835.
[3]
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, “GPU computing,” Proc. IEEE, vol. 96, no. 5, pp. 879–899, May 2008.
[4]
S. Ryoo et al., “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” in Proc. 13th ACM SIGPLAN Symp. Principles Pract. Parallel Program. (PPoPP), Salt Lake City, UT, USA, Jan. 2008, pp. 73–82.
[5]
Y. Kim and A. Shrivastava, “CuMAPz: A tool to analyze memory access patterns in CUDA,” in Proc. 48th Design Autom. Conf. (DAC), Jun. 2011, pp. 128–133.
[6]
C.-J. Wu et al., “SHiP: Signature-based hit predictor for high performance caching,” in Proc. 44th Annu. IEEE/ACM Int. Symp. Microarchit. (Micro), Porto Alegre, Brazil, Dec. 2011, pp. 430–441.
[7]
W. Jia, K. A. Shaw, and M. Martonosi, “Characterizing and improving the use of demand-fetched caches in GPUs,” in Proc. 26th ACM Int. Conf. Supercomput. (ICS), Venice, Italy, Jun. 2012, pp. 15–24.
[8]
M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally, “Unifying primary cache, scratch, and register file memories in a throughput processor,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchit. (Micro), Vancouver, BC, Canada, Dec. 2012, pp. 96–106.
[9]
X. Xie, Y. Liang, G. Sun, and D. Chen, “An efficient compiler framework for cache bypassing on GPUs,” in Proc. Int. Conf. Comput.-Aided Design (ICCAD), San Jose, CA, USA, 2013, pp. 516–523.
[10]
[11]
NVIDIA. Kepler GPUs. [Online]. Available: http://www.nvidia.com/object/nvidia-kepler.html
[12]
S. Che et al., “Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Character. (IISWC), Austin, TX, USA, Oct. 2009, pp. 44–54.
[13]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. New York, NY, USA: McGraw-Hill, 2001.
[14]
NVIDIA. Occupancy Calculator. [Online]. Available: http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html
[15]
X. E. Chen and T. Aamodt, “Modeling cache contention and throughput of multiprogrammed manycore processors,” IEEE Trans. Comput., vol. 61, no. 7, pp. 913–927, Jul. 2012.
[16]
C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal, “A detailed GPU cache model based on reuse distance theory,” in Proc. IEEE 20th Int. Symp. High Perform. Comput. Archit. (HPCA), Orlando, FL, USA, Feb. 2014, pp. 37–48.
[17]
J. A. Stratton et al., “Parboil: A revised benchmark suite for scientific and commercial throughput computing,” Center Rel. High-Perform. Comput., Univ. Illinois Urbana-Champaign, Champaign, IL, USA, Tech. Rep. IMPACT-12-0, Mar. 2012.
[18]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to GPU codes,” in Proc. Innov. Parallel Comput. Conf. (InPar), San Jose, CA, USA, May 2012, pp. 1–10.
[19]
NVIDIA GPU Computing SDK. [Online]. Available: http://developer.nvidia.com/gpu-computing-sdk
[20]
Mosek. [Online]. Available: http://www.mosek.com/
[21]
S. Hong and H. Kim, “An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness,” in Proc. 36th Annu. Int. Symp. Comput. Archit. (ISCA), Austin, TX, USA, Jun. 2009, pp. 152–163.
[22]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-M. W. Hwu, “An adaptive performance modeling tool for GPU architectures,” in Proc. 15th ACM SIGPLAN Symp. Principles Pract. Parallel Program. (PPoPP), Bangalore, India, 2010, pp. 105–114.
[23]
T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-conscious wavefront scheduling,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchit. (MICRO), Washington, DC, USA, 2012, pp. 72–83.
[24]
Y. Liang, Z. Cui, K. Rupnow, and D. Chen, “Register and thread structure optimization for GPUs,” in Proc. 18th Asia South Pac. Design Autom. Conf. (ASP-DAC), Yokohama, Japan, Jan. 2013, pp. 461–466.
[25]
Z. Cui, Y. Liang, K. Rupnow, and D. Chen, “An accurate GPU performance model for effective control flow divergence optimization,” in Proc. IEEE 26th Int. Parallel Distrib. Process. Symp. (IPDPS), Shanghai, China, May 2012, pp. 83–94.
[26]
X. Chen, Y. Wang, Y. Liang, Y. Xie, and H. Yang, “Run-time technique for simultaneous aging and power optimization in GPGPUs,” in Proc. 51st Annu. Design Autom. Conf. (DAC), San Francisco, CA, USA, 2014, pp. 1–6.
[27]
Y. Liang, H. Huynh, K. Rupnow, R. Goh, and D. Chen, “Efficient GPU spatial-temporal multitasking,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 3, pp. 748–760, Mar. 2014.
[28]
W. Jia, K. A. Shaw, and M. Martonosi, “MRPB: Memory request priorization for massively parallel processors,” in Proc. 20th Int. Symp. High Perform. Comput. Archit. (HPCA), Orlando, FL, USA, 2014, pp. 272–283.
[29]
X. Chen et al., “Adaptive cache management for energy-efficient GPU computing,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit. (MICRO), Minneapolis, MN, USA, 2014, pp. 343–355.
[30]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, “Coordinated static and dynamic cache bypassing for GPUs,” in Proc. 21th Int. Symp. High Perform. Comput. Archit. (HPCA), Burlingame, CA, USA, 2015, pp. 76–88.
[31]
H. Liu, M. Ferdman, J. Huh, and D. Burger, “Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency,” in Proc. 41st Annu. IEEE/ACM Int. Symp. Microarchit. (Micro), Lake Como, Italy, Nov. 2008, pp. 222–233.
[32]
Y. Wu et al., “Compiler managed micro-cache bypassing for high performance EPIC processors,” in Proc. 35th Annu. ACM/IEEE Int. Symp. Microarchit. (MICRO), Istanbul, Turkey, Nov. 2002, pp. 134–145.

Cited By

View all
  • (2023)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: 29-Nov-2023
  • (2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
  • (2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
  • Show More Cited By

Index Terms

  1. An Efficient Compiler Framework for Cache Bypassing on GPUs
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
          IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  Volume 34, Issue 10
          Oct. 2015
          163 pages

          Publisher

          IEEE Press

          Publication History

          Published: 01 October 2015

          Author Tags

          1. performance
          2. Cache bypassing
          3. compiler
          4. graphics processing unit (GPU)

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 17 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: 29-Nov-2023
          • (2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
          • (2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
          • (2020)Fast, accurate, and scalable memory modeling of GPGPUs using reuse profilesProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392761(1-12)Online publication date: 29-Jun-2020
          • (2018)Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main MemoryACM Transactions on Embedded Computing Systems10.1145/323064317:4(1-25)Online publication date: 31-Jul-2018

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media