research-article

An Efficient Compiler Framework for Cache Bypassing on GPUs

Authors:

Deming ChenAuthors Info & Claims

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 34, Issue 10

Pages 1677 - 1690

https://doi.org/10.1109/TCAD.2015.2424962

Published: 01 October 2015 Publication History

Abstract

Graphics processing units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly configurable. The programmer or compiler can explicitly control cache access or bypass for global load instructions. This highly configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we present techniques to explore the unified cache and shared memory design space. We integrate our techniques into an automatic compiler framework that leverages parallel thread execution instruction set architecture to enable cache bypassing for GPUs. Experiments evaluation on NVIDIA GTX680 using a variety of applications demonstrates that compared to cache-all and bypass-all solutions, our techniques improve the performance from 4.6% to 13.1% for 16 KB L1 cache.

References

[1]

V. Bertacco et al., “On the use of GP-GPUs for accelerating compute-intensive EDA applications,” in Proc. Conf. Design Autom. Test Europe (DATE), Grenoble, France, 2013, pp. 1357–1366.

[2]

Y. Liang et al., “Real-time implementation and performance optimization of 3D sound localization on GPUs,” in Proc. Design Autom. Test Europe Conf. Exhibit. (DATE), Dresden, Germany, Mar. 2012, pp. 832–835.

[3]

J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, “GPU computing,” Proc. IEEE, vol. 96, no. 5, pp. 879–899, May 2008.

[4]

S. Ryoo et al., “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” in Proc. 13th ACM SIGPLAN Symp. Principles Pract. Parallel Program. (PPoPP), Salt Lake City, UT, USA, Jan. 2008, pp. 73–82.

[5]

Y. Kim and A. Shrivastava, “CuMAPz: A tool to analyze memory access patterns in CUDA,” in Proc. 48th Design Autom. Conf. (DAC), Jun. 2011, pp. 128–133.

[6]

C.-J. Wu et al., “SHiP: Signature-based hit predictor for high performance caching,” in Proc. 44th Annu. IEEE/ACM Int. Symp. Microarchit. (Micro), Porto Alegre, Brazil, Dec. 2011, pp. 430–441.

[7]

W. Jia, K. A. Shaw, and M. Martonosi, “Characterizing and improving the use of demand-fetched caches in GPUs,” in Proc. 26th ACM Int. Conf. Supercomput. (ICS), Venice, Italy, Jun. 2012, pp. 15–24.

[8]

M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally, “Unifying primary cache, scratch, and register file memories in a throughput processor,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchit. (Micro), Vancouver, BC, Canada, Dec. 2012, pp. 96–106.

[9]

X. Xie, Y. Liang, G. Sun, and D. Chen, “An efficient compiler framework for cache bypassing on GPUs,” in Proc. Int. Conf. Comput.-Aided Design (ICCAD), San Jose, CA, USA, 2013, pp. 516–523.

[10]

NVIDIA. Fermi GPUs. [Online]. Available: http://www.nvidia.com/object/fermi-architecture.html

[11]

NVIDIA. Kepler GPUs. [Online]. Available: http://www.nvidia.com/object/nvidia-kepler.html

[12]

S. Che et al., “Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Character. (IISWC), Austin, TX, USA, Oct. 2009, pp. 44–54.

[13]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. New York, NY, USA: McGraw-Hill, 2001.

Digital Library

[14]

NVIDIA. Occupancy Calculator. [Online]. Available: http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html

[15]

X. E. Chen and T. Aamodt, “Modeling cache contention and throughput of multiprogrammed manycore processors,” IEEE Trans. Comput., vol. 61, no. 7, pp. 913–927, Jul. 2012.

Digital Library

[16]

C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal, “A detailed GPU cache model based on reuse distance theory,” in Proc. IEEE 20th Int. Symp. High Perform. Comput. Archit. (HPCA), Orlando, FL, USA, Feb. 2014, pp. 37–48.

[17]

J. A. Stratton et al., “Parboil: A revised benchmark suite for scientific and commercial throughput computing,” Center Rel. High-Perform. Comput., Univ. Illinois Urbana-Champaign, Champaign, IL, USA, Tech. Rep. IMPACT-12-0, Mar. 2012.

[18]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to GPU codes,” in Proc. Innov. Parallel Comput. Conf. (InPar), San Jose, CA, USA, May 2012, pp. 1–10.

[19]

NVIDIA GPU Computing SDK. [Online]. Available: http://developer.nvidia.com/gpu-computing-sdk

[20]

Mosek. [Online]. Available: http://www.mosek.com/

[21]

S. Hong and H. Kim, “An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness,” in Proc. 36th Annu. Int. Symp. Comput. Archit. (ISCA), Austin, TX, USA, Jun. 2009, pp. 152–163.

[22]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-M. W. Hwu, “An adaptive performance modeling tool for GPU architectures,” in Proc. 15th ACM SIGPLAN Symp. Principles Pract. Parallel Program. (PPoPP), Bangalore, India, 2010, pp. 105–114.

[23]

T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-conscious wavefront scheduling,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchit. (MICRO), Washington, DC, USA, 2012, pp. 72–83.

[24]

Y. Liang, Z. Cui, K. Rupnow, and D. Chen, “Register and thread structure optimization for GPUs,” in Proc. 18th Asia South Pac. Design Autom. Conf. (ASP-DAC), Yokohama, Japan, Jan. 2013, pp. 461–466.

[25]

Z. Cui, Y. Liang, K. Rupnow, and D. Chen, “An accurate GPU performance model for effective control flow divergence optimization,” in Proc. IEEE 26th Int. Parallel Distrib. Process. Symp. (IPDPS), Shanghai, China, May 2012, pp. 83–94.

[26]

X. Chen, Y. Wang, Y. Liang, Y. Xie, and H. Yang, “Run-time technique for simultaneous aging and power optimization in GPGPUs,” in Proc. 51st Annu. Design Autom. Conf. (DAC), San Francisco, CA, USA, 2014, pp. 1–6.

[27]

Y. Liang, H. Huynh, K. Rupnow, R. Goh, and D. Chen, “Efficient GPU spatial-temporal multitasking,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 3, pp. 748–760, Mar. 2014.

[28]

W. Jia, K. A. Shaw, and M. Martonosi, “MRPB: Memory request priorization for massively parallel processors,” in Proc. 20th Int. Symp. High Perform. Comput. Archit. (HPCA), Orlando, FL, USA, 2014, pp. 272–283.

[29]

X. Chen et al., “Adaptive cache management for energy-efficient GPU computing,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit. (MICRO), Minneapolis, MN, USA, 2014, pp. 343–355.

[30]

X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, “Coordinated static and dynamic cache bypassing for GPUs,” in Proc. 21th Int. Symp. High Perform. Comput. Archit. (HPCA), Burlingame, CA, USA, 2015, pp. 76–88.

[31]

H. Liu, M. Ferdman, J. Huh, and D. Burger, “Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency,” in Proc. 41st Annu. IEEE/ACM Int. Symp. Microarchit. (Micro), Lake Como, Italy, Nov. 2008, pp. 222–233.

[32]

Y. Wu et al., “Compiler managed micro-cache bypassing for high performance EPIC processors,” in Proc. 35th Annu. ACM/IEEE Int. Symp. Microarchit. (MICRO), Istanbul, Turkey, Nov. 2002, pp. 134–145.

Cited By

Xu XWang LXiao LLiu LLv YXie XHan MLiu H(2023)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: 29-Nov-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3337192
Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Bitalebi HSafaei F(2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1007/s11227-022-04657-3
Show More Cited By

Index Terms

An Efficient Compiler Framework for Cache Bypassing on GPUs

Index terms have been assigned to the content through auto-classification.

Recommendations

An efficient compiler framework for cache bypassing on GPUs
ICCAD '13: Proceedings of the International Conference on Computer-Aided Design

Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is ...
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Exploring cache bypassing and partitioning for multi-tasking on GPUs
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided Design

Graphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Volume 34, Issue 10

Oct. 2015

163 pages

ISSN:0278-0070

Issue’s Table of Contents

Copyright © 2015.

Publisher

IEEE Press

Publication History

Published: 01 October 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu XWang LXiao LLiu LLv YXie XHan MLiu H(2023)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: 29-Nov-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3337192
Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Bitalebi HSafaei F(2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1007/s11227-022-04657-3
Arafa YBadawy AChennupati GBarai ASanthi NEidenbenz SAyguadé EHwu WBadia RHofstee H(2020)Fast, accurate, and scalable memory modeling of GPGPUs using reuse profilesProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392761(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392761
Wang GZang CJu LZhao MCai XJia Z(2018)Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main MemoryACM Transactions on Embedded Computing Systems10.1145/323064317:4(1-25)Online publication date: 31-Jul-2018
https://dl.acm.org/doi/10.1145/3230643

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents