research-article

Automatic data placement into GPU on-chip memory resources

Authors:

Huiyang ZhouAuthors Info & Claims

CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 23 - 33

Published: 07 February 2015 Publication History

Abstract

Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as on-chip memory resources vary among different GPU generations, performance portability has become a daunting challenge.

In this paper, we tackle this problem with compiler-driven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76x on average on Nvidia GTX480, and by 1.61x on average on GTX680.

References

[1]

M. Abdel-Majeed. and M. Annavaram. Warped register file: A power efficient register file for GPGPUs. HPCA, 2013.

Digital Library

[2]

A. Bakhoda, G. Yuan, W. L. Fung, H. Wong and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS, 2009.

[3]

CUDA programming guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

[4]

S. Che, et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. IISWC, 2009.

Digital Library

[5]

K. D. Cooper, P. Schielke and D. Subramanian. Optimzing for reduced code space using generic algorithms. LCTES, 1999.

Digital Library

[6]

M. Gebhart, et al. Energy-efficient mechanisms for managing thread context in throughput processors. ISCA, 2012.

Digital Library

[7]

M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky and W. J. Dally. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. MICRO, 2012.

Digital Library

[8]

A. Hayes and E. Zhang, Unified On-Chip Mmeory Allocation for SIMT Architecture, ICS, 2014.

Digital Library

[9]

W. Jia, K. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in GPUs. ICS, 2012.

Digital Library

[10]

O. Kayran, A. Jog, M. T. Kandemir and C. R. Das. Neither more nor less: optimizing thread-level parallelism for GPGPUs. PACT, 2013.

Digital Library

[11]

D. B. Kirk and W. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach, 2010.

Digital Library

[12]

J. Lai and A. Seznec. Performance upper bound analysis and optimization of SGEMM on FERMI and KEPLER GPUs. CGO, 2013.

Digital Library

[13]

S. Lee, et al. Cetus-An extensible compiler infrastructure for source-to-source transformation. LCPC, 2004.

[14]

C. Li, Y. Yang, H. Dai, S. Yan, F. Muller and H. Zhou. Understanding the Tradeoffs between Software-Managed vs. Hardware-Managed Caches in GPUs. ISPASS, 2014.

[15]

Z. Lin, X. Gao, H. Wan and B. Jiang. GLES: A Practical GPGPU Optimizing Compiler Using Data Sharing and Thread Coarsening. LCPC, 2014.

[16]

M. M. Baskaran, etal. A compiler framework for Optimization of Affine Loop Nests for GPGPUs. ICS, 2008.

Digital Library

[17]

NVIDIA FERMI: NVIDIA's Next Generation CUDA Compute Architecture, Nov. 2011.

[18]

NVIDIA KEPLER GK110 white paper. 2012.

[19]

NVIDIA. CUDA C/C++ SDK Code Samples, 2011. http://developer.nvidia.com/gpu-computing-sdk, 2011.

[20]

M. Stephenson, S. Amarasinghe, M. Martin and U. M. O'Reilly. Meta optimization: improving compiler heuristics with machine learning. PLDI, 2003.

Digital Library

[21]

J. A. Stratton, S. S. Stone, and W. W. Hwu, MCUDA: An efficient implementation of CUDA kernels on multi-cores. LCPC, 2008.

Digital Library

[22]

S. Unkule, C. Shaltz and A. Qasem. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. CC, 2012.

Digital Library

[23]

B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X, Shen. Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced GPU Memory Accesses. PPoPP, 2013.

Digital Library

[24]

S. Yan, C. Li, Y. Zhang and H. Zhou. yaSpMV: Yet Another SpMV Framework on GPUs. PPoPP, 2014.

Digital Library

[25]

Y. Yang, and H. Zhou. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications. PPoPP, 2014.

Digital Library

[26]

Y. Yang, P. Xiang, J. Kong, M. Mantor and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. PLDI, 2010.

Digital Library

[27]

Y. Yang, P. Xiang, M. Mantor, N. Rubin and H. Zhou. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Performance. PACT, 2012.

Digital Library

Cited By

Kim HHong SLee HSeo EHan H(2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337886
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Kloosterman JBeaumont JJamshidi DBailey JMudge TMahlke SHunter HMoreno JEmer JSanchez D(2017)ReglessProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123974(151-164)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123974
Show More Cited By

Index Terms

Automatic data placement into GPU on-chip memory resources
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Data Placement Optimization in GPU Memory Hierarchy using Predictive Modeling
MCHPC'18: Proceedings of the Workshop on Memory Centric High Performance Computing

Modern supercomputers often use Graphic Processing Units (or GPUs) to meet the ever-growing demands for high performance computing. GPUs typically have a complex memory architecture with various types of memories and caches, such as global memory, ...
Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU
DAC '12: Proceedings of the 49th Annual Design Automation Conference

Single-chip CPU/GPU architecture is being adopted in high-end (embedded) systems, e.g., smartphones and tablet PCs. Main memory subsystem is expected to consist of hybrid DRAM and phase-change RAM (PRAM) due to the difficulties in DRAM scaling. In this ...
Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel applications. In many GPU applications, GPU memory bandwidth bottlenecks performance, causing underutilization of GPU cores. Hence, disabling many cores does not affect ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 2015

280 pages

ISBN:9781479981618

General Chairs:
Kunle Olukotun
Stanford University
,
Aaron Smith
Microsoft Research
,
Program Chairs:
Robert Hundt
Google
,
Jason Mars
University of Michigan

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
ACM: Association for Computing Machinery
IEEE Computer Society TC-uARCH
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS\DATC: IEEE Computer Society

Publisher

IEEE Computer Society

United States

Publication History

Published: 07 February 2015

Check for updates

Qualifiers

Research-article

Conference

CGO '15

Sponsor:

SIGPLAN
ACM
SIGMICRO
IEEE-CS\DATC

CGO '15: 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 7 - 11, 2015

California, San Francisco

Acceptance Rates

CGO '15 Paper Acceptance Rate 24 of 88 submissions, 27%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
195
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim HHong SLee HSeo EHan H(2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337886
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Kloosterman JBeaumont JJamshidi DBailey JMudge TMahlke SHunter HMoreno JEmer JSanchez D(2017)ReglessProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123974(151-164)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123974
Weber NGoesele M(2017)MATOGACM Transactions on Architecture and Code Optimization10.1145/310634114:3(1-26)Online publication date: 30-Aug-2017
https://dl.acm.org/doi/10.1145/3106341
Jatala VAnantpur JKarkare A(2017)Scratchpad Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/307561914:2(1-29)Online publication date: 26-May-2017
https://dl.acm.org/doi/10.1145/3075619
Vijaykumar NHsieh KPekhimenko GKhan SShrestha AGhose SJog AGibbons PMutlu OHsu WYang CLipasti MLee H(2016)ZoruaThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195656(1-14)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195656
Shen DLiu XLin F(2016)Characterizing emerging heterogeneous memoryACM SIGPLAN Notices10.1145/3241624.292670251:11(13-23)Online publication date: 14-Jun-2016
https://dl.acm.org/doi/10.1145/3241624.2926702
Shen DLiu XLin FFlood CZhang Z(2016)Characterizing emerging heterogeneous memoryProceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management10.1145/2926697.2926702(13-23)Online publication date: 14-Jun-2016
https://dl.acm.org/doi/10.1145/2926697.2926702
Xie XLiang YLi XWu YSun GWang TFan DPrvulovic M(2015)Enabling coordinated register allocation and thread-level parallelism optimization for GPUsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830813(395-406)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830813
Wu BChen GLi DShen XVetter JBhuyan LChong FSarkar V(2015)Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program TransformationsProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751213(119-130)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751213

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten