Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2738600.2738604acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Automatic data placement into GPU on-chip memory resources

Published: 07 February 2015 Publication History

Abstract

Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as on-chip memory resources vary among different GPU generations, performance portability has become a daunting challenge.
In this paper, we tackle this problem with compiler-driven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76x on average on Nvidia GTX480, and by 1.61x on average on GTX680.

References

[1]
M. Abdel-Majeed. and M. Annavaram. Warped register file: A power efficient register file for GPGPUs. HPCA, 2013.
[2]
A. Bakhoda, G. Yuan, W. L. Fung, H. Wong and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS, 2009.
[3]
CUDA programming guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[4]
S. Che, et al. Rodinia: A Benchmark Suite for Heterogeneous Computing. IISWC, 2009.
[5]
K. D. Cooper, P. Schielke and D. Subramanian. Optimzing for reduced code space using generic algorithms. LCTES, 1999.
[6]
M. Gebhart, et al. Energy-efficient mechanisms for managing thread context in throughput processors. ISCA, 2012.
[7]
M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky and W. J. Dally. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. MICRO, 2012.
[8]
A. Hayes and E. Zhang, Unified On-Chip Mmeory Allocation for SIMT Architecture, ICS, 2014.
[9]
W. Jia, K. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in GPUs. ICS, 2012.
[10]
O. Kayran, A. Jog, M. T. Kandemir and C. R. Das. Neither more nor less: optimizing thread-level parallelism for GPGPUs. PACT, 2013.
[11]
D. B. Kirk and W. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach, 2010.
[12]
J. Lai and A. Seznec. Performance upper bound analysis and optimization of SGEMM on FERMI and KEPLER GPUs. CGO, 2013.
[13]
S. Lee, et al. Cetus-An extensible compiler infrastructure for source-to-source transformation. LCPC, 2004.
[14]
C. Li, Y. Yang, H. Dai, S. Yan, F. Muller and H. Zhou. Understanding the Tradeoffs between Software-Managed vs. Hardware-Managed Caches in GPUs. ISPASS, 2014.
[15]
Z. Lin, X. Gao, H. Wan and B. Jiang. GLES: A Practical GPGPU Optimizing Compiler Using Data Sharing and Thread Coarsening. LCPC, 2014.
[16]
M. M. Baskaran, etal. A compiler framework for Optimization of Affine Loop Nests for GPGPUs. ICS, 2008.
[17]
NVIDIA FERMI: NVIDIA's Next Generation CUDA Compute Architecture, Nov. 2011.
[18]
NVIDIA KEPLER GK110 white paper. 2012.
[19]
NVIDIA. CUDA C/C++ SDK Code Samples, 2011. http://developer.nvidia.com/gpu-computing-sdk, 2011.
[20]
M. Stephenson, S. Amarasinghe, M. Martin and U. M. O'Reilly. Meta optimization: improving compiler heuristics with machine learning. PLDI, 2003.
[21]
J. A. Stratton, S. S. Stone, and W. W. Hwu, MCUDA: An efficient implementation of CUDA kernels on multi-cores. LCPC, 2008.
[22]
S. Unkule, C. Shaltz and A. Qasem. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. CC, 2012.
[23]
B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X, Shen. Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced GPU Memory Accesses. PPoPP, 2013.
[24]
S. Yan, C. Li, Y. Zhang and H. Zhou. yaSpMV: Yet Another SpMV Framework on GPUs. PPoPP, 2014.
[25]
Y. Yang, and H. Zhou. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications. PPoPP, 2014.
[26]
Y. Yang, P. Xiang, J. Kong, M. Mantor and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. PLDI, 2010.
[27]
Y. Yang, P. Xiang, M. Mantor, N. Rubin and H. Zhou. Shared Memory Multiplexing: A Novel Way to Improve GPGPU Performance. PACT, 2012.

Cited By

View all
  • (2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
  • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
  • (2017)ReglessProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123974(151-164)Online publication date: 14-Oct-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization
February 2015
280 pages
ISBN:9781479981618

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 07 February 2015

Check for updates

Qualifiers

  • Research-article

Conference

CGO '15
Sponsor:

Acceptance Rates

CGO '15 Paper Acceptance Rate 24 of 88 submissions, 27%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
  • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
  • (2017)ReglessProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123974(151-164)Online publication date: 14-Oct-2017
  • (2017)MATOGACM Transactions on Architecture and Code Optimization10.1145/310634114:3(1-26)Online publication date: 30-Aug-2017
  • (2017)Scratchpad Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/307561914:2(1-29)Online publication date: 26-May-2017
  • (2016)ZoruaThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195656(1-14)Online publication date: 15-Oct-2016
  • (2016)Characterizing emerging heterogeneous memoryACM SIGPLAN Notices10.1145/3241624.292670251:11(13-23)Online publication date: 14-Jun-2016
  • (2016)Characterizing emerging heterogeneous memoryProceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management10.1145/2926697.2926702(13-23)Online publication date: 14-Jun-2016
  • (2015)Enabling coordinated register allocation and thread-level parallelism optimization for GPUsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830813(395-406)Online publication date: 5-Dec-2015
  • (2015)Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program TransformationsProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751213(119-130)Online publication date: 8-Jun-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media