Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2884045.2884049acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Multi-stage programming for GPUs in C++ using PACXX

Published: 12 March 2016 Publication History

Abstract

Writing and optimizing programs for high performance on systems with Graphics Processing Units (GPUs) remains a challenging task even for expert programmers. A promising optimization technique is multi-stage programming -- evaluating parts of the program upfront on the CPU and embedding the computed values in the GPU code, thus allowing for more aggressive compiler optimizations. Unfortunately, such optimizations are not possible in CUDA, whereas to apply them in OpenCL, programmers are forced to manipulate the GPU source code as plain strings, which is error-prone and type-unsafe.
In this paper, we describe PACXX -- our approach to GPU programming in C++, with the convenient features of modern C++14 standard: type deduction, lambda expressions, and algorithms from the standard template library (STL). Using PACXX, a GPU program is written as a single C++ program, rather than two distinct host and kernel programs. We extend PACXX with an easy-to-use and type-safe API for multi-stage programming avoiding the pitfalls of string manipulation. Using just-in-time compilation techniques, PACXX generates efficient GPU code at runtime.
Our evaluation shows that using PACXX allows for writing multi-stage code easier and safer than currently possible in CUDA or OpenCL. With two application studies we demonstrate that multi-stage programs can significantly outperform equivalent non-staged versions. Furthermore, we show that PACXX generates code with high performance, comparable to industrial-strength OpenCL compilers.

References

[1]
AMD. Bolt C++ Template Library, 2014. Version 1.2.
[2]
R. Barik, R. Kaleem, D. Majeti, et al. Efficient Mapping of Irregular C++ Applications to Integrated GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, page 33. ACM, 2014.
[3]
N. Bell and J. Hoberock. Thrust: A Parallel Template Library. GPU Computing Gems Jade Edition, page 359, 2011.
[4]
C. Calcagno, W. Taha, L. Huang, et al. Implementing Multi-Stage Languages Using ASTs, Gensym, and Reflection. In Generative Programming and Component Engineering, pages 57--76. Springer, 2003.
[5]
Z. DeVito, J. Hegarty, A. Aiken, et al. Terra: A Multi-Stage Language for High-Performance Computing. In ACM SIGPLAN Notices, volume 48, pages 105--116. ACM, 2013.
[6]
M. Haidl and S. Gorlatch. PACXX: Towards a Unified Programming Model for Programming Accelerators Using C++14. In Proceedings of LLVM Compiler Infrastructure in HPC (LLVM-HPC) at Supercomputing 14, pages 1--11. IEEE, 2014.
[7]
M. Harris. Optimizing Parallel Reduction in CUDA. Nvidia, 2007.
[8]
R. Ierusalimschy, L. H. De Figueiredo, and W. Celes Filho. Lua - An Extensible Extension Language. Software: Practice and Experience, 26(6):635--652, 1996.
[9]
Khronos Group. The OpenCL Specification, 2012.
[10]
Khronos Group. The SPIR Specification, 2014.
[11]
Khronos Group. SYCL Specifcation, 2015.
[12]
A. Klöckner, N. Pinto, Y. Lee, et al. PyCUDA and PyOpenCL: A scripting-based approach to GPU runtime code generation. Parallel Computing, 38(3):157--174, 2012.
[13]
C. Lattner. LLVM and Clang: Next Generation Compiler Technology. In Proceedings of the BSD Conference, pages 1--2, 2008.
[14]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO 2004, pages 75--86. IEEE, 2004.
[15]
T. Lutz and V. Grover. LambdaJIT: A Dynamic Compiler for Heterogeneous Optimizations of STL Algorithms. In Workshop on Functional High-Performance Computing at ICFP, pages 99--108. ACM, 2014.
[16]
Nvidia. Parallel Thread Execution ISA. Version 4.3.
[17]
Nvidia. CUDA Programming Guide, 2015. Version 7.5.
[18]
Nvidia. CUDA Toolkit 7.5, 2015.
[19]
L. Nyland, M. Harris, and J. Prins. Fast N-Body Simulation with CUDA. GPU Gems, 3(1):677--696, 2007.
[20]
T. Rompf, K. J. Brown, H. Lee, et al. Go meta! A case for generative programming and DSLs in performance critical systems. In 1st Summit on Advances in Programming Languages, SNAPL 2015, volume 32 of LIPIcs, pages 238--261. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015.
[21]
T. Rompf and M. Odersky. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. In ACM SIGPLAN Notices, volume 46, pages 127--136. ACM, 2010.
[22]
M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - A Portable Skeleton Library for High-Level GPU Programming. In Workshop on High-Level Parallel Programming Models and Supportive Environments at IPDPS 2011, pages 1176--1182. IEEE, 2011.
[23]
A. K. Sujeeth, K. J. Brown, H. Lee, et al. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embedded Comput. Syst., 13(4s):134:1--134:25, 2014.
[24]
W. Taha. A Gentle Introduction to Multi-Stage Programming. In Domain-Specific Program Generation, pages 30--50. Springer, 2004.
[25]
W. Taha and T. Sheard. Multi-Stage Programming with Explicit Annotations. In ACM SIGPLAN Notices, volume 32, pages 203--217. ACM, 1997.
[26]
E. Westbrook, M. Ricken, J. Inoue, et al. Mint: Java Multi-Stage Programming Using Weak Separability. ACM SIGPLAN Notices, 45(6):400--411, 2010.

Cited By

View all
  • (2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
  • (2020)Multi-stage programming in the large with staged classesProceedings of the 19th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3425898.3426961(35-49)Online publication date: 16-Nov-2020
  • (2019)Toward Performance-Portable Finite Element Methods on High-Performance Systems2019 3rd International Conference on Recent Advances in Signal Processing, Telecommunications & Computing (SigTelCom)10.1109/SIGTELCOM.2019.8696146(69-73)Online publication date: Mar-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit
March 2016
107 pages
ISBN:9781450341950
DOI:10.1145/2884045
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 March 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. modern C++
  3. multi-stage programming
  4. runtime code generation
  5. runtime optimization

Qualifiers

  • Research-article

Conference

PPoPP '16

Acceptance Rates

GPGPU '16 Paper Acceptance Rate 9 of 23 submissions, 39%;
Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
  • (2020)Multi-stage programming in the large with staged classesProceedings of the 19th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3425898.3426961(35-49)Online publication date: 16-Nov-2020
  • (2019)Toward Performance-Portable Finite Element Methods on High-Performance Systems2019 3rd International Conference on Recent Advances in Signal Processing, Telecommunications & Computing (SigTelCom)10.1109/SIGTELCOM.2019.8696146(69-73)Online publication date: Mar-2019
  • (2019)Persistent Asynchronous Adaptive Specialization for Generic Array ProgrammingInternational Journal of Parallel Programming10.1007/s10766-018-0567-947:2(164-183)Online publication date: 15-May-2019
  • (2018)AnyDSL: a partial evaluation framework for programming high-performance librariesProceedings of the ACM on Programming Languages10.1145/32764892:OOPSLA(1-30)Online publication date: 24-Oct-2018
  • (2017)Towards Composable GPU ProgrammingProceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3026937.3026942(58-67)Online publication date: 4-Feb-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media