research-article

Multi-stage programming for GPUs in C++ using PACXX

Authors:

Michel Steuwer,

Tim Humernbrum,

Sergei GorlatchAuthors Info & Claims

GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

Pages 32 - 41

https://doi.org/10.1145/2884045.2884049

Published: 12 March 2016 Publication History

Abstract

Writing and optimizing programs for high performance on systems with Graphics Processing Units (GPUs) remains a challenging task even for expert programmers. A promising optimization technique is multi-stage programming -- evaluating parts of the program upfront on the CPU and embedding the computed values in the GPU code, thus allowing for more aggressive compiler optimizations. Unfortunately, such optimizations are not possible in CUDA, whereas to apply them in OpenCL, programmers are forced to manipulate the GPU source code as plain strings, which is error-prone and type-unsafe.

In this paper, we describe PACXX -- our approach to GPU programming in C++, with the convenient features of modern C++14 standard: type deduction, lambda expressions, and algorithms from the standard template library (STL). Using PACXX, a GPU program is written as a single C++ program, rather than two distinct host and kernel programs. We extend PACXX with an easy-to-use and type-safe API for multi-stage programming avoiding the pitfalls of string manipulation. Using just-in-time compilation techniques, PACXX generates efficient GPU code at runtime.

Our evaluation shows that using PACXX allows for writing multi-stage code easier and safer than currently possible in CUDA or OpenCL. With two application studies we demonstrate that multi-stage programs can significantly outperform equivalent non-staged versions. Furthermore, we show that PACXX generates code with high performance, comparable to industrial-strength OpenCL compilers.

References

[1]

AMD. Bolt C++ Template Library, 2014. Version 1.2.

[2]

R. Barik, R. Kaleem, D. Majeti, et al. Efficient Mapping of Irregular C++ Applications to Integrated GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, page 33. ACM, 2014.

Digital Library

[3]

N. Bell and J. Hoberock. Thrust: A Parallel Template Library. GPU Computing Gems Jade Edition, page 359, 2011.

[4]

C. Calcagno, W. Taha, L. Huang, et al. Implementing Multi-Stage Languages Using ASTs, Gensym, and Reflection. In Generative Programming and Component Engineering, pages 57--76. Springer, 2003.

Digital Library

[5]

Z. DeVito, J. Hegarty, A. Aiken, et al. Terra: A Multi-Stage Language for High-Performance Computing. In ACM SIGPLAN Notices, volume 48, pages 105--116. ACM, 2013.

Digital Library

[6]

M. Haidl and S. Gorlatch. PACXX: Towards a Unified Programming Model for Programming Accelerators Using C++14. In Proceedings of LLVM Compiler Infrastructure in HPC (LLVM-HPC) at Supercomputing 14, pages 1--11. IEEE, 2014.

Digital Library

[7]

M. Harris. Optimizing Parallel Reduction in CUDA. Nvidia, 2007.

[8]

R. Ierusalimschy, L. H. De Figueiredo, and W. Celes Filho. Lua - An Extensible Extension Language. Software: Practice and Experience, 26(6):635--652, 1996.

Digital Library

[9]

Khronos Group. The OpenCL Specification, 2012.

[10]

Khronos Group. The SPIR Specification, 2014.

[11]

Khronos Group. SYCL Specifcation, 2015.

[12]

A. Klöckner, N. Pinto, Y. Lee, et al. PyCUDA and PyOpenCL: A scripting-based approach to GPU runtime code generation. Parallel Computing, 38(3):157--174, 2012.

Digital Library

[13]

C. Lattner. LLVM and Clang: Next Generation Compiler Technology. In Proceedings of the BSD Conference, pages 1--2, 2008.

[14]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO 2004, pages 75--86. IEEE, 2004.

Digital Library

[15]

T. Lutz and V. Grover. LambdaJIT: A Dynamic Compiler for Heterogeneous Optimizations of STL Algorithms. In Workshop on Functional High-Performance Computing at ICFP, pages 99--108. ACM, 2014.

Digital Library

[16]

Nvidia. Parallel Thread Execution ISA. Version 4.3.

[17]

Nvidia. CUDA Programming Guide, 2015. Version 7.5.

[18]

Nvidia. CUDA Toolkit 7.5, 2015.

[19]

L. Nyland, M. Harris, and J. Prins. Fast N-Body Simulation with CUDA. GPU Gems, 3(1):677--696, 2007.

[20]

T. Rompf, K. J. Brown, H. Lee, et al. Go meta! A case for generative programming and DSLs in performance critical systems. In 1st Summit on Advances in Programming Languages, SNAPL 2015, volume 32 of LIPIcs, pages 238--261. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015.

[21]

T. Rompf and M. Odersky. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. In ACM SIGPLAN Notices, volume 46, pages 127--136. ACM, 2010.

Digital Library

[22]

M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - A Portable Skeleton Library for High-Level GPU Programming. In Workshop on High-Level Parallel Programming Models and Supportive Environments at IPDPS 2011, pages 1176--1182. IEEE, 2011.

Digital Library

[23]

A. K. Sujeeth, K. J. Brown, H. Lee, et al. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embedded Comput. Syst., 13(4s):134:1--134:25, 2014.

Digital Library

[24]

W. Taha. A Gentle Introduction to Multi-Stage Programming. In Domain-Specific Program Generation, pages 30--50. Springer, 2004.

[25]

W. Taha and T. Sheard. Multi-Stage Programming with Explicit Annotations. In ACM SIGPLAN Notices, volume 32, pages 203--217. ACM, 1997.

Digital Library

[26]

E. Westbrook, M. Ricken, J. Inoue, et al. Mint: Java Multi-Stage Programming Using Weak Separability. ACM SIGPLAN Notices, 45(6):400--411, 2010.

Digital Library

Cited By

Kucher VHunloh JGorlatch S(2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
https://doi.org/10.25046/aj050116
Parreaux LShaikhha AErwig MGray J(2020)Multi-stage programming in the large with staged classesProceedings of the 19th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3425898.3426961(35-49)Online publication date: 16-Nov-2020
https://dl.acm.org/doi/10.1145/3425898.3426961
Kucher VHunloh JGorlatch S(2019)Toward Performance-Portable Finite Element Methods on High-Performance Systems2019 3rd International Conference on Recent Advances in Signal Processing, Telecommunications & Computing (SigTelCom)10.1109/SIGTELCOM.2019.8696146(69-73)Online publication date: Mar-2019
https://doi.org/10.1109/SIGTELCOM.2019.8696146
Show More Cited By

Index Terms

Multi-stage programming for GPUs in C++ using PACXX
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Heterogeneous computing come with tremendous potential and is a leading candidate for scientific applications that are becoming more and more complex. Accelerators such as GPUs whose computing momentum is growing faster than ever offer application ...
PACXX: Towards a Unified Programming Model for Programming Accelerators Using C++14
LLVM-HPC '14: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC

We present PACXX -- a unified programming model for programming many-core systems that comprise accelerators like Graphics Processing Units (GPUs). One of the main difficulties of the current GPU programming is that two distinct programming models are ...
Energy-efficient stencil computations on distributed GPUs using dynamic parallelism and GPU-controlled communication
E2SC '14: Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing

GPUs are widely used in high performance computing, due to their high computational power and high performance per Watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit

March 2016

107 pages

ISBN:9781450341950

DOI:10.1145/2884045

Conference Chairs:
David Kaeli,
John Cavazos

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '16

PPoPP '16: 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

March 12, 2016

Barcelona, Spain

Acceptance Rates

GPGPU '16 Paper Acceptance Rate 9 of 23 submissions, 39%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
272
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kucher VHunloh JGorlatch S(2020)Performance Portability and Unified Profiling for Finite Element Methods on Parallel SystemsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0501165:1(119-127)Online publication date: Jan-2020
https://doi.org/10.25046/aj050116
Parreaux LShaikhha AErwig MGray J(2020)Multi-stage programming in the large with staged classesProceedings of the 19th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3425898.3426961(35-49)Online publication date: 16-Nov-2020
https://dl.acm.org/doi/10.1145/3425898.3426961
Kucher VHunloh JGorlatch S(2019)Toward Performance-Portable Finite Element Methods on High-Performance Systems2019 3rd International Conference on Recent Advances in Signal Processing, Telecommunications & Computing (SigTelCom)10.1109/SIGTELCOM.2019.8696146(69-73)Online publication date: Mar-2019
https://doi.org/10.1109/SIGTELCOM.2019.8696146
Grelck CWiesinger H(2019)Persistent Asynchronous Adaptive Specialization for Generic Array ProgrammingInternational Journal of Parallel Programming10.1007/s10766-018-0567-947:2(164-183)Online publication date: 15-May-2019
https://dl.acm.org/doi/10.1007/s10766-018-0567-9
Leißa RBoesche KHack SPérard-Gayot AMembarth RSlusallek PMüller ASchmidt B(2018)AnyDSL: a partial evaluation framework for programming high-performance librariesProceedings of the ACM on Programming Languages10.1145/32764892:OOPSLA(1-30)Online publication date: 24-Oct-2018
https://dl.acm.org/doi/10.1145/3276489
Haidl MSteuwer MDirks HHumernbrum TGorlatch S(2017)Towards Composable GPU ProgrammingProceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3026937.3026942(58-67)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.1145/3026937.3026942

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents