research-article

Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views

Authors:

Michel Steuwer,

Tim Humernbrum,

Sergei GorlatchAuthors Info & Claims

PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 58 - 67

https://doi.org/10.1145/3026937.3026942

Published: 04 February 2017 Publication History

Abstract

In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic code without patterns (CUDA and OpenCL), which achieves high performance at the cost of cumbersome and error-prone programming, or they improve the programmability by using pattern-based abstractions (e.g., Thrust) but pay a performance penalty due to inefficient implementations of pattern composition.

We develop an API for GPUs based programming on C++ with STL-style patterns and its compiler-based implementation. Our API gives the application developers the native C++ means (views and actions) to specify precisely which pattern compositions should be automatically fused during code generation into a single efficient GPU kernel, thereby ensuring a high target performance. We implement our approach by extending the range-v3 library which is currently being developed for the forthcoming C++ standards. The composable programming in our approach is done exclusively in the standard C++14, with STL algorithms used as patterns which we re-implemented in parallel for GPU. Our compiler implementation is based on the LLVM and Clang frameworks, and we use advanced multi-stage programming techniques for aggressive runtime optimizations.

We experimentally evaluate our approach using a set of benchmark applications and a real-world case study from the area of image processing. Our codes achieve performance competitive with CUDA monolithic implementations, and we outperform pattern-based codes written using Nvidia's Thrust.

References

[1]

AMD. Bolt C++ template library. 2014. Version 1.2.

[2]

N. Bell and J. Hoberock. Thrust: A productivity-oriented library for CUDA. GPU Computing Gems Jade Edition, 2011.

[3]

A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. of Mathematical Imaging and Vision, 40(1):120--145, 2011.

Digital Library

[4]

C. Cummins, P. Petoumenos, M. Steuwer, and H. Leather. Autotuning OpenCL workgroup size for stencil patterns. In ADAPT@HiPEAC 2016, 2016.

[5]

H. Dirks. A flexible primal-dual toolbox. arXiv preprint arXiv:1603.05835, 2016.

[6]

T. Goldstein, M. Li, X. Yuan, E. Esser, and R. Baraniuk. Adaptive primal-dual hybrid gradient methods for saddle-point problems. arXiv preprint arXiv:1305.0546, 2013.

[7]

K. O. W. Group. The SPIR specification. 2014.

[8]

M. Haidl, M. Steuwer, T. Humernbrum, and S. Gorlatch. Multi-stage programming for GPUs in C++ using PACXX. In GPGPU@PPoPP. ACM, 2016.

Digital Library

[9]

M. Harris. Optimizing parallel reduction in CUDA. 2007.

[10]

isocpp. Technical Specification for C++ Extensions for Parallelism [N4578], 2015.

[11]

T. L. McDonell, M. M. T. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. In ICFP. ACM, 2013.

Digital Library

[12]

Microsoft. C++ AMP: Language and programming model. 2012. Version 1.0.

[13]

E. Niebler and C. Carter. N4569: C++ extensions for ranges. 2016. https://github.com/ericniebler/range-v3.

[14]

Nvidia. The CUDA software developer kit. Version 7.5.

[15]

Nvidia. PTX: Parallel thread execution ISA. Version 4.2.

[16]

R. Reyes and V. Lomüller. SYCL: Single-source C++ accelerator programming. In ParCo, volume 27 of Advances in Parallel Computing. IOS Press, 2015.

[17]

R. T. Rockafellar. Convex analysis. Princeton University Press, 2015.

[18]

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259--268, 1992.

Digital Library

[19]

M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL --- A portable skeleton library for high-level GPU programming. In IPDPS Workshop Proceedings, pages 1176--1182. IEEE, 2011.

Digital Library

[20]

M. Steuwer, C. Fensch, S. Lindley, and C. Dubach. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In ICFP 2015, pages 205--217. ACM, 2015.

Digital Library

Cited By

Brock BCohn RBakshi SKarna TKim JNowak MŚlusarczyk ŁStefanski KMattson T(2024)Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and ViewsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656632(236-246)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656632
Brown GDi Bella CHaidl MRemmelg TReyes RSteuwer M(2018)Introducing Parallelism to the Ranges TSProceedings of the International Workshop on OpenCL10.1145/3204919.3204936(1-5)Online publication date: 14-May-2018
https://dl.acm.org/doi/10.1145/3204919.3204936

Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views
1. Software and its engineering
  1. Software notations and tools

Recommendations

Multicore and GPU Programming: An Integrated Approach
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
GPU programming for EDA with OpenCL
ICCAD '11: Proceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design

Graphical processing unit (GPU) computing has been an interesting area of research in the last few years. While initial adapters of the technology have been from image processing domain due to difficulties in programming the GPUs, research on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

February 2017

84 pages

ISBN:9781450348836

DOI:10.1145/3026937

Editors:
Quan Chen
Shanghai Jiao Tong University, China
,
Zhiyi Huang
University of Otago, New Zealand

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

PPoPP '17

Sponsor:

SIGPLAN

PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 4 - 8, 2017

TX, Austin, USA

Acceptance Rates

PMAM'17 Paper Acceptance Rate 7 of 14 submissions, 50%;

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
227
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Brock BCohn RBakshi SKarna TKim JNowak MŚlusarczyk ŁStefanski KMattson T(2024)Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and ViewsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656632(236-246)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656632
Brown GDi Bella CHaidl MRemmelg TReyes RSteuwer M(2018)Introducing Parallelism to the Ranges TSProceedings of the International Workshop on OpenCL10.1145/3204919.3204936(1-5)Online publication date: 14-May-2018
https://dl.acm.org/doi/10.1145/3204919.3204936

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents