Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3026937.3026942acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views

Published: 04 February 2017 Publication History

Abstract

In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic code without patterns (CUDA and OpenCL), which achieves high performance at the cost of cumbersome and error-prone programming, or they improve the programmability by using pattern-based abstractions (e.g., Thrust) but pay a performance penalty due to inefficient implementations of pattern composition.
We develop an API for GPUs based programming on C++ with STL-style patterns and its compiler-based implementation. Our API gives the application developers the native C++ means (views and actions) to specify precisely which pattern compositions should be automatically fused during code generation into a single efficient GPU kernel, thereby ensuring a high target performance. We implement our approach by extending the range-v3 library which is currently being developed for the forthcoming C++ standards. The composable programming in our approach is done exclusively in the standard C++14, with STL algorithms used as patterns which we re-implemented in parallel for GPU. Our compiler implementation is based on the LLVM and Clang frameworks, and we use advanced multi-stage programming techniques for aggressive runtime optimizations.
We experimentally evaluate our approach using a set of benchmark applications and a real-world case study from the area of image processing. Our codes achieve performance competitive with CUDA monolithic implementations, and we outperform pattern-based codes written using Nvidia's Thrust.

References

[1]
AMD. Bolt C++ template library. 2014. Version 1.2.
[2]
N. Bell and J. Hoberock. Thrust: A productivity-oriented library for CUDA. GPU Computing Gems Jade Edition, 2011.
[3]
A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. of Mathematical Imaging and Vision, 40(1):120--145, 2011.
[4]
C. Cummins, P. Petoumenos, M. Steuwer, and H. Leather. Autotuning OpenCL workgroup size for stencil patterns. In ADAPT@HiPEAC 2016, 2016.
[5]
H. Dirks. A flexible primal-dual toolbox. arXiv preprint arXiv:1603.05835, 2016.
[6]
T. Goldstein, M. Li, X. Yuan, E. Esser, and R. Baraniuk. Adaptive primal-dual hybrid gradient methods for saddle-point problems. arXiv preprint arXiv:1305.0546, 2013.
[7]
K. O. W. Group. The SPIR specification. 2014.
[8]
M. Haidl, M. Steuwer, T. Humernbrum, and S. Gorlatch. Multi-stage programming for GPUs in C++ using PACXX. In GPGPU@PPoPP. ACM, 2016.
[9]
M. Harris. Optimizing parallel reduction in CUDA. 2007.
[10]
isocpp. Technical Specification for C++ Extensions for Parallelism [N4578], 2015.
[11]
T. L. McDonell, M. M. T. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. In ICFP. ACM, 2013.
[12]
Microsoft. C++ AMP: Language and programming model. 2012. Version 1.0.
[13]
E. Niebler and C. Carter. N4569: C++ extensions for ranges. 2016. https://github.com/ericniebler/range-v3.
[14]
Nvidia. The CUDA software developer kit. Version 7.5.
[15]
Nvidia. PTX: Parallel thread execution ISA. Version 4.2.
[16]
R. Reyes and V. Lomüller. SYCL: Single-source C++ accelerator programming. In ParCo, volume 27 of Advances in Parallel Computing. IOS Press, 2015.
[17]
R. T. Rockafellar. Convex analysis. Princeton University Press, 2015.
[18]
L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259--268, 1992.
[19]
M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL --- A portable skeleton library for high-level GPU programming. In IPDPS Workshop Proceedings, pages 1176--1182. IEEE, 2011.
[20]
M. Steuwer, C. Fensch, S. Lindley, and C. Dubach. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In ICFP 2015, pages 205--217. ACM, 2015.

Cited By

View all
  • (2024)Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and ViewsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656632(236-246)Online publication date: 30-May-2024
  • (2018)Introducing Parallelism to the Ranges TSProceedings of the International Workshop on OpenCL10.1145/3204919.3204936(1-5)Online publication date: 14-May-2018
  1. Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
    February 2017
    84 pages
    ISBN:9781450348836
    DOI:10.1145/3026937
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 February 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    PPoPP '17
    Sponsor:

    Acceptance Rates

    PMAM'17 Paper Acceptance Rate 7 of 14 submissions, 50%;
    Overall Acceptance Rate 53 of 97 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Distributed Ranges: A Model for Distributed Data Structures, Algorithms, and ViewsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656632(236-246)Online publication date: 30-May-2024
    • (2018)Introducing Parallelism to the Ranges TSProceedings of the International Workshop on OpenCL10.1145/3204919.3204936(1-5)Online publication date: 14-May-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media