research-article

Open access

Harmonic CUDA: Asynchronous Programming on GPUs

Authors:

Jonathan D. Wapman,

Sean Treichler,

Serban D. Porumbescu,

John D. OwensAuthors Info & Claims

PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 39 - 49

https://doi.org/10.1145/3582514.3582517

Published: 25 February 2023 Publication History

Abstract

We introduce Harmonic CUDA, a dataflow programming model for GPUs that allows programmers to describe algorithms as a dependency graph of producers and consumers where data flows continuously through the graph for the duration of the kernel. This makes it easier for programmers to exploit asynchrony, warp specialization, and hardware acceleration. Using Harmonic CUDA, we implement two example applications: Matrix Multiplication and GraphSage. The matrix multiplication kernel demonstrates how a key kernel can break down into more granular building blocks, with results that show a geomean average of 80% of cuBLAS performance, and up to 92% when omitting small matrices, as well as an analysis of how to improve performance in the future. GraphSage shows how asynchrony and warp specialization can provide significant performance improvements by reusing the same building blocks as the matrix multiplication kernel. We show performance improvements of 34% by changing to a warp-specialized version compared to a bulk-synchronous implementation. This paper evaluates the strengths and weaknesses of Harmonic CUDA based on these test cases and suggests future work to improve the programming model.

References

[1]

Farhoosh Alghabi, Ulrich Schipper, and Andreas Kolb. 2014. A Scalable Software Framework for Stateful Stream Data Processing on Multiple GPUs and Applications. In GPU Computing and Applications. Springer Singapore, 99--118.

[2]

Michael Bauer, Henry Cook, and Brucek Khailany. 2011. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). 1--11.

Digital Library

[3]

Michael Bauer, Sean Treichler, and Alex Aiken. 2014. Singe: Leveraging Warp Specialization for High Performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). 119--130.

Digital Library

[4]

Jack Choquette, Oliver Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro 38, 2 (April 2018), 42--52.

[5]

Federico Ciccozzi, Lorenzo Addazi, Sara Abbaspour Asadollah, Björn Lisper, Abu Naser Masud, and Saad Mubeen. 2022. A Comprehensive Exploration of Languages for Parallel Computing. ACM Comput. Surv. 55, 2, Article 24 (Jan. 2022), 39 pages.

Digital Library

[6]

William J. Dally, Stephen W. Keckler, and David B. Kirk. 2021. Evolution of the Graphics Processing Unit (GPU). IEEE Micro 41, 6 (2021), 42--51.

Digital Library

[7]

Michał Dominiak, Georgy Evtushenko, Lewis Baker, Lucian Radu Teodorescu, Lee Howes, Kirk Shoop, Michael Garland, Eric Niebler, and Bryce Adelstein Lelbach. 2022. std::execution. C++ Standards Committee Papers. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2300r5.html

[8]

Alex Fender, Brad Rees, and Joe Eaton. 2022. RAPIDS cuGraph. In Massive Graph Analytics. Chapman and Hall/CRC, 483--493.

[9]

Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics. 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]

William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 1025--1035. https://proceedings.neurips.cc/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf

Digital Library

[11]

Mark Harris and Kyrylo Perelygin. 2017. Cooperative Groups: Flexible CUDA Thread Programming. https://developer.nvidia.com/blog/cooperative-groups/

[12]

Kartik Hegde, Hadi Asghari Moghaddam, Michael Pellauer, Neal Clayton Crago, Aamer Jaleel, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-52). 319--333.

Digital Library

[13]

Dominique Houzet, Sylvain Huet, and Anis Rahman. 2010. SysCellC: a data-flow programming model on multi-GPU. Procedia Computer Science 1, 1 (May 2010), 1035--1044. ICCS 2010.

[14]

Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/

[15]

Ronny Krashinsky, Olivier Giroux, Stephen Jones, NickStam, and Sridhar Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/.

[16]

MathWorks Corporation. 2022. Simulink. https://www.mathworks.com/help/simulink/index.html

[17]

Duane Merrill. 2013--2022. CUB: Flexible Library of Cooperative Threadblock Primitives and Other Utilities for CUDA Kernel Programming. (2013--2022). https://github.com/NVIDIA/cub.

[18]

National Instruments Corporation. 2022. LabVIEW Documentation. https://www.ni.com/docs/en-US/bundle/labview/page/lvhelp/labview_help.html

[19]

NVIDIA Corporation. 2020. NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core.

[20]

NVIDIA Corporation. 2022. CUDA cuBLAS Library (v11.6). http://developer.nvidia.com/cublas.

[21]

NVIDIA Corporation. 2022. CUDA Samples. https://github.com/NVIDIA/cuda-samples.

[22]

NVIDIA Corporation. 2022. libcu++: The C++ Standard Library for Your Entire System. https://nvidia.github.io/libcudacxx/ Version 1.8.1.

[23]

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). 137--151.

Digital Library

[24]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM Press, 519--530.

Digital Library

[25]

William Thies, Michal Karczmarek, and Saman P. Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction, R. Nigel Horspool (Ed.). Springer-Verlag, 179--196.

Cited By

Yazdanpanah FAlaei M(2024)An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDAParallel Computing10.1016/j.parco.2024.103084120(103084)Online publication date: Jun-2024
https://doi.org/10.1016/j.parco.2024.103084

Index Terms

Harmonic CUDA: Asynchronous Programming on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages

Recommendations

CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCL

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores

February 2023

73 pages

ISBN:9798400701153

DOI:10.1145/3582514

Program Co-chairs:
Quan Chen,
Zhiyi Huang,
Min Si

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PMAM'23

Sponsor:

PMAM'23: 14th International Workshop on Programming Models and Applications for Multicores and Manycores

February 25 - March 1, 2023

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,184
Total Downloads

Downloads (Last 12 months)631
Downloads (Last 6 weeks)52

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yazdanpanah FAlaei M(2024)An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDAParallel Computing10.1016/j.parco.2024.103084120(103084)Online publication date: Jun-2024
https://doi.org/10.1016/j.parco.2024.103084

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents