research-article

Open access

Modeling and analyzing evaluation cost of CUDA kernels

Authors:

Stefan K. Muller,

Jan HoffmannAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 5, Issue POPL

Article No.: 25, Pages 1 - 31

https://doi.org/10.1145/3434306

Published: 04 January 2021 Publication History

Abstract

General-purpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers.

This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels.

References

[1]

Elvira Albert, Puri Arenas, Samir Genaim, Miguel Gómez-Zamalloa, and German Puebla. 2011. Cost Analysis of Concurrent OO Programs. In 9th Asian Symp. on Prog. Langs. and Systems (APLAS'11). https://doi.org/10.1007/978-3-642-25318-8_19

Digital Library

[2]

Ugo Dal Lago and Marco Gaboardi. 2011. Linear Dependent Types and Relative Completeness. In 26th IEEE Symp. on Logic in Computer Science (LICS'11). https://doi.org/10.1109/LICS. 2011.22

Digital Library

[3]

Guodong Li and Ganesh Gopalakrishnan. 2010. SMT-Based Verification of GPU Kernel Functions. In International Symposium on the Foundations of Software Engineering (FSE) (FSE '10). https://doi.org/10.1145/1882291.1882320

Digital Library

[4]

Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: Concolic verification and test generation for GPUs. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). https://doi.org/10.1145/2370036.2145844

Digital Library

[5]

Stefan K. Muller and Umut A. Acar. 2016. Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '16). ACM, New York, NY, USA, 71-82. https://doi.org/10.1145/2935764.2935793

Digital Library

[6]

Van Chan Ngo, Quentin Carbonneaux, and Jan Hofmann. 2018. Bounded Expectations: Resource Analysis for Probabilistic Programs. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ). ACM, New York, NY, USA, 496-512. https://doi.org/10.1145/3192366.3192394

Digital Library

[7]

NVIDIA Corporation. 2019. CUDA C Programming Guide v. 10.1.168.

[8]

Yuanfeng Peng, Vinod Grover, and Joseph Devietti. 2018. CURD: A Dynamic CUDA Race Detector. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ). ACM, New York, NY, USA, 390-403. https://doi.org/10.1145/3192366.3192368

Digital Library

[9]

Phillipe Pereira, Higo Albuquerque, Hendrio Marques, Isabela Silva, Celso Carvalho, Lucas Cordeiro, Vanessa Santos, and Ricardo Ferreira. 2016. Verifying CUDA Programs Using SMT-based Context-bounded Model Checking. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). ACM, New York, NY, USA, 1648-1653. https: //doi.org/10.1145/2851613.2851830

Digital Library

[10]

Ivan Radiček, Gilles Barthe, Marco Gaboardi, Deepak Garg, and Florian Zuleger. 2017. Monadic Refinements for Relational Cost Analysis. Proc. ACM Program. Lang. 2, POPL ( 2017 ). https://doi.org/10.1145/3158124

Digital Library

[11]

Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. 2020. Sponge Examples: Energy-Latency Attacks on Neural Networks. arXiv:cs.LG/ 2006.03463

[12]

Nimit Singhania. 2018. Static Analysis for GPU Program Performance. Ph.D. Dissertation. Computer and Information Science, University of Pennsylvania.

[13]

Moritz Sinn, Florian Zuleger, and Helmut Veith. 2014. A Simple and Scalable Approach to Bound Analysis and Amortized Complexity Analysis. In Computer Aided Verification-26th Int. Conf. (CAV'14). https://doi.org/10.1007/978-3-319-08867-9_50

Digital Library

[14]

Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, and Yuqun Zhang. 2019. Characterizing and Detecting CUDA Program Bugs. CoRR abs/ 1905. 01833 ( 2019 ). arXiv: 1905. 01833 http://arxiv.org/abs/ 1905.01833

[15]

Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 135-146. https://doi.org/10.1145/1941553.1941574

Digital Library

Cited By

Liew DCogumbreiro TLange J(2024)Sound and Partially-Complete Static Analysis of Data-Races in GPU ProgramsProceedings of the ACM on Programming Languages10.1145/36897978:OOPSLA2(2434-2461)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689797
Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Lou MMuller S(2024)Automatic Static Analysis-Guided Optimization of CUDA KernelsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649249(11-21)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649249
Show More Cited By

Index Terms

Modeling and analyzing evaluation cost of CUDA kernels
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models
  2. Semantics and reasoning
    1. Program reasoning
      1. Program analysis

Recommendations

Modeling and Analyzing Evaluation Cost of CUDA Kernels
Motivated by the increasing importance of general-purpose Graphic Processing Units (GPGPU) programming, exemplified by NVIDIA’s CUDA framework, as well as the difficulty, especially for novice programmers, of reasoning about performance in GPGPU kernels, ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages

Proceedings of the ACM on Programming Languages Volume 5, Issue POPL

January 2021

1789 pages

EISSN:2475-1421

DOI:10.1145/3445980

Issue’s Table of Contents

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2021

Published in PACMPL Volume 5, Issue POPL

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,440
Total Downloads

Downloads (Last 12 months)354
Downloads (Last 6 weeks)40

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liew DCogumbreiro TLange J(2024)Sound and Partially-Complete Static Analysis of Data-Races in GPU ProgramsProceedings of the ACM on Programming Languages10.1145/36897978:OOPSLA2(2434-2461)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689797
Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Lou MMuller S(2024)Automatic Static Analysis-Guided Optimization of CUDA KernelsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649249(11-21)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649249
Muller SHoffmann J(2024)Modeling and Analyzing Evaluation Cost of CUDA KernelsACM Transactions on Parallel Computing10.1145/363940311:1(1-53)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3639403
Tauro BSuchy BCampanoni SDinda PHale KTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)TrackFM: Far-out Compiler Support for a Far Memory WorldProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624856(401-419)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624856
Cogumbreiro TLange JLiew DZicarelli H(2024)Memory access protocols: certified data-race freedom for GPU kernelsFormal Methods in System Design10.1007/s10703-023-00415-063:1-3(134-171)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1007/s10703-023-00415-0
Jin FJacobson JPollard SSarkar V(2022)MiniKokkos: A Calculus of Portable Parallelism2022 IEEE/ACM Sixth International Workshop on Software Correctness for HPC Applications (Correctness)10.1109/Correctness56720.2022.00010(37-44)Online publication date: Nov-2022
https://doi.org/10.1109/Correctness56720.2022.00010
Cogumbreiro TLange JRong DZicarelli H(2021)Checking Data-Race Freedom of GPU Kernels, CompositionallyComputer Aided Verification10.1007/978-3-030-81685-8_19(403-426)Online publication date: 15-Jul-2021
https://doi.org/10.1007/978-3-030-81685-8_19

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents