Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Modeling and analyzing evaluation cost of CUDA kernels

Published: 04 January 2021 Publication History

Abstract

General-purpose programming on GPUs (GPGPU) is becoming increasingly in vogue as applications such as machine learning and scientific computing demand high throughput in vector-parallel applications. NVIDIA's CUDA toolkit seeks to make GPGPU programming accessible by allowing programmers to write GPU functions, called kernels, in a small extension of C/C++. However, due to CUDA's complex execution model, the performance characteristics of CUDA kernels are difficult to predict, especially for novice programmers.
This paper introduces a novel quantitative program logic for CUDA kernels, which allows programmers to reason about both functional correctness and resource usage of CUDA kernels, paying particular attention to a set of common but CUDA-specific performance bottlenecks. The logic is proved sound with respect to a novel operational cost semantics for CUDA kernels. The semantics, logic and soundness proofs are formalized in Coq. An inference algorithm based on LP solving automatically synthesizes symbolic resource bounds by generating derivations in the logic. This algorithm is the basis of RaCuda, an end-to-end resource-analysis tool for kernels, which has been implemented using an existing resource-analysis tool for imperative programs. An experimental evaluation on a suite of CUDA benchmarks shows that the analysis is effective in aiding the detection of performance bugs in CUDA kernels.

References

[1]
Elvira Albert, Puri Arenas, Samir Genaim, Miguel Gómez-Zamalloa, and German Puebla. 2011. Cost Analysis of Concurrent OO Programs. In 9th Asian Symp. on Prog. Langs. and Systems (APLAS'11). https://doi.org/10.1007/978-3-642-25318-8_19
[2]
Ugo Dal Lago and Marco Gaboardi. 2011. Linear Dependent Types and Relative Completeness. In 26th IEEE Symp. on Logic in Computer Science (LICS'11). https://doi.org/10.1109/LICS. 2011.22
[3]
Guodong Li and Ganesh Gopalakrishnan. 2010. SMT-Based Verification of GPU Kernel Functions. In International Symposium on the Foundations of Software Engineering (FSE) (FSE '10). https://doi.org/10.1145/1882291.1882320
[4]
Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: Concolic verification and test generation for GPUs. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). https://doi.org/10.1145/2370036.2145844
[5]
Stefan K. Muller and Umut A. Acar. 2016. Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '16). ACM, New York, NY, USA, 71-82. https://doi.org/10.1145/2935764.2935793
[6]
Van Chan Ngo, Quentin Carbonneaux, and Jan Hofmann. 2018. Bounded Expectations: Resource Analysis for Probabilistic Programs. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ). ACM, New York, NY, USA, 496-512. https://doi.org/10.1145/3192366.3192394
[7]
NVIDIA Corporation. 2019. CUDA C Programming Guide v. 10.1.168.
[8]
Yuanfeng Peng, Vinod Grover, and Joseph Devietti. 2018. CURD: A Dynamic CUDA Race Detector. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ). ACM, New York, NY, USA, 390-403. https://doi.org/10.1145/3192366.3192368
[9]
Phillipe Pereira, Higo Albuquerque, Hendrio Marques, Isabela Silva, Celso Carvalho, Lucas Cordeiro, Vanessa Santos, and Ricardo Ferreira. 2016. Verifying CUDA Programs Using SMT-based Context-bounded Model Checking. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). ACM, New York, NY, USA, 1648-1653. https: //doi.org/10.1145/2851613.2851830
[10]
Ivan Radiček, Gilles Barthe, Marco Gaboardi, Deepak Garg, and Florian Zuleger. 2017. Monadic Refinements for Relational Cost Analysis. Proc. ACM Program. Lang. 2, POPL ( 2017 ). https://doi.org/10.1145/3158124
[11]
Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. 2020. Sponge Examples: Energy-Latency Attacks on Neural Networks. arXiv:cs.LG/ 2006.03463
[12]
Nimit Singhania. 2018. Static Analysis for GPU Program Performance. Ph.D. Dissertation. Computer and Information Science, University of Pennsylvania.
[13]
Moritz Sinn, Florian Zuleger, and Helmut Veith. 2014. A Simple and Scalable Approach to Bound Analysis and Amortized Complexity Analysis. In Computer Aided Verification-26th Int. Conf. (CAV'14). https://doi.org/10.1007/978-3-319-08867-9_50
[14]
Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, and Yuqun Zhang. 2019. Characterizing and Detecting CUDA Program Bugs. CoRR abs/ 1905. 01833 ( 2019 ). arXiv: 1905. 01833 http://arxiv.org/abs/ 1905.01833
[15]
Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. 2011. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 135-146. https://doi.org/10.1145/1941553.1941574

Cited By

View all
  • (2024)Sound and Partially-Complete Static Analysis of Data-Races in GPU ProgramsProceedings of the ACM on Programming Languages10.1145/36897978:OOPSLA2(2434-2461)Online publication date: 8-Oct-2024
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
  • (2024)Automatic Static Analysis-Guided Optimization of CUDA KernelsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649249(11-21)Online publication date: 3-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 5, Issue POPL
January 2021
1789 pages
EISSN:2475-1421
DOI:10.1145/3445980
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2021
Published in PACMPL Volume 5, Issue POPL

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. CUDA
  2. performance analysis
  3. program logics
  4. resource-aware type system
  5. thread-level parallelism

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)354
  • Downloads (Last 6 weeks)40
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Sound and Partially-Complete Static Analysis of Data-Races in GPU ProgramsProceedings of the ACM on Programming Languages10.1145/36897978:OOPSLA2(2434-2461)Online publication date: 8-Oct-2024
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
  • (2024)Automatic Static Analysis-Guided Optimization of CUDA KernelsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649249(11-21)Online publication date: 3-Mar-2024
  • (2024)Modeling and Analyzing Evaluation Cost of CUDA KernelsACM Transactions on Parallel Computing10.1145/363940311:1(1-53)Online publication date: 12-Jan-2024
  • (2024)TrackFM: Far-out Compiler Support for a Far Memory WorldProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624856(401-419)Online publication date: 27-Apr-2024
  • (2024)Memory access protocols: certified data-race freedom for GPU kernelsFormal Methods in System Design10.1007/s10703-023-00415-063:1-3(134-171)Online publication date: 1-Oct-2024
  • (2022)MiniKokkos: A Calculus of Portable Parallelism2022 IEEE/ACM Sixth International Workshop on Software Correctness for HPC Applications (Correctness)10.1109/Correctness56720.2022.00010(37-44)Online publication date: Nov-2022
  • (2021)Checking Data-Race Freedom of GPU Kernels, CompositionallyComputer Aided Verification10.1007/978-3-030-81685-8_19(403-426)Online publication date: 15-Jul-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media