Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Specifying and testing GPU workgroup progress models

Published: 15 October 2021 Publication History

Abstract

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model.
This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.

Supplementary Material

Auxiliary Presentation Video (oopsla21main-p143-p-video.mp4)
As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks.

References

[1]
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Architectural Support for Programming Languages and Operating Systems, ASPLOS. ACM. https://doi.org/10.1145/2694344.2694391
[2]
AMD. 2019. RDNA Architecture. Whitepaper, https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
[3]
Apple. 2020. Metal Shading Language Specification v2.3. https://developer.apple.com/metal/
[4]
Christel Baier and Joost-Pieter Katoen. 2008. Principles of model checking. MIT Press. isbn:978-0-262-02649-9
[5]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR, abs/1508.03619 (2015), arXiv:1508.03619. arxiv:1508.03619
[6]
Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. 2015. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Trans. Program. Lang. Syst., 37, 3 (2015), 10:1–10:49. https://doi.org/10.1145/2743017
[7]
Adam Betts, Nathan Chong, Alastair F. Donaldson, Shaz Qadeer, and Paul Thomson. 2012. GPUVerify: a verifier for GPU kernels. In Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM. https://doi.org/10.1145/2384616.2384625
[8]
G. E. Blelloch. 1989. Scans As Primitive Parallel Operations. IEEE Trans. Comput., 38, 11 (1989), 1526–1538. https://doi.org/10.1109/12.42122
[9]
Daniel Cederman and Philippas Tsigas. 2008. On Dynamic Load Balancing on Graphics Processors. In SIGGRAPH. Eurographics Association, 57–64. https://doi.org/10.5555/1413957.1413967
[10]
David Champelovier, Xavier Clerc, Hubert Garavel, Yves Guerte, Christine McKinty, Vincent Powazny, Frédéric Lang, Wendelin Serwe, and Gideon Smeding. 2021. Reference Manual of the LNT to LOTOS Translator (Version 7.0). Aug., https://cadp.inria.fr/ftp/publications/cadp/Champelovier-Clerc-Garavel-et-al-10.pdf INRIA, Grenoble, France.
[11]
Nathan Chong, Tyler Sorensen, and John Wickerson. 2018. The semantics of transactions and weak memory in x86, Power, ARM, and C++. In PLDI. ACM, 211–225. https://doi.org/10.1145/3192366.3192373
[12]
Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. 2013. Interleaving and Lock-Step Semantics for Analysis and Verification of GPU Kernels. In ESOP, Matthias Felleisen and Philippa Gardner (Eds.) (Lecture Notes in Computer Science, Vol. 7792). Springer, 270–289. https://doi.org/10.1007/978-3-642-37036-6_16
[13]
Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. Proc. ACM Program. Lang., 1, OOPSLA (2017), https://doi.org/10.1145/3133917
[14]
Alastair F. Donaldson, Hugues Evrard, and Paul Thomson. 2020. Putting Randomized Compiler Testing into Production (Experience Report). In 34th European Conference on Object-Oriented Programming, ECOOP 2020, November 15-17, 2020, Berlin, Germany (Virtual Conference), Robert Hirschfeld and Tobias Pape (Eds.) (LIPIcs, Vol. 166). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 22:1–22:29. https://doi.org/10.4230/LIPIcs.ECOOP.2020.22
[15]
Alexandru Dutu, Matthew D. Sinclair, Bradford M. Beckmann, David A. Wood, and Marcus Chow. 2020. Independent Forward Progress of Work-groups. In International Symposium on Computer Architecture, ISCA. IEEE. https://doi.org/10.1109/ISCA45697.2020.00087
[16]
Ahmed ElTantawy and Tor M. Aamodt. 2016. MIMD synchronization on SIMT architectures. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO. IEEE Computer Society. https://doi.org/10.1109/MICRO.2016.7783714
[17]
Dov M. Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. On the Temporal Basis of Fairness. In Symposium on Principles of Programming Languages, Paul W. Abrahams, Richard J. Lipton, and Stephen R. Bourne (Eds.). ACM Press, 163–173. https://doi.org/10.1145/567446.567462
[18]
Hubert Garavel, Frédéric Lang, Radu Mateescu, and Wendelin Serwe. 2013. CADP 2011: a toolbox for the construction and analysis of distributed processes. International Journal on Software Tools for Technology Transfer, 15, 2 (2013), 89–107. https://doi.org/10.1007/s10009-012-0244-z
[19]
Google. 2020. Amber. https://github.com/google/amber
[20]
Kshitij Gupta, Jeff Stuart, and John D. Owens. 2012. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads. In InPar. 1–14. https://doi.org/10.1109/InPar.2012.6339596
[21]
Axel Habermaier and Alexander Knapp. 2012. On the Correctness of the SIMT Execution Model of GPUs. In ESOP, Helmut Seidl (Ed.) (Lecture Notes in Computer Science, Vol. 7211). Springer, 316–335. https://doi.org/10.1007/978-3-642-28869-2_16
[22]
HSA Foundation. 2017. HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG). (rev 1.1.1).
[23]
2015. Heterogeneous System Architecture: A New Compute Platform Infrastructure, Wen-mei W. Hwu (Ed.). Morgan Kaufmann.
[24]
Intel. 2021. oneAPI GPU Optimization Guide. https://software.intel.com/content/dam/develop/external/us/en/documents/oneapi-gpu-optimization-guide.pdf
[25]
ISO/IEC. 2017. Working Draft, Standard for Programming Language C++ v. N4659. https://doi.org/jtc1/sc22/wg21/docs/papers/2017/n4659.pdf
[26]
Daniel Jackson. 2012. Software Abstractions: Logic, Language, and Analysis. MIT Press. isbn:0262017156
[27]
Khronos Group. 2020. Khronos Group Releases OpenCL 3.0. https://khr.io/ocl3pressrelease
[28]
Khronos Group. 2020. Vulkan 1.2.174 - A Specification. https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html
[29]
Yonghae Kim and Hyesoon Kim. 2019. Translating CUDA to OpenCL for Hardware Generation using Neural Machine Translation. In International Symposium on Code Generation and Optimization, CGO. IEEE. https://doi.org/10.1109/CGO.2019.8661172
[30]
Dexter Kozen. 1983. Results on the Propositional mu-Calculus. Theor. Comput. Sci., 27 (1983), 333–354. https://doi.org/10.1016/0304-3975(82)90125-6
[31]
Ori Lahav, Egor Namakonov, Jonas Oberhauser, Anton Podkopaev, and Viktor Vafeiadis. 2021. Making Weak Memory Models Fair. Proc. ACM Program. Lang., 6, OOPSLA (2021), https://doi.org/10.1145/3485475
[32]
Raph Levien. 2020. Prefix sum on Vulkan. https://raphlinus.github.io/gpu/2020/04/30/prefix-sum.html
[33]
Guodong Li, Peng Li, Geoffrey Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. 2012. GKLEE: concolic verification and test generation for GPUs. In Principles and Practice of Parallel Programming, PPOPP. ACM. https://doi.org/10.1145/2145816.2145844
[34]
Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In ASPLOS. ACM, 661–675. https://doi.org/10.1145/3037697.3037723
[35]
Sepideh Maleki, Annie Yang, and Martin Burtscher. 2016. Higher-order and Tuple-based Massively-parallel Prefix Sums. In PLDI. ACM, 539–552. https://doi.org/10.1145/2908080.2908089
[36]
Gabriel Martinez, Mark K. Gardner, and Wu-chun Feng. 2011. CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures. In International Conference on Parallel and Distributed Systems, ICPADS. IEEE Computer Society. https://doi.org/10.1109/ICPADS.2011.48
[37]
Radu Mateescu and Wendelin Serwe. 2010. A Study of Shared-Memory Mutual Exclusion Protocols Using CADP. In Formal Methods for Industrial Critical Systems FMICS, Stefan Kowalewski and Marco Roveri (Eds.) (Lecture Notes in Computer Science, Vol. 6371). Springer, 180–197. https://doi.org/10.1007/978-3-642-15898-8_12
[38]
Radu Mateescu and Damien Thivolle. 2008. A Model Checking Language for Concurrent Value-Passing Systems. In International Symposium on Formal Methods (FM), Jorge Cuéllar, T. S. E. Maibaum, and Kaisa Sere (Eds.) (Lecture Notes in Computer Science, Vol. 5014). Springer, 148–164. https://doi.org/10.1007/978-3-540-68237-0_12
[39]
MCL. 2008. MCL manual page. http://cadp.inria.fr/man/mcl4.html
[40]
Bruce Merry. 2015. A Performance Comparison of Sort and Scan Libraries for GPUs. Parallel Process. Lett., 25, 4 (2015), 1550007:1–1550007:8. https://doi.org/10.1142/S0129626415500073
[41]
Jacob Nelson and Roberto Palmieri. 2019. Don’t Forget About Synchronization!: A Case Study of K-Means on GPU. In International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM. ACM. https://doi.org/10.1145/3303084.3309488
[42]
Nvidia. 2017. Nvidia Tesla V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Whitepaper WP-08608-001_v1.1.
[43]
Nvidia. 2021. CUB v1.12.0. https://github.com/NVlabs/cub
[44]
Nvidia. 2021. CUDA C++ Programming Guide, Version 11.2.1. https://docs.nvidia.com/cuda/archive/11.2.1/cuda-c-programming-guide/
[45]
Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing errors related to weak memory in GPU applications. In Programming Language Design and Implementation, PLDI, Chandra Krintz and Emery Berger (Eds.). ACM, 100–113. https://doi.org/10.1145/2908080.2908114
[46]
Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM. https://doi.org/10.1145/3022671.2984032
[47]
Tyler Sorensen, Hugues Evrard, and Alastair F. Donaldson. 2018. GPU Schedulers: How Fair Is Fair Enough? In 29th International Conference on Concurrency Theory, CONCUR 2018, September 4-7, 2018, Beijing, China, Sven Schewe and Lijun Zhang (Eds.) (LIPIcs, Vol. 118). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 23:1–23:17. https://doi.org/10.4230/LIPIcs.CONCUR.2018.23
[48]
Tyler Sorensen, Sreepathi Pai, and Alastair F. Donaldson. 2019. One Size Doesn’t Fit All: Quantifying Performance Portability of Graph Applications on GPUs. In IEEE International Symposium on Workload Characterization, IISWC. IEEE. https://doi.org/10.1109/IISWC47752.2019.9042139
[49]
Tyler Sorensen, Lucas F. Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, and Alastair F. Donaldson. 2021. Artifact for "Specifying and Testing GPU Workgroup Progress Models" (OOPSLA 2021). https://doi.org/10.5281/zenodo.5501522 also available at:
[50]
Tyler Sorensen, Lucas F. Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, and Alastair F. Donaldson. 2021. Specifying and Testing GPU Workgroup Progress Models (Extended Version). CoRR, abs/2109.06132 (2021), https://doi.org/10.1145/3485508 arXiv:2109.06132.
[51]
Tuan Ta, Xianwei Zhang, Anthony Gutierrez, and Bradford M. Beckmann. 2019. Autonomous Data-Race-Free GPU Testing. In International Symposium on Workload Characterization, IISWC. IEEE. https://doi.org/10.1109/IISWC47752.2019.9042019
[52]
Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In HPG. 29–37. https://doi.org/10.5555/1921479.1921485
[53]
Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: a high-performance graph processing library on the GPU. In Principles and Practice of Parallel Programming (PPoPP). ACM. https://doi.org/10.1145/2851141.2851145
[54]
John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically comparing memory consistency models. In Principles of Programming Languages, POPL. ACM. https://doi.org/10.1145/3093333.3009838
[55]
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In International Symposium on Performance Analysis of Systems and Software, ISPASS. IEEE. https://doi.org/10.1109/ISPASS.2010.5452013
[56]
Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In IPDPS. 1–12. https://doi.org/10.1109/IPDPS.2010.5470477

Cited By

View all
  • (2023)Towards Alignment of Parallelism in SYCL and ISO C++Proceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585371(1-9)Online publication date: 18-Apr-2023
  • (2023)Taking Back Control in an Intermediate Representation for GPU ComputingProceedings of the ACM on Programming Languages10.1145/35712537:POPL(1740-1769)Online publication date: 11-Jan-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 5, Issue OOPSLA
October 2021
2001 pages
EISSN:2475-1421
DOI:10.1145/3492349
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2021
Published in PACMPL Volume 5, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. GPU
  2. liveness
  3. model checking
  4. semantics
  5. test case synthesis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)491
  • Downloads (Last 6 weeks)41
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Towards Alignment of Parallelism in SYCL and ISO C++Proceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585371(1-9)Online publication date: 18-Apr-2023
  • (2023)Taking Back Control in an Intermediate Representation for GPU ComputingProceedings of the ACM on Programming Languages10.1145/35712537:POPL(1740-1769)Online publication date: 11-Jan-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media