research-article

Public Access

A Formal Analysis of the NVIDIA PTX Memory Consistency Model

Authors:

Sameer Sahasrabuddhe,

Olivier GirouxAuthors Info & Claims

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 257 - 270

https://doi.org/10.1145/3297858.3304043

Published: 04 April 2019 Publication History

Abstract

This paper presents the first formal analysis of the official memory consistency model for the NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly ordered but provides scoped synchronization primitives that enable GPU program threads to communicate through memory. However, unlike some competing GPU memory models, PTX does not require data race freedom, and this results in PTX using a fundamentally different (and more complicated) set of rules in its memory model. As such, PTX has a clear need for a rigorous and reliable memory model testing and analysis infrastructure. We break our formal analysis of the PTX memory model into multiple steps that collectively demonstrate its rigor and validity. First, we adapt the English language specification from the public PTX documentation into a formal axiomatic model. Second, we derive an up-to-date presentation of an OpenCL-like scoped C++ model and develop a mapping from the synchronization primitives of that scoped C++ model onto PTX. Third, we use the Alloy relational modeling tool to empirically test the correctness of the mapping. Finally, we compile the model and mapping into Coq and build a full machine-checked proof that the mapping is sound for programs of any size. Our analysis demonstrates that in spite of issues in previous generations, the new NVIDIA PTX memory model is suitable as a sound compilation target for GPU programming languages such as CUDA.

References

[1]

2017. CppMem: Interactive C/C++ Memory Model. http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem.

[2]

2017. diy, Release Seven.http://diy.inria.fr/.

[3]

2017. Lem, a Tool for Lightweight Executable Mathematics.http://www.cl.cam.ac.uk/~pes20/lem.

[4]

2017. The Coq Proof Assistant. https://coq.inria.fr.

[5]

2017. The HOL Interactive Theorem Prover. https://hol-theorem-prover.org.

[6]

Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[7]

Jade Alglave and Luc Maranget. 2016. Towards a Formalization of the HSA Memory Model in the cat Language. Technical Report. HSA Foundation Specification Version 1.1. URL: http://www.hsafoundation.com/?ddownload=5381.

[8]

Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. 2010. Fences in Weak Memory Models. In 22nd International Conference on Computer Aided Verification (CAV).

Digital Library

[9]

Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Transanctions on Programming Languages and Systems (TOPLAS) 36, 2 (July 2014), 7:1--7:74.

Digital Library

[10]

Konstantine Arkoudas, Sarfraz Khurshid, Darko Marinov, and Martin Rinard. 2003. Integrating Model Checking and Theorem Proving for Relational Reasoning. International Conference on Relational Methods in Computer Science (RelMiCS).

[11]

ARM. 2011. Cortex-A9 MPCore?, Programmer Advice Notice, Read-after-Read Hazards. Technical Report. URL: http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf.

[12]

ARM Holdings. 2016. ARM Architecture Reference Manuals. Technical Report. URL: http://infocenter.arm.com/help/topic/com.arm.doc.set.architecture.

[13]

Mark Batty, Alastair F. Donaldson, and John Wickerson. 2016. Over-hauling SC Atomics in C11 and OpenCL. In 43rd Annual Symposiumon Principles of Programming Languages (POPL).

Digital Library

[14]

Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell. 2012. Clarifying and Compiling C/C++ Concurrency: from C++11 to POWER.39th Symposium on Principles of Programming Languages (POPL) (2012).

Digital Library

[15]

Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. 38th Symposium on Principles of Programming Languages (POPL)(2011).

Digital Library

[16]

Jasmin Christian Blanchette and Tobias Nipkow. 2010. Nitpick: A Counterexample Generator for Higher-Order Logic Based on a Relational Model Finder. 23rd International Conference on Interactive Theorem Proving (ITP).

Digital Library

[17]

Jasmin Christian Blanchette, Tjark Weber, Mark Batty, Scott Owens,and Susmit Sarkar. 2011. Nitpicking C++ Concurrency. 13th International Symposium on Principles and Practice of Declarative Programming (PPDP).

Digital Library

[18]

Hans-J. Boehm and Brian Demsky. 2014. Outlawing Ghosts: Avoiding Out-of-thin-air Results. In Workshop on Memory Systems Performance and Correctness (MSPC).

Digital Library

[19]

Hans-J. Boehm, Olivier Giroux, Viktor Vafeiadis, and with in-put from Will Deacon, Doug Lea, Daniel Lustig, Paul McKenney and others {sic}. 2018. P0668R3: Revising the C++ memory model. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r3.html.ISO/JTC1/SC22/WG21(2018).

[20]

Adam Chlipala. 2013.Certified Programming with Dependent Types: a Pragmatic Introduction to the Coq Proof Assistant. MIT Press.

Digital Library

[21]

Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc Maranget, Kathryn E Gray, Ali Sezgin, Mark Batty, and Peter Sewell. 2017. Mixed-size Concurrency: ARM, POWER, C/C++11, and SC. (2017).

[22]

HSA Foundation. 2017. Heterogeneous System Architecture. http://www.hsafoundation.com/standards.

[23]

Aboubakr Achraf El Ghazi and Mana Taghdiri. 2011. Relational Reasoning via SMT Solving. International Symposium on Formal Methods (FM).

Digital Library

[24]

Khronos Group. 2017. OpenCL 2.2.https://www.khronos.org/opencl.

[25]

Mark Harris. 2017. Unified Memory for CUDA Beginners.NVIDIA Developer Blog (2017). https://devblogs.nvidia.com/unified-memory-cuda-beginners.

[26]

Mark Harris and Kyrylo Perelygin. 2017. Cooperative Groups: Flexible CUDA Thread Programming. NVIDIA Developer Blog (2017). https://devblogs.nvidia.com/cooperative-groups.

[27]

Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[28]

International Organization for Standardization (ISO). 2011. Information technology -- Programming languages -- C++, ISO/IEC 14882:2011. Technical Report.

[29]

International Organization for Standardization (ISO). 2011. Information technology -- Programming languages -- C, ISO/IEC 9899: 2011. Technical Report.

[30]

Daniel Jackson. 2002. Alloy: A Lightweight Object Modelling Notation. In ACM Transactions on Software Engineering and Methodology(TOSEM), Vol. 11. URL: http://alloy.mit.edu.

Digital Library

[31]

Khronos Group. 2015. The OpenCL Specification, Version 2.2. Technical Report. URL: https://www.khronos.org/registry/cl/specs/opencl-2.2.pdf.

[32]

Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing Sequential Consistency in C/C++11. (2017).

[33]

Daniel Lustig, Michael Pellauer, and Margaret Martonosi. 2014. PipeCheck: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models. 47th Annual International Symposium on Microarchitecture (MICRO) (2014).

Digital Library

[34]

Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2017. Supplemental material. https://github.com/nvlabs/ptxmemorymodel.

[35]

Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, andOlivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. 22nd International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS) (2017).

Digital Library

[36]

Yatin A. Manerkar, Caroline Trippel, Daniel Lustig, Michael Pellauer,and Margaret Martonosi. 2016. Counterexamples and Proof Loop-hole for the C/C++ to POWER and ARMv7 Trailing-sync Compiler Mappings. arXiv1611.01507v2 (Nov 2016).

[37]

Baoluo Meng, Andrew Reynolds, Cesare Tinelli, and Clark Barrett. 2017. Relational Constraint Solving in SMT. 26th International Conference on Automated Deduction (CADE).

[38]

Mariano M. Moscato, Carlos G. Lopez Pombo, and Marcelo F. Frias. 2014. Dynamite: a Tool for the Verification of Alloy Models based on PVS. ACM Transactions on Software Engineering and Methodology (TOSEM) 23. Issue 2.

Digital Library

[39]

NVIDIA. 2017. CUDA C Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide.

[40]

NVIDIA. 2017. Parallel Thread Execution ISA Version 6.0. http://docs.nvidia.com/cuda/parallel-thread-execution/index.html.

[41]

NVIDIA. 2018. NVIDIA Tesla V100 GPU Architecture: The World's MostAdvanced Data Center GPU. Technical Report.

[42]

Oracle. 2017. Java Language and Virtual Machine Specifications. https://docs.oracle.com/javase/specs/.

[43]

Peizhao Ou and Brian Demsky. 2018. Towards Understanding the Costs of Avoiding Out-of-Thin-Air Results. Object-Oriented Programming, Systems, Languages & Applications Conference (OOPSLA) (2018).

[44]

Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A Better x86Memory Model: x86-TSO. In 22nd International Conference on Theorem Proving in Higher Order Logics (TPHOLs).

Digital Library

[45]

Gustavo Petri, Jan Vitek, and Suresh Jagannathan. 2015. Cookingthe Books: Formalizing JMM Implementation Recipes. 29th European Conference on Object-Oriented Programming (ECOOP)(2015).

[46]

Power.org. 2013. Power ISA? Version 2.07. Technical Report. URL:https://www.power.org/wp-content/uploads/2013/05/PowerISA_V2.07_PUBLIC.pdf.

[47]

Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. 45th Symposium on Principles of Programming Languages (POPL).

Digital Library

[48]

Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In 32nd Conference on Programming Language Design and Implementation (PLDI).

Digital Library

[49]

Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models. In 48th Annual International Symposium on Microarchitecture (MICRO).

Digital Library

[50]

Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2017. Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems. 44th Annual International Symposium on Computer Architecture (ISCA) (2017).

Digital Library

[51]

Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing Errors Related to Weak Memory in GPU Applications. In 37th Conference on Programming Language Design and Implementation (PLDI).

Digital Library

[52]

Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primeron Memory Consistency and Cache Coherence. Morgan & Claypool Publishers.

Digital Library

[53]

SPARC International. 1993. The SPARC Architecture Manual, Version 9. Technical Report.

[54]

Emina Torlak and Daniel Jackson. 2007. Kodkod: A Relational Model Finder. In 13th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS).

Digital Library

[55]

Caroline Trippel, Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. 2017. TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA. 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)(2017).

Digital Library

[56]

Mattias Ulbrich, Ulrich Geilmann, Aboubakr Achraf El Ghazi, and Mana Taghdiri. 2012. A Proof Assistant for Alloy Specifications. International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS).

Digital Library

[57]

Viktor Vafeiadis, Thibaut Balabonski, Soham Chakraborty, Robin Morisset, and Francesco Zappa Nardelli. 2015. Common Compiler Optimisations Are Invalid in the C11 Memory Model and What We Can Do About It. In 42nd Symposium on Principles of Programming Languages (POPL).

Digital Library

[58]

John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically Comparing Memory Consistency Models. 44th Symposium on Principles of Programming Languages(POPL) (2017).

Digital Library

Cited By

Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Drechsler RMetz CPlump C(2024)Energy-Efficient CNN Inferencing on GPUs with Dynamic Frequency ScalingInnovations in Data Analytics10.1007/978-981-97-3466-5_28(375-389)Online publication date: 21-Jul-2024
https://doi.org/10.1007/978-981-97-3466-5_28
Goens AChakraborty SSarkar SAgarwal SOswald NNagarajan V(2023)Compound Memory ModelsProceedings of the ACM on Programming Languages10.1145/35912677:PLDI(1145-1168)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591267
Show More Cited By

Index Terms

A Formal Analysis of the NVIDIA PTX Memory Consistency Model
1. Hardware
  1. Hardware validation
    1. Functional verification
      1. Theorem proving and SAT solving
2. Software and its engineering
  1. Software organization and properties
    1. Software functional properties
      1. Correctness
        Consistency

Recommendations

Mixed-proxy extensions for the NVIDIA PTX memory consistency model: industrial product
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

In recent years, there has been a trend towards the use of accelerators and architectural specialization to continue scaling performance in spite of a slowing of Moore's Law. GPUs have always relied on dedicated hardware for graphics workloads, but ...
A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code
CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction

Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method ...
Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On
High Performance Computing
Abstract
Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

April 2019

1126 pages

ISBN:9781450362405

DOI:10.1145/3297858

General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

ASPLOS '19

Sponsor:

ASPLOS '19: Architectural Support for Programming Languages and Operating Systems

April 13 - 17, 2019

RI, Providence, USA

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
3,158
Total Downloads

Downloads (Last 12 months)1,218
Downloads (Last 6 weeks)177

Reflects downloads up to 01 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Drechsler RMetz CPlump C(2024)Energy-Efficient CNN Inferencing on GPUs with Dynamic Frequency ScalingInnovations in Data Analytics10.1007/978-981-97-3466-5_28(375-389)Online publication date: 21-Jul-2024
https://doi.org/10.1007/978-981-97-3466-5_28
Goens AChakraborty SSarkar SAgarwal SOswald NNagarajan V(2023)Compound Memory ModelsProceedings of the ACM on Programming Languages10.1145/35912677:PLDI(1145-1168)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591267
Levine RGuo TCho MBaker ALevien RNeto DQuinn ASorensen TAamodt TJerger NSwift M(2023)MC Mutants: Evaluating and Improving Testing for Memory Consistency SpecificationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575750(473-488)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575750
Pandey SKamath ABasu AAamodt TJerger NSwift M(2023)Scoped Buffered Persistency Model for GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575749(688-701)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575749
Klimis VClark JBaker ANeto DWickerson JDonaldson A(2023)Taking Back Control in an Intermediate Representation for GPU ComputingProceedings of the ACM on Programming Languages10.1145/35712537:POPL(1740-1769)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3571253
Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Awad MPorumbescu SOwens JKloeckner AMoreira J(2022)A GPU Multiversion B-TreeProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569681(481-493)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569681
Pandey SKamath ABasu AFalsafi BFerdman MLu SWenisch T(2022)GPM: leveraging persistent memory from a GPUProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507758(142-156)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507758
Lustig DCooksey SGiroux OSalapura VZahran MChong FTang L(2022)Mixed-proxy extensions for the NVIDIA PTX memory consistency modelProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3533045(1058-1070)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3533045
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents