Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

GPU Concurrency: Weak Behaviours and Programming Assumptions

Published: 14 March 2015 Publication History

Abstract

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software.
To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent programs), we questioned the assumptions in programming guides and vendor documentation about the guarantees provided by hardware. We developed a tool to generate thousands of litmus tests and run them under stressful workloads. We observed a litany of previously elusive weak behaviours, and exposed folklore beliefs about GPU programming---often supported by official tutorials---as false.
As a way forward, we propose a model of Nvidia GPU hardware, which correctly models every behaviour witnessed in our experiments. The model is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy.

References

[1]
Online companion material. http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/.
[2]
GPUBench, June 2014. http://graphics.stanford.edu/projects/gpubench.
[3]
J. Alglave. A Shared Memory Poetics. PhD thesis, Université Paris 7, 2010.
[4]
J. Alglave. A formal hierarchy of weak memory models. Formal Methods in System Design (FMSD), 41(2):178--210, 2012.
[5]
J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Litmus: Running tests against hardware. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 41--44, 2011.
[6]
J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Fences in weak memory models (extended version). Formal Methods in System Design (FMSD), 40(2):170--205, 2012.
[7]
J. Alglave, L. Maranget, and M. Tautschnig. Herding cats: Modelling, simulation, testing, and data mining for weak memory. ACM Transactions on Programming Languages and Systems (TOPLAS), 36(2):7, 2014.
[8]
AMD. AMD intermediate language (IL), version 2.4, Oct. 2011.
[9]
AMD. Evergreen family instruction set architecture: Instructions and microcode, revision 1.1a, Nov. 2011.
[10]
AMD. Southern Islands series instruction set architecture, revision 1.1, Dec. 2012.
[11]
AMD. AMD accelerated parallel processing OpenCL programming guide, Nov. 2013.
[12]
ARM. Cortex-A9 MPCore, programmer advice notice, read-after-read hazards ARM reference 761319, Sept. 2011.
[13]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 119--129, 1998.
[14]
H. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. In Programming Language Design and Implementation (PLDI), pages 68--78, 2008.
[15]
D. Cederman and P. Tsigas. On dynamic load balancing on graphics processors. In SIGGRAPH/Eurographics, pages 57--64, 2008.
[16]
D. Cederman and P. Tsigas. Dynamic load balancing on graphics processors, Feb. 2014. http://www.cse.chalmers.se/research/group/dcs/gpuloadbal.html.
[17]
W. Collier. Reasoning About Parallel Architectures. Prentice-Hall, 1992.
[18]
E. Eide and J. Regehr. Volatiles are miscompiled, and what to do about it. In Embedded software (EMSOFT), pages 255--264, 2008.
[19]
W. Feng and S. Xiao. To GPU synchronize or not GPU synchronize? In International Symposium on Circuits and Systems (ISCAS), pages 3801--3804, 2010.
[20]
A. Habermaier and A. Knapp. On the correctness of the SIMT execution model of GPUs. In European Symposium on Programming (ESOP), pages 316--335, 2012.
[21]
T. Härder and A. Reuter. Principles of transaction-oriented database recovery. Computing Survey (CSUR), 15(4):287--317, 1983.
[22]
B. He and J. X. Yu. High-throughput transaction executions on graphics processors. The Proceedings of the VLDB Endowment (PVLDB), 4(5):314--325, 2011.
[23]
B. A. Hechtman and D. J. Sorin. Exploring memory consistency for massively-threaded throughput-oriented processors. In International Symposium on Circuits and Systems (ISCA), pages 201--212, 2013.
[24]
D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Sequential consistency for heterogeneous-race-free. In Memory Systems Performance and Correctness (MSPC), 2013.
[25]
D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood. Heterogeneous- race-free memory models. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 427--440, 2014.
[26]
W.-m. W. Hwu. GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers Inc., 2011.
[27]
Khronos OpenCL Working Group. The OpenCL specification (version 1.2, revision 19), Nov. 2012.
[28]
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690--691, 1979.
[29]
S. Lee, Y. Kim, J. Kim, and J. Kim. Stealing webpages rendered on your browser by exploiting GPU vulnerabilities. In Symposium on Security and Privacy (SP), pages 19--33, 2014.
[30]
P. Misra and M. Chaudhuri. Performance evaluation of concurrent lock-free data structures on GPUs. In International Conference on Parallel and Distributed Systems, (ICPADS), pages 53--60, 2012.
[31]
R. Morisset, P. Pawan, and F. Z. Nardelli. Compiler testing via a theory of sound optimisations in the C11/C++11 memory model. In Programming Language Design and Implementation (PLDI), pages 187--196, 2013.
[32]
Nvidia. CUDA C programming guide, version 5.5, July 2013.
[33]
Nvidia. CUDA by example -- errata, June 2014. http://developer.nvidia.com/cuda-example-errata-page.
[34]
Nvidia. CUDA C programming guide, version 6, July 2014.
[35]
Nvidia. CUDA binary utilities, Aug. 2014. http://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf.
[36]
Nvidia. Parallel thread execution ISA: Version 4.0, Feb. 2014.
[37]
S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-tso. In Theorem Proving in Higher Order Logics (TPHOLs), pages 391--407, 2009.
[38]
J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 2010.
[39]
S. Sarkar, P. Sewell, J. Alglave, L. Maranget, and D. Williams. Understanding POWER multiprocessors. In Programming Language Design and Implementation (PLDI), pages 175--186, 2011.
[40]
T. Sorensen. Towards shared memory consistency models for GPUs. Bachelor's thesis, University of Utah, 2013.
[41]
T. Sorensen, G. Gopalakrishnan, and V. Grover. Towards shared memory consistency models for GPUs. In International Conference on Supercomputing (ICS), pages 489--490, 2013.
[42]
J. A. Stuart and J. D. Owens. Efficient synchronization primitives for GPUs. Computing Research Repository (CoRR), abs/1110.4623, 2011. http://arxiv.org/pdf/1110.4623.pdf.
[43]
D. L. Weaver and T. Germond. The SPARC Architecture Manual Version 9. SPARC International Inc., 1994.
[44]
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In Performance Analysis of Systems Software (ISPASS), pages 235--246, 2010.
[45]
S. Xiao and W. Feng. Inter-block GPU communication via fast barrier synchronization. In International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.

Cited By

View all
  • (2022)CAAT: consistency as a theoryProceedings of the ACM on Programming Languages10.1145/35632926:OOPSLA2(114-144)Online publication date: 31-Oct-2022
  • (2022)Consistency and Coherence for Heterogeneous SystemsA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_10(211-251)Online publication date: 28-Mar-2022
  • (2021)KVCGProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463779(1-12)Online publication date: 14-Jun-2021
  • Show More Cited By

Index Terms

  1. GPU Concurrency: Weak Behaviours and Programming Assumptions

    Recommendations

    Reviews

    Karthik S Murthy

    A memory consistency model (MCM) is a specification that describes the value(s) that a memory location should hold based on the causal history of operations that may or may not be associated with that location. The MCM specification is important because the correctness of a program depends on knowing what a thread reads/writes to a memory location in the presence of concurrent accesses by other threads. This specification exists at multiple levels: between a programmer and a compiler for a language, between a compiler and the processor/hardware on which it runs, between multiple hardware components, and so on. Considering this sandwich of specifications and the number of optimizations that many compilers/hardware do today, writing deterministic and performant parallel programs is a challenge for an expert or beginner programmer. Efforts to thoroughly understand and find bugs in MCMs have spanned multiple decades [1,2,3]. This paper plays a crucial role in understanding one such specification: the one between a programmer and the general-purpose graphics processing unit (GPGPU) hardware. In the paper, the authors describe that the observed consistency model on the GPU hardware is that of weak behavior, that is, a relaxed consistency model where operations can be reordered during execution leading to a nonsequentially consistent execution unless appropriate steps are taken such as using fences. They justify this stand by building a set of tests-litmus tests-that help capture this behavior. They then suggest fixes to obtain the required behavior when using such code snippets. They also produce a tool, optcheck, which can inspect GPU assembly code for reorderings that might change the behavior of the litmus tests. Finally, they present a formal model that can help build a simulation tool, proposed not implemented, which would predict possible behaviors of PTX fragments. I do have a few gripes about the paper. The authors do not indicate whether any out-of-thin-air reads were observed in the CoRR experiments. Furthermore, they justify the necessity of fences in Figure 6, but an argument about sufficiency is absent. Overall, the paper is very well written and easily understandable. Of particular mention is the way the authors defend the lack of clarity in vendor specifications via excerpts from such specifications that contradict each other. Finally, Table 2 depicting the behaviors identified by their study is a must-read for any GPU programmer. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 4
    ASPLOS '15
    April 2015
    676 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2775054
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2015
      720 pages
      ISBN:9781450328357
      DOI:10.1145/2694344
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 March 2015
    Published in SIGPLAN Volume 50, Issue 4

    Check for updates

    Author Tags

    1. GPU
    2. Nvidia PTX
    3. formal model
    4. litmus testing
    5. memory consistency
    6. openCL
    7. test generation

    Qualifiers

    • Research-article

    Funding Sources

    • NSF CCF
    • SRC
    • EPSRC
    • EU FP7 project CARP

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)123
    • Downloads (Last 6 weeks)27
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)CAAT: consistency as a theoryProceedings of the ACM on Programming Languages10.1145/35632926:OOPSLA2(114-144)Online publication date: 31-Oct-2022
    • (2022)Consistency and Coherence for Heterogeneous SystemsA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_10(211-251)Online publication date: 28-Mar-2022
    • (2021)KVCGProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463779(1-12)Online publication date: 14-Jun-2021
    • (2021)Systems-on-Chip with Strong OrderingACM Transactions on Architecture and Code Optimization10.1145/342815318:1(1-27)Online publication date: 20-Jan-2021
    • (2020)A Primer on Memory Consistency and Cache Coherence, Second EditionSynthesis Lectures on Computer Architecture10.2200/S00962ED2V01Y201910CAC04915:1(1-294)Online publication date: 4-Feb-2020
    • (2020)Safely Preventing Unbounded Delays During Bus Transactions in FPGA-based SoC2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM48280.2020.00026(129-137)Online publication date: May-2020
    • (2019)A lock-free algorithm of tree-based reduction for large scale clustering on GPGPUProceedings of the 2nd International Conference on Artificial Intelligence and Pattern Recognition10.1145/3357254.3357271(129-133)Online publication date: 16-Aug-2019
    • (2019)Don't Forget About Synchronization!Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309488(11-20)Online publication date: 17-Feb-2019
    • (2018)An Implementation of Fast memset() Using Hardware AcceleratorsProceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers10.1145/3217189.3217192(1-6)Online publication date: 12-Jun-2018
    • (2017)Compositional relaxed concurrencyPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2015.0406375:2104(20150406)Online publication date: 4-Sep-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media