Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Heterogeneous-race-free memory models

Published: 24 February 2014 Publication History

Abstract

Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous systems (unlike homogeneous CPU systems) provide synchronization mechanisms that only guarantee ordering among a subset of threads, which we call a scope. Unfortunately, the consequences and se-mantics of these scoped operations are not yet well under-stood. Without a formal and approachable model to reason about the behavior of these operations, we risk an array of portability and performance issues.
In this paper, we embrace scoped synchronization with a new class of memory consistency models that add scoped synchronization to data-race-free models like those of C++ and Java. Called sequential consistency for heterogeneous-race-free (SC for HRF), the new models guarantee SC for programs with "sufficient" synchronization (no data races) of "sufficient" scope. We discuss two such models. The first, HRF-direct, works well for programs with highly regular parallelism. The second, HRF-indirect, builds on HRF-direct by allowing synchronization using different scopes in some cases involving transitive communication. We quanti-tatively show that HRF-indirect encourages forward-looking programs with irregular parallelism by showing up to a 10% performance increase in a task runtime for GPUs.

References

[1]
Adve, S.V. and Boehm, H.-J. 2010. Semantics of shared variables & synchronization a.k.a. memory models.
[2]
Adve, S.V. and Gharachorloo, K. 1996. Shared memory consistency models: A tutorial. Computer. 29, 12 (1996), 66--76.
[3]
Adve, S.V. and Hill, M.D. 1990. Weak ordering--a new definition. Proceedings of the International Symposium on Computer Architecture (New York, NY, USA, 1990), 2--14.
[4]
AMD, Inc. 2012. Southern Islands series instruction set architecture. Advanced Micro Devices.
[5]
Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T. and Sardashti, S. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News. 39, 2 (2011), 1--7.
[6]
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H. and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. ACM.
[7]
Boehm, H.-J. and Adve, S.V. 2008. Foundations of the C++ concurrency memory model. International Symposium on Programming Language Design and Implementation (PLDI) (Tuscon, AZ, Jun. 2008), 68--78.
[8]
Carlson, W.W., Draper, J.M., Culler, D.E., Yelick, K., Brooks, E. and Warren, K. 1999. Introduction to UPC and language specification. Center for Computing Sciences, Institute for Defense Analyses.
[9]
Chamberlain, B.L., Callahan, D. and Zima, H.P. 2007. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications. 21, 3 (2007), 291--312.
[10]
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C. and Sarkar, V. 2005. X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices (2005), 519--538.
[11]
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H. and Skadron, K. 2009. Rodinia: a benchmark suite for heterogeneous computing. IEEE International Symposium on Workload Characterization, 2009. IISWC 2009 (Oct. 2009), 44--54.
[12]
CUDA 5.5 C programming guide: 2013. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2013-12-19.
[13]
Danalis, A., Pollock, L., Swany, M. and Cavazos, J. 2009. MPI-aware compiler optimizations for improving communication-computation overlap. Proceedings of the 23rd in-ternational conference on Supercomputing (2009), 316--325.
[14]
Dubois, M., Scheurich, C. and Briggs, F. 1986. Memory access buffering in multiprocessors. ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture (1986), 434--442.
[15]
Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A. and Hennessy, J. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. Proceedings of the 17th annual International Symposium on Computer Architecture (1990), 376--387.
[16]
Gropp, W., Lusk, E. and Skjellum, A. 1999. Using MPI: portable parallel programming with the message passing interface. MIT press.
[17]
Guiady, C., Falsafi, B. and Vijaykumar, T.N. 1999. Is SC+ILP=RC? Proceedings of the 26th International Symposium on Computer Architecture, 1999 (1999), 162--171.
[18]
Gupta, K., Stuart, J. and Owens, J.D. 2012. A study of persistent threads style GPU programming for GPGPU workloads. Proceedings of Innovative Parallel Computing (InPar '12) (May 2012).
[19]
Hechtman, B.A., Che, S., Hower, D.R., Tian, Y., Beckmann, B.M., Hill, M.D., Reinhardt, S.K. and Wood, D.A. 2014. QuickRelease: a throughput oriented approach to release consistency on GPUs. Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA) (Orland, FL, Feb. 2014).
[20]
Hechtman, B.A. and Sorin, D.J. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. Proceedings of the 40th International Symposi-um on Computer Architecture (ISCA) (Tel Aviv, Israel, Jun. 2013).
[21]
HSA Foundation 2012. Heterogeneous System Architecture: A Technical Review.
[22]
Kalla, R., Sinharoy, B., Starke, W.J. and Floyd, M. 2010. Power7: IBM's next-generation server processor. IEEE Micro. 30, 2 (2010), 7--15.
[23]
Kelm, J.H., Johnson, D.R., Tuohy, W., Lumetta, S.S. and Patel, S.J. 2010. Cohesion: a hybrid memory model for accelerators. Proceedings of the 37th annual international symposium on Computer architecture (New York, NY, USA, 2010), 429--440.
[24]
Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers. C-28, 9 (Sep. 1979), 690--691.
[25]
Lucia, B., Ceze, L., Strauss, K., Qadeer, S. and Boehm, H.J. 2010. Conflict exceptions: providing simple concurrent language semantics with precise hardware exceptions. Interna-tional Symposium on Computer Architecture (ISCA) (2010).
[26]
Manson, J., Pugh, W. and Adve, S.V. 2005. The Java memory model. Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages (New York, NY, USA, 2005), 378--391.
[27]
Marino, D., Singh, A., Millstein, T., Musuvathi, M. and Narayanasamy, S. 2010. DRFX: a simple and efficient memory model for concurrent programming languages. Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation (New York, NY, USA, 2010), 351--362.
[28]
Munshi, A. ed. 2013. The OpenCL Specification, Version 2.0 (Provisional). Khronos Group.
[29]
Munshi, A., Gaster, B. and Mattson, T.G. 2011. OpenCL programming guide. Addison-Wesley Professional.
[30]
NVIDIA Corporation 2012. Parallel Thread Execution ISA Version 3.1.
[31]
Olivier, S., Huan, J., Liu, J., Prins, J., Dinan, J., Sa-dayappan, P. and Tseng, C.-W. 2007. UTS: An unbalanced tree search benchmark. Languages and Compilers for Parallel Computing. Springer. 235--250.
[32]
OpenACC, Inc 2011. The OpenACCTM Application Programming Interface, Version 1.0.
[33]
Owens, S., Sarkar, S. and Sewell, P. 2009. A better x86 memory model: x86-TSO. Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics (Berlin, Heidelberg, 2009), 391--407.
[34]
Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C. and Horowitz, M.A. 2013. Convolution engine: balancing efficiency & flexibility in specialized computing. Proceedings of the 40th Annual International Symposium on Computer Architecture (2013), 24--35.
[35]
Sindhu, P.S., Frailong, J.-M. and Cekleov, M. 1992. Formal specification of memory models. Scalable Shared Memory Multiprocessors: Proceedings. (1992), 25.
[36]
Sorin, D.J., Hill, M.D. and Wood, D.A. 2011. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. 6, 3 (2011), 1--212.
[37]
Thakkar, S., Gifford, P. and Fielland, G. 1988. The balance multiprocessor system. IEEE Micro. 8, 1 (Jan. 1988), 57--69.
[38]
UTS source distribution: http://sourceforge.net/p/uts-benchmark/wiki/Home/.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 49, Issue 4
ASPLOS '14
April 2014
729 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2644865
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
    February 2014
    780 pages
    ISBN:9781450323055
    DOI:10.1145/2541940
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014
Published in SIGPLAN Volume 49, Issue 4

Check for updates

Author Tags

  1. data-race-free
  2. heterogeneous systems
  3. memory consistency model
  4. task runtime

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)84
  • Downloads (Last 6 weeks)3
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An efficient sequential consistency implementation with dynamic race detection for GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104836187(104836)Online publication date: May-2024
  • (2019)SIMD-XProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358843(411-427)Online publication date: 10-Jul-2019
  • (2017)Thread communication and synchronization on massively parallel GPUsAdvances in GPU Research and Practice10.1016/B978-0-12-803738-6.00003-3(57-81)Online publication date: 2017
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • (2023)MC Mutants: Evaluating and Improving Testing for Memory Consistency SpecificationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575750(473-488)Online publication date: 27-Jan-2023
  • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
  • (2023)CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00009(1-13)Online publication date: 21-Oct-2023
  • (2023)IXIAM: ISA EXtension for Integrated Accelerator ManagementIEEE Access10.1109/ACCESS.2023.326426511(33768-33791)Online publication date: 2023
  • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
  • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media