research-article

Synchronization Using Remote-Scope Promotion

Authors:

Bradford M. Beckmann,

David A. WoodAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 43, Issue 1

Pages 73 - 86

https://doi.org/10.1145/2786763.2694350

Published: 14 March 2015 Publication History

Abstract

Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a priori. It works poorly for dynamic sharing patterns (e.g., work stealing) where programmers cannot use a faster small scope due to the rare possibility that the work is stolen by a thread in a distant slower scope. This puts programmers in a conundrum: optimize the common case by synchronizing at a faster small scope or use work stealing at a slower large scope. In this paper, we propose to extend scoped synchronization with remote-scope promotion. This allows the most frequent sharers to synchronize through a small scope. Infrequent sharers synchronize by promoting that remote small scope to a larger shared scope. Synchronization using remote-scope promotion provides performance robustness for dynamic workloads, where the benefits provided by scoped synchronization and work stealing are hard to anticipate. Compared to a naïve baseline, static scoped synchronization alone achieves a 1.07x speedup on average and dynamic work stealing alone achieves a 1.18x speedup on average. In contrast, synchronization using remote-scope promotion achieves a robust 1.25x speedup on average, across a diverse set of graph benchmarks and inputs.

References

[1]

"OpenCL 2.0 Reference Pages." {Online}. Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.

[2]

"CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[3]

"HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG) Version 1.0 Provisional," HSA Foundation, Spring 2013.

[4]

T. Aila and S. Laine, "Understanding the Efficiency of Ray Traversal on GPUs," In Proceedings of the Conference on High Performance Graphics, New York, N.Y., USA, 2009, pp. 145--149.

Digital Library

[5]

M. Frigo, C. E. Leiserson, and K. H. Randall, "The Imple-mentation of the Cilk-5 Multithreaded Language," In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, New York, N.Y., USA, 1998, pp. 212--223.

Digital Library

[6]

OpenMP Architecture Review Board, "OpenMP Application Program Interface Version 4.0," {Online}. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.

[7]

"Intel Threading Building Blocks." {Online}. Available: http://www.threadingbuildingblocks.org/.

[8]

D. Leijen, W. Schulte, and S. Burckhardt, "The design of a task parallel library," In Proceedings of the 24th ACM SIG-PLAN conference on Object oriented programming systems languages and applications, pp. 227--242, 2009.

Digital Library

[9]

International Organization for Standardization, "Working Draft, Standard for Programming Language C++," {Online}. Available: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf

[10]

D.R. Hower, B.A. Hechtman, B.M. Beckmann, B.R. Gaster, M.D. Hill, S.K. Reinhardt, and D.A. Wood, "Heterogeneous-race-free Memory Models," In The 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-19), 2014.

Digital Library

[11]

B.R. Gaster, D. Hower, and L. Howes, "HRF-Relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models," In Transactions on Architecture and Code Optimization (TACO), 2015.

Digital Library

[12]

AMD, "Southern Islands Series Instruction Set Architecture," 2012.

[13]

S. Owens, S. Sarkar, and P. Sewell, "A Better x86 Memory Model: x86-TSO," In Proceedings of the Conference on Theorem Proving in Higher Order Logics, 2009.

Digital Library

[14]

D. J. Sorin, M. D. Hill, and D. A. Wood, "A Primer on Memory Consistency and Cache Coherence," Morgan and Claypool, 2011.

Digital Library

[15]

B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs," presented at the 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014).

[16]

N.S. Arora, R.D. Blumofe, and C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors," In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, ACM, Puerto Vallarta, Mexico, 1998, pp. 119--129.

Digital Library

[17]

D. Cederman and P. Tsigas, "Dynamic Load-Balancing Using Work-Stealing," In GPU Computing Gems Jade Edition, Wen-Mei Hwu (Editor-in-Chief), Morgan Kaufmann.

[18]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," In SIGARCH Computer Arch. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.

Digital Library

[19]

S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding Irregular GPGPU Graph Applications," In Proceedings of the International Symposium on Workload Characterizations, Sept. 2013.

[20]

DIMACS Implementation Challenges. http://dimacs.rutgers.edu/Challenges/

[21]

Web resource: http://www.sommer.jp/graphs/

[22]

B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon, "The Midway distributed shared memory system," In Proc. 38th IEEE Computer Society Int. Conf., pp. 528--537, 1993.

[23]

L. Iftode, J. P. Singh, and K. Li, "Scope consistency: a bridge between release consistency and entry consistency," In Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, p.277--287, June 24--26, 1996, Padua, Italy.

Digital Library

[24]

D. Dice, M.S. Moir, and W.N. Scherer III, "Quickly reacquirable locks," US Patent 7,814,488, 2010.

[25]

W.W.L. Fung and T.M. Aamodt, "Energy Efficient GPU Transactional Memory via Space-Time Optimizations," In Proceedings of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO-46), pp. 408--420, Davis, CA, Dec. 7--11, 2013.

Digital Library

[26]

D. Cederman, P. Tsigas, and M.T. Chaudhry, "Towards a Software Transactional Memory for Graphics Processors," In Proceedings of the 10th Eurographics Symposium on Parallel Graphics and Visualization (EGPGV 2010).

Digital Library

[27]

I. Singh, A. Shriraman, W.W.L. Fung, M. O'Connor, and T.M. Aamodt, "Cache Coherence for GPU Architectures," In Proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19), pp. 578--590, Shenzhen, China, Feb. 23--27, 2013.

Digital Library

[28]

S. Tzeng, A. Patney, and J.D. Owens, "Task Management for Irregular-Parallel Workloads on the GPU," In Proceedings of High Performance Graphics 2010, pp. 29--37. June 2010.

Digital Library

Cited By

Soubervielle-Montalvo CPerez-Cham OPuente CGonzalez-Galvan EOlague GAguirre-Salado CCuevas-Tello JOntanon-Garcia L(2022)Design of a Low-Power Embedded System Based on a SoC-FPGA and the Honeybee Search Algorithm for Real-Time Video TrackingSensors10.3390/s2203128022:3(1280)Online publication date: 8-Feb-2022
https://doi.org/10.3390/s22031280
Alsop JSinclair MKomuravelli RAdve S(2016)GSI: A GPU Stall Inspector to characterize the sources of memory stalls for tightly coupled GPUs2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2016.7482092(172-182)Online publication date: Apr-2016
https://doi.org/10.1109/ISPASS.2016.7482092
Nayak ABasu A(2024)Over-Synchronization in GPU Programs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00064(795-809)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00064
Show More Cited By

Index Terms

Synchronization Using Remote-Scope Promotion

Recommendations

Synchronization Using Remote-Scope Promotion
ASPLOS '15

Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...
Synchronization Using Remote-Scope Promotion
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...
Lazy release consistency for GPUs
MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

The heterogeneous-race-free (HRF) memory model has been embraced by the Heterogeneous System Architecture (HSA) Foundation and OpenCL^™ because it clearly and precisely defines the behavior of current GPUs. However, compared to the simpler SC for DRF ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 43, Issue 1

ASPLOS'15

March 2015

676 pages

ISSN:0163-5964

DOI:10.1145/2786763

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2015
720 pages
ISBN:9781450328357
DOI:10.1145/2694344
General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2015

Published in SIGARCH Volume 43, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
482
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Soubervielle-Montalvo CPerez-Cham OPuente CGonzalez-Galvan EOlague GAguirre-Salado CCuevas-Tello JOntanon-Garcia L(2022)Design of a Low-Power Embedded System Based on a SoC-FPGA and the Honeybee Search Algorithm for Real-Time Video TrackingSensors10.3390/s2203128022:3(1280)Online publication date: 8-Feb-2022
https://doi.org/10.3390/s22031280
Alsop JSinclair MKomuravelli RAdve S(2016)GSI: A GPU Stall Inspector to characterize the sources of memory stalls for tightly coupled GPUs2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2016.7482092(172-182)Online publication date: Apr-2016
https://doi.org/10.1109/ISPASS.2016.7482092
Nayak ABasu A(2024)Over-Synchronization in GPU Programs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00064(795-809)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00064
Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Nagarajan VSorin DHill MWood DNagarajan VSorin DHill MWood D(2022)Consistency and Coherence for Heterogeneous SystemsA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_10(211-251)Online publication date: 28-Mar-2022
https://doi.org/10.1007/978-3-031-01764-3_10
Nagarajan VSorin DHill MWood D(2020)A Primer on Memory Consistency and Cache Coherence, Second EditionSynthesis Lectures on Computer Architecture10.2200/S00962ED2V01Y201910CAC04915:1(1-294)Online publication date: 4-Feb-2020
https://doi.org/10.2200/S00962ED2V01Y201910CAC049
Kamath AGeorge ABasu AMartínez JDuato JEeckhout L(2020)ScoRDProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture10.1109/ISCA45697.2020.00088(1036-1049)Online publication date: 30-May-2020
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00088
Boroumand AGhose SPatel MHassan HLucia BAusavarungnirun RHsieh KHajinazar NMalladi KZheng HMutlu OManne SHunter HAltman E(2019)CoNDAProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322266(629-642)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322266
Alsop JSinclair MBharadwaj SDutu AGutierrez AKayiran OLeBeane MPotter BPuthoor SZhang XYeh TBeckmann B(2019)Optimizing GPU Cache Policies for MI Workloads2019 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC47752.2019.9041977(243-248)Online publication date: Nov-2019
https://doi.org/10.1109/IISWC47752.2019.9041977
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents