Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3195638.3195669acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Lazy release consistency for GPUs

Published: 15 October 2016 Publication History

Abstract

The heterogeneous-race-free (HRF) memory model has been embraced by the Heterogeneous System Architecture (HSA) Foundation and OpenCL because it clearly and precisely defines the behavior of current GPUs. However, compared to the simpler SC for DRF memory model, HRF has two shortcomings. The first is that HRF requires programmers to label atomic memory operations with the correct scope of synchronization. This explicit labeling can save significant coherence overhead when synchronization is local, but it is tedious and error-prone. The second shortcoming is that HRF restricts important dynamic data sharing patterns like work stealing. Prior work on remote-scope promotion (RSP) attempted to resolve the second shortcoming. However, RSP further complicates the memory model and no scalable implementation of RSP has been proposed. For example, we found that the previously proposed RSP implementation actually results in slowdowns of up to 30% on large GPUs, compared to a naïve baseline system that forgoes work stealing and scopes. Meanwhile, DeNovo has been shown to offer efficient synchronization with an SC for DRF memory model, performing on average 21% better than our baseline system, but it introduces additional overheads to maintain ownership of all modified data.
To resolve these deficiencies, we propose to adapt lazy release consistency---previously only proposed for homogeneous CPU systems---to a heterogeneous system. Our approach, called hLRC, uses a DeNovo-like mechanism to track ownership of synchronization variables, lazily performing coherence actions only when a synchronization variable changes locations. hLRC allows GPU programmers to use the simpler SC for DRF memory model without tracking ownership for all modified data. Our evaluation shows that lazy release consistency provides robust performance improvement across a set of work-stealing graph analysis applications---29% on average versus the baseline system.

References

[1]
D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Morgan and Claypool, 2011.
[2]
International Organization for Standardization, "Working Draft, Standard for Programming Language C++," {Online}. Available: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf
[3]
S. Adve and M. Hill, "Weak Ordering - A New Definition," in Proceedings of the 17th Annual International Symposium on Computer Architecture, 1990.
[4]
"CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
[5]
A. Munshi. The OpenCL Specification (Version 2.0). Khronos OpenCL Working Group, November 2013.
[6]
"HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG) Version 1.0 Provisional," HSA Foundation, Spring 2013.
[7]
D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free Memory Models," In The 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-19), 2014.
[8]
B. R. Gaster, D. Hower, and L. Howes, "HRF-Relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models," In Transactions on Architecture and Code Optimization (TACO), 2015.
[9]
T. Sorenson and A. F. Donaldson, "Exposing errors related to weak memory in GPU applications." In Proceedings of the 37th Conference on Programming Language Design and Implementation, 2016.
[10]
M. D. Sinclair, J. Alsop, and S. V. Adve, "Efficient GPU synchronization without scopes: Saying no to complex consistency models," In Proceedings of the 48th International Symposium on Microarchitecture, 2015.
[11]
H. Sung and S. V. Adve, "DeNovoSync: Efficient support for arbitrary synchronization without writer-initiated invalidations," In The 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-20), 2015.
[12]
M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood, "Synchronization using remote-scope promotion," In The 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-20), 2015.
[13]
Mike Mantor, "AMD Radeon™ HD 7970 with Graphics Core Next (GCN) Architecture," In HOT Chips, A Symposium on High Performance Chips, 2012.
[14]
H. J. Boehm and S. Adve, "Foundations of the C++ Concurrency Memory Model," In PLDI, 2008
[15]
J. Wickerson, M. Batty, B. M. Beckmann, and A. F. Donaldson, "Remote-scope promotion: Clarified, rectified, and verified." In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2015.
[16]
P. Keleher, A. L. Cox, and W. Zwaenepoel, "Lazy release consistency for software distributed shared memory," In Proceedings of the 19th Annual Symposium on Computer Architecture, 1992.
[17]
AMD Research, "AMD's GEM5 APU simulator" {Online}. Available: http://www.gem5.org/wiki/images/7/7a/2015_ws_03_amd-apu-model.pdf
[18]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding Irregular GPGPU Graph Applications," In Proceedings of the International Symposium on Workload Characterizations, 2013
[19]
D. Cederman and P. Tsigas, "Dynamic Load-Balancing Using Work-Stealing," In GPU Computing Gems Jade Edition, Wen-Mei Hwu (Editor-in-Chief), Morgan Kaufmann
[20]
The University of Florida Sparse Matrix Collection, T. A. Davis and Y Hu, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1 -- 1:25. http://www.cise.ufl.edu/research/sparse/matrices
[21]
A. Lebeck and D. Wood, "Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors," In The 22nd International Symposium on Computer Architecture (ISCA), 1995.
[22]
A. Ros and S. Kaxiras. "Complexity-effective multicore coherence," In The International Conference on Parallel Architecture and Compilation (PACT), 2012.
[23]
M. Elver and V. Nagarajan. "TSO-CC: Consistency directed cache coherence for TSO," In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, 2014.
[24]
M. Elver and V. Nagarajan. "RC3: Consistency directed cache coherence for x86-64 with RC extensions," In The International Conference on Parallel Architecture and Compilation (PACT), 2015.
[25]
H. Sung, R. Komuravelli, and S. V. Adve, "DeNovoND: efficient hardware support for disciplined non-determinism." In Proceedings of the 47th International Symposium on Microarchitecture, 2014.
[26]
B. A. Hechtman and D. J. Sorin, "Exploring Memory Consistency for Massively-threaded Throughput-oriented Processors," in Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.
[27]
J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous System Coherence for Integrated CPU-GPU Systems," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013.
[28]
B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt and D. A. Wood, "QuickRelease: A throughput-oriented approach to release consistency on GPUs," In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, 2014.
[29]
I. Singh, A. Shriraman, W. W. Fung, M. O'Connor M, and T. M. Aamodt TM, "Cache coherence for GPU architectures," In The19th International Symposium on High Performance Computer Architecture (HPCA2013), 2013.
[30]
J. Alglave, M. Batty, A. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson, "GPU Concurrency: Weak behaviors and programming assumptions," In The 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-20), 2015.
[31]
A. Singh, S. Aga, and S. Narayanasamy, "Efficiently enforcing strong memory ordering in GPUs," In Proceedings of the 48th International Symposium on Microarchitecture, 2015.

Cited By

View all
  • (2023)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence ProtocolsIEEE Micro10.1109/MM.2023.327499343:4(62-70)Online publication date: 1-Jul-2023
  • (2019)CoNDAProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322266(629-642)Online publication date: 22-Jun-2019
  • (2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture
October 2016
816 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 15 October 2016

Check for updates

Author Tags

  1. graphics processing unit (GPU)
  2. lazy release consistency
  3. memory model
  4. scope promotion
  5. scoped synchronization
  6. work stealing

Qualifiers

  • Research-article

Conference

MICRO-49
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)2
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence ProtocolsIEEE Micro10.1109/MM.2023.327499343:4(62-70)Online publication date: 1-Jul-2023
  • (2019)CoNDAProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322266(629-642)Online publication date: 22-Jun-2019
  • (2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
  • (2018)Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systemsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00035(339-351)Online publication date: 20-Oct-2018
  • (2018)SpandexProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00031(261-274)Online publication date: 2-Jun-2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media