Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An efficient sequential consistency implementation with dynamic race detection for GPUs

Published: 16 May 2024 Publication History

Abstract

As GPUs are being used for general purpose computations, applications with different memory access requirements have emerged. In spite of the growing demand, only few GPU coherence protocols and memory models have been explored in research, and even fewer models have been implemented in products. However, in the CPU domain a diverse range of memory models for parallel programming have been proposed, which explore the interplay between performance and programmability.
Sequential consistency (SC) is one of the strict memory models. It provides the most programmer intuitive execution of memory operation but it imposes strict ordering restrictions on memory operations that cause performance overhead. Hence, implementing and supporting SC is one of the most challenging tasks in any computing platform, and GPUs are no exception. As such in this paper, we propose a GPU architecture that implements SC memory model with minimal performance and power overhead. We achieve this goal by designing a mechanism to detect races between different streaming multiprocessors (SMs) dynamically at runtime. The race is detected using a signature-based mechanism to keep track of sets of unseen updates for each SM which significantly reduces the hardware implementation cost, with a small increase in invalidation traffic. Our experiments show that dynamic race detection can be used to implement sequential consistency with 5% performance overhead.

References

[1]
M. Abdel-Majeed, M. Annavaram, Warped register file: a power efficient register file for GPGPUs, in: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013, pp. 412–423,.
[2]
S.V. Adve, M.D. Hill, Weak ordering - a new definition, in: Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA '90, ACM, New York, NY, USA, 1990, pp. 2–14,. http://doi.acm.org/10.1145/325164.325100.
[3]
J. Alglave, M. Batty, A.F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, J. Wickerson, G.P.U. Concurrency, Weak behaviours and programming assumptions, in: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, ACM, New York, NY, USA, 2015, pp. 577–591,. http://doi.acm.org/10.1145/2694344.2694391.
[4]
J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consistency for GPUs, in: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–14,.
[5]
A. Bakhoda, G.L. Yuan, W.W. Fung, H. Wong, T.M. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, in: Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, IEEE, 2009, pp. 163–174.
[6]
B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM 13 (7) (1970) 422–426,. http://doi.acm.org/10.1145/362686.362692.
[7]
S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, K. Skadron, Rodinia: a benchmark suite for heterogeneous computing, in: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, IEEE, 2009, pp. 44–54.
[8]
B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S.V. Adve, V.S. Adve, N.P. Carter, C.T. Chou, DeNovo: rethinking the memory hierarchy for disciplined parallelism, in: 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011, pp. 155–166,.
[9]
M. Gebhart, D.R. Johnson, D. Tarjan, S.W. Keckler, W.J. Dally, E. Lindholm, K. Skadron, Energy-efficient mechanisms for managing thread context in throughput processors, in: Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, ACM, New York, NY, USA, 2011, pp. 235–246,. http://doi.acm.org/10.1145/2000064.2000093.
[10]
B.A. Hechtman, D.J. Sorin, Exploring memory consistency for massively-threaded throughput-oriented processors, in: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, ACM, New York, NY, USA, 2013, pp. 201–212,. http://doi.acm.org/10.1145/2485922.2485940.
[11]
B.A. Hechtman, D.J. Sorin, Evaluating cache coherent shared virtual memory for heterogeneous multicore chips, in: 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013, pp. 118–119,.
[12]
B.A. Hechtman, S. Che, D.R. Hower, Y. Tian, B.M. Beckmann, M.D. Hill, S.K. Reinhardt, D.A. Wood, QuickRelease: a throughput-oriented approach to release consistency on GPUs, in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 189–200,.
[13]
D.R. Hower, B.A. Hechtman, B.M. Beckmann, B.R. Gaster, M.D. Hill, S.K. Reinhardt, D.A. Wood, Heterogeneous-race-free memory models, ACM SIGPLAN Not. 49 (4) (2014) 427–440.
[14]
H. Jeon, G.S. Ravi, N.S. Kim, M. Annavaram, GPU register file virtualization, in: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, ACM, New York, NY, USA, 2015, pp. 420–432,. http://doi.acm.org/10.1145/2830772.2830784.
[15]
N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, X. Liang, An energy-efficient and scalable eDRAM-based register file architecture for GPGPU, in: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, ACM, New York, NY, USA, 2013, pp. 344–355,. http://doi.acm.org/10.1145/2485922.2485952.
[16]
K. Kim, S. Lee, M.K. Yoon, G. Koo, W.W. Ro, M. Annavaram, Warped-preexecution: a GPU pre-execution approach for improving latency hiding, in: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 163–175,.
[17]
A.-C. Lai, B. Falsafi, Selective, accurate, and timely self-invalidation using last-touch prediction, in: Computer Architecture, 2000. Proceedings of the 27th International Symposium on, IEEE, 2000, pp. 139–148.
[18]
A.R. Lebeck, D.A. Wood, Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors, ACM SIGARCH Computer Architecture News, vol. 23, ACM, 1995, pp. 48–59.
[19]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N.S. Kim, T.M. Aamodt, V.J. Reddi, GPUWattch: enabling energy optimizations in GPGPUs, ACM SIGARCH Comput. Archit. News 41 (3) (2013) 487–498.
[20]
J. Menon, M. De Kruijf, K. Sankaralingam, iGPU: exception support and speculative execution on GPUs, in: Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, IEEE Computer Society, Washington, DC, USA, 2012, pp. 72–83. http://dl.acm.org/citation.cfm?id=2337159.2337168.
[21]
A. Muzahid, D. Suárez, S. Qi, J. Torrellas, SigRace: signature-based data race detection, in: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, ACM, New York, NY, USA, 2009, pp. 337–348,. http://doi.acm.org/10.1145/1555754.1555797.
[22]
R.N. Netzer, B.P. Miller, Detecting data races in parallel program executions, in: Advances in Languages and Compilers for Parallel Computing, 1990 Workshop, MIT Press, 1989, pp. 109–129.
[23]
M. Prvulovic, CORD: cost-effective (and nearly overhead-free) order-recording and data race detection, in: The Twelfth International Symposium on High-Performance Computer Architecture, 2006, 2006, pp. 232–243,.
[24]
M. Prvulovic, J. Torrellas, ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes, in: 30th Annual International Symposium on Computer Architecture, 2003. Proceedings, 2003, pp. 110–121,.
[25]
X. Ren, M. Lis, Efficient sequential consistency in gpus via relativistic cache coherence, in: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE, 2017, pp. 625–636.
[26]
A. Ros, S. Kaxiras, Complexity-effective multicore coherence, in: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ACM, 2012, pp. 241–252.
[27]
A. Ros, S. Kaxiras, Racer: TSO consistency via race detection, in: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13,.
[28]
M.D. Sinclair, J. Alsop, S.V. Adve, Efficient GPU synchronization without scopes: saying no to complex consistency models, in: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, ACM, New York, NY, USA, 2015, pp. 647–659,. http://doi.acm.org/10.1145/2830772.2830821.
[29]
I. Singh, A. Shriraman, W.W. Fung, M. O'Connor, T.M. Aamodt, Cache coherence for GPU architectures, in: High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, IEEE, 2013, pp. 578–590.
[30]
T. Sorensen, G. Gopalakrishnan, V. Grover, Towards shared memory consistency models for GPUs, in: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, ACM, New York, NY, USA, 2013, pp. 489–490,. http://doi.acm.org/10.1145/2464996.2467280.
[31]
J.A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G.D. Liu, W.-M.W. Hwu, Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing, Center for Reliable and High-Performance Computing, 2012.
[32]
A. Tabbakh, X. Qian, M. Annavaram, G-tsc: timestamp based coherence for gpus, in: High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, IEEE, 2018, pp. 403–415.
[33]
C. von Praun, H.W. Cain, J.-D. Choi, K.D. Ryu, Conditional memory ordering, in: Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA '06, IEEE Computer Society, Washington, DC, USA, 2006, pp. 41–52,.
[34]
P. Zhou, R. Teodorescu, Y. Zhou, HARD: hardware-assisted lockset-based race detection, in: 2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007, pp. 121–132,.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 187, Issue C
May 2024
181 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 16 May 2024

Author Tags

  1. Computer architecture
  2. GPU
  3. Memory coherence
  4. Sequential consistency

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media