research-article

An efficient sequential consistency implementation with dynamic race detection for GPUs

Authors:

Abdulaziz Tabbakh,

Murali AnnavaramAuthors Info & Claims

Volume 187, Issue C

https://doi.org/10.1016/j.jpdc.2023.104836

Published: 16 May 2024 Publication History

Abstract

As GPUs are being used for general purpose computations, applications with different memory access requirements have emerged. In spite of the growing demand, only few GPU coherence protocols and memory models have been explored in research, and even fewer models have been implemented in products. However, in the CPU domain a diverse range of memory models for parallel programming have been proposed, which explore the interplay between performance and programmability.

Sequential consistency (SC) is one of the strict memory models. It provides the most programmer intuitive execution of memory operation but it imposes strict ordering restrictions on memory operations that cause performance overhead. Hence, implementing and supporting SC is one of the most challenging tasks in any computing platform, and GPUs are no exception. As such in this paper, we propose a GPU architecture that implements SC memory model with minimal performance and power overhead. We achieve this goal by designing a mechanism to detect races between different streaming multiprocessors (SMs) dynamically at runtime. The race is detected using a signature-based mechanism to keep track of sets of unseen updates for each SM which significantly reduces the hardware implementation cost, with a small increase in invalidation traffic. Our experiments show that dynamic race detection can be used to implement sequential consistency with 5% performance overhead.

References

[1]

M. Abdel-Majeed, M. Annavaram, Warped register file: a power efficient register file for GPGPUs, in: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013, pp. 412–423,.

Digital Library

[2]

S.V. Adve, M.D. Hill, Weak ordering - a new definition, in: Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA '90, ACM, New York, NY, USA, 1990, pp. 2–14,. http://doi.acm.org/10.1145/325164.325100.

Digital Library

[3]

J. Alglave, M. Batty, A.F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, J. Wickerson, G.P.U. Concurrency, Weak behaviours and programming assumptions, in: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, ACM, New York, NY, USA, 2015, pp. 577–591,. http://doi.acm.org/10.1145/2694344.2694391.

Digital Library

[4]

J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consistency for GPUs, in: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–14,.

[5]

A. Bakhoda, G.L. Yuan, W.W. Fung, H. Wong, T.M. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, in: Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, IEEE, 2009, pp. 163–174.

[6]

B.H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM 13 (7) (1970) 422–426,. http://doi.acm.org/10.1145/362686.362692.

Digital Library

[7]

S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, K. Skadron, Rodinia: a benchmark suite for heterogeneous computing, in: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, IEEE, 2009, pp. 44–54.

[8]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S.V. Adve, V.S. Adve, N.P. Carter, C.T. Chou, DeNovo: rethinking the memory hierarchy for disciplined parallelism, in: 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011, pp. 155–166,.

Digital Library

[9]

M. Gebhart, D.R. Johnson, D. Tarjan, S.W. Keckler, W.J. Dally, E. Lindholm, K. Skadron, Energy-efficient mechanisms for managing thread context in throughput processors, in: Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, ACM, New York, NY, USA, 2011, pp. 235–246,. http://doi.acm.org/10.1145/2000064.2000093.

Digital Library

[10]

B.A. Hechtman, D.J. Sorin, Exploring memory consistency for massively-threaded throughput-oriented processors, in: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, ACM, New York, NY, USA, 2013, pp. 201–212,. http://doi.acm.org/10.1145/2485922.2485940.

Digital Library

[11]

B.A. Hechtman, D.J. Sorin, Evaluating cache coherent shared virtual memory for heterogeneous multicore chips, in: 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013, pp. 118–119,.

[12]

B.A. Hechtman, S. Che, D.R. Hower, Y. Tian, B.M. Beckmann, M.D. Hill, S.K. Reinhardt, D.A. Wood, QuickRelease: a throughput-oriented approach to release consistency on GPUs, in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 189–200,.

[13]

D.R. Hower, B.A. Hechtman, B.M. Beckmann, B.R. Gaster, M.D. Hill, S.K. Reinhardt, D.A. Wood, Heterogeneous-race-free memory models, ACM SIGPLAN Not. 49 (4) (2014) 427–440.

[14]

H. Jeon, G.S. Ravi, N.S. Kim, M. Annavaram, GPU register file virtualization, in: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, ACM, New York, NY, USA, 2015, pp. 420–432,. http://doi.acm.org/10.1145/2830772.2830784.

Digital Library

[15]

N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, X. Liang, An energy-efficient and scalable eDRAM-based register file architecture for GPGPU, in: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, ACM, New York, NY, USA, 2013, pp. 344–355,. http://doi.acm.org/10.1145/2485922.2485952.

Digital Library

[16]

K. Kim, S. Lee, M.K. Yoon, G. Koo, W.W. Ro, M. Annavaram, Warped-preexecution: a GPU pre-execution approach for improving latency hiding, in: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 163–175,.

[17]

A.-C. Lai, B. Falsafi, Selective, accurate, and timely self-invalidation using last-touch prediction, in: Computer Architecture, 2000. Proceedings of the 27th International Symposium on, IEEE, 2000, pp. 139–148.

[18]

A.R. Lebeck, D.A. Wood, Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors, ACM SIGARCH Computer Architecture News, vol. 23, ACM, 1995, pp. 48–59.

[19]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N.S. Kim, T.M. Aamodt, V.J. Reddi, GPUWattch: enabling energy optimizations in GPGPUs, ACM SIGARCH Comput. Archit. News 41 (3) (2013) 487–498.

[20]

J. Menon, M. De Kruijf, K. Sankaralingam, iGPU: exception support and speculative execution on GPUs, in: Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, IEEE Computer Society, Washington, DC, USA, 2012, pp. 72–83. http://dl.acm.org/citation.cfm?id=2337159.2337168.

[21]

A. Muzahid, D. Suárez, S. Qi, J. Torrellas, SigRace: signature-based data race detection, in: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, ACM, New York, NY, USA, 2009, pp. 337–348,. http://doi.acm.org/10.1145/1555754.1555797.

Digital Library

[22]

R.N. Netzer, B.P. Miller, Detecting data races in parallel program executions, in: Advances in Languages and Compilers for Parallel Computing, 1990 Workshop, MIT Press, 1989, pp. 109–129.

[23]

M. Prvulovic, CORD: cost-effective (and nearly overhead-free) order-recording and data race detection, in: The Twelfth International Symposium on High-Performance Computer Architecture, 2006, 2006, pp. 232–243,.

[24]

M. Prvulovic, J. Torrellas, ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes, in: 30th Annual International Symposium on Computer Architecture, 2003. Proceedings, 2003, pp. 110–121,.

[25]

X. Ren, M. Lis, Efficient sequential consistency in gpus via relativistic cache coherence, in: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE, 2017, pp. 625–636.

[26]

A. Ros, S. Kaxiras, Complexity-effective multicore coherence, in: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ACM, 2012, pp. 241–252.

[27]

A. Ros, S. Kaxiras, Racer: TSO consistency via race detection, in: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13,.

[28]

M.D. Sinclair, J. Alsop, S.V. Adve, Efficient GPU synchronization without scopes: saying no to complex consistency models, in: Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, ACM, New York, NY, USA, 2015, pp. 647–659,. http://doi.acm.org/10.1145/2830772.2830821.

Digital Library

[29]

I. Singh, A. Shriraman, W.W. Fung, M. O'Connor, T.M. Aamodt, Cache coherence for GPU architectures, in: High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, IEEE, 2013, pp. 578–590.

[30]

T. Sorensen, G. Gopalakrishnan, V. Grover, Towards shared memory consistency models for GPUs, in: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, ACM, New York, NY, USA, 2013, pp. 489–490,. http://doi.acm.org/10.1145/2464996.2467280.

Digital Library

[31]

J.A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G.D. Liu, W.-M.W. Hwu, Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing, Center for Reliable and High-Performance Computing, 2012.

[32]

A. Tabbakh, X. Qian, M. Annavaram, G-tsc: timestamp based coherence for gpus, in: High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, IEEE, 2018, pp. 403–415.

[33]

C. von Praun, H.W. Cain, J.-D. Choi, K.D. Ryu, Conditional memory ordering, in: Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA '06, IEEE Computer Society, Washington, DC, USA, 2006, pp. 41–52,.

Digital Library

[34]

P. Zhou, R. Teodorescu, Y. Zhou, HARD: hardware-assisted lockset-based race detection, in: 2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007, pp. 121–132,.

Digital Library

Recommendations

Efficient sequential consistency using conditional fences
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Among the various memory consistency models, the sequential consistency (SC) model, in which memory operations appear to take place in the order specified by the program, is the most intuitive and enables programmers to reason about their parallel ...
Efficient sequential consistency via conflict ordering
ASPLOS '12

Although the sequential consistency (SC) model is the most intuitive, processor designers often choose to support relaxed memory consistency models for higher performance. This is because SC implementations that match the performance of relaxed memory ...
Efficient Convex Optimization on GPUs for Embedded Model Predictive Control
GPGPU-10: Proceedings of the General Purpose GPUs

GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 187, Issue C

May 2024

181 pages

Issue’s Table of Contents

Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 16 May 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents