research-article

MIMD synchronization on SIMT architectures

Authors:

Ahmed ElTantawy,

Tor M. AamodtAuthors Info & Claims

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

Article No.: 11, Pages 1 - 14

Published: 15 October 2016 Publication History

Abstract

In the single-instruction multiple-threads (SIMT) execution model, small groups of scalar threads operate in lockstep. Within each group, current SIMT hardware implementations serialize the execution of threads that follow different paths, and to ensure efficiency, revert to lockstep execution as soon as possible. These constraints must be considered when adapting algorithms that employ synchronization. A deadlock-free program on a multiple-instruction multiple-data (MIMD) architecture may deadlock on a SIMT machine. To avoid this, programmers need to restructure control flow with SIMT scheduling constraints in mind. This requires programmers to be familiar with the underlying SIMT hardware.

In this paper, we propose a static analysis technique that detects SIMT deadlocks by inspecting the application control flow graph (CFG). We further propose a CFG transformation that avoids SIMT deadlocks when synchronization is local to a function. Both the analysis and the transformation algorithms are implemented as LLVM compiler passes. Finally, we propose an adaptive hardware reconvergence mechanism that supports MIMD synchronization without changing the application CFG, but which can leverage our compiler analysis to gain efficiency. The static detection has a false detection rate of only 4%--5%. The automated transformation has an average performance overhead of 8.2%--10.9% compared to manual transformation. Our hardware approach performs on par with the compiler transformation, however, it avoids synchronization scope limitations, static instruction and register overheads, and debuggability challenges that are present in the compiler only solution.

References

[1]

NVIDIA, CUDA, "NVIDIA CUDA Programming Guide," 2011.

[2]

Intel Corporation, "The ISPC Parallel Execution Model," 2016.

[3]

AMD, "Accelerated Parallel Processing: OpenCL Programming Guide," 2013.

[4]

A. Levinthal and T. Porter, "Chap --- A SIMD Graphics Processor," Proc. ACM Conf. on Comp. Grap. and Interactive Tech. (SIGGRAPH), 1984.

Digital Library

[5]

B. Coon and J. Lindholm, "System and method for managing divergent threads in a simd architecture," 2008. US Patent 7,353,369.

[6]

B. Beylin and R. S. Glanville, "Insertion of multithreaded execution synchronization points in a software program," 2013. US Patent 8,381,203.

[7]

AMD Corporation, "Southern Islands Series Instruction Set Architecture," 2012.

[8]

M. HOUSTON, B. Gaster, L. HOWES, M. Mantor, and D. Behr, "Method and System for Synchronization of Workitems with Divergent Control Flow," 2013. WO Patent App. PCT/US2013/043,394.

[9]

A. Habermaier and A. Knapp, "On the Correctness of the SIMT Execution Model of GPUs," in Programming Languages and Systems, pp. 316--335, Springer, 2012.

Digital Library

[10]

S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA Graph Algorithms at Maximum Warp," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 267--276, 2011.

Digital Library

[11]

M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proc. IEEE Symp. on Workload Characterization (IISWC), 2012.

Digital Library

[12]

D. Merrill, M. Garland, and A. Grimshaw, "Scalable GPU Graph Traversal," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 117--128, 2012.

Digital Library

[13]

S. Lee, S.-J. Min, and R. Eigenmann, "OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 101--110, 2009.

Digital Library

[14]

G. Noaje, C. Jaillet, and M. Krajecki, "Source-to-source Code Translator: OpenMP C to CUDA," in IEEE Int'l Conf. on High Performance Computing and Communications (HPCC), 2011.

Digital Library

[15]

C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O'Brien, Z. Sura, A. C. Jacob, T. Chen, and O. Sallenave, "Coordinating GPU Threads for OpenMP 4.0 in LLVM," in Proc. LLVM Compiler Infrastructure in HPC, 2014.

Digital Library

[16]

S. Antao, C. Bertolli, A. Bokhanko, A. Eichenberger, H. Finkel, S. Ostanevich, E. Stotzer, and G. Zhang, "OpenMP Offload Infrastructure in LLVM," tech. rep.

[17]

OpenMP Clang Frontend, "OpenMP Clang Frontend Documentation." https://github.com/clang-omp, 2015.

[18]

X. Tian and B. R. de Supins, "Explicit Vector Programming with OpenMP 4.0 SIMD Extension," Primeur Magazine 2014, 2014.

[19]

A. ElTantawy, "SSDE and AWARE codes." https://github.com/ElTantawy/mimd_to_simt/, 2016.

[20]

W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 407--420, 2007.

Digital Library

[21]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 235--246, 2010.

Digital Library

[22]

NVIDIA Forums, "atomicCAS does NOT seem to work." http://forums.nvidia.com/index.php?showtopic=98444, 2009.

[23]

NVIDIA Forums, "atomic locks." https://devtalk.nvidia.com/default/topic/512038/atomic-locks/, 2012.

[24]

A. Ramamurthy, "Towards Scalar Synchronization in SIMT Architectures," Master's thesis, The University of British Columbia, 2011.

[25]

W. W. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 296--307, 2011.

Digital Library

[26]

Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian, "Software Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), p. 1, 2014.

Digital Library

[27]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, "Demystifying GPU Microarchitecture Through Microbenchmarking," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 235--246, 2010.

[28]

A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson, "GPUVerify: a Verifier for GPU Kernels," in Proc. ACM Int'l Conf. on Object oriented programming systems languages and applications, pp. 113--132, 2012.

Digital Library

[29]

G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan, "GKLEE: Concolic Verification and Test Generation for GPUs," in PPoPP, 2012.

Digital Library

[30]

R. Sharma, M. Bauer, and A. Aiken, "Verification of Producer-Consumer Synchronization in GPU Programs," in Proc. ACM Conf. on Programming Language Design and Implementation (PLDI), pp. 88--98, 2015.

Digital Library

[31]

"How does the OpenACC API relate to the OpenMP API?." OpenACC.org.

[32]

LLVM Compiler, "LLVN 3.6 Release Information." http://llvm.org/releases/3.6.0/, 2015.

[33]

M. Villmow, "AMD OpenCL Compiler." LLVM Developers Conference, 2010. Presentation.

[34]

NVIDIA, "CUDA LLVM Compiler." https://developer.nvidia.com/cuda-llvm-compiler, 2015.

[35]

Intel Developer Zone, "Weird behaviour of atomic functions." https://software.intel.com/en-us/forums/opencl/topic/278350, 2012.

[36]

NVIDIA Forums, "GLSL Spinlock." https://devtalk.nvidia.com/default/topic/768115/opengl/glsl-spinlock/, 2014.

[37]

M. Burtscher and K. Pingali, "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm," GPU computing Gems Emerald edition, 2011.

[38]

A. ElTantawy and T. M. Aamodt, "Correctness Discussion of a SIMT-induced Deadlock Elimination Algorithm," tech. rep., University of British Columbia, 2016.

[39]

NVIDIA, "CUDA Binary Utilities," http://docs.nvidia.com/cuda/cuda-binary-utilities/, 2015.

[40]

S. Horwitz, T. Reps, and D. Binkley, "Interprocedural Slicing Using Dependence Graphs," in PLDI, 1988.

Digital Library

[41]

M. Burke and R. Cytron, "Interprocedural Dependence Analysis and Parallelization," in Proc. ACM SIGPLAN Symp. on Compiler Construction, 1986.

Digital Library

[42]

U. Hölzle, C. Chambers, and D. Ungar, "Debugging Optimized Code with Dynamic Deoptimization," in Proc. ACM Conf. on Programming Language Design and Implementation (PLDI), 1992.

Digital Library

[43]

J. Hennessy, "Symbolic Debugging of Optimized Code," ACM Trans. on Prog. Lang. and Sys.(TOPLAS), vol. 4, no. 3, pp. 323--344, 1982.

Digital Library

[44]

A. ElTantawy, J. W. Ma, M. O'Connor, and T. M. Aamodt, "A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), 2014.

[45]

B. W. Coon, P. C. Mills, S. F. Oberman, and M. Y. Siu., "Tracking Register Usage during Multithreaded Processing Using a Scoreboard having Separate Memory Regions and Storing Sequential Register Size Indicators. US Patent 7,434,032," 2008.

[46]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 72--83, 2012.

Digital Library

[47]

A. Brownsword, "Cloth in OpenCL," tech. rep., Khronos Group, 2009.

[48]

J. Coplin and M. Burtscher, "Effects of Source-Code Optimizations on GPU Performance and Energy Consumption," in Proc. ACM Workshop on General Purpose Processing on Graphics Processing Units, 2015.

Digital Library

[49]

J. M. Bull, "Measuring synchronisation and scheduling overheads in openmp," in Proc. European Workshop on OpenMP, vol. 8, p. 49, 1999.

[50]

Microsoft. https://msdn.microsoft.com/en-us/library/b38674ky.aspx, 2016.

[51]

LLVM Compiler, "LLVM Alias Analysis Infrastructure." http://llvm.org/docs/AliasAnalysis.html, 2015.

[52]

V. C. Sreedhar, G. R. Gao, and Y.-F. Lee, "Identifying loops using dj graphs," ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 1996.

Digital Library

[53]

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 163--174, 2009.

[54]

T. M. Aamodt et al., GPGPU-Sim 3.x Manual. University of British Columbia, 2013.

[55]

LLVM Compiler, "Clang Front End." http://clang.llvm.org/, 2015.

[56]

LLVM Compiler, "LIBCLC Library." http://libclc.llvm.org/, 2015.

[57]

Dmitry Mikushin, "CUDA to LLVM-IR." https://github.com/apc-llc/nvcc-llvm-ir, 2015.

[58]

NVIDIA, "LibNVVM Library." http://docs.nvidia.com/cuda/libnvvm-api/, 2015.

[59]

NVIDIA, "CUDA SDK 3.2," September 2013.

[60]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proc. IEEE Symp. on Workload Characterization (IISWC), pp. 44--54, 2009.

Digital Library

[61]

Jeff Larkin, NVIDIA, "OpenMP and NVIDIA." http://openmp.org/sc13/SC13_OpenMP_and_NVIDIA.pdf, 2013.

[62]

Michael Wong, Alexey Bataev, "OpenMP GPU/Accelerators Coming of Age in Clang." http://llvm.org/devmtg/2015-10/slides/WongBataev-OpenMPGPUAcceleratorsComingOfAgeInClang.pdf, 2015.

[63]

C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Integrating GPU support for OpenMP offloading directives into Clang," in Proc. ACM Int'l Workshop on the LLVM Compiler Infrastructure in HPC, 2015.

Digital Library

[64]

G.-T. Bercea, C. Bertolli, S. F. Antao, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Performance Analysis of OpenMp on a GPU Using a Coral Proxy Application," in Proc. ACM Int'l Workshop on Perf. Modeling, Benchmarking, and Simulation of High Perf. Computing Sys., 2015.

Digital Library

[65]

M. Rhu and M. Erez, "The Dual-Path Execution Model for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 235--246, 2013.

Digital Library

[66]

G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, "SIMD Re-convergence at Thread Frontiers," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 477--488, 2011.

Digital Library

[67]

W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 25--36, 2011.

Digital Library

[68]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 308--317, 2011.

Digital Library

[69]

M. Rhu and M. Erez, "CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 61--71, 2012.

Digital Library

[70]

Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic, "Convergence and Scalarization for Data-Parallel Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), pp. 1--11, 2013.

Digital Library

[71]

M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal, "GRace: a Low-overhead Mechanism for Detecting Data Races in GPU Programs," in ACM SIGPLAN Notices, vol. 46, pp. 135--146, ACM, 2011.

Digital Library

[72]

C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), 2004.

Digital Library

[73]

J. A. Stratton, S. S. Stone, and W. H. Wen-mei, "MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs," in Languages and Compilers for Parallel Computing, Springer, 2008.

Digital Library

[74]

A. Yilmazer and D. Kaeli, "HQL: A Scalable Synchronization Mechanism for GPUs," in Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, 2013.

Digital Library

[75]

A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, "Fine-grained Synchronizations and Dataflow Programming on GPUs," 2015.

[76]

Y. Xu, L. Gao, R. Wang, Z. Luan, W. Wu, and D. Qian, "Lock-based Synchronization for GPU Architectures," in Proc. Int'l Conf. on Computing Frontiers, 2016.

Digital Library

Cited By

Nelson JPalmieri R(2019)Don't Forget About Synchronization!Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309488(11-20)Online publication date: 17-Feb-2019
https://dl.acm.org/doi/10.1145/3303084.3309488
Wang KFussell DLin CBahar IHerlihy MWitchel ELebeck A(2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304055

Recommendations

SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores

This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for ...
Lock-based synchronization for GPU architectures
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Modern GPUs have shown promising results in accelerating compute-intensive and numerical workloads with limited data sharing. However, emerging GPU applications manifest ample amount of data sharing among concurrently executing threads. Often data ...
Enabling SIMT Execution Model on Homogeneous Multi-Core System

Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

October 2016

816 pages

General Chairs:
Wei-Chung Hsu
NTU, Taiwan
,
Chia-Lin Yang
NTU, Taiwan
,
Program Chairs:
Mikko Lipasti
Univ. Wisconsin
,
Hsien-Hsin Lee
TSMC, Taiwan

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS\DATC: IEEE Computer Society

Publisher

IEEE Press

Publication History

Published: 15 October 2016

Check for updates

Qualifiers

Research-article

Conference

MICRO-49

Sponsor:

SIGMICRO
IEEE-CS\DATC

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

October 15 - 19, 2016

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
103
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nelson JPalmieri R(2019)Don't Forget About Synchronization!Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309488(11-20)Online publication date: 17-Feb-2019
https://dl.acm.org/doi/10.1145/3303084.3309488
Wang KFussell DLin CBahar IHerlihy MWitchel ELebeck A(2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304055

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents