Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3195638.3195652acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

MIMD synchronization on SIMT architectures

Published: 15 October 2016 Publication History

Abstract

In the single-instruction multiple-threads (SIMT) execution model, small groups of scalar threads operate in lockstep. Within each group, current SIMT hardware implementations serialize the execution of threads that follow different paths, and to ensure efficiency, revert to lockstep execution as soon as possible. These constraints must be considered when adapting algorithms that employ synchronization. A deadlock-free program on a multiple-instruction multiple-data (MIMD) architecture may deadlock on a SIMT machine. To avoid this, programmers need to restructure control flow with SIMT scheduling constraints in mind. This requires programmers to be familiar with the underlying SIMT hardware.
In this paper, we propose a static analysis technique that detects SIMT deadlocks by inspecting the application control flow graph (CFG). We further propose a CFG transformation that avoids SIMT deadlocks when synchronization is local to a function. Both the analysis and the transformation algorithms are implemented as LLVM compiler passes. Finally, we propose an adaptive hardware reconvergence mechanism that supports MIMD synchronization without changing the application CFG, but which can leverage our compiler analysis to gain efficiency. The static detection has a false detection rate of only 4%--5%. The automated transformation has an average performance overhead of 8.2%--10.9% compared to manual transformation. Our hardware approach performs on par with the compiler transformation, however, it avoids synchronization scope limitations, static instruction and register overheads, and debuggability challenges that are present in the compiler only solution.

References

[1]
NVIDIA, CUDA, "NVIDIA CUDA Programming Guide," 2011.
[2]
Intel Corporation, "The ISPC Parallel Execution Model," 2016.
[3]
AMD, "Accelerated Parallel Processing: OpenCL Programming Guide," 2013.
[4]
A. Levinthal and T. Porter, "Chap --- A SIMD Graphics Processor," Proc. ACM Conf. on Comp. Grap. and Interactive Tech. (SIGGRAPH), 1984.
[5]
B. Coon and J. Lindholm, "System and method for managing divergent threads in a simd architecture," 2008. US Patent 7,353,369.
[6]
B. Beylin and R. S. Glanville, "Insertion of multithreaded execution synchronization points in a software program," 2013. US Patent 8,381,203.
[7]
AMD Corporation, "Southern Islands Series Instruction Set Architecture," 2012.
[8]
M. HOUSTON, B. Gaster, L. HOWES, M. Mantor, and D. Behr, "Method and System for Synchronization of Workitems with Divergent Control Flow," 2013. WO Patent App. PCT/US2013/043,394.
[9]
A. Habermaier and A. Knapp, "On the Correctness of the SIMT Execution Model of GPUs," in Programming Languages and Systems, pp. 316--335, Springer, 2012.
[10]
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA Graph Algorithms at Maximum Warp," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 267--276, 2011.
[11]
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proc. IEEE Symp. on Workload Characterization (IISWC), 2012.
[12]
D. Merrill, M. Garland, and A. Grimshaw, "Scalable GPU Graph Traversal," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 117--128, 2012.
[13]
S. Lee, S.-J. Min, and R. Eigenmann, "OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 101--110, 2009.
[14]
G. Noaje, C. Jaillet, and M. Krajecki, "Source-to-source Code Translator: OpenMP C to CUDA," in IEEE Int'l Conf. on High Performance Computing and Communications (HPCC), 2011.
[15]
C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O'Brien, Z. Sura, A. C. Jacob, T. Chen, and O. Sallenave, "Coordinating GPU Threads for OpenMP 4.0 in LLVM," in Proc. LLVM Compiler Infrastructure in HPC, 2014.
[16]
S. Antao, C. Bertolli, A. Bokhanko, A. Eichenberger, H. Finkel, S. Ostanevich, E. Stotzer, and G. Zhang, "OpenMP Offload Infrastructure in LLVM," tech. rep.
[17]
OpenMP Clang Frontend, "OpenMP Clang Frontend Documentation." https://github.com/clang-omp, 2015.
[18]
X. Tian and B. R. de Supins, "Explicit Vector Programming with OpenMP 4.0 SIMD Extension," Primeur Magazine 2014, 2014.
[19]
A. ElTantawy, "SSDE and AWARE codes." https://github.com/ElTantawy/mimd_to_simt/, 2016.
[20]
W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 407--420, 2007.
[21]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 235--246, 2010.
[22]
NVIDIA Forums, "atomicCAS does NOT seem to work." http://forums.nvidia.com/index.php?showtopic=98444, 2009.
[23]
NVIDIA Forums, "atomic locks." https://devtalk.nvidia.com/default/topic/512038/atomic-locks/, 2012.
[24]
A. Ramamurthy, "Towards Scalar Synchronization in SIMT Architectures," Master's thesis, The University of British Columbia, 2011.
[25]
W. W. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 296--307, 2011.
[26]
Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian, "Software Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), p. 1, 2014.
[27]
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, "Demystifying GPU Microarchitecture Through Microbenchmarking," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 235--246, 2010.
[28]
A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson, "GPUVerify: a Verifier for GPU Kernels," in Proc. ACM Int'l Conf. on Object oriented programming systems languages and applications, pp. 113--132, 2012.
[29]
G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan, "GKLEE: Concolic Verification and Test Generation for GPUs," in PPoPP, 2012.
[30]
R. Sharma, M. Bauer, and A. Aiken, "Verification of Producer-Consumer Synchronization in GPU Programs," in Proc. ACM Conf. on Programming Language Design and Implementation (PLDI), pp. 88--98, 2015.
[31]
"How does the OpenACC API relate to the OpenMP API?." OpenACC.org.
[32]
LLVM Compiler, "LLVN 3.6 Release Information." http://llvm.org/releases/3.6.0/, 2015.
[33]
M. Villmow, "AMD OpenCL Compiler." LLVM Developers Conference, 2010. Presentation.
[34]
NVIDIA, "CUDA LLVM Compiler." https://developer.nvidia.com/cuda-llvm-compiler, 2015.
[35]
Intel Developer Zone, "Weird behaviour of atomic functions." https://software.intel.com/en-us/forums/opencl/topic/278350, 2012.
[36]
NVIDIA Forums, "GLSL Spinlock." https://devtalk.nvidia.com/default/topic/768115/opengl/glsl-spinlock/, 2014.
[37]
M. Burtscher and K. Pingali, "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm," GPU computing Gems Emerald edition, 2011.
[38]
A. ElTantawy and T. M. Aamodt, "Correctness Discussion of a SIMT-induced Deadlock Elimination Algorithm," tech. rep., University of British Columbia, 2016.
[39]
NVIDIA, "CUDA Binary Utilities," http://docs.nvidia.com/cuda/cuda-binary-utilities/, 2015.
[40]
S. Horwitz, T. Reps, and D. Binkley, "Interprocedural Slicing Using Dependence Graphs," in PLDI, 1988.
[41]
M. Burke and R. Cytron, "Interprocedural Dependence Analysis and Parallelization," in Proc. ACM SIGPLAN Symp. on Compiler Construction, 1986.
[42]
U. Hölzle, C. Chambers, and D. Ungar, "Debugging Optimized Code with Dynamic Deoptimization," in Proc. ACM Conf. on Programming Language Design and Implementation (PLDI), 1992.
[43]
J. Hennessy, "Symbolic Debugging of Optimized Code," ACM Trans. on Prog. Lang. and Sys.(TOPLAS), vol. 4, no. 3, pp. 323--344, 1982.
[44]
A. ElTantawy, J. W. Ma, M. O'Connor, and T. M. Aamodt, "A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), 2014.
[45]
B. W. Coon, P. C. Mills, S. F. Oberman, and M. Y. Siu., "Tracking Register Usage during Multithreaded Processing Using a Scoreboard having Separate Memory Regions and Storing Sequential Register Size Indicators. US Patent 7,434,032," 2008.
[46]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 72--83, 2012.
[47]
A. Brownsword, "Cloth in OpenCL," tech. rep., Khronos Group, 2009.
[48]
J. Coplin and M. Burtscher, "Effects of Source-Code Optimizations on GPU Performance and Energy Consumption," in Proc. ACM Workshop on General Purpose Processing on Graphics Processing Units, 2015.
[49]
J. M. Bull, "Measuring synchronisation and scheduling overheads in openmp," in Proc. European Workshop on OpenMP, vol. 8, p. 49, 1999.
[50]
Microsoft. https://msdn.microsoft.com/en-us/library/b38674ky.aspx, 2016.
[51]
LLVM Compiler, "LLVM Alias Analysis Infrastructure." http://llvm.org/docs/AliasAnalysis.html, 2015.
[52]
V. C. Sreedhar, G. R. Gao, and Y.-F. Lee, "Identifying loops using dj graphs," ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 1996.
[53]
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 163--174, 2009.
[54]
T. M. Aamodt et al., GPGPU-Sim 3.x Manual. University of British Columbia, 2013.
[55]
LLVM Compiler, "Clang Front End." http://clang.llvm.org/, 2015.
[56]
LLVM Compiler, "LIBCLC Library." http://libclc.llvm.org/, 2015.
[57]
Dmitry Mikushin, "CUDA to LLVM-IR." https://github.com/apc-llc/nvcc-llvm-ir, 2015.
[58]
NVIDIA, "LibNVVM Library." http://docs.nvidia.com/cuda/libnvvm-api/, 2015.
[59]
NVIDIA, "CUDA SDK 3.2," September 2013.
[60]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proc. IEEE Symp. on Workload Characterization (IISWC), pp. 44--54, 2009.
[61]
Jeff Larkin, NVIDIA, "OpenMP and NVIDIA." http://openmp.org/sc13/SC13_OpenMP_and_NVIDIA.pdf, 2013.
[62]
Michael Wong, Alexey Bataev, "OpenMP GPU/Accelerators Coming of Age in Clang." http://llvm.org/devmtg/2015-10/slides/WongBataev-OpenMPGPUAcceleratorsComingOfAgeInClang.pdf, 2015.
[63]
C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Integrating GPU support for OpenMP offloading directives into Clang," in Proc. ACM Int'l Workshop on the LLVM Compiler Infrastructure in HPC, 2015.
[64]
G.-T. Bercea, C. Bertolli, S. F. Antao, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Performance Analysis of OpenMp on a GPU Using a Coral Proxy Application," in Proc. ACM Int'l Workshop on Perf. Modeling, Benchmarking, and Simulation of High Perf. Computing Sys., 2015.
[65]
M. Rhu and M. Erez, "The Dual-Path Execution Model for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 235--246, 2013.
[66]
G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, "SIMD Re-convergence at Thread Frontiers," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 477--488, 2011.
[67]
W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 25--36, 2011.
[68]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 308--317, 2011.
[69]
M. Rhu and M. Erez, "CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 61--71, 2012.
[70]
Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic, "Convergence and Scalarization for Data-Parallel Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), pp. 1--11, 2013.
[71]
M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal, "GRace: a Low-overhead Mechanism for Detecting Data Races in GPU Programs," in ACM SIGPLAN Notices, vol. 46, pp. 135--146, ACM, 2011.
[72]
C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), 2004.
[73]
J. A. Stratton, S. S. Stone, and W. H. Wen-mei, "MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs," in Languages and Compilers for Parallel Computing, Springer, 2008.
[74]
A. Yilmazer and D. Kaeli, "HQL: A Scalable Synchronization Mechanism for GPUs," in Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, 2013.
[75]
A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, "Fine-grained Synchronizations and Dataflow Programming on GPUs," 2015.
[76]
Y. Xu, L. Gao, R. Wang, Z. Luan, W. Wu, and D. Qian, "Lock-based Synchronization for GPU Architectures," in Proc. Int'l Conf. on Computing Frontiers, 2016.

Cited By

View all
  • (2019)Don't Forget About Synchronization!Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309488(11-20)Online publication date: 17-Feb-2019
  • (2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture
October 2016
816 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 15 October 2016

Check for updates

Qualifiers

  • Research-article

Conference

MICRO-49
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Don't Forget About Synchronization!Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309488(11-20)Online publication date: 17-Feb-2019
  • (2019)Fast Fine-Grained Global Synchronization on GPUsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304055(793-806)Online publication date: 4-Apr-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media