OpenMP is a shared memory programming model which supports the offloading of target regions to accelerators such as NVIDIA GPUs. The implementation in Clang/LLVM aims to deliver a generic GPU compilation toolchain that supports both the native CUDA C/C++ and the OpenMP device offloading models. There are situations where the semantics of OpenMP and those of CUDA diverge. One such example is the policy for implicitly handling local variables. In CUDA, local variables are implicitly mapped to thread local memory and thus become private to a CUDA thread. In OpenMP, due to semantics that allow the nesting of regions executed by different numbers of threads, variables need to be implicitly shared among the threads of a contention group.

In this paper we introduce a re-design of the OpenMP device data sharing infrastructure that is responsible for the implicit sharing of local variables in the Clang/LLVM toolchain. We introduce a new data sharing infrastructure that lowers implicitly shared variables to the shared memory of the GPU.

We measure the amount of shared memory used by our scheme in cases that involve scalar variables and statically allocated arrays. The evaluation is carried out by offloading to K40 and P100 NVIDIA GPUs. For scalar variables the pressure on shared memory is relatively low, under 26% of shared memory utilization for the K40, and does not negatively impact occupancy. The limiting occupancy factor in that case is register pressure. The data sharing scheme offers the users a simple memory model for controlling the implicit allocation of device shared memory.

References

[1]

Samuel F. Antao, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O'Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPC (LLVM-HPC '16). IEEE Press, Piscataway, NJ, USA, 1--11. https://doi.org/10.1109/LLVM-HPC.2016.6

Crossref

Google Scholar

[2]

Gheorghe-Teodor Bercea, Carlo Bertolli, Samuel F. Antao, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O'Brien. 2015. Performance Analysis of OpenMP on a GPU Using a CORAL Proxy Application. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 2, 11 pages. https://doi.org/10.1145/2832087.2832089

Digital Library

Google Scholar

[3]

Carlo Bertolli, Samuel F. Antao, Gheorghe-Teodor Bercea, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O'Brien. 2015. Integrating GPU Support for OpenMP Offloading Directives into Clang. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC (LLVM '15). ACM, New York, NY, USA, Article 5, 11 pages. https://doi.org/10.1145/2833157.2833161

Digital Library

Google Scholar

[4]

Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O'Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU Threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC (LLVM-HPC '14). IEEE Press, Piscataway, NJ, USA, 12--21. https://doi.org/10.1109/LLVM-HPC.2014.10

Digital Library

Google Scholar

[5]

Arpith C. Jacob, Alexandre E. Eichenberger, Hyojin Sung, Samuel F. Antao, Gheorghe-Teodor Bercea, Carlo Bertolli, Alexey Bataev, Tian Jin, Tong Chen, Zehra Sura, Georgios Rokos, and Kevin O'Brien. [n. d.]. Efficient Fork-Join on GPUs through Warp Specialization. To be published at the IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2017) ([n. d.]).

Google Scholar

[6]

M. Martineau, S. McIntosh-Smith, C. Bertolli, A. C. Jacob, S. F. Antao, A. Eichenberger, G. T. Bercea, T. Chen, T. Jin, K. O'Brien, G. Rokos, H. Sung, and Z. Sura. 2016. Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support. In 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 54--64. https://doi.org/10.1109/PMBS.2016.011

Crossref

Google Scholar

[7]

All members of the OpenMP Language Working Group. 2017. OpenMP Technical Report 4: Version 5.0 Preview 1. Technical Report. The OpenMP ARB.

Google Scholar

[8]

Eric Stotzer, Ajay Jayaraj, Murtaza Ali, Arnon Friedmann, Gaurav Mitra, Alistair P. Rendell, and Ian Lintault. 2013. OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip. Springer Berlin Heidelberg, Berlin, Heidelberg, 114--127. https://doi.org/10.1007/978-3-642-40698-0_9

Crossref

Google Scholar

[9]

Yi Yang and Huiyang Zhou. 2014. CUDA-NP: Realizing Nested Thread-level Parallelism in GPGPU Applications. SIGPLAN Not. 49, 8 (Feb. 2014), 93--106. https://doi.org/10.1145/2692916.2555254

Digital Library

Google Scholar

Cited By

View all

Kurth AWolters KForsberg BCapotondi AMarongiu AGrosser TBenini LPouchet LJimborean A(2020)Mixed-data-model heterogeneous compilation and OpenMP offloadingProceedings of the 29th International Conference on Compiler Construction10.1145/3377555.3377891(119-131)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3377555.3377891
Mishra AMalik AChapman B(2020)Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMPOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_18(280-294)Online publication date: 1-Sep-2020
https://doi.org/10.1007/978-3-030-58144-2_18
Bercea GBataev AEichenberger ABertolli CObrien K(2019)An open-source solution to performance portability for Summit and Sierra supercomputersIBM Journal of Research and Development10.1147/JRD.2019.2955944(1-1)Online publication date: 2019
https://doi.org/10.1147/JRD.2019.2955944
Show More Cited By

Recommendations

Efficient execution of OpenMP on GPUs
CGO '22: Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization

OpenMP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This ...
Clacc: OpenACC for C/C++ in Clang

The Clacc project has developed OpenACC compiler, runtime, and profiling interface support for C/C++ by extending Clang and LLVM. A key Clacc design feature is that it translates OpenACC to OpenMP to leverage the OpenMP offloading support that is ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

November 2017

106 pages

ISBN:9781450355650

DOI:10.1145/3148173

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

CO, Denver, USA

Acceptance Rates

LLVM-HPC'17 Paper Acceptance Rate 9 of 10 submissions, 90%;

Overall Acceptance Rate 16 of 22 submissions, 73%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
175
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kurth AWolters KForsberg BCapotondi AMarongiu AGrosser TBenini LPouchet LJimborean A(2020)Mixed-data-model heterogeneous compilation and OpenMP offloadingProceedings of the 29th International Conference on Compiler Construction10.1145/3377555.3377891(119-131)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3377555.3377891
Mishra AMalik AChapman B(2020)Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMPOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_18(280-294)Online publication date: 1-Sep-2020
https://doi.org/10.1007/978-3-030-58144-2_18
Bercea GBataev AEichenberger ABertolli CObrien K(2019)An open-source solution to performance portability for Summit and Sierra supercomputersIBM Journal of Research and Development10.1147/JRD.2019.2955944(1-1)Online publication date: 2019
https://doi.org/10.1147/JRD.2019.2955944
Doerfert JDiaz JFinkel H(2019)The TRegion Interface and Compiler Optimizations for OpenMP Target RegionsOpenMP: Conquering the Full Hardware Spectrum10.1007/978-3-030-28596-8_11(153-167)Online publication date: 9-Aug-2019
https://doi.org/10.1007/978-3-030-28596-8_11
Pennycook SSewall JHammond J(2018)Evaluating the Impact of Proposed OpenMP 5.0 Features on Performance, Portability and Productivity2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC.2018.00007(37-46)Online publication date: Nov-2018
https://doi.org/10.1109/P3HPC.2018.00007

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Efficient execution of OpenMP on GPUs

Clacc: OpenACC for C/C++ in Clang

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption