Towards Efficient Remote OpenMP Offloading

Wenbin Lu¹¹,
Baodi Shan ORCID: orcid.org/0000-0002-1022-3801¹¹,
Eric Raut ORCID: orcid.org/0000-0001-8091-2066¹¹,
Jie Meng¹²,
Mauricio Araya-Polo¹²,
Johannes Doerfert¹³,
Abid M. Malik¹⁴ &
…
Barbara Chapman^11,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13527))

Included in the following conference series:

International Workshop on OpenMP

544 Accesses
6 Citations

Abstract

On modern heterogeneous HPC systems, the most popular way to realize distributed computation is the hybrid programming model of MPI+X (X being OpenMP/CUDA/etc.), as it has been proven to perform well with various scientific applications. However, application developers prefer to use a single coherent programming model over a hybrid model, as maintainability and portability decrease per additional model. Recent work [14] has shown that the OpenMP device offloading model could be used to program distributed accelerator-based HPC systems with minimal changes to the application.

In this paper, we improve the performance of OpenMP remote offloading through various runtime optimizations, guided by a detailed overhead analysis. Evaluation of our work is conducted using an industrial-level seismic modeling code, Minimod, as well as two proxy-apps, XSBench and RSBench. Results show that, compared to the baseline version, our optimizations can reduce offloading latencies by up to 92%, and raise application parallel efficiency by at least 25.2% when running with 16 GPUs. We then point out why strong scaling is still difficult with OpenMP remote offloading, and propose further improvements to the runtime to increase scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Remote OpenMP Offloading

SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters

Article Open access 19 May 2021

Utilizing Multiple Xeon Phi Coprocessors on One Compute Node

References

Acun, B., et al.: Parallel programming with migratable objects: Charm++ in practice. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 647–658 (2014). https://doi.org/10.1109/SC.2014.58
Bachan, J., et al.: UPC++: a high-performance communication framework for asynchronous computation. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 963–973 (2019). https://doi.org/10.1109/IPDPS.2019.00104
Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11, November 2012. https://doi.org/10.1109/SC.2012.71
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007). https://doi.org/10.1177/1094342007078442
Article Google Scholar
gRPC community: grpc. https://grpc.io/about/
Hsu, C.H., Imam, N., Langer, A., Potluri, S., Newburn, C.J.: An initial assessment of NVSHMEM for high performance computing. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1–10 (2020). https://doi.org/10.1109/IPDPSW50202.2020.00104
Kale, V., Lu, W., Curtis, A., Malik, A.M., Chapman, B., Hernandez, O.: Toward supporting multi-GPU Targets via taskloop and user-defined schedules. In: Milfeld, K., de Supinski, B.R., Koesterke, L., Klinkenberg, J. (eds.) IWOMP 2020. LNCS, vol. 12295, pp. 295–309. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58144-2_19
Chapter Google Scholar
Kokkos: kokkos remote spaces. https://github.com/kokkos/kokkos-remote-spaces
Lu, W., Curtis, T., Chapman, B.: Enabling low-overhead communication in multi-threaded OpenSHMEM applications using contexts. In: 2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM), pp. 47–57 (2019). https://doi.org/10.1109/PAW-ATM49560.2019.00010
Meng, J., Atle, A., Calandra, H., Araya-Polo, M.: Minimod: a finite difference solver for seismic modeling (2020). https://arxiv.org/abs/2007.06048
NVIDIA: Gdrcopy. https://github.com/NVIDIA/gdrcopy
NVIDIA: Nvidia cuda gpudirect rdma. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
OpenMP architecture review board: OpenMP application programming interface, November 2018. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf, version 5.0
Patel, A., Doerfert, J.: Remote openmp offloading. In: Varbanescu, A.L., Bhatele, A., Luszczek, P., Marc, B. (eds.) High Performance Computing. pp. 315–333. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-07312-0_16
Qawasmeh, A., Hugues, M.R., Calandra, H., Chapman, B.M.: Performance portability in reverse time migration and seismic modelling via OpenACC. Int. J. High Perform. Comput. Appl. 31(5), 422–440 (2017). https://doi.org/10.1177/1094342016675678
Article Google Scholar
Raut, E., Anderson, J., Araya-Polo, M., Meng, J.: Evaluation of distributed tasks in stencil-based application on GPUs. In: 2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp. 45–52 (2021). https://doi.org/10.1109/ESPM254806.2021.00011
Raut, E., Anderson, J., Araya-Polo, M., Meng, J.: Porting and evaluation of a distributed task-driven stencil-based application. In: Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 21–30. PMAM 2021. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3448290.3448559
Raut, E., Meng, J., Araya-Polo, M., Chapman, B.: Evaluating performance of OpenMP tasks in a seismic stencil application. In: Milfeld, K., de Supinski, B.R., Koesterke, L., Klinkenberg, J. (eds.) IWOMP 2020. LNCS, vol. 12295, pp. 67–81. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58144-2_5
Chapter Google Scholar
Reaño, C., Silla, F., Shainer, G., Schultz, S.: Local and remote GPUs perform similar with EDR 100g InfiniBand. In: Proceedings of the Industrial Track of the 16th International Middleware Conference. Middleware Industry 2015. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2830013.2830015
Romano, P.K., Forget, B.: The OpenMC monte Carlo particle transport code. Ann. Nucl. Energy 51, 274–281 (2013). https://doi.org/10.1016/j.anucene.2012.06.040
Article Google Scholar
Sai, R., Mellor-Crummey, J., Meng, X., Araya-Polo, M., Meng, J.: Accelerating high-order stencils on GPUs. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 86–108 (2020). https://doi.org/10.1109/PMBS51919.2020.00014
Shamis, P., et al.: UCX: an open source framework for HPC network APIs and beyond. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 40–43 (2015). https://doi.org/10.1109/HOTI.2015.13
Terboven, C., Mey, D., Schmidl, D., Wagner, M.: First experiences with intel cluster OpenMP. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 48–59. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79561-2_5
Chapter Google Scholar
Tian, S., Doerfert, J., Chapman, B.: Concurrent execution of deferred OpenMP target tasks with hidden helper threads. In: Chapman, B., Moreira, J. (eds.) Languages and Compilers for Parallel Computing, pp. 41–56. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-95953-1_4
Chapter Google Scholar
Tramm, J.R., Siegel, A.R., Forget, B., Josey, C.: Performance analysis of a reduced data movement algorithm for neutron cross section data in Monte Carlo simulations. In: Markidis, S., Laure, E. (eds.) EASC 2014. LNCS, vol. 8759, pp. 39–56. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15976-8_3
Chapter Google Scholar
Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the development and verification of a performance abstraction for Monte Carlo reactor analysis. In: PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future, Kyoto (2014). https://www.mcs.anl.gov/papers/P5064-0114.pdf
Trott, C.R., et al.: Kokkos 3: programming model extensions for the exascale Era. IEEE Trans. Parallel Distrib. Syst. 33(4), 805–817 (2022). https://doi.org/10.1109/TPDS.2021.3097283
Article Google Scholar
Yan, Y., Lin, P.H., Liao, C., de Supinski, B.R., Quinlan, D.J.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 170–180. PMAM 2015. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2712386.2712405
Zimmer, C., et al.: An evaluation of the coral interconnects. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3295500.3356166

Download references

Acknowledgements

We would like to thank TotalEnergies EP Research and Technologies for their support of this work. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, in particular its subproject on Scaling OpenMP with LLVM for Exascale performance and portability (SOLLVE).

This research was also funded in part by the United States Department of Defense, and was supported by resources at Los Alamos National Laboratory, operated by Triad National Security, LLC under Contract No. 89233218CNA000001.

Author information

Authors and Affiliations

Stony Brook University, Stony Brook, NY, 11794, USA
Wenbin Lu, Baodi Shan, Eric Raut & Barbara Chapman
TotalEnergies EP R &T, Houston, TX, 77002, USA
Jie Meng & Mauricio Araya-Polo
Argonne National Laboratory, Lemont, IL, 60439, USA
Johannes Doerfert
Brookhaven National Laboratory, Upton, NY, 11793, USA
Abid M. Malik & Barbara Chapman

Authors

Wenbin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Baodi Shan
View author publications
You can also search for this author in PubMed Google Scholar
Eric Raut
View author publications
You can also search for this author in PubMed Google Scholar
Jie Meng
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Araya-Polo
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Doerfert
View author publications
You can also search for this author in PubMed Google Scholar
Abid M. Malik
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Chapman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbin Lu .

Editor information

Editors and Affiliations

OpenMP ARB, Beaverton, OR, USA
Michael Klemm
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
RWTH Aachen University, Aachen, Germany
Jannis Klinkenberg
University of Arizona, Tucson, AZ, USA
Brandon Neth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, W. et al. (2022). Towards Efficient Remote OpenMP Offloading. In: Klemm, M., de Supinski, B.R., Klinkenberg, J., Neth, B. (eds) OpenMP in a Modern World: From Multi-device Support to Meta Programming. IWOMP 2022. Lecture Notes in Computer Science, vol 13527. Springer, Cham. https://doi.org/10.1007/978-3-031-15922-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-15922-0_2
Published: 20 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15921-3
Online ISBN: 978-3-031-15922-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Efficient Remote OpenMP Offloading

Abstract

Access this chapter

Subscribe and save

Buy Now