Nothing Special   »   [go: up one dir, main page]

Skip to main content

Towards Efficient Remote OpenMP Offloading

  • Conference paper
  • First Online:
OpenMP in a Modern World: From Multi-device Support to Meta Programming (IWOMP 2022)

Abstract

On modern heterogeneous HPC systems, the most popular way to realize distributed computation is the hybrid programming model of MPI+X (X being OpenMP/CUDA/etc.), as it has been proven to perform well with various scientific applications. However, application developers prefer to use a single coherent programming model over a hybrid model, as maintainability and portability decrease per additional model. Recent work [14] has shown that the OpenMP device offloading model could be used to program distributed accelerator-based HPC systems with minimal changes to the application.

In this paper, we improve the performance of OpenMP remote offloading through various runtime optimizations, guided by a detailed overhead analysis. Evaluation of our work is conducted using an industrial-level seismic modeling code, Minimod, as well as two proxy-apps, XSBench and RSBench. Results show that, compared to the baseline version, our optimizations can reduce offloading latencies by up to 92%, and raise application parallel efficiency by at least 25.2% when running with 16 GPUs. We then point out why strong scaling is still difficult with OpenMP remote offloading, and propose further improvements to the runtime to increase scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Acun, B., et al.: Parallel programming with migratable objects: Charm++ in practice. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 647–658 (2014). https://doi.org/10.1109/SC.2014.58

  2. Bachan, J., et al.: UPC++: a high-performance communication framework for asynchronous computation. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 963–973 (2019). https://doi.org/10.1109/IPDPS.2019.00104

  3. Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11, November 2012. https://doi.org/10.1109/SC.2012.71

  4. Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007). https://doi.org/10.1177/1094342007078442

    Article  Google Scholar 

  5. gRPC community: grpc. https://grpc.io/about/

  6. Hsu, C.H., Imam, N., Langer, A., Potluri, S., Newburn, C.J.: An initial assessment of NVSHMEM for high performance computing. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1–10 (2020). https://doi.org/10.1109/IPDPSW50202.2020.00104

  7. Kale, V., Lu, W., Curtis, A., Malik, A.M., Chapman, B., Hernandez, O.: Toward supporting multi-GPU Targets via taskloop and user-defined schedules. In: Milfeld, K., de Supinski, B.R., Koesterke, L., Klinkenberg, J. (eds.) IWOMP 2020. LNCS, vol. 12295, pp. 295–309. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58144-2_19

    Chapter  Google Scholar 

  8. Kokkos: kokkos remote spaces. https://github.com/kokkos/kokkos-remote-spaces

  9. Lu, W., Curtis, T., Chapman, B.: Enabling low-overhead communication in multi-threaded OpenSHMEM applications using contexts. In: 2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM), pp. 47–57 (2019). https://doi.org/10.1109/PAW-ATM49560.2019.00010

  10. Meng, J., Atle, A., Calandra, H., Araya-Polo, M.: Minimod: a finite difference solver for seismic modeling (2020). https://arxiv.org/abs/2007.06048

  11. NVIDIA: Gdrcopy. https://github.com/NVIDIA/gdrcopy

  12. NVIDIA: Nvidia cuda gpudirect rdma. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

  13. OpenMP architecture review board: OpenMP application programming interface, November 2018. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf, version 5.0

  14. Patel, A., Doerfert, J.: Remote openmp offloading. In: Varbanescu, A.L., Bhatele, A., Luszczek, P., Marc, B. (eds.) High Performance Computing. pp. 315–333. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-07312-0_16

  15. Qawasmeh, A., Hugues, M.R., Calandra, H., Chapman, B.M.: Performance portability in reverse time migration and seismic modelling via OpenACC. Int. J. High Perform. Comput. Appl. 31(5), 422–440 (2017). https://doi.org/10.1177/1094342016675678

    Article  Google Scholar 

  16. Raut, E., Anderson, J., Araya-Polo, M., Meng, J.: Evaluation of distributed tasks in stencil-based application on GPUs. In: 2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp. 45–52 (2021). https://doi.org/10.1109/ESPM254806.2021.00011

  17. Raut, E., Anderson, J., Araya-Polo, M., Meng, J.: Porting and evaluation of a distributed task-driven stencil-based application. In: Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 21–30. PMAM 2021. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3448290.3448559

  18. Raut, E., Meng, J., Araya-Polo, M., Chapman, B.: Evaluating performance of OpenMP tasks in a seismic stencil application. In: Milfeld, K., de Supinski, B.R., Koesterke, L., Klinkenberg, J. (eds.) IWOMP 2020. LNCS, vol. 12295, pp. 67–81. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58144-2_5

    Chapter  Google Scholar 

  19. Reaño, C., Silla, F., Shainer, G., Schultz, S.: Local and remote GPUs perform similar with EDR 100g InfiniBand. In: Proceedings of the Industrial Track of the 16th International Middleware Conference. Middleware Industry 2015. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2830013.2830015

  20. Romano, P.K., Forget, B.: The OpenMC monte Carlo particle transport code. Ann. Nucl. Energy 51, 274–281 (2013). https://doi.org/10.1016/j.anucene.2012.06.040

    Article  Google Scholar 

  21. Sai, R., Mellor-Crummey, J., Meng, X., Araya-Polo, M., Meng, J.: Accelerating high-order stencils on GPUs. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 86–108 (2020). https://doi.org/10.1109/PMBS51919.2020.00014

  22. Shamis, P., et al.: UCX: an open source framework for HPC network APIs and beyond. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 40–43 (2015). https://doi.org/10.1109/HOTI.2015.13

  23. Terboven, C., Mey, D., Schmidl, D., Wagner, M.: First experiences with intel cluster OpenMP. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 48–59. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79561-2_5

    Chapter  Google Scholar 

  24. Tian, S., Doerfert, J., Chapman, B.: Concurrent execution of deferred OpenMP target tasks with hidden helper threads. In: Chapman, B., Moreira, J. (eds.) Languages and Compilers for Parallel Computing, pp. 41–56. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-95953-1_4

    Chapter  Google Scholar 

  25. Tramm, J.R., Siegel, A.R., Forget, B., Josey, C.: Performance analysis of a reduced data movement algorithm for neutron cross section data in Monte Carlo simulations. In: Markidis, S., Laure, E. (eds.) EASC 2014. LNCS, vol. 8759, pp. 39–56. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15976-8_3

    Chapter  Google Scholar 

  26. Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the development and verification of a performance abstraction for Monte Carlo reactor analysis. In: PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future, Kyoto (2014). https://www.mcs.anl.gov/papers/P5064-0114.pdf

  27. Trott, C.R., et al.: Kokkos 3: programming model extensions for the exascale Era. IEEE Trans. Parallel Distrib. Syst. 33(4), 805–817 (2022). https://doi.org/10.1109/TPDS.2021.3097283

    Article  Google Scholar 

  28. Yan, Y., Lin, P.H., Liao, C., de Supinski, B.R., Quinlan, D.J.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 170–180. PMAM 2015. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2712386.2712405

  29. Zimmer, C., et al.: An evaluation of the coral interconnects. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3295500.3356166

Download references

Acknowledgements

We would like to thank TotalEnergies EP Research and Technologies for their support of this work. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, in particular its subproject on Scaling OpenMP with LLVM for Exascale performance and portability (SOLLVE).

This research was also funded in part by the United States Department of Defense, and was supported by resources at Los Alamos National Laboratory, operated by Triad National Security, LLC under Contract No. 89233218CNA000001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenbin Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lu, W. et al. (2022). Towards Efficient Remote OpenMP Offloading. In: Klemm, M., de Supinski, B.R., Klinkenberg, J., Neth, B. (eds) OpenMP in a Modern World: From Multi-device Support to Meta Programming. IWOMP 2022. Lecture Notes in Computer Science, vol 13527. Springer, Cham. https://doi.org/10.1007/978-3-031-15922-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15922-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15921-3

  • Online ISBN: 978-3-031-15922-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics