Stream Support in MPI Without the Churn

Joseph Schuchart¹⁰ &
Edgar Gabriel¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15267))

Included in the following conference series:

European MPI Users' Group Meeting

58 Accesses

Abstract

Accelerators have become a corner stone of parallel computing, ranging from scientific computing to artificial intelligence. At the application level, accelerators are controlled by submitting work into a stream, from which the work is executed by the hardware. Vendor-specific communication libraries such as NCCL and RCCL have integrated support for submitting communication operations onto a stream to enable ordering of communication and work on streams. It is safe to assume that stream-based computing will remain relevant for the foreseeable future. MPI has yet to catch up to this reality and prior proposals involved extensions of MPI that would incur significant additions to the API.

In this work, we explore alternatives that involve only minor additions to the standard to enable the integration of MPI operations with compute stream. Our additions include i) associating streams with communication objects, ii) blocking streams until completion, and iii) synchronizing streams while progressing MPI operations. Our API is agnostic of the type of stream, reuses existing communication procedures and semantics, and enables integration with graph capturing. We provide a proof-of-concept implementation and show that stream integration of MPI operations can be beneficial.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Viper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing

Efficient Communication/Computation Overlap with MPI+OpenMP Runtimes Collaboration

Stream parallelism with ordered data constraints on multi-core systems

Article 17 July 2018

Notes

1.
Our PoC is available at https://github.com/devreal/mpix-streams-pub.
2.
Available at https://github.com/devreal/ompi/tree/mpi-continue-master.
3.
https://github.com/ROCm/rccl.

References

Advanced Micro Devices, Inc.: HIP documentation (2024). https://rocm.docs.amd.com/projects/HIP/en/latest/index.html
Alpay, A., Heuveline, V.: SYCL beyond OpenCL: the architecture, current state and future direction of hipSYCL. In: Proceedings of the International Workshop on OpenCL, p. 1 (2020)
Google Scholar
Awan, A.A., Manian, K.V., Chu, C.H., Subramoni, H., Panda, D.K.: Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2? Parallel Comput. (C) (2019). https://doi.org/10.1016/j.parco.2019.03.005
Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph. (TOG) 22(3), 917–924 (2003). https://doi.org/10.1145/882262.882364
Article Google Scholar
Dinan, J.: MPI Hybrid & Accelerator (HACC) WG Kickoff, December 2020. https://github.com/mpiwg-hybrid/hybrid-issues/blob/master/slides/2020-12-02%20--%20WG%20Kickoff%20(Jim%20Dinan).pdf
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings. 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104, September 2004
Google Scholar
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)
Article Google Scholar
Holmes, D.J., Skjellum, A., Schafer, D.: Why is MPI (perceived to be) so complex? Part 1-does strong progress simplify MPI? In: Proceedings of the 27th European MPI Users’ Group Meeting, EuroMPI/USA 2020, pp. 21–30. Association for Computing Machinery (2020). https://doi.org/10.1145/3416315.3416318
Jeaugey, S.: Nccl 2.0. In: GPU Technology Conference (GTC), vol. 2, p. 23 (2017)
Google Scholar
Khronos OpenCL Working Group: The opencl specification, April 2024. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html
Krüger, J., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graph. 22(3), 908–916 (2003). https://doi.org/10.1145/882262.882363
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 4.1, November 2023. https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report.pdf
Namashivayam, N., Kandalla, K., White, T., Radcliffe, N., Kaplan, L., Pagel, M.: Exploring GPU stream-aware message passing using triggered operations. arXiv preprint arXiv:2208.04817 (2022)
NVIDIA: The CUDA Zone (2009). http://www.nvidia.com/cuda
ORNL: Summit – Americas newest and smartest supercomputer (2018). https://www.olcf.ornl.gov/summit/
ORNL: Frontier – ORNL’s exascale supercomputer is delivering world-leading performance in 2022 and beyond (2022). https://www.olcf.ornl.gov/frontier/
Ouyang, K., Si, M., Hori, A., Chen, Z., Balaji, P.: Daps: a dynamic asynchronous progress stealing model for MPI communication. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp. 516–527 (2021). https://doi.org/10.1109/Cluster48925.2021.00027
Pan, L., Liu, J., Yuan, J., Zhang, R., Li, P., Xiao, Z.: OCCL: a deadlock-free Library for GPU collective communication (2023)
Google Scholar
Patinyasakdikul, T., Eberius, D., Bosilca, G., Hjelm, N.: Give MPI threading a fair chance: a study of multithreaded MPI designs. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER) (2019). https://doi.org/10.1109/CLUSTER.2019.8891015
Schuchart, J., Samfass, P., Niethammer, C., Gracia, J., Bosilca, G.: Callback-based completion notification using MPI continuations. Parallel Comput. 106, 102793 (2021). https://doi.org/10.1016/j.parco.2021.102793
Si, M., Peña, A.J., Hammond, J., Balaji, P., Takagi, M., Ishikawa, Y.: Casper: an asynchronous progress model for MPI RMA on many-core architectures. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 665–676 (2015). https://doi.org/10.1109/IPDPS.2015.35
Zhou, H., Raffenetti, K., Guo, Y., Thakur, R.: MPIX stream: an explicit solution to hybrid MPI+X programming. In: Proceedings of the 29th European MPI Users’ Group Meeting, EuroMPI/USA 2022, pp. 1–10. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3555819.3555820

Download references

Acknowledgements

This research was supported partly by NSF awards #1931384 and #1931387. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Institute for Advanced Computational Science, Stony Brook University, Stonybrook, NY, USA
Joseph Schuchart
Advanced Micro Devices, Inc. (AMD), Austin, TX, USA
Edgar Gabriel

Authors

Joseph Schuchart
View author publications
You can also search for this author in PubMed Google Scholar
Edgar Gabriel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joseph Schuchart .

Editor information

Editors and Affiliations

VSC Research Center, TU Wien, Operngasse, Wien, Austria
Claudia Blaas-Schenner
HLRS, University of Stuttgart, Stuttgart, Baden-Württemberg, Germany
Christoph Niethammer
HLRS, University of Stuttgart, Stuttgart, Baden-Württemberg, Germany
Tobias Haas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schuchart, J., Gabriel, E. (2025). Stream Support in MPI Without the Churn. In: Blaas-Schenner, C., Niethammer, C., Haas, T. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2024. Lecture Notes in Computer Science, vol 15267. Springer, Cham. https://doi.org/10.1007/978-3-031-73370-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-73370-3_4
Published: 25 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73369-7
Online ISBN: 978-3-031-73370-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stream Support in MPI Without the Churn

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Viper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing

Efficient Communication/Computation Overlap with MPI+OpenMP Runtimes Collaboration

Stream parallelism with ordered data constraints on multi-core systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Stream Support in MPI Without the Churn

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Viper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing

Efficient Communication/Computation Overlap with MPI+OpenMP Runtimes Collaboration

Stream parallelism with ordered data constraints on multi-core systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation