Abstract
Message Passing Interface (MPI) is the de facto standard for parallel programming, and collective operations in MPI are widely utilized by numerous scientific applications. The efficiency of these collective operations greatly impacts the performance of parallel applications. With the increasing scale and heterogeneity of HPC systems, the network environment has become more complex. The network states vary widely and dynamically between node pairs, and this makes it more difficult to design efficient collective communication algorithms. In this paper, we propose a method to optimize collective operations by using real-time measured network states, specifically focusing on the binomial tree algorithm. Our approach employs a low-overhead method to measure the network states, and the binomial tree with small latency is constructed based on the measurement result. Additionally, we take into account the disparities between the two underlying MPI peer-to-peer communication protocols, eager and rendezvous, and design tailored binomial tree construction algorithms for each protocol. We have implemented hierarchical MPI_Bcast, MPI_Reduce, MPI_Gather and MPI_Scatter, utilizing our network states-aware binomial tree algorithm at the inter-node level. The benchmark results demonstrate that our algorithm effectively enhances performance in small and medium message communication when compared to the default binomial tree algorithm in Open MPI. Specifically, for MPI_Bcast, we observe an average performance improvement of over 15.5% when the message size is less than 64KB. Similarly, for MPI_Reduce, there is an average performance improvement of over 12.1% when the message is below 2KB. In addition, there is an average performance improvement of over 10% for MPI_Gather when the message ranging from 64B to 512B. For MPI_Scatter, our algorithm achieved performance improvement only for certain message sizes.
Similar content being viewed by others
Data Availability
Enquiries about data availability should be directed to the authors.
References
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Version 4.1. (2023)
Graham, R.L., Shipman, G.M., Barrett, B.W., Castain, R.H., Bosilca, G., Lumsdaine, A.: Open mpi: A high-performance, heterogeneous mpi. In: 2006 IEEE International Conference on Cluster Computing, pp. 1–9 (2006). https://doi.org/10.1109/CLUSTR.2006.311904
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22(6), 789–828 (1996). https://doi.org/10.1016/0167-8191(96)00024-5
Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of mpich2 over infiniband with rdma support. In: 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. p. 16 (2004). https://doi.org/10.1109/IPDPS.2004.1302922
Chunduri, S., Parker, S., Balaji, P., Harms, K., Kumaran, K.: Characterization of mpi usage on a production supercomputer. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 386–400 (2018). IEEE
Luo, X., Wu, W., Bosilca, G., Patinyasakdikul, T., Wang, L., Dongarra, J.: Adapt: An event-based adaptive collective communication framework. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. HPDC ’18, pp. 118–130. Association for Computing Machinery, New York, NY, USA (2018). doi: https://doi.org/10.1145/3208040.3208054
Loch, W.J., Koslovski, G.P.: Sparbit: A new logarithmic-cost and data locality-aware mpi allgather algorithm. In: 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 167–176 (2021). https://doi.org/10.1109/SBAC-PAD53543.2021.00028
Cai, Z., Liu, Z., Maleki, S., Musuvathi, M., Mytkowicz, T., Nelson, J., Saarikivi, O.: Synthesizing optimal collective algorithms. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’21, pp. 62–75. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3437801.3441620
Arap, O., Swany, M., Brown, G., Himebaugh, B.: Adaptive recursive doubling algorithm for collective communication. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 121–128 (2015). https://doi.org/10.1109/IPDPSW.2015.82
Proficz, J.: Improving all-reduce collective operations for imbalanced process arrival patterns. J. Supercomput. 74(7), 3071–3092 (2018). https://doi.org/10.1007/s11227-018-2356-z
Nukada, A.: Performance optimization of allreduce operation for multi-gpu systems. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 3107–3112 (2021). https://doi.org/10.1109/BigData52589.2021.9672073
Buntinas, D., Goglin, B., Goodell, D., Mercier, G., Moreaud, S.: Cache-efficient, intranode, large-message mpi communication with mpich2-nemesis. In: 2009 International Conference on Parallel Processing, pp. 462–469 (2009). https://doi.org/10.1109/ICPP.2009.22
Lee, J., Hwang, I., Shah, S., Cho, M.: Flexreduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2020). https://doi.org/10.1109/DAC18072.2020.9218538
Li, S., Zhang, Y., Hoefler, T.: Cache-oblivious mpi all-to-all communications based on morton order. IEEE Trans. Parallel Distrib. Syst. 29(3), 542–555 (2018). https://doi.org/10.1109/TPDS.2017.2768413
Zhong, D., Cao, Q., Bosilca, G., Dongarra, J.: Using long vector extensions for mpi reductions. Parallel Comput. 109, 102871 (2022). https://doi.org/10.1016/j.parco.2021.102871
Luo, X., Wu, W., Bosilca, G., Pei, Y., Cao, Q., Patinyasakdikul, T., Zhong, D., Dongarra, J.: Han: A hierarchical autotuned collective communication framework. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 23–34 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00013
Kurnosov, M.G.: Dynamic mapping of all-to-all collective operations into hierarchical computer clusters. In: 2016 13th International Scientific-Technical Conference on Actual Problems of Electronics Instrument Engineering (APEIE), Vol. 02, pp. 475–478 (2016). https://doi.org/10.1109/APEIE.2016.7806396
Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: 2008 International Symposium on Computer Architecture, pp. 77–88 (2008). https://doi.org/10.1109/ISCA.2008.19
Leiserson, C.E.: Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans. Comput. 34(10), 892–901 (1985). https://doi.org/10.1109/TC.1985.6312192
Kumar, S., Sharkawi, S.S., Jan, K.A.N.: Optimization and analysis of mpi collective communication on fat-tree networks. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1031–1040 (2016). https://doi.org/10.1109/IPDPS.2016.85
Ma, J., Dong, D., Li, C., Wu, K., Xiao, L.: Evaluation of topology-aware all-reduce algorithm for dragonfly networks. In: Cérin, C., Qian, D., Gaudiot, J.-L., Tan, G., Zuckerman, S. (Eds.), Network and Parallel Computing. Lecture Notes in Computer Science, pp. 243–255. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93571-9_19
Kandalla, K., Subramoni, H., Vishnu, A., Panda, D.K.: Designing topology-aware collective communication algorithms for large scale infiniband clusters: Case studies with scatter and gather. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8 (2010). https://doi.org/10.1109/IPDPSW.2010.5470853
Subramoni, H., Kandalla, K., Vienne, J., Sur, S., Barth, B., Tomko, K., Mclay, R., Schulz, K., Panda, D.K.: Design and evaluation of network topology-/speed- aware broadcast algorithms for infiniband clusters. In: 2011 IEEE International Conference on Cluster Computing, pp. 317–325 (2011). https://doi.org/10.1109/CLUSTER.2011.43
Gong, Y., He, B., Zhong, J.: Network performance aware mpi collective communication operations in the cloud. IEEE Trans. Parallel Distrib. Syst. 26(11), 3079–3089 (2015). https://doi.org/10.1109/TPDS.2013.96
Sudhakar, C., Ramesh, T., Waghmare, K.: Path based optimization of mpi collective communication operation in cloud. In: 2018 International Conference on Computing, Power and Communication Technologies (GUCON), pp. 595–599 (2018). https://doi.org/10.1109/GUCON.2018.8675055
Kielmann, T., Bal, H.E., Verstoep, K.: Fast measurement of logp parameters for message passing platforms. In: Rolim, J. (ed.) Parallel and Distributed Processing. Lecture Notes in Computer Science, pp. 1176–1183. Springer, Berlin, Heidelberg (2000). https://doi.org/10.1007/3-540-45591-4_162
Shen, H., Sarker, A., Yu, L., Deng, F.: Probabilistic network-aware task placement for mapreduce scheduling. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 241–250 (2016). https://doi.org/10.1109/CLUSTER.2016.48
Tahmasbi-Sarvestani, A., Fallah, Y.P., Kulathumani, V.: Network-aware double-layer distance-dependent broadcast protocol for vanets. IEEE Trans. Veh. Technol. 64(12), 5536–5546 (2015). https://doi.org/10.1109/TVT.2015.2487998
Cui, X., Li, X., Wang, B.: Communication optimization technology based on network dynamic performance model. Math. Probl. Eng. 2020, 8890721 (2020). https://doi.org/10.1155/2020/8890721
Rico-Gallego, J.A., Díaz-Martín, J.C., Manumachu, R.R., Lastovetsky, A.L.: A survey of communication performance models for high-performance computing. ACM Comput. Surv. 51(6), 126–112636 (2019). https://doi.org/10.1145/3284358
Bar-Noy, A., Kipnis, S.: Designing broadcasting algorithms in the postal model for message-passing systems. In: Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 13–22 (1992)
Hockney, R.W.: The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Comput. 20(3), 389–398 (1994). https://doi.org/10.1016/S0167-8191(06)80021-9
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: Logp: Towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPOPP ’93, pp. 1–12. Association for Computing Machinery, New York, NY, USA (1993). https://doi.org/10.1145/155332.155333
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: Loggp: Incorporating long messages into the logp model - one step closer towards a realistic model for parallel computation. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA ’95, pp. 95–105. Association for Computing Machinery, New York, NY, USA (1995). https://doi.org/10.1145/215399.215427
Wang, Z., Chen, H., Cai, W., Dong, X., Zhang, X.: C-lop: accurate contention-based modeling of mpi concurrent communication. Parallel Comput. 111, 102925 (2022). https://doi.org/10.1016/j.parco.2022.102925
Rico-Gallego, J.-A., Díaz-Martín, J.-C.: \(\tau \)-lop: modeling performance of shared memory mpi. Parallel Comput. 46, 14–31 (2015). https://doi.org/10.1016/j.parco.2015.02.006
Koohi, S.Z., Hamid, N.A.W.A., Othman, M., Ibragimov, G.: Mempha: model of exascale message-passing programs on heterogeneous architectures. IEEE Trans. Parallel Distrib. Syst. 31(11), 2570–2581 (2020). https://doi.org/10.1109/TPDS.2020.2995867
Hoefler, T., Schneider, T., Lumsdaine, A.: Loggp in theory and practice—an in-depth analysis of modern interconnection networks and benchmarking methods for collective operations. Simul. Model. Pract. Theory 17(9), 1511–1521 (2009). https://doi.org/10.1016/j.simpat.2009.06.007
Bruck, J., Ho, C.-T., Kipnis, S., Weathersby, D.: Efficient algorithms for all-to-all communications in multi-port message-passing systems. In: Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms and Architectures. SPAA ’94, pp. 298–309. Association for Computing Machinery, New York, NY, USA (1994). https://doi.org/10.1145/181014.181756
Graham, R.L., Shipman, G.: Mpi support for multi-core architectures: Optimized shared memory collectives. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. Lecture Notes in Computer Science, pp. 130–140. Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87475-1_21
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: Simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing, pp. 44–60. Springer, Berlin (2003)
MVAPICH::Benchmarks. https://mvapich.cse.ohio-state.edu/benchmarks/
Funding
This work is supported by the National Key R &D Program of China, No. 2022YFB4501701.
Author information
Authors and Affiliations
Contributions
JW: methodology, software, validation, investigation, writing—original draft. TZ: software, investigation, supervision, writing—review and editing. YW: conceptualization, resources, investigation, writing—review and editing, supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Comparison of different binomial tree for MPI_Bcast
Comparison of different binomial tree for MPI_Bcast
We constructed four binomial trees, as depicted in the Fig. 13. The experimental environments align with Section 5, with processes 0–3, 4–7, 8–11, and 12–15 located within the same rack. The default binomial tree, referred to as bt0, was constructed accordingly. Additionally, bt1 and bt2 were constructed using depth-first and breadth-first methods, respectively. Finally, bt3 is a binomial tree with poor communication performance, where all communication are carried out between nodes in different racks.
The figure 14 illustrates the latency of MPI_Bcast across the four binomial trees, as well as demonstrating a comparison with our algorithm. It can be seen that the variances in performance among the different tree structures primarily emerge during the broadcast of small and medium-sized messages.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Zhao, T. & Wang, Y. Network states-aware collective communication optimization. Cluster Comput 27, 6869–6887 (2024). https://doi.org/10.1007/s10586-024-04330-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-024-04330-9