Article

Collective communication on architectures that support simultaneous communication over multiple links

Authors:

Robert van de Geijn,

Rajeev ThakurAuthors Info & Claims

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 2 - 11

https://doi.org/10.1145/1122971.1122975

Published: 29 March 2006 Publication History

Abstract

Traditional collective communication algorithms are designed with the assumption that a node can communicate with only one other node at a time. On new parallel architectures such as the IBM Blue Gene/L, a node can communicate with multiple nodes simultaneously. We have redesigned and reimplemented many of the MPI collective communication algorithms to take advantage of this ability to send simultaneously, including broadcast, reduce(-to-one), scatter, gather, allgather, reduce-scatter, and allreduce. We show that these new algorithms have lower expected costs than the previously known lower bounds based on old models of parallel computation. Results are included comparing their performance to the default implementations in IBM's MPI.

References

[1]

G. Almasi, C. Archer, J. G. Castanos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow, W. Gropp, and B. Toonen. Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM J. Res. and Dev., 49(2/3):393--406, March/May 2005.]]

Digital Library

[2]

M. Barnett, S. Gupta, D. Payne, L. Shuler, R. A. van de Geijn, and J. Watts. Interprocessor collective communication library (intercom). In Proceedings of the Scalable High Performance Computing Conference 1994, 1994.]]

[3]

M. Barnett, R. Littlefield, D. Payne, and R. van de Geijn. On the efficiency of global combine algorithms for 2-d meshes with wormhole routing. J. Parallel Distrib. Comput., 24:191--201, 1995.]]

Digital Library

[4]

M. Barnett, D. Payne, and R. van de Geijn. Optimal broadcasting in mesh-connected architectures. Computer Science report TR-91-38, Univ. of Texas, 1991.]]

Digital Library

[5]

M. Barnett, D. Payne, R. van de Geijn, and J. Watts. Broadcasting on meshes with wormhole routing. J. Parallel Distrib. Comput., 35(2):111--122, 1996.]]

Digital Library

[6]

Gregory D. Benson, Cho-Wai Chu, Qing Huang, and Sadik G. Caglar. A comparison of MPICH allgather algorithms on switched networks. In Jack Dongarra, Domenico Laforenza, and Salvatore Orlando, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI Users' Group Meeting, pages 335--343. Lecture Notes in Computer Science 2840, Springer, September 2003.]]

[7]

Massimo Bernaschi, Giulio Iannello, and Mario Lauria. Experimental results about MPI collective communication operations. In Proceedings of HPCN99, 1999.]]

Digital Library

[8]

Ernie W. Chan, Marcel F. Heimlich, Avi Purkayastha, and Robert A. van de Geijn. On optimizing collective communication. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 145--155, San Diego, CA, 2004. IEEE.]]

Digital Library

[9]

Graham E. Fagg, Sathish S. Vadhiyar, and Jack J. Dongarra. ACCT: Automatic collective communications tuning. In Jack Dongarra, Peter Kacsuk, and Norbert Podhorszki, editors, Recent Advances in Parallel Virutal Machine and Message Passing Interface, number 1908 in Springer Lecture Notes in Computer Science, pages 354--361, September 2000.]]

Digital Library

[10]

G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, volume I. Prentice Hall, 1988.]]

Digital Library

[11]

A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. E. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, and P. Vranas. Overview of the Blue Gene/L system architecture. IBM J. Res. and Dev., 49(2/3):195--212, March/May 2005.]]

Digital Library

[12]

Ching-Tien Ho and S. Lennart Johnsson. Distributed routing algorithms for broadcasting and personalized communication in hypercubes. In Proceedings of the 1986 International Conference on Parallel Processing, pages 640--648. IEEE, 1986.]]

[13]

S. L. Johnsson and C. T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, pages 1249--1268, September 1989.]]

Digital Library

[14]

L. V. Kale, Sameer Kumar, and Krishnan Vardarajan. A framework for collective personalized communication. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS '03), 2003.]]

Digital Library

[15]

N. Karonis, B. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In Proceedings of the Fourteenth International Parallel and Distributed Processing Symposium (IPDPS '00), pages 377--384, 2000.]]

Digital Library

[16]

T. Kielmann, R.F.H. Hofman, H.E. Bal, A. Plaat, and R.A.F. Bhoedjang. MagPIe: MPI's collective communication operations for clustered wide area systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'99), pages 131--140. ACM, May 1999.]]

Digital Library

[17]

S.L. Lillevik. The Touchstone 30 Gigaflop DELTA Prototype. In Sixth Distributed Memory Computing Conference Proceedings, pages 671--677. IEEE Computer Society Press, 1991.]]

[18]

Prasenjit Mitra, David Payne, Lance Shuler, Robert van de Geijn, and Jerrell Watts. Fast collective communication libraries, please. In Proceedings of the Intel Supercomputing Users' Group Meeting 1995, 1995.]]

[19]

D. Payne, L. Shuler, R. van de Geijn, and J. Watts. Streetguide to collective communication. unpublished manuscript.]]

[20]

Rolf Rabenseifner and Gerhard Wellein. Communication and optimization aspects of parallel programming models on hybrid architectures. International Journal of High-Performance Computing Applications, 17(1):49--62, 2003.]]

Digital Library

[21]

Y. Saad and M.H. Schultz. Data communications in hypercubes. J. Parallel Distrib. Comput., 6:115--135, 1989.]]

Digital Library

[22]

Mohak Shroff and Robert A. van de Geijn. Collmark MPI collective communication benchmark. unpublished manuscript, 2001.]]

[23]

Marc Snir, Steve Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete Reference, volume 1, The MPI Core. The MIT Press, 2nd edition, 1998.]]

Digital Library

[24]

Rajeev Thakur and William Gropp. Improving the performance of collective operations in MPICH. In Proceedings of the 10th European PVM/MPI Users' Group Conference (Euro PVN/MPI 2003), pages 257--267, September 2003.]]

[25]

Rajeev Thakur, William Gropp, and Brian Toonen. Optimizing the synchronization operations in MPI one-sided communication. International Journal of High-Performance Computing Applications, 19(2):119--128, Summer 2005.]]

Digital Library

[26]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in MPICH. International Journal of High-Performance Computing Applications, (19)1:49--66, Spring 2005.]]

Digital Library

[27]

V. Tipparaju, J. Nieplocha, and D. K. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS '03), 2003.]]

Digital Library

[28]

Jesper~Larason Traff and Andreas Ripke. An optimal broadcast algorithm adapted to SMP clusters. In EuroPVM/MPI 2005, LNCS 3666, pages 48--56, 2005.]]

Digital Library

[29]

Jesper Larason Traff and Andreas Ripke. Optimal broadcast for fully connected networks. In HPCC 2005, LNCS 3726, pages 45--56, 2005.]]

Digital Library

[30]

Sathish S. Vadhiyar, Graham E. Fagg, and Jack Dongarra. Automatically tuned collective communication. In Proceedings of Supercomputing 2000, Dallas, TX.]]

Digital Library

[31]

Robert van de Geijn. On global combine operations. J. Parallel Distrib. Comput., 22:324--328, 1994.]]

Digital Library

[32]

Jerrell Watts and Robert van de Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(2):281--292, 1995.]]

[33]

Thomas Worsch, Ralf Reussner, and Werner Augustin. On benchmarking collective MPI operations. In Dieter Kranzlmüller, Peter Kacsuk, Jack Dongarra, and Jens Volkert, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, pages 271--279. Lecture Notes in Computer Science 2474, Springer, September 2002.]]

Digital Library

Cited By

卢凯赖志李笙柳炜葛可卢锡李东(2023)Parallel intelligent computing: development and challengesSCIENTIA SINICA Informationis10.1360/SSI-2023-005153:8(1441)Online publication date: 17-Aug-2023
https://doi.org/10.1360/SSI-2023-0051
Won WHeo TRashidi SSridharan SSrinivasan SKrishna T(2023)ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00035(283-294)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00035
Cho SSon HKim J(2023)Logical/Physical Topology-Aware Collective Communication in Deep Learning Training2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071117(56-68)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071117
Show More Cited By

Index Terms

Collective communication on architectures that support simultaneous communication over multiple links
1. Software and its engineering

Recommendations

Optimization of Collective Communication Operations in MPICH

We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of ...
Optimizing Collective Communication in UPC
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops

Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (...
Collective Communication and Communicators in mpi++
MPIDC '96: Proceedings of the Second MPI Developers Conference

This paper describes the current version of mpi++, a C++ language binding for MPI, that includes all of the collective services, and services for contexts, groups and communicators as described in Chapter 4 and 5 of the MPI standard. The code for mpi++ ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

March 2006

258 pages

ISBN:1595931899

DOI:10.1145/1122971

General Chair:
Josep Torrellas
University of Illinois
,
Program Chair:
Siddhartha Chatterjee
IBM Research

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

PPoPP06

Sponsor:

PPoPP06: ACM SIGPLAN 2006 Symposium on Principles and Practice of Parallel Programming 2006

March 29 - 31, 2006

New York, New York, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
584
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)8

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

卢凯赖志李笙柳炜葛可卢锡李东(2023)Parallel intelligent computing: development and challengesSCIENTIA SINICA Informationis10.1360/SSI-2023-005153:8(1441)Online publication date: 17-Aug-2023
https://doi.org/10.1360/SSI-2023-0051
Won WHeo TRashidi SSridharan SSrinivasan SKrishna T(2023)ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00035(283-294)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00035
Cho SSon HKim J(2023)Logical/Physical Topology-Aware Collective Communication in Deep Learning Training2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071117(56-68)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071117
Tran TRamesh BMichalowicz BAbduljabbar MSubramoni HShafi APanda D(2023)Accelerating communication with multi‐HCA aware collectives in MPIConcurrency and Computation: Practice and Experience10.1002/cpe.787936:1Online publication date: 9-Aug-2023
https://doi.org/10.1002/cpe.7879
Rashidi SWon WSrinivasan SSridharan SKrishna TSalapura VZahran MChong FTang L(2022)ThemisProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527382(581-596)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527382
Rashidi SDenton MSridharan SSrinivasan SSuresh ANie JKrishna TMartínez JDuato JJohn L(2021)Enabling compute-communication overlap in distributed deep learning training platformsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00049(540-553)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00049
Cui XLi XWang B(2020)Communication Optimization Technology Based on Network Dynamic Performance ModelMathematical Problems in Engineering10.1155/2020/88907212020(1-13)Online publication date: 15-Oct-2020
https://doi.org/10.1155/2020/8890721
Rashidi SSridharan SSrinivasan SKrishna T(2020)ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00018(81-92)Online publication date: Aug-2020
https://doi.org/10.1109/ISPASS48437.2020.00018
Traff J(2019)On Optimal Trees for Irregular Gather and Scatter CollectivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.2899843(1-1)Online publication date: 2019
https://doi.org/10.1109/TPDS.2019.2899843
Kwon YRhu M(2019)A Disaggregated Memory System for Deep LearningIEEE Micro10.1109/MM.2019.292916539:5(82-90)Online publication date: 1-Sep-2019
https://doi.org/10.1109/MM.2019.2929165
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents