Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1122971.1122975acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

Collective communication on architectures that support simultaneous communication over multiple links

Published: 29 March 2006 Publication History

Abstract

Traditional collective communication algorithms are designed with the assumption that a node can communicate with only one other node at a time. On new parallel architectures such as the IBM Blue Gene/L, a node can communicate with multiple nodes simultaneously. We have redesigned and reimplemented many of the MPI collective communication algorithms to take advantage of this ability to send simultaneously, including broadcast, reduce(-to-one), scatter, gather, allgather, reduce-scatter, and allreduce. We show that these new algorithms have lower expected costs than the previously known lower bounds based on old models of parallel computation. Results are included comparing their performance to the default implementations in IBM's MPI.

References

[1]
G. Almasi, C. Archer, J. G. Castanos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow, W. Gropp, and B. Toonen. Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM J. Res. and Dev., 49(2/3):393--406, March/May 2005.]]
[2]
M. Barnett, S. Gupta, D. Payne, L. Shuler, R. A. van de Geijn, and J. Watts. Interprocessor collective communication library (intercom). In Proceedings of the Scalable High Performance Computing Conference 1994, 1994.]]
[3]
M. Barnett, R. Littlefield, D. Payne, and R. van de Geijn. On the efficiency of global combine algorithms for 2-d meshes with wormhole routing. J. Parallel Distrib. Comput., 24:191--201, 1995.]]
[4]
M. Barnett, D. Payne, and R. van de Geijn. Optimal broadcasting in mesh-connected architectures. Computer Science report TR-91-38, Univ. of Texas, 1991.]]
[5]
M. Barnett, D. Payne, R. van de Geijn, and J. Watts. Broadcasting on meshes with wormhole routing. J. Parallel Distrib. Comput., 35(2):111--122, 1996.]]
[6]
Gregory D. Benson, Cho-Wai Chu, Qing Huang, and Sadik G. Caglar. A comparison of MPICH allgather algorithms on switched networks. In Jack Dongarra, Domenico Laforenza, and Salvatore Orlando, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI Users' Group Meeting, pages 335--343. Lecture Notes in Computer Science 2840, Springer, September 2003.]]
[7]
Massimo Bernaschi, Giulio Iannello, and Mario Lauria. Experimental results about MPI collective communication operations. In Proceedings of HPCN99, 1999.]]
[8]
Ernie W. Chan, Marcel F. Heimlich, Avi Purkayastha, and Robert A. van de Geijn. On optimizing collective communication. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 145--155, San Diego, CA, 2004. IEEE.]]
[9]
Graham E. Fagg, Sathish S. Vadhiyar, and Jack J. Dongarra. ACCT: Automatic collective communications tuning. In Jack Dongarra, Peter Kacsuk, and Norbert Podhorszki, editors, Recent Advances in Parallel Virutal Machine and Message Passing Interface, number 1908 in Springer Lecture Notes in Computer Science, pages 354--361, September 2000.]]
[10]
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors, volume I. Prentice Hall, 1988.]]
[11]
A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. E. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, and P. Vranas. Overview of the Blue Gene/L system architecture. IBM J. Res. and Dev., 49(2/3):195--212, March/May 2005.]]
[12]
Ching-Tien Ho and S. Lennart Johnsson. Distributed routing algorithms for broadcasting and personalized communication in hypercubes. In Proceedings of the 1986 International Conference on Parallel Processing, pages 640--648. IEEE, 1986.]]
[13]
S. L. Johnsson and C. T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, pages 1249--1268, September 1989.]]
[14]
L. V. Kale, Sameer Kumar, and Krishnan Vardarajan. A framework for collective personalized communication. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS '03), 2003.]]
[15]
N. Karonis, B. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In Proceedings of the Fourteenth International Parallel and Distributed Processing Symposium (IPDPS '00), pages 377--384, 2000.]]
[16]
T. Kielmann, R.F.H. Hofman, H.E. Bal, A. Plaat, and R.A.F. Bhoedjang. MagPIe: MPI's collective communication operations for clustered wide area systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'99), pages 131--140. ACM, May 1999.]]
[17]
S.L. Lillevik. The Touchstone 30 Gigaflop DELTA Prototype. In Sixth Distributed Memory Computing Conference Proceedings, pages 671--677. IEEE Computer Society Press, 1991.]]
[18]
Prasenjit Mitra, David Payne, Lance Shuler, Robert van de Geijn, and Jerrell Watts. Fast collective communication libraries, please. In Proceedings of the Intel Supercomputing Users' Group Meeting 1995, 1995.]]
[19]
D. Payne, L. Shuler, R. van de Geijn, and J. Watts. Streetguide to collective communication. unpublished manuscript.]]
[20]
Rolf Rabenseifner and Gerhard Wellein. Communication and optimization aspects of parallel programming models on hybrid architectures. International Journal of High-Performance Computing Applications, 17(1):49--62, 2003.]]
[21]
Y. Saad and M.H. Schultz. Data communications in hypercubes. J. Parallel Distrib. Comput., 6:115--135, 1989.]]
[22]
Mohak Shroff and Robert A. van de Geijn. Collmark MPI collective communication benchmark. unpublished manuscript, 2001.]]
[23]
Marc Snir, Steve Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete Reference, volume 1, The MPI Core. The MIT Press, 2nd edition, 1998.]]
[24]
Rajeev Thakur and William Gropp. Improving the performance of collective operations in MPICH. In Proceedings of the 10th European PVM/MPI Users' Group Conference (Euro PVN/MPI 2003), pages 257--267, September 2003.]]
[25]
Rajeev Thakur, William Gropp, and Brian Toonen. Optimizing the synchronization operations in MPI one-sided communication. International Journal of High-Performance Computing Applications, 19(2):119--128, Summer 2005.]]
[26]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in MPICH. International Journal of High-Performance Computing Applications, (19)1:49--66, Spring 2005.]]
[27]
V. Tipparaju, J. Nieplocha, and D. K. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS '03), 2003.]]
[28]
Jesper~Larason Traff and Andreas Ripke. An optimal broadcast algorithm adapted to SMP clusters. In EuroPVM/MPI 2005, LNCS 3666, pages 48--56, 2005.]]
[29]
Jesper Larason Traff and Andreas Ripke. Optimal broadcast for fully connected networks. In HPCC 2005, LNCS 3726, pages 45--56, 2005.]]
[30]
Sathish S. Vadhiyar, Graham E. Fagg, and Jack Dongarra. Automatically tuned collective communication. In Proceedings of Supercomputing 2000, Dallas, TX.]]
[31]
Robert van de Geijn. On global combine operations. J. Parallel Distrib. Comput., 22:324--328, 1994.]]
[32]
Jerrell Watts and Robert van de Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(2):281--292, 1995.]]
[33]
Thomas Worsch, Ralf Reussner, and Werner Augustin. On benchmarking collective MPI operations. In Dieter Kranzlmüller, Peter Kacsuk, Jack Dongarra, and Jens Volkert, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, pages 271--279. Lecture Notes in Computer Science 2474, Springer, September 2002.]]

Cited By

View all
  • (2023)Parallel intelligent computing: development and challengesSCIENTIA SINICA Informationis10.1360/SSI-2023-005153:8(1441)Online publication date: 17-Aug-2023
  • (2023)ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00035(283-294)Online publication date: Apr-2023
  • (2023)Logical/Physical Topology-Aware Collective Communication in Deep Learning Training2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071117(56-68)Online publication date: Feb-2023
  • Show More Cited By

Index Terms

  1. Collective communication on architectures that support simultaneous communication over multiple links

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
    March 2006
    258 pages
    ISBN:1595931899
    DOI:10.1145/1122971
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 March 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    PPoPP06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)45
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 25 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Parallel intelligent computing: development and challengesSCIENTIA SINICA Informationis10.1360/SSI-2023-005153:8(1441)Online publication date: 17-Aug-2023
    • (2023)ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00035(283-294)Online publication date: Apr-2023
    • (2023)Logical/Physical Topology-Aware Collective Communication in Deep Learning Training2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071117(56-68)Online publication date: Feb-2023
    • (2023)Accelerating communication with multi‐HCA aware collectives in MPIConcurrency and Computation: Practice and Experience10.1002/cpe.787936:1Online publication date: 9-Aug-2023
    • (2022)ThemisProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527382(581-596)Online publication date: 18-Jun-2022
    • (2021)Enabling compute-communication overlap in distributed deep learning training platformsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00049(540-553)Online publication date: 14-Jun-2021
    • (2020)Communication Optimization Technology Based on Network Dynamic Performance ModelMathematical Problems in Engineering10.1155/2020/88907212020(1-13)Online publication date: 15-Oct-2020
    • (2020)ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00018(81-92)Online publication date: Aug-2020
    • (2019)On Optimal Trees for Irregular Gather and Scatter CollectivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.2899843(1-1)Online publication date: 2019
    • (2019)A Disaggregated Memory System for Deep LearningIEEE Micro10.1109/MM.2019.292916539:5(82-90)Online publication date: 1-Sep-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media