research-article

Synthesizing optimal collective algorithms

Authors:

Madanlal Musuvathi,

Todd Mytkowicz,

Olli SaarikiviAuthors Info & Claims

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 62 - 75

https://doi.org/10.1145/3437801.3441620

Published: 17 February 2021 Publication History

Abstract

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training.

This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesizing collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode the synthesis problem as a quantifier-free SMT formula which can be discharged to a theorem prover. We show how our carefully built encoding enables SCCL to scale.

We synthesize novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.

References

[1]

AMD Radeon Instinct MI50 2020. AMD Radeon Instinct MI50 Accelerator. https://www.amd.com/system/files/documents/radeon-instinctmi50-datasheet.pdf.

[2]

AMD RCCL Library 2020. ROCm Communication Collectives Library. https://github.com/ROCmSoftwarePlatform/rccl.

[3]

Mike Barnett, Satya Gupta, David G Payne, Lance Shuler, Robert van de Geijn, and Jerrell Watts. 1994. Building a high-performance collective communication library. In Supercomputing'94: Proceedings of the 1994 ACM/IEEE Conference on Supercomputing. IEEE, 107--116.

[4]

Michael Barnett, Rick Littlefield, David G Payne, and Robert van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium. IEEE, 156--162.

[5]

Shahid H Bokhari and Harry Berryman. 1992. Complete exchange on a circuit switched mesh. In 1992 Proceedings Scalable High Performance Computing Conference. IEEE Computer Society, 300--301.

[6]

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749--1783.

[7]

Minsik Cho, Ulrich Finkler, Mauricio Serrano, David Kung, and Hillery Hunter. 2019. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development 63, 6 (2019), 1:1--1:11.

[8]

Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS.

[9]

Jack Dongarra et al. 2013. MPI: A message-passing interface standard version 3.0. High Performance Computing Center Stuttgart (HLRS) 2, 5 (2013), 32.

[10]

Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 97--104.

[11]

Google TPU 2020. Google Cloud TPU. https://cloud.google.com/tpu.

[12]

Graphcore IPU 2020. Graphcore Intelligence Processing Unit. https://www.graphcore.ai/products/ipu.

[13]

Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. (March 2019).

[14]

Roger W Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing 20, 3 (1994), 389--398.

[15]

Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. (March 2019).

[16]

A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 94--110.

Digital Library

[17]

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud. In Proceedings of Machine Learning and Systems 2020. 82--97.

[18]

NVIDIA NCCL Library 2020. NVIDIA Collective Communications Library. https://github.com/NVIDIA/nccl.

[19]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 16--29.

Digital Library

[20]

Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E Fagg, Edgar Gabriel, and Jack J Dongarra. 2007. Performance analysis of MPI collective operations. Cluster Computing 10, 2 (2007), 127--143.

Digital Library

[21]

Peter Sanders and Jesper Larsson Träff. 2002. The hierarchical factor algorithm for all-to-all communication. In European Conference on Parallel Processing. Springer, 799--803.

[22]

David S Scott. 1991. Efficient all-to-all communication patterns in hypercube and mesh topologies. In The Sixth Distributed Memory Computing Conference, 1991. Proceedings. IEEE Computer Society, 398--399.

[23]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799 [cs.LG]

[24]

Steve Sistare, Rolf Vandevaart, and Eugene Loh. 1999. Optimization of MPI collectives on clusters of large-scale SMP's. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing. 23--es.

Digital Library

[25]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.

Digital Library

[26]

Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast collective operations using shared and remote memory access protocols on clusters. In Proceedings International Parallel and Distributed Processing Symposium. IEEE, 10--pp.

[27]

Jesper Larsson Träff. 2002. Improved MPI all-to-all communication on a Giganet SMP cluster. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 392--400.

[28]

UCX 2020. Unified Communication X. https://www.openucx.org/.

[29]

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In Conference on Machine Learning and Systems (MLSys 2020).

[30]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181--193.

Cited By

Cao PCheng WZhao SXiong Y(2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673794
Liu XArzani BKakarla SZhao LLiu VCastro MKandula SMarshall LSekar VYu MSeneviratne AVeitch D(2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672249
Cheng SLin JEmani MRaskar SForeman SXie ZVishwanath VKandemir M(2024)Thorough Characterization and Analysis of Large Transformer Model Training At-ScaleProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390348:1(1-25)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639034
Show More Cited By

Index Terms

Synthesizing optimal collective algorithms
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Software architectures
        Cooperating communicating processes

Recommendations

MSCCLang: Microsoft Collective Communication Language
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Machine learning models with millions or billions of parameters are increasingly trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, collective communication becomes a bottleneck. Custom collective ...
Efficient all-to-all Collective Communication Schedules for Direct-connect Topologies
HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly ...
gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2021

507 pages

ISBN:9781450382946

DOI:10.1145/3437801

General Chair:
Jaejin Lee
Seoul National University, South Korea
,
Program Chair:
Erez Petrank
Technion, Israel

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Evaluated & Reusable / v1.1
Best Paper

Author Tags

Qualifiers

Research-article

Conference

PPoPP '21

Sponsor:

PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 27, 2021

Virtual Event, Republic of Korea

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
1,788
Total Downloads

Downloads (Last 12 months)498
Downloads (Last 6 weeks)101

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cao PCheng WZhao SXiong Y(2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673794
Liu XArzani BKakarla SZhao LLiu VCastro MKandula SMarshall LSekar VYu MSeneviratne AVeitch D(2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672249
Cheng SLin JEmani MRaskar SForeman SXie ZVishwanath VKandemir M(2024)Thorough Characterization and Analysis of Large Transformer Model Training At-ScaleProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390348:1(1-25)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639034
Basu PZhao LFantl JPal SKrishnamurthy AKhoury JMencagli GDazzi PLowenthal DBadia R(2024)Efficient all-to-all Collective Communication Schedules for Direct-connect TopologiesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658656(28-41)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658656
Chen CLi XZhu QDuan JSun PZhang XYang CTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication PartitioningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651379(178-191)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651379
Kim HRyu JLee JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)TCCL: Discovering Better Communication Paths for PCIe GPU ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651362(999-1015)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651362
Pati SAga SIslam MJayasena NSinclair MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & CollectivesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640410(1146-1164)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640410
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Zhao XZhang ZWu C(2024)AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00012(25-35)Online publication date: 23-Jul-2024
https://doi.org/10.1109/ICDCS60910.2024.00012
Wang JZhao TWang Y(2024)Network states-aware collective communication optimizationCluster Computing10.1007/s10586-024-04330-927:5(6869-6887)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10586-024-04330-9
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents