Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3437801.3441620acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Synthesizing optimal collective algorithms

Published: 17 February 2021 Publication History

Abstract

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training.
This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesizing collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode the synthesis problem as a quantifier-free SMT formula which can be discharged to a theorem prover. We show how our carefully built encoding enables SCCL to scale.
We synthesize novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.

References

[1]
AMD Radeon Instinct MI50 2020. AMD Radeon Instinct MI50 Accelerator. https://www.amd.com/system/files/documents/radeon-instinctmi50-datasheet.pdf.
[2]
AMD RCCL Library 2020. ROCm Communication Collectives Library. https://github.com/ROCmSoftwarePlatform/rccl.
[3]
Mike Barnett, Satya Gupta, David G Payne, Lance Shuler, Robert van de Geijn, and Jerrell Watts. 1994. Building a high-performance collective communication library. In Supercomputing'94: Proceedings of the 1994 ACM/IEEE Conference on Supercomputing. IEEE, 107--116.
[4]
Michael Barnett, Rick Littlefield, David G Payne, and Robert van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In [1993] Proceedings Seventh International Parallel Processing Symposium. IEEE, 156--162.
[5]
Shahid H Bokhari and Harry Berryman. 1992. Complete exchange on a circuit switched mesh. In 1992 Proceedings Scalable High Performance Computing Conference. IEEE Computer Society, 300--301.
[6]
Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749--1783.
[7]
Minsik Cho, Ulrich Finkler, Mauricio Serrano, David Kung, and Hillery Hunter. 2019. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development 63, 6 (2019), 1:1--1:11.
[8]
Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS.
[9]
Jack Dongarra et al. 2013. MPI: A message-passing interface standard version 3.0. High Performance Computing Center Stuttgart (HLRS) 2, 5 (2013), 32.
[10]
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 97--104.
[11]
Google TPU 2020. Google Cloud TPU. https://cloud.google.com/tpu.
[12]
Graphcore IPU 2020. Graphcore Intelligence Processing Unit. https://www.graphcore.ai/products/ipu.
[13]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. (March 2019).
[14]
Roger W Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing 20, 3 (1994), 389--398.
[15]
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. (March 2019).
[16]
A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 94--110.
[17]
Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud. In Proceedings of Machine Learning and Systems 2020. 82--97.
[18]
NVIDIA NCCL Library 2020. NVIDIA Collective Communications Library. https://github.com/NVIDIA/nccl.
[19]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 16--29.
[20]
Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E Fagg, Edgar Gabriel, and Jack J Dongarra. 2007. Performance analysis of MPI collective operations. Cluster Computing 10, 2 (2007), 127--143.
[21]
Peter Sanders and Jesper Larsson Träff. 2002. The hierarchical factor algorithm for all-to-all communication. In European Conference on Parallel Processing. Springer, 799--803.
[22]
David S Scott. 1991. Efficient all-to-all communication patterns in hypercube and mesh topologies. In The Sixth Distributed Memory Computing Conference, 1991. Proceedings. IEEE Computer Society, 398--399.
[23]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799 [cs.LG]
[24]
Steve Sistare, Rolf Vandevaart, and Eugene Loh. 1999. Optimization of MPI collectives on clusters of large-scale SMP's. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing. 23--es.
[25]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.
[26]
Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast collective operations using shared and remote memory access protocols on clusters. In Proceedings International Parallel and Distributed Processing Symposium. IEEE, 10--pp.
[27]
Jesper Larsson Träff. 2002. Improved MPI all-to-all communication on a Giganet SMP cluster. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 392--400.
[28]
UCX 2020. Unified Communication X. https://www.openucx.org/.
[29]
Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In Conference on Machine Learning and Systems (MLSys 2020).
[30]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181--193.

Cited By

View all
  • (2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
  • (2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
  • (2024)Thorough Characterization and Analysis of Large Transformer Model Training At-ScaleProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390348:1(1-25)Online publication date: 21-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2021
507 pages
ISBN:9781450382946
DOI:10.1145/3437801
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. GPU
  2. collective communication
  3. interconnection
  4. network
  5. synthesis

Qualifiers

  • Research-article

Conference

PPoPP '21

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)498
  • Downloads (Last 6 weeks)101
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
  • (2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
  • (2024)Thorough Characterization and Analysis of Large Transformer Model Training At-ScaleProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390348:1(1-25)Online publication date: 21-Feb-2024
  • (2024)Efficient all-to-all Collective Communication Schedules for Direct-connect TopologiesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658656(28-41)Online publication date: 3-Jun-2024
  • (2024)Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication PartitioningProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651379(178-191)Online publication date: 27-Apr-2024
  • (2024)TCCL: Discovering Better Communication Paths for PCIe GPU ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651362(999-1015)Online publication date: 27-Apr-2024
  • (2024)T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & CollectivesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640410(1146-1164)Online publication date: 27-Apr-2024
  • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
  • (2024)AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00012(25-35)Online publication date: 23-Jul-2024
  • (2024)Network states-aware collective communication optimizationCluster Computing10.1007/s10586-024-04330-927:5(6869-6887)Online publication date: 1-Aug-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media