Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Published: 01 September 2021 Publication History

Abstract

Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.

References

[1]
NVLink NVSwitch, 2021. [Online]. Available: https://www.nvidia.com/en-us/data-center/nvlink/
[2]
J. Dong, et al., “EFLOPS: Algorithm and system co-design for a high performance distributed training platform,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2020.
[3]
P. Patarasuk and X. Yuan, “Bandwidth optimal all-Reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, pp. 117–124, Feb. 2009.
[4]
R. Thakur, et al., “Optimization of collective communication operations in mpich,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, pp. 49–66, Feb. 2005.
[5]
C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficient supercomputing,” IEEE Trans. Comput., vol. 34, no. 10, pp. 892–901, Oct. 1985.
[6]
E. Gabriel, et al., “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in Proc. 11th Eur. Parallel Virtual Mach./Message Passing Interface Users’ Group Meeting, 2004.
[7]
M. Abadi, et al., “TensorFlow: Large-scale machine learning on heterogeneous distributed systems,” 2015. [Online]. Available: http://tensorflow.org/
[8]
A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” 2018, arXiv:1802.05799.
[9]
J. Devlin, M.-W Chang, K. Lee, and K.Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT(1), 2019, pp. 4171–4186.
[10]
NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch, 2021. [Online]. Available: https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/

Cited By

View all
  • (2024)TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673799(48-53)Online publication date: 4-Aug-2024
  • (2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
  • (2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
  • Show More Cited By

Index Terms

  1. ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          Publisher

          IEEE Computer Society Press

          Washington, DC, United States

          Publication History

          Published: 01 September 2021

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 24 Sep 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673799(48-53)Online publication date: 4-Aug-2024
          • (2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
          • (2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
          • (2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 1-Feb-2023

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media