research-article

ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Pages 85 - 92

https://doi.org/10.1109/MM.2021.3091475

Published: 01 September 2021 Publication History

Abstract

Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.

References

[1]

NVLink NVSwitch, 2021. [Online]. Available: https://www.nvidia.com/en-us/data-center/nvlink/

Google Scholar

[2]

J. Dong, et al., “EFLOPS: Algorithm and system co-design for a high performance distributed training platform,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2020.

Google Scholar

[3]

P. Patarasuk and X. Yuan, “Bandwidth optimal all-Reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, pp. 117–124, Feb. 2009.

Digital Library

Google Scholar

[4]

R. Thakur, et al., “Optimization of collective communication operations in mpich,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, pp. 49–66, Feb. 2005.

Digital Library

Google Scholar

[5]

C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficient supercomputing,” IEEE Trans. Comput., vol. 34, no. 10, pp. 892–901, Oct. 1985.

Crossref

Google Scholar

[6]

E. Gabriel, et al., “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in Proc. 11th Eur. Parallel Virtual Mach./Message Passing Interface Users’ Group Meeting, 2004.

Google Scholar

[7]

M. Abadi, et al., “TensorFlow: Large-scale machine learning on heterogeneous distributed systems,” 2015. [Online]. Available: http://tensorflow.org/

Google Scholar

[8]

A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” 2018, arXiv:1802.05799.

Google Scholar

[9]

J. Devlin, M.-W Chang, K. Lee, and K.Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT(1), 2019, pp. 4171–4186.

Google Scholar

[10]

NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch, 2021. [Online]. Available: https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/

Google Scholar

Cited By

View all

Li BWang XWang JLiu YGong YLu HDang WZhang WHuang XChen MChen JHe CLiu YHu XLiu CJi XXia YLi XHe ZWang YZou X(2024)TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673799(48-53)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673799
Cao PCheng WZhao SXiong Y(2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673794
Cao JGuan YQian KGao JXiao WDong JFu BCai DZhai ESekar VYu MSeneviratne AVeitch D(2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672239
Show More Cited By

Index Terms

ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards scalable collective communication for multicomputer interconnection networks
Special issue: Information technology

A considerable number of broadcast algorithms have been proposed for the mesh over the past decade. Nonetheless, most of these algorithms do not exhibit good scalability properties as the network size increases. As a consequence, most existing broadcast ...
Realizations of efficient collective communication in multidimensional processor arrays
Fast and highly scalable parallel computations for fundamental matrix problems on distributed memory systems

We present fast and highly scalable parallel computations for a number of important and fundamental matrix problems on distributed memory systems (DMS). These problems include matrix multiplication, matrix chain product, and computing the powers, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 September 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li BWang XWang JLiu YGong YLu HDang WZhang WHuang XChen MChen JHe CLiu YHu XLiu CJi XXia YLi XHe ZWang YZou X(2024)TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673799(48-53)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673799
Cao PCheng WZhao SXiong Y(2024)Network Load Balancing with Parallel Flowlets for AI Training ClustersProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673794(18-25)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673794
Cao JGuan YQian KGao JXiao WDong JFu BCai DZhai ESekar VYu MSeneviratne AVeitch D(2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672239
Weingram ALi YQi HNg DDai LLu X(2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1007/s11390-023-2894-6

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Towards scalable collective communication for multicomputer interconnection networks

Realizations of efficient collective communication in multidimensional processor arrays

Fast and highly scalable parallel computations for fundamental matrix problems on distributed memory systems

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations