Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3472456.3472467acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling

Published: 05 October 2021 Publication History

Abstract

Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters. To tackle this issue, in this paper, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process. Prototype experiments with representative DNN models trained on Amazon EC2 demonstrate that Prophet can improve the DDNN training performance by up to 40% compared with the state-of-the-art priority-based communication scheduling strategies, yet with negligible runtime performance overhead.

References

[1]
MartAbadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proc. of USENIX OSDI. 265–283.
[2]
Yixin Bao, Yanghua Peng, Yangrui Chne, and Chuan Wu. 2020. Preemptive All-reduce Scheduling for Expediting Distributed DNN Training. In Proc. of IEEE INFOCOM. 626–635.
[3]
Chen Chen, Wei Wang, and Bo Li. 2019. Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers. In Proc. of IEEE INFOCOM. 532–540.
[4]
Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, and Bo Li. 2020. Semi-Dynamic Load Balancing: Efficient Distributed Learning in Non-Dedicated Environments. In Proc. of ACM SoCC. 431–446.
[5]
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting Distributed Synchronous SGD. arXiv preprint arXiv:1604.00981(2016).
[6]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv preprint arXiv:1512.01274(2015).
[7]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, 2012. Large Scale Distributed Deep Networks. In Proc. of NIPS. 1223–1231.
[8]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proc. of MLSys.
[9]
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proc. of NIPS. 1223–1231.
[10]
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based Parameter Propagation for Distributed DNN Training. In Proc. of MLSys.
[11]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proc. of USENIX ATC. 947–960.
[12]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proc. of IEEE OSDI. 463–479.
[13]
Janis Keuper and Franz-Josef Preundt. 2016. Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability. In Proc. of IEEE MLHPC. 19–26.
[14]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proc. of NIPS. 1097–1105.
[15]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-Tenant Learning. In Proc. of USENIX NSDI. 741–761.
[16]
Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P Xing. 2014. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. In Proc. of NIPS. 2834–2842.
[17]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proc. of USENIX OSDI. 583–598.
[18]
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, and Arvind Krishnamurthy. 2018. Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training. In Proc. of ACM SoCC. 41–54.
[19]
Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang. 2019. PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice. Frontiers of Data and Computing 1, 1 (2019), 105–115.
[20]
Peter Mattson, Paulius Micikevicius, Vijay Janapa Reddi, David Patterson, Christine Cheng, Guenther Schmuelling, Cody Coleman, Hanlin Tang, Greg Diamos, Gu-Yeon Wei, David Kanter, and Carole-Jean Wu. 2020. MLPerf Training Benchmark. IEEE Micro 40, 2 (2020), 8–16.
[21]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In Proc. of ACM SOSP. 16–29.
[22]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In Proc. of USENIX NSDI. 785–808.
[23]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799(2018).
[24]
Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms. In Proc. of IEEE INFOCOM. 172–180.
[25]
Raajay Viswanathan, Arjun Balasubramanian, and Aditya Akella. 2020. Network-Accelerated Distributed Machine Learning for Multi-Tenant Settings. In Proc. of ACM SoCC. 447–461.
[26]
Qiang Wang, Shaohuai Shi, Canhui Wang, and Xiaowen Chu. 2020. Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs. arXiv preprint arXiv:2002.10105(2020).
[27]
S. Wang, D. Li, and J. Geng. 2020. Geryon: Accelerating Distributed CNN Training by Network-Level Flow Scheduling. In Proc. of IEEE INFOCOM. 1678–1687.
[28]
Shaoqi Wang, Aidi Pi, and Xiaobo Zhou. 2019. Scalable Distributed DL Training: Batching Communication and Computation. In Proc. of AAAI, Vol. 33. 5289–5296.
[29]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proc. of USENIX ATC. 181–193.
[30]
Xing Zhao, Aijun An, Junfeng Liu, and Bao Xin Chen. 2019. Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning. In Proc. of IEEE ICDCS. 1507–1517.

Cited By

View all
  • (2023)Communication Optimization Algorithms for Distributed Deep Learning Systems: A SurveyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332328234:12(3294-3308)Online publication date: 1-Dec-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. communication scheduling
  2. distributed DNN training
  3. gradient transfer
  4. resource utilization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)9
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Communication Optimization Algorithms for Distributed Deep Learning Systems: A SurveyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332328234:12(3294-3308)Online publication date: 1-Dec-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media