Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3341301.3359642acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

A generic communication scheduler for distributed DNN training acceleration

Published: 27 October 2019 Publication History

Abstract

We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler work generally for various DNN training frameworks, we introduce a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines. We further introduce a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions. ByteScheduler now supports TensorFlow, PyTorch, and MXNet without modifying their source code, and works well with both Parameter Server (PS) and all-reduce architectures for gradient synchronization, using either TCP or RDMA. Our experiments show that ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196% (or 2.96X of original speed).

References

[1]
2019. ByteScheduler Appendix. https://www.dropbox.com/s/smoq6xd6pr7av81/bytescheduler_appendix.pdf?dl=0.
[2]
2019. ByteScheduler Source Code. https://github.com/bytedance/byteps.
[3]
2019. MLPerf Training v0.6 Results. https://mlperf.org/training-results-0-6/.
[4]
2019. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.
[5]
2019. TensorFlow Grapper. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/grappler.
[6]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[7]
Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. arXiv preprint arXiv:1704.05021 (2017).
[8]
Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. Cherrypick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
[9]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[10]
Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2018. Optimized Broadcast for Deep Learning Workloads on Dense-GPU Infiniband Clusters: MPI or NCCL?. In Proceedings of the 25th European MPI Users' Group Meeting.
[11]
Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599 (2010).
[12]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proceedings of NIPS Workshop on Machine Learning Systems.
[13]
Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient Coflow Scheduling with Varys. In Proceedings of ACM Special Interest Group on Data Communication (SIGCOMM).
[14]
Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. 2018. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent. arXiv preprint arXiv:1803.05880 (2018).
[15]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Scale Distributed Deep Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[16]
Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning Database Configuration Parameters with iTuned. In Proceedings of Very Large Data Bases (VLDB) Endowment.
[17]
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI).
[18]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of Systems and Machine Learning (SysML).
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20]
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[21]
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-Based Parameter Propagation for Distributed DNN Training. In Proceedings of Systems and Machine Learning (SysML).
[22]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia.
[23]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[24]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[25]
Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High Performance RDMA-based MPI Implementation over InfiniBand. International Journal of Parallel Programming (2004).
[26]
Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing Network Performance in Distributed Machine Learning. In Proceedings of USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
[27]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine Learning in Apache Spark. Journal of Machine Learning Research (2016).
[28]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In Proceedings of NIPS Autodiff Workshop.
[29]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th ACM European Conference on Computer Systems (EuroSys).
[30]
Sebastian Ruder. 2016. An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747 (2016).
[31]
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (KDD).
[32]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[33]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
[34]
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[35]
Evan R Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph Gonzalez, Michael J Franklin, Michael I Jordan, and Tim Kraska. 2013. MLI: An API for Distributed Machine Learning. In Proceedings of IEEE International Conference on Data Mining (ICDM).
[36]
Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning. In Proceedings of the 14th ACM European Conference on Computer Systems (EuroSys).
[37]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[38]
Wikipedia. 2019. Monkey Patch. https://en.wikipedia.org/wiki/Monkey_patch.
[39]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of USENIX Annual Technical Conference (USENIX ATC).

Cited By

View all
  • (2025)PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networksNeurocomputing10.1016/j.neucom.2024.128876616(128876)Online publication date: Feb-2025
  • (2024)MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine LearningProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696878(167-176)Online publication date: 18-Nov-2024
  • (2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
  • Show More Cited By

Index Terms

  1. A generic communication scheduler for distributed DNN training acceleration

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles
      October 2019
      615 pages
      ISBN:9781450368735
      DOI:10.1145/3341301
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • USENIX Assoc: USENIX Assoc

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. ML frameworks
      2. communication scheduling

      Qualifiers

      • Research-article

      Conference

      SOSP '19
      Sponsor:
      SOSP '19: ACM SIGOPS 27th Symposium on Operating Systems Principles
      October 27 - 30, 2019
      Ontario, Huntsville, Canada

      Acceptance Rates

      Overall Acceptance Rate 174 of 961 submissions, 18%

      Upcoming Conference

      SOSP '25
      ACM SIGOPS 31st Symposium on Operating Systems Principles
      October 13 - 16, 2025
      Seoul , Republic of Korea

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)507
      • Downloads (Last 6 weeks)51
      Reflects downloads up to 21 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networksNeurocomputing10.1016/j.neucom.2024.128876616(128876)Online publication date: Feb-2025
      • (2024)MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine LearningProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696878(167-176)Online publication date: 18-Nov-2024
      • (2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
      • (2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
      • (2024)AutoPipe: Automatic Configuration of Pipeline Parallelism in Shared GPU ClusterProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673047(443-452)Online publication date: 12-Aug-2024
      • (2024)A Fast Machine Learning Framework with Distributed Packet Loss ToleranceProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition10.1145/3663976.3664033(1-6)Online publication date: 26-Apr-2024
      • (2024)Understanding Communication Characteristics of Distributed TrainingProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663409(1-8)Online publication date: 3-Aug-2024
      • (2024)R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic SystemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672264(554-567)Online publication date: 4-Aug-2024
      • (2024)Crux: GPU-Efficient Communication Scheduling for Deep Learning TrainingProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672239(1-15)Online publication date: 4-Aug-2024
      • (2024)Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672228(707-720)Online publication date: 4-Aug-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media