Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3302424.3303957acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

Published: 25 March 2019 Publication History

Abstract

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist DL researchers to train their models in a distributed manner. Although current DL frameworks scale well for image classification models, there remain opportunities for scalable distributed training on natural language processing (NLP) models. We found that current frameworks show relatively low scalability on training NLP models due to the lack of consideration to the difference in sparsity of model parameters. In this paper, we propose Parallax, a framework that optimizes data parallel training by utilizing the sparsity of model parameters. Parallax introduces a hybrid approach that combines Parameter Server and AllReduce architectures to optimize the amount of data transfer according to the sparsity. Experiments show that Parallax built atop Tensor-Flow achieves scalable training throughput on both dense and sparse models while requiring little effort from its users. Parallax achieves up to 2.8x, 6.02x speedup for NLP models than TensorFlow and Horovod with 48 GPUs, respectively. The training speed for the image classification models is equal to Horovod and 1.53x faster than TensorFlow.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 265--283.
[2]
Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. (2017). arXiv:1711.04325 https://arxiv.org/abs/1711.04325
[3]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proceedings of Advances in Neural Information Processing Systems. Curran Associates, Inc., 1709--1720.
[4]
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math compiler in Python. In Proceedings of the 9th Python in Science Conf, Vol. 1. 1--7.
[5]
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 4960--4964.
[6]
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. (2013). arXiv:1312.3005 https://arxiv.org/abs/1312.3005
[7]
Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Józefowicz. 2016. Revisiting Distributed Synchronous SGD. (2016). arXiv:1604.00981 https://arxiv.org/abs/1604.00981
[8]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. (2015). arXiv:1512.01274 https://arxiv.org/abs/1512.01274
[9]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 571--582.
[10]
Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. End-to-end continuous speech recognition using attention-based recurrent NN: first results. (2014). arXiv:1412.1602 http://arxiv.org/abs/1412.1602
[11]
Facebook. 2017. Caffe2. https://caffe2.ai
[12]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. (2017). arXiv:1706.02677 https://arxiv.org/abs/1706.02677
[13]
Suyog Gupta, Wei Zhang, and Fei Wang. 2017. Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 4854--4858.
[14]
Song Han, Huizi Mao, and William J Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of International Conference on Learning Representations.
[15]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377 http://arxiv.org/abs/1806.03377
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.
[17]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. In Proceedings of Workshop on Machine Learning Systems in The 32th Annual Conference on Neural Information Processing Systems. IEEE.
[18]
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the Limits of Language Modeling. (2016). arXiv:1602.02410v2 https://arxiv.org/abs/1602.02410
[19]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. (2016). arXiv:1609.02907 http://arxiv.org/abs/1609.02927
[20]
Sameer Kumar, Dheeraj Sreedhar, Vaibhav Saxena, Yogish Sabharwal, and Ashish Verma. 2017. Efficient Training of Convolutional Neural Nets on Large Distributed Systems. (2017). arXiv:1711.00705 http://arxiv.org/abs/1711.00705
[21]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 583--598.
[22]
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime neural pruning. In Proceedings of Advances in Neural Information Processing Systems. Curran Associates, Inc., 2181--2191.
[23]
Jian-Hao Luo and Jianxin Wu. 2018. AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference. (2018). arXiv:1805.08941 http://arxiv.org/abs/1805.08941
[24]
Amith R Mamidala, Georgios Kollias, Chris Ward, and Fausto Artico. 2018. MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning. (2018). arXiv:1801.03855 https://arxiv.org/abs/1801.03855
[25]
Amith R Mamidala, Jiuxing Liu, and Dhabaleswar K Panda. 2004. Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms. In Proceedings of International Conference on Cluster Computing. IEEE, 135--144.
[26]
NVIDIA. 2013. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect
[27]
NVIDIA. 2017. NCCL. https://developer.nvidia.com/nccl
[28]
Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional Image Generation with PixelCNN Decoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc., 4797--4805.
[29]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
[30]
Pitch Patarasuk and Xin Yuan. 2007. Bandwidth efficient all-reduce operation on tree topologies. In Proceedings of 21th International Parallel and Distributed Processing Symposium. IEEE, 1--8.
[31]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal Allreduce Algorithms for Clusters of Workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117--124.
[32]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3, 211--252.
[33]
Yousef Saad. 2003. Iterative methods for sparse linear systems. SIAM.
[34]
Alexander Sergeev and Mike Del Balso. 2018. Horovod. (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799
[35]
Shaohuai Shi and Xiaowen Chu. 2018. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs. In Proceedings of IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing. IEEE, 949--957.
[36]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556
[37]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2818--2826.
[38]
Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next-generation open source framework for deep learning. In Workshop on Machine Learning Systems in The 29th Annual Conference on Neural Information Processing Systems.
[39]
Jesper Larsson Träff, Andreas Ripke, Christian Siebert, Pavan Balaji, Rajeev Thakur, and William Gropp. 2010. A simple, pipelined algorithm for large, irregular all-gather problems. The International Journal of High Performance Computing Applications 24, 58--68.
[40]
Statistical Machine Translation. 2014. wmt. http://www.statmt.org/wmt14
[41]
Minjie Wang, Chien-chin Huang, and Jinyang Li. 2018. Supporting Very Large Models using Automatic Dataflow Graph Partitioning. (2018). arXiv:1807.08887 http://arxiv.org/abs/1807.08887
[42]
Minjie Wang, Chien-chin Huang, and Jinyang Li. 2018. Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling. (2018). arXiv:1805.04170 http://arxiv.org/abs/1805.04170
[43]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. (2016). arXiv:1609.08144 https://arxiv.org/abs/1609.08144
[44]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, 181--193.
[45]
Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2016. Staleness-aware async-SGD for Distributed Deep Learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. AAAI Press, 2350--2356.
[46]
Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. (2017). arXiv:1702.03044 http://arxiv.org/abs/1702.03044

Cited By

View all
  • (2024)POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model TrainingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638481(466-468)Online publication date: 2-Mar-2024
  • (2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
  • (2024)HCEC: An efficient geo-distributed deep learning training strategy based on wait-free back-propagationJournal of Systems Architecture10.1016/j.sysarc.2024.103070148(103070)Online publication date: Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
March 2019
714 pages
ISBN:9781450362818
DOI:10.1145/3302424
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning framework
  2. graph transformation
  3. sparsity-aware data parallel training

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EuroSys '19
Sponsor:
EuroSys '19: Fourteenth EuroSys Conference 2019
March 25 - 28, 2019
Dresden, Germany

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)8
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model TrainingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638481(466-468)Online publication date: 2-Mar-2024
  • (2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
  • (2024)HCEC: An efficient geo-distributed deep learning training strategy based on wait-free back-propagationJournal of Systems Architecture10.1016/j.sysarc.2024.103070148(103070)Online publication date: Mar-2024
  • (2024)A review on label cleaning techniques for learning with noisy labelsICT Express10.1016/j.icte.2024.09.007Online publication date: Sep-2024
  • (2023)From distributed machine to distributed deep learning: a comprehensive surveyJournal of Big Data10.1186/s40537-023-00829-x10:1Online publication date: 13-Oct-2023
  • (2023)FEC: Efficient Deep Recommendation Model Training with Flexible Embedding CommunicationProceedings of the ACM on Management of Data10.1145/35893101:2(1-21)Online publication date: 20-Jun-2023
  • (2023)Good Intentions: Adaptive Parameter Management via Intent SignalingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614895(2156-2166)Online publication date: 21-Oct-2023
  • (2023)Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning ModelsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.324273341:4(941-963)Online publication date: Apr-2023
  • (2023)Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed NetworksIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10228922(1-10)Online publication date: 17-May-2023
  • (2023)PipeParNeurocomputing10.1016/j.neucom.2023.126661555:COnline publication date: 17-Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media