research-article

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

Authors:

Joo Seong Jeong,

Byung-Gon ChunAuthors Info & Claims

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

Article No.: 43, Pages 1 - 15

https://doi.org/10.1145/3302424.3303957

Published: 25 March 2019 Publication History

Abstract

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist DL researchers to train their models in a distributed manner. Although current DL frameworks scale well for image classification models, there remain opportunities for scalable distributed training on natural language processing (NLP) models. We found that current frameworks show relatively low scalability on training NLP models due to the lack of consideration to the difference in sparsity of model parameters. In this paper, we propose Parallax, a framework that optimizes data parallel training by utilizing the sparsity of model parameters. Parallax introduces a hybrid approach that combines Parameter Server and AllReduce architectures to optimize the amount of data transfer according to the sparsity. Experiments show that Parallax built atop Tensor-Flow achieves scalable training throughput on both dense and sparse models while requiring little effort from its users. Parallax achieves up to 2.8x, 6.02x speedup for NLP models than TensorFlow and Horovod with 48 GPUs, respectively. The training speed for the image classification models is equal to Horovod and 1.53x faster than TensorFlow.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 265--283.

Digital Library

[2]

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. (2017). arXiv:1711.04325 https://arxiv.org/abs/1711.04325

[3]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proceedings of Advances in Neural Information Processing Systems. Curran Associates, Inc., 1709--1720.

Digital Library

[4]

James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math compiler in Python. In Proceedings of the 9th Python in Science Conf, Vol. 1. 1--7.

[5]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 4960--4964.

[6]

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. (2013). arXiv:1312.3005 https://arxiv.org/abs/1312.3005

[7]

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Józefowicz. 2016. Revisiting Distributed Synchronous SGD. (2016). arXiv:1604.00981 https://arxiv.org/abs/1604.00981

[8]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. (2015). arXiv:1512.01274 https://arxiv.org/abs/1512.01274

[9]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 571--582.

Digital Library

[10]

Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. End-to-end continuous speech recognition using attention-based recurrent NN: first results. (2014). arXiv:1412.1602 http://arxiv.org/abs/1412.1602

[11]

Facebook. 2017. Caffe2. https://caffe2.ai

[12]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. (2017). arXiv:1706.02677 https://arxiv.org/abs/1706.02677

[13]

Suyog Gupta, Wei Zhang, and Fei Wang. 2017. Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 4854--4858.

Digital Library

[14]

Song Han, Huizi Mao, and William J Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of International Conference on Learning Representations.

[15]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377 http://arxiv.org/abs/1806.03377

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.

[17]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. In Proceedings of Workshop on Machine Learning Systems in The 32th Annual Conference on Neural Information Processing Systems. IEEE.

[18]

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the Limits of Language Modeling. (2016). arXiv:1602.02410v2 https://arxiv.org/abs/1602.02410

[19]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. (2016). arXiv:1609.02907 http://arxiv.org/abs/1609.02927

[20]

Sameer Kumar, Dheeraj Sreedhar, Vaibhav Saxena, Yogish Sabharwal, and Ashish Verma. 2017. Efficient Training of Convolutional Neural Nets on Large Distributed Systems. (2017). arXiv:1711.00705 http://arxiv.org/abs/1711.00705

[21]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 583--598.

Digital Library

[22]

Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime neural pruning. In Proceedings of Advances in Neural Information Processing Systems. Curran Associates, Inc., 2181--2191.

Digital Library

[23]

Jian-Hao Luo and Jianxin Wu. 2018. AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference. (2018). arXiv:1805.08941 http://arxiv.org/abs/1805.08941

[24]

Amith R Mamidala, Georgios Kollias, Chris Ward, and Fausto Artico. 2018. MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning. (2018). arXiv:1801.03855 https://arxiv.org/abs/1801.03855

[25]

Amith R Mamidala, Jiuxing Liu, and Dhabaleswar K Panda. 2004. Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms. In Proceedings of International Conference on Cluster Computing. IEEE, 135--144.

Digital Library

[26]

NVIDIA. 2013. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect

[27]

NVIDIA. 2017. NCCL. https://developer.nvidia.com/nccl

[28]

Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional Image Generation with PixelCNN Decoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc., 4797--4805.

Digital Library

[29]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).

[30]

Pitch Patarasuk and Xin Yuan. 2007. Bandwidth efficient all-reduce operation on tree topologies. In Proceedings of 21th International Parallel and Distributed Processing Symposium. IEEE, 1--8.

[31]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal Allreduce Algorithms for Clusters of Workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117--124.

Digital Library

[32]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3, 211--252.

Digital Library

[33]

Yousef Saad. 2003. Iterative methods for sparse linear systems. SIAM.

Digital Library

[34]

Alexander Sergeev and Mike Del Balso. 2018. Horovod. (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799

[35]

Shaohuai Shi and Xiaowen Chu. 2018. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs. In Proceedings of IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing. IEEE, 949--957.

[36]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556

[37]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2818--2826.

[38]

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next-generation open source framework for deep learning. In Workshop on Machine Learning Systems in The 29th Annual Conference on Neural Information Processing Systems.

[39]

Jesper Larsson Träff, Andreas Ripke, Christian Siebert, Pavan Balaji, Rajeev Thakur, and William Gropp. 2010. A simple, pipelined algorithm for large, irregular all-gather problems. The International Journal of High Performance Computing Applications 24, 58--68.

Digital Library

[40]

Statistical Machine Translation. 2014. wmt. http://www.statmt.org/wmt14

[41]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2018. Supporting Very Large Models using Automatic Dataflow Graph Partitioning. (2018). arXiv:1807.08887 http://arxiv.org/abs/1807.08887

Digital Library

[42]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2018. Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling. (2018). arXiv:1805.04170 http://arxiv.org/abs/1805.04170

[43]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. (2016). arXiv:1609.08144 https://arxiv.org/abs/1609.08144

[44]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, 181--193.

Digital Library

[45]

Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2016. Staleness-aware async-SGD for Distributed Deep Learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. AAAI Press, 2350--2356.

Digital Library

[46]

Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. (2017). arXiv:1702.03044 http://arxiv.org/abs/1702.03044

Cited By

He JChen SZhai JLee IChabbi MSteuwer M(2024)POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model TrainingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638481(466-468)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638481
Li SLu KLai ZLiu WGe KLi D(2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3406420
Song YAi YXiao XLiu ZTang ZLi K(2024)HCEC: An efficient geo-distributed deep learning training strategy based on wait-free back-propagationJournal of Systems Architecture10.1016/j.sysarc.2024.103070148(103070)Online publication date: Mar-2024
https://doi.org/10.1016/j.sysarc.2024.103070
Show More Cited By

Index Terms

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
    2. Other architectures
      1. Data flow architectures
      2. Neural networks
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Software architectures
        Data flow architectures

Recommendations

SwitchFlow: preemptive multitasking for deep learning
Middleware '21: Proceedings of the 22nd International Middleware Conference

Accelerators, such as GPU, are a scarce resource in deep learning (DL). Effectively and efficiently sharing GPU leads to improved hardware utilization as well as user experiences, who may need to wait for hours to access GPU before a long training job ...
Deep Boltzmann machine algorithm for accurate medical image analysis for classification of cancerous region

In this research work, a deep learning algorithm is applied to the medical domain to deliver a better healthcare system. For this, a deep learning framework for classification the region of interest pattern of complex hyperspectral medical images is ...
A-DBNF: adaptive deep belief network framework for regression and classification tasks
Abstract
Many machine learning methods and models have been proposed for multivariate data regression and classification in recent years. Most of them are supervised learning methods, which require a large number of labeled data. Moreover, current methods ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

March 2019

714 pages

ISBN:9781450362818

DOI:10.1145/3302424

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroSys '19

Sponsor:

SIGOPS

EuroSys '19: Fourteenth EuroSys Conference 2019

March 25 - 28, 2019

Dresden, Germany

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
1,179
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)8

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

He JChen SZhai JLee IChabbi MSteuwer M(2024)POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model TrainingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638481(466-468)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638481
Li SLu KLai ZLiu WGe KLi D(2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3406420
Song YAi YXiao XLiu ZTang ZLi K(2024)HCEC: An efficient geo-distributed deep learning training strategy based on wait-free back-propagationJournal of Systems Architecture10.1016/j.sysarc.2024.103070148(103070)Online publication date: Mar-2024
https://doi.org/10.1016/j.sysarc.2024.103070
Shin JWon JLee HLee J(2024)A review on label cleaning techniques for learning with noisy labelsICT Express10.1016/j.icte.2024.09.007Online publication date: Sep-2024
https://doi.org/10.1016/j.icte.2024.09.007
Dehghani MYazdanparast Z(2023)From distributed machine to distributed deep learning: a comprehensive surveyJournal of Big Data10.1186/s40537-023-00829-x10:1Online publication date: 13-Oct-2023
https://doi.org/10.1186/s40537-023-00829-x
Ma KYan XCai ZHuang YWu YCheng J(2023)FEC: Efficient Deep Recommendation Model Training with Flexible Embedding CommunicationProceedings of the ACM on Management of Data10.1145/35893101:2(1-21)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589310
Renz-Wieland AKieslinger AGericke RGemulla RKaoudi ZMarkl VFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Good Intentions: Adaptive Parameter Management via Intent SignalingProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614895(2156-2166)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614895
Ge KLu KFu YDeng XLai ZLi D(2023)Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning ModelsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.324273341:4(941-963)Online publication date: Apr-2023
https://doi.org/10.1109/JSAC.2023.3242733
Liu YJiang BZhao SLin TWang XZhou C(2023)Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed NetworksIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10228922(1-10)Online publication date: 17-May-2023
https://doi.org/10.1109/INFOCOM53939.2023.10228922
Zhang JNiu GDai QLi HWu ZDong FWu Z(2023)PipeParNeurocomputing10.1016/j.neucom.2023.126661555:COnline publication date: 17-Oct-2023
https://dl.acm.org/doi/10.1016/j.neucom.2023.126661
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents