Distributed Training Large-Scale Deep Architectures

Shang-Xuan Zou¹⁸,
Chun-Yen Chen¹⁸,
Jui-Lin Wu¹⁸,
Chun-Nan Chou¹⁸,
Chia-Chin Tsao¹⁸,
Kuan-Chieh Tung¹⁸,
Ting-Wei Lin¹⁸,
Cheng-Lung Sung¹⁸ &
…
Edward Y. Chang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3355 Accesses
10 Citations

Abstract

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Taking both the algorithmic and system aspects into consideration, we develop a procedure for setting mini-batch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Theano-MPI: A Theano-Based Distributed Training Framework

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Article 21 April 2022

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Article 09 February 2024

Notes

1.
GPU instances on Google Compute Engine (GCE) do not support GPU peer-to-peer access, and hence we will defer our GCE experiments till such support is available.
2.
For each training instance, we need to store the gradients of all model parameters. The aggregated gradients of all model parameters are also required for a specific batch.
3.
AlexNet achieved 18.2% top-5 error rate in in the ILSVRC-2012 competition, whereas we obtained 21% in our experiments. This is because we did not perform all the tricks for data augmentation and fine-tuning. We choose 25% as the termination criterion to demonstrate convergence behavior when mini-batch sizes are different.
4.
nvprof only profiles GPU activities, so the CPU activities cannot be analyzed.

References

Abadi, M. et al.: TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org (2015).
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, 18–20 April 1967, pp. 483–485. ACM (1967)
Google Scholar
Bahrampour, S. et al.: Comparative study of deep learning software frameworks. In: arXiv.org. arxiv: 1511.06435v3 [cs.LG], November 2015
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_26
Chapter Google Scholar
Chang, E., Garcia-Molina, H., Li, C.: 2D BubbleUp: managing parallel disks for media servers. Technical report, Stanford InfoLab (1998)
Google Scholar
Chen, J. et al.: Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016)
Chen, T. et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Chilimbi, T.M. et al.: Project adam: building an efficient and scalable deep learning training system. In: OSDI, vol. 14, pp. 571–582 (2014)
Google Scholar
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: EPFL-CONF-192376 (2011)
Google Scholar
Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 281–290. Springer, Cham (2014). doi:10.1007/978-3-319-11179-7_36
Google Scholar
CS231n Convolutional neural network for visual recognition (2017). http://cs231n.github.io/
Dally, W.J.: CNTK: an embedded language for circuit description. Department of Computer Science, California Institute of Technology, Display File
Google Scholar
Dean, J. et al.: Large scale distributed deep networks, pp. 1223–1231 (2012)
Google Scholar
Deng, J. et al.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)
Google Scholar
Elgohary, A., et al.: Compressed linear algebra for large-scale machine learning. Proc. VLDB Endow. 9(12), 960–971 (2016)
Article Google Scholar
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet MATH Google Scholar
GNU linear programming kit (2012). https://www.gnu.org/software/glpk/
Goyal, P. et al.: Accurate, large Minibatch SGD: training ImageNet in 1 h. arXiv preprint arXiv:1706.02677 (2017)
Hadjis, S., et al.: Caffe con troll: shallow ideas to speed up deep learning, April 2015. arXiv.org. arXiv: 1504.04343v2 [cs.LG]
He, K. et al.: Deep residual learning for image recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9(1), 926 (2010)
Google Scholar
Iandola, F.N. et al.: FireCaffe - near-linear acceleration of deep neural network training on compute clusters. In: CVPR, pp. 2592–2600 (2016)
Google Scholar
Ioffe, S.: Batch renormalization: towards reducing Minibatch dependence in batch-normalized models, February 2017. arXiv.org. arXiv: 1702.03275v1 [cs.LG]
Bergstra, J. et al.: Theano: a CPU and GPU math expression compiler (2010)
Google Scholar
Jia, Y. et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678 (2014)
Google Scholar
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F. et al. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc. (2012)
Google Scholar
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021 (2016)
Google Scholar
Li, M. et al.: Scaling distributed machine learning with the parameter server. In: OSDI (2014)
Google Scholar
Liu, Z. et al.: PLDA+: parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2(3), 26:1–26:18 (2011). ISSN, pp. 2157–6904, doi:10.1145/1961189.1961198. http://doi.acm.org/10.1145/1961189.1961198
Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. In: CoRR abs/1312.5851 cs.CV (2013)
Google Scholar
Nemhauser, G.L., Wolsey, L.A.: Integer programming and combinatorial optimization. In: Nemhauser, G.L., Savelsbergh, M.W.P., Sigismondi, G.S. (eds.) Constraint Classification for Mixed Integer Programming Formulations. Wiley, Chichester (1992). COAL Bull. 20, 8–12 (1988)
Google Scholar
Ng, A.Y.: The nuts and bolts of machine learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2016)
Google Scholar
Niu, F. et al.: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730 (2011)
Shi, S. et al.: Benchmarking state-of-the-art deep learning software tools, August 2016. arXiv.org. arXiv:1608.07249v5 [cs.DC]
Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: CoRR abs/1602.07261 (2016). http://arxiv.org/abs/1602.07261
Vasilache, N. et al.: Fast convolutional nets with fbfft: a GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014)
Zhang, H. et al.: Poseidon: a system architecture for effcient GPU-based deep learning on multiple machines. arXiv preprint arXiv:1512.06216 (2015)
Zheng, Z. et al.: SpeeDO: parallelizing stochastic gradient descent for deep convolutional neural network. In: NIPS Workshop on Learning Systems (2015)
Google Scholar
Zinkevich, M. et al.: Parallelized stochastic gradient descent, pp. 2595–2603 (2010)
Google Scholar
Zou, S.-X. et al.: Distributed training large-scale deep architectures. HTC technical report (2017). https://research.htc.com/publications-and-talks

Download references

Author information

Authors and Affiliations

HTC Research, Taipei, Taiwan
Shang-Xuan Zou, Chun-Yen Chen, Jui-Lin Wu, Chun-Nan Chou, Chia-Chin Tsao, Kuan-Chieh Tung, Ting-Wei Lin, Cheng-Lung Sung & Edward Y. Chang

Authors

Shang-Xuan Zou
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Yen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jui-Lin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Nan Chou
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Chin Tsao
View author publications
You can also search for this author in PubMed Google Scholar
Kuan-Chieh Tung
View author publications
You can also search for this author in PubMed Google Scholar
Ting-Wei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Lung Sung
View author publications
You can also search for this author in PubMed Google Scholar
Edward Y. Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edward Y. Chang .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Gao Cong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Macquarie University, Sydney, New South Wales, Australia
Wei Emma Zhang
Wuhan University, Wuhan, China
Chengliang Li
Nanyang Technological University, Singapore, Singapore
Aixin Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zou, SX. et al. (2017). Distributed Training Large-Scale Deep Architectures. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-69179-4_2
Published: 14 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69178-7
Online ISBN: 978-3-319-69179-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Distributed Training Large-Scale Deep Architectures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Theano-MPI: A Theano-Based Distributed Training Framework

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Distributed Training Large-Scale Deep Architectures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Theano-MPI: A Theano-Based Distributed Training Framework

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation