Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3225058.3225069acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

ImageNet Training in Minutes

Published: 13 August 2018 Publication History

Abstract

In this paper, we investigate large scale computers' capability of speeding up deep neural networks (DNN) training. Our approach is to use large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from a group of researchers at Facebook, our approach shows higher test accuracy on batch sizes that are larger than 16K. Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe v1.0.7.

References

[1]
Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. arXiv preprint arXiv:1711.04325 (2017).
[2]
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595 (2015).
[3]
Carol Reiley. 2016. Deep Driving. (2016). https://www.technologyreview.com/s/602600/deep-driving/.
[4]
Bryan Catanzaro. 2013. Deep learning with COTS HPC systems. (2013).
[5]
Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).
[6]
Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).
[7]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.
[9]
Jack Dongarra, Martin Meuer, Horst Simon, and Erich Strohmaier. 2017. Top500 supercomputer ranking. (2017). https://www.top500.org/lists/2017/06/
[10]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).
[11]
Hayit Greenspan, Bram van Ginneken, and Ronald M Summers. 2016. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging 35, 5 (2016), 1153--1159.
[12]
William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel computing 22, 6 (1996), 789--828.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[14]
JB Heaton, NG Polson, and JH Witte. 2016. Deep learning in finance. arXiv preprint arXiv:1602.06561 (2016).
[15]
Júlio Hoffimann, Youli Mao, Avinash Wesley, and Aimee Taylor. 2017. Sequence Mining and Pattern Analysis in Drilling Reports with Deep Natural Language Processing. arXiv preprint arXiv:1712.01476 (2017).
[16]
Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2015. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR abs/1511.00175 (2015). http://arxiv.org/abs/1511.00175
[17]
Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592--2600.
[18]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678.
[19]
Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. 2016. How to scale distributed deep learning? arXiv preprint arXiv:1611.04581 (2016).
[20]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).
[21]
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).
[22]
Quoc V Le. 2013. Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 8595--8598.
[23]
Mu Li. 2017. Scaling Distributed Machine Learning with System and Algorithm Co-design. Ph.D. Dissertation. Intel.
[24]
Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and Christopher Ré. 2016. Asynchrony begets momentum, with an application to deep learning. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on. IEEE, 997--1004.
[25]
Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science. Springer, 1--9.
[26]
Rolf Rabenseifner and Jesper Larsson Träff. 2004. More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In PVM/MPI. Springer, 36--46.
[27]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.
[28]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Interspeech. 1058--1062.
[29]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. On paralleliz-ability of stochastic gradient descent for speech DNNs. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 235--239.
[30]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.
[31]
Robert A Vandegeijn. 1994. On global combine operations. J. Parallel and Distrib. Comput. 22, 2 (1994), 324--328.
[32]
Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. (2017).
[33]
Yang You, Zhao Zhang, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2017. ImageNet Training in 24 Minutes. arXiv preprint arXiv:1709.05011 (2017).
[34]
Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems. 685--693.
[35]
Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems. 685--693.

Cited By

View all
  • (2024)Deep Learning for Automatic Classification of Fruits and Vegetables: Evaluation from the Perspectives of Efficiency and AccuracyTürkiye Teknoloji ve Uygulamalı Bilimler Dergisi10.70562/tubid.15203575:2(151-171)Online publication date: 28-Oct-2024
  • (2024)On Efficient Training of Large-Scale Deep Learning ModelsACM Computing Surveys10.1145/370043957:3(1-36)Online publication date: 11-Nov-2024
  • (2024)Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous ClustersProceedings of the 25th International Middleware Conference10.1145/3652892.3700767(299-312)Online publication date: 2-Dec-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed Machine Learning
  2. Fast Deep Neural Networks Training

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)405
  • Downloads (Last 6 weeks)41
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Deep Learning for Automatic Classification of Fruits and Vegetables: Evaluation from the Perspectives of Efficiency and AccuracyTürkiye Teknoloji ve Uygulamalı Bilimler Dergisi10.70562/tubid.15203575:2(151-171)Online publication date: 28-Oct-2024
  • (2024)On Efficient Training of Large-Scale Deep Learning ModelsACM Computing Surveys10.1145/370043957:3(1-36)Online publication date: 11-Nov-2024
  • (2024)Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous ClustersProceedings of the 25th International Middleware Conference10.1145/3652892.3700767(299-312)Online publication date: 2-Dec-2024
  • (2024)Layer-Wise Adaptive Gradient Norm Penalizing Method for Efficient and Accurate Deep LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671728(1518-1529)Online publication date: 25-Aug-2024
  • (2024)MalleTrain: Deep Neural Networks Training on Unfillable Supercomputer NodesProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645035(190-200)Online publication date: 7-May-2024
  • (2024)A Survey of Dataset Refinement for Problems in Computer Vision DatasetsACM Computing Surveys10.1145/362715756:7(1-34)Online publication date: 9-Apr-2024
  • (2024)UniSched: A Unified Scheduler for Deep Learning Training Jobs With Different User DemandsIEEE Transactions on Computers10.1109/TC.2024.337179473:6(1500-1515)Online publication date: Jun-2024
  • (2024)ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.331584773:1(30-43)Online publication date: Jan-2024
  • (2024)Training and Serving System of Foundation Models: A Comprehensive SurveyIEEE Open Journal of the Computer Society10.1109/OJCS.2024.33808285(107-119)Online publication date: 2024
  • (2024)Industrial Internet of Things Intelligence Empowering Smart Manufacturing: A Literature ReviewIEEE Internet of Things Journal10.1109/JIOT.2024.336769211:11(19143-19167)Online publication date: 1-Jun-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media