research-article

ImageNet Training in Minutes

Authors:

Kurt KeutzerAuthors Info & Claims

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Article No.: 1, Pages 1 - 10

https://doi.org/10.1145/3225058.3225069

Published: 13 August 2018 Publication History

Abstract

In this paper, we investigate large scale computers' capability of speeding up deep neural networks (DNN) training. Our approach is to use large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from a group of researchers at Facebook, our approach shows higher test accuracy on batch sizes that are larger than 16K. Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe v1.0.7.

References

[1]

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. arXiv preprint arXiv:1711.04325 (2017).

[2]

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595 (2015).

[3]

Carol Reiley. 2016. Deep Driving. (2016). https://www.technologyreview.com/s/602600/deep-driving/.

[4]

Bryan Catanzaro. 2013. Deep learning with COTS HPC systems. (2013).

[5]

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).

[6]

Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).

[7]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.

Digital Library

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.

[9]

Jack Dongarra, Martin Meuer, Horst Simon, and Erich Strohmaier. 2017. Top500 supercomputer ranking. (2017). https://www.top500.org/lists/2017/06/

[10]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).

[11]

Hayit Greenspan, Bram van Ginneken, and Ronald M Summers. 2016. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Transactions on Medical Imaging 35, 5 (2016), 1153--1159.

[12]

William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel computing 22, 6 (1996), 789--828.

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[14]

JB Heaton, NG Polson, and JH Witte. 2016. Deep learning in finance. arXiv preprint arXiv:1602.06561 (2016).

[15]

Júlio Hoffimann, Youli Mao, Avinash Wesley, and Aimee Taylor. 2017. Sequence Mining and Pattern Analysis in Drilling Reports with Deep Natural Language Processing. arXiv preprint arXiv:1712.01476 (2017).

[16]

Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. 2015. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR abs/1511.00175 (2015). http://arxiv.org/abs/1511.00175

[17]

Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592--2600.

[18]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678.

Digital Library

[19]

Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. 2016. How to scale distributed deep learning? arXiv preprint arXiv:1611.04581 (2016).

[20]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).

[21]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).

[22]

Quoc V Le. 2013. Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 8595--8598.

[23]

Mu Li. 2017. Scaling Distributed Machine Learning with System and Algorithm Co-design. Ph.D. Dissertation. Intel.

[24]

Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and Christopher Ré. 2016. Asynchrony begets momentum, with an application to deep learning. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on. IEEE, 997--1004.

[25]

Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science. Springer, 1--9.

[26]

Rolf Rabenseifner and Jesper Larsson Träff. 2004. More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In PVM/MPI. Springer, 36--46.

[27]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.

Digital Library

[28]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Interspeech. 1058--1062.

[29]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. On paralleliz-ability of stochastic gradient descent for speech DNNs. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 235--239.

[30]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.

Digital Library

[31]

Robert A Vandegeijn. 1994. On global combine operations. J. Parallel and Distrib. Comput. 22, 2 (1994), 324--328.

Digital Library

[32]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. (2017).

[33]

Yang You, Zhao Zhang, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2017. ImageNet Training in 24 Minutes. arXiv preprint arXiv:1709.05011 (2017).

[34]

Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems. 685--693.

Digital Library

[35]

Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems. 685--693.

Digital Library

Cited By

Tang XZhang W(2025)Attention Mechanism-Based Cognition-Level Scene UnderstandingInformation10.3390/info1603020316:3(203)Online publication date: 5-Mar-2025
https://doi.org/10.3390/info16030203
Truong TAirao JFattahi SAzarhoushang BKarras PAghababaei R(2025)Image-based machine learning model for tool wear estimation in milling Inconel 718Wear10.1016/j.wear.2025.205865(205865)Online publication date: Feb-2025
https://doi.org/10.1016/j.wear.2025.205865
Wu JLiu C(2025)VQ-VAE-2 based unsupervised algorithm for detecting concrete structural apparent cracksMaterials Today Communications10.1016/j.mtcomm.2025.11207544(112075)Online publication date: Mar-2025
https://doi.org/10.1016/j.mtcomm.2025.112075
Show More Cited By

Index Terms

ImageNet Training in Minutes
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

ImageNet pre-training also transfers non-robustness
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

ImageNet pre-training has enabled state-of-the-art results on many tasks. In spite of its recognized contribution to generalization, we observed in this study that ImageNet pre-training also transfers adversarial non-robustness from pre-trained model ...
FastDimeNet++: Training DimeNet++ in 22 minutes
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Recently, graph neural network (GNN) has shown significant strength in predicting the quantum mechanical properties of molecules. Based on GNN, the DimeNet++ leverages both distance information of atomic pairs and angle information of atomic triplets ...
Training 100,000 Classes on a Single Titan X in 7 Hours or 15 Minutes with 25 Titan Xs
WWW '18: Companion Proceedings of the The Web Conference 2018

In this talk, I will present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(dłogK)$ (d is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

August 2018

945 pages

ISBN:9781450365109

DOI:10.1145/3225058

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2018

ICPP 2018: 47th International Conference on Parallel Processing

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

245
Total Citations
View Citations
2,532
Total Downloads

Downloads (Last 12 months)369
Downloads (Last 6 weeks)45

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang XZhang W(2025)Attention Mechanism-Based Cognition-Level Scene UnderstandingInformation10.3390/info1603020316:3(203)Online publication date: 5-Mar-2025
https://doi.org/10.3390/info16030203
Truong TAirao JFattahi SAzarhoushang BKarras PAghababaei R(2025)Image-based machine learning model for tool wear estimation in milling Inconel 718Wear10.1016/j.wear.2025.205865(205865)Online publication date: Feb-2025
https://doi.org/10.1016/j.wear.2025.205865
Wu JLiu C(2025)VQ-VAE-2 based unsupervised algorithm for detecting concrete structural apparent cracksMaterials Today Communications10.1016/j.mtcomm.2025.11207544(112075)Online publication date: Mar-2025
https://doi.org/10.1016/j.mtcomm.2025.112075
Parlak Sönmez DKılıç Ş(2024)Deep Learning for Automatic Classification of Fruits and Vegetables: Evaluation from the Perspectives of Efficiency and AccuracyTürkiye Teknoloji ve Uygulamalı Bilimler Dergisi10.70562/tubid.15203575:2(151-171)Online publication date: 28-Oct-2024
https://doi.org/10.70562/tubid.1520357
Shen LSun YYu ZDing LTian XTao D(2024)On Efficient Training of Large-Scale Deep Learning ModelsACM Computing Surveys10.1145/370043957:3(1-36)Online publication date: 11-Nov-2024
https://dl.acm.org/doi/10.1145/3700439
Nie CMaghakian JLiu ZSchiavoni VEdinger JCao JJin Z(2024)Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous ClustersProceedings of the 25th International Middleware Conference10.1145/3652892.3700767(299-312)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700767
Lee SBaeza-Yates RBonchi F(2024)Layer-Wise Adaptive Gradient Norm Penalizing Method for Efficient and Accurate Deep LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671728(1518-1529)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671728
Ma XYan FYang LFoster IPapka MLiu ZKettimuthu RBalsamo SKnottenbelt WAbad CShang W(2024)MalleTrain: Deep Neural Networks Training on Unfillable Supercomputer NodesProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645035(190-200)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645035
Wan ZWang ZChung CWang Z(2024)A Survey of Dataset Refinement for Problems in Computer Vision DatasetsACM Computing Surveys10.1145/362715756:7(1-34)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3627157
Ahmed IAhmad MGhazouani HBarhoumi WJeon G(2024) Intelligent Computing for Crop Monitoring in CIoT : Leveraging AI and Big Data Technologies Expert Systems10.1111/exsy.13786Online publication date: 18-Nov-2024
https://doi.org/10.1111/exsy.13786
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten