Article

Batch normalization: accelerating deep network training by reducing internal covariate shift

Authors:

Christian SzegedyAuthors Info & Claims

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

Pages 448 - 456

Published: 06 July 2015 Publication History

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

References

[1]

Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249-256, May 2010.

[2]

Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc'Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.

[3]

Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished).

[4]

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121-2159, July 2011. ISSN 1532-4435.

[5]

Gülçehre, Ç aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013.

[6]

He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.

[7]

Hyvärinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5): 411-430, May 2000.

[8]

Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008.

[9]

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998a.

[10]

LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.

[11]

Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp. 1-8. IEEE Computer Society, Jun 23-28 2008.

[12]

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807-814. Omnipress, 2010.

[13]

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16- 21 June 2013, pp. 1310-1318, 2013.

[14]

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.

[15]

Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 924-932, 2012.

[16]

Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.

[17]

Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.

[18]

Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227-244, October 2000.

[19]

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929-1958, January 2014.

[20]

Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp. 1139-1147. JMLR.org, 2013.

[21]

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

[22]

Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657-665, Granada, Spain, December 2011.

[23]

Wiesler, Simon, Richard, Alexander, Schlüter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180-184, Florence, Italy, May 2014.

[24]

Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015.

Cited By

Zhang HYan BCao LMadden SRundensteiner E(2024)MetaStore: Analyzing Deep Learning Meta-Data at ScaleProceedings of the VLDB Endowment10.14778/3648160.364818217:6(1446-1459)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648182
Lloyd JLepora N(2024)Pose-and-shear-based tactile servoingInternational Journal of Robotics Research10.1177/0278364923122581143:7(1024-1055)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1177/02783649231225811
Mahmud SParikh VLiang QLi KZhang RAjit AGunda VAgarwal DGuimbretiere FZhang C(2024)ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the BodyProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997528:4(1-32)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3699752
Show More Cited By

Batch normalization: accelerating deep network training by reducing internal covariate shift
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Batch Normalization: Is Learning An Adaptive Gain and Bias Necessary?
ICMLC '18: Proceedings of the 2018 10th International Conference on Machine Learning and Computing

The state-of-the-art training of deep neural networks requires to normalize the activities of the neurons for accelerating the training process. A standard approach is to employ batch normalization (BN), in which the activations are normalized by the ...
Enhanced LSTM with Batch Normalization
Neural Information Processing
Abstract
Recurrent neural networks (RNNs) are powerful models for sequence learning. However, the training of RNNs is complicated because the internal covariate shift problem, where the input distribution at each iteration changes during the training as ...
Zero-shot anomaly detection via batch normalization
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal", has led ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

July 2015

2558 pages

Editors:
Francis Bach,
David Blei

Publisher

JMLR.org

Publication History

Published: 06 July 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,520
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang HYan BCao LMadden SRundensteiner E(2024)MetaStore: Analyzing Deep Learning Meta-Data at ScaleProceedings of the VLDB Endowment10.14778/3648160.364818217:6(1446-1459)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648182
Lloyd JLepora N(2024)Pose-and-shear-based tactile servoingInternational Journal of Robotics Research10.1177/0278364923122581143:7(1024-1055)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1177/02783649231225811
Mahmud SParikh VLiang QLi KZhang RAjit AGunda VAgarwal DGuimbretiere FZhang C(2024)ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the BodyProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997528:4(1-32)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3699752
Rahmath P HSrivastava VChaurasia KPacheco RCouto R(2024)Early-Exit Deep Neural Network - A Comprehensive SurveyACM Computing Surveys10.1145/369876757:3(1-37)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698767
Esmaeilzadeh HGhodrati SKahng AKinzer SManasi SSapatnekar SWang Z(2024)Performance Analysis of CNN Inference/Training with Convolution and Non-Convolution Operations on ASIC AcceleratorsACM Transactions on Design Automation of Electronic Systems10.1145/369666530:1(1-34)Online publication date: 26-Sep-2024
https://dl.acm.org/doi/10.1145/3696665
Cao CZhou FDai YWang JZhang K(2024)A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and ExplainabilityACM Computing Surveys10.1145/369620657:2(1-38)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.1145/3696206
Traini LDi Menna FCortellessa VFilkov VRay BZhou M(2024)AI-driven Java Performance Testing: Balancing Result Quality with Testing TimeProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695017(443-454)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695017
Tang YLi QYin LLi DZhang YWang CZhang XQiao LZhang ZLu K(2024)DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and SwappingACM Transactions on Architecture and Code Optimization10.1145/368933821:4(1-25)Online publication date: 20-Aug-2024
https://dl.acm.org/doi/10.1145/3689338
Wang SWang XJiang WMiao CCao QWang HSun KXue HSu L(2024)Towards Smartphone-based 3D Hand Pose Reconstruction Using Acoustic SignalsACM Transactions on Sensor Networks10.1145/367712220:5(1-32)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1145/3677122
Luo JZhong W(2024)Graduates Employment Prediction Based on Deep LearningProceedings of the 2024 9th International Conference on Distance Education and Learning10.1145/3675812.3675826(162-166)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3675812.3675826
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents