Abstract
Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Notes
- 1.
Better performance does not imply that the assumptions used are correct.
- 2.
Image classification on MNIST, SVHN, CIFAR-10, CIFAR-100 and ImageNet.
- 3.
See the GitHub link in Sect. 7 for further analyses and code. We are aware that our analysis of a small set of problems provides low evidence. Nevertheless, we consider it to be guiding. With the code published with this paper, it is simple to run our experiments on further problems.
- 4.
Cropping, horizontal flipping and normalization with mean and standard deviation.
- 5.
Best performing \(\lambda \) chosen of a grid search over \(\{10^{-i}| i \in \{0,1,1.3,2,3,4\}\}\).
- 6.
References
Berrada, L., Zisserman, A., Kumar, M.P.: Training neural networks for and by interpolation. In: ICML (2020)
Chae, Y., Wilke, D.N.: Empirical study towards understanding line search approximations for training neural networks. arXiv (2019)
De, S., Yadav, A.K., Jacobs, D.W., Goldstein, T.: Big batch SGD: automated inference using adaptive batch sizes. arXiv (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Draxler, F., Veschgini, K., Salmhofer, M., Hamprecht, F.A.: Essentially no barriers in neural network energy landscape. In: ICML (2018)
Fort, S., Jastrzebski, S.: Large scale structure of neural network loss landscapes. In: NeurIPS (2019)
Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems. In: ICLR (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hochreiter, S., Schmidhuber, J.: Simplifying neural nets by discovering flat minima. In: NeurIPS (1994)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Jastrzebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y., Storkey, A.J.: On the relation between the sharpest directions of DNN loss and the SGD step length. In: ICLR (2019)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR (2017)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the loss landscape of neural nets. In: NeurIPS (2018)
Li, X., Gu, Q., Zhou, Y., Chen, T., Banerjee, A.: Hessian based analysis of SGD for deep nets: dynamics and generalization. In: SDM21 (2020)
Mahsereci, M., Hennig, P.: Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18(1), 4262–4320 (2017)
McCandlish, S., Kaplan, J., Amodei, D., Team, O.D.: An empirical model of large-batch training. arXiv (2018)
Mutschler, M., Zell, A.: Parabolic approximation line search for dnns. In: NeurIPS (2020)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Rolinek, M., Martius, G.: L4: Practical loss-based stepsize adaptation for deep learning. In: NeurIPS (2018)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: CVPR (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Smith, L.N.: Cyclical learning rates for training neural networks. In: WACV (2017)
Smith, S.L., Kindermans, P., Ying, C., Le, Q.V.: Don’t decay the learning rate, increase the batch size. In: ICLR (2018)
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: NeurIPS (2019)
Xing, C., Arpit, D., Tsirigotis, C., Bengio, Y.: A walk with sgd. arXiv (2018)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
8 Appendix
8 Appendix
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mutschler, M., Zell, A. (2021). Empirically Explaining SGD from a Line Search Perspective. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12892. Springer, Cham. https://doi.org/10.1007/978-3-030-86340-1_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-86340-1_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86339-5
Online ISBN: 978-3-030-86340-1
eBook Packages: Computer ScienceComputer Science (R0)