Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3540261.3542351guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Towards understanding why lookahead generalizes better than SGD and beyond

Published: 10 June 2024 Publication History

Abstract

To train networks, lookahead algorithm [1] updates its fast weights k times via an inner-loop optimizer before updating its slow weights once by using the latest fast weights. Any optimizer, e.g. SGD, can serve as the inner-loop optimizer, and the derived lookahead generally enjoys remarkable test performance improvement over the vanilla optimizer. But theoretical understandings on the test performance improvement of lookahead remain absent yet. To solve this issue, we theoretically justify the advantages of lookahead in terms of the excess risk error which measures the test performance. Specifically, we prove that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-Łojasiewicz condition which has been observed/proved in neural networks. Moreover, we show the stagewise optimization strategy [2] which decays learning rate several times during training can also benefit lookahead in improving its optimization and generalization errors on strongly convex problems. Finally, we propose a stagewise locally-regularized lookahead (SLRLA) algorithm which sums up the vanilla objective and a local regularizer to minimize at each stage and provably enjoys optimization and generalization improvement over the conventional (stagewise) lookahead. Experimental results on CIFAR10/100 and ImageNet testify its advantages.

Supplementary Material

Additional material (3540261.3542351_supp.pdf)
Supplemental material.

References

[1]
M. Zhang, J. Lucas, G. Hinton, and J. Ba. Lookahead optimizer: k steps forward, 1 step back. In Proc. Conf. Neural Information Processing Systems, pages 9597–9608, 2019.
[2]
E. Barshan and P. Fieguth. Stage-wise training: An improved feature learning strategy for deep models. In Feature Extraction: Modern Questions and Challenges, pages 49–59. PMLR, 2015.
[3]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1–9, 2015.
[4]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 770–778, 2016.
[5]
P. Zhou, C. Xiong, R. Socher, and S. Hoi. Theory-inspired path-regularized differential network architecture search. In Proc. Conf. Neural Information Processing Systems, 2020.
[6]
J. Li, P. Zhou, C. Xiong, R. Socher, and S. CH Hoi. Prototypical contrastive learning of unsupervised representations. In Int'l Conf. Learning Representations, 2021.
[7]
P. Zhou, C. Xiong, X. Yuan, and S. Hoi. A theory-driven self-labeling refinement method for contrastive representation learning. In Proc. Conf. Neural Information Processing Systems, 2021.
[8]
P. Zhou, Y. Zou, X. Yuan, J. Feng, C. Xiong, and S Hoi. Task similarity aware meta learning: Theory-inspired improvement on maml. In Conf. Uncertainty in Artificial Intelligence, 2021.
[9]
T. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran. Deep convolutional neural networks for LVCSR. In Int'l Conf. acoustics, speech and signal processing, pages 8614–8618. IEEE, 2013.
[10]
O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE Trans. on audio, speech, and language processing, 22(10):1533–1545, 2014.
[11]
G. Zheng, Y. Xiao, K. Gong, P. Zhou, X. Liang, and L. Lin. Wav-bert: Cooperative acoustic and linguistic representation learning for low-resource speech recognition. In Conf. Empirical Methods in Natural Language Processing, 2021.
[12]
D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, D. Van Den, J. Schrittwieser, L. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
[13]
N. Brown and T. Sandholm. Safe and nested subgame solving for imperfect-information games. arXiv preprint arXiv:1705.02955, 2017.
[14]
S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. J. of Machine Learning Research, 14(2), 2013.
[15]
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Conf. Neural Information Processing Systems, pages 315–323, 2013.
[16]
P. Zhou and X. Yuan. Hybrid stochastic-deterministic minibatch proximal gradient: Lessthan-single-pass optimization with nearly optimal generalization. In Proc. Int'l Conf. Machine Learning, 2020.
[17]
P. Zhou, X. Yuan, and J. Feng. New insight into hybrid stochastic gradient descent: Beyond with-replacement sampling and convexity. In Proc. Conf. Neural Information Processing Systems, 2018.
[18]
D. Kingma P and J. Ba. Adam: A method for stochastic optimization. In Int'l Conf. Learning Representations, 2014.
[19]
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. of Machine Learning Research, 12(7), 2011.
[20]
H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
[21]
D. Saad. Online algorithms and stochastic approximations. Online Learning, 5:6–3, 1998.
[22]
P. Zhou, J. Feng, C. Ma, C. Xiong, S. Hoi, and W. E. Towards theoretically understanding why sgd generalizes better than adam in deep learning. In Proc. Conf. Neural Information Processing Systems, 2020.
[23]
L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
[24]
L. Wright. Ranger - a synergistic optimizer. https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer, 2019.
[25]
Z. Tang, F. Jiang, J. Song, M. Gong, H. Li, F. Yu, Z. Wang, and M. Wang. Asymptoticng: A regularized natural gradient optimization algorithm with look-ahead strategy. arXiv preprint arXiv:2012.13077, 2020.
[26]
T. Wei, D. Chen, W. Zhou, J. Liao, W. Zhang, L. Yuan, G. Hua, and N. Yu. A simple baseline for stylegan inversion. arXiv preprint arXiv:2104.07661, 2021.
[27]
G. Huang, S. Huang, L. Huangfu, and D. Yang. Weakly supervised patch label inference network with image pyramid for pavement diseases recognition in the wild. In Int'l Conf. acoustics, speech and signal processing, pages 7978–7982, 2021.
[28]
D. Samuel, A. Ganeshan, and J. Naradowsky. Meta-learning extractors for music source separation. In Int'l Conf. acoustics, speech and signal processing, pages 816–820, 2020.
[29]
T. Chavdarova, M. Pagliardini, S. Stich, F. Fleuret, and M. Jaggi. Taming gans with lookahead-minmax. arXiv preprint arXiv:2006.14567, 2020.
[30]
M. Hardt and T. Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
[31]
B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions. In Int'l Conf. Artificial Intelligence and Statistics, pages 1216–1224. PMLR, 2017.
[32]
Z. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In Proc. Conf. Neural Information Processing Systems, 2017.
[33]
Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In Proc. Int'l Conf. Machine Learning, pages 745–754. PMLR, 2018.
[34]
Y. Zhou and Y. Liang. Characterization of gradient dominance and regularity conditions for neural networks. In Int'l Conf. Learning Representations, 2018.
[35]
P. Zhou, X. Yuan, and J. Feng. Efficient stochastic gradient hard thresholding. In Proc. Conf. Neural Information Processing Systems, 2018.
[36]
P. Zhou, X. Yuan, S. Yan, and J. Feng. Faster first-order methods for stochastic non-convex optimization on riemannian manifolds. 2019.
[37]
N. Keskar and R. Socher. Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628, 2017.
[38]
A. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Proc. Conf. Neural Information Processing Systems, pages 4148–4158, 2017.
[39]
S. Merity, N. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182, 2017.
[40]
H. He, G. Huang, and Y. Yuan. Asymmetric valleys: Beyond sharp and flat local minima. In Proc. Conf. Neural Information Processing Systems, 2019.
[41]
U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Proc. Int'l Conf. Machine Learning, 2019.
[42]
J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Lookahead converges to stationary points of smooth non-convex functions. In Int'l Conf. acoustics, speech and signal processing, pages 8604–8608. IEEE, 2020.
[43]
O. Bousquet and A. Elisseeff. Stability and generalization. J. of Machine Learning Research, 2:499–526, 2002.
[44]
M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proc. Int'l Conf. Machine Learning, pages 1225–1234. PMLR, 2016.
[45]
Y. Zhang, W. Zhang, S. Bald, V. Pingali, C. Chen, and M. Goswami. Stability of sgd: Tightness analysis and improved bounds. arXiv preprint arXiv:2102.05274, 2021.
[46]
Z. Yuan, Y. Yan, R. Jin, and T. Yang. Stagewise training accelerates convergence of testing error over SGD. In Proc. Conf. Neural Information Processing Systems, 2018.
[47]
O. Shamir. Making gradient descent optimal for strongly convex stochastic optimization. CoRR abs/1109.5647, 2011.
[48]
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Conf. Neural Information Processing Systems, pages 315–323, 2013.
[49]
P. Zhou and J. Feng. Empirical risk landscape analysis for understanding deep neural networks. In Int'l Conf. Learning Representations, 2018.
[50]
P. Zhou and J. Feng. Understanding generalization and optimization performance of deep cnns. In Proc. Int'l Conf. Machine Learning, 2018.
[51]
P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng. Efficient meta learning via minibatch proximal update. In Proc. Conf. Neural Information Processing Systems, 2019.
[52]
A. Zhu, Y. Meng, and C. Zhang. An improved adam algorithm using look-ahead. In Int'l Conf. Deep Learning Technologies, pages 19–22, 2017.
[53]
A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proc. Int'l Conf. Machine Learning, pages 1571–1578, 2012.
[54]
L. Lei and M. Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Artificial Intelligence and Statistics, pages 148–156, 2017.
[55]
S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In Proc. Int'l Conf. Machine Learning, pages 314–323. PMLR, 2016.
[56]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[57]
S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
[58]
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
[59]
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 248–255, 2009.
[60]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[61]
J. Chen, D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.
[62]
J. Zhuang, T. Tang, Y. Ding, S. Tatikonda, N. Dvornek, X. Papademetris, and J. Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. arXiv preprint arXiv:2010.07468, 2020.
[63]
L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
[64]
L. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems
December 2021
30517 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 10 June 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media