Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3152494.3152515acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

ADINE: an adaptive momentum method for stochastic gradient descent

Published: 11 January 2018 Publication History

Abstract

Momentum based learning algorithms are one of the most successful learning algorithms in both convex and non-convex optimization. Two major momentum based techniques that achieved tremendous success in gradient-based optimization are Polyak's heavy ball method and Nesterov's accelerated gradient. A crucial step in all the momentum based methods is the choice of the momentum parameter m, which is always set to less than 1. Although the choice of m < 1 is justified only under very strong theoretical assumptions, it works well in practice. In this paper we propose a new momentum based method ADINE, which relaxes the constraint of m < 1 and allows the learning algorithm to use adaptive higher momentum. We motivate our relaxation on m by experimentally verifying that a higher momentum (≥ 1) can help escape saddles much faster. ADINE uses this intuition and helps weigh the previous updates more, inherently setting the momentum parameter to be greater in the optimization method. To the best of our knowledge, the idea of increased momentum is first of its kind and is very novel. We evaluate this on deep neural networks and show that ADINE helps the learning algorithm to converge much faster without compromising on the generalization error.

References

[1]
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems. 2933--2941.
[2]
John Duchi, Elad Hazan, and Yoram Singer. 2010. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Technical Report UCB/EECS-2010--24. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-24.html
[3]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. 2015. Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition. In Proceedings of The 28th Conference on Learning Theory (Proceedings of Machine Learning Research), Vol. 40. PMLR, 797--842.
[4]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy.
[5]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
[6]
Yurii Nesterov. 1983. A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady, Vol. 27. 372--376.
[7]
Boris Teodorovich Polyak. 1964. Some methods of speeding up the convergence of iteration methods. U. S. S. R. Comput. Math. and Math. Phys. 4, 5 (1964), 1--17.
[8]
Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. 2016. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning. 314--323.
[9]
M. Riedmiller and H. Braun. 1993. A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In IEEE International Conference on Neural Networks. 586--591 vol.1.
[10]
Nicolas L Roux, Mark Schmidt, and Francis R Bach. 2012. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems. 2663--2671.
[11]
Shai Shalev-Shwartz and Tong Zhang. 2013. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14, Feb (2013), 567--599.
[12]
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the Importance of Initialization and Momentum in Deep Learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML'13). JMLR.org, III-1139--III-1147. http://dl.acm.org/citation.cfm?id=3042817.3043064
[13]
T. Tieleman and G. Hinton. 2012. Lecture 6.5---RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. (2012).
[14]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In BMVC.
[15]
Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).

Cited By

View all
  • (2022)A Neural Network Algorithm of Learning Rate Adaptive Optimization and Its Application in Emitter RecognitionSimulation Tools and Techniques10.1007/978-3-030-97124-3_29(390-402)Online publication date: 31-Mar-2022
  • (2020)Research and analysis on defect detection of semi-conductive layer of high voltage cableE3S Web of Conferences10.1051/e3sconf/202018501057185(01057)Online publication date: 1-Sep-2020
  • (2020)Electrical performance analysis of 110 kv GIS terminal extension conducting rodE3S Web of Conferences10.1051/e3sconf/202018501020185(01020)Online publication date: 1-Sep-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '18: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data
January 2018
379 pages
ISBN:9781450363419
DOI:10.1145/3152494
© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. momentum
  3. neural networks
  4. non-convex optimization

Qualifiers

  • Research-article

Funding Sources

  • Ministry of Human Resource Development, Govt of India
  • Intel India
  • Microsoft Research India

Conference

CoDS-COMAD '18

Acceptance Rates

CODS-COMAD '18 Paper Acceptance Rate 50 of 150 submissions, 33%;
Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Neural Network Algorithm of Learning Rate Adaptive Optimization and Its Application in Emitter RecognitionSimulation Tools and Techniques10.1007/978-3-030-97124-3_29(390-402)Online publication date: 31-Mar-2022
  • (2020)Research and analysis on defect detection of semi-conductive layer of high voltage cableE3S Web of Conferences10.1051/e3sconf/202018501057185(01057)Online publication date: 1-Sep-2020
  • (2020)Electrical performance analysis of 110 kv GIS terminal extension conducting rodE3S Web of Conferences10.1051/e3sconf/202018501020185(01020)Online publication date: 1-Sep-2020
  • (2019)A Group Recommendation Approach Based on Neural Network Collaborative Filtering2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW.2019.00-18(148-154)Online publication date: Apr-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media