Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Published: 01 January 2016 Publication History

Abstract

The question of how to incorporate curvature information into stochastic approximation methods is challenging. The direct application of classical quasi-Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust, and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at spaced intervals. One way to do this is through (subsampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients at every iteration, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.

References

[1]
S.-I. Amari, Natural gradient works efficiently in learning, Neural Comput., 10 (1998), pp. 251--276.
[2]
S. Asmussen and P. W. Glynn, Stochastic Simulation: Algorithms and Analysis, Stoch. Model. Appl. Probab. 57, Springer, New York, 2007.
[3]
F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate $o (1/n)$, in Advances in Neural Information Processing Systems 26, Curran, Red Hook, NY, 2013, pp. 773--781.
[4]
A. Bordes, L. Bottou, and P. Gallinari, SGD-QN: Careful quasi-Newton stochastic gradient descent, J. Mach. Learn. Res., 10 (2009), pp. 1737--1754.
[5]
L. Bottou and O. Bousquet, The tradeoffs of large scale learning, in Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. Roweis, eds., MIT Press, Cambridge, MA, 2008, pp. 161--168.
[6]
L. Bottou and Y. LeCun, Large scale online learning, in Advances in Neural Information Processing Systems 16, S. Thrun, L. Saul, and B. Schölkopf, eds., MIT Press, Cambridge, MA, 2004.
[7]
R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, Sample size selection in optimization methods for machine learning, Math. Program., 134 (2012), pp. 127--155.
[8]
R. H Byrd, G. M Chin, W. Neveitt, and J. Nocedal, On the use of stochastic Hessian information in optimization methods for machine learning, SIAM J. Optim., 21 (2011), pp. 977--995.
[9]
J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2011), pp. 2121--2159.
[10]
R. Fletcher, Practical Methods of Optimization, 2nd ed., Wiley, Chichester, 1987.
[11]
D. D Lewis, Y. Yang, T. G. Rose, and F. Li, Rcv$1$: A new benchmark collection for text categorization research, J. Mach. Learn. Res., 5 (2004), pp. 361--397.
[12]
A. Mokhtari and A. Ribeiro, Regularized stochastic BFGS algorithm, in IEEE Global Conference on Signal and Information Processing, IEEE, Piscataway, NJ, 2013.
[13]
A. Mokhtari and A. Ribeiro, Global convergence of online limited memory bfgs, preprint, arXiv:1409.2045, 2014.
[14]
I. Mukherjee, K. Canini, R. Frongillo, and Y. Singer, Parallel boosting with momentum, in ECML PKDD 2013, Part III, Lecture Notes in Comput. Sci. 8190, Springer, Heidelberg, 2013, pp. 17--32.
[15]
N. Murata, A statistical study of on-line learning, in On-line Learning in Neural Networks, Cambridge University Press, Cambridge, UK, 1998, pp. 63--92.
[16]
A. Nedić and D. Bertsekas, Convergence rate of incremental subgradient algorithms, in Stochastic Optimization: Algorithms and Applications, Springer, Boston, 2001, pp. 223--264.
[17]
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574--1609.
[18]
J. Nocedal and S. Wright, Numerical Optimization, 2nd ed., Springer New York, 1999.
[19]
H. Park, S.-I. Amari, and K. Fukumizu, Adaptive natural gradient learning algorithms for various stochastic models, Neural Networks, 13 (2000), pp. 755--764.
[20]
A. Plakhov and P. Cruz, A stochastic approximation algorithm with step-size adaptation, J. Math. Sci., 120 (2004), pp. 964--973.
[21]
M. J. D. Powell, Some global convergence properties of a variable metric algorithm for minimization without exact line searches, in Nonlinear Programming, R. W. Cottle and C. E. Lemke, eds., SIAM-AMS Proc. 9, AMS, Providence, RI, 1976, pp. 53--72.
[22]
H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400--407.
[23]
N. L. Roux and A. W. Fitzgibbon, A fast natural Newton method, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 623--630.
[24]
N. L. Roux, P.-A. Manzagol, and Y. Bengio, Topmoumoute online natural gradient algorithm, in Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2007, pp. 849--856.
[25]
N. Schraudolph, J. Yu, and S. Günter, A stochastic quasi-Newton method for online convex optimization, in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Microtome Publishing, Brookline, MA, 2007, pp. 436--443.
[26]
P. Sunehag, J. Trumpf, S. V. N. Vishwanathan, and N. Schraudolph, Variable metric stochastic approximation theory, in Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, Microtome Publishing, Brookline, MA, (2007), pp. 436--443.
[27]
P. Tseng and S. Yun, A coordinate gradient descent method for nonsmooth separable minimization, Math. Program., 117 (2009), pp. 387--423.
[28]
F. Yousefian, A. Nedić, and U. V. Shanbhag, On stochastic gradient and subgradient methods with adaptive steplength sequences, Automatica J. IFAC, 48 (2012), pp. 56--67.

Cited By

View all
  • (2024)PETScML: Second-Order Solvers for Training Regression Problems in Scientific Machine LearningProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659931(1-12)Online publication date: 3-Jun-2024
  • (2024)Convergence Analysis of an Adaptively Regularized Natural Gradient MethodIEEE Transactions on Signal Processing10.1109/TSP.2024.339849672(2527-2542)Online publication date: 9-May-2024
  • (2024)Stochastic Steffensen methodComputational Optimization and Applications10.1007/s10589-024-00583-789:1(1-32)Online publication date: 7-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Optimization
SIAM Journal on Optimization  Volume 26, Issue 2
2016
519 pages
ISSN:1052-6234
DOI:10.1137/sjope8.26.2
Issue’s Table of Contents

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2016

Author Tags

  1. stochastic optimization
  2. quasi-Newton
  3. sub sampling
  4. large scale optimization

Author Tags

  1. 65K05
  2. 90C06
  3. 90C30
  4. 90C55

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PETScML: Second-Order Solvers for Training Regression Problems in Scientific Machine LearningProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659931(1-12)Online publication date: 3-Jun-2024
  • (2024)Convergence Analysis of an Adaptively Regularized Natural Gradient MethodIEEE Transactions on Signal Processing10.1109/TSP.2024.339849672(2527-2542)Online publication date: 9-May-2024
  • (2024)Stochastic Steffensen methodComputational Optimization and Applications10.1007/s10589-024-00583-789:1(1-32)Online publication date: 7-Jun-2024
  • (2024)The continuous stochastic gradient method: part I–convergence theoryComputational Optimization and Applications10.1007/s10589-023-00542-887:3(935-976)Online publication date: 1-Apr-2024
  • (2023)Searching for optimal per-coordinate step-sizes with multidimensional backtrackingProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666244(2725-2767)Online publication date: 10-Dec-2023
  • (2023)Variance-Reduced Stochastic Quasi-Newton Methods for Decentralized LearningIEEE Transactions on Signal Processing10.1109/TSP.2023.324065271(311-326)Online publication date: 1-Jan-2023
  • (2023)An Efficient Fisher Matrix Approximation Method for Large-Scale Neural Network OptimizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.321365445:5(5391-5403)Online publication date: 1-May-2023
  • (2023)Adaptive step size rules for stochastic optimization in large-scale learningStatistics and Computing10.1007/s11222-023-10218-233:2Online publication date: 20-Feb-2023
  • (2023)Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the secant conditionComputational Optimization and Applications10.1007/s10589-022-00448-x84:3(651-702)Online publication date: 9-Jan-2023
  • (2023)Resource-Adaptive Newton’s Method for Distributed LearningComputing and Combinatorics10.1007/978-3-031-49190-0_24(335-346)Online publication date: 15-Dec-2023
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media