Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Published: 28 February 2022 Publication History

Abstract

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant stepsize SA algorithms do not converge to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithm with a smooth and strongly convex objective, (2) linear SA algorithm involving a Hurwitz matrix, and (3) nonlinear SA algorithm involving a contractive operator. When the iterate is scaled by 1/α, where α is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an implicit equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be 1/α, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a heuristic formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.

References

[1]
Stefan Banach. 1922. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. math, Vol. 3, 1 (1922), 133--181.
[2]
Amir Beck. 2017. First-order methods in optimization . Vol. 25. SIAM.
[3]
Albert Benveniste, Michel Métivier, and Pierre Priouret. 2012. Adaptive algorithms and stochastic approximations. Vol. 22. Springer Science & Business Media.
[4]
Dimitri P Bertsekas and John N Tsitsiklis. 1996. Neuro-dynamic programming .Athena Scientific.
[5]
Jalaj Bhandari, Daniel Russo, and Raghav Singal. 2018. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. In Conference On Learning Theory . 1691--1692.
[6]
Pascal Bianchi, Walid Hachem, and Sholom Schechtman. 2020. Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. Preprint arXiv:2005.08513 (2020).
[7]
Vivek S Borkar. 2009. Stochastic approximation: a dynamical systems viewpoint. Vol. 48. Springer.
[8]
Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. Siam Review, Vol. 60, 2 (2018), 223--311.
[9]
Adithya M Devraj, Ana Buvs ić, and Sean Meyn. 2019. Zap Q-Learning-A User's Guide. In 2019 Fifth Indian Control Conference (ICC). IEEE, 10--15.
[10]
Persi Diaconis and David Freedman. 1999. Iterated random functions. SIAM review, Vol. 41, 1 (1999), 45--76.
[11]
Aymeric Dieuleveut, Alain Durmus, and Francis Bach. 2020. Bridging the gap between constant step size stochastic gradient descent and markov chains. The Annals of Statistics, Vol. 48, 3 (2020), 1348--1382.
[12]
Alain Durmus, Pablo Jiménez, Éric Moulines, and SAID Salem. 2021. On Riemannian Stochastic Approximation Schemes with Fixed Step-Size. In International Conference on Artificial Intelligence and Statistics. PMLR, 1018--1026.
[13]
Rick Durrett. 2019. Probability: theory and examples . Vol. 49. Cambridge university press.
[14]
Atilla Eryilmaz and Rayadurgam Srikant. 2012. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems, Vol. 72, 3--4 (2012), 311--359.
[15]
Vaclav Fabian. 1968. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics (1968), 1327--1332.
[16]
Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. 2020. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, Vol. 21 (2020).
[17]
Yuanyuan Feng, Lei Li, and Jian-Guo Liu. 2018. Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Communications in Mathematical Sciences, Vol. 16, 3 (2018), 777--789.
[18]
Harley Flanders. 1973. Differentiation under the integral sign. The American Mathematical Monthly, Vol. 80, 6 (1973), 615--627.
[19]
D. Gamarnik and A. Zeevi. 2006. Validity of Heavy Traffic Steady-State Approximations in Generalized Jackson Networks. The Annals of Applied Probability (2006), 56--90.
[20]
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.
[21]
Robert Gower, Othmane Sebbouh, and Nicolas Loizou. 2021. SGD for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics. PMLR, 1315--1323.
[22]
Wassim M Haddad and VijaySekhar Chellaboina. 2011. Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press.
[23]
J.M. Harrison. 1988. Brownian Models of Queueing Networks with Heterogeneous Customer Populations. In Stochastic Differential Systems, Stochastic Control Theory and Applications. Springer, 147--186.
[24]
J. M. Harrison. 1998. Heavy traffic analysis of a system with parallel servers: Asymptotic optimality of discrete review policies. Ann. App. Probab. (1998), 822--848.
[25]
J. M. Harrison and M. J. López. 1999. Heavy traffic resource pooling in parallel-server systems. Queueing Systems (1999), 339--368.
[26]
Karla Hernandez and James C Spall. 2019. Generalization of a result of Fabian on the asymptotic normality of stochastic approximation. Automatica, Vol. 99 (2019), 420--424.
[27]
Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. 2019. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, Vol. 4, 1 (2019).
[28]
Daniela Hurtado-Lange and Siva Theja Maguluri. 2020. Transform methods for heavy-traffic analysis. Stochastic Systems, Vol. 10, 4 (2020), 275--309.
[29]
Daniela Hurtado-Lange, Sushil Mahavir Varma, and Siva Theja Maguluri. 2020. Logarithmic Heavy Traffic Error Bounds in Generalized Switch and Load Balancing Systems. Preprint arXiv:2003.07821 (2020).
[30]
Harry Kesten et almbox. 1958. Accelerated stochastic approximation. Annals of Mathematical Statistics, Vol. 29, 1 (1958), 41--59.
[31]
Hassan K Khalil and Jessy W Grizzle. 2002. Nonlinear systems . Vol. 3. Prentice hall Upper Saddle River, NJ.
[32]
Guanghui Lan. 2020. First-order and Stochastic Optimization Methods for Machine Learning. Springer Nature.
[33]
Jonas Latz. 2021. Analysis of stochastic gradient descent in continuous time. Statistics and Computing, Vol. 31, 4 (2021), 1--25.
[34]
Qianxiao Li, Cheng Tai, and E Weinan. 2017. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning. PMLR, 2101--2110.
[35]
Xiaoyu Li and Francesco Orabona. 2019. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 983--992.
[36]
Siva Theja Maguluri, Sai Kiran Burle, and Rayadurgam Srikant. 2018. Optimal heavy-traffic queue length scaling in an incompletely saturated switch. Queueing Systems, Vol. 88, 3--4 (2018), 279--309.
[37]
Siva Theja Maguluri and R Srikant. 2016. Heavy traffic queue length behavior in a switch under the MaxWeight algorithm. Stochastic Systems, Vol. 6, 1 (2016), 211--250.
[38]
Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. 2020. On the almost sure convergence of stochastic gradient descent in non-convex problems. Preprint arXiv:2006.11144 (2020).
[39]
Shancong Mou and Siva Theja Maguluri. 2020. Heavy Traffic Queue Length Behaviour in a Switch under Markovian Arrivals. Preprint arXiv:2006.06150 (2020).
[40]
Angel Muleshkov and Tan Nguyen. 2016. Easy proof of the Jacobian for the n-dimensional polar coordinates . Pi Mu Epsilon Journal, Vol. 14 (2016), 269--273.
[41]
Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics (1951), 400--407.
[42]
Timothy Sauer. 2012. Numerical solution of stochastic differential equations in finance. In Handbook of computational finance . Springer, 529--550.
[43]
Ohad Shamir and Tong Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning . PMLR, 71--79.
[44]
Justin Sirignano and Konstantinos Spiliopoulos. 2020. Stochastic gradient descent in continuous time: A central limit theorem. Stochastic Systems, Vol. 10, 2 (2020), 124--151.
[45]
R Srikant and Lei Ying. 2019. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning. In Conference on Learning Theory. 2803--2830.
[46]
Alexander L Stolyar et almbox. 2004. Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic. The Annals of Applied Probability, Vol. 14, 1 (2004), 1--53.
[47]
Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning, Vol. 3, 1 (1988), 9--44.
[48]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.
[49]
John N Tsitsiklis and Benjamin Van Roy. 1997. Analysis of temporal-difference learning with function approximation. In Advances in neural information processing systems. 1075--1081.
[50]
Aad W Van der Vaart. 2000. Asymptotic statistics . Vol. 3. Cambridge university press.
[51]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning, Vol. 8, 3--4 (1992), 279--292.
[52]
Mou Wenlong, Flammarion Nicolas, Wainwright Martin J., and Bartlett Peter L. 2019. Improved Bounds for Discretization of Langevin Diffusions: Near-Optimal Rates without Convexity. https://arxiv.org/pdf/1907.11331 (2019).
[53]
R. J. Williams. 1998. Diffusion approximations for open multiclass queueing networks: Sufficient conditions involving state space collapse. Queueing Systems Theory and Applications (1998), 27 -- 88.
[54]
Yuege Xie, Xiaoxia Wu, and Rachel Ward. 2020. Linear convergence of adaptive stochastic gradient descent. In International Conference on Artificial Intelligence and Statistics. PMLR, 1475--1485.
[55]
Jiaojiao Yang, Wenqing Hu, and Chris Junchi Li. 2021. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis, Vol. 122, 3--4 (2021), 371--393.
[56]
Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. 2020. An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias . Preprint arXiv:2006.07904 (2020).

Cited By

View all
  • (2024)Parikh’s Theorem Made SymbolicProceedings of the ACM on Programming Languages10.1145/36329078:POPL(1945-1977)Online publication date: 5-Jan-2024
  • (2024)Building Human Values into Recommender Systems: An Interdisciplinary SynthesisACM Transactions on Recommender Systems10.1145/36322972:3(1-57)Online publication date: 5-Jun-2024
  • (2024)A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniquesStatistical Theory and Related Fields10.1080/24754269.2024.2343151(1-23)Online publication date: 23-Apr-2024
  • Show More Cited By

Index Terms

  1. Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
    Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 6, Issue 1
    POMACS
    March 2022
    695 pages
    EISSN:2476-1249
    DOI:10.1145/3522731
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 February 2022
    Published in POMACS Volume 6, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. asymptotic analysis
    2. stationary distribution
    3. stochastic approximation
    4. stochastic gradient descent

    Qualifiers

    • Research-article

    Funding Sources

    • NSF
    • Raytheon Technologies

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)182
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Parikh’s Theorem Made SymbolicProceedings of the ACM on Programming Languages10.1145/36329078:POPL(1945-1977)Online publication date: 5-Jan-2024
    • (2024)Building Human Values into Recommender Systems: An Interdisciplinary SynthesisACM Transactions on Recommender Systems10.1145/36322972:3(1-57)Online publication date: 5-Jun-2024
    • (2024)A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniquesStatistical Theory and Related Fields10.1080/24754269.2024.2343151(1-23)Online publication date: 23-Apr-2024
    • (2024)Socio‐technical issues in the platform‐mediated gig economyJournal of the Association for Information Science and Technology10.1002/asi.2486875:3(344-374)Online publication date: 8-Jan-2024
    • (2023)A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning AlgorithmsACM SIGMETRICS Performance Evaluation Review10.1145/3579342.357934650:3(12-15)Online publication date: 5-Jan-2023
    • (2023)Statistical Analysis of Fixed Mini-Batch Gradient Descent EstimatorJournal of Computational and Graphical Statistics10.1080/10618600.2023.220413032:4(1348-1360)Online publication date: 6-Jun-2023
    • (2021)Discussion of: ‘A review of distributed statistical inference’Statistical Theory and Related Fields10.1080/24754269.2021.20158686:2(105-107)Online publication date: 28-Dec-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media