research-article

Public Access

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Authors:

Siva Theja MaguluriAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 6, Issue 1

Article No.: 19, Pages 1 - 24

https://doi.org/10.1145/3508039

Published: 28 February 2022 Publication History

Abstract

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant stepsize SA algorithms do not converge to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithm with a smooth and strongly convex objective, (2) linear SA algorithm involving a Hurwitz matrix, and (3) nonlinear SA algorithm involving a contractive operator. When the iterate is scaled by 1/α, where α is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an implicit equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be 1/α, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a heuristic formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.

References

[1]

Stefan Banach. 1922. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. math, Vol. 3, 1 (1922), 133--181.

[2]

Amir Beck. 2017. First-order methods in optimization . Vol. 25. SIAM.

Digital Library

[3]

Albert Benveniste, Michel Métivier, and Pierre Priouret. 2012. Adaptive algorithms and stochastic approximations. Vol. 22. Springer Science & Business Media.

[4]

Dimitri P Bertsekas and John N Tsitsiklis. 1996. Neuro-dynamic programming .Athena Scientific.

[5]

Jalaj Bhandari, Daniel Russo, and Raghav Singal. 2018. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. In Conference On Learning Theory . 1691--1692.

[6]

Pascal Bianchi, Walid Hachem, and Sholom Schechtman. 2020. Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. Preprint arXiv:2005.08513 (2020).

[7]

Vivek S Borkar. 2009. Stochastic approximation: a dynamical systems viewpoint. Vol. 48. Springer.

[8]

Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. Siam Review, Vol. 60, 2 (2018), 223--311.

[9]

Adithya M Devraj, Ana Buvs ić, and Sean Meyn. 2019. Zap Q-Learning-A User's Guide. In 2019 Fifth Indian Control Conference (ICC). IEEE, 10--15.

[10]

Persi Diaconis and David Freedman. 1999. Iterated random functions. SIAM review, Vol. 41, 1 (1999), 45--76.

[11]

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. 2020. Bridging the gap between constant step size stochastic gradient descent and markov chains. The Annals of Statistics, Vol. 48, 3 (2020), 1348--1382.

[12]

Alain Durmus, Pablo Jiménez, Éric Moulines, and SAID Salem. 2021. On Riemannian Stochastic Approximation Schemes with Fixed Step-Size. In International Conference on Artificial Intelligence and Statistics. PMLR, 1018--1026.

[13]

Rick Durrett. 2019. Probability: theory and examples . Vol. 49. Cambridge university press.

Digital Library

[14]

Atilla Eryilmaz and Rayadurgam Srikant. 2012. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems, Vol. 72, 3--4 (2012), 311--359.

[15]

Vaclav Fabian. 1968. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics (1968), 1327--1332.

[16]

Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. 2020. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, Vol. 21 (2020).

[17]

Yuanyuan Feng, Lei Li, and Jian-Guo Liu. 2018. Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Communications in Mathematical Sciences, Vol. 16, 3 (2018), 777--789.

[18]

Harley Flanders. 1973. Differentiation under the integral sign. The American Mathematical Monthly, Vol. 80, 6 (1973), 615--627.

[19]

D. Gamarnik and A. Zeevi. 2006. Validity of Heavy Traffic Steady-State Approximations in Generalized Jackson Networks. The Annals of Applied Probability (2006), 56--90.

[20]

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.

Digital Library

[21]

Robert Gower, Othmane Sebbouh, and Nicolas Loizou. 2021. SGD for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics. PMLR, 1315--1323.

[22]

Wassim M Haddad and VijaySekhar Chellaboina. 2011. Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press.

[23]

J.M. Harrison. 1988. Brownian Models of Queueing Networks with Heterogeneous Customer Populations. In Stochastic Differential Systems, Stochastic Control Theory and Applications. Springer, 147--186.

[24]

J. M. Harrison. 1998. Heavy traffic analysis of a system with parallel servers: Asymptotic optimality of discrete review policies. Ann. App. Probab. (1998), 822--848.

[25]

J. M. Harrison and M. J. López. 1999. Heavy traffic resource pooling in parallel-server systems. Queueing Systems (1999), 339--368.

[26]

Karla Hernandez and James C Spall. 2019. Generalization of a result of Fabian on the asymptotic normality of stochastic approximation. Automatica, Vol. 99 (2019), 420--424.

[27]

Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. 2019. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, Vol. 4, 1 (2019).

[28]

Daniela Hurtado-Lange and Siva Theja Maguluri. 2020. Transform methods for heavy-traffic analysis. Stochastic Systems, Vol. 10, 4 (2020), 275--309.

[29]

Daniela Hurtado-Lange, Sushil Mahavir Varma, and Siva Theja Maguluri. 2020. Logarithmic Heavy Traffic Error Bounds in Generalized Switch and Load Balancing Systems. Preprint arXiv:2003.07821 (2020).

[30]

Harry Kesten et almbox. 1958. Accelerated stochastic approximation. Annals of Mathematical Statistics, Vol. 29, 1 (1958), 41--59.

[31]

Hassan K Khalil and Jessy W Grizzle. 2002. Nonlinear systems . Vol. 3. Prentice hall Upper Saddle River, NJ.

[32]

Guanghui Lan. 2020. First-order and Stochastic Optimization Methods for Machine Learning. Springer Nature.

[33]

Jonas Latz. 2021. Analysis of stochastic gradient descent in continuous time. Statistics and Computing, Vol. 31, 4 (2021), 1--25.

Digital Library

[34]

Qianxiao Li, Cheng Tai, and E Weinan. 2017. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning. PMLR, 2101--2110.

[35]

Xiaoyu Li and Francesco Orabona. 2019. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 983--992.

[36]

Siva Theja Maguluri, Sai Kiran Burle, and Rayadurgam Srikant. 2018. Optimal heavy-traffic queue length scaling in an incompletely saturated switch. Queueing Systems, Vol. 88, 3--4 (2018), 279--309.

Digital Library

[37]

Siva Theja Maguluri and R Srikant. 2016. Heavy traffic queue length behavior in a switch under the MaxWeight algorithm. Stochastic Systems, Vol. 6, 1 (2016), 211--250.

[38]

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. 2020. On the almost sure convergence of stochastic gradient descent in non-convex problems. Preprint arXiv:2006.11144 (2020).

[39]

Shancong Mou and Siva Theja Maguluri. 2020. Heavy Traffic Queue Length Behaviour in a Switch under Markovian Arrivals. Preprint arXiv:2006.06150 (2020).

[40]

Angel Muleshkov and Tan Nguyen. 2016. Easy proof of the Jacobian for the n-dimensional polar coordinates . Pi Mu Epsilon Journal, Vol. 14 (2016), 269--273.

[41]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics (1951), 400--407.

[42]

Timothy Sauer. 2012. Numerical solution of stochastic differential equations in finance. In Handbook of computational finance . Springer, 529--550.

[43]

Ohad Shamir and Tong Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning . PMLR, 71--79.

[44]

Justin Sirignano and Konstantinos Spiliopoulos. 2020. Stochastic gradient descent in continuous time: A central limit theorem. Stochastic Systems, Vol. 10, 2 (2020), 124--151.

[45]

R Srikant and Lei Ying. 2019. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning. In Conference on Learning Theory. 2803--2830.

[46]

Alexander L Stolyar et almbox. 2004. Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic. The Annals of Applied Probability, Vol. 14, 1 (2004), 1--53.

[47]

Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning, Vol. 3, 1 (1988), 9--44.

[48]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.

Digital Library

[49]

John N Tsitsiklis and Benjamin Van Roy. 1997. Analysis of temporal-difference learning with function approximation. In Advances in neural information processing systems. 1075--1081.

[50]

Aad W Van der Vaart. 2000. Asymptotic statistics . Vol. 3. Cambridge university press.

[51]

Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning, Vol. 8, 3--4 (1992), 279--292.

[52]

Mou Wenlong, Flammarion Nicolas, Wainwright Martin J., and Bartlett Peter L. 2019. Improved Bounds for Discretization of Langevin Diffusions: Near-Optimal Rates without Convexity. https://arxiv.org/pdf/1907.11331 (2019).

[53]

R. J. Williams. 1998. Diffusion approximations for open multiclass queueing networks: Sufficient conditions involving state space collapse. Queueing Systems Theory and Applications (1998), 27 -- 88.

[54]

Yuege Xie, Xiaoxia Wu, and Rachel Ward. 2020. Linear convergence of adaptive stochastic gradient descent. In International Conference on Artificial Intelligence and Statistics. PMLR, 1475--1485.

[55]

Jiaojiao Yang, Wenqing Hu, and Chris Junchi Li. 2021. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis, Vol. 122, 3--4 (2021), 371--393.

[56]

Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. 2020. An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias . Preprint arXiv:2006.07904 (2020).

Cited By

Hague MJeż ALin A(2024)Parikh’s Theorem Made SymbolicProceedings of the ACM on Programming Languages10.1145/36329078:POPL(1945-1977)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632907
Stray JHalevy AAssar PHadfield-Menell DBoutilier CAshar ABakalar CBeattie LEkstrand MLeibowicz CMoon Sehat CJohansen SKerlin LVickrey DSingh SVrijenhoek SZhang AAndrus MHelberger NProutskova PMitra TVasan N(2024)Building Human Values into Recommender Systems: An Interdisciplinary SynthesisACM Transactions on Recommender Systems10.1145/36322972:3(1-57)Online publication date: 5-Jun-2024
https://dl.acm.org/doi/10.1145/3632297
Li XGao YChang HHuang DMa YPan RQi HWang FWu SXu KZhou JZhu XZhu YWang H(2024)A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniquesStatistical Theory and Related Fields10.1080/24754269.2024.2343151(1-23)Online publication date: 23-Apr-2024
https://doi.org/10.1080/24754269.2024.2343151
Show More Cited By

Index Terms

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization
1. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis
      1. Stochastic approximation

Recommendations

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization
SIGMETRICS '22

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant ...
Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant ...
Numerical method for stationary distribution of stochastic differential equations with Markovian switching

In principle, once the existence of the stationary distribution of a stochastic differential equation with Markovian switching is assured, we may compute it by solving the associated system of the coupled Kolmogorov-Fokker-Planck equations. However, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 6, Issue 1

POMACS

March 2022

695 pages

EISSN:2476-1249

DOI:10.1145/3522731

Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California
,
Zhi-Li Zhang
University of Minnesota

Issue’s Table of Contents

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 February 2022

Published in POMACS Volume 6, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
Raytheon Technologies

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
375
Total Downloads

Downloads (Last 12 months)182
Downloads (Last 6 weeks)22

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hague MJeż ALin A(2024)Parikh’s Theorem Made SymbolicProceedings of the ACM on Programming Languages10.1145/36329078:POPL(1945-1977)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632907
Stray JHalevy AAssar PHadfield-Menell DBoutilier CAshar ABakalar CBeattie LEkstrand MLeibowicz CMoon Sehat CJohansen SKerlin LVickrey DSingh SVrijenhoek SZhang AAndrus MHelberger NProutskova PMitra TVasan N(2024)Building Human Values into Recommender Systems: An Interdisciplinary SynthesisACM Transactions on Recommender Systems10.1145/36322972:3(1-57)Online publication date: 5-Jun-2024
https://dl.acm.org/doi/10.1145/3632297
Li XGao YChang HHuang DMa YPan RQi HWang FWu SXu KZhou JZhu XZhu YWang H(2024)A selective review on statistical methods for massive data computation: distributed computing, subsampling, and minibatch techniquesStatistical Theory and Related Fields10.1080/24754269.2024.2343151(1-23)Online publication date: 23-Apr-2024
https://doi.org/10.1080/24754269.2024.2343151
Dedema MRosenbaum H(2024)Socio‐technical issues in the platform‐mediated gig economyJournal of the Association for Information Science and Technology10.1002/asi.2486875:3(344-374)Online publication date: 8-Jan-2024
https://dl.acm.org/doi/10.1002/asi.24868
Chen Z(2023)A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning AlgorithmsACM SIGMETRICS Performance Evaluation Review10.1145/3579342.357934650:3(12-15)Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1145/3579342.3579346
Qi HWang FWang H(2023)Statistical Analysis of Fixed Mini-Batch Gradient Descent EstimatorJournal of Computational and Graphical Statistics10.1080/10618600.2023.220413032:4(1348-1360)Online publication date: 6-Jun-2023
https://doi.org/10.1080/10618600.2023.2204130
Lv SZhou X(2021)Discussion of: ‘A review of distributed statistical inference’Statistical Theory and Related Fields10.1080/24754269.2021.20158686:2(105-107)Online publication date: 28-Dec-2021
https://doi.org/10.1080/24754269.2021.2015868

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents