Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Free access

Stochastic nested variance reduction for nonconvex optimization

Published: 01 January 2020 Publication History

Abstract

We study nonconvex optimization problems, where the objective function is either an average of n nonconvex functions or the expectation of some stochastic function. We propose a new stochastic gradient descent algorithm based on nested variance reduction, namely, Stochastic Nested Variance-Reduced Gradient descent (SNVRG). Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses K + 1 nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration. For smooth nonconvex functions, SNVRG converges to an ε-approximate first-order stationary point within Õ(n∧ε-2-3n1/2ε-2)1 number of stochastic gradient evaluations. This improves the best known gradient complexity of SVRG O(n+n2/3ε-2) and that of SCSG O(n∧ε-2-10/3n2/3ε-2). For gradient dominated functions, SNVRG also achieves better gradient complexity than the state-of-the-art algorithms.
Based on SNVRG, we further propose two algorithms that can find local minima faster than state-of-the-art algorithms in both finite-sum and general stochastic (online) nonconvex optimization. In particular, for finite-sum optimization problems, the proposed SNVRG + Neon2finite algorithm achieves Õ(n1/2ε-2 + nε-3H + n3/4ε-7/2H) gradient complexity to converge to an (ε, εH)-second-order stationary point, which outperforms SVRG+Neon2finite (Allen-Zhu and Li, 2018), the best existing algorithm, in a wide regime. For general stochastic optimization problems, the proposed SNVRG+Neon2online achieves Õ(ε-3-5H-2ε-3H) gradient complexity, which is better than both SVRG+Neon2online (Allen-Zhu and Li, 2018) and Natasha2 (Allen-Zhu, 2018a) in certain regimes. Thorough experimental results on different nonconvex optimization problems back up our theory.

References

[1]
Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195-1199. ACM, 2017.
[2]
Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200-1205. ACM, 2017.
[3]
Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems, pages 2676-2687, 2018a.
[4]
Zeyuan Allen-Zhu. Katyusha x: Simple momentum method for stochastic sum-of-nonconvex optimization. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 179-185. PMLR, 2018b.
[5]
Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning, pages 699-707, 2016.
[6]
Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. In Advances in Neural Information Processing Systems, pages 3720-3730, 2018.
[7]
Animashree Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Conference on Learning Theory, pages 81- 102, 2016.
[8]
Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873-3881, 2016.
[9]
Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. "Convex until proven guilty": Dimension-free acceleration of gradient descent on non-convex functions. In International Conference on Machine Learning, pages 654-663, 2017.
[10]
Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751-1772, 2018.
[11]
Xi Chen, Simon S Du, and Xin T Tong. On stationary-point hitting time and ergodicity of stochastic gradient langevin dynamics. Journal of Machine Learning Research, 21(68): 1-41, 2020.
[12]
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192-204, 2015.
[13]
Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of O-3/2) for nonconvex optimization. Mathematical Programming, 162(1-2):1-32, 2017.
[14]
Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1155-1164. PMLR, 2018.
[15]
Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Mathematics, 111(6 Pt 1):2475-2485, 2014.
[16]
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646-1654, 2014a.
[17]
Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pages 1125-1133, 2014b.
[18]
Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal nonconvex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 686-696, 2018.
[19]
Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex sgd escaping from saddle points. In Conference on Learning Theory, pages 1192-1234, 2019.
[20]
Dan Garber and Elad Hazan. Fast and simple pca via convex optimization. arXiv preprint arXiv:1509.05647, 2015.
[21]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797-842, 2015.
[22]
Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973-2981, 2016.
[23]
Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341-2368, 2013.
[24]
Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59-99, 2016.
[25]
Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267-305, 2016.
[26]
Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečny, and Scott Sallinen. Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pages 2251-2259, 2015.
[27]
Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.
[28]
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724-1732, 2017.
[29]
Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages 1042-1085. PMLR, 2018.
[30]
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315- 323, 2013.
[31]
Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- lojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795-811. Springer, 2016.
[32]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
[33]
Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for nonconvex optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1895-1904. PMLR, 2017.
[34]
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
[35]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
[36]
Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246-1257, 2016.
[37]
Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddle points. Mathematical programming, 176(1-2):311-337, 2019.
[38]
Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2348-2358, 2017.
[39]
Kfir Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831, 2016.
[40]
Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384-3392, 2015.
[41]
Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
[42]
Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177-205, 2006.
[43]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011.
[44]
Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2613-2621. JMLR. org, 2017a.
[45]
Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017b.
[46]
Lam M Nguyen, Marten van Dijk, Dzung T Phan, Phuong Ha Nguyen, Tsui-Wei Weng, and Jayant R Kalagnanam. Finite-sum smooth optimization with sarah. arXiv preprint arXiv:1901.07648, 2019.
[47]
Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267-273, 1982.
[48]
Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki, 3(4):643-653, 1963.
[49]
Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145-151, 1999.
[50]
Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674-1703, 2017.
[51]
Sashank Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, and Alex Smola. A generic approach for escaping saddle points. In International Conference on Artificial Intelligence and Statistics, pages 1233-1242, 2018.
[52]
Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314-323, 2016a.
[53]
Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 1971-1977. IEEE, 2016b.
[54]
Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400-407, 1951.
[55]
Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663-2671, 2012.
[56]
Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
[57]
Shai Shalev-Shwartz. Sdca without duality, regularization, and individual convexity. In International Conference on Machine Learning, pages 747-754, 2016.
[58]
Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567-599, 2013.
[59]
Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan, and Lam M Nguyen. A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. arXiv preprint arXiv:1907.03793, 2019.
[60]
Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
[61]
Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In Advances in Neural Information Processing Systems, pages 2403-2413, 2019.
[62]
Martin Weiser, Peter Deuhard, and Bodo Erdmann. Affine conjugate adaptive newton methods for nonlinear elastomechanics. Optimisation Methods and Software, 22(3):413- 431, 2007.
[63]
Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057-2075, 2014.
[64]
Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3122-3133, 2018a.
[65]
Peng Xu, Fred Roosta, and Michael W Mahoney. Newton-type methods for nonconvex optimization under inexact hessian information. Mathematical Programming, pages 1-36, 2019.
[66]
Yi Xu, Jing Rong, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pages 5531-5541, 2018b.
[67]
Yaodong Yu, Difan Zou, and Quanquan Gu. Saving gradient and negative curvature computations: Finding local minima more efficiently. arXiv preprint arXiv:1712.03950, 2017.
[68]
Yaodong Yu, Pan Xu, and Quanquan Gu. Third-order smoothness helps: Faster stochastic optimization algorithms for finding local minima. In Advances in Neural Information Processing Systems, pages 4526-4536, 2018.
[69]
Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu. A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery. In International conference on machine learning, 2018.
[70]
Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pages 1980-2022, 2017.
[71]
Dongruo Zhou and Quanquan Gu. Lower bounds for smooth nonconvex finite-sum optimization. In International Conference on Machine Learning, pages 7574-7583, 2019.
[72]
Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized Newton methods. In Proceedings of the 35th International Conference on Machine Learning, pages 5990-5999. PMLR, 2018a.
[73]
Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3922-3933, 2018b.

Cited By

View all
  • (2023)Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimizationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618622(5396-5427)Online publication date: 23-Jul-2023
  • (2023)Energy Efficient Federated Learning Over Heterogeneous Mobile Devices via Joint Design of Weight Quantization and Wireless TransmissionIEEE Transactions on Mobile Computing10.1109/TMC.2022.321376622:12(7451-7465)Online publication date: 1-Dec-2023

Index Terms

  1. Stochastic nested variance reduction for nonconvex optimization
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image The Journal of Machine Learning Research
      The Journal of Machine Learning Research  Volume 21, Issue 1
      January 2020
      10260 pages
      ISSN:1532-4435
      EISSN:1533-7928
      Issue’s Table of Contents
      CC-BY 4.0

      Publisher

      JMLR.org

      Publication History

      Accepted: 01 May 2020
      Revised: 01 April 2020
      Published: 01 January 2020
      Received: 01 July 2018
      Published in JMLR Volume 21, Issue 1

      Author Tags

      1. nonconvex optimization
      2. finding local minima
      3. variance reduction

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)109
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 19 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimizationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618622(5396-5427)Online publication date: 23-Jul-2023
      • (2023)Energy Efficient Federated Learning Over Heterogeneous Mobile Devices via Joint Design of Weight Quantization and Wireless TransmissionIEEE Transactions on Mobile Computing10.1109/TMC.2022.321376622:12(7451-7465)Online publication date: 1-Dec-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media