Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

On optimal universal first-order methods for minimizing heterogeneous sums

  • Original Paper
  • Published:
Optimization Letters Aims and scope Submit manuscript

Abstract

This work considers minimizing a sum of convex functions, each with potentially different structure ranging from nonsmooth to smooth, Lipschitz to non-Lipschitz. Nesterov’s universal fast gradient method  (Nesterov, Math Program 152:381–404, 2015) provides an optimal black-box first-order method for minimizing a single function that takes advantage of any continuity structure present without requiring prior knowledge. In this paper, we show that this landmark method (without modification) further adapts to heterogeneous sums. For example, it minimizes the sum of a nonsmooth M-Lipschitz function and an L-smooth function at a rate of \( O(M^2/\epsilon ^2 + \sqrt{L/\epsilon }) \) without knowledge of M, L, or even that the objective was a sum of two terms. This rate is precisely the sum of the optimal convergence rates for each term’s individual complexity class. More generally, we show sums of varied Hölder smooth functions introduce no new complexities and require at most as many iterations as is needed to minimize each summand separately. Extensions to strongly convex and growth/error bounds are also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. In particular, the considered methods require an oracle producing the function value and one (sub)gradient of F at the current iterate, to be called at each iteration.

  2. Namely, supposing all \((M_j,v_j) = (M/|\mathcal {J}|, v)\), the defining equation for K simplifies to

    $$\begin{aligned} |\mathcal {J}|\left( \frac{|\mathcal {J}|^{\frac{1-v}{1+v}}(M/|\mathcal {J}|)^{\frac{2}{1+v}}\xi (x_0,x^*)}{\epsilon ^{\frac{2}{1+v}}}K^{\frac{-1-3v}{1+v}}\right) = \frac{M^{\frac{2}{1+v}}\xi (x_0,x^*)}{\epsilon ^{\frac{2}{1+v}}}K^{\frac{-1-3v}{1+v}} = 1 \, \end{aligned}$$

    solved by \(K = \varTheta \left( \left( \frac{M}{\epsilon }\right) ^{\frac{2}{1+3v}}\xi (x_0,x^*)^{\frac{1+v}{1+3v}}\right) \), matching the optimal rate for (Mv)-Hölder smooth minimization.

  3. Note this comparison is somewhat superficial as we assume a more expensive oracle giving (sub)gradients of the whole objective. Given our oracle, considering all the terms as one \(L=\sum L_i\)-smooth term gives a faster \(O\left( \frac{\sqrt{\sum _i L_i}}{\sqrt{\sum _i\mu _i}} \log (1/\epsilon )\right) \) rate.

  4. Such spectrahedrons occur as the dual feasible region of MAX-CUT SDP relaxations [14].

  5. The original source for such a lower bound is difficult to identify in the literature, leaving this optimality somewhat as folklore. The theory of Grimmer [17] allows these bounds to be concluded from the classic lower bounds without growth.

References

  1. Antonakopoulos, K., Pethick, T., Kavis, A., Mertikopoulos, P., Cevher, V.: Sifting through the noise: universal first-order methods for stochastic variational inequalities. In: Advances in Neural Information Processing Systems, vol. 34, pp. 13099–13111. Curran Associates, Inc. (2021)

  2. Arjevani, Y., Daniely, A., Jegelka, S., Lin, H.: On the complexity of minimizing convex finite sums without using the indices of the individual functions (2020). arXiv:2002.03273

  3. Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007). https://doi.org/10.1137/050644641

    Article  Google Scholar 

  4. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017). https://doi.org/10.1007/s10107-016-1091-6

    Article  MathSciNet  Google Scholar 

  5. Chang, C.C., Lin, C.J.: Libsvm data: classification (binary class). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Accessed 06 Jan 2023

  6. Chen, X., Lin, Q., Pena, J.: Optimal regularized dual averaging methods for stochastic optimization. In: Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012)

  7. Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. (2011). https://doi.org/10.1145/2049662.2049663

  8. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)

    MathSciNet  Google Scholar 

  9. Díaz, M., Grimmer, B.: Optimal convergence rates for the proximal bundle method. SIAM J. Optim. 33(2), 424–454 (2023). https://doi.org/10.1137/21M1428601

    Article  MathSciNet  Google Scholar 

  10. Ding, L., Yurtsever, A., Cevher, V., Tropp, J.A., Udell, M.: An optimal-storage approach to semidefinite programming using approximate complementarity. SIAM J. Optim. 31(4), 2695–2725 (2021). https://doi.org/10.1137/19m1244603

    Article  MathSciNet  Google Scholar 

  11. Doikov, N., Mishchenko, K., Nesterov, Y.: Super-universal regularized newton method (2022). arXiv:2208.05888

  12. Efron, B.: Double exponential families and their use in generalized linear regression. J. Am. Stat. Assoc. 81(395), 709–721 (1986). https://doi.org/10.1080/01621459.1986.10478327

    Article  MathSciNet  Google Scholar 

  13. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012). https://doi.org/10.1137/110848864

    Article  MathSciNet  Google Scholar 

  14. Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995). https://doi.org/10.1145/227683.227684

    Article  MathSciNet  Google Scholar 

  15. Grimmer, B., Li, D.: Some primal-dual theory for subgradient methods for strongly convex optimization (2023)

  16. Grimmer, B.: Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM J. Optim. 29(2), 1350–1365 (2019). https://doi.org/10.1137/18M117306X

    Article  MathSciNet  Google Scholar 

  17. Grimmer, B.: General hölder smooth convergence rates follow from specialized rates assuming growth bounds. J. Optim. Theory Appl. 197, 51–70 (2021)

    Article  Google Scholar 

  18. Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. l’inst. Four. 48(3), 769–783 (1998)

    Article  MathSciNet  Google Scholar 

  19. Lacoste-Julien, S., Schmidt, M., Bach, F.R.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method (2012). arXiv:1212.2002

  20. Lan, G.: Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Math. Program. 149(1–2), 1–45 (2015). https://doi.org/10.1007/s10107-013-0737-x

    Article  MathSciNet  Google Scholar 

  21. Llorente, A., Suárez, A.: Critical sample size for the LP-norm estimator in linear regression models. In: 2013 Winter Simulations Conference (WSC), pp. 1047–1056 (2013)

  22. Łojasiewicz, S.: Sur la géométrie semi-et sous-analytique. In: Annales de l’institut Fourier, vol. 43, pp. 1575–1595 (1993)

  23. Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les Éq. Aux Dérivées Part. 117, 87–89 (1963)

    Google Scholar 

  24. Meng, X.L., van Dyk, D.: The EM algorithm-an old folk-song sung to a fast new tune. J. R. Stat. Soc.: Ser. B (Methodol.) 59(3), 511–567 (1997)

    Article  MathSciNet  Google Scholar 

  25. Money, A., Affleck-Graves, J.F., Hart, M.L., Barr, G.D.I.: The linear regression model: Lp norm estimation and the choice of p. Commun. Stat.—Simul. Comput. 11, 89–109 (1982)

    Article  Google Scholar 

  26. Nemirovskii, A.S., Nesterov, Y.E.: Optimal methods of smooth convex minimization. USSR Comput. Math. Math. Phys. 25(3–4), 21–30 (1986)

    Google Scholar 

  27. Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)

    Article  MathSciNet  Google Scholar 

  28. Nocedal, J., Wright, S.J.: Numerical Optimization, 2e edn. Springer, New York (2006)

    Google Scholar 

  29. Nyquist, H.: Recent studies on LP-norm estimation. Ph.D. Thesis, Umeå Universitet (1980)

  30. Renegar, J., Grimmer, B.: A simple nearly optimal restart scheme for speeding up first-order methods. Found. Comput. Math. 22(1), 211–256 (2022). https://doi.org/10.1007/s10208-021-09502-2

    Article  MathSciNet  Google Scholar 

  31. Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM J. Optim. 30(1), 262–289 (2020). https://doi.org/10.1137/18M1224568

    Article  MathSciNet  Google Scholar 

  32. Wang, N., Zhang, S.: A gradient complexity analysis for minimizing the sum of strongly convex functions with varying condition numbers (2022). arXiv:2208.06524

  33. Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, vol. 22. Curran Associates, Inc. (2009)

  34. Yang, T., Lin, Q.: RSG: beating subgradient method without smoothness and strong convexity. J. Mach. Learn. Res. 19(6), 1–33 (2018)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin Grimmer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Grimmer, B. On optimal universal first-order methods for minimizing heterogeneous sums. Optim Lett 18, 427–445 (2024). https://doi.org/10.1007/s11590-023-02060-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11590-023-02060-2

Keywords

Navigation