On optimal universal first-order methods for minimizing heterogeneous sums

Benjamin Grimmer ORCID: orcid.org/0000-0002-7003-8448¹

212 Accesses
Explore all metrics

Abstract

This work considers minimizing a sum of convex functions, each with potentially different structure ranging from nonsmooth to smooth, Lipschitz to non-Lipschitz. Nesterov’s universal fast gradient method (Nesterov, Math Program 152:381–404, 2015) provides an optimal black-box first-order method for minimizing a single function that takes advantage of any continuity structure present without requiring prior knowledge. In this paper, we show that this landmark method (without modification) further adapts to heterogeneous sums. For example, it minimizes the sum of a nonsmooth M-Lipschitz function and an L-smooth function at a rate of $ O(M^2/\epsilon ^2 + \sqrt{L/\epsilon }) $ without knowledge of M, L, or even that the objective was a sum of two terms. This rate is precisely the sum of the optimal convergence rates for each term’s individual complexity class. More generally, we show sums of varied Hölder smooth functions introduce no new complexities and require at most as many iterations as is needed to minimize each summand separately. Extensions to strongly convex and growth/error bounds are also provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiency of minimizing compositions of convex functions and smooth maps

Article 20 July 2018

Universal heavy-ball method for nonconvex optimization under Hölder continuous Hessians

Article Open access 04 June 2024

A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization

Article 19 January 2019

Notes

In particular, the considered methods require an oracle producing the function value and one (sub)gradient of F at the current iterate, to be called at each iteration.
Namely, supposing all $(M_j,v_j) = (M/|\mathcal {J}|, v)$, the defining equation for K simplifies to
$$\begin{aligned} |\mathcal {J}|\left( \frac{|\mathcal {J}|^{\frac{1-v}{1+v}}(M/|\mathcal {J}|)^{\frac{2}{1+v}}\xi (x_0,x^*)}{\epsilon ^{\frac{2}{1+v}}}K^{\frac{-1-3v}{1+v}}\right) = \frac{M^{\frac{2}{1+v}}\xi (x_0,x^*)}{\epsilon ^{\frac{2}{1+v}}}K^{\frac{-1-3v}{1+v}} = 1 \, \end{aligned}$$
solved by $K = \varTheta \left( \left( \frac{M}{\epsilon }\right) ^{\frac{2}{1+3v}}\xi (x_0,x^*)^{\frac{1+v}{1+3v}}\right) $, matching the optimal rate for (M, v)-Hölder smooth minimization.
Note this comparison is somewhat superficial as we assume a more expensive oracle giving (sub)gradients of the whole objective. Given our oracle, considering all the terms as one $L=\sum L_i$-smooth term gives a faster $O\left( \frac{\sqrt{\sum _i L_i}}{\sqrt{\sum _i\mu _i}} \log (1/\epsilon )\right) $ rate.
Such spectrahedrons occur as the dual feasible region of MAX-CUT SDP relaxations [14].
The original source for such a lower bound is difficult to identify in the literature, leaving this optimality somewhat as folklore. The theory of Grimmer [17] allows these bounds to be concluded from the classic lower bounds without growth.

References

Antonakopoulos, K., Pethick, T., Kavis, A., Mertikopoulos, P., Cevher, V.: Sifting through the noise: universal first-order methods for stochastic variational inequalities. In: Advances in Neural Information Processing Systems, vol. 34, pp. 13099–13111. Curran Associates, Inc. (2021)
Arjevani, Y., Daniely, A., Jegelka, S., Lin, H.: On the complexity of minimizing convex finite sums without using the indices of the individual functions (2020). arXiv:2002.03273
Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007). https://doi.org/10.1137/050644641
Article Google Scholar
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017). https://doi.org/10.1007/s10107-016-1091-6
Article MathSciNet Google Scholar
Chang, C.C., Lin, C.J.: Libsvm data: classification (binary class). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Accessed 06 Jan 2023
Chen, X., Lin, Q., Pena, J.: Optimal regularized dual averaging methods for stochastic optimization. In: Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012)
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. (2011). https://doi.org/10.1145/2049662.2049663
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)
MathSciNet Google Scholar
Díaz, M., Grimmer, B.: Optimal convergence rates for the proximal bundle method. SIAM J. Optim. 33(2), 424–454 (2023). https://doi.org/10.1137/21M1428601
Article MathSciNet Google Scholar
Ding, L., Yurtsever, A., Cevher, V., Tropp, J.A., Udell, M.: An optimal-storage approach to semidefinite programming using approximate complementarity. SIAM J. Optim. 31(4), 2695–2725 (2021). https://doi.org/10.1137/19m1244603
Article MathSciNet Google Scholar
Doikov, N., Mishchenko, K., Nesterov, Y.: Super-universal regularized newton method (2022). arXiv:2208.05888
Efron, B.: Double exponential families and their use in generalized linear regression. J. Am. Stat. Assoc. 81(395), 709–721 (1986). https://doi.org/10.1080/01621459.1986.10478327
Article MathSciNet Google Scholar
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012). https://doi.org/10.1137/110848864
Article MathSciNet Google Scholar
Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995). https://doi.org/10.1145/227683.227684
Article MathSciNet Google Scholar
Grimmer, B., Li, D.: Some primal-dual theory for subgradient methods for strongly convex optimization (2023)
Grimmer, B.: Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM J. Optim. 29(2), 1350–1365 (2019). https://doi.org/10.1137/18M117306X
Article MathSciNet Google Scholar
Grimmer, B.: General hölder smooth convergence rates follow from specialized rates assuming growth bounds. J. Optim. Theory Appl. 197, 51–70 (2021)
Article Google Scholar
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. l’inst. Four. 48(3), 769–783 (1998)
Article MathSciNet Google Scholar
Lacoste-Julien, S., Schmidt, M., Bach, F.R.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method (2012). arXiv:1212.2002
Lan, G.: Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Math. Program. 149(1–2), 1–45 (2015). https://doi.org/10.1007/s10107-013-0737-x
Article MathSciNet Google Scholar
Llorente, A., Suárez, A.: Critical sample size for the LP-norm estimator in linear regression models. In: 2013 Winter Simulations Conference (WSC), pp. 1047–1056 (2013)
Łojasiewicz, S.: Sur la géométrie semi-et sous-analytique. In: Annales de l’institut Fourier, vol. 43, pp. 1575–1595 (1993)
Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les Éq. Aux Dérivées Part. 117, 87–89 (1963)
Google Scholar
Meng, X.L., van Dyk, D.: The EM algorithm-an old folk-song sung to a fast new tune. J. R. Stat. Soc.: Ser. B (Methodol.) 59(3), 511–567 (1997)
Article MathSciNet Google Scholar
Money, A., Affleck-Graves, J.F., Hart, M.L., Barr, G.D.I.: The linear regression model: Lp norm estimation and the choice of p. Commun. Stat.—Simul. Comput. 11, 89–109 (1982)
Article Google Scholar
Nemirovskii, A.S., Nesterov, Y.E.: Optimal methods of smooth convex minimization. USSR Comput. Math. Math. Phys. 25(3–4), 21–30 (1986)
Google Scholar
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)
Article MathSciNet Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, 2e edn. Springer, New York (2006)
Google Scholar
Nyquist, H.: Recent studies on LP-norm estimation. Ph.D. Thesis, Umeå Universitet (1980)
Renegar, J., Grimmer, B.: A simple nearly optimal restart scheme for speeding up first-order methods. Found. Comput. Math. 22(1), 211–256 (2022). https://doi.org/10.1007/s10208-021-09502-2
Article MathSciNet Google Scholar
Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM J. Optim. 30(1), 262–289 (2020). https://doi.org/10.1137/18M1224568
Article MathSciNet Google Scholar
Wang, N., Zhang, S.: A gradient complexity analysis for minimizing the sum of strongly convex functions with varying condition numbers (2022). arXiv:2208.06524
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, vol. 22. Curran Associates, Inc. (2009)
Yang, T., Lin, Q.: RSG: beating subgradient method without smoothness and strong convexity. J. Mach. Learn. Res. 19(6), 1–33 (2018)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
Benjamin Grimmer

Authors

Benjamin Grimmer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamin Grimmer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Grimmer, B. On optimal universal first-order methods for minimizing heterogeneous sums. Optim Lett 18, 427–445 (2024). https://doi.org/10.1007/s11590-023-02060-2

Download citation

Received: 21 October 2022
Accepted: 04 September 2023
Published: 22 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11590-023-02060-2

On optimal universal first-order methods for minimizing heterogeneous sums

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficiency of minimizing compositions of convex functions and smooth maps

Universal heavy-ball method for nonconvex optimization under Hölder continuous Hessians

A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

On optimal universal first-order methods for minimizing heterogeneous sums

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficiency of minimizing compositions of convex functions and smooth maps

Universal heavy-ball method for nonconvex optimization under Hölder continuous Hessians

A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now