Abstract
This work considers minimizing a sum of convex functions, each with potentially different structure ranging from nonsmooth to smooth, Lipschitz to non-Lipschitz. Nesterov’s universal fast gradient method (Nesterov, Math Program 152:381–404, 2015) provides an optimal black-box first-order method for minimizing a single function that takes advantage of any continuity structure present without requiring prior knowledge. In this paper, we show that this landmark method (without modification) further adapts to heterogeneous sums. For example, it minimizes the sum of a nonsmooth M-Lipschitz function and an L-smooth function at a rate of \( O(M^2/\epsilon ^2 + \sqrt{L/\epsilon }) \) without knowledge of M, L, or even that the objective was a sum of two terms. This rate is precisely the sum of the optimal convergence rates for each term’s individual complexity class. More generally, we show sums of varied Hölder smooth functions introduce no new complexities and require at most as many iterations as is needed to minimize each summand separately. Extensions to strongly convex and growth/error bounds are also provided.
Similar content being viewed by others
Notes
In particular, the considered methods require an oracle producing the function value and one (sub)gradient of F at the current iterate, to be called at each iteration.
Namely, supposing all \((M_j,v_j) = (M/|\mathcal {J}|, v)\), the defining equation for K simplifies to
$$\begin{aligned} |\mathcal {J}|\left( \frac{|\mathcal {J}|^{\frac{1-v}{1+v}}(M/|\mathcal {J}|)^{\frac{2}{1+v}}\xi (x_0,x^*)}{\epsilon ^{\frac{2}{1+v}}}K^{\frac{-1-3v}{1+v}}\right) = \frac{M^{\frac{2}{1+v}}\xi (x_0,x^*)}{\epsilon ^{\frac{2}{1+v}}}K^{\frac{-1-3v}{1+v}} = 1 \, \end{aligned}$$solved by \(K = \varTheta \left( \left( \frac{M}{\epsilon }\right) ^{\frac{2}{1+3v}}\xi (x_0,x^*)^{\frac{1+v}{1+3v}}\right) \), matching the optimal rate for (M, v)-Hölder smooth minimization.
Note this comparison is somewhat superficial as we assume a more expensive oracle giving (sub)gradients of the whole objective. Given our oracle, considering all the terms as one \(L=\sum L_i\)-smooth term gives a faster \(O\left( \frac{\sqrt{\sum _i L_i}}{\sqrt{\sum _i\mu _i}} \log (1/\epsilon )\right) \) rate.
Such spectrahedrons occur as the dual feasible region of MAX-CUT SDP relaxations [14].
The original source for such a lower bound is difficult to identify in the literature, leaving this optimality somewhat as folklore. The theory of Grimmer [17] allows these bounds to be concluded from the classic lower bounds without growth.
References
Antonakopoulos, K., Pethick, T., Kavis, A., Mertikopoulos, P., Cevher, V.: Sifting through the noise: universal first-order methods for stochastic variational inequalities. In: Advances in Neural Information Processing Systems, vol. 34, pp. 13099–13111. Curran Associates, Inc. (2021)
Arjevani, Y., Daniely, A., Jegelka, S., Lin, H.: On the complexity of minimizing convex finite sums without using the indices of the individual functions (2020). arXiv:2002.03273
Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007). https://doi.org/10.1137/050644641
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017). https://doi.org/10.1007/s10107-016-1091-6
Chang, C.C., Lin, C.J.: Libsvm data: classification (binary class). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Accessed 06 Jan 2023
Chen, X., Lin, Q., Pena, J.: Optimal regularized dual averaging methods for stochastic optimization. In: Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012)
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. (2011). https://doi.org/10.1145/2049662.2049663
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–38 (1977)
Díaz, M., Grimmer, B.: Optimal convergence rates for the proximal bundle method. SIAM J. Optim. 33(2), 424–454 (2023). https://doi.org/10.1137/21M1428601
Ding, L., Yurtsever, A., Cevher, V., Tropp, J.A., Udell, M.: An optimal-storage approach to semidefinite programming using approximate complementarity. SIAM J. Optim. 31(4), 2695–2725 (2021). https://doi.org/10.1137/19m1244603
Doikov, N., Mishchenko, K., Nesterov, Y.: Super-universal regularized newton method (2022). arXiv:2208.05888
Efron, B.: Double exponential families and their use in generalized linear regression. J. Am. Stat. Assoc. 81(395), 709–721 (1986). https://doi.org/10.1080/01621459.1986.10478327
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012). https://doi.org/10.1137/110848864
Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995). https://doi.org/10.1145/227683.227684
Grimmer, B., Li, D.: Some primal-dual theory for subgradient methods for strongly convex optimization (2023)
Grimmer, B.: Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM J. Optim. 29(2), 1350–1365 (2019). https://doi.org/10.1137/18M117306X
Grimmer, B.: General hölder smooth convergence rates follow from specialized rates assuming growth bounds. J. Optim. Theory Appl. 197, 51–70 (2021)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. l’inst. Four. 48(3), 769–783 (1998)
Lacoste-Julien, S., Schmidt, M., Bach, F.R.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method (2012). arXiv:1212.2002
Lan, G.: Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Math. Program. 149(1–2), 1–45 (2015). https://doi.org/10.1007/s10107-013-0737-x
Llorente, A., Suárez, A.: Critical sample size for the LP-norm estimator in linear regression models. In: 2013 Winter Simulations Conference (WSC), pp. 1047–1056 (2013)
Łojasiewicz, S.: Sur la géométrie semi-et sous-analytique. In: Annales de l’institut Fourier, vol. 43, pp. 1575–1595 (1993)
Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les Éq. Aux Dérivées Part. 117, 87–89 (1963)
Meng, X.L., van Dyk, D.: The EM algorithm-an old folk-song sung to a fast new tune. J. R. Stat. Soc.: Ser. B (Methodol.) 59(3), 511–567 (1997)
Money, A., Affleck-Graves, J.F., Hart, M.L., Barr, G.D.I.: The linear regression model: Lp norm estimation and the choice of p. Commun. Stat.—Simul. Comput. 11, 89–109 (1982)
Nemirovskii, A.S., Nesterov, Y.E.: Optimal methods of smooth convex minimization. USSR Comput. Math. Math. Phys. 25(3–4), 21–30 (1986)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)
Nocedal, J., Wright, S.J.: Numerical Optimization, 2e edn. Springer, New York (2006)
Nyquist, H.: Recent studies on LP-norm estimation. Ph.D. Thesis, Umeå Universitet (1980)
Renegar, J., Grimmer, B.: A simple nearly optimal restart scheme for speeding up first-order methods. Found. Comput. Math. 22(1), 211–256 (2022). https://doi.org/10.1007/s10208-021-09502-2
Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM J. Optim. 30(1), 262–289 (2020). https://doi.org/10.1137/18M1224568
Wang, N., Zhang, S.: A gradient complexity analysis for minimizing the sum of strongly convex functions with varying condition numbers (2022). arXiv:2208.06524
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, vol. 22. Curran Associates, Inc. (2009)
Yang, T., Lin, Q.: RSG: beating subgradient method without smoothness and strong convexity. J. Mach. Learn. Res. 19(6), 1–33 (2018)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Grimmer, B. On optimal universal first-order methods for minimizing heterogeneous sums. Optim Lett 18, 427–445 (2024). https://doi.org/10.1007/s11590-023-02060-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-023-02060-2