Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

A Simple Nearly Optimal Restart Scheme For Speeding Up First-Order Methods

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

We present a simple scheme for restarting first-order methods for convex optimization problems. Restarts are made based only on achieving specified decreases in objective values, the specified amounts being the same for all optimization problems. Unlike existing restart schemes, the scheme makes no attempt to learn parameter values characterizing the structure of an optimization problem, nor does it require any special information that would not be available in practice (unless the first-order method chosen to be employed in the scheme itself requires special information). As immediate corollaries to the main theorems, we show that when some well-known first-order methods are employed in the scheme, the resulting complexity bounds are nearly optimal for particular—yet quite general—classes of problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Recall that a vector g is a subgradient of f at x if for all y we have \( f(y) \ge f(x) + \langle g, y-x \rangle \). If f is differentiable at x, then g is precisely the gradient.

  2. \( \mathtt {accel}\):

    1. (0)

      Input: \( x_0 \in Q \). Initialize: \( y_0 = x_0 \), \( \theta _0 = 1 \), \(k = 0 \).

    2. (1)

      \( x_{k+1} = P_Q( y_k - {\scriptstyle {\textstyle {\frac{1}{L}}}} \nabla f(y_k)) \)

    3. (2)

      \( \theta _{k+1} = {\scriptstyle {\textstyle {\frac{1}{2}}}} (1 + \sqrt{1 + 4 \theta _k^2}) \)

    4. (3)

      \( y_{k+1} = x_{k+1} + {\scriptstyle {\textstyle {\frac{\theta _k - 1}{\theta _{k+1}}}}} (x_{k+1} - x_k) \).

  3. We avoid digressing to discuss the principles underlying proximal methods. For readers unfamiliar with the topic, the paper introducing FISTA [3] is a good place to start.

  4. In the case of \( \mathtt {subgrad}_n \), restarting at x simply means computing a subgradient step at x, i.e., \( x_+ = x - \frac{2^n \varepsilon }{\Vert g_k \Vert ^2} g_k \). For \( \mathtt {accel}_n \), restarting is slightly more involved, in part due to two sequences of points being computed, \( \{ x_{n,k} \} \) and \( \{ y_{n,k} \} \). The primary sequence is \( \{ x_{n,k} \} \). The method \( \mathtt {accel}_n \) restarts at a point x from a primary sequence (either from \( {\mathtt {fom}}_n \)’s own primary sequence, or a point sent to \( {\mathtt {fom}}_n \) from

    Footnote 4 continued

    the primary sequence for \( {\mathtt {fom}}_{n+1} \)). In restarting, \( \mathtt {accel}_n \) begins by “reinitializing,” setting \( x_{n,0} = x \), \( y_{n,0} = x \), \( \theta _0 = 1 \), \( k = 0 \) and then begins iterating.

  5. This is example 4.5 in [4], but there the Euclidean norm is not used, unlike here. To verify correctness of the value \( \alpha = \max _i \Vert a_i \Vert ^2 \), it suffices to show for each x that the largest eigenvalue of the positive-definite matrix \( \nabla ^2 f_{\eta }(x) \) does not exceed \( \max _i \Vert a_i \Vert ^2/ \eta \). However, \( \nabla ^2 f_{\eta }(x) = - {\scriptstyle {\textstyle {\frac{1}{\eta }}}} \nabla f_{\eta }(x) \nabla f_{\eta }(x)^T + {\scriptstyle {\textstyle {\frac{1}{ \eta \sum _i \exp ((a_i^T x + b_i)/ \eta ) }}}} \sum _i \exp ((a_i^T x + b_i)/\eta ) \, a_i a_i^T , \) and hence, the largest eigenvalue of \( \nabla ^2 f(x) \) is no greater than \( 1/ \eta \) times the largest eigenvalue of the matrix \( {\scriptstyle {\textstyle {\frac{1}{\sum _i \exp ((a_i^T x + b_i)/ \eta ) }}}} \sum _i \exp ((a_i^T x + b_i)/\eta ) \, a_i a_i^T \), which is a convex combination of the matrices \( a_i a_i^T \). Consequently \( \alpha = \max _i \Vert a_i \Vert ^2 \) is indeed an appropriate choice.

  6. In particular, analogous results can be obtained under the weaker assumption that Hölderian growth holds on a smaller set \( \{ x \in Q \mid \mathrm {dist}(x,X^*) \le r \} \) for some \( r > 0 \). The key is that due to convexity of f, for \( {\hat{f}} \ge f^* \) we would then have

    $$\begin{aligned} D( {\hat{f}}) \le {\left\{ \begin{array}{ll} \big ( ({\hat{f}} - f^*)/\mu \big )^{1/p} &{} \text {if } {\hat{f}} - f^* \le \mu \, r^p\\ ({\hat{f}} - f^*)/(\mu \, r^{p-1}) &{} \text {if }{\hat{f}} - f^* \ge \mu \, r^p. \end{array}\right. } \end{aligned}$$
  7. The choice of values \( 2^n \varepsilon \) (\( n = -1, 0, \ldots , N \)) is slightly arbitrary. Indeed, given any scalar \( \gamma > 1 \), everywhere replacing \( 2^n \varepsilon \) by \( \gamma ^n \varepsilon \) leads to similar theoretical results. Our analysis shows relying on powers of 2 suffices to give nearly optimal guarantees, although a more refined analysis, in the spirit of [33], might reveal desirability of relying on values of \( \gamma \) other than 2, depending on the setting.

  8. The synchronous parallel scheme is inappropriate for first-order methods that have wide variation in the amount of work done in iterations. Instead, the asynchronous parallel scheme (Sect. 7) is appropriate.

  9. The sequence of contiguous pauses is finite, because (1) \( {\mathtt {fom}}_{n+1} \) sends a message only when it has obtained x satisfying \( f(x) \le f(x_{n+1,0}) - 2^{n+1} \varepsilon \), and (2) we assume \( f^* \) is finite.

  10. We assume the listed chores are accomplished instantly by \( {\mathtt {fom}}_n \). Assuming positive time is required would have negligible effect on the complexity results, but would make notation more cumbersome.

  11. For that setting, see [32, Appendix A] for a slight extension to Nesterov’s arguments that suffice to remove the assumption.

  12. https://github.com/bgrimmer/Parallel-Restarting-Scheme.

  13. This work can easily be found online, although it was never published (presumably a consequence of Tseng’s disappearance).

  14. In the proof of [3, Theorem 4.4], \( x^* \) designates any optimal solution. However, no properties of optimality are relied upon other than \( f(x^*) = f^* \) (including not being relied upon in [3, Lemma 4.1], to which the proof refers). Instead choose \( x^* \) to be the point in \( X(\varepsilon /2) \) which is nearest to \( x_0 \), and everywhere replace \( f^* \) by \( f^* + \varepsilon / 2 \).

    Also make the trivial observation that the conclusion to [3, Lemma 4.2] remains valid even if the scalars \( \alpha _i \) are not sign restricted, so long as all of the scalars \( \beta _i \) are nonnegative (whereas the lemma as stated in the paper starts with the assumption that all of these scalars are positive).

    With these changes, the proof of [3, Theorem 4.4] shows that the iterates \( x_k \) of \( \mathtt {fista}\) satisfy

    $$\begin{aligned} f(x_k) - (f^* + \varepsilon /2) \le \frac{2 L \mathrm {dist}(x_0, X(\varepsilon /2))^2 }{(k+1)^2} . \end{aligned}$$

    Thus, to obtain an \( \varepsilon \)-optimal solution, (56) iterations suffice.

  15. The number of messages sent by \( {\mathtt {fom}}_{n+1} \) is finite because (1) \( {\mathtt {fom}}_{n+1} \) sends a message only when it has obtained x satisfying \( f(x) \le f(x_{n+1,0}) - 2^{n+1} \varepsilon \), where \( x_{n+1,0} \) is the most recent (re)start point, and (2) we assume \( f^* \) is finite.

References

  1. Bauschke, H., Bolte, J., Teboulle, M.: A descent lemma beyond lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research 42(2), 330–348 (2017). https://doi.org/10.1287/moor.2016.0817.

    Article  MathSciNet  MATH  Google Scholar 

  2. Bauschke, H., Borwein, J., Combettes, P.: Bregman monotone optimization algorithms. SIAM Journal on control and optimization 42(2), 596–636 (2003)

    Article  MathSciNet  Google Scholar 

  3. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  4. Beck, A., Teboulle, M.: Smoothing and first order methods: A unified framework. SIAM Journal on Optimization 22(2), 557–580 (2012)

    Article  MathSciNet  Google Scholar 

  5. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization 17(4), 1205–1223 (2007)

    Article  Google Scholar 

  6. Bolte, J., Nguyen, T., Peypouquet, J., Suter, B.: From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming 165(2), 471–507 (2017)

    Article  MathSciNet  Google Scholar 

  7. Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arxiv:1609.07358 (2016)

  8. Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA Journal of Numerical Analysis 39(4), 2069–2095 (2019)

    Article  MathSciNet  Google Scholar 

  9. Gilpin, A., Pena, J., Sandholm, T.: First-order algorithm with \({\cal{O}({\rm ln}(1{/}\epsilon ))}\) convergence for \({\epsilon }\)-equilibrium in two-person zero-sum games. Mathematical Programming 133, 279–298 (2010)

    Article  Google Scholar 

  10. Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 5058–5063. IEEE (2014)

  11. Goffin, J.: On convergence rates of subgradient optimization methods. Mathematical Programming 13(1), 329–347 (1977)

    Article  MathSciNet  Google Scholar 

  12. Iouditski, A., Nesterov, Y.: Primal-dual subgradient methods for minimizing uniformly convex functions. arXiv:1401.1792 (2014)

  13. Johnstone, P., Moulin, P.: Faster subgradient methods for functions with hölderian growth. Mathematical Programming 180(1), 417–450 (2020)

    Article  MathSciNet  Google Scholar 

  14. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)

  15. Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. In: International Conference on Machine Learning, pp. 73–81 (2014)

  16. Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles 117, 87–89 (1963)

    MATH  Google Scholar 

  17. Łojasiewicz, S.: Sur la géométrie semi-et sous-analytique. In: Annales de l’institut Fourier, vol. 43, pp. 1575–1595 (1993)

    Article  MathSciNet  Google Scholar 

  18. Lu, H., Freund, R., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization 28(1), 333–354 (2018). https://doi.org/10.1137/16M1099546.

    Article  MathSciNet  MATH  Google Scholar 

  19. Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming pp. 1–39 (2016)

  20. Nemirovski, A., Nesterov, Y.: Optimal methods of smooth convex minimization. U.S.S.R. Comput. Math. Math. Phys. 25(2), 21–30 (1985)

    Article  MathSciNet  Google Scholar 

  21. Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley (1983)

  22. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \( O(1/k^2)\). Soviet Mathematics Doklady 27(2), 372–376 (1983)

    MATH  Google Scholar 

  23. Nesterov, Y.: Smooth minimization of non-smooth functions. Mathematical programming 103(1), 127–152 (2005)

    Article  MathSciNet  Google Scholar 

  24. Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Mathematical Programming 110(2), 245–259 (2007)

    Article  MathSciNet  Google Scholar 

  25. Nesterov, Y.: Gradient methods for minimizing composite functions. Mathematical Programming 140(1), 125–161 (2013)

    Article  MathSciNet  Google Scholar 

  26. Nesterov, Y.: Universal gradient methods for convex optimization problems. Mathematical Programming 152(1-2), 381–404 (2015)

    Article  MathSciNet  Google Scholar 

  27. ODonoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics 15(3), 715–732 (2015)

    Article  MathSciNet  Google Scholar 

  28. Polyak, B.: Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics 3(4), 864–878 (1963)

    Article  Google Scholar 

  29. Polyak, B.: Subgradient methods: a survey of Soviet research. In: Nonsmooth optimization: Proceedings of the IIASA workshop, pp. 5–30 (1977)

  30. Polyak, B.: Introduction to optimization. translations series in mathematics and engineering. Optimization Software (1987)

  31. Renegar, J.: “Efficient” subgradient methods for general convex optimization. SIAM Journal on Optimization 26, 2649–2676 (2016)

    Article  MathSciNet  Google Scholar 

  32. Renegar, J.: Accelerated first-order methods for hyperbolic programming. Mathematical Programming pp. 1–35 (2017)

  33. Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM Journal on Optimization 30(1), 262–289 (2020)

    Article  MathSciNet  Google Scholar 

  34. Shor, N.: Minimization Methods for Non-Differentiable Functions. Springer (1985)

  35. Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. 170(1), 67–96 (2018)

    Article  MathSciNet  Google Scholar 

  36. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization 2, 3 (2008)

  37. Yang, T.: Adaptive accelerated gradient converging methods under Hölderian error bound condition. In: 31st Conference on Neural Information Processing System (2017)

  38. Yang, T., Lin, Q.: RSG: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research 19(1), 236–268 (2018)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We are indebted to the reviewers, whose comments and suggestions led to a wide range of substantial improvements in the presentation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Renegar.

Additional information

Communicated by Michael Overton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research supported in part by NSF grants CCF-1552518 and DMS-1812904, and by the NSF Graduate Research Fellowship under grant DGE-1650441. A portion of the initial development was done while the authors were visiting the Simons Institute for the Theory of Computing, and was partially supported by the DIMACS/Simons Collaboration on Bridging Continuous and Discrete Optimization through NSF Grant CCF-1740425.

Appendices

Complexity of Subgradient Method

Here, we record the simple proof of the complexity bound for \( \mathtt {subgrad}\) introduced in Sect. 2.1, namely, that if f is M-Lipschitz on Q, then for the function

$$\begin{aligned} K_{\mathtt {subgrad}}(\delta , \varepsilon ) := (M \delta / \varepsilon )^2 , \end{aligned}$$
(54)

the iterates \( x_{k+1} = P_Q ( x_k - \frac{\varepsilon }{ \Vert g_k \Vert ^2} g_k ) \) satisfy

$$\begin{aligned} \mathrm {dist}(x_0, X^*) \le \delta \quad \Rightarrow \quad \min \{ f(x_k) \mid 0 \le k \le K_{\mathtt {subgrad}}(\delta , \varepsilon ) \} \, \le \, f^* + \varepsilon . \end{aligned}$$
(55)

Indeed, letting \( x^* \) be the closest point in \( X^* \) to \( x_k \), and defining \( \alpha _k = \varepsilon /\Vert g_k \Vert ^2 \), we have

$$\begin{aligned} \mathrm {dist}(x_{k+1}, X^*)^2&\le \Vert x_{k+1} - x^* \Vert ^2 = \Vert P( x_k - \alpha _k g_k) - x^* \Vert ^2 \le \Vert (x_k - \alpha _k g_k) - x^* \Vert ^2 \\&= \Vert x_k - x^* \Vert ^2 - 2 \alpha _k \langle g_k, x_k - x^* \rangle + \alpha _k^2 \Vert g_k \Vert ^2 \\&\le \Vert x_k - x^* \Vert ^2 - 2 \alpha _k (f(x_k) - f^* ) \\&\quad + \alpha _k^2 \Vert g_k \Vert ^2 \quad (\text {because } g_k \text { is a subgradient at } x_k)\\&\le \mathrm {dist}(x_k, X^*)^2 - ( \, 2 ( f(x_k) - f^* ) - \varepsilon \, ) \, \varepsilon /M^2 , \end{aligned}$$

the final inequality due to \( \alpha _k = \varepsilon / \Vert g_k \Vert ^2 \), \( \Vert g_k \Vert \le M \) and the definition of \( x^* \). Thus,

$$\begin{aligned} f(x_k) - f^* > \varepsilon \, \Rightarrow \, \mathrm {dist}(x_{k+1}, X^*)^2 < \mathrm {dist}(x_k, X^*)^2 - (\varepsilon /M)^2 . \end{aligned}$$

Consequently, by induction, if \( x_{k+1} \) is the first iterate satisfying \( f(x_{k+1}) - f^* \le \varepsilon \), then

$$\begin{aligned} 0 \le \mathrm {dist}(x_{k+1},X^*)^2 < \mathrm {dist}(x_0,X^*)^2 - (k+1) (\varepsilon /M)^2 , \end{aligned}$$

implying the function \( K_{\mathtt {subgrad}} \) defined by (54) satisfies (55).

Smoothing

Here, we present the first-order method \( \mathtt {smooth}\) which is applicable when the objective function has a smoothing. We also justify the claim that \( K_{\mathtt {smooth}}(\delta , \varepsilon ) \), as defined by (10), provides an appropriate bound for the number of iterations.

We begin with an observation regarding the function \( K_{{\mathtt {fom}}}(\delta , \varepsilon ) \) for bounding the number of iterations of \( {\mathtt {fom}}\) in terms of the desired accuracy \( \varepsilon \) and the distance from the initial iterate to \( X^* \), the set of optimal solutions: Occasionally in analyses of first-order methods, the value \( K_{{\mathtt {fom}}}( \delta _{\varepsilon }, \varepsilon ) \) is shown to be an upper bound, where \( \delta _{\varepsilon } \) is the distance from the initial iterate to \( X(\varepsilon ) := \{ x \in Q \mid f(x) \le \varepsilon \} \). Since \( \delta _{\varepsilon } < \delta \), the latter upper bound is tighter. Particularly notable in focusing on \( \mathrm {dist}(x_0,X(\varepsilon )) \) rather than \( X^* \) is the masterpiece [36] of Paul Tseng.Footnote 13

In many other cases, even though the analysis is done in terms of \( \mathrm {dist}(x_0,X^*) \), easy modifications can easily be made to obtain an iteration bound in terms of \( \varepsilon \) (the desired accuracy) and \( \mathrm {dist}(x_0, X( \varepsilon ')) \) for \( \varepsilon ' \) satisfying \( 0< \varepsilon ' < \varepsilon \). This happens to be the case for the analysis of \( \mathtt {fista}\) in [3] (a special case being \( \mathtt {accel}\), which we rely upon for \( \mathtt {smooth}\).

In particular, small modifications to the proof of [3, Theorem 4.4] show that \( \mathtt {fista}\) computes an \( \varepsilon \)-optimal solution within a number of iterations not exceeding

$$\begin{aligned} 2 \, \mathrm {dist}(x_0,X(\varepsilon /2)) \sqrt{L/ \varepsilon } . \end{aligned}$$
(56)

We provide specifics in a footnote.Footnote 14

Recall the following definition:

An “\( (\alpha , \beta ) \) smoothing of f” is a family of functions \( f_{\eta } \) parameterized by \( \eta > 0 \), where for each \( \eta \), the function \( f_{\eta } \) is smooth on a neighborhood of Q and satisfies

  1. 1.

    \( \Vert \nabla f_{\eta }(x) - \nabla f_{\eta }(y) \Vert \le {\scriptstyle {\textstyle {\frac{\alpha }{\eta }}}} \Vert x - y \Vert , \quad \forall x,y \in Q \).

  2. 2.

    \( f(x) \le f_{\eta }(x) \le f(x) + \beta \eta , \quad \forall x \in Q \).

To make use of an \( (\alpha , \beta ) \)-smoothing of a nonsmooth objective f when the goal is to obtain an \( \varepsilon \)-optimal solution (for f), we rely on \( \mathtt {accel}\) (or more generally, \( \mathtt {fista}\)) applied to \( f_{\eta } \) with \( \eta \) chosen appropriately. In particular, let the algorithm \( \mathtt {smooth}(\varepsilon ) \) for obtaining an \( \varepsilon \)-optimal solution of f be precisely the algorithm \( \mathtt {accel}(2\varepsilon /3) \), applied to obtaining an \( (2\varepsilon /3) \)-optimal solution of \( f_{\varepsilon /3\beta } \) (i.e., \( \eta = \varepsilon /3\beta \)). Note that the Lipschitz constant \( L = \alpha /\eta \) is available for \( \mathtt {accel}\) to use in computing iterates (assuming \( \alpha \) is known, as it is for our example in Sect. 2.1, along with many other examples [4]).

To understand these choices of parameters, first observe that by definition, \( \mathtt {accel}(2\varepsilon /3 ) \) is guaranteed—for whatever smooth function to which it is applied—to compute an iterate which is an \( (2\varepsilon /3 ) \)-optimal solution. Let \( x_k \) be such an iterate when the function is \( f_{\varepsilon /3 \beta } \), that is

$$\begin{aligned} f_{\varepsilon /3 \beta }(x_k) - f^*_{\varepsilon /3 \beta } \le 2 \varepsilon /3 . \end{aligned}$$
(57)

Now observe that letting \( x^* \) be a minimizer of f, we have

$$\begin{aligned} f_{\varepsilon /3 \beta }^* \le f_{\varepsilon /3 \beta }(x^*) \le f(x^*) + \beta ( \varepsilon /3 \beta ) = f^* + \varepsilon /3 . \end{aligned}$$

Substituting this and \( f(x_k) \le f_{\varepsilon /3 \beta }(x_k) \) into (57) gives, after rearranging, \( f(x_k) - f^* \le \varepsilon \), that is, the method \( \mathtt {smooth}(\varepsilon ) \) will have computed an \( \varepsilon \)-optimal solution for f.

Consequently, the number of iterations required by \( \mathtt {smooth}\) to obtain an \( \varepsilon \)-optimal solution for f is no greater than the number of iterations for \( \mathtt {accel}\) to compute an \( (2\varepsilon /3) \)-optimal solution of \( f_{\varepsilon /3 \beta } \), and hence per the discussion giving the iteration bound (56), does not exceed

$$\begin{aligned} 2 \, \mathrm {dist}(x_0, X_{\varepsilon /3 \beta }( \varepsilon /3 )) \sqrt{ (3 \alpha \beta /\varepsilon )(2 \varepsilon / 3)} , \end{aligned}$$
(58)

where \( X_{\varepsilon /3 \beta }(\varepsilon /3) = \{ x \in Q \mid f_{\varepsilon /3 \beta }(x) - f^*_{ \varepsilon /3 \beta } \le \varepsilon /3 \} \), and where we have used the fact that the Lipschitz constant of \( \nabla f_{\varepsilon /3 \beta } \) is \( \alpha / ( \varepsilon /3 \beta ) \). We claim, however, \( X^* \subseteq X_{\varepsilon /3 \beta }( \varepsilon /3) \), which with (58) results in the iteration bound

$$\begin{aligned} K_{\mathtt {smooth}}(\delta , \varepsilon ) = 3 \delta \sqrt{2\alpha \beta } / \varepsilon \end{aligned}$$

for \( \mathtt {smooth}\) to compute an \( \varepsilon \)-optimal solution for f, where \( \delta \) is the distance from the initial iterate to \( X^* \), the set of optimal solutions for f.

Finally, to verify the claim, letting \( x^* \) be any optimal point for f, we have

$$\begin{aligned} f_{\varepsilon /3 \beta }(x^*) \le f(x^*) + \beta (\varepsilon /3 \beta ) = f^* + \varepsilon /3 \le f_{\varepsilon /2 \beta }^* + \varepsilon /3 , \end{aligned}$$

as desired.

Proof of Theorem 3

The proof proceeds through a sequence of results analogous to the sequence in the proof of Theorem 2, although bookkeeping is now more pronounced. As for the proof of Theorem 2, we refer to \( {\mathtt {fom}}_n \) “updating” at a point x. The starting point \( \mathbf {x_0} \) is considered to be the first update point for every \( {\mathtt {fom}}_n \). After \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) has started, then for \( n < N \), updating at x means the same as restarting at x, and for \( {\mathtt {fom}}_{N} \), updating at x is the same as having computed an iterate x satisfying the current task of \( {\mathtt {fom}}_{N} \) (in which case the point is sent to \( {\mathtt {fom}}_{N-1} \), even though \( {\mathtt {fom}}_{N} \) does not restart).

Keep in mind that when a new epoch for \( {\mathtt {fom}}_n \) occurs, if a message from \( {\mathtt {fom}}_{n+1} \) arrives in the inbox exactly when the epoch begins, then the restart point will not be decided immediately, due to \( {\mathtt {fom}}_n \) being made to pause, possibly contiguously. In any case, the restart point is decided after only a finite amount of time, due to \( {\mathtt {fom}}_{n+1} \) sending at most finitely many messages in total.Footnote 15

Proposition 2

Assume that at time t, \( {\mathtt {fom}}_n \) updates at x satisfying \( f(x) - f^* \ge 2 \cdot 2^n \varepsilon \). Then \( {\mathtt {fom}}_n \) updates again and does so no later than time

$$\begin{aligned} t + m \, t_{\mathrm {pause}}+ T_{{\mathtt {fom}}}( D(f(x)), 2^n \varepsilon ) , \end{aligned}$$
(59)

where m is the number of messages received by \( {\mathtt {fom}}_n \) between the two updates.

Proof

Let \( {\widetilde{m}} \) be the total number of messages received by \( {\mathtt {fom}}_n \) from time zero onward. We know \( {\widetilde{m}} \) is finite. Consequently, at time

$$\begin{aligned} t + {\widetilde{m}} \, t_{\mathrm {pause}}+ T_{{\mathtt {fom}}}( D(f(x)), 2^n \varepsilon ) , \end{aligned}$$
(60)

\( {\mathtt {fom}}_n \) will have devoted at least \( T_{{\mathtt {fom}}}( D(f(x)), 2^n \varepsilon ) \) units of time to computing iterations. Thus, if \( {\mathtt {fom}}_n \) has not updated before time (60), then at that instant, \( {\mathtt {fom}}_n \) will not be in a pause, and will have obtained \( {\bar{x}} \) satisfying

$$\begin{aligned} f( {\bar{x}}) - f^* \le 2^n \varepsilon - \text { thus}, f({\bar{x}}) \le f(x) - 2^n \varepsilon (\text {because } f(x) - f^* \ge 2 \cdot 2^n \varepsilon ).\nonumber \\ \end{aligned}$$
(61)

Consequently, if \( {\mathtt {fom}}_n \) has not updated by time (60), it will update at that time, at \( {\bar{x}} \).

Having proved that after updating at x, \( {\mathtt {fom}}_n \) will update again, it remains to prove the next update will occur no later than time (59). Of course, the next update occurs at (or soon after) the start of a new epoch. Let \( m_1 \) be the number of messages received by \( {\mathtt {fom}}_n \) after updating at x and before the new epoch begins, and let \( m_2 \) be the number of messages received in the new epoch and before \( {\mathtt {fom}}_n \) updates (i.e., before the restart point is decided). (Thus, from the beginning of the new epoch until the update, \( {\mathtt {fom}}_n \) is contiguously paused for an amount of time not exceeding \( m_2 \, t_{\mathrm {pause}}\).) Clearly, \( m_1 + m_2 = m \).

In view of the preceding observations, to establish the time bound (59) it suffices to show the new epoch begins no later than time

$$\begin{aligned} t + m_1 \, t_{\mathrm {pause}}+ T_{{\mathtt {fom}}}( D(f(x)), 2^n \varepsilon ) . \end{aligned}$$
(62)

However, if the new epoch has not begun prior to time (62) then at that time, \( {\mathtt {fom}}_n \) is not in a pause, and has spent enough time computing iterations so as to have a point \( {\bar{x}} \) satisfying (61), causing a new epoch to begin instantly. \(\square \)

Proposition 3

Assume that at time t, \( {\mathtt {fom}}_n \) updates at x satisfying \( f(x) - f^* \ge 2 \cdot 2^n \varepsilon \). Let \( {\mathbf {j}} := \lfloor \frac{ f(x) - f^* }{ 2^n \varepsilon } \rfloor - 2 \). Then \( {\mathtt {fom}}_n \) updates at a point \( {\bar{x}} \) satisfying \( f( {\bar{x}}) - f^* < 2 \cdot 2^n \varepsilon \) no later than time

$$\begin{aligned} t + m \, t_{\mathrm {pause}}+ \sum _{j=0}^{{\mathbf {j}}} T_{{\mathtt {fom}}}( D( f(x) - j \cdot 2^n \varepsilon ), 2^n \varepsilon ) , \end{aligned}$$

where m is the total number of messages received by \( {\mathtt {fom}}_n \) between the updates at x and \( {\bar{x}} \).

Proof

The proof is essentially identical to the proof of Proposition 1, but relying on Proposition 2 rather than on Lemma 1. \(\square \)

Corollary 11

Assume that at time t, \( {\mathtt {fom}}_n \) updates at x satisfying \( f(x) - f^* < {\mathbf {i}} \cdot 2^n \varepsilon \), where \( {\mathbf {i}} \) is an integer and \( {\mathbf {i}} \ge 3 \). Then \( {\mathtt {fom}}_n \) updates at a point \( {\bar{x}} \) satisfying \( f( {\bar{x}}) - f^* < 2 \cdot 2^n \varepsilon \) no later than time

$$\begin{aligned}&t + m \, t_{\mathrm {pause}}+ \sum _{i=3}^{ {\mathbf {i}}} T_{{\mathtt {fom}}}(D_{n,i}, 2^n \varepsilon ) \nonumber \\&\qquad \text {where} \, \, D_{n,i} = \min \{ D(f^* + i \cdot 2^n \varepsilon ), D(f(x) ) \} , \end{aligned}$$
(63)

and where m is the number of messages received by \( {\mathtt {fom}}_n \) between the updates at x and \( {\bar{x}} \).

Moreover, if \( n > -1 \), then after sending the message to \( {\mathtt {fom}}_{n-1} \) containing the point \( {\bar{x}} \), \( {\mathtt {fom}}_n \) will send at most one further message.

Proof

Except for the final assertion, the proof is essentially identical to the proof of Corollary 6, but relying on Proposition 3 rather than on Proposition 1.

For the final assertion, note that if \( {\mathtt {fom}}_n \) sent two points after sending \( {\bar{x}} \) - say, first \( x' \) and then \( x'' \)—we would have

$$\begin{aligned} f(x'') \le f(x') - 2^n \varepsilon \le (f( {\bar{x}}) - 2^n \varepsilon ) - 2^n \varepsilon < f^* , \end{aligned}$$

a contradiction. \(\square \)

Corollary 12

Let \( n > -1 \). Assume that at time t, \( {\mathtt {fom}}_{n} \) updates at x satisfying \( f(x) - f^* < 5 \cdot 2^n \varepsilon \), and assume from time t onward, \( {\mathtt {fom}}_n \) receives at most \( {\hat{m}} \) messages. Then for either \( {\hat{m}}' = 0 \) or \( {\hat{m}}' = 1 \), \( {\mathtt {fom}}_{n-1} \) updates at \( x' \) satisfying \( f( x') - f^* < 5 \cdot 2^{n-1} \varepsilon \) no later than time

$$\begin{aligned} t + t_{\mathrm {transit}}+ ( {\hat{m}} + 2 - {\hat{m}}' ) t_{\mathrm {pause}}+ \sum _{i=3}^5 T_{{\mathtt {fom}}}( D_{n,i}, 2^n \varepsilon ) \end{aligned}$$

(where \( D_{n,i} \) is given by (63)), and from that time onward, \( {\mathtt {fom}}_{n-1} \) receives at most \( {\hat{m}}' \) messages.

Proof

By Corollary 11, no later than time

$$\begin{aligned} t + {\hat{m}} t_{\mathrm {pause}}+ \sum _{i=3}^5 T_{{\mathtt {fom}}}( D_{n,i}, 2^n \varepsilon ) , \end{aligned}$$

\( {\mathtt {fom}}_n \) updates at \( {\bar{x}} \) satisfying \( f({\bar{x}}) - f^* < 2 \cdot 2^n \varepsilon \). When \( {\mathtt {fom}}_n \) updates at \( {\bar{x}} \), the point is sent to the inbox of \( {\mathtt {fom}}_{n-1} \), where it arrives no more than \( t_{\mathrm {transit}}\) time units later, causing a pause of \( {\mathtt {fom}}_{n-1} \) to begin immediately. Moreover, following the arrival of \( {\bar{x}} \), at most one additional message will be received by \( {\mathtt {fom}}_{n-1} \) (per the last assertion of Corollary 11).

If during the pause, \( {\mathtt {fom}}_{n-1} \) does not receive an additional message, then at the end of the pause, \( {\mathtt {fom}}_{n-1} \) either updates at \( {\bar{x}} \) or continues its current epoch. The decision on whether to update at \( {\bar{x}} \) is then enacted no later than time

$$\begin{aligned} t + t_{\mathrm {transit}}+ ({\hat{m}} + 1) t_{\mathrm {pause}}+ \sum _{i=3}^5 T_{{\mathtt {fom}}}( D_{n,i}, 2^n \varepsilon ) , \end{aligned}$$
(64)

after which \( {\mathtt {fom}}_{n-1} \) receives at most one message.

On the other hand, if during the pause, an additional message is received, then a second pause immediately begins, and \( {\bar{x}} \) is overwritten by a point \( \bar{{\bar{x}}} \) satisfying \( f(\bar{{\bar{x}}}) \le f( {\bar{x}}) - 2^n \varepsilon < f^* + 2^n \varepsilon \). Here, the decision on whether to update at \( \bar{{\bar{x}}} \) is enacted no later than time

$$\begin{aligned} t + t_{\mathrm {transit}}+ ( {\hat{m}} + 2) t_{\mathrm {pause}}+ \sum _{i=3}^5 T_{{\mathtt {fom}}}( D_{n,i}, 2^n \varepsilon ) , \end{aligned}$$
(65)

after which \( {\mathtt {fom}}_{n-1} \) receives no messages.

If \( {\mathtt {fom}}_{n-1} \) chooses to update at \( {\bar{x}} \) (or \( \bar{{\bar{x}}} \)), the corollary follows immediately from (64) and (65), as the value of f at the update point is then less than \( f^* + 4 \cdot 2^{n-1} \varepsilon \). On the other hand, if \( {\mathtt {fom}}_{n-1} \) chooses not to update, it is only because its most recent update point satisfies \( f(x_{n-1,0}) < f( {\bar{x}}) + 2^{n-1} \varepsilon \) (resp., \( f(x_{n-1,0}) < f( \bar{{\bar{x}}} ) + 2^{n-1} \varepsilon \)), and hence \( f(x_{n-1,0}) < f^* + 5 \cdot 2^{n-1} \varepsilon \). Thus, here as well, the corollary is established. \(\square \)

Corollary 13

For any \( {\bar{N}} \in \{ -1, \ldots , N \} \), assume that at time t, \( {\mathtt {fom}}_{{\bar{N}}} \) updates at x satisfying \( f(x) - f^* < 5 \cdot 2^{{\bar{N}}} \varepsilon \), and assume from time t onward, \( {\mathtt {fom}}_{{\bar{N}}} \) receives at most \( {\hat{m}} \) messages. Then no later than time

$$\begin{aligned} t + ({\bar{N}} +1) t_{\mathrm {transit}}+ ( {\hat{m}} + 2 {\bar{N}} + 2) t_{\mathrm {pause}}+ \sum _{n = -1}^{{\bar{N}}} \sum _{i=3}^5 T_{{\mathtt {fom}}}(D_{n,i}, 2^n \varepsilon ) , \end{aligned}$$

\( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) has computed an \( \varepsilon \)-optimal solution.

Proof

If \( {\bar{N}} = -1 \), Corollary 11 implies that after updating at x, the additional time required by \( {\mathtt {fom}}_{-1} \) to compute a \( (2 \cdot 2^{-1} \varepsilon ) \)-optimal solution \( {\bar{x}} \) is bounded from above by

$$\begin{aligned} {\hat{m}} \, t_{\mathrm {pause}}+ \sum _{i=3}^5 T_{{\mathtt {fom}}}( D_{-1,i}, 2^{-1} \varepsilon ) . \end{aligned}$$
(66)

The present corollary is thus immediate for the case \( {\bar{N}} = -1 \).

On the other hand, if \( {\bar{N}} > - 1 \), induction using Corollary 12 shows that for either \( {\hat{m}}' = 0 \) or \( {\hat{m}}' = 1 \), no later than time

$$\begin{aligned} t + ({\bar{N}}+1) t_{\mathrm {transit}}+ ({\hat{m}} + 2{\bar{N}} + 2 - {\hat{m}}' ) t_{\mathrm {pause}}+ \sum _{n=0}^{{\bar{N}}} \sum _{i=3}^5 T_{{\mathtt {fom}}}( D_{n,i}, 2^n \varepsilon ) , \end{aligned}$$

\( {\mathtt {fom}}_{-1} \) has restarted at x satisfying \( f(x) - f^* < 5 \cdot 2^{-1} \varepsilon \), and from then onwards, \( {\mathtt {fom}}_{-1} \) receives at most \( {\hat{m}}' \) messages. Then by Corollary 11, the additional time required by \( {\mathtt {fom}}_{-1} \) to compute a \( (2 \cdot 2^{-1} \varepsilon ) \)-optimal solution does not exceed (66) with \( {\hat{m}} \) replaced by \( {\hat{m}}' \). The present corollary follows. \(\square \)

We are now in position to prove Theorem 3, which we restate as a corollary.

Corollary 14

If \( f(\mathbf {x_0}) - f^* < 5 \cdot 2^N \varepsilon \), then \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) computes an \( \varepsilon \)-optimal solution within time

$$\begin{aligned}&({\bar{N}}+1) t_{\mathrm {transit}}+ 2({\bar{N}}+2) t_{\mathrm {pause}}+ 3 \sum _{n=-1}^{{\bar{N}}} T_{{\mathtt {fom}}}\left( D_n , 2^n \varepsilon \right) \nonumber \\&\qquad \text {with } \, D_n := \min \{ D(f^* + 5 \cdot 2^n \varepsilon ), D( f( \mathbf {x_0})) \} , \end{aligned}$$
(67)

where \( {\bar{N}} \) is the smallest integer satisfying both \( f(\mathbf {x_0}) - f^* < 5 \cdot 2^{{\bar{N}}} \varepsilon \) and \( {\bar{N}} \ge -1 \).

In any case, \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) computes an \( \varepsilon \)-optimal solution within time

$$\begin{aligned} {\mathbf {T}}_N + T_{{\mathtt {fom}}}\big ( \mathrm {dist}( \mathbf {x_0}, X^*), 2^{N} \varepsilon \big ) , \end{aligned}$$

where \( {\mathbf {T}}_N \) is the quantity obtained by substituting N for \( {\bar{N}} \) in (67).

Proof

Consider the case \( f( \mathbf {x_0}) - f^* < 5 \cdot 2^N \varepsilon \), and let \( {\bar{N}} \) be as in the statement of the theorem, that is, the smallest integer satisfying both \( f( \mathbf {x_0}) - f^* < 5 \cdot 2^{{\bar{N}}} \varepsilon \) and \( {\bar{N}} \ge -1 \).

If \( {\bar{N}} = N \), then \( {\mathtt {fom}}_{{\bar{N}}} \) never receives messages, in which case the time bound (67) is immediate from Corollary 13 with \( t = 0 \) and \( {\hat{m}} = 0 \). On the other hand, if \( {\bar{N}} < N \), then since \( f( \mathbf {x_0}) - f^* < \frac{5}{2} \cdot 2^{{\bar{N}}+1} \varepsilon \), \( {\mathtt {fom}}_{{\bar{N}}+1} \) sends at most two messages from time zero onward. Again (67) is immediate from Corollary 13.

Regardless of whether \( f( \mathbf {x_0}) - f^* < 5 \cdot 2^N \varepsilon \), we know \( {\mathtt {fom}}_{N} \)—which never restarts—computes x satisfying \( f(x) - f^* \le 2^N \varepsilon \) within time

$$\begin{aligned} T_{{\mathtt {fom}}}( \mathrm {dist}( \mathbf {x_0}, X^*), 2^{N}\varepsilon ) . \end{aligned}$$
(68)

If x is not an update point for \( {\mathtt {fom}}_{N} \), then the most recent update point \( {\hat{x}} \) satisfies \( f({\hat{x}}) < f(x) + 2^{N} \varepsilon \le f^* + 2 \cdot 2^N \varepsilon \). Hence, irrespective of whether x is an update point, by the time \( {\mathtt {fom}}_{N} \) has computed x, it has obtained an update point \( x' \) (\( x' = {\hat{x}} \) if not \( x' = x \)) satisfying \( f(x') - f^* < 5 \cdot 2^{N} \varepsilon \). Consequently, the time required for \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) to compute an \( \varepsilon \)-optimal solution does not exceed (68) plus the value (67), where in (67), N is substituted for \( {\bar{N}} \). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Renegar, J., Grimmer, B. A Simple Nearly Optimal Restart Scheme For Speeding Up First-Order Methods. Found Comput Math 22, 211–256 (2022). https://doi.org/10.1007/s10208-021-09502-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-021-09502-2

Keywords

Mathematics Subject Classification

Navigation