Abstract
We present a simple scheme for restarting first-order methods for convex optimization problems. Restarts are made based only on achieving specified decreases in objective values, the specified amounts being the same for all optimization problems. Unlike existing restart schemes, the scheme makes no attempt to learn parameter values characterizing the structure of an optimization problem, nor does it require any special information that would not be available in practice (unless the first-order method chosen to be employed in the scheme itself requires special information). As immediate corollaries to the main theorems, we show that when some well-known first-order methods are employed in the scheme, the resulting complexity bounds are nearly optimal for particular—yet quite general—classes of problems.
Similar content being viewed by others
Notes
Recall that a vector g is a subgradient of f at x if for all y we have \( f(y) \ge f(x) + \langle g, y-x \rangle \). If f is differentiable at x, then g is precisely the gradient.
\( \mathtt {accel}\):
-
(0)
Input: \( x_0 \in Q \). Initialize: \( y_0 = x_0 \), \( \theta _0 = 1 \), \(k = 0 \).
-
(1)
\( x_{k+1} = P_Q( y_k - {\scriptstyle {\textstyle {\frac{1}{L}}}} \nabla f(y_k)) \)
-
(2)
\( \theta _{k+1} = {\scriptstyle {\textstyle {\frac{1}{2}}}} (1 + \sqrt{1 + 4 \theta _k^2}) \)
-
(3)
\( y_{k+1} = x_{k+1} + {\scriptstyle {\textstyle {\frac{\theta _k - 1}{\theta _{k+1}}}}} (x_{k+1} - x_k) \).
-
(0)
We avoid digressing to discuss the principles underlying proximal methods. For readers unfamiliar with the topic, the paper introducing FISTA [3] is a good place to start.
In the case of \( \mathtt {subgrad}_n \), restarting at x simply means computing a subgradient step at x, i.e., \( x_+ = x - \frac{2^n \varepsilon }{\Vert g_k \Vert ^2} g_k \). For \( \mathtt {accel}_n \), restarting is slightly more involved, in part due to two sequences of points being computed, \( \{ x_{n,k} \} \) and \( \{ y_{n,k} \} \). The primary sequence is \( \{ x_{n,k} \} \). The method \( \mathtt {accel}_n \) restarts at a point x from a primary sequence (either from \( {\mathtt {fom}}_n \)’s own primary sequence, or a point sent to \( {\mathtt {fom}}_n \) from
Footnote 4 continued
the primary sequence for \( {\mathtt {fom}}_{n+1} \)). In restarting, \( \mathtt {accel}_n \) begins by “reinitializing,” setting \( x_{n,0} = x \), \( y_{n,0} = x \), \( \theta _0 = 1 \), \( k = 0 \) and then begins iterating.
This is example 4.5 in [4], but there the Euclidean norm is not used, unlike here. To verify correctness of the value \( \alpha = \max _i \Vert a_i \Vert ^2 \), it suffices to show for each x that the largest eigenvalue of the positive-definite matrix \( \nabla ^2 f_{\eta }(x) \) does not exceed \( \max _i \Vert a_i \Vert ^2/ \eta \). However, \( \nabla ^2 f_{\eta }(x) = - {\scriptstyle {\textstyle {\frac{1}{\eta }}}} \nabla f_{\eta }(x) \nabla f_{\eta }(x)^T + {\scriptstyle {\textstyle {\frac{1}{ \eta \sum _i \exp ((a_i^T x + b_i)/ \eta ) }}}} \sum _i \exp ((a_i^T x + b_i)/\eta ) \, a_i a_i^T , \) and hence, the largest eigenvalue of \( \nabla ^2 f(x) \) is no greater than \( 1/ \eta \) times the largest eigenvalue of the matrix \( {\scriptstyle {\textstyle {\frac{1}{\sum _i \exp ((a_i^T x + b_i)/ \eta ) }}}} \sum _i \exp ((a_i^T x + b_i)/\eta ) \, a_i a_i^T \), which is a convex combination of the matrices \( a_i a_i^T \). Consequently \( \alpha = \max _i \Vert a_i \Vert ^2 \) is indeed an appropriate choice.
In particular, analogous results can be obtained under the weaker assumption that Hölderian growth holds on a smaller set \( \{ x \in Q \mid \mathrm {dist}(x,X^*) \le r \} \) for some \( r > 0 \). The key is that due to convexity of f, for \( {\hat{f}} \ge f^* \) we would then have
$$\begin{aligned} D( {\hat{f}}) \le {\left\{ \begin{array}{ll} \big ( ({\hat{f}} - f^*)/\mu \big )^{1/p} &{} \text {if } {\hat{f}} - f^* \le \mu \, r^p\\ ({\hat{f}} - f^*)/(\mu \, r^{p-1}) &{} \text {if }{\hat{f}} - f^* \ge \mu \, r^p. \end{array}\right. } \end{aligned}$$The choice of values \( 2^n \varepsilon \) (\( n = -1, 0, \ldots , N \)) is slightly arbitrary. Indeed, given any scalar \( \gamma > 1 \), everywhere replacing \( 2^n \varepsilon \) by \( \gamma ^n \varepsilon \) leads to similar theoretical results. Our analysis shows relying on powers of 2 suffices to give nearly optimal guarantees, although a more refined analysis, in the spirit of [33], might reveal desirability of relying on values of \( \gamma \) other than 2, depending on the setting.
The synchronous parallel scheme is inappropriate for first-order methods that have wide variation in the amount of work done in iterations. Instead, the asynchronous parallel scheme (Sect. 7) is appropriate.
The sequence of contiguous pauses is finite, because (1) \( {\mathtt {fom}}_{n+1} \) sends a message only when it has obtained x satisfying \( f(x) \le f(x_{n+1,0}) - 2^{n+1} \varepsilon \), and (2) we assume \( f^* \) is finite.
We assume the listed chores are accomplished instantly by \( {\mathtt {fom}}_n \). Assuming positive time is required would have negligible effect on the complexity results, but would make notation more cumbersome.
For that setting, see [32, Appendix A] for a slight extension to Nesterov’s arguments that suffice to remove the assumption.
This work can easily be found online, although it was never published (presumably a consequence of Tseng’s disappearance).
In the proof of [3, Theorem 4.4], \( x^* \) designates any optimal solution. However, no properties of optimality are relied upon other than \( f(x^*) = f^* \) (including not being relied upon in [3, Lemma 4.1], to which the proof refers). Instead choose \( x^* \) to be the point in \( X(\varepsilon /2) \) which is nearest to \( x_0 \), and everywhere replace \( f^* \) by \( f^* + \varepsilon / 2 \).
Also make the trivial observation that the conclusion to [3, Lemma 4.2] remains valid even if the scalars \( \alpha _i \) are not sign restricted, so long as all of the scalars \( \beta _i \) are nonnegative (whereas the lemma as stated in the paper starts with the assumption that all of these scalars are positive).
With these changes, the proof of [3, Theorem 4.4] shows that the iterates \( x_k \) of \( \mathtt {fista}\) satisfy
$$\begin{aligned} f(x_k) - (f^* + \varepsilon /2) \le \frac{2 L \mathrm {dist}(x_0, X(\varepsilon /2))^2 }{(k+1)^2} . \end{aligned}$$Thus, to obtain an \( \varepsilon \)-optimal solution, (56) iterations suffice.
The number of messages sent by \( {\mathtt {fom}}_{n+1} \) is finite because (1) \( {\mathtt {fom}}_{n+1} \) sends a message only when it has obtained x satisfying \( f(x) \le f(x_{n+1,0}) - 2^{n+1} \varepsilon \), where \( x_{n+1,0} \) is the most recent (re)start point, and (2) we assume \( f^* \) is finite.
References
Bauschke, H., Bolte, J., Teboulle, M.: A descent lemma beyond lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research 42(2), 330–348 (2017). https://doi.org/10.1287/moor.2016.0817.
Bauschke, H., Borwein, J., Combettes, P.: Bregman monotone optimization algorithms. SIAM Journal on control and optimization 42(2), 596–636 (2003)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202 (2009)
Beck, A., Teboulle, M.: Smoothing and first order methods: A unified framework. SIAM Journal on Optimization 22(2), 557–580 (2012)
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization 17(4), 1205–1223 (2007)
Bolte, J., Nguyen, T., Peypouquet, J., Suter, B.: From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming 165(2), 471–507 (2017)
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate. arXiv preprint arxiv:1609.07358 (2016)
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA Journal of Numerical Analysis 39(4), 2069–2095 (2019)
Gilpin, A., Pena, J., Sandholm, T.: First-order algorithm with \({\cal{O}({\rm ln}(1{/}\epsilon ))}\) convergence for \({\epsilon }\)-equilibrium in two-person zero-sum games. Mathematical Programming 133, 279–298 (2010)
Giselsson, P., Boyd, S.: Monotonicity and restart in fast gradient methods. In: Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 5058–5063. IEEE (2014)
Goffin, J.: On convergence rates of subgradient optimization methods. Mathematical Programming 13(1), 329–347 (1977)
Iouditski, A., Nesterov, Y.: Primal-dual subgradient methods for minimizing uniformly convex functions. arXiv:1401.1792 (2014)
Johnstone, P., Moulin, P.: Faster subgradient methods for functions with hölderian growth. Mathematical Programming 180(1), 417–450 (2020)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer (2016)
Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. In: International Conference on Machine Learning, pp. 73–81 (2014)
Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles 117, 87–89 (1963)
Łojasiewicz, S.: Sur la géométrie semi-et sous-analytique. In: Annales de l’institut Fourier, vol. 43, pp. 1575–1595 (1993)
Lu, H., Freund, R., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization 28(1), 333–354 (2018). https://doi.org/10.1137/16M1099546.
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming pp. 1–39 (2016)
Nemirovski, A., Nesterov, Y.: Optimal methods of smooth convex minimization. U.S.S.R. Comput. Math. Math. Phys. 25(2), 21–30 (1985)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley (1983)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \( O(1/k^2)\). Soviet Mathematics Doklady 27(2), 372–376 (1983)
Nesterov, Y.: Smooth minimization of non-smooth functions. Mathematical programming 103(1), 127–152 (2005)
Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Mathematical Programming 110(2), 245–259 (2007)
Nesterov, Y.: Gradient methods for minimizing composite functions. Mathematical Programming 140(1), 125–161 (2013)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Mathematical Programming 152(1-2), 381–404 (2015)
ODonoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics 15(3), 715–732 (2015)
Polyak, B.: Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics 3(4), 864–878 (1963)
Polyak, B.: Subgradient methods: a survey of Soviet research. In: Nonsmooth optimization: Proceedings of the IIASA workshop, pp. 5–30 (1977)
Polyak, B.: Introduction to optimization. translations series in mathematics and engineering. Optimization Software (1987)
Renegar, J.: “Efficient” subgradient methods for general convex optimization. SIAM Journal on Optimization 26, 2649–2676 (2016)
Renegar, J.: Accelerated first-order methods for hyperbolic programming. Mathematical Programming pp. 1–35 (2017)
Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM Journal on Optimization 30(1), 262–289 (2020)
Shor, N.: Minimization Methods for Non-Differentiable Functions. Springer (1985)
Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. 170(1), 67–96 (2018)
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization 2, 3 (2008)
Yang, T.: Adaptive accelerated gradient converging methods under Hölderian error bound condition. In: 31st Conference on Neural Information Processing System (2017)
Yang, T., Lin, Q.: RSG: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research 19(1), 236–268 (2018)
Acknowledgements
We are indebted to the reviewers, whose comments and suggestions led to a wide range of substantial improvements in the presentation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Michael Overton.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research supported in part by NSF grants CCF-1552518 and DMS-1812904, and by the NSF Graduate Research Fellowship under grant DGE-1650441. A portion of the initial development was done while the authors were visiting the Simons Institute for the Theory of Computing, and was partially supported by the DIMACS/Simons Collaboration on Bridging Continuous and Discrete Optimization through NSF Grant CCF-1740425.
Appendices
Complexity of Subgradient Method
Here, we record the simple proof of the complexity bound for \( \mathtt {subgrad}\) introduced in Sect. 2.1, namely, that if f is M-Lipschitz on Q, then for the function
the iterates \( x_{k+1} = P_Q ( x_k - \frac{\varepsilon }{ \Vert g_k \Vert ^2} g_k ) \) satisfy
Indeed, letting \( x^* \) be the closest point in \( X^* \) to \( x_k \), and defining \( \alpha _k = \varepsilon /\Vert g_k \Vert ^2 \), we have
the final inequality due to \( \alpha _k = \varepsilon / \Vert g_k \Vert ^2 \), \( \Vert g_k \Vert \le M \) and the definition of \( x^* \). Thus,
Consequently, by induction, if \( x_{k+1} \) is the first iterate satisfying \( f(x_{k+1}) - f^* \le \varepsilon \), then
implying the function \( K_{\mathtt {subgrad}} \) defined by (54) satisfies (55).
Smoothing
Here, we present the first-order method \( \mathtt {smooth}\) which is applicable when the objective function has a smoothing. We also justify the claim that \( K_{\mathtt {smooth}}(\delta , \varepsilon ) \), as defined by (10), provides an appropriate bound for the number of iterations.
We begin with an observation regarding the function \( K_{{\mathtt {fom}}}(\delta , \varepsilon ) \) for bounding the number of iterations of \( {\mathtt {fom}}\) in terms of the desired accuracy \( \varepsilon \) and the distance from the initial iterate to \( X^* \), the set of optimal solutions: Occasionally in analyses of first-order methods, the value \( K_{{\mathtt {fom}}}( \delta _{\varepsilon }, \varepsilon ) \) is shown to be an upper bound, where \( \delta _{\varepsilon } \) is the distance from the initial iterate to \( X(\varepsilon ) := \{ x \in Q \mid f(x) \le \varepsilon \} \). Since \( \delta _{\varepsilon } < \delta \), the latter upper bound is tighter. Particularly notable in focusing on \( \mathrm {dist}(x_0,X(\varepsilon )) \) rather than \( X^* \) is the masterpiece [36] of Paul Tseng.Footnote 13
In many other cases, even though the analysis is done in terms of \( \mathrm {dist}(x_0,X^*) \), easy modifications can easily be made to obtain an iteration bound in terms of \( \varepsilon \) (the desired accuracy) and \( \mathrm {dist}(x_0, X( \varepsilon ')) \) for \( \varepsilon ' \) satisfying \( 0< \varepsilon ' < \varepsilon \). This happens to be the case for the analysis of \( \mathtt {fista}\) in [3] (a special case being \( \mathtt {accel}\), which we rely upon for \( \mathtt {smooth}\).
In particular, small modifications to the proof of [3, Theorem 4.4] show that \( \mathtt {fista}\) computes an \( \varepsilon \)-optimal solution within a number of iterations not exceeding
We provide specifics in a footnote.Footnote 14
Recall the following definition:
An “\( (\alpha , \beta ) \) smoothing of f” is a family of functions \( f_{\eta } \) parameterized by \( \eta > 0 \), where for each \( \eta \), the function \( f_{\eta } \) is smooth on a neighborhood of Q and satisfies
- 1.
\( \Vert \nabla f_{\eta }(x) - \nabla f_{\eta }(y) \Vert \le {\scriptstyle {\textstyle {\frac{\alpha }{\eta }}}} \Vert x - y \Vert , \quad \forall x,y \in Q \).
- 2.
\( f(x) \le f_{\eta }(x) \le f(x) + \beta \eta , \quad \forall x \in Q \).
To make use of an \( (\alpha , \beta ) \)-smoothing of a nonsmooth objective f when the goal is to obtain an \( \varepsilon \)-optimal solution (for f), we rely on \( \mathtt {accel}\) (or more generally, \( \mathtt {fista}\)) applied to \( f_{\eta } \) with \( \eta \) chosen appropriately. In particular, let the algorithm \( \mathtt {smooth}(\varepsilon ) \) for obtaining an \( \varepsilon \)-optimal solution of f be precisely the algorithm \( \mathtt {accel}(2\varepsilon /3) \), applied to obtaining an \( (2\varepsilon /3) \)-optimal solution of \( f_{\varepsilon /3\beta } \) (i.e., \( \eta = \varepsilon /3\beta \)). Note that the Lipschitz constant \( L = \alpha /\eta \) is available for \( \mathtt {accel}\) to use in computing iterates (assuming \( \alpha \) is known, as it is for our example in Sect. 2.1, along with many other examples [4]).
To understand these choices of parameters, first observe that by definition, \( \mathtt {accel}(2\varepsilon /3 ) \) is guaranteed—for whatever smooth function to which it is applied—to compute an iterate which is an \( (2\varepsilon /3 ) \)-optimal solution. Let \( x_k \) be such an iterate when the function is \( f_{\varepsilon /3 \beta } \), that is
Now observe that letting \( x^* \) be a minimizer of f, we have
Substituting this and \( f(x_k) \le f_{\varepsilon /3 \beta }(x_k) \) into (57) gives, after rearranging, \( f(x_k) - f^* \le \varepsilon \), that is, the method \( \mathtt {smooth}(\varepsilon ) \) will have computed an \( \varepsilon \)-optimal solution for f.
Consequently, the number of iterations required by \( \mathtt {smooth}\) to obtain an \( \varepsilon \)-optimal solution for f is no greater than the number of iterations for \( \mathtt {accel}\) to compute an \( (2\varepsilon /3) \)-optimal solution of \( f_{\varepsilon /3 \beta } \), and hence per the discussion giving the iteration bound (56), does not exceed
where \( X_{\varepsilon /3 \beta }(\varepsilon /3) = \{ x \in Q \mid f_{\varepsilon /3 \beta }(x) - f^*_{ \varepsilon /3 \beta } \le \varepsilon /3 \} \), and where we have used the fact that the Lipschitz constant of \( \nabla f_{\varepsilon /3 \beta } \) is \( \alpha / ( \varepsilon /3 \beta ) \). We claim, however, \( X^* \subseteq X_{\varepsilon /3 \beta }( \varepsilon /3) \), which with (58) results in the iteration bound
for \( \mathtt {smooth}\) to compute an \( \varepsilon \)-optimal solution for f, where \( \delta \) is the distance from the initial iterate to \( X^* \), the set of optimal solutions for f.
Finally, to verify the claim, letting \( x^* \) be any optimal point for f, we have
as desired.
Proof of Theorem 3
The proof proceeds through a sequence of results analogous to the sequence in the proof of Theorem 2, although bookkeeping is now more pronounced. As for the proof of Theorem 2, we refer to \( {\mathtt {fom}}_n \) “updating” at a point x. The starting point \( \mathbf {x_0} \) is considered to be the first update point for every \( {\mathtt {fom}}_n \). After \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) has started, then for \( n < N \), updating at x means the same as restarting at x, and for \( {\mathtt {fom}}_{N} \), updating at x is the same as having computed an iterate x satisfying the current task of \( {\mathtt {fom}}_{N} \) (in which case the point is sent to \( {\mathtt {fom}}_{N-1} \), even though \( {\mathtt {fom}}_{N} \) does not restart).
Keep in mind that when a new epoch for \( {\mathtt {fom}}_n \) occurs, if a message from \( {\mathtt {fom}}_{n+1} \) arrives in the inbox exactly when the epoch begins, then the restart point will not be decided immediately, due to \( {\mathtt {fom}}_n \) being made to pause, possibly contiguously. In any case, the restart point is decided after only a finite amount of time, due to \( {\mathtt {fom}}_{n+1} \) sending at most finitely many messages in total.Footnote 15
Proposition 2
Assume that at time t, \( {\mathtt {fom}}_n \) updates at x satisfying \( f(x) - f^* \ge 2 \cdot 2^n \varepsilon \). Then \( {\mathtt {fom}}_n \) updates again and does so no later than time
where m is the number of messages received by \( {\mathtt {fom}}_n \) between the two updates.
Proof
Let \( {\widetilde{m}} \) be the total number of messages received by \( {\mathtt {fom}}_n \) from time zero onward. We know \( {\widetilde{m}} \) is finite. Consequently, at time
\( {\mathtt {fom}}_n \) will have devoted at least \( T_{{\mathtt {fom}}}( D(f(x)), 2^n \varepsilon ) \) units of time to computing iterations. Thus, if \( {\mathtt {fom}}_n \) has not updated before time (60), then at that instant, \( {\mathtt {fom}}_n \) will not be in a pause, and will have obtained \( {\bar{x}} \) satisfying
Consequently, if \( {\mathtt {fom}}_n \) has not updated by time (60), it will update at that time, at \( {\bar{x}} \).
Having proved that after updating at x, \( {\mathtt {fom}}_n \) will update again, it remains to prove the next update will occur no later than time (59). Of course, the next update occurs at (or soon after) the start of a new epoch. Let \( m_1 \) be the number of messages received by \( {\mathtt {fom}}_n \) after updating at x and before the new epoch begins, and let \( m_2 \) be the number of messages received in the new epoch and before \( {\mathtt {fom}}_n \) updates (i.e., before the restart point is decided). (Thus, from the beginning of the new epoch until the update, \( {\mathtt {fom}}_n \) is contiguously paused for an amount of time not exceeding \( m_2 \, t_{\mathrm {pause}}\).) Clearly, \( m_1 + m_2 = m \).
In view of the preceding observations, to establish the time bound (59) it suffices to show the new epoch begins no later than time
However, if the new epoch has not begun prior to time (62) then at that time, \( {\mathtt {fom}}_n \) is not in a pause, and has spent enough time computing iterations so as to have a point \( {\bar{x}} \) satisfying (61), causing a new epoch to begin instantly. \(\square \)
Proposition 3
Assume that at time t, \( {\mathtt {fom}}_n \) updates at x satisfying \( f(x) - f^* \ge 2 \cdot 2^n \varepsilon \). Let \( {\mathbf {j}} := \lfloor \frac{ f(x) - f^* }{ 2^n \varepsilon } \rfloor - 2 \). Then \( {\mathtt {fom}}_n \) updates at a point \( {\bar{x}} \) satisfying \( f( {\bar{x}}) - f^* < 2 \cdot 2^n \varepsilon \) no later than time
where m is the total number of messages received by \( {\mathtt {fom}}_n \) between the updates at x and \( {\bar{x}} \).
Proof
The proof is essentially identical to the proof of Proposition 1, but relying on Proposition 2 rather than on Lemma 1. \(\square \)
Corollary 11
Assume that at time t, \( {\mathtt {fom}}_n \) updates at x satisfying \( f(x) - f^* < {\mathbf {i}} \cdot 2^n \varepsilon \), where \( {\mathbf {i}} \) is an integer and \( {\mathbf {i}} \ge 3 \). Then \( {\mathtt {fom}}_n \) updates at a point \( {\bar{x}} \) satisfying \( f( {\bar{x}}) - f^* < 2 \cdot 2^n \varepsilon \) no later than time
and where m is the number of messages received by \( {\mathtt {fom}}_n \) between the updates at x and \( {\bar{x}} \).
Moreover, if \( n > -1 \), then after sending the message to \( {\mathtt {fom}}_{n-1} \) containing the point \( {\bar{x}} \), \( {\mathtt {fom}}_n \) will send at most one further message.
Proof
Except for the final assertion, the proof is essentially identical to the proof of Corollary 6, but relying on Proposition 3 rather than on Proposition 1.
For the final assertion, note that if \( {\mathtt {fom}}_n \) sent two points after sending \( {\bar{x}} \) - say, first \( x' \) and then \( x'' \)—we would have
a contradiction. \(\square \)
Corollary 12
Let \( n > -1 \). Assume that at time t, \( {\mathtt {fom}}_{n} \) updates at x satisfying \( f(x) - f^* < 5 \cdot 2^n \varepsilon \), and assume from time t onward, \( {\mathtt {fom}}_n \) receives at most \( {\hat{m}} \) messages. Then for either \( {\hat{m}}' = 0 \) or \( {\hat{m}}' = 1 \), \( {\mathtt {fom}}_{n-1} \) updates at \( x' \) satisfying \( f( x') - f^* < 5 \cdot 2^{n-1} \varepsilon \) no later than time
(where \( D_{n,i} \) is given by (63)), and from that time onward, \( {\mathtt {fom}}_{n-1} \) receives at most \( {\hat{m}}' \) messages.
Proof
By Corollary 11, no later than time
\( {\mathtt {fom}}_n \) updates at \( {\bar{x}} \) satisfying \( f({\bar{x}}) - f^* < 2 \cdot 2^n \varepsilon \). When \( {\mathtt {fom}}_n \) updates at \( {\bar{x}} \), the point is sent to the inbox of \( {\mathtt {fom}}_{n-1} \), where it arrives no more than \( t_{\mathrm {transit}}\) time units later, causing a pause of \( {\mathtt {fom}}_{n-1} \) to begin immediately. Moreover, following the arrival of \( {\bar{x}} \), at most one additional message will be received by \( {\mathtt {fom}}_{n-1} \) (per the last assertion of Corollary 11).
If during the pause, \( {\mathtt {fom}}_{n-1} \) does not receive an additional message, then at the end of the pause, \( {\mathtt {fom}}_{n-1} \) either updates at \( {\bar{x}} \) or continues its current epoch. The decision on whether to update at \( {\bar{x}} \) is then enacted no later than time
after which \( {\mathtt {fom}}_{n-1} \) receives at most one message.
On the other hand, if during the pause, an additional message is received, then a second pause immediately begins, and \( {\bar{x}} \) is overwritten by a point \( \bar{{\bar{x}}} \) satisfying \( f(\bar{{\bar{x}}}) \le f( {\bar{x}}) - 2^n \varepsilon < f^* + 2^n \varepsilon \). Here, the decision on whether to update at \( \bar{{\bar{x}}} \) is enacted no later than time
after which \( {\mathtt {fom}}_{n-1} \) receives no messages.
If \( {\mathtt {fom}}_{n-1} \) chooses to update at \( {\bar{x}} \) (or \( \bar{{\bar{x}}} \)), the corollary follows immediately from (64) and (65), as the value of f at the update point is then less than \( f^* + 4 \cdot 2^{n-1} \varepsilon \). On the other hand, if \( {\mathtt {fom}}_{n-1} \) chooses not to update, it is only because its most recent update point satisfies \( f(x_{n-1,0}) < f( {\bar{x}}) + 2^{n-1} \varepsilon \) (resp., \( f(x_{n-1,0}) < f( \bar{{\bar{x}}} ) + 2^{n-1} \varepsilon \)), and hence \( f(x_{n-1,0}) < f^* + 5 \cdot 2^{n-1} \varepsilon \). Thus, here as well, the corollary is established. \(\square \)
Corollary 13
For any \( {\bar{N}} \in \{ -1, \ldots , N \} \), assume that at time t, \( {\mathtt {fom}}_{{\bar{N}}} \) updates at x satisfying \( f(x) - f^* < 5 \cdot 2^{{\bar{N}}} \varepsilon \), and assume from time t onward, \( {\mathtt {fom}}_{{\bar{N}}} \) receives at most \( {\hat{m}} \) messages. Then no later than time
\( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) has computed an \( \varepsilon \)-optimal solution.
Proof
If \( {\bar{N}} = -1 \), Corollary 11 implies that after updating at x, the additional time required by \( {\mathtt {fom}}_{-1} \) to compute a \( (2 \cdot 2^{-1} \varepsilon ) \)-optimal solution \( {\bar{x}} \) is bounded from above by
The present corollary is thus immediate for the case \( {\bar{N}} = -1 \).
On the other hand, if \( {\bar{N}} > - 1 \), induction using Corollary 12 shows that for either \( {\hat{m}}' = 0 \) or \( {\hat{m}}' = 1 \), no later than time
\( {\mathtt {fom}}_{-1} \) has restarted at x satisfying \( f(x) - f^* < 5 \cdot 2^{-1} \varepsilon \), and from then onwards, \( {\mathtt {fom}}_{-1} \) receives at most \( {\hat{m}}' \) messages. Then by Corollary 11, the additional time required by \( {\mathtt {fom}}_{-1} \) to compute a \( (2 \cdot 2^{-1} \varepsilon ) \)-optimal solution does not exceed (66) with \( {\hat{m}} \) replaced by \( {\hat{m}}' \). The present corollary follows. \(\square \)
We are now in position to prove Theorem 3, which we restate as a corollary.
Corollary 14
If \( f(\mathbf {x_0}) - f^* < 5 \cdot 2^N \varepsilon \), then \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) computes an \( \varepsilon \)-optimal solution within time
where \( {\bar{N}} \) is the smallest integer satisfying both \( f(\mathbf {x_0}) - f^* < 5 \cdot 2^{{\bar{N}}} \varepsilon \) and \( {\bar{N}} \ge -1 \).
In any case, \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) computes an \( \varepsilon \)-optimal solution within time
where \( {\mathbf {T}}_N \) is the quantity obtained by substituting N for \( {\bar{N}} \) in (67).
Proof
Consider the case \( f( \mathbf {x_0}) - f^* < 5 \cdot 2^N \varepsilon \), and let \( {\bar{N}} \) be as in the statement of the theorem, that is, the smallest integer satisfying both \( f( \mathbf {x_0}) - f^* < 5 \cdot 2^{{\bar{N}}} \varepsilon \) and \( {\bar{N}} \ge -1 \).
If \( {\bar{N}} = N \), then \( {\mathtt {fom}}_{{\bar{N}}} \) never receives messages, in which case the time bound (67) is immediate from Corollary 13 with \( t = 0 \) and \( {\hat{m}} = 0 \). On the other hand, if \( {\bar{N}} < N \), then since \( f( \mathbf {x_0}) - f^* < \frac{5}{2} \cdot 2^{{\bar{N}}+1} \varepsilon \), \( {\mathtt {fom}}_{{\bar{N}}+1} \) sends at most two messages from time zero onward. Again (67) is immediate from Corollary 13.
Regardless of whether \( f( \mathbf {x_0}) - f^* < 5 \cdot 2^N \varepsilon \), we know \( {\mathtt {fom}}_{N} \)—which never restarts—computes x satisfying \( f(x) - f^* \le 2^N \varepsilon \) within time
If x is not an update point for \( {\mathtt {fom}}_{N} \), then the most recent update point \( {\hat{x}} \) satisfies \( f({\hat{x}}) < f(x) + 2^{N} \varepsilon \le f^* + 2 \cdot 2^N \varepsilon \). Hence, irrespective of whether x is an update point, by the time \( {\mathtt {fom}}_{N} \) has computed x, it has obtained an update point \( x' \) (\( x' = {\hat{x}} \) if not \( x' = x \)) satisfying \( f(x') - f^* < 5 \cdot 2^{N} \varepsilon \). Consequently, the time required for \( \mathtt {Async} \negthinspace \negthinspace \parallel \negthinspace \negthinspace \mathtt {FOM}\) to compute an \( \varepsilon \)-optimal solution does not exceed (68) plus the value (67), where in (67), N is substituted for \( {\bar{N}} \). \(\square \)
Rights and permissions
About this article
Cite this article
Renegar, J., Grimmer, B. A Simple Nearly Optimal Restart Scheme For Speeding Up First-Order Methods. Found Comput Math 22, 211–256 (2022). https://doi.org/10.1007/s10208-021-09502-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-021-09502-2