Restarting the accelerated coordinate descent method with a rough strong convexity estimate

443 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

We propose new restarting strategies for the accelerated coordinate descent method. Our main contribution is to show that for a well chosen sequence of restarting times, the restarted method has a nearly geometric rate of convergence. A major feature of the method is that it can take profit of the local quadratic error bound of the objective function without knowing the actual value of the error bound. We also show that under the more restrictive assumption that the objective function is strongly convex, any fixed restart period leads to a geometric rate of convergence. Finally, we illustrate the properties of the algorithm on a regularized logistic regression problem and on a Lasso problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On optimal probabilities in stochastic coordinate descent methods

Article Open access 02 July 2015

Restarting Algorithms: Sometimes There Is Free Lunch

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Article 07 May 2018

Notes

In [3], the $L^1$ norm is rewritten by a matrix of $2^n$ rows.

References

Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205. ACM (2017)
Allen-Zhu, Z., Yuan, Y.: Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In: International Conference on Machine Learning, pp. 1080–1089 (2016)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017)
Article MathSciNet MATH Google Scholar
Beck, A., Shtern, S.: Linearly convergent away-step conditional gradient for non-strongly convex functions. Math. Program. 164(1), 1–27 (2017)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)
Article Google Scholar
Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)
Article MathSciNet Google Scholar
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R., et al.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)
Article MathSciNet MATH Google Scholar
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv preprint arXiv:1609.07358
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition (2017). arXiv preprint arXiv:1709.02300
Fercoq, O., Richtárik, P.: Smooth minimization of nonsmooth functions by parallel coordinate descent (2013). Preprint arXiv:1309.5885
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Article MathSciNet MATH Google Scholar
Gasnikov, A.V., Dvurechensky, P.E.: Stochastic intermediate gradient method for convex optimization problems. Dokl. Math. 93(2), 148–151 (2016)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2089 (2013)
Article MathSciNet MATH Google Scholar
Johnson, T., Guestrin, C.: Blitz: a principled meta-algorithm for scaling sparse optimization. In: International Conference on Machine Learning, pp. 1171–1179 (2015)
Klatte, D., Thiere, G.: Error bounds for solutions of linear equations and inequalities. Z. Oper. Res. 41(2), 191–214 (1995)
MathSciNet MATH Google Scholar
Lin, Q., Lu, Z., Xiao, L.: An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4), 2244–2273 (2015)
Article MathSciNet MATH Google Scholar
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 3384–3392. Curran Associates, Inc., Red Hook (2015)
Google Scholar
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS ’13, pp. 147–156, Washington, DC, USA. IEEE Computer Society (2013)
Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993)
Article MathSciNet MATH Google Scholar
Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60(3), 633–674 (2015)
Article MathSciNet MATH Google Scholar
Liu, M., Yang, T.: Adaptive accelerated gradient converging method under holderian error bound condition. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 3104–3114. Curran Associates, Inc., Red Hook (2017)
Google Scholar
Murata, T., Suzuki, T.: Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. In: Advances in Neural Information Processing Systems, pp. 608–617 (2017)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate ${O}(1/k^2)$. Soviet Math. Dokl. 27(2), 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175, 69–107 (2018)
Article MathSciNet MATH Google Scholar
O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2012)
Article MathSciNet MATH Google Scholar
Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optim. Methods Softw. 31(5), 858–884 (2016)
Article MathSciNet MATH Google Scholar
Roulet, V., D’Aspremont, A.: Sharpness, restart and acceleration. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 1119–1129. Curran Associates, Inc., Red Hook (2017)
Google Scholar
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. (2014). https://doi.org/10.1007/s10107-012-0614-z
Article MathSciNet MATH Google Scholar
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)
Article MathSciNet MATH Google Scholar
Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. SIAM J. Optim. 2, 3 (2008)
MathSciNet Google Scholar
Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

LTCI, Télécom ParisTech, Université Paris-Saclay, Paris, France
Olivier Fercoq
Department of Mathematics, The University of Hong Kong, Hong Kong, China
Zheng Qu

Authors

Olivier Fercoq
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Qu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivier Fercoq.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The first author’s work was supported by the EPSRC Grant EP/K02325X/1 Accelerated Coordinate Descent Methods for Big Data Optimization, the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council), the Orange/Telecom ParisTech think tank Phi-TAB, the ANR Grant ANR-11-LABX-0056-LMH, LabEx LMH as part of the Investissement d’avenir project. The second author’s work was supported by Hong Kong Research Grant Council 27302016. This research was partially conducted using the HKU Information Technology Services research computing facilities that are supported in part by the Hong Kong UGC Special Equipment Grant (SEG HKU09).

Proof of Lemma 1 and Theorem 1

Proof of Lemma 1

The Eq. (11) holds because $\theta _{k+1}$ is the unique positive square root to the polynomial $P(X) = X^2 + \theta _k^2 X - \theta _k^2$.

Let us prove (10) by induction. It is clear that $\theta _0 \le \frac{2}{0+2/\theta _0}$. Assume that $\theta _k \le \frac{2}{k+2/\theta _0}$. We know that $P(\theta _{k+1}) = 0$ and that P is an increasing function on $[0 , +\infty ]$. So we just need to show that $P\big (\frac{2}{k+1+2/\theta _0}\big )\ge 0$.

$$\begin{aligned} P\Big (\frac{2}{k+1+2/\theta _0}\Big ) = \frac{4}{(k+1+2/\theta _0)^2} + \frac{2}{k+1+2/\theta _0}\theta _k^2 - \theta _k^2 \end{aligned}$$

As $\theta _k \le \frac{2}{k+2/\theta _0}$ and $\frac{2}{k+1+2/\theta _0}-1 \le 0$,

$$\begin{aligned} P\Big (\frac{2}{k+1+2/\theta _0}\Big )&\ge \frac{4}{(k+1+2/\theta _0)^2} + \Big (\frac{2}{k+1+2/\theta _0} - 1\Big ) \frac{4}{(k+2/\theta _0)^2} \\&= \frac{4}{(k+1+2/\theta _0)^2 (k+2/\theta _0)^2} \ge 0. \end{aligned}$$

For the other inequality, $\frac{(2-\theta _0)}{0+(2-\theta _0)/\theta _0} \le \theta _0$. We now assume that $\theta _k \ge \frac{(2-\theta _0)}{k+(2-\theta _0)/\theta _0}$ but that $\theta _{k+1} < \frac{(2-\theta _0)}{k+1+(2-\theta _0)/\theta _0}$. Remark that $(x \mapsto (1-x)/x^2)$ is strictly decreasing for $x \in (0,2)$. Then, using (11), we have

$$\begin{aligned}&\frac{(k+1+(2-\theta _0)/\theta _0)^2}{(2-\theta _0)^2} -\frac{k+1+(2-\theta _0)/\theta _0}{2-\theta _0}\\&\quad < \frac{1-\theta _{k+1}}{\theta _{k+1}^2} \overset{(11)}{=} \frac{1}{\theta _k^2} \le \frac{(k+(2-\theta _0)/\theta _0)^2}{(2-\theta _0)^2}. \end{aligned}$$

This is equivalent to

$$\begin{aligned} (2 - (2- \theta _0)) (k+(2-\theta _0)/\theta _0) + 1 < (2-\theta _0) \end{aligned}$$

which obviously does not hold for any $k \ge 0$. So $\theta _{k+1} \ge \frac{(2-\theta _0)}{k+1+(2-\theta _0)/\theta _0}$.

Proof of Theorem 1

The proof is organised in 4 steps. Firstly, we derive a one-iteration inequality, secondly, we identify conditions under which this inequality is a supermartingale inequality, thirdly, we give a solution to the set of inequalities and finally, we bound the rate we obtain by a simpler formula.

Step 1: Derive a one-iteration inequality.

By Lemma 2 in [12], there exists nonnegative numbers $\gamma _k^l$ such that for all k, $\sum _{l=0}^k \gamma _k^l = 1$ and $x_k = \sum _{l=0}^k \gamma _k^l z_l$. Let us denote ${\hat{F}}_k = f(x_k) + \sum _{l=0}^k \gamma _k^l \psi (z_l)$. We have ${\hat{F}}_k \ge F(x_k)$ where the inequality follows from the convexity of $\psi $, and ${\hat{F}}_0 = F(x_0)$. By [12], for any $x_* \in {\mathcal {X}}^*$,

$$\begin{aligned} {\mathbf {E}}_k[{\hat{F}}_{k+1}-F^*] \le (1-\theta _k)({\hat{F}}_k-F^*) + \frac{\theta _k^2}{2 \theta _0^2}\big (\left\| z_k - x_* \right\| _v^2 - {\mathbf {E}}_k[\left\| z_{k+1}-x_* \right\| _v^2]\big ) \end{aligned}$$

(28)

Let us denote $x_*$ the unique element of ${\mathcal {X}}^*$. Using the equality

$$\begin{aligned} \left\| (1-\lambda )a + \lambda b \right\| ^2 = (1-\lambda )\left\| a \right\| ^2 + \lambda \left\| b \right\| ^2 - \lambda (1-\lambda ) \left\| a-b \right\| ^2 \end{aligned}$$

(29)

which is valid for any vectors a, b and $\lambda \in {\mathbb {R}}$, we get,

$$\begin{aligned}&\left( 1-\frac{\theta _k}{\theta _0}\right) \left\| x_{k+1} - x_* \right\| ^2 + \frac{\theta _k}{\theta _0} \left\| x_{k+1} - z_{k+1} \right\| ^2 \nonumber \\&\quad = \left\| \left( 1-\frac{\theta _k}{\theta _0}\right) (x_{k+1} - x_*) + \frac{\theta _k}{\theta _0} (x_{k+1} - z_{k+1}) \right\| ^2 + \left( 1-\frac{\theta _k}{\theta _0}\right) \frac{\theta _k}{\theta _0}\left\| z_{k+1} - x_* \right\| ^2\nonumber \\ \end{aligned}$$

(30)

Since

$$\begin{aligned} x_{k+1} = (1-\theta _k)x_k + \frac{\theta _k}{\theta _0}z_{k+1} - \frac{\theta _k}{\theta _0}(1-\theta _0) z_k, \end{aligned}$$

we simplify $\big (1-\frac{\theta _k}{\theta _0}\big ) (x_{k+1} - x_*) + \frac{\theta _k}{\theta _0} (x_{k+1} - z_{k+1}) = (1-\theta _k)x_k - \frac{\theta _k}{\theta _0}(1-\theta _0) z_k - \big (1-\frac{\theta _k}{\theta _0}\big ) x_*$.

We then reorganize (30) and use again (29) to get

$$\begin{aligned}&\left\| x_{k+1}-x_* \right\| _v^2 = \frac{1}{1-\theta _k/\theta _0} \left\| (1-\theta _k)x_k - \frac{\theta _k}{\theta _0}(1-\theta _0) z_k - (1-\theta _k/\theta _0)x_* \right\| _v^2 \nonumber \\&\qquad + \frac{\theta _k}{\theta _0}\left\| z_{k+1}-x_* \right\| _v^2 - \frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \nonumber \\&\quad = \frac{(1-\theta _k/\theta _0)^2}{1-\theta _k/\theta _0} \left\| \frac{1-\theta _k}{1-\theta _k/\theta _0}(x_k - x_*) - \frac{\theta _k(1-\theta _0)}{\theta _0(1-\theta _k/\theta _0)}(z_k - x_*) \right\| _v^2 \nonumber \\&\qquad + \frac{\theta _k}{\theta _0}\left\| z_{k+1}-x_* \right\| _v^2 - \frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \nonumber \\&\quad = (1-\theta _k)\left\| x_k-x_* \right\| _v^2 + \frac{\theta _k}{\theta _0}\left\| z_{k+1}-x_* \right\| _v^2 - \frac{\theta _k}{\theta _0}(1-\theta _0)\left\| z_k - x_* \right\| _v^2 \nonumber \\&\qquad + \frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0}\left\| x_k-z_k \right\| _v^2 - \frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \end{aligned}$$

(31)

When $k=0$, $x_0 = z_0$ and $z_1 = x_1$ implies that the inequality still holds with the convention $0/0 = 0$.

Let us consider nonnegative sequences $(a_k)$, $(b_k)$, $(c_k)$, $(d_k)$ and $(\sigma ^K_k)$ and study the quantity

$$\begin{aligned} C_{k+1}= & {} a_{k+1}({\hat{F}}_{k+1} - F^*) + \frac{b_{k+1}}{2} \left\| x_{k+1} - x_* \right\| _v^2 + \frac{c_{k+1}}{2} \left\| z_{k+1} - x_* \right\| _v^2\\&+ \frac{d_{k+1}}{2} \left\| x_{k+1}-z_{k+1} \right\| _v^2 \end{aligned}$$

By the strong convexity condition of F, we have:

$$\begin{aligned} \frac{1}{2} \Vert x_{k+1}-x_*\Vert _v^2 \le \frac{1}{\mu }(F(x_{k+1}) - F^*). \end{aligned}$$

Hence, we get for any $\sigma _{k+1}^K \in [0,1]$, (the usefulness of superscript K will become clear later)

$$\begin{aligned}&{\mathbf {E}}_k[C_{k+1}] \overset{(31)}{\le } {\mathbf {E}}_k\left[ \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) ({\hat{F}}_{k+1} - F^*) \right. \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})(1-\theta _k)\left\| x_k-x_* \right\| _v^2 \\&\quad + \left( c_{k+1} + b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k}{\theta _0} \right) \frac{1}{2} \left\| z_{k+1}-x_* \right\| _v^2 \\&\quad -\frac{b_{k+1}}{2}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0)\left\| z_k - x_* \right\| _v^2 \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0}\left\| x_k-z_k \right\| _v^2 \\&\quad \left. + \left( d_{k+1} - b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\right) \frac{1}{2}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \right] \end{aligned}$$

Applying (28) we get:

$$\begin{aligned} {\mathbf {E}}_k[C_{k+1}]&\le (1-\theta _k)\left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) ({\hat{F}}_{k} - F^*) \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})(1-\theta _k)\left\| x_k-x_* \right\| _v^2 \\&\quad + \left( c_{k+1} + b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k}{\theta _0}\right. \\&\quad \left. -\left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2} \right) \frac{1}{2} {\mathbf {E}}_k[\left\| z_{k+1}-x_* \right\| _v^2]\\&\qquad \left( \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2}\right. \\&\quad \left. -b_{k+1}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0)\right) \frac{1}{2} \left\| z_k - x_* \right\| _v^2 \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0}\left\| x_k-z_k \right\| _v^2 \\&\quad + \left( d_{k+1} - b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\right) \frac{1}{2}{\mathbf {E}}_k[\left\| x_{k+1}- z_{k+1} \right\| _v^2] \end{aligned}$$

Step 2: Conditions to get a supermartingale.

In order to have the supermartingale inequality ${\mathbf {E}}_k[ C_{k+1}]\le C_k$, we need the following constraints on the parameters for $k \in \{0, \ldots , K-1\}$:

$$\begin{aligned} a_k&\ge (1-\theta _k)\left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \end{aligned}$$

(32)

$$\begin{aligned} b_k&\ge b_{k+1}(1-\sigma ^K_{k+1})(1-\theta _k) \end{aligned}$$

(33)

$$\begin{aligned} c_{k+1}&\le \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2} - b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k}{\theta _0} \end{aligned}$$

(34)

$$\begin{aligned} c_k&\ge \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2}-b_{k+1}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0) \end{aligned}$$

(35)

$$\begin{aligned} d_{k+1}&\le b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k/\theta _0}{1-\theta _k/\theta _0} \end{aligned}$$

(36)

$$\begin{aligned} d_k&\ge b_{k+1}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0} \end{aligned}$$

(37)

(as $x_0-z_0 = 0$ and $x_1 = z_1$, there is no constraint on $d_0$ and $d_1$). We also add the constraints

$$\begin{aligned} a_0 = (1-\theta _0)(b_0 + c_0) = \frac{1-\theta _0}{\theta _0^2} \end{aligned}$$

(38)

in order to get $C_0 = \frac{1-\theta _0}{\theta _0^2} {\hat{F}}_0 + \frac{1}{2\theta _0^2}\left\| x_0 - x_* \right\| ^2_v + \frac{d_0}{2} \left\| x_0 - x_0 \right\| ^2_v = \varDelta _0$ and

$$\begin{aligned} c_K&= 0 \end{aligned}$$

(39)

$$\begin{aligned} a_K&= (1-\theta _0)b_K \;. \end{aligned}$$

(40)

If we can find a set of nonnegative sequences satisfying (32)–(40), then we have

$$\begin{aligned} b_K \varDelta (x_K)= & {} \frac{a_K}{\theta _0^2} (F(x_K) - F^*) + \frac{b_K}{2\theta _0^2} \left\| x_K - x_* \right\| _v^2 \nonumber \\\le & {} \frac{a_0}{\theta _0^2} (F(x_0) - F^*) + \frac{b_0+c_0}{2\theta _0^2} \left\| x_0 - x_* \right\| _v^2 = (b_0+c_0)\varDelta (x_0)\;. \end{aligned}$$

(41)

Step 3: Explicit solution of the inequalities.

By analogy to the gradient case, we are looking for such sequences saturating (32)–(35). For $k \in \{1,\dots , K-1\}$, equality in (32)–(35) can be fulfilled only if

$$\begin{aligned} \sigma ^K_k = \frac{1 - \frac{\theta _k}{\theta _{k-1}}\frac{1-\theta _0}{1-\theta _k}}{1 + \frac{\theta _{k-1}}{\theta _0 \mu }}, \qquad k \in \{1,\dots , K-1\} \;. \end{aligned}$$

Indeed, the saturation requirement leads to

$$\begin{aligned}&c_k \overset{eq. in (34)}{=} \left( a_{k}+\frac{b_{k}\sigma ^K_{k}}{\mu }\right) \frac{\theta _{k-1}^2}{\theta _0^2} - b_{k}(1-\sigma ^K_{k})\frac{\theta _{k-1}}{\theta _0} \\&\quad \overset{eq. in (35)}{=} \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2}-b_{k+1}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0) \\&\quad \overset{eq. in (32)+(33)}{=} \frac{a_k}{1-\theta _k} \frac{\theta _k^2}{\theta _0^2} - \frac{b_k}{1-\theta _k} \frac{\theta _k}{\theta _0}(1-\theta _0) \end{aligned}$$

We get the formula for $\sigma ^K_k$ by removing $a_k$ on both sides thanks to $\theta _{k-1}^2 = \frac{\theta _k^2}{1-\theta _k}$ and by dividing by $b_k$. Note that we have $\sigma _k^K \le 1 - \frac{\theta _k}{\theta _{k-1}}\frac{1-\theta _0}{1-\theta _k} \le 1 - \frac{1-\theta _0}{1-\theta _k} \le 1 - (1-\theta _0) \le \theta _0$.

For $k = K$, we would like to ensure (39) and (40). This can be done by taking

$$\begin{aligned} \sigma ^K_K = \frac{1 - \frac{\theta _{K-1}}{\theta _0}(1-\theta _0)}{1 + \frac{\theta _{K-1}}{\theta _0\mu }} \;. \end{aligned}$$

(42)

Here also, $\sigma _K^K \le \theta _0$.

Having found $(\sigma ^K_k)$, we can get $(b_k)$ as a function of $b_0$ through the equality in (33)

$$\begin{aligned} b_k = \Big ( \prod _{l=0}^{k-1} \frac{1}{(1-\theta _l)(1-\sigma ^K_{l+1})} \Big )b_0 = \frac{\theta _{-1}^2}{\theta _{k-1}^2} \Big (\prod _{l=1}^{k} \frac{1}{1-\sigma ^K_{l}}\Big )b_0 \end{aligned}$$

and $(a_k)$ as a function of $b_0$ through the equality in (32)

$$\begin{aligned} a_k&= \frac{1}{1-\theta _{k-1}} a_{k-1} - \frac{b_k \sigma _k^K}{\mu }= \frac{\theta _{k-2}^2}{\theta _{k-1}^2} a_{k-1} - \frac{b_k \sigma _k^K}{\mu } = \frac{\theta _{-1}^2}{\theta _{k-1}^2} a_0 - \sum _{l=1}^k \frac{\sigma ^K_l}{\mu }\frac{\theta _{l-1}^2}{\theta _{k-1}^2}b_l \\&= \frac{\theta _{-1}^2}{\theta _{k-1}^2} a_0 - \frac{\theta _{-1}^2}{\theta _{k-1}^2}\sum _{l=1}^k \frac{\sigma ^K_l}{\mu }\Big (\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}}\Big )b_0 . \end{aligned}$$

Using equality in (34) and (35), we get the relation

$$\begin{aligned} c_{k+1} = c_k - \frac{\theta _k}{1-\theta _k}b_k \end{aligned}$$

and so

$$\begin{aligned} c_k = c_0 - \sum _{l=0}^{k-1} \frac{\theta _l}{1-\theta _l}b_l \end{aligned}$$

This gives us the opportunity to calculate $b_0$ because

$$\begin{aligned} c_K = 0 = c_0 - \sum _{l=0}^{K-1} \frac{\theta _l}{1-\theta _l}b_l \end{aligned}$$

Hence, we get two equalities:

$$\begin{aligned} c_0 + b_0&= \frac{1}{\theta _0^2} \\ c_0&= \sum _{l=0}^{K-1} \frac{\theta _l}{1-\theta _l} \frac{\theta _{-1}^2}{\theta _{l-1}^2}\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}} b_0 \end{aligned}$$

and so since $\theta _{l-1}^2 = \theta _l^2/(1-\theta _l) $

$$\begin{aligned} b_0 = \frac{1}{\theta _0^2} \frac{1}{1 + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}}}\;. \end{aligned}$$

We then need to check that there is a feasible sequence $(d_k)$. Indeed such a sequence exists because the upper and lower bound satisfy

$$\begin{aligned}&b_{k}(1-\sigma ^K_{k})\frac{\theta _{k-1}/\theta _0}{1-\theta _{k-1}/\theta _0} \ge d_k \ge b_{k+1}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0} \\&\quad \overset{eq. in (33)}{\Leftrightarrow } \sigma ^K_{k} \le 1 - (1-\theta _0)\frac{1/\theta _{k-1} - 1/\theta _0}{1/\theta _k - 1/\theta _0} \end{aligned}$$

which is true for all $k \ge 2$ because $\sigma ^K_k \le \theta _0$.

Up to now, we have constructed sequences $(\sigma _k^K)$, $(a_k)$, $(b_k)$, $(c_k)$ and $(d_k)$ that clearly satisfy (32) to (39). Moreover, $\sigma _k^K \in [0,1]$, $b_k \ge 0$, $c_k \ge 0$ and $d_k \ge 0$.

We can then easily check that $c_K = 0$ and (42) implies that $a_{K} = (1-\theta _0) b_K$. Hence, $a_K \ge 0$. By recursively looking at (32), we deduce that $a_k \ge 0$ for all $k \le K$.

Now that we have the sequences, using the relation (41), we can compute the rate as

$$\begin{aligned} \rho _K&= \frac{b_0 + c_0}{b_K} = \frac{1}{\theta _0^2} \frac{\theta _{K-1}^2}{\theta _{-1}^2}\Big (\prod _{l'=1}^{K} (1-\sigma ^K_{l'})\Big ) \theta _0^2 \Big (1 + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}}\Big )\\&=\frac{\theta _{K-1}^2}{\theta _{-1}^2} \Big (\prod _{l'=1}^{K} (1-\sigma ^K_{l'}) + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K} (1-\sigma ^K_{j})\Big ) \end{aligned}$$

Hence, we obtain that

$$\begin{aligned} {\mathbf {E}}[\varDelta (x_K)] \le \frac{\theta _{K-1}^2}{\theta _{-1}^2} \Big (\prod _{l'=1}^{K} (1-\sigma ^K_{l'}) + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K} (1-\sigma ^K_{j})\Big ) \varDelta (x_0) = \rho _K \varDelta (x_0) \end{aligned}$$

where $\theta _{-1}^2 = \frac{\theta _0^2}{1-\theta _0}$ and

$$\begin{aligned} \sigma ^K_k&= \frac{1 - \frac{\theta _k}{\theta _{k-1}}\frac{1-\theta _0}{1-\theta _k}}{1 + \frac{\theta _{k-1}}{\theta _0 \mu }}, \qquad k \in \{1, K-1\} \\ \sigma ^K_K&= \frac{1 - \frac{\theta _{K-1}}{\theta _0}(1-\theta _0)}{1 + \frac{\theta _{K-1}}{\theta _0\mu }}. \end{aligned}$$

Step 4: bound$\rho _K$by induction.

$$\begin{aligned} \rho _1&= \frac{\theta _{0}^2}{\theta _{-1}^2} \Big ((1-\sigma _{1}^1) + \frac{\theta _{-1}^2}{\theta _{0}}(1-\sigma _{1}^1)\Big ) = (1-\theta _0)\frac{\frac{\theta _0}{\theta _0\mu } + \frac{\theta _0}{\theta _0}(1-\theta _0)}{1 + \frac{\theta _0}{\theta _0 \mu }}\left( 1+\frac{\theta _0}{1-\theta _0}\right) \\&= \frac{1 + (1-\theta _0)\mu }{1+\mu } \le \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2}{2\theta _0^2}\mu } \end{aligned}$$

Let us now assume that $\rho _K \le \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}}$.

$$\begin{aligned} \rho _{K+1}&= \frac{\theta _{K}^2}{\theta _{-1}^2} \Big (\prod _{l'=1}^{K+1} (1-\sigma ^{K+1}_{l'}) + \sum _{l=0}^{K} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K+1} (1-\sigma ^{K+1}_{j})\Big ) \\&= \frac{\theta _{K}^2}{\theta _{K-1}^2}\frac{\theta _{K-1}^2}{\theta _{-1}^2}(1-\sigma _{K+1}^{K+1})\frac{1-\sigma _{K}^{K+1}}{1-\sigma _K^K} \left( \prod _{l'=1}^{K} (1-\sigma ^{K}_{l'})\right. \\&\quad + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K} (1-\sigma ^{K}_{j}) \\&\quad \left. + \frac{\theta _{-1}^2}{\theta _{K}} \frac{1-\sigma ^{K}_{K}}{1-\sigma _K^{K+1}} \right) \\&= (1-\theta _K) (1-\sigma _{K+1}^{K+1})\frac{1-\sigma _{K}^{K+1}}{1-\sigma _K^K} \left( \rho _K + \frac{\theta _{K-1}^2}{\theta _K}\frac{1-\sigma ^{K}_{K}}{1-\sigma _K^{K+1}} \right) \\&= (1-\theta _K) \frac{1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)}{1+\frac{\theta _0 \mu }{\theta _{K}}} \rho _K + \frac{1 + (1-\theta _0) \mu }{1+ \frac{\theta _0 \mu }{\theta _K}}\theta _K \end{aligned}$$

We now use the induction hypothesis.

$$\begin{aligned} \rho _{K+1}&\le (1-\theta _K) \frac{1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)}{1+\frac{\theta _0 \mu }{\theta _{K}}} \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2}{2\theta _{K-1}^2}\mu }+ \frac{1 + (1-\theta _0) \mu }{1+ \frac{\theta _0 \mu }{\theta _K}}\theta _K \\&\le \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2\mu }{2\theta _{K}^2}} \end{aligned}$$

The last inequality comes from the fact that

$$\begin{aligned}&\frac{1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)}{1+\frac{\theta _0 \mu }{\theta _{K}}} \frac{1-\theta _K}{1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}}+ \frac{\theta _K}{1+ \frac{\theta _0 \mu }{\theta _K}} \le \frac{1}{1+\frac{\theta _0^2\mu }{2\theta _{K}^2}} \\&\quad \Leftrightarrow (1-\theta _K) \left( 1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)\right) \left( 1+\frac{\theta _0^2\mu }{2\theta _{K}^2}\right) + \theta _K \left( 1+\frac{\theta _0^2\mu }{2\theta _{K}^2}\right) \left( 1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}\right) \\&\quad \le \left( 1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}\right) \left( 1+\frac{\theta _0 \mu }{\theta _{K}}\right) \\&\quad \Leftrightarrow 0 + \mu \left[ (1-\theta _K)\frac{\theta _0(1-\theta _0)}{\theta _{K}} + \frac{(1-\theta _K)\theta _0^2}{2\theta _{K}^2} + \frac{\theta _K\theta _0^2}{2\theta _{K-1}^2} \right. \\&\qquad \left. + \frac{\theta _K\theta _0^2}{2 \theta _{K}^2} - \frac{\theta _0^2}{2 \theta _{K-1}^2} - \frac{\theta _0}{\theta _K}\right] \\&\qquad + \mu ^2 \left[ \frac{(1-\theta _K)\theta _0^3(1-\theta _0)}{2 \theta _K^2 \theta _{K}} + \frac{\theta _K\theta _0^4}{4 \theta _{K-1}^2 \theta _K^2} - \frac{\theta _0^3}{2 \theta _{K-1}^2\theta _K} \right] \le 0 \end{aligned}$$

and both terms in the brackets are nonpositive for all K: they are respectively equal to $-\theta _0(1-\theta _0)$ and $-\frac{\theta _0^4}{2 \theta _K^3}(1-\theta _K)$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fercoq, O., Qu, Z. Restarting the accelerated coordinate descent method with a rough strong convexity estimate. Comput Optim Appl 75, 63–91 (2020). https://doi.org/10.1007/s10589-019-00137-2

Download citation

Received: 06 March 2018
Published: 10 October 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10589-019-00137-2

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On optimal probabilities in stochastic coordinate descent methods

Restarting Algorithms: Sometimes There Is Free Lunch

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Proof of Lemma 1 and Theorem 1

Proof of Lemma 1

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Restarting the accelerated coordinate descent method with a rough strong convexity estimate

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On optimal probabilities in stochastic coordinate descent methods

Restarting Algorithms: Sometimes There Is Free Lunch

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Proof of Lemma 1 and Theorem 1

Proof of Lemma 1 and Theorem 1

Proof of Lemma 1

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation