Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Restarting the accelerated coordinate descent method with a rough strong convexity estimate

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

We propose new restarting strategies for the accelerated coordinate descent method. Our main contribution is to show that for a well chosen sequence of restarting times, the restarted method has a nearly geometric rate of convergence. A major feature of the method is that it can take profit of the local quadratic error bound of the objective function without knowing the actual value of the error bound. We also show that under the more restrictive assumption that the objective function is strongly convex, any fixed restart period leads to a geometric rate of convergence. Finally, we illustrate the properties of the algorithm on a regularized logistic regression problem and on a Lasso problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. In [3], the \(L^1\) norm is rewritten by a matrix of \(2^n\) rows.

References

  1. Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205. ACM (2017)

  2. Allen-Zhu, Z., Yuan, Y.: Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In: International Conference on Machine Learning, pp. 1080–1089 (2016)

  3. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  4. Beck, A., Shtern, S.: Linearly convergent away-step conditional gradient for non-strongly convex functions. Math. Program. 164(1), 1–27 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  5. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  6. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)

    Article  Google Scholar 

  7. Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)

    Article  MathSciNet  Google Scholar 

  8. Friedman, J., Hastie, T., Höfling, H., Tibshirani, R., et al.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  9. Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv preprint arXiv:1609.07358

  10. Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition (2017). arXiv preprint arXiv:1709.02300

  11. Fercoq, O., Richtárik, P.: Smooth minimization of nonsmooth functions by parallel coordinate descent (2013). Preprint arXiv:1309.5885

  12. Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  13. Gasnikov, A.V., Dvurechensky, P.E.: Stochastic intermediate gradient method for convex optimization problems. Dokl. Math. 93(2), 148–151 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  14. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2089 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  15. Johnson, T., Guestrin, C.: Blitz: a principled meta-algorithm for scaling sparse optimization. In: International Conference on Machine Learning, pp. 1171–1179 (2015)

  16. Klatte, D., Thiere, G.: Error bounds for solutions of linear equations and inequalities. Z. Oper. Res. 41(2), 191–214 (1995)

    MathSciNet  MATH  Google Scholar 

  17. Lin, Q., Lu, Z., Xiao, L.: An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4), 2244–2273 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  18. Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 3384–3392. Curran Associates, Inc., Red Hook (2015)

    Google Scholar 

  19. Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS ’13, pp. 147–156, Washington, DC, USA. IEEE Computer Society (2013)

  20. Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  21. Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60(3), 633–674 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  22. Liu, M., Yang, T.: Adaptive accelerated gradient converging method under holderian error bound condition. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 3104–3114. Curran Associates, Inc., Red Hook (2017)

    Google Scholar 

  23. Murata, T., Suzuki, T.: Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. In: Advances in Neural Information Processing Systems, pp. 608–617 (2017)

  24. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \({O}(1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  25. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  26. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  27. Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175, 69–107 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  28. O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  29. Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optim. Methods Softw. 31(5), 858–884 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  30. Roulet, V., D’Aspremont, A.: Sharpness, restart and acceleration. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 1119–1129. Curran Associates, Inc., Red Hook (2017)

    Google Scholar 

  31. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. (2014). https://doi.org/10.1007/s10107-012-0614-z

    Article  MathSciNet  MATH  Google Scholar 

  32. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  33. Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. SIAM J. Optim. 2, 3 (2008)

    MathSciNet  Google Scholar 

  34. Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olivier Fercoq.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The first author’s work was supported by the EPSRC Grant EP/K02325X/1 Accelerated Coordinate Descent Methods for Big Data Optimization, the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council), the Orange/Telecom ParisTech think tank Phi-TAB, the ANR Grant ANR-11-LABX-0056-LMH, LabEx LMH as part of the Investissement d’avenir project. The second author’s work was supported by Hong Kong Research Grant Council 27302016. This research was partially conducted using the HKU Information Technology Services research computing facilities that are supported in part by the Hong Kong UGC Special Equipment Grant (SEG HKU09).

Proof of Lemma 1 and Theorem 1

Proof of Lemma 1 and Theorem 1

Proof of Lemma 1

The Eq. (11) holds because \(\theta _{k+1}\) is the unique positive square root to the polynomial \(P(X) = X^2 + \theta _k^2 X - \theta _k^2\).

Let us prove (10) by induction. It is clear that \(\theta _0 \le \frac{2}{0+2/\theta _0}\). Assume that \(\theta _k \le \frac{2}{k+2/\theta _0}\). We know that \(P(\theta _{k+1}) = 0\) and that P is an increasing function on \([0 , +\infty ]\). So we just need to show that \(P\big (\frac{2}{k+1+2/\theta _0}\big )\ge 0\).

$$\begin{aligned} P\Big (\frac{2}{k+1+2/\theta _0}\Big ) = \frac{4}{(k+1+2/\theta _0)^2} + \frac{2}{k+1+2/\theta _0}\theta _k^2 - \theta _k^2 \end{aligned}$$

As \(\theta _k \le \frac{2}{k+2/\theta _0}\) and \(\frac{2}{k+1+2/\theta _0}-1 \le 0\),

$$\begin{aligned} P\Big (\frac{2}{k+1+2/\theta _0}\Big )&\ge \frac{4}{(k+1+2/\theta _0)^2} + \Big (\frac{2}{k+1+2/\theta _0} - 1\Big ) \frac{4}{(k+2/\theta _0)^2} \\&= \frac{4}{(k+1+2/\theta _0)^2 (k+2/\theta _0)^2} \ge 0. \end{aligned}$$

For the other inequality, \(\frac{(2-\theta _0)}{0+(2-\theta _0)/\theta _0} \le \theta _0\). We now assume that \(\theta _k \ge \frac{(2-\theta _0)}{k+(2-\theta _0)/\theta _0}\) but that \(\theta _{k+1} < \frac{(2-\theta _0)}{k+1+(2-\theta _0)/\theta _0}\). Remark that \((x \mapsto (1-x)/x^2)\) is strictly decreasing for \(x \in (0,2)\). Then, using (11), we have

$$\begin{aligned}&\frac{(k+1+(2-\theta _0)/\theta _0)^2}{(2-\theta _0)^2} -\frac{k+1+(2-\theta _0)/\theta _0}{2-\theta _0}\\&\quad < \frac{1-\theta _{k+1}}{\theta _{k+1}^2} \overset{(11)}{=} \frac{1}{\theta _k^2} \le \frac{(k+(2-\theta _0)/\theta _0)^2}{(2-\theta _0)^2}. \end{aligned}$$

This is equivalent to

$$\begin{aligned} (2 - (2- \theta _0)) (k+(2-\theta _0)/\theta _0) + 1 < (2-\theta _0) \end{aligned}$$

which obviously does not hold for any \(k \ge 0\). So \(\theta _{k+1} \ge \frac{(2-\theta _0)}{k+1+(2-\theta _0)/\theta _0}\).

Proof of Theorem 1

The proof is organised in 4 steps. Firstly, we derive a one-iteration inequality, secondly, we identify conditions under which this inequality is a supermartingale inequality, thirdly, we give a solution to the set of inequalities and finally, we bound the rate we obtain by a simpler formula.

Step 1: Derive a one-iteration inequality.

By Lemma 2 in [12], there exists nonnegative numbers \(\gamma _k^l\) such that for all k, \(\sum _{l=0}^k \gamma _k^l = 1\) and \(x_k = \sum _{l=0}^k \gamma _k^l z_l\). Let us denote \({\hat{F}}_k = f(x_k) + \sum _{l=0}^k \gamma _k^l \psi (z_l)\). We have \({\hat{F}}_k \ge F(x_k)\) where the inequality follows from the convexity of \(\psi \), and \({\hat{F}}_0 = F(x_0)\). By [12], for any \(x_* \in {\mathcal {X}}^*\),

$$\begin{aligned} {\mathbf {E}}_k[{\hat{F}}_{k+1}-F^*] \le (1-\theta _k)({\hat{F}}_k-F^*) + \frac{\theta _k^2}{2 \theta _0^2}\big (\left\| z_k - x_* \right\| _v^2 - {\mathbf {E}}_k[\left\| z_{k+1}-x_* \right\| _v^2]\big ) \end{aligned}$$
(28)

Let us denote \(x_*\) the unique element of \({\mathcal {X}}^*\). Using the equality

$$\begin{aligned} \left\| (1-\lambda )a + \lambda b \right\| ^2 = (1-\lambda )\left\| a \right\| ^2 + \lambda \left\| b \right\| ^2 - \lambda (1-\lambda ) \left\| a-b \right\| ^2 \end{aligned}$$
(29)

which is valid for any vectors a, b and \(\lambda \in {\mathbb {R}}\), we get,

$$\begin{aligned}&\left( 1-\frac{\theta _k}{\theta _0}\right) \left\| x_{k+1} - x_* \right\| ^2 + \frac{\theta _k}{\theta _0} \left\| x_{k+1} - z_{k+1} \right\| ^2 \nonumber \\&\quad = \left\| \left( 1-\frac{\theta _k}{\theta _0}\right) (x_{k+1} - x_*) + \frac{\theta _k}{\theta _0} (x_{k+1} - z_{k+1}) \right\| ^2 + \left( 1-\frac{\theta _k}{\theta _0}\right) \frac{\theta _k}{\theta _0}\left\| z_{k+1} - x_* \right\| ^2\nonumber \\ \end{aligned}$$
(30)

Since

$$\begin{aligned} x_{k+1} = (1-\theta _k)x_k + \frac{\theta _k}{\theta _0}z_{k+1} - \frac{\theta _k}{\theta _0}(1-\theta _0) z_k, \end{aligned}$$

we simplify \(\big (1-\frac{\theta _k}{\theta _0}\big ) (x_{k+1} - x_*) + \frac{\theta _k}{\theta _0} (x_{k+1} - z_{k+1}) = (1-\theta _k)x_k - \frac{\theta _k}{\theta _0}(1-\theta _0) z_k - \big (1-\frac{\theta _k}{\theta _0}\big ) x_*\).

We then reorganize (30) and use again (29) to get

$$\begin{aligned}&\left\| x_{k+1}-x_* \right\| _v^2 = \frac{1}{1-\theta _k/\theta _0} \left\| (1-\theta _k)x_k - \frac{\theta _k}{\theta _0}(1-\theta _0) z_k - (1-\theta _k/\theta _0)x_* \right\| _v^2 \nonumber \\&\qquad + \frac{\theta _k}{\theta _0}\left\| z_{k+1}-x_* \right\| _v^2 - \frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \nonumber \\&\quad = \frac{(1-\theta _k/\theta _0)^2}{1-\theta _k/\theta _0} \left\| \frac{1-\theta _k}{1-\theta _k/\theta _0}(x_k - x_*) - \frac{\theta _k(1-\theta _0)}{\theta _0(1-\theta _k/\theta _0)}(z_k - x_*) \right\| _v^2 \nonumber \\&\qquad + \frac{\theta _k}{\theta _0}\left\| z_{k+1}-x_* \right\| _v^2 - \frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \nonumber \\&\quad = (1-\theta _k)\left\| x_k-x_* \right\| _v^2 + \frac{\theta _k}{\theta _0}\left\| z_{k+1}-x_* \right\| _v^2 - \frac{\theta _k}{\theta _0}(1-\theta _0)\left\| z_k - x_* \right\| _v^2 \nonumber \\&\qquad + \frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0}\left\| x_k-z_k \right\| _v^2 - \frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \end{aligned}$$
(31)

When \(k=0\), \(x_0 = z_0\) and \(z_1 = x_1\) implies that the inequality still holds with the convention \(0/0 = 0\).

Let us consider nonnegative sequences \((a_k)\), \((b_k)\), \((c_k)\), \((d_k)\) and \((\sigma ^K_k)\) and study the quantity

$$\begin{aligned} C_{k+1}= & {} a_{k+1}({\hat{F}}_{k+1} - F^*) + \frac{b_{k+1}}{2} \left\| x_{k+1} - x_* \right\| _v^2 + \frac{c_{k+1}}{2} \left\| z_{k+1} - x_* \right\| _v^2\\&+ \frac{d_{k+1}}{2} \left\| x_{k+1}-z_{k+1} \right\| _v^2 \end{aligned}$$

By the strong convexity condition of F, we have:

$$\begin{aligned} \frac{1}{2} \Vert x_{k+1}-x_*\Vert _v^2 \le \frac{1}{\mu }(F(x_{k+1}) - F^*). \end{aligned}$$

Hence, we get for any \(\sigma _{k+1}^K \in [0,1]\), (the usefulness of superscript K will become clear later)

$$\begin{aligned}&{\mathbf {E}}_k[C_{k+1}] \overset{(31)}{\le } {\mathbf {E}}_k\left[ \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) ({\hat{F}}_{k+1} - F^*) \right. \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})(1-\theta _k)\left\| x_k-x_* \right\| _v^2 \\&\quad + \left( c_{k+1} + b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k}{\theta _0} \right) \frac{1}{2} \left\| z_{k+1}-x_* \right\| _v^2 \\&\quad -\frac{b_{k+1}}{2}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0)\left\| z_k - x_* \right\| _v^2 \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0}\left\| x_k-z_k \right\| _v^2 \\&\quad \left. + \left( d_{k+1} - b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\right) \frac{1}{2}\left\| x_{k+1}- z_{k+1} \right\| _v^2 \right] \end{aligned}$$

Applying (28) we get:

$$\begin{aligned} {\mathbf {E}}_k[C_{k+1}]&\le (1-\theta _k)\left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) ({\hat{F}}_{k} - F^*) \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})(1-\theta _k)\left\| x_k-x_* \right\| _v^2 \\&\quad + \left( c_{k+1} + b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k}{\theta _0}\right. \\&\quad \left. -\left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2} \right) \frac{1}{2} {\mathbf {E}}_k[\left\| z_{k+1}-x_* \right\| _v^2]\\&\qquad \left( \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2}\right. \\&\quad \left. -b_{k+1}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0)\right) \frac{1}{2} \left\| z_k - x_* \right\| _v^2 \\&\quad + \frac{b_{k+1}}{2}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0}\left\| x_k-z_k \right\| _v^2 \\&\quad + \left( d_{k+1} - b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k/\theta _0}{1-\theta _k/\theta _0}\right) \frac{1}{2}{\mathbf {E}}_k[\left\| x_{k+1}- z_{k+1} \right\| _v^2] \end{aligned}$$

Step 2: Conditions to get a supermartingale.

In order to have the supermartingale inequality \({\mathbf {E}}_k[ C_{k+1}]\le C_k\), we need the following constraints on the parameters for \(k \in \{0, \ldots , K-1\}\):

$$\begin{aligned} a_k&\ge (1-\theta _k)\left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \end{aligned}$$
(32)
$$\begin{aligned} b_k&\ge b_{k+1}(1-\sigma ^K_{k+1})(1-\theta _k) \end{aligned}$$
(33)
$$\begin{aligned} c_{k+1}&\le \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2} - b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k}{\theta _0} \end{aligned}$$
(34)
$$\begin{aligned} c_k&\ge \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2}-b_{k+1}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0) \end{aligned}$$
(35)
$$\begin{aligned} d_{k+1}&\le b_{k+1}(1-\sigma ^K_{k+1})\frac{\theta _k/\theta _0}{1-\theta _k/\theta _0} \end{aligned}$$
(36)
$$\begin{aligned} d_k&\ge b_{k+1}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0} \end{aligned}$$
(37)

(as \(x_0-z_0 = 0\) and \(x_1 = z_1\), there is no constraint on \(d_0\) and \(d_1\)). We also add the constraints

$$\begin{aligned} a_0 = (1-\theta _0)(b_0 + c_0) = \frac{1-\theta _0}{\theta _0^2} \end{aligned}$$
(38)

in order to get \(C_0 = \frac{1-\theta _0}{\theta _0^2} {\hat{F}}_0 + \frac{1}{2\theta _0^2}\left\| x_0 - x_* \right\| ^2_v + \frac{d_0}{2} \left\| x_0 - x_0 \right\| ^2_v = \varDelta _0\) and

$$\begin{aligned} c_K&= 0 \end{aligned}$$
(39)
$$\begin{aligned} a_K&= (1-\theta _0)b_K \;. \end{aligned}$$
(40)

If we can find a set of nonnegative sequences satisfying (32)–(40), then we have

$$\begin{aligned} b_K \varDelta (x_K)= & {} \frac{a_K}{\theta _0^2} (F(x_K) - F^*) + \frac{b_K}{2\theta _0^2} \left\| x_K - x_* \right\| _v^2 \nonumber \\\le & {} \frac{a_0}{\theta _0^2} (F(x_0) - F^*) + \frac{b_0+c_0}{2\theta _0^2} \left\| x_0 - x_* \right\| _v^2 = (b_0+c_0)\varDelta (x_0)\;. \end{aligned}$$
(41)

Step 3: Explicit solution of the inequalities.

By analogy to the gradient case, we are looking for such sequences saturating (32)–(35). For \(k \in \{1,\dots , K-1\}\), equality in (32)–(35) can be fulfilled only if

$$\begin{aligned} \sigma ^K_k = \frac{1 - \frac{\theta _k}{\theta _{k-1}}\frac{1-\theta _0}{1-\theta _k}}{1 + \frac{\theta _{k-1}}{\theta _0 \mu }}, \qquad k \in \{1,\dots , K-1\} \;. \end{aligned}$$

Indeed, the saturation requirement leads to

$$\begin{aligned}&c_k \overset{eq. in (34)}{=} \left( a_{k}+\frac{b_{k}\sigma ^K_{k}}{\mu }\right) \frac{\theta _{k-1}^2}{\theta _0^2} - b_{k}(1-\sigma ^K_{k})\frac{\theta _{k-1}}{\theta _0} \\&\quad \overset{eq. in (35)}{=} \left( a_{k+1}+\frac{b_{k+1}\sigma ^K_{k+1}}{\mu }\right) \frac{\theta _k^2}{\theta _0^2}-b_{k+1}(1-\sigma ^K_{k+1}) \frac{\theta _k}{\theta _0}(1-\theta _0) \\&\quad \overset{eq. in (32)+(33)}{=} \frac{a_k}{1-\theta _k} \frac{\theta _k^2}{\theta _0^2} - \frac{b_k}{1-\theta _k} \frac{\theta _k}{\theta _0}(1-\theta _0) \end{aligned}$$

We get the formula for \(\sigma ^K_k\) by removing \(a_k\) on both sides thanks to \(\theta _{k-1}^2 = \frac{\theta _k^2}{1-\theta _k}\) and by dividing by \(b_k\). Note that we have \(\sigma _k^K \le 1 - \frac{\theta _k}{\theta _{k-1}}\frac{1-\theta _0}{1-\theta _k} \le 1 - \frac{1-\theta _0}{1-\theta _k} \le 1 - (1-\theta _0) \le \theta _0\).

For \(k = K\), we would like to ensure (39) and (40). This can be done by taking

$$\begin{aligned} \sigma ^K_K = \frac{1 - \frac{\theta _{K-1}}{\theta _0}(1-\theta _0)}{1 + \frac{\theta _{K-1}}{\theta _0\mu }} \;. \end{aligned}$$
(42)

Here also, \(\sigma _K^K \le \theta _0\).

Having found \((\sigma ^K_k)\), we can get \((b_k)\) as a function of \(b_0\) through the equality in (33)

$$\begin{aligned} b_k = \Big ( \prod _{l=0}^{k-1} \frac{1}{(1-\theta _l)(1-\sigma ^K_{l+1})} \Big )b_0 = \frac{\theta _{-1}^2}{\theta _{k-1}^2} \Big (\prod _{l=1}^{k} \frac{1}{1-\sigma ^K_{l}}\Big )b_0 \end{aligned}$$

and \((a_k)\) as a function of \(b_0\) through the equality in (32)

$$\begin{aligned} a_k&= \frac{1}{1-\theta _{k-1}} a_{k-1} - \frac{b_k \sigma _k^K}{\mu }= \frac{\theta _{k-2}^2}{\theta _{k-1}^2} a_{k-1} - \frac{b_k \sigma _k^K}{\mu } = \frac{\theta _{-1}^2}{\theta _{k-1}^2} a_0 - \sum _{l=1}^k \frac{\sigma ^K_l}{\mu }\frac{\theta _{l-1}^2}{\theta _{k-1}^2}b_l \\&= \frac{\theta _{-1}^2}{\theta _{k-1}^2} a_0 - \frac{\theta _{-1}^2}{\theta _{k-1}^2}\sum _{l=1}^k \frac{\sigma ^K_l}{\mu }\Big (\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}}\Big )b_0 . \end{aligned}$$

Using equality in (34) and (35), we get the relation

$$\begin{aligned} c_{k+1} = c_k - \frac{\theta _k}{1-\theta _k}b_k \end{aligned}$$

and so

$$\begin{aligned} c_k = c_0 - \sum _{l=0}^{k-1} \frac{\theta _l}{1-\theta _l}b_l \end{aligned}$$

This gives us the opportunity to calculate \(b_0\) because

$$\begin{aligned} c_K = 0 = c_0 - \sum _{l=0}^{K-1} \frac{\theta _l}{1-\theta _l}b_l \end{aligned}$$

Hence, we get two equalities:

$$\begin{aligned} c_0 + b_0&= \frac{1}{\theta _0^2} \\ c_0&= \sum _{l=0}^{K-1} \frac{\theta _l}{1-\theta _l} \frac{\theta _{-1}^2}{\theta _{l-1}^2}\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}} b_0 \end{aligned}$$

and so since \(\theta _{l-1}^2 = \theta _l^2/(1-\theta _l) \)

$$\begin{aligned} b_0 = \frac{1}{\theta _0^2} \frac{1}{1 + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}}}\;. \end{aligned}$$

We then need to check that there is a feasible sequence \((d_k)\). Indeed such a sequence exists because the upper and lower bound satisfy

$$\begin{aligned}&b_{k}(1-\sigma ^K_{k})\frac{\theta _{k-1}/\theta _0}{1-\theta _{k-1}/\theta _0} \ge d_k \ge b_{k+1}(1-\sigma ^K_{k+1})\frac{(1-\theta _k)\theta _k/\theta _0(1-\theta _0)}{1-\theta _k/\theta _0} \\&\quad \overset{eq. in (33)}{\Leftrightarrow } \sigma ^K_{k} \le 1 - (1-\theta _0)\frac{1/\theta _{k-1} - 1/\theta _0}{1/\theta _k - 1/\theta _0} \end{aligned}$$

which is true for all \(k \ge 2\) because \(\sigma ^K_k \le \theta _0\).

Up to now, we have constructed sequences \((\sigma _k^K)\), \((a_k)\), \((b_k)\), \((c_k)\) and \((d_k)\) that clearly satisfy (32) to (39). Moreover, \(\sigma _k^K \in [0,1]\), \(b_k \ge 0\), \(c_k \ge 0\) and \(d_k \ge 0\).

We can then easily check that \(c_K = 0\) and (42) implies that \(a_{K} = (1-\theta _0) b_K\). Hence, \(a_K \ge 0\). By recursively looking at (32), we deduce that \(a_k \ge 0\) for all \(k \le K\).

Now that we have the sequences, using the relation (41), we can compute the rate as

$$\begin{aligned} \rho _K&= \frac{b_0 + c_0}{b_K} = \frac{1}{\theta _0^2} \frac{\theta _{K-1}^2}{\theta _{-1}^2}\Big (\prod _{l'=1}^{K} (1-\sigma ^K_{l'})\Big ) \theta _0^2 \Big (1 + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=1}^{l} \frac{1}{1-\sigma ^K_{j}}\Big )\\&=\frac{\theta _{K-1}^2}{\theta _{-1}^2} \Big (\prod _{l'=1}^{K} (1-\sigma ^K_{l'}) + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K} (1-\sigma ^K_{j})\Big ) \end{aligned}$$

Hence, we obtain that

$$\begin{aligned} {\mathbf {E}}[\varDelta (x_K)] \le \frac{\theta _{K-1}^2}{\theta _{-1}^2} \Big (\prod _{l'=1}^{K} (1-\sigma ^K_{l'}) + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K} (1-\sigma ^K_{j})\Big ) \varDelta (x_0) = \rho _K \varDelta (x_0) \end{aligned}$$

where \(\theta _{-1}^2 = \frac{\theta _0^2}{1-\theta _0}\) and

$$\begin{aligned} \sigma ^K_k&= \frac{1 - \frac{\theta _k}{\theta _{k-1}}\frac{1-\theta _0}{1-\theta _k}}{1 + \frac{\theta _{k-1}}{\theta _0 \mu }}, \qquad k \in \{1, K-1\} \\ \sigma ^K_K&= \frac{1 - \frac{\theta _{K-1}}{\theta _0}(1-\theta _0)}{1 + \frac{\theta _{K-1}}{\theta _0\mu }}. \end{aligned}$$

Step 4: bound\(\rho _K\)by induction.

$$\begin{aligned} \rho _1&= \frac{\theta _{0}^2}{\theta _{-1}^2} \Big ((1-\sigma _{1}^1) + \frac{\theta _{-1}^2}{\theta _{0}}(1-\sigma _{1}^1)\Big ) = (1-\theta _0)\frac{\frac{\theta _0}{\theta _0\mu } + \frac{\theta _0}{\theta _0}(1-\theta _0)}{1 + \frac{\theta _0}{\theta _0 \mu }}\left( 1+\frac{\theta _0}{1-\theta _0}\right) \\&= \frac{1 + (1-\theta _0)\mu }{1+\mu } \le \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2}{2\theta _0^2}\mu } \end{aligned}$$

Let us now assume that \(\rho _K \le \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}}\).

$$\begin{aligned} \rho _{K+1}&= \frac{\theta _{K}^2}{\theta _{-1}^2} \Big (\prod _{l'=1}^{K+1} (1-\sigma ^{K+1}_{l'}) + \sum _{l=0}^{K} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K+1} (1-\sigma ^{K+1}_{j})\Big ) \\&= \frac{\theta _{K}^2}{\theta _{K-1}^2}\frac{\theta _{K-1}^2}{\theta _{-1}^2}(1-\sigma _{K+1}^{K+1})\frac{1-\sigma _{K}^{K+1}}{1-\sigma _K^K} \left( \prod _{l'=1}^{K} (1-\sigma ^{K}_{l'})\right. \\&\quad + \sum _{l=0}^{K-1} \frac{\theta _{-1}^2}{\theta _{l}}\prod _{j=l+1}^{K} (1-\sigma ^{K}_{j}) \\&\quad \left. + \frac{\theta _{-1}^2}{\theta _{K}} \frac{1-\sigma ^{K}_{K}}{1-\sigma _K^{K+1}} \right) \\&= (1-\theta _K) (1-\sigma _{K+1}^{K+1})\frac{1-\sigma _{K}^{K+1}}{1-\sigma _K^K} \left( \rho _K + \frac{\theta _{K-1}^2}{\theta _K}\frac{1-\sigma ^{K}_{K}}{1-\sigma _K^{K+1}} \right) \\&= (1-\theta _K) \frac{1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)}{1+\frac{\theta _0 \mu }{\theta _{K}}} \rho _K + \frac{1 + (1-\theta _0) \mu }{1+ \frac{\theta _0 \mu }{\theta _K}}\theta _K \end{aligned}$$

We now use the induction hypothesis.

$$\begin{aligned} \rho _{K+1}&\le (1-\theta _K) \frac{1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)}{1+\frac{\theta _0 \mu }{\theta _{K}}} \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2}{2\theta _{K-1}^2}\mu }+ \frac{1 + (1-\theta _0) \mu }{1+ \frac{\theta _0 \mu }{\theta _K}}\theta _K \\&\le \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2\mu }{2\theta _{K}^2}} \end{aligned}$$

The last inequality comes from the fact that

$$\begin{aligned}&\frac{1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)}{1+\frac{\theta _0 \mu }{\theta _{K}}} \frac{1-\theta _K}{1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}}+ \frac{\theta _K}{1+ \frac{\theta _0 \mu }{\theta _K}} \le \frac{1}{1+\frac{\theta _0^2\mu }{2\theta _{K}^2}} \\&\quad \Leftrightarrow (1-\theta _K) \left( 1+ \frac{\theta _0 \mu }{\theta _{K}}(1-\theta _0)\right) \left( 1+\frac{\theta _0^2\mu }{2\theta _{K}^2}\right) + \theta _K \left( 1+\frac{\theta _0^2\mu }{2\theta _{K}^2}\right) \left( 1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}\right) \\&\quad \le \left( 1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}\right) \left( 1+\frac{\theta _0 \mu }{\theta _{K}}\right) \\&\quad \Leftrightarrow 0 + \mu \left[ (1-\theta _K)\frac{\theta _0(1-\theta _0)}{\theta _{K}} + \frac{(1-\theta _K)\theta _0^2}{2\theta _{K}^2} + \frac{\theta _K\theta _0^2}{2\theta _{K-1}^2} \right. \\&\qquad \left. + \frac{\theta _K\theta _0^2}{2 \theta _{K}^2} - \frac{\theta _0^2}{2 \theta _{K-1}^2} - \frac{\theta _0}{\theta _K}\right] \\&\qquad + \mu ^2 \left[ \frac{(1-\theta _K)\theta _0^3(1-\theta _0)}{2 \theta _K^2 \theta _{K}} + \frac{\theta _K\theta _0^4}{4 \theta _{K-1}^2 \theta _K^2} - \frac{\theta _0^3}{2 \theta _{K-1}^2\theta _K} \right] \le 0 \end{aligned}$$

and both terms in the brackets are nonpositive for all K: they are respectively equal to \(-\theta _0(1-\theta _0)\) and \(-\frac{\theta _0^4}{2 \theta _K^3}(1-\theta _K)\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fercoq, O., Qu, Z. Restarting the accelerated coordinate descent method with a rough strong convexity estimate. Comput Optim Appl 75, 63–91 (2020). https://doi.org/10.1007/s10589-019-00137-2

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-019-00137-2

Keywords

Navigation