Abstract
We propose new restarting strategies for the accelerated coordinate descent method. Our main contribution is to show that for a well chosen sequence of restarting times, the restarted method has a nearly geometric rate of convergence. A major feature of the method is that it can take profit of the local quadratic error bound of the objective function without knowing the actual value of the error bound. We also show that under the more restrictive assumption that the objective function is strongly convex, any fixed restart period leads to a geometric rate of convergence. Finally, we illustrate the properties of the algorithm on a regularized logistic regression problem and on a Lasso problem.
Similar content being viewed by others
Notes
In [3], the \(L^1\) norm is rewritten by a matrix of \(2^n\) rows.
References
Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205. ACM (2017)
Allen-Zhu, Z., Yuan, Y.: Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In: International Conference on Machine Learning, pp. 1080–1089 (2016)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017)
Beck, A., Shtern, S.: Linearly convergent away-step conditional gradient for non-strongly convex functions. Math. Program. 164(1), 1–27 (2017)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)
Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R., et al.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)
Fercoq, O., Qu, Z.: Restarting accelerated gradient methods with a rough strong convexity estimate (2016). arXiv preprint arXiv:1609.07358
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition (2017). arXiv preprint arXiv:1709.02300
Fercoq, O., Richtárik, P.: Smooth minimization of nonsmooth functions by parallel coordinate descent (2013). Preprint arXiv:1309.5885
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Gasnikov, A.V., Dvurechensky, P.E.: Stochastic intermediate gradient method for convex optimization problems. Dokl. Math. 93(2), 148–151 (2016)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2089 (2013)
Johnson, T., Guestrin, C.: Blitz: a principled meta-algorithm for scaling sparse optimization. In: International Conference on Machine Learning, pp. 1171–1179 (2015)
Klatte, D., Thiere, G.: Error bounds for solutions of linear equations and inequalities. Z. Oper. Res. 41(2), 191–214 (1995)
Lin, Q., Lu, Z., Xiao, L.: An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4), 2244–2273 (2015)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 3384–3392. Curran Associates, Inc., Red Hook (2015)
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS ’13, pp. 147–156, Washington, DC, USA. IEEE Computer Society (2013)
Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993)
Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60(3), 633–674 (2015)
Liu, M., Yang, T.: Adaptive accelerated gradient converging method under holderian error bound condition. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 3104–3114. Curran Associates, Inc., Red Hook (2017)
Murata, T., Suzuki, T.: Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. In: Advances in Neural Information Processing Systems, pp. 608–617 (2017)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \({O}(1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175, 69–107 (2018)
O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2012)
Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optim. Methods Softw. 31(5), 858–884 (2016)
Roulet, V., D’Aspremont, A.: Sharpness, restart and acceleration. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 1119–1129. Curran Associates, Inc., Red Hook (2017)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. (2014). https://doi.org/10.1007/s10107-012-0614-z
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)
Tseng, P.: On accelerated proximal gradient methods for convex–concave optimization. SIAM J. Optim. 2, 3 (2008)
Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The first author’s work was supported by the EPSRC Grant EP/K02325X/1 Accelerated Coordinate Descent Methods for Big Data Optimization, the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council), the Orange/Telecom ParisTech think tank Phi-TAB, the ANR Grant ANR-11-LABX-0056-LMH, LabEx LMH as part of the Investissement d’avenir project. The second author’s work was supported by Hong Kong Research Grant Council 27302016. This research was partially conducted using the HKU Information Technology Services research computing facilities that are supported in part by the Hong Kong UGC Special Equipment Grant (SEG HKU09).
Proof of Lemma 1 and Theorem 1
Proof of Lemma 1 and Theorem 1
Proof of Lemma 1
The Eq. (11) holds because \(\theta _{k+1}\) is the unique positive square root to the polynomial \(P(X) = X^2 + \theta _k^2 X - \theta _k^2\).
Let us prove (10) by induction. It is clear that \(\theta _0 \le \frac{2}{0+2/\theta _0}\). Assume that \(\theta _k \le \frac{2}{k+2/\theta _0}\). We know that \(P(\theta _{k+1}) = 0\) and that P is an increasing function on \([0 , +\infty ]\). So we just need to show that \(P\big (\frac{2}{k+1+2/\theta _0}\big )\ge 0\).
As \(\theta _k \le \frac{2}{k+2/\theta _0}\) and \(\frac{2}{k+1+2/\theta _0}-1 \le 0\),
For the other inequality, \(\frac{(2-\theta _0)}{0+(2-\theta _0)/\theta _0} \le \theta _0\). We now assume that \(\theta _k \ge \frac{(2-\theta _0)}{k+(2-\theta _0)/\theta _0}\) but that \(\theta _{k+1} < \frac{(2-\theta _0)}{k+1+(2-\theta _0)/\theta _0}\). Remark that \((x \mapsto (1-x)/x^2)\) is strictly decreasing for \(x \in (0,2)\). Then, using (11), we have
This is equivalent to
which obviously does not hold for any \(k \ge 0\). So \(\theta _{k+1} \ge \frac{(2-\theta _0)}{k+1+(2-\theta _0)/\theta _0}\).
Proof of Theorem 1
The proof is organised in 4 steps. Firstly, we derive a one-iteration inequality, secondly, we identify conditions under which this inequality is a supermartingale inequality, thirdly, we give a solution to the set of inequalities and finally, we bound the rate we obtain by a simpler formula.
Step 1: Derive a one-iteration inequality.
By Lemma 2 in [12], there exists nonnegative numbers \(\gamma _k^l\) such that for all k, \(\sum _{l=0}^k \gamma _k^l = 1\) and \(x_k = \sum _{l=0}^k \gamma _k^l z_l\). Let us denote \({\hat{F}}_k = f(x_k) + \sum _{l=0}^k \gamma _k^l \psi (z_l)\). We have \({\hat{F}}_k \ge F(x_k)\) where the inequality follows from the convexity of \(\psi \), and \({\hat{F}}_0 = F(x_0)\). By [12], for any \(x_* \in {\mathcal {X}}^*\),
Let us denote \(x_*\) the unique element of \({\mathcal {X}}^*\). Using the equality
which is valid for any vectors a, b and \(\lambda \in {\mathbb {R}}\), we get,
Since
we simplify \(\big (1-\frac{\theta _k}{\theta _0}\big ) (x_{k+1} - x_*) + \frac{\theta _k}{\theta _0} (x_{k+1} - z_{k+1}) = (1-\theta _k)x_k - \frac{\theta _k}{\theta _0}(1-\theta _0) z_k - \big (1-\frac{\theta _k}{\theta _0}\big ) x_*\).
We then reorganize (30) and use again (29) to get
When \(k=0\), \(x_0 = z_0\) and \(z_1 = x_1\) implies that the inequality still holds with the convention \(0/0 = 0\).
Let us consider nonnegative sequences \((a_k)\), \((b_k)\), \((c_k)\), \((d_k)\) and \((\sigma ^K_k)\) and study the quantity
By the strong convexity condition of F, we have:
Hence, we get for any \(\sigma _{k+1}^K \in [0,1]\), (the usefulness of superscript K will become clear later)
Applying (28) we get:
Step 2: Conditions to get a supermartingale.
In order to have the supermartingale inequality \({\mathbf {E}}_k[ C_{k+1}]\le C_k\), we need the following constraints on the parameters for \(k \in \{0, \ldots , K-1\}\):
(as \(x_0-z_0 = 0\) and \(x_1 = z_1\), there is no constraint on \(d_0\) and \(d_1\)). We also add the constraints
in order to get \(C_0 = \frac{1-\theta _0}{\theta _0^2} {\hat{F}}_0 + \frac{1}{2\theta _0^2}\left\| x_0 - x_* \right\| ^2_v + \frac{d_0}{2} \left\| x_0 - x_0 \right\| ^2_v = \varDelta _0\) and
If we can find a set of nonnegative sequences satisfying (32)–(40), then we have
Step 3: Explicit solution of the inequalities.
By analogy to the gradient case, we are looking for such sequences saturating (32)–(35). For \(k \in \{1,\dots , K-1\}\), equality in (32)–(35) can be fulfilled only if
Indeed, the saturation requirement leads to
We get the formula for \(\sigma ^K_k\) by removing \(a_k\) on both sides thanks to \(\theta _{k-1}^2 = \frac{\theta _k^2}{1-\theta _k}\) and by dividing by \(b_k\). Note that we have \(\sigma _k^K \le 1 - \frac{\theta _k}{\theta _{k-1}}\frac{1-\theta _0}{1-\theta _k} \le 1 - \frac{1-\theta _0}{1-\theta _k} \le 1 - (1-\theta _0) \le \theta _0\).
For \(k = K\), we would like to ensure (39) and (40). This can be done by taking
Here also, \(\sigma _K^K \le \theta _0\).
Having found \((\sigma ^K_k)\), we can get \((b_k)\) as a function of \(b_0\) through the equality in (33)
and \((a_k)\) as a function of \(b_0\) through the equality in (32)
Using equality in (34) and (35), we get the relation
and so
This gives us the opportunity to calculate \(b_0\) because
Hence, we get two equalities:
and so since \(\theta _{l-1}^2 = \theta _l^2/(1-\theta _l) \)
We then need to check that there is a feasible sequence \((d_k)\). Indeed such a sequence exists because the upper and lower bound satisfy
which is true for all \(k \ge 2\) because \(\sigma ^K_k \le \theta _0\).
Up to now, we have constructed sequences \((\sigma _k^K)\), \((a_k)\), \((b_k)\), \((c_k)\) and \((d_k)\) that clearly satisfy (32) to (39). Moreover, \(\sigma _k^K \in [0,1]\), \(b_k \ge 0\), \(c_k \ge 0\) and \(d_k \ge 0\).
We can then easily check that \(c_K = 0\) and (42) implies that \(a_{K} = (1-\theta _0) b_K\). Hence, \(a_K \ge 0\). By recursively looking at (32), we deduce that \(a_k \ge 0\) for all \(k \le K\).
Now that we have the sequences, using the relation (41), we can compute the rate as
Hence, we obtain that
where \(\theta _{-1}^2 = \frac{\theta _0^2}{1-\theta _0}\) and
Step 4: bound\(\rho _K\)by induction.
Let us now assume that \(\rho _K \le \frac{1 + (1-\theta _0)\mu }{1+\frac{\theta _0^2\mu }{2\theta _{K-1}^2}}\).
We now use the induction hypothesis.
The last inequality comes from the fact that
and both terms in the brackets are nonpositive for all K: they are respectively equal to \(-\theta _0(1-\theta _0)\) and \(-\frac{\theta _0^4}{2 \theta _K^3}(1-\theta _K)\).
Rights and permissions
About this article
Cite this article
Fercoq, O., Qu, Z. Restarting the accelerated coordinate descent method with a rough strong convexity estimate. Comput Optim Appl 75, 63–91 (2020). https://doi.org/10.1007/s10589-019-00137-2
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-019-00137-2