Worst-case complexity of cyclic coordinate descent: $$O(n^2)$$ gap with randomized version

1103 Accesses
13 Citations
2 Altmetric
Explore all metrics

Abstract

This paper concerns the worst-case complexity of cyclic coordinate descent (C-CD) for minimizing a convex quadratic function, which is equivalent to Gauss–Seidel method, Kaczmarz method and projection onto convex sets (POCS) in this simple setting. We observe that the known provable complexity of C-CD can be $\mathcal {O}(n^2)$ times slower than randomized coordinate descent (R-CD), but no example was proven to exhibit such a large gap. In this paper we show that the gap indeed exists. We prove that there exists an example for which C-CD takes at least $\mathcal {O}(n^4 \kappa _{\text {CD}} \log \frac{1}{\epsilon })$ operations, where $\kappa _{\text {CD}}$ is related to Demmel’s condition number and it determines the convergence rate of R-CD. It implies that in the worst case C-CD can indeed be $\mathcal {O}(n^2)$ times slower than R-CD, which has complexity $\mathcal {O}( n^2 \kappa _{\text {CD}} \log \frac{1}{\epsilon })$. Note that for this example, the gap exists for any fixed update order, not just a particular order. An immediate consequence is that for Gauss–Seidel method, Kaczmarz method and POCS, there is also an $\mathcal {O}(n^2) $ gap between the cyclic versions and randomized versions (for solving linear systems). One difficulty with the analysis is that the spectral radius of a non-symmetric iteration matrix does not necessarily constitute a lower bound for the convergence rate. Finally, we design some numerical experiments to show that the size of the off-diagonal entries is an important indicator of the practical performance of C-CD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iteration complexity analysis of block coordinate descent methods

Article 19 August 2016

Complexity of a projected Newton-CG method for optimization with bounds

Article 13 July 2023

Inexact Coordinate Descent: Complexity and Preconditioning

Article Open access 29 February 2016

Notes

These works often analyze cyclic BCGD (Block Coordinate Gradient Descent). For minimizing convex quadratic functions, cyclic CGD with a special stepsize is the same as cyclic CD (i.e. exactly minimizing each subproblem).
The reference [45] discussed Gauss–Seidel method and seems to be unknown to optimization researchers. We first derived this bound by applying the analysis of the paper [38] (which studies non-strongly cnovex problems) to strongly convex quadratic problems. When preparing this paper we became aware of the early reference [45] which obtained a similar bound.
Note that G–S, POCS and CD are not equivalent in more general settings; in fact, G–S can be used to solve non-symmetric linear systems, POCS can be used to find intersection of any closed convex sets, and CD can be used to solve non-quadratic non-smooth problems.
After posting the first version of the paper in April 2016, Steven Wright pointed out to us that he proposed the example that we analyzed in this paper in a talk in Paris in July 2015 and in a talk at NYU in December 2015. He also noticed the large gap between C-CD and R-CD for this example, although no theoretical analysis was available on public.
When the matrix is sparse, the time is actually $O(\text {nnz}(A))$, but to simplify the discussions, we do not consider the sparsity in this work.
Rigorously speaking, the upper bounds of the convergence rate can be transformed to upper bounds of the complexity, but the lower bounds require a bit more work. See the arxiv version for a more rigorous treatment of the lower bounds of the complexity.

References

Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
MathSciNet MATH Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
MathSciNet MATH Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, pp. 408–415. ACM (2008)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Google Scholar
Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for l1-regularized loss minimization. arXiv preprint arXiv:1105.5379 (2011)
Mazumder, R., Friedman, J.H., Hastie, T.: SparseNet: coordinate descent with nonconvex penalties. J. Am. Stat. Assoc. 106(495), 1125–1138 (2011)
MathSciNet MATH Google Scholar
Razaviyayn, M., Hong, M., Luo, Z.-Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
MathSciNet MATH Google Scholar
Baligh, H., Hong, M., Liao, W.-C., Luo, Z.-Q., Razaviyayn, M., Sanjabi, M., Sun, R.: Cross-layer provision of future cellular networks: a WMMSE-based approach. IEEE Signal Process. Mag. 31(6), 56–68 (2014)
Google Scholar
Sun, R., Baligh, H., Luo, Z.-Q.: Long-term transmit point association for coordinated multipoint transmission by stochastic optimization. In: 2013 IEEE 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 330–334. IEEE (2013)
Hong, M., Sun, R., Baligh, H., Luo, Z.-Q.: Joint base station clustering and beamformer design for partial coordinated transmission in heterogeneous networks. IEEE J. Sel. Areas Commun. 31(2), 226–240 (2013)
Google Scholar
Canutescu, A.A., Dunbrack, R.L.: Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci. 12(5), 963–972 (2003)
Google Scholar
Bouman, C.A., Sauer, K.: A unified approach to statistical tomography using coordinate descent optimization. IEEE Trans. Image Process. 5(3), 480–492 (1996)
Google Scholar
Greenbaum, A.: Iterative Methods for Solving Linear Systems, vol. 17. SIAM, Philadelphia (1997)
MATH Google Scholar
Wen, Z., Goldfarb, D., Scheinberg, K.: Block coordinate descent methods for semidefinite programming. In: Anjos, M., Lasserre, J. (eds.) Handbook on Semidefinite, Conic and Polynomial Optimization. International Series in Operations Research & Management Science, vol. 166. Springer, Boston, MA
Sun, R., Luo, Z.-Q.: Guaranteed matrix completion via nonconvex factorization. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pp. 270–289. IEEE (2015)
Powell, M.J.D.: On search directions for minimization algorithms. Math. Program. 4(1), 193–201 (1973)
MathSciNet MATH Google Scholar
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
MATH Google Scholar
Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 103(9), 475–494 (2001)
MathSciNet MATH Google Scholar
Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear Gauss–Seidel method under convex constraints. Oper. Res. Lett. 26, 127–136 (2000)
MathSciNet MATH Google Scholar
Luo, Z.-Q., Tseng, P.: On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 72(1), 7–35 (1992)
MathSciNet MATH Google Scholar
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)
MathSciNet MATH Google Scholar
Nesterov, Y.: Efficiency of coordiate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)
MathSciNet MATH Google Scholar
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144, 1–38 (2014)
MathSciNet MATH Google Scholar
Zhaosong, L., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1–2), 615–642 (2015)
MathSciNet MATH Google Scholar
Qu, Z., Richtárik, P., Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling. arXiv preprint arXiv:1411.5873 (2014)
Lin, Q., Lu, Z., Xiao, L.: An accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization. arXiv preprint arXiv:1407.1296 (2014)
Zhang, Y., Xiao, L.: Stochastic primal-dual coordinate method for regularized empirical risk minimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 353–361 (2015)
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
MathSciNet MATH Google Scholar
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)
MathSciNet MATH Google Scholar
Patrascu, A., Necoara, I.: Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. J. Glob. Optim. 61(1), 19–46 (2015)
MathSciNet MATH Google Scholar
Hsieh, C.-J., Yu, H.-Fu, Dhillon, I.S.: PASSCoDe: parallel asynchronous stochastic dual co-ordinate descent. arXiv preprint arXiv:1504.01365 (2015)
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156. IEEE (2013)
Allen-Zhu, Z., Orecchia, L.: Nearly-linear time positive LP solver with faster convergence rate. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 229–236. ACM (2015)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
MathSciNet MATH Google Scholar
Sun, R., Luo, Z.-Q., Ye, Y.: On the expected convergence of randomly permuted ADMM. arXiv preprint arXiv:1503.06387 (2015)
Sun, R., Hong, M.: Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. In: Advances in Neural Information Processing Systems, pp. 1306–1314 (2015)
Lee, C.-P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. arXiv preprint arXiv:1607.08320 (2016)
Xiao, L., Yu, A.W., Lin, Q., Chen, W.: DSCOVR: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. arXiv preprint arXiv:1710.05080 (2017)
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
MathSciNet MATH Google Scholar
Beck, A.: On the convergence of alternating minimization with applications to iteratively reweighted least squares and decomposition schemes. SIAM J. Optim. 25(1), 185–209 (2015)
MathSciNet MATH Google Scholar
Saha, A., Tewari, A.: On the nonasymptotic convergence of cyclic coordinate descent method. SIAM J. Optim. 23(1), 576–601 (2013)
MathSciNet MATH Google Scholar
Hong, M., Wang, X., Razaviyayn, M., Luo, Z.-Q.: Iteration complexity analysis of block coordinate descent methods. Preprint, arXiv:1310.6957 (2013)
Oswald, P.: On the convergence rate of SOR: a worst case estimate. Computing 52(3), 245–255 (1994)
MathSciNet MATH Google Scholar
Deutsch, F.: The method of alternating orthogonal projections. In: Singh, S.P. (eds.) Approximation Theory, Spline Functions and Applications. NATO ASI Series (Series C: Mathematical and Physical Sciences), vol. 356. Springer, Dordrecht
Von Neumann, J.: Functional Operators. Volume II, the Geometry of Orthogonal Spaces (1950). This is a reprint of mimeographed lecture notes first distributed in (1933)
Smith, K.T., Solmon, D.C., Wagner, S.L.: Practical and mathematical aspects of the problem of reconstructing objects from radiographs. Bull. Am. Math. Soc. 83(6), 1227–1270 (1977)
MathSciNet MATH Google Scholar
Kayalar, S., Weinert, H.L.: Error bounds for the method of alternating projections. Math. Control Signals Syst. (MCSS) 1(1), 43–59 (1988)
MathSciNet MATH Google Scholar
Deutsch, F., Hundal, H.: The rate of convergence for the method of alternating projections, II. J. Math. Anal. Appl. 205(2), 381–405 (1997)
MathSciNet MATH Google Scholar
Bauschke, H.H., Borwein, J.M., Lewis, A.S.: The method of cyclic projections for closed convex sets in hilbert space. Contemp. Math. 204, 1–38 (1997)
MathSciNet MATH Google Scholar
Escalante, R., Raydan, M.: Alternating Projection Methods. SIAM, Philadelphia (2011)
MATH Google Scholar
Galántai, A.: Projectors and Projection Methods, vol. 6. Springer, Berlin (2013)
MATH Google Scholar
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de lAcademie Polonaise des Sciences et des Lettres 35, 355–357 (1937)
MATH Google Scholar
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
MathSciNet MATH Google Scholar
Von Neumann, J.: On rings of operators. Reduction theory. Ann. Math. 50(2), 401–485 (1949)
MATH Google Scholar
Halperin, I.: The product of projection operators. Acta Sci. Math. (Szeged) 23(1–2), 96–99 (1962)
MathSciNet MATH Google Scholar
Young, D.: Iterative methods for solving partial difference equations of elliptic type. Trans. Am. Math. Soc. 76(1), 92–111 (1954)
MathSciNet MATH Google Scholar
Widlund, O.B.: On the effects of scaling of the Peaceman–Rachford method. Math. Comput. 25(113), 33–41 (1971)
MathSciNet MATH Google Scholar
Galántai, A.: On the rate of convergence of the alternating projection method in finite dimensional spaces. J. Math. Anal. Appl. 310(1), 30–44 (2005)
MathSciNet MATH Google Scholar
Necoara, I., Richtarik, P., Patrascu, A.: Randomized projection methods for convex feasibility problems: conditioning and convergence rates. arXiv preprint arXiv:1801.04873 (2018)
Angelos, J.R., Cowen, C.C., Narayan, S.K.: Triangular truncation and finding the norm of a Hadamard multiplier. Linear Algebra Appl. 170, 117–135 (1992)
MathSciNet MATH Google Scholar
Uherka, D.J., Sergott, A.M.: On the continuous dependence of the roots of a polynomial on its coefficients. Am. Math. Mon. 84(5), 368–370 (1977)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial and Enterprise Systems Engineering, Univeristy of Illinois at Urbana-Champaign, Urbana, IL, USA
Ruoyu Sun
Department of Management Science and Engineering, Stanford University, Stanford, CA, USA
Yinyu Ye

Authors

Ruoyu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yinyu Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruoyu Sun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Proofs of upper bounds

1.1 A.1 Proof of Proposition 3.1

Without loss of generality, we can assume $ b = 0 $. In fact, minimizing $f(x) = x^T A x - 2 b^T x$ is equivalent to minimizing $ f(x) = (x - x^*)^T A (x - x^*) $ where $x^* = A^{\dag }b $; here we use the fact that $ A x^* = A A^{\dag }b = b $ when $b \in \mathcal {R}(A)$. By a linear transformation $z = x-x^*$, C-CD for minimizing $ (x - x^*)^T A (x - x^*) $ starting from $x^0$ is equivalent to C-CD for minimizing $ z^T A z $ starting from $z^0 = x^0-x^*$. Thus we can assume $x^* = 0$, or equivalently, $b = 0$.

The update equation of C-CD now becomes

$$\begin{aligned} x^{k+1} = (I - \Gamma ^{-1}A) x^k = x^k - d^k, \end{aligned}$$

(46)

where $\Gamma $ is the lower triangular part of A with diagonal entries, i.e., $\Gamma _{ij} = A_{ij}, 1 \le j \le i \le n $, and $d^k = \Gamma ^{-1}A x^k $ is the moving direction. This implies

$$\begin{aligned} \Gamma d^k = A x^k. \end{aligned}$$

(47)

We first assume A is positive definite and will show how to extend to the PSD case in the end.

The proof consists of two main claims. The first claim relates the convergence rate of C-CD with the spectral radius of a certain matrix.

Claim A.1

Let $D_A = \text {diag}(A_{11}, \ldots , A_{nn})$ be a diagonal matrix with entries $A_{ii}$’s. Then

$$\begin{aligned} f(x^{k+1}) - f(x^*) \le \left( 1- \frac{1}{\Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert } \right) (f(x^k) - f(x^*)) . \end{aligned}$$

(48)

First Proof of Claim A.1

(Optimization Perspective) Following the proof framework of [38], we bound the descent amount and the cost yet to be minimized (cost-to-go) respectively. Suppose $w^0 = x^k, w^n = x^{k+1}$ and $w^1,\ldots , w^{n-1}$ are the $n-1$ intermediate iterates. Since $w^i$ is obtained by minimizing f over the i-th coordinate with other variables fixed, it is easy to verify

$$\begin{aligned} d^k_i =\frac{1}{2 A_{ii}} \nabla _i f(w^{i-1}) . \end{aligned}$$

(49)

In the above expression, $ 2 A_{ii} $ can be viewed as the i-th coordinate-wise Lipschitz constant of $\nabla f$ from an optimization perspective. We have

$$\begin{aligned} w^1= & {} w^0 - d^k_1 e_1 = w^0 - \frac{1}{2 A_{11}} \nabla _1 f(w^0), \ldots , w^n \\= & {} w^{n-1}- d^k_n e_n = w^{n-1} - \frac{1}{2 A_{nn}} \nabla _n f(w^{n-1} ) , \end{aligned}$$

where $e_i$ is the i-th unit vector. Then

$$\begin{aligned}&f(w^{i-1}) - f(w^{i})\nonumber \\&\quad = (w^{i-1})^T A w^{i-1} - ( w^{i-1} - d^k_i e_i)^T A (w^{i-1} - d^k_i e_i) \nonumber \\&\quad = - d^k_i e_i^T A e_i d^k_i + 2 (w^{i-1})^T A e_i d^k_i \nonumber \\&\quad = - A_{ii} (d^k_i)^2 + \nabla _i f(w^{i-1}) d^k_i \overset{ (49)}{ =} - A_{ii} (d^k_i)^2 + 2 A_{ii} (d^k_i)^2 = A_{ii} (d^k_i)^2.\nonumber \\ \end{aligned}$$

(50)

Therefore, the descent amount $ f(x^k) - f(x^{k+1}) $ can be bounded in terms of $d^k$ as

$$\begin{aligned} f(x^k) - f(x^{k+1}) = \sum _{i=1}^n f(w^{i-1}) - f(w^{i}) = \sum _{i=1}^n A_{ii} (d^k_i)^2 = (d^k)^T D_A d^k. \end{aligned}$$

(51)

The cost-to-go estimate is simply

$$\begin{aligned} f(x^k) - f(x^*) = f(x^k) = (x^k)^T A x^k \overset{(47)}{=} (d^k)^T \Gamma ^T A^{-1} \Gamma d^k . \end{aligned}$$

(52)

Combining with (51), we obtain

$$\begin{aligned} \frac{f(x^k) - f(x^*)}{ f(x^k) - f(x^{k+1})} = \frac{ (d^k)^T \Gamma ^T A^{-1} \Gamma d^k }{ (d^k)^T D_A d^k } \le \Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert , \end{aligned}$$

(53)

which implies (48). $\square $

Second Proof of Claim A.1

(Matrix Recursion Perspective) One natural idea is to prove $f(x^{k+1}) = M_f f(x^k) $ or $f(x^{k+1}) \le \Vert M_f\Vert f(x^k)$ for a certain matrix $M_f$, based on the update equation of the iterates $x^{k+1} = (I - \Gamma ^{-1} A )x^k$. We can write down the expression of $f(x^{k+1})$ in terms of $x^k$ as $f(x^{k+1}) = (x^k)^T (I - \Gamma ^{-1} A )^T A (I - \Gamma ^{-1} A ) x^k $. However, it is not clear how this expression is related to $f(x^k) = (x^k)^T A x^k $. A simple trick to resolve this issue is to express everything in terms of $d^k$. More specifically, we have

$$\begin{aligned} \begin{aligned}&f(x^k) - f(x^{k+1}) \\&\quad = ( x^k)^T A x^k - (x^k - d^k)^T A (x^k - d^k) = 2(d^k)^TA x^k - (d^k)^T A d^k \\&\quad = 2 (d^k)^T \Gamma d^k - (d^k)^T A d^k = (d^k)^T (\Gamma + \Gamma ^T) d^k - (d^k)^T A d^k = (d^k)^T D_A d^k, \end{aligned} \end{aligned}$$

(54)

where the last step is because $ \Gamma + \Gamma ^T = A + D_A $. Equation (54) is equivalent to (51) derived earlier using another approach. The rest is the same as the first proof. $\square $

Remark Although the second proof seems simpler, for people who are familiar with optimization the first proof is probably easier to understand: equation (50) is just the classical descent lemma (applied to each coordinate), thus (51) is straightforward to derive. In the proof of [38], one crucial step is to bound the cost-to-go in terms of $d^k$; here for the quadratic case the cost-to-go has a closed form expression given by (53). The second proof is cleaner to write, but it is specifically tailored for the quadratic problem; in contrast, the first proof can be extended to non-quadratic problems as done in [38] ((51) and (53) will become inequalities).

Claim A.2

Let $D_A = \text {diag}(A_{11}, \ldots , A_{nn})$ be a diagonal matrix with entries $A_{ii}$’s. Then

$$\begin{aligned} \Vert \Gamma ^T A^{-1} \Gamma \Vert \le \kappa \cdot \min \left\{ \sum _i L_i, (2 + \frac{1}{\pi } \log n)^2 L \right\} . \end{aligned}$$

(55)

Proof of Claim A.2

Denote

$$\begin{aligned} \Gamma _{\mathrm {unit} } = \begin{bmatrix} 1&\quad 0&\quad 0&\quad \dots&\quad 0 \\ 1&\quad 1&\quad 0&\quad \dots&\quad 0 \\ \vdots&\vdots&\vdots&\ddots&\quad \vdots \\ 1&\quad 1&\quad 1&\quad \dots&\quad 0 \\ 1&\quad 1&\quad 1&\quad \dots&\quad 1 \\ \end{bmatrix} , \end{aligned}$$

then $ \Gamma = \Gamma _{\mathrm {unit} } \circ A$, where $\circ $ denotes the Hadamard product. According to the classical result on the operator norm of the triangular truncation operator [62, Theorem 1], we have

$$\begin{aligned} \Vert \Gamma \Vert = \Vert \Gamma _{\mathrm {unit}} \circ A \Vert \le \left( 1 + \frac{1}{\pi } + \frac{1}{\pi } \log n\right) \Vert A \Vert \le \left( 2 + \frac{1}{\pi } \log n\right) \Vert A \Vert . \end{aligned}$$

Thus we have

$$\begin{aligned} \Vert \Gamma ^T A^{-1} \Gamma \Vert\le & {} \Vert \Gamma ^T \Gamma \Vert \Vert A^{-1} \Vert = \Vert \Gamma \Vert ^2 \frac{1}{\lambda _{\min }(A) }\\\le & {} \left( 2 +\frac{1}{\pi } \log n\right) ^2 \frac{\Vert A\Vert ^2}{\lambda _{\min }(A)} = \left( 2 + \frac{1}{\pi } \log n\right) ^2 \kappa L , \end{aligned}$$

which proves the second part of (55).

We can bound $ \Vert \Gamma \Vert ^2 $ in another way (denote $\lambda _i$’s as the eigenvalues of A):

$$\begin{aligned} \Vert \Gamma \Vert ^2 \le \Vert \Gamma \Vert _F^2&= \frac{1}{2}\left( \Vert A \Vert _F^2 + \sum _i A_{ii}^2\right) = \frac{1}{2} \left( \sum _i \lambda _i^2 + \sum _i A_{ii}^2 \right) \nonumber \\&\le \frac{1}{2} \left( \left( \sum _i \lambda _i\right) \lambda _{\max } + \max _j A_{jj} \sum _i A_{ii} \right) \nonumber \\&\overset{\text {(i)}}{=} \frac{1}{2} \left( L + \max _j A_{jj}\right) \sum _i L_i \le L \sum _i L_i. \end{aligned}$$

(56)

where(i) is because $ \sum _i \lambda _i = \text {tr}(A) = \sum _i A_{ii} $ and $A_{ii} = L_i $. Thus

$$\begin{aligned} \Vert \Gamma ^T A^{-1} \Gamma \Vert \le \Vert \Gamma \Vert ^2 \frac{1}{\lambda _{\min }(A) } \overset{(56)}{\le } \frac{L}{\lambda _{\min }} \sum _i L_i = \kappa \sum _i L_i. \end{aligned}$$

which proves the first part of (55). $\square $

Finally, according to the fact $ \Vert D_A^{-1/2} B D_A^{-1/2} \Vert \le \frac{1}{\min _i L_i } \Vert B \Vert = \frac{1}{L_{\min } } \Vert B \Vert $ for any positive definite matrix B, we have

$$\begin{aligned} \begin{aligned}&\Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert \le \frac{1}{L_{\min } } \Vert \Gamma ^T A^{-1} \Gamma \Vert \\&\quad \overset{(55)}{\le } \frac{1}{L_{\min } } \kappa \cdot \min \left\{ \sum _i L_i, \left( 2 + \frac{1}{\pi } \log n\right) ^2 L \right\} \end{aligned} \end{aligned}$$

Plugging this inequality into (48) and replacing $\sum _i L_i$ by $n L_{\mathrm {avg}}$, we obtain (6a).

Now we show how to modify the above proof to the case that A is PSD. From (47) we have

$$\begin{aligned} x^k = A^{\dag } \Gamma d^k. \end{aligned}$$

Then (52) is slightly modified to $ (x^k)^T A x^k = (d^k)^T \Gamma ^T A^{\dag } \Gamma d^k. $ We still have (51) since its proof does not require A to be positive definite. Now we modify (53) to

$$\begin{aligned} \frac{f(x^k) - f(x^*)}{ f(x^k) - f(x^{k+1})}&= \frac{ (d^k)^T \Gamma ^T A^{\dag } \Gamma d^k }{ (d^k)^T D_A d^k } \le \frac{ (d^k)^T \Gamma ^T \Gamma d^k \Vert A^{\dag }\Vert }{ (d^k)^T D_A d^k } \nonumber \\&\quad \overset{\text {(i)}}{=} \frac{1}{\lambda _{\min } } \frac{ (d^k)^T \Gamma ^T \Gamma d^k }{ (d^k)^T D_A d^k } \le \frac{1}{\lambda _{\min } } \Vert D_A^{-1/2} \Gamma ^T \Gamma D_A^{-1/2} \Vert \nonumber \\&\quad \le \frac{1}{ \lambda _{\min } L_{\min } } \Vert \Gamma ^T \Gamma \Vert . \end{aligned}$$

(57)

where (i) is because $\Vert A^{\dag }\Vert = 1/\lambda _{\min }$ where $\lambda _{\min }$ is the minimum non-zero eigenvalue of A. The rest is almost the same as the proof for the PD case: obtaining the bounds of $ \Gamma ^T \Gamma $ as in Claim A.2 and plugging them into (57) immediately leads to (6a).

The first bound of result (6b) is a direct corollary of (6a) because $ \kappa \le n \kappa _{\mathrm CD}$ (which is because $\lambda _{\max }(A) \le \mathrm {tr}(A) = n L_{\mathrm {avg}}$). The second bound of (6b) is the same as the second bound of (6a) because

$$\begin{aligned} \frac{ \kappa L }{L_{\min }} = \frac{ L^2 }{ \lambda _{\min } L_{\min } } = \frac{L^2 }{ L_{\mathrm {avg}} L_{\min } } \frac{ L_{\mathrm {avg}} }{\lambda _{\min }} = \frac{L^2 }{ L_{\mathrm {avg}} L_{\min } } \kappa _{\mathrm CD}. \end{aligned}$$

This finishes the proof of Proposition 3.1.

1.2 A. 2 Proof of Proposition 3.2

This proof is a slight modification of the proof of Proposition 3.1.

We first consider the case that A is positive definite. The insight is to rewrite the relation proved in Claim A.1

$$\begin{aligned} f(x^{k+1}) - f(x^*) \le \left( 1- \frac{1}{\Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert } \right) (f(x^k) - f(x^*)) \end{aligned}$$

(58)

$$\begin{aligned} f(x^{k+1}) - f(x^*) \le \left( 1- \frac{1}{\Vert {\hat{\Gamma }}^T \hat{A}^{-1} {\hat{\Gamma }} \Vert } \right) (f(x^k) - f(x^*)) , \end{aligned}$$

(59)

where $ {\hat{\Gamma }} = D_A^{-1/2} \Gamma D_A^{-1/2} $ and $ \hat{A} = D_A^{-1/2} A D_A^{-1/2} $. Note that ${\hat{\Gamma }}$ is still the lower-triangular part (with diagonal entries) of the Jacobi-preconditioned matrix $\hat{A}$. The diagonal entries of ${\hat{\Gamma }}$ and $\hat{A}$ are all 1, so $\hat{L}_i = 1, \ \forall i$.

Applying Claim A.2 we have

$$\begin{aligned} \Vert {\hat{\Gamma }}^T \hat{A}^{-1} {\hat{\Gamma }} \Vert\le & {} \hat{ \kappa } \cdot \min \left\{ \sum _i \hat{ L}_i, \left( 2 + \frac{1}{\pi } \log n\right) ^2 \hat{ L} \right\} \\= & {} {\hat{\kappa }} \cdot \min \left\{ n , \left( 2 + \frac{1}{\pi } \log n\right) ^2 \hat{ L} \right\} . \end{aligned}$$

Plugging the above relation into (59) we obtain (10a). Similar to Proposition 3.1, the second bound (10b) follows directly from (10a).

The case that A is PSD is can be handled in a similar way to the proof of Proposition 3.1.

B Supplemental Proofs for Theorem 3.1

1.1 B.1 Proof of Lemma 5.1

Suppose $\lambda $ is an eigenvalue of $ Z = \Gamma ^{-1}A $ and $v = (v_1; v_2; \ldots ; v_n) \in {\mathbb {C}}^{n \times 1}$ is the corresponding eigenvector. Then we have

$$\begin{aligned}&\Gamma ^{-1} A v = \lambda v \Longrightarrow \quad A v = \lambda L v \nonumber \\ \qquad&\Longrightarrow {\left\{ \begin{array}{ll} v_1 + c \sum _{j \ne 1} v_j = \lambda v_1 \\ v_2 + c \sum _{j \ne 2 } v_j = \lambda (c v_1 + v_2 ) \\ \dots \\ v_k + c \sum _{j \ne k} v_j = \lambda (c v_1 + \cdots + c v_{k-1} + v_k), \\ \dots \\ v_n + c \sum _{j \ne n} v_j = \lambda (c v_1 + \cdots + c v_{n-1} + v_n ). \end{array}\right. } \end{aligned}$$

(60)

Without loss of generality, we can assume

$$\begin{aligned} \sum _{j=1}^n v_j = 1 . \end{aligned}$$

(61)

Let $\hat{c} = 1 - c$. Then (60) becomes

$$\begin{aligned} {\left\{ \begin{array}{ll} \hat{c} v_1 + c = \lambda v_1 \\ \hat{c} v_2 + c = \lambda (c v_1 + v_2 ) \\ \dots \\ \hat{c} v_k + c = \lambda (c v_1 + \cdots + c v_{k-1} + v_k), \\ \dots \\ \hat{c} v_n + c = \lambda (c v_1 + \cdots + c v_{n-1} + v_n ). \end{array}\right. } \end{aligned}$$

(62)

The first equation implies $ v_1 = \frac{c}{ \lambda - \hat{c}}$. Plugging into the second equation, we get

$$\begin{aligned} v_2 = \frac{c (1 - \lambda v_1 )}{\lambda - \hat{c} } = \frac{c( \lambda - \hat{c} - \lambda c )}{ (\lambda - \hat{c})^2 } = \frac{c \hat{c} (\lambda - 1)}{ ( \lambda - \hat{c})^2} . \end{aligned}$$

Plugging the expression of $v_1, v_2$ into the third equation, we get

$$\begin{aligned} v_3 = \frac{ c(1 - \lambda v_1 - \lambda v_2) }{ \lambda - \hat{c} } = \frac{c (\hat{c})^2 (\lambda - 1)^2 }{ (\lambda - \hat{c})^3 }. \end{aligned}$$

In general, we can prove by induction that

$$\begin{aligned} v_k = \frac{c (\hat{c})^{k-1} ( \lambda -1 )^{k-1} }{ (\lambda - \hat{c})^k } = \frac{c}{\lambda - \hat{c}} q^{k-1}, \end{aligned}$$

(63)

where

$$\begin{aligned} q = \frac{\hat{c}(\lambda - 1)}{\lambda - \hat{c}} . \end{aligned}$$

(64)

We can also express $\lambda $ in terms of q as

$$\begin{aligned} \lambda = \frac{ \hat{c} - \hat{c}q }{\hat{c}-q} . \end{aligned}$$

(65)

Note that the expression of $v_k$ given by (63) satisfies (62) for any $\lambda $, but our goal is to compute $\lambda $. To do this, we need to utilize the normalization assmption (61). In particular, we have (when $q \ne 1$)

$$\begin{aligned}&1 = \sum _k v_k = \left( \sum _{k=1}^n q^{k-1}\right) \frac{c }{\lambda - \hat{c}} = \frac{1 - q^n }{1 - q} \frac{c }{ \lambda - \hat{c} } \nonumber \\&\quad \Longrightarrow (1-q ) (\lambda - \hat{c}) = c(1-q^n) \nonumber \\&\quad \overset{(64) }{\Longrightarrow } c \lambda = c (1 - q^n) \nonumber \\&\quad \overset{ }{\Longrightarrow } q^n = 1 - \lambda \overset{ (65) }{=} 1 - \frac{ \hat{c} - \hat{c}q }{\hat{c}-q} \nonumber \\&\quad \overset{ }{\Longrightarrow } q^n = \frac{qc}{q - \hat{c}} \nonumber \\&\quad \overset{ }{\Longrightarrow } q^{n} (q - \hat{c}) = c q. \end{aligned}$$

(66)

The above procedure is reversible, i.e. suppose $q \ne 1$ is a root of $ q^{n} (q - \hat{c}) = c q $, then $\lambda = \frac{ \hat{c} - \hat{c}q }{\hat{c}-q} $ is an eigenvalue of Z. Suppose the $ n + 1 $ roots of $ q^{n} (q - \hat{c}) = c q $ are $q_0 = 0, q_1, \ldots , q_{n-1}, q_n = 1$ ($q = 0$ and $q=1$ are always roots), then $ \lambda _k = \frac{ \hat{c} - \hat{c} q_k }{ \hat{c}- q_k } \overset{ (66)}{=} 1 - q_k^n , k= 0, \ldots , n - 1$ are the n eigenvalues of Z.

1.2 B.2 Proof of Lemma 5.2

It is well known that the roots of a polynomial depend continuously on the coefficients of the polynomial, i.e., if the coefficients of a family of functions f converges to the coefficients of a function $f^*$, then the roots of f also converge to the roots of $f^*$. More precisely, if $ x^* $ is a root of $f^*$ of multiplicity m, then for any $\epsilon > 0$, if the coefficients of f are close enough to those of $f^*$, then f has at least m roots within $\epsilon $ distance of $x^*$ (in this case, a root with multiplicity k is counted as k roots); see, e.g., [63, Theorem]. If one chooses $\epsilon $ to be smaller than the minimal distance between the distinct roots of $f^*$, then f has exactly m roots (counting multiplicity) within $\epsilon $ distance of $x^*$.

Below, we provide a rigorous proof of Lemma 5.2 using Rouché’s theorem in complex analysis.

When $n=1 $, the only solution of $ q^{n-1} (q - 1 + c) = c $ is $q = 1$, thus the conclusion holds. From now on, we assume $n \ge 2$. Let $p = 1/q$, then the equation $ q^{n-1} (q - 1 + c) = c $ becomes

$$\begin{aligned} p^{-1} -1 + c = c p^{n-1} \Longleftrightarrow p^n -1 + (1/c - 1) (p - 1) = 0. \end{aligned}$$

This equation can be written as $F(p) + G_c(p) = 0$, where F and $G_c$ are defined as

$$\begin{aligned} F (p) = p^n -1, \; G_c(p) = (1/c - 1) (p - 1) . \end{aligned}$$

Lemma B.1

Suppose $n \ge 2$. For any $ 0< \epsilon < \sin ( \pi /n) $, there exists some $\delta > 0 $ such that for any $c \in (1 - \delta , 1 )$, $F(p) + G_c(p) $ has exactly one root $p_k$ in the ball $ B( e^{- i 2k \pi /n}, \epsilon ) \triangleq \{ z \mid | z - e^{- i 2k \pi /n} | \le \epsilon \} $, $k = 0, 1, \ldots , n - 1$.

Clearly, the function F has n roots $ \eta _k \triangleq e^{- i 2k \pi /n}, k=1, \ldots , n $. The distance between two adjacent roots are $ | 1 - e^{-2 i \pi /n} | = 2 \sin ( \pi /n) . $ For any $0< \epsilon < \sin ( \pi /n) $, consider n balls

$$\begin{aligned} B ( \eta _k , \epsilon ) = \{ z \mid | z - \eta _k | \le \epsilon \}, k= 0, 1, \ldots , n-1. \end{aligned}$$

Any two such balls have no intersection since $\epsilon < | \sin ( \pi /n) | = \min _{0 \le j, k \le n -1} |\eta _j - \eta _k| $.

The boundary of the ball $B ( \eta _k , \epsilon )$ is $ \partial B ( \eta _k , \epsilon ) = \{ z \mid | z - \eta _k | = \epsilon \}. $ Define

$$\begin{aligned} v_k(\epsilon ) \triangleq \inf _{ z \in \partial B ( \eta _k , \epsilon ) } F(z) = \min _{ z \in \partial B ( \eta _k , \epsilon ) } | z^n - 1 | > 0 . \end{aligned}$$

This minimum can be achieved because $v_k(\epsilon ) $ is the minimal value of a continuous function on a compact set. It is positive since otherwise there exists some $z \in \partial B ( \eta _k , \epsilon ) $ such that $z^n = 1$ which means $z \in \{ \eta _0, \ldots , \eta _{n-1} \}$. This contradicts the fact that any two balls $B ( \eta _j , \epsilon ), B ( \eta _k , \epsilon )$ have no intersection.

Define

$$\begin{aligned} v(\epsilon ) = \min _{ 0 \le k \le n-1 } v_k(\epsilon ) > 0. \end{aligned}$$

For any $ z \in \partial B ( \eta _k , \epsilon ) $, we have

$$\begin{aligned} | F (z) | = | z^n - 1 | \ge v( \epsilon ). \end{aligned}$$

(67)

For any $ z \in \partial B ( \eta _k , \epsilon ) $ and any $c > \frac{3 }{ 3 + v(\epsilon )} $, we have

$$\begin{aligned} | G_c (z) | = | (1/c - 1) ( z - 1) | \le |1/c -1| ( | \eta _k | + \epsilon + 1 ) \le 3 |1/c -1| < v( \epsilon ), \end{aligned}$$

(68)

where the second inequality is due to $|\eta _k | = 1$ and $\epsilon < \sin ( \pi /n) \le 1$.

Combining the two bounds (67) and (68), we obtain that

$$\begin{aligned} | F(z) | > | G_c (z) | , \forall z \in \partial B ( \eta _k , \epsilon ). \end{aligned}$$

According to Rouché’s theorem, F and $F + G_c $ have the same number of zeros inside $B ( \eta _k , \epsilon ) $. Since F has exactly one root inside $B ( \eta _k , \epsilon ) $ which is $\eta _k$, we obtain that $F + G_c $ has exactly one root $p_k$ inside $B ( \eta _k , \epsilon ) $. $\Box $

We first let $ \epsilon _0 = \sin ( \pi /n)/ 4 $, which implies $B ( \eta _k , \epsilon _0 ), k=0, 1, \ldots , n-1$ are n disjoint balls. For any $c \in ( 3/(3 + v( \epsilon _0 ) , 1 ) $, Lemma B.1 implies that $F (p) + G_c (p) $ has exactly one root inside each ball. We denote $ p_0(c), p_1(c), \ldots , p_{n-1}(c) $ to be the roots of $F (p) + G_c (p) $ such that $p_k(c) \in B ( \eta _k , \epsilon _0 ) , \forall k $. Since $F + G_c $ has exactly n complex roots, thus $p_k(c)$’s are all the roots of $F + G_c $. Lemma B.1 implies that for any $\epsilon > 0 $, there exists some $ \delta $ such that whenever $ c > 1- \delta $, we have $ | p_k( c) - \eta _k | < \epsilon , \; \forall k. $ This means

$$\begin{aligned} \lim _{c \rightarrow 1} p_k(c) = \eta _k, \; \forall k. \end{aligned}$$

Since there is a one-to-one mapping between the roots of $ q^{n-1} (q - 1 + c) - c $ and the roots of $ F(p) + G_c (p) = p^n -1 + (1/c - 1) (p - 1) $ by the inverse transformation $ p = 1/q$, and $ |1/ q_k(c) - e^{ - i 2\pi k/n } | < \sin ( \pi /n)/4 $ implies $ | q_k(c) - e^{ i 2\pi k/n } | < \sin (\pi / n)/2 $, we obtain the following result: for any $c \in ( 3/(3 + v( \epsilon _0 ) , 1 ) $, the equation $ q^{n-1} (q - 1 + c) - c $ has exactly one root $q_k(c)$ such that $ |1/ q_k(c) - e^{ - i 2\pi k/n } | < \sin ( \pi /n)/2 $ for $k = 0, 1, \ldots , n-1$; moreover,

$$\begin{aligned} \lim _{c \rightarrow 1} q_k(c) = e^{ i 2\pi k/n }, \; \forall k. \end{aligned}$$

1.3 B.3 Proof of Claim 5.1

Since $2 \sin (n\phi /2) \cos ( x + (n+1)\phi /2 ) = \sin ( z + (n+ 1/2) \phi ) - \sin ( z + \phi /2 )$, the desired equation (28) is equivalent to

$$\begin{aligned} \sum _{j=1}^n \cos ( z + j \phi ) = \frac{ \sin ( z + (n+ 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )}. \end{aligned}$$

(69)

We prove (69) by induction. When $ n = 1$, it holds because $ \sin ( z + 1.5 \phi ) - \sin (z + 0.5 \phi ) = 2 \sin (\phi /2) \cos ( z + \phi ) $. Suppose (69) holds for $n-1$, i.e.

$$\begin{aligned} \sum _{j=1}^{n-1} \cos ( z + j \phi ) = \frac{ \sin ( z + (n - 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )} . \end{aligned}$$

Note that $ 2 \cos (z + n \phi ) \sin (\phi /2 ) = \sin (z + (n+1/2 \phi )) - \sin ( z + (n - 1/2) \phi ) $, therefore

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^{n} \cos ( z + j \phi ) \\&\quad = \frac{ \sin ( z + (n - 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )} + \cos (z + n \phi ) \\&\quad = \frac{ \sin ( z + (n - 1/2) \phi ) - \sin ( z + \phi /2 ) + \sin (z + (n+1/2 \phi )) - \sin ( z + (n - 1/2) \phi ) }{2 \sin (\phi /2 )} \\&\quad = \frac{ \sin ( z + (n+ 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )}. \end{aligned} \end{aligned}$$

(70)

This completes the induction step, and thus (69) holds. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, R., Ye, Y. Worst-case complexity of cyclic coordinate descent: $O(n^2)$ gap with randomized version. Math. Program. 185, 487–520 (2021). https://doi.org/10.1007/s10107-019-01437-5

Download citation

Received: 18 September 2017
Accepted: 22 September 2019
Published: 11 October 2019
Issue Date: January 2021
DOI: https://doi.org/10.1007/s10107-019-01437-5

Worst-case complexity of cyclic coordinate descent: \(O(n^2)\) gap with randomized version

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Iteration complexity analysis of block coordinate descent methods

Complexity of a projected Newton-CG method for optimization with bounds

Inexact Coordinate Descent: Complexity and Preconditioning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

A Proofs of upper bounds

1.1 A.1 Proof of Proposition 3.1

Claim A.1

First Proof of Claim A.1

Second Proof of Claim A.1

Claim A.2

Proof of Claim A.2

1.2 A. 2 Proof of Proposition 3.2

B Supplemental Proofs for Theorem 3.1

1.1 B.1 Proof of Lemma 5.1

1.2 B.2 Proof of Lemma 5.2

Lemma B.1

1.3 B.3 Proof of Claim 5.1

Rights and permissions

About this article

Cite this article

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Worst-case complexity of cyclic coordinate descent: \(O(n^2)\) gap with randomized version

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Iteration complexity analysis of block coordinate descent methods

Complexity of a projected Newton-CG method for optimization with bounds

Inexact Coordinate Descent: Complexity and Preconditioning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

A Proofs of upper bounds

1.1 A.1 Proof of Proposition 3.1

Claim A.1

First Proof of Claim A.1

Second Proof of Claim A.1

Claim A.2

Proof of Claim A.2

1.2 A. 2 Proof of Proposition 3.2

B Supplemental Proofs for Theorem 3.1

1.1 B.1 Proof of Lemma 5.1

1.2 B.2 Proof of Lemma 5.2

Lemma B.1

1.3 B.3 Proof of Claim 5.1

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation