Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Worst-case complexity of cyclic coordinate descent: \(O(n^2)\) gap with randomized version

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

This paper concerns the worst-case complexity of cyclic coordinate descent (C-CD) for minimizing a convex quadratic function, which is equivalent to Gauss–Seidel method, Kaczmarz method and projection onto convex sets (POCS) in this simple setting. We observe that the known provable complexity of C-CD can be \(\mathcal {O}(n^2)\) times slower than randomized coordinate descent (R-CD), but no example was proven to exhibit such a large gap. In this paper we show that the gap indeed exists. We prove that there exists an example for which C-CD takes at least \(\mathcal {O}(n^4 \kappa _{\text {CD}} \log \frac{1}{\epsilon })\) operations, where \(\kappa _{\text {CD}}\) is related to Demmel’s condition number and it determines the convergence rate of R-CD. It implies that in the worst case C-CD can indeed be \(\mathcal {O}(n^2)\) times slower than R-CD, which has complexity \(\mathcal {O}( n^2 \kappa _{\text {CD}} \log \frac{1}{\epsilon })\). Note that for this example, the gap exists for any fixed update order, not just a particular order. An immediate consequence is that for Gauss–Seidel method, Kaczmarz method and POCS, there is also an \(\mathcal {O}(n^2) \) gap between the cyclic versions and randomized versions (for solving linear systems). One difficulty with the analysis is that the spectral radius of a non-symmetric iteration matrix does not necessarily constitute a lower bound for the convergence rate. Finally, we design some numerical experiments to show that the size of the off-diagonal entries is an important indicator of the practical performance of C-CD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. These works often analyze cyclic BCGD (Block Coordinate Gradient Descent). For minimizing convex quadratic functions, cyclic CGD with a special stepsize is the same as cyclic CD (i.e. exactly minimizing each subproblem).

  2. The reference [45] discussed Gauss–Seidel method and seems to be unknown to optimization researchers. We first derived this bound by applying the analysis of the paper [38] (which studies non-strongly cnovex problems) to strongly convex quadratic problems. When preparing this paper we became aware of the early reference [45] which obtained a similar bound.

  3. Note that G–S, POCS and CD are not equivalent in more general settings; in fact, G–S can be used to solve non-symmetric linear systems, POCS can be used to find intersection of any closed convex sets, and CD can be used to solve non-quadratic non-smooth problems.

  4. After posting the first version of the paper in April 2016, Steven Wright pointed out to us that he proposed the example that we analyzed in this paper in a talk in Paris in July 2015 and in a talk at NYU in December 2015. He also noticed the large gap between C-CD and R-CD for this example, although no theoretical analysis was available on public.

  5. When the matrix is sparse, the time is actually \(O(\text {nnz}(A))\), but to simplify the discussions, we do not consider the sparsity in this work.

  6. Rigorously speaking, the upper bounds of the convergence rate can be transformed to upper bounds of the complexity, but the lower bounds require a bit more work. See the arxiv version for a more rigorous treatment of the lower bounds of the complexity.

References

  1. Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    MathSciNet  MATH  Google Scholar 

  2. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)

    MathSciNet  MATH  Google Scholar 

  3. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  4. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, pp. 408–415. ACM (2008)

  5. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)

    Google Scholar 

  6. Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for l1-regularized loss minimization. arXiv preprint arXiv:1105.5379 (2011)

  7. Mazumder, R., Friedman, J.H., Hastie, T.: SparseNet: coordinate descent with nonconvex penalties. J. Am. Stat. Assoc. 106(495), 1125–1138 (2011)

    MathSciNet  MATH  Google Scholar 

  8. Razaviyayn, M., Hong, M., Luo, Z.-Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)

    MathSciNet  MATH  Google Scholar 

  9. Baligh, H., Hong, M., Liao, W.-C., Luo, Z.-Q., Razaviyayn, M., Sanjabi, M., Sun, R.: Cross-layer provision of future cellular networks: a WMMSE-based approach. IEEE Signal Process. Mag. 31(6), 56–68 (2014)

    Google Scholar 

  10. Sun, R., Baligh, H., Luo, Z.-Q.: Long-term transmit point association for coordinated multipoint transmission by stochastic optimization. In: 2013 IEEE 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 330–334. IEEE (2013)

  11. Hong, M., Sun, R., Baligh, H., Luo, Z.-Q.: Joint base station clustering and beamformer design for partial coordinated transmission in heterogeneous networks. IEEE J. Sel. Areas Commun. 31(2), 226–240 (2013)

    Google Scholar 

  12. Canutescu, A.A., Dunbrack, R.L.: Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci. 12(5), 963–972 (2003)

    Google Scholar 

  13. Bouman, C.A., Sauer, K.: A unified approach to statistical tomography using coordinate descent optimization. IEEE Trans. Image Process. 5(3), 480–492 (1996)

    Google Scholar 

  14. Greenbaum, A.: Iterative Methods for Solving Linear Systems, vol. 17. SIAM, Philadelphia (1997)

    MATH  Google Scholar 

  15. Wen, Z., Goldfarb, D., Scheinberg, K.: Block coordinate descent methods for semidefinite programming. In: Anjos, M., Lasserre, J. (eds.) Handbook on Semidefinite, Conic and Polynomial Optimization. International Series in Operations Research & Management Science, vol. 166. Springer, Boston, MA

  16. Sun, R., Luo, Z.-Q.: Guaranteed matrix completion via nonconvex factorization. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pp. 270–289. IEEE (2015)

  17. Powell, M.J.D.: On search directions for minimization algorithms. Math. Program. 4(1), 193–201 (1973)

    MathSciNet  MATH  Google Scholar 

  18. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  19. Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 103(9), 475–494 (2001)

    MathSciNet  MATH  Google Scholar 

  20. Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear Gauss–Seidel method under convex constraints. Oper. Res. Lett. 26, 127–136 (2000)

    MathSciNet  MATH  Google Scholar 

  21. Luo, Z.-Q., Tseng, P.: On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 72(1), 7–35 (1992)

    MathSciNet  MATH  Google Scholar 

  22. Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)

    MathSciNet  MATH  Google Scholar 

  23. Nesterov, Y.: Efficiency of coordiate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    MathSciNet  MATH  Google Scholar 

  24. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  25. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144, 1–38 (2014)

    MathSciNet  MATH  Google Scholar 

  26. Zhaosong, L., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1–2), 615–642 (2015)

    MathSciNet  MATH  Google Scholar 

  27. Qu, Z., Richtárik, P., Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling. arXiv preprint arXiv:1411.5873 (2014)

  28. Lin, Q., Lu, Z., Xiao, L.: An accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization. arXiv preprint arXiv:1407.1296 (2014)

  29. Zhang, Y., Xiao, L.: Stochastic primal-dual coordinate method for regularized empirical risk minimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 353–361 (2015)

  30. Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)

    MathSciNet  MATH  Google Scholar 

  31. Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)

    MathSciNet  MATH  Google Scholar 

  32. Patrascu, A., Necoara, I.: Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. J. Glob. Optim. 61(1), 19–46 (2015)

    MathSciNet  MATH  Google Scholar 

  33. Hsieh, C.-J., Yu, H.-Fu, Dhillon, I.S.: PASSCoDe: parallel asynchronous stochastic dual co-ordinate descent. arXiv preprint arXiv:1504.01365 (2015)

  34. Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156. IEEE (2013)

  35. Allen-Zhu, Z., Orecchia, L.: Nearly-linear time positive LP solver with faster convergence rate. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 229–236. ACM (2015)

  36. Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)

    MathSciNet  MATH  Google Scholar 

  37. Sun, R., Luo, Z.-Q., Ye, Y.: On the expected convergence of randomly permuted ADMM. arXiv preprint arXiv:1503.06387 (2015)

  38. Sun, R., Hong, M.: Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. In: Advances in Neural Information Processing Systems, pp. 1306–1314 (2015)

  39. Lee, C.-P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. arXiv preprint arXiv:1607.08320 (2016)

  40. Xiao, L., Yu, A.W., Lin, Q., Chen, W.: DSCOVR: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. arXiv preprint arXiv:1710.05080 (2017)

  41. Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)

    MathSciNet  MATH  Google Scholar 

  42. Beck, A.: On the convergence of alternating minimization with applications to iteratively reweighted least squares and decomposition schemes. SIAM J. Optim. 25(1), 185–209 (2015)

    MathSciNet  MATH  Google Scholar 

  43. Saha, A., Tewari, A.: On the nonasymptotic convergence of cyclic coordinate descent method. SIAM J. Optim. 23(1), 576–601 (2013)

    MathSciNet  MATH  Google Scholar 

  44. Hong, M., Wang, X., Razaviyayn, M., Luo, Z.-Q.: Iteration complexity analysis of block coordinate descent methods. Preprint, arXiv:1310.6957 (2013)

  45. Oswald, P.: On the convergence rate of SOR: a worst case estimate. Computing 52(3), 245–255 (1994)

    MathSciNet  MATH  Google Scholar 

  46. Deutsch, F.: The method of alternating orthogonal projections. In: Singh, S.P. (eds.) Approximation Theory, Spline Functions and Applications. NATO ASI Series (Series C: Mathematical and Physical Sciences), vol. 356. Springer, Dordrecht

  47. Von Neumann, J.: Functional Operators. Volume II, the Geometry of Orthogonal Spaces (1950). This is a reprint of mimeographed lecture notes first distributed in (1933)

  48. Smith, K.T., Solmon, D.C., Wagner, S.L.: Practical and mathematical aspects of the problem of reconstructing objects from radiographs. Bull. Am. Math. Soc. 83(6), 1227–1270 (1977)

    MathSciNet  MATH  Google Scholar 

  49. Kayalar, S., Weinert, H.L.: Error bounds for the method of alternating projections. Math. Control Signals Syst. (MCSS) 1(1), 43–59 (1988)

    MathSciNet  MATH  Google Scholar 

  50. Deutsch, F., Hundal, H.: The rate of convergence for the method of alternating projections, II. J. Math. Anal. Appl. 205(2), 381–405 (1997)

    MathSciNet  MATH  Google Scholar 

  51. Bauschke, H.H., Borwein, J.M., Lewis, A.S.: The method of cyclic projections for closed convex sets in hilbert space. Contemp. Math. 204, 1–38 (1997)

    MathSciNet  MATH  Google Scholar 

  52. Escalante, R., Raydan, M.: Alternating Projection Methods. SIAM, Philadelphia (2011)

    MATH  Google Scholar 

  53. Galántai, A.: Projectors and Projection Methods, vol. 6. Springer, Berlin (2013)

    MATH  Google Scholar 

  54. Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de lAcademie Polonaise des Sciences et des Lettres 35, 355–357 (1937)

    MATH  Google Scholar 

  55. Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)

    MathSciNet  MATH  Google Scholar 

  56. Von Neumann, J.: On rings of operators. Reduction theory. Ann. Math. 50(2), 401–485 (1949)

    MATH  Google Scholar 

  57. Halperin, I.: The product of projection operators. Acta Sci. Math. (Szeged) 23(1–2), 96–99 (1962)

    MathSciNet  MATH  Google Scholar 

  58. Young, D.: Iterative methods for solving partial difference equations of elliptic type. Trans. Am. Math. Soc. 76(1), 92–111 (1954)

    MathSciNet  MATH  Google Scholar 

  59. Widlund, O.B.: On the effects of scaling of the Peaceman–Rachford method. Math. Comput. 25(113), 33–41 (1971)

    MathSciNet  MATH  Google Scholar 

  60. Galántai, A.: On the rate of convergence of the alternating projection method in finite dimensional spaces. J. Math. Anal. Appl. 310(1), 30–44 (2005)

    MathSciNet  MATH  Google Scholar 

  61. Necoara, I., Richtarik, P., Patrascu, A.: Randomized projection methods for convex feasibility problems: conditioning and convergence rates. arXiv preprint arXiv:1801.04873 (2018)

  62. Angelos, J.R., Cowen, C.C., Narayan, S.K.: Triangular truncation and finding the norm of a Hadamard multiplier. Linear Algebra Appl. 170, 117–135 (1992)

    MathSciNet  MATH  Google Scholar 

  63. Uherka, D.J., Sergott, A.M.: On the continuous dependence of the roots of a polynomial on its coefficients. Am. Math. Mon. 84(5), 368–370 (1977)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruoyu Sun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Proofs of upper bounds

1.1 A.1 Proof of Proposition 3.1

Without loss of generality, we can assume \( b = 0 \). In fact, minimizing \(f(x) = x^T A x - 2 b^T x\) is equivalent to minimizing \( f(x) = (x - x^*)^T A (x - x^*) \) where \(x^* = A^{\dag }b \); here we use the fact that \( A x^* = A A^{\dag }b = b \) when \(b \in \mathcal {R}(A)\). By a linear transformation \(z = x-x^*\), C-CD for minimizing \( (x - x^*)^T A (x - x^*) \) starting from \(x^0\) is equivalent to C-CD for minimizing \( z^T A z \) starting from \(z^0 = x^0-x^*\). Thus we can assume \(x^* = 0\), or equivalently, \(b = 0\).

The update equation of C-CD now becomes

$$\begin{aligned} x^{k+1} = (I - \Gamma ^{-1}A) x^k = x^k - d^k, \end{aligned}$$
(46)

where \(\Gamma \) is the lower triangular part of A with diagonal entries, i.e., \(\Gamma _{ij} = A_{ij}, 1 \le j \le i \le n \), and \(d^k = \Gamma ^{-1}A x^k \) is the moving direction. This implies

$$\begin{aligned} \Gamma d^k = A x^k. \end{aligned}$$
(47)

We first assume A is positive definite and will show how to extend to the PSD case in the end.

The proof consists of two main claims. The first claim relates the convergence rate of C-CD with the spectral radius of a certain matrix.

Claim A.1

Let \(D_A = \text {diag}(A_{11}, \ldots , A_{nn})\) be a diagonal matrix with entries \(A_{ii}\)’s. Then

$$\begin{aligned} f(x^{k+1}) - f(x^*) \le \left( 1- \frac{1}{\Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert } \right) (f(x^k) - f(x^*)) . \end{aligned}$$
(48)

First Proof of Claim A.1

(Optimization Perspective) Following the proof framework of [38], we bound the descent amount and the cost yet to be minimized (cost-to-go) respectively. Suppose \(w^0 = x^k, w^n = x^{k+1}\) and \(w^1,\ldots , w^{n-1}\) are the \(n-1\) intermediate iterates. Since \(w^i\) is obtained by minimizing f over the i-th coordinate with other variables fixed, it is easy to verify

$$\begin{aligned} d^k_i =\frac{1}{2 A_{ii}} \nabla _i f(w^{i-1}) . \end{aligned}$$
(49)

In the above expression, \( 2 A_{ii} \) can be viewed as the i-th coordinate-wise Lipschitz constant of \(\nabla f\) from an optimization perspective. We have

$$\begin{aligned} w^1= & {} w^0 - d^k_1 e_1 = w^0 - \frac{1}{2 A_{11}} \nabla _1 f(w^0), \ldots , w^n \\= & {} w^{n-1}- d^k_n e_n = w^{n-1} - \frac{1}{2 A_{nn}} \nabla _n f(w^{n-1} ) , \end{aligned}$$

where \(e_i\) is the i-th unit vector. Then

$$\begin{aligned}&f(w^{i-1}) - f(w^{i})\nonumber \\&\quad = (w^{i-1})^T A w^{i-1} - ( w^{i-1} - d^k_i e_i)^T A (w^{i-1} - d^k_i e_i) \nonumber \\&\quad = - d^k_i e_i^T A e_i d^k_i + 2 (w^{i-1})^T A e_i d^k_i \nonumber \\&\quad = - A_{ii} (d^k_i)^2 + \nabla _i f(w^{i-1}) d^k_i \overset{ (49)}{ =} - A_{ii} (d^k_i)^2 + 2 A_{ii} (d^k_i)^2 = A_{ii} (d^k_i)^2.\nonumber \\ \end{aligned}$$
(50)

Therefore, the descent amount \( f(x^k) - f(x^{k+1}) \) can be bounded in terms of \(d^k\) as

$$\begin{aligned} f(x^k) - f(x^{k+1}) = \sum _{i=1}^n f(w^{i-1}) - f(w^{i}) = \sum _{i=1}^n A_{ii} (d^k_i)^2 = (d^k)^T D_A d^k. \end{aligned}$$
(51)

The cost-to-go estimate is simply

$$\begin{aligned} f(x^k) - f(x^*) = f(x^k) = (x^k)^T A x^k \overset{(47)}{=} (d^k)^T \Gamma ^T A^{-1} \Gamma d^k . \end{aligned}$$
(52)

Combining with (51), we obtain

$$\begin{aligned} \frac{f(x^k) - f(x^*)}{ f(x^k) - f(x^{k+1})} = \frac{ (d^k)^T \Gamma ^T A^{-1} \Gamma d^k }{ (d^k)^T D_A d^k } \le \Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert , \end{aligned}$$
(53)

which implies (48). \(\square \)

Second Proof of Claim A.1

(Matrix Recursion Perspective) One natural idea is to prove \(f(x^{k+1}) = M_f f(x^k) \) or \(f(x^{k+1}) \le \Vert M_f\Vert f(x^k)\) for a certain matrix \(M_f\), based on the update equation of the iterates \(x^{k+1} = (I - \Gamma ^{-1} A )x^k\). We can write down the expression of \(f(x^{k+1})\) in terms of \(x^k\) as \(f(x^{k+1}) = (x^k)^T (I - \Gamma ^{-1} A )^T A (I - \Gamma ^{-1} A ) x^k \). However, it is not clear how this expression is related to \(f(x^k) = (x^k)^T A x^k \). A simple trick to resolve this issue is to express everything in terms of \(d^k\). More specifically, we have

$$\begin{aligned} \begin{aligned}&f(x^k) - f(x^{k+1}) \\&\quad = ( x^k)^T A x^k - (x^k - d^k)^T A (x^k - d^k) = 2(d^k)^TA x^k - (d^k)^T A d^k \\&\quad = 2 (d^k)^T \Gamma d^k - (d^k)^T A d^k = (d^k)^T (\Gamma + \Gamma ^T) d^k - (d^k)^T A d^k = (d^k)^T D_A d^k, \end{aligned} \end{aligned}$$
(54)

where the last step is because \( \Gamma + \Gamma ^T = A + D_A \). Equation (54) is equivalent to (51) derived earlier using another approach. The rest is the same as the first proof. \(\square \)

Remark Although the second proof seems simpler, for people who are familiar with optimization the first proof is probably easier to understand: equation (50) is just the classical descent lemma (applied to each coordinate), thus (51) is straightforward to derive. In the proof of [38], one crucial step is to bound the cost-to-go in terms of \(d^k\); here for the quadratic case the cost-to-go has a closed form expression given by (53). The second proof is cleaner to write, but it is specifically tailored for the quadratic problem; in contrast, the first proof can be extended to non-quadratic problems as done in [38] ((51) and (53) will become inequalities).

Claim A.2

Let \(D_A = \text {diag}(A_{11}, \ldots , A_{nn})\) be a diagonal matrix with entries \(A_{ii}\)’s. Then

$$\begin{aligned} \Vert \Gamma ^T A^{-1} \Gamma \Vert \le \kappa \cdot \min \left\{ \sum _i L_i, (2 + \frac{1}{\pi } \log n)^2 L \right\} . \end{aligned}$$
(55)

Proof of Claim A.2

Denote

$$\begin{aligned} \Gamma _{\mathrm {unit} } = \begin{bmatrix} 1&\quad 0&\quad 0&\quad \dots&\quad 0 \\ 1&\quad 1&\quad 0&\quad \dots&\quad 0 \\ \vdots&\vdots&\vdots&\ddots&\quad \vdots \\ 1&\quad 1&\quad 1&\quad \dots&\quad 0 \\ 1&\quad 1&\quad 1&\quad \dots&\quad 1 \\ \end{bmatrix} , \end{aligned}$$

then \( \Gamma = \Gamma _{\mathrm {unit} } \circ A\), where \(\circ \) denotes the Hadamard product. According to the classical result on the operator norm of the triangular truncation operator [62, Theorem 1], we have

$$\begin{aligned} \Vert \Gamma \Vert = \Vert \Gamma _{\mathrm {unit}} \circ A \Vert \le \left( 1 + \frac{1}{\pi } + \frac{1}{\pi } \log n\right) \Vert A \Vert \le \left( 2 + \frac{1}{\pi } \log n\right) \Vert A \Vert . \end{aligned}$$

Thus we have

$$\begin{aligned} \Vert \Gamma ^T A^{-1} \Gamma \Vert\le & {} \Vert \Gamma ^T \Gamma \Vert \Vert A^{-1} \Vert = \Vert \Gamma \Vert ^2 \frac{1}{\lambda _{\min }(A) }\\\le & {} \left( 2 +\frac{1}{\pi } \log n\right) ^2 \frac{\Vert A\Vert ^2}{\lambda _{\min }(A)} = \left( 2 + \frac{1}{\pi } \log n\right) ^2 \kappa L , \end{aligned}$$

which proves the second part of (55).

We can bound \( \Vert \Gamma \Vert ^2 \) in another way (denote \(\lambda _i\)’s as the eigenvalues of A):

$$\begin{aligned} \Vert \Gamma \Vert ^2 \le \Vert \Gamma \Vert _F^2&= \frac{1}{2}\left( \Vert A \Vert _F^2 + \sum _i A_{ii}^2\right) = \frac{1}{2} \left( \sum _i \lambda _i^2 + \sum _i A_{ii}^2 \right) \nonumber \\&\le \frac{1}{2} \left( \left( \sum _i \lambda _i\right) \lambda _{\max } + \max _j A_{jj} \sum _i A_{ii} \right) \nonumber \\&\overset{\text {(i)}}{=} \frac{1}{2} \left( L + \max _j A_{jj}\right) \sum _i L_i \le L \sum _i L_i. \end{aligned}$$
(56)

where(i) is because \( \sum _i \lambda _i = \text {tr}(A) = \sum _i A_{ii} \) and \(A_{ii} = L_i \). Thus

$$\begin{aligned} \Vert \Gamma ^T A^{-1} \Gamma \Vert \le \Vert \Gamma \Vert ^2 \frac{1}{\lambda _{\min }(A) } \overset{(56)}{\le } \frac{L}{\lambda _{\min }} \sum _i L_i = \kappa \sum _i L_i. \end{aligned}$$

which proves the first part of (55). \(\square \)

Finally, according to the fact \( \Vert D_A^{-1/2} B D_A^{-1/2} \Vert \le \frac{1}{\min _i L_i } \Vert B \Vert = \frac{1}{L_{\min } } \Vert B \Vert \) for any positive definite matrix B, we have

$$\begin{aligned} \begin{aligned}&\Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert \le \frac{1}{L_{\min } } \Vert \Gamma ^T A^{-1} \Gamma \Vert \\&\quad \overset{(55)}{\le } \frac{1}{L_{\min } } \kappa \cdot \min \left\{ \sum _i L_i, \left( 2 + \frac{1}{\pi } \log n\right) ^2 L \right\} \end{aligned} \end{aligned}$$

Plugging this inequality into (48) and replacing \(\sum _i L_i\) by \(n L_{\mathrm {avg}}\), we obtain (6a).

Now we show how to modify the above proof to the case that A is PSD. From (47) we have

$$\begin{aligned} x^k = A^{\dag } \Gamma d^k. \end{aligned}$$

Then (52) is slightly modified to \( (x^k)^T A x^k = (d^k)^T \Gamma ^T A^{\dag } \Gamma d^k. \) We still have (51) since its proof does not require A to be positive definite. Now we modify (53) to

$$\begin{aligned} \frac{f(x^k) - f(x^*)}{ f(x^k) - f(x^{k+1})}&= \frac{ (d^k)^T \Gamma ^T A^{\dag } \Gamma d^k }{ (d^k)^T D_A d^k } \le \frac{ (d^k)^T \Gamma ^T \Gamma d^k \Vert A^{\dag }\Vert }{ (d^k)^T D_A d^k } \nonumber \\&\quad \overset{\text {(i)}}{=} \frac{1}{\lambda _{\min } } \frac{ (d^k)^T \Gamma ^T \Gamma d^k }{ (d^k)^T D_A d^k } \le \frac{1}{\lambda _{\min } } \Vert D_A^{-1/2} \Gamma ^T \Gamma D_A^{-1/2} \Vert \nonumber \\&\quad \le \frac{1}{ \lambda _{\min } L_{\min } } \Vert \Gamma ^T \Gamma \Vert . \end{aligned}$$
(57)

where (i) is because \(\Vert A^{\dag }\Vert = 1/\lambda _{\min }\) where \(\lambda _{\min }\) is the minimum non-zero eigenvalue of A. The rest is almost the same as the proof for the PD case: obtaining the bounds of \( \Gamma ^T \Gamma \) as in Claim A.2 and plugging them into (57) immediately leads to (6a).

The first bound of result (6b) is a direct corollary of (6a) because \( \kappa \le n \kappa _{\mathrm CD}\) (which is because \(\lambda _{\max }(A) \le \mathrm {tr}(A) = n L_{\mathrm {avg}}\)). The second bound of (6b) is the same as the second bound of (6a) because

$$\begin{aligned} \frac{ \kappa L }{L_{\min }} = \frac{ L^2 }{ \lambda _{\min } L_{\min } } = \frac{L^2 }{ L_{\mathrm {avg}} L_{\min } } \frac{ L_{\mathrm {avg}} }{\lambda _{\min }} = \frac{L^2 }{ L_{\mathrm {avg}} L_{\min } } \kappa _{\mathrm CD}. \end{aligned}$$

This finishes the proof of Proposition 3.1.

1.2 A. 2 Proof of Proposition 3.2

This proof is a slight modification of the proof of Proposition 3.1.

We first consider the case that A is positive definite. The insight is to rewrite the relation proved in Claim A.1

$$\begin{aligned} f(x^{k+1}) - f(x^*) \le \left( 1- \frac{1}{\Vert D_A^{-1/2} \Gamma ^T A^{-1} \Gamma D_A^{-1/2} \Vert } \right) (f(x^k) - f(x^*)) \end{aligned}$$
(58)

as

$$\begin{aligned} f(x^{k+1}) - f(x^*) \le \left( 1- \frac{1}{\Vert {\hat{\Gamma }}^T \hat{A}^{-1} {\hat{\Gamma }} \Vert } \right) (f(x^k) - f(x^*)) , \end{aligned}$$
(59)

where \( {\hat{\Gamma }} = D_A^{-1/2} \Gamma D_A^{-1/2} \) and \( \hat{A} = D_A^{-1/2} A D_A^{-1/2} \). Note that \({\hat{\Gamma }}\) is still the lower-triangular part (with diagonal entries) of the Jacobi-preconditioned matrix \(\hat{A}\). The diagonal entries of \({\hat{\Gamma }}\) and \(\hat{A}\) are all 1, so \(\hat{L}_i = 1, \ \forall i\).

Applying Claim A.2 we have

$$\begin{aligned} \Vert {\hat{\Gamma }}^T \hat{A}^{-1} {\hat{\Gamma }} \Vert\le & {} \hat{ \kappa } \cdot \min \left\{ \sum _i \hat{ L}_i, \left( 2 + \frac{1}{\pi } \log n\right) ^2 \hat{ L} \right\} \\= & {} {\hat{\kappa }} \cdot \min \left\{ n , \left( 2 + \frac{1}{\pi } \log n\right) ^2 \hat{ L} \right\} . \end{aligned}$$

Plugging the above relation into (59) we obtain (10a). Similar to Proposition 3.1, the second bound (10b) follows directly from (10a).

The case that A is PSD is can be handled in a similar way to the proof of Proposition 3.1.

B Supplemental Proofs for Theorem 3.1

1.1 B.1 Proof of Lemma 5.1

Suppose \(\lambda \) is an eigenvalue of \( Z = \Gamma ^{-1}A \) and \(v = (v_1; v_2; \ldots ; v_n) \in {\mathbb {C}}^{n \times 1}\) is the corresponding eigenvector. Then we have

$$\begin{aligned}&\Gamma ^{-1} A v = \lambda v \Longrightarrow \quad A v = \lambda L v \nonumber \\ \qquad&\Longrightarrow {\left\{ \begin{array}{ll} v_1 + c \sum _{j \ne 1} v_j = \lambda v_1 \\ v_2 + c \sum _{j \ne 2 } v_j = \lambda (c v_1 + v_2 ) \\ \dots \\ v_k + c \sum _{j \ne k} v_j = \lambda (c v_1 + \cdots + c v_{k-1} + v_k), \\ \dots \\ v_n + c \sum _{j \ne n} v_j = \lambda (c v_1 + \cdots + c v_{n-1} + v_n ). \end{array}\right. } \end{aligned}$$
(60)

Without loss of generality, we can assume

$$\begin{aligned} \sum _{j=1}^n v_j = 1 . \end{aligned}$$
(61)

Let \(\hat{c} = 1 - c\). Then (60) becomes

$$\begin{aligned} {\left\{ \begin{array}{ll} \hat{c} v_1 + c = \lambda v_1 \\ \hat{c} v_2 + c = \lambda (c v_1 + v_2 ) \\ \dots \\ \hat{c} v_k + c = \lambda (c v_1 + \cdots + c v_{k-1} + v_k), \\ \dots \\ \hat{c} v_n + c = \lambda (c v_1 + \cdots + c v_{n-1} + v_n ). \end{array}\right. } \end{aligned}$$
(62)

The first equation implies \( v_1 = \frac{c}{ \lambda - \hat{c}}\). Plugging into the second equation, we get

$$\begin{aligned} v_2 = \frac{c (1 - \lambda v_1 )}{\lambda - \hat{c} } = \frac{c( \lambda - \hat{c} - \lambda c )}{ (\lambda - \hat{c})^2 } = \frac{c \hat{c} (\lambda - 1)}{ ( \lambda - \hat{c})^2} . \end{aligned}$$

Plugging the expression of \(v_1, v_2\) into the third equation, we get

$$\begin{aligned} v_3 = \frac{ c(1 - \lambda v_1 - \lambda v_2) }{ \lambda - \hat{c} } = \frac{c (\hat{c})^2 (\lambda - 1)^2 }{ (\lambda - \hat{c})^3 }. \end{aligned}$$

In general, we can prove by induction that

$$\begin{aligned} v_k = \frac{c (\hat{c})^{k-1} ( \lambda -1 )^{k-1} }{ (\lambda - \hat{c})^k } = \frac{c}{\lambda - \hat{c}} q^{k-1}, \end{aligned}$$
(63)

where

$$\begin{aligned} q = \frac{\hat{c}(\lambda - 1)}{\lambda - \hat{c}} . \end{aligned}$$
(64)

We can also express \(\lambda \) in terms of q as

$$\begin{aligned} \lambda = \frac{ \hat{c} - \hat{c}q }{\hat{c}-q} . \end{aligned}$$
(65)

Note that the expression of \(v_k\) given by (63) satisfies (62) for any \(\lambda \), but our goal is to compute \(\lambda \). To do this, we need to utilize the normalization assmption (61). In particular, we have (when \(q \ne 1\))

$$\begin{aligned}&1 = \sum _k v_k = \left( \sum _{k=1}^n q^{k-1}\right) \frac{c }{\lambda - \hat{c}} = \frac{1 - q^n }{1 - q} \frac{c }{ \lambda - \hat{c} } \nonumber \\&\quad \Longrightarrow (1-q ) (\lambda - \hat{c}) = c(1-q^n) \nonumber \\&\quad \overset{(64) }{\Longrightarrow } c \lambda = c (1 - q^n) \nonumber \\&\quad \overset{ }{\Longrightarrow } q^n = 1 - \lambda \overset{ (65) }{=} 1 - \frac{ \hat{c} - \hat{c}q }{\hat{c}-q} \nonumber \\&\quad \overset{ }{\Longrightarrow } q^n = \frac{qc}{q - \hat{c}} \nonumber \\&\quad \overset{ }{\Longrightarrow } q^{n} (q - \hat{c}) = c q. \end{aligned}$$
(66)

The above procedure is reversible, i.e. suppose \(q \ne 1\) is a root of \( q^{n} (q - \hat{c}) = c q \), then \(\lambda = \frac{ \hat{c} - \hat{c}q }{\hat{c}-q} \) is an eigenvalue of Z. Suppose the \( n + 1 \) roots of \( q^{n} (q - \hat{c}) = c q \) are \(q_0 = 0, q_1, \ldots , q_{n-1}, q_n = 1\) (\(q = 0\) and \(q=1\) are always roots), then \( \lambda _k = \frac{ \hat{c} - \hat{c} q_k }{ \hat{c}- q_k } \overset{ (66)}{=} 1 - q_k^n , k= 0, \ldots , n - 1\) are the n eigenvalues of Z.

1.2 B.2 Proof of Lemma 5.2

It is well known that the roots of a polynomial depend continuously on the coefficients of the polynomial, i.e., if the coefficients of a family of functions f converges to the coefficients of a function \(f^*\), then the roots of f also converge to the roots of \(f^*\). More precisely, if \( x^* \) is a root of \(f^*\) of multiplicity m, then for any \(\epsilon > 0\), if the coefficients of f are close enough to those of \(f^*\), then f has at least m roots within \(\epsilon \) distance of \(x^*\) (in this case, a root with multiplicity k is counted as k roots); see, e.g., [63, Theorem]. If one chooses \(\epsilon \) to be smaller than the minimal distance between the distinct roots of \(f^*\), then f has exactly m roots (counting multiplicity) within \(\epsilon \) distance of \(x^*\).

Below, we provide a rigorous proof of Lemma 5.2 using Rouché’s theorem in complex analysis.

When \(n=1 \), the only solution of \( q^{n-1} (q - 1 + c) = c \) is \(q = 1\), thus the conclusion holds. From now on, we assume \(n \ge 2\). Let \(p = 1/q\), then the equation \( q^{n-1} (q - 1 + c) = c \) becomes

$$\begin{aligned} p^{-1} -1 + c = c p^{n-1} \Longleftrightarrow p^n -1 + (1/c - 1) (p - 1) = 0. \end{aligned}$$

This equation can be written as \(F(p) + G_c(p) = 0\), where F and \(G_c\) are defined as

$$\begin{aligned} F (p) = p^n -1, \; G_c(p) = (1/c - 1) (p - 1) . \end{aligned}$$

Lemma B.1

Suppose \(n \ge 2\). For any \( 0< \epsilon < \sin ( \pi /n) \), there exists some \(\delta > 0 \) such that for any \(c \in (1 - \delta , 1 )\), \(F(p) + G_c(p) \) has exactly one root \(p_k\) in the ball \( B( e^{- i 2k \pi /n}, \epsilon ) \triangleq \{ z \mid | z - e^{- i 2k \pi /n} | \le \epsilon \} \), \(k = 0, 1, \ldots , n - 1\).

Clearly, the function F has n roots \( \eta _k \triangleq e^{- i 2k \pi /n}, k=1, \ldots , n \). The distance between two adjacent roots are \( | 1 - e^{-2 i \pi /n} | = 2 \sin ( \pi /n) . \) For any \(0< \epsilon < \sin ( \pi /n) \), consider n balls

$$\begin{aligned} B ( \eta _k , \epsilon ) = \{ z \mid | z - \eta _k | \le \epsilon \}, k= 0, 1, \ldots , n-1. \end{aligned}$$

Any two such balls have no intersection since \(\epsilon < | \sin ( \pi /n) | = \min _{0 \le j, k \le n -1} |\eta _j - \eta _k| \).

The boundary of the ball \(B ( \eta _k , \epsilon )\) is \( \partial B ( \eta _k , \epsilon ) = \{ z \mid | z - \eta _k | = \epsilon \}. \) Define

$$\begin{aligned} v_k(\epsilon ) \triangleq \inf _{ z \in \partial B ( \eta _k , \epsilon ) } F(z) = \min _{ z \in \partial B ( \eta _k , \epsilon ) } | z^n - 1 | > 0 . \end{aligned}$$

This minimum can be achieved because \(v_k(\epsilon ) \) is the minimal value of a continuous function on a compact set. It is positive since otherwise there exists some \(z \in \partial B ( \eta _k , \epsilon ) \) such that \(z^n = 1\) which means \(z \in \{ \eta _0, \ldots , \eta _{n-1} \}\). This contradicts the fact that any two balls \(B ( \eta _j , \epsilon ), B ( \eta _k , \epsilon )\) have no intersection.

Define

$$\begin{aligned} v(\epsilon ) = \min _{ 0 \le k \le n-1 } v_k(\epsilon ) > 0. \end{aligned}$$

For any \( z \in \partial B ( \eta _k , \epsilon ) \), we have

$$\begin{aligned} | F (z) | = | z^n - 1 | \ge v( \epsilon ). \end{aligned}$$
(67)

For any \( z \in \partial B ( \eta _k , \epsilon ) \) and any \(c > \frac{3 }{ 3 + v(\epsilon )} \), we have

$$\begin{aligned} | G_c (z) | = | (1/c - 1) ( z - 1) | \le |1/c -1| ( | \eta _k | + \epsilon + 1 ) \le 3 |1/c -1| < v( \epsilon ), \end{aligned}$$
(68)

where the second inequality is due to \(|\eta _k | = 1\) and \(\epsilon < \sin ( \pi /n) \le 1\).

Combining the two bounds (67) and (68), we obtain that

$$\begin{aligned} | F(z) | > | G_c (z) | , \forall z \in \partial B ( \eta _k , \epsilon ). \end{aligned}$$

According to Rouché’s theorem, F and \(F + G_c \) have the same number of zeros inside \(B ( \eta _k , \epsilon ) \). Since F has exactly one root inside \(B ( \eta _k , \epsilon ) \) which is \(\eta _k\), we obtain that \(F + G_c \) has exactly one root \(p_k\) inside \(B ( \eta _k , \epsilon ) \). \(\Box \)

We first let \( \epsilon _0 = \sin ( \pi /n)/ 4 \), which implies \(B ( \eta _k , \epsilon _0 ), k=0, 1, \ldots , n-1\) are n disjoint balls. For any \(c \in ( 3/(3 + v( \epsilon _0 ) , 1 ) \), Lemma B.1 implies that \(F (p) + G_c (p) \) has exactly one root inside each ball. We denote \( p_0(c), p_1(c), \ldots , p_{n-1}(c) \) to be the roots of \(F (p) + G_c (p) \) such that \(p_k(c) \in B ( \eta _k , \epsilon _0 ) , \forall k \). Since \(F + G_c \) has exactly n complex roots, thus \(p_k(c)\)’s are all the roots of \(F + G_c \). Lemma B.1 implies that for any \(\epsilon > 0 \), there exists some \( \delta \) such that whenever \( c > 1- \delta \), we have \( | p_k( c) - \eta _k | < \epsilon , \; \forall k. \) This means

$$\begin{aligned} \lim _{c \rightarrow 1} p_k(c) = \eta _k, \; \forall k. \end{aligned}$$

Since there is a one-to-one mapping between the roots of \( q^{n-1} (q - 1 + c) - c \) and the roots of \( F(p) + G_c (p) = p^n -1 + (1/c - 1) (p - 1) \) by the inverse transformation \( p = 1/q\), and \( |1/ q_k(c) - e^{ - i 2\pi k/n } | < \sin ( \pi /n)/4 \) implies \( | q_k(c) - e^{ i 2\pi k/n } | < \sin (\pi / n)/2 \), we obtain the following result: for any \(c \in ( 3/(3 + v( \epsilon _0 ) , 1 ) \), the equation \( q^{n-1} (q - 1 + c) - c \) has exactly one root \(q_k(c)\) such that \( |1/ q_k(c) - e^{ - i 2\pi k/n } | < \sin ( \pi /n)/2 \) for \(k = 0, 1, \ldots , n-1\); moreover,

$$\begin{aligned} \lim _{c \rightarrow 1} q_k(c) = e^{ i 2\pi k/n }, \; \forall k. \end{aligned}$$

1.3 B.3 Proof of Claim 5.1

Since \(2 \sin (n\phi /2) \cos ( x + (n+1)\phi /2 ) = \sin ( z + (n+ 1/2) \phi ) - \sin ( z + \phi /2 )\), the desired equation (28) is equivalent to

$$\begin{aligned} \sum _{j=1}^n \cos ( z + j \phi ) = \frac{ \sin ( z + (n+ 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )}. \end{aligned}$$
(69)

We prove (69) by induction. When \( n = 1\), it holds because \( \sin ( z + 1.5 \phi ) - \sin (z + 0.5 \phi ) = 2 \sin (\phi /2) \cos ( z + \phi ) \). Suppose (69) holds for \(n-1\), i.e.

$$\begin{aligned} \sum _{j=1}^{n-1} \cos ( z + j \phi ) = \frac{ \sin ( z + (n - 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )} . \end{aligned}$$

Note that \( 2 \cos (z + n \phi ) \sin (\phi /2 ) = \sin (z + (n+1/2 \phi )) - \sin ( z + (n - 1/2) \phi ) \), therefore

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^{n} \cos ( z + j \phi ) \\&\quad = \frac{ \sin ( z + (n - 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )} + \cos (z + n \phi ) \\&\quad = \frac{ \sin ( z + (n - 1/2) \phi ) - \sin ( z + \phi /2 ) + \sin (z + (n+1/2 \phi )) - \sin ( z + (n - 1/2) \phi ) }{2 \sin (\phi /2 )} \\&\quad = \frac{ \sin ( z + (n+ 1/2) \phi ) - \sin ( z + \phi /2 ) }{2 \sin (\phi /2 )}. \end{aligned} \end{aligned}$$
(70)

This completes the induction step, and thus (69) holds. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, R., Ye, Y. Worst-case complexity of cyclic coordinate descent: \(O(n^2)\) gap with randomized version. Math. Program. 185, 487–520 (2021). https://doi.org/10.1007/s10107-019-01437-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01437-5

Mathematics Subject Classification

Navigation