Abstract
This paper concerns the worst-case complexity of cyclic coordinate descent (C-CD) for minimizing a convex quadratic function, which is equivalent to Gauss–Seidel method, Kaczmarz method and projection onto convex sets (POCS) in this simple setting. We observe that the known provable complexity of C-CD can be \(\mathcal {O}(n^2)\) times slower than randomized coordinate descent (R-CD), but no example was proven to exhibit such a large gap. In this paper we show that the gap indeed exists. We prove that there exists an example for which C-CD takes at least \(\mathcal {O}(n^4 \kappa _{\text {CD}} \log \frac{1}{\epsilon })\) operations, where \(\kappa _{\text {CD}}\) is related to Demmel’s condition number and it determines the convergence rate of R-CD. It implies that in the worst case C-CD can indeed be \(\mathcal {O}(n^2)\) times slower than R-CD, which has complexity \(\mathcal {O}( n^2 \kappa _{\text {CD}} \log \frac{1}{\epsilon })\). Note that for this example, the gap exists for any fixed update order, not just a particular order. An immediate consequence is that for Gauss–Seidel method, Kaczmarz method and POCS, there is also an \(\mathcal {O}(n^2) \) gap between the cyclic versions and randomized versions (for solving linear systems). One difficulty with the analysis is that the spectral radius of a non-symmetric iteration matrix does not necessarily constitute a lower bound for the convergence rate. Finally, we design some numerical experiments to show that the size of the off-diagonal entries is an important indicator of the practical performance of C-CD.
Similar content being viewed by others
Notes
These works often analyze cyclic BCGD (Block Coordinate Gradient Descent). For minimizing convex quadratic functions, cyclic CGD with a special stepsize is the same as cyclic CD (i.e. exactly minimizing each subproblem).
The reference [45] discussed Gauss–Seidel method and seems to be unknown to optimization researchers. We first derived this bound by applying the analysis of the paper [38] (which studies non-strongly cnovex problems) to strongly convex quadratic problems. When preparing this paper we became aware of the early reference [45] which obtained a similar bound.
Note that G–S, POCS and CD are not equivalent in more general settings; in fact, G–S can be used to solve non-symmetric linear systems, POCS can be used to find intersection of any closed convex sets, and CD can be used to solve non-quadratic non-smooth problems.
After posting the first version of the paper in April 2016, Steven Wright pointed out to us that he proposed the example that we analyzed in this paper in a talk in Paris in July 2015 and in a talk at NYU in December 2015. He also noticed the large gap between C-CD and R-CD for this example, although no theoretical analysis was available on public.
When the matrix is sparse, the time is actually \(O(\text {nnz}(A))\), but to simplify the discussions, we do not consider the sparsity in this work.
Rigorously speaking, the upper bounds of the convergence rate can be transformed to upper bounds of the complexity, but the lower bounds require a bit more work. See the arxiv version for a more rigorous treatment of the lower bounds of the complexity.
References
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, pp. 408–415. ACM (2008)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for l1-regularized loss minimization. arXiv preprint arXiv:1105.5379 (2011)
Mazumder, R., Friedman, J.H., Hastie, T.: SparseNet: coordinate descent with nonconvex penalties. J. Am. Stat. Assoc. 106(495), 1125–1138 (2011)
Razaviyayn, M., Hong, M., Luo, Z.-Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
Baligh, H., Hong, M., Liao, W.-C., Luo, Z.-Q., Razaviyayn, M., Sanjabi, M., Sun, R.: Cross-layer provision of future cellular networks: a WMMSE-based approach. IEEE Signal Process. Mag. 31(6), 56–68 (2014)
Sun, R., Baligh, H., Luo, Z.-Q.: Long-term transmit point association for coordinated multipoint transmission by stochastic optimization. In: 2013 IEEE 14th Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 330–334. IEEE (2013)
Hong, M., Sun, R., Baligh, H., Luo, Z.-Q.: Joint base station clustering and beamformer design for partial coordinated transmission in heterogeneous networks. IEEE J. Sel. Areas Commun. 31(2), 226–240 (2013)
Canutescu, A.A., Dunbrack, R.L.: Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci. 12(5), 963–972 (2003)
Bouman, C.A., Sauer, K.: A unified approach to statistical tomography using coordinate descent optimization. IEEE Trans. Image Process. 5(3), 480–492 (1996)
Greenbaum, A.: Iterative Methods for Solving Linear Systems, vol. 17. SIAM, Philadelphia (1997)
Wen, Z., Goldfarb, D., Scheinberg, K.: Block coordinate descent methods for semidefinite programming. In: Anjos, M., Lasserre, J. (eds.) Handbook on Semidefinite, Conic and Polynomial Optimization. International Series in Operations Research & Management Science, vol. 166. Springer, Boston, MA
Sun, R., Luo, Z.-Q.: Guaranteed matrix completion via nonconvex factorization. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pp. 270–289. IEEE (2015)
Powell, M.J.D.: On search directions for minimization algorithms. Math. Program. 4(1), 193–201 (1973)
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 103(9), 475–494 (2001)
Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear Gauss–Seidel method under convex constraints. Oper. Res. Lett. 26, 127–136 (2000)
Luo, Z.-Q., Tseng, P.: On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 72(1), 7–35 (1992)
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)
Nesterov, Y.: Efficiency of coordiate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144, 1–38 (2014)
Zhaosong, L., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1–2), 615–642 (2015)
Qu, Z., Richtárik, P., Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling. arXiv preprint arXiv:1411.5873 (2014)
Lin, Q., Lu, Z., Xiao, L.: An accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization. arXiv preprint arXiv:1407.1296 (2014)
Zhang, Y., Xiao, L.: Stochastic primal-dual coordinate method for regularized empirical risk minimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 353–361 (2015)
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)
Patrascu, A., Necoara, I.: Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. J. Glob. Optim. 61(1), 19–46 (2015)
Hsieh, C.-J., Yu, H.-Fu, Dhillon, I.S.: PASSCoDe: parallel asynchronous stochastic dual co-ordinate descent. arXiv preprint arXiv:1504.01365 (2015)
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156. IEEE (2013)
Allen-Zhu, Z., Orecchia, L.: Nearly-linear time positive LP solver with faster convergence rate. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 229–236. ACM (2015)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
Sun, R., Luo, Z.-Q., Ye, Y.: On the expected convergence of randomly permuted ADMM. arXiv preprint arXiv:1503.06387 (2015)
Sun, R., Hong, M.: Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. In: Advances in Neural Information Processing Systems, pp. 1306–1314 (2015)
Lee, C.-P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. arXiv preprint arXiv:1607.08320 (2016)
Xiao, L., Yu, A.W., Lin, Q., Chen, W.: DSCOVR: Randomized primal-dual block coordinate algorithms for asynchronous distributed optimization. arXiv preprint arXiv:1710.05080 (2017)
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
Beck, A.: On the convergence of alternating minimization with applications to iteratively reweighted least squares and decomposition schemes. SIAM J. Optim. 25(1), 185–209 (2015)
Saha, A., Tewari, A.: On the nonasymptotic convergence of cyclic coordinate descent method. SIAM J. Optim. 23(1), 576–601 (2013)
Hong, M., Wang, X., Razaviyayn, M., Luo, Z.-Q.: Iteration complexity analysis of block coordinate descent methods. Preprint, arXiv:1310.6957 (2013)
Oswald, P.: On the convergence rate of SOR: a worst case estimate. Computing 52(3), 245–255 (1994)
Deutsch, F.: The method of alternating orthogonal projections. In: Singh, S.P. (eds.) Approximation Theory, Spline Functions and Applications. NATO ASI Series (Series C: Mathematical and Physical Sciences), vol. 356. Springer, Dordrecht
Von Neumann, J.: Functional Operators. Volume II, the Geometry of Orthogonal Spaces (1950). This is a reprint of mimeographed lecture notes first distributed in (1933)
Smith, K.T., Solmon, D.C., Wagner, S.L.: Practical and mathematical aspects of the problem of reconstructing objects from radiographs. Bull. Am. Math. Soc. 83(6), 1227–1270 (1977)
Kayalar, S., Weinert, H.L.: Error bounds for the method of alternating projections. Math. Control Signals Syst. (MCSS) 1(1), 43–59 (1988)
Deutsch, F., Hundal, H.: The rate of convergence for the method of alternating projections, II. J. Math. Anal. Appl. 205(2), 381–405 (1997)
Bauschke, H.H., Borwein, J.M., Lewis, A.S.: The method of cyclic projections for closed convex sets in hilbert space. Contemp. Math. 204, 1–38 (1997)
Escalante, R., Raydan, M.: Alternating Projection Methods. SIAM, Philadelphia (2011)
Galántai, A.: Projectors and Projection Methods, vol. 6. Springer, Berlin (2013)
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de lAcademie Polonaise des Sciences et des Lettres 35, 355–357 (1937)
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
Von Neumann, J.: On rings of operators. Reduction theory. Ann. Math. 50(2), 401–485 (1949)
Halperin, I.: The product of projection operators. Acta Sci. Math. (Szeged) 23(1–2), 96–99 (1962)
Young, D.: Iterative methods for solving partial difference equations of elliptic type. Trans. Am. Math. Soc. 76(1), 92–111 (1954)
Widlund, O.B.: On the effects of scaling of the Peaceman–Rachford method. Math. Comput. 25(113), 33–41 (1971)
Galántai, A.: On the rate of convergence of the alternating projection method in finite dimensional spaces. J. Math. Anal. Appl. 310(1), 30–44 (2005)
Necoara, I., Richtarik, P., Patrascu, A.: Randomized projection methods for convex feasibility problems: conditioning and convergence rates. arXiv preprint arXiv:1801.04873 (2018)
Angelos, J.R., Cowen, C.C., Narayan, S.K.: Triangular truncation and finding the norm of a Hadamard multiplier. Linear Algebra Appl. 170, 117–135 (1992)
Uherka, D.J., Sergott, A.M.: On the continuous dependence of the roots of a polynomial on its coefficients. Am. Math. Mon. 84(5), 368–370 (1977)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Proofs of upper bounds
1.1 A.1 Proof of Proposition 3.1
Without loss of generality, we can assume \( b = 0 \). In fact, minimizing \(f(x) = x^T A x - 2 b^T x\) is equivalent to minimizing \( f(x) = (x - x^*)^T A (x - x^*) \) where \(x^* = A^{\dag }b \); here we use the fact that \( A x^* = A A^{\dag }b = b \) when \(b \in \mathcal {R}(A)\). By a linear transformation \(z = x-x^*\), C-CD for minimizing \( (x - x^*)^T A (x - x^*) \) starting from \(x^0\) is equivalent to C-CD for minimizing \( z^T A z \) starting from \(z^0 = x^0-x^*\). Thus we can assume \(x^* = 0\), or equivalently, \(b = 0\).
The update equation of C-CD now becomes
where \(\Gamma \) is the lower triangular part of A with diagonal entries, i.e., \(\Gamma _{ij} = A_{ij}, 1 \le j \le i \le n \), and \(d^k = \Gamma ^{-1}A x^k \) is the moving direction. This implies
We first assume A is positive definite and will show how to extend to the PSD case in the end.
The proof consists of two main claims. The first claim relates the convergence rate of C-CD with the spectral radius of a certain matrix.
Claim A.1
Let \(D_A = \text {diag}(A_{11}, \ldots , A_{nn})\) be a diagonal matrix with entries \(A_{ii}\)’s. Then
First Proof of Claim A.1
(Optimization Perspective) Following the proof framework of [38], we bound the descent amount and the cost yet to be minimized (cost-to-go) respectively. Suppose \(w^0 = x^k, w^n = x^{k+1}\) and \(w^1,\ldots , w^{n-1}\) are the \(n-1\) intermediate iterates. Since \(w^i\) is obtained by minimizing f over the i-th coordinate with other variables fixed, it is easy to verify
In the above expression, \( 2 A_{ii} \) can be viewed as the i-th coordinate-wise Lipschitz constant of \(\nabla f\) from an optimization perspective. We have
where \(e_i\) is the i-th unit vector. Then
Therefore, the descent amount \( f(x^k) - f(x^{k+1}) \) can be bounded in terms of \(d^k\) as
The cost-to-go estimate is simply
Combining with (51), we obtain
which implies (48). \(\square \)
Second Proof of Claim A.1
(Matrix Recursion Perspective) One natural idea is to prove \(f(x^{k+1}) = M_f f(x^k) \) or \(f(x^{k+1}) \le \Vert M_f\Vert f(x^k)\) for a certain matrix \(M_f\), based on the update equation of the iterates \(x^{k+1} = (I - \Gamma ^{-1} A )x^k\). We can write down the expression of \(f(x^{k+1})\) in terms of \(x^k\) as \(f(x^{k+1}) = (x^k)^T (I - \Gamma ^{-1} A )^T A (I - \Gamma ^{-1} A ) x^k \). However, it is not clear how this expression is related to \(f(x^k) = (x^k)^T A x^k \). A simple trick to resolve this issue is to express everything in terms of \(d^k\). More specifically, we have
where the last step is because \( \Gamma + \Gamma ^T = A + D_A \). Equation (54) is equivalent to (51) derived earlier using another approach. The rest is the same as the first proof. \(\square \)
Remark Although the second proof seems simpler, for people who are familiar with optimization the first proof is probably easier to understand: equation (50) is just the classical descent lemma (applied to each coordinate), thus (51) is straightforward to derive. In the proof of [38], one crucial step is to bound the cost-to-go in terms of \(d^k\); here for the quadratic case the cost-to-go has a closed form expression given by (53). The second proof is cleaner to write, but it is specifically tailored for the quadratic problem; in contrast, the first proof can be extended to non-quadratic problems as done in [38] ((51) and (53) will become inequalities).
Claim A.2
Let \(D_A = \text {diag}(A_{11}, \ldots , A_{nn})\) be a diagonal matrix with entries \(A_{ii}\)’s. Then
Proof of Claim A.2
Denote
then \( \Gamma = \Gamma _{\mathrm {unit} } \circ A\), where \(\circ \) denotes the Hadamard product. According to the classical result on the operator norm of the triangular truncation operator [62, Theorem 1], we have
Thus we have
which proves the second part of (55).
We can bound \( \Vert \Gamma \Vert ^2 \) in another way (denote \(\lambda _i\)’s as the eigenvalues of A):
where(i) is because \( \sum _i \lambda _i = \text {tr}(A) = \sum _i A_{ii} \) and \(A_{ii} = L_i \). Thus
which proves the first part of (55). \(\square \)
Finally, according to the fact \( \Vert D_A^{-1/2} B D_A^{-1/2} \Vert \le \frac{1}{\min _i L_i } \Vert B \Vert = \frac{1}{L_{\min } } \Vert B \Vert \) for any positive definite matrix B, we have
Plugging this inequality into (48) and replacing \(\sum _i L_i\) by \(n L_{\mathrm {avg}}\), we obtain (6a).
Now we show how to modify the above proof to the case that A is PSD. From (47) we have
Then (52) is slightly modified to \( (x^k)^T A x^k = (d^k)^T \Gamma ^T A^{\dag } \Gamma d^k. \) We still have (51) since its proof does not require A to be positive definite. Now we modify (53) to
where (i) is because \(\Vert A^{\dag }\Vert = 1/\lambda _{\min }\) where \(\lambda _{\min }\) is the minimum non-zero eigenvalue of A. The rest is almost the same as the proof for the PD case: obtaining the bounds of \( \Gamma ^T \Gamma \) as in Claim A.2 and plugging them into (57) immediately leads to (6a).
The first bound of result (6b) is a direct corollary of (6a) because \( \kappa \le n \kappa _{\mathrm CD}\) (which is because \(\lambda _{\max }(A) \le \mathrm {tr}(A) = n L_{\mathrm {avg}}\)). The second bound of (6b) is the same as the second bound of (6a) because
This finishes the proof of Proposition 3.1.
1.2 A. 2 Proof of Proposition 3.2
This proof is a slight modification of the proof of Proposition 3.1.
We first consider the case that A is positive definite. The insight is to rewrite the relation proved in Claim A.1
as
where \( {\hat{\Gamma }} = D_A^{-1/2} \Gamma D_A^{-1/2} \) and \( \hat{A} = D_A^{-1/2} A D_A^{-1/2} \). Note that \({\hat{\Gamma }}\) is still the lower-triangular part (with diagonal entries) of the Jacobi-preconditioned matrix \(\hat{A}\). The diagonal entries of \({\hat{\Gamma }}\) and \(\hat{A}\) are all 1, so \(\hat{L}_i = 1, \ \forall i\).
Applying Claim A.2 we have
Plugging the above relation into (59) we obtain (10a). Similar to Proposition 3.1, the second bound (10b) follows directly from (10a).
The case that A is PSD is can be handled in a similar way to the proof of Proposition 3.1.
B Supplemental Proofs for Theorem 3.1
1.1 B.1 Proof of Lemma 5.1
Suppose \(\lambda \) is an eigenvalue of \( Z = \Gamma ^{-1}A \) and \(v = (v_1; v_2; \ldots ; v_n) \in {\mathbb {C}}^{n \times 1}\) is the corresponding eigenvector. Then we have
Without loss of generality, we can assume
Let \(\hat{c} = 1 - c\). Then (60) becomes
The first equation implies \( v_1 = \frac{c}{ \lambda - \hat{c}}\). Plugging into the second equation, we get
Plugging the expression of \(v_1, v_2\) into the third equation, we get
In general, we can prove by induction that
where
We can also express \(\lambda \) in terms of q as
Note that the expression of \(v_k\) given by (63) satisfies (62) for any \(\lambda \), but our goal is to compute \(\lambda \). To do this, we need to utilize the normalization assmption (61). In particular, we have (when \(q \ne 1\))
The above procedure is reversible, i.e. suppose \(q \ne 1\) is a root of \( q^{n} (q - \hat{c}) = c q \), then \(\lambda = \frac{ \hat{c} - \hat{c}q }{\hat{c}-q} \) is an eigenvalue of Z. Suppose the \( n + 1 \) roots of \( q^{n} (q - \hat{c}) = c q \) are \(q_0 = 0, q_1, \ldots , q_{n-1}, q_n = 1\) (\(q = 0\) and \(q=1\) are always roots), then \( \lambda _k = \frac{ \hat{c} - \hat{c} q_k }{ \hat{c}- q_k } \overset{ (66)}{=} 1 - q_k^n , k= 0, \ldots , n - 1\) are the n eigenvalues of Z.
1.2 B.2 Proof of Lemma 5.2
It is well known that the roots of a polynomial depend continuously on the coefficients of the polynomial, i.e., if the coefficients of a family of functions f converges to the coefficients of a function \(f^*\), then the roots of f also converge to the roots of \(f^*\). More precisely, if \( x^* \) is a root of \(f^*\) of multiplicity m, then for any \(\epsilon > 0\), if the coefficients of f are close enough to those of \(f^*\), then f has at least m roots within \(\epsilon \) distance of \(x^*\) (in this case, a root with multiplicity k is counted as k roots); see, e.g., [63, Theorem]. If one chooses \(\epsilon \) to be smaller than the minimal distance between the distinct roots of \(f^*\), then f has exactly m roots (counting multiplicity) within \(\epsilon \) distance of \(x^*\).
Below, we provide a rigorous proof of Lemma 5.2 using Rouché’s theorem in complex analysis.
When \(n=1 \), the only solution of \( q^{n-1} (q - 1 + c) = c \) is \(q = 1\), thus the conclusion holds. From now on, we assume \(n \ge 2\). Let \(p = 1/q\), then the equation \( q^{n-1} (q - 1 + c) = c \) becomes
This equation can be written as \(F(p) + G_c(p) = 0\), where F and \(G_c\) are defined as
Lemma B.1
Suppose \(n \ge 2\). For any \( 0< \epsilon < \sin ( \pi /n) \), there exists some \(\delta > 0 \) such that for any \(c \in (1 - \delta , 1 )\), \(F(p) + G_c(p) \) has exactly one root \(p_k\) in the ball \( B( e^{- i 2k \pi /n}, \epsilon ) \triangleq \{ z \mid | z - e^{- i 2k \pi /n} | \le \epsilon \} \), \(k = 0, 1, \ldots , n - 1\).
Clearly, the function F has n roots \( \eta _k \triangleq e^{- i 2k \pi /n}, k=1, \ldots , n \). The distance between two adjacent roots are \( | 1 - e^{-2 i \pi /n} | = 2 \sin ( \pi /n) . \) For any \(0< \epsilon < \sin ( \pi /n) \), consider n balls
Any two such balls have no intersection since \(\epsilon < | \sin ( \pi /n) | = \min _{0 \le j, k \le n -1} |\eta _j - \eta _k| \).
The boundary of the ball \(B ( \eta _k , \epsilon )\) is \( \partial B ( \eta _k , \epsilon ) = \{ z \mid | z - \eta _k | = \epsilon \}. \) Define
This minimum can be achieved because \(v_k(\epsilon ) \) is the minimal value of a continuous function on a compact set. It is positive since otherwise there exists some \(z \in \partial B ( \eta _k , \epsilon ) \) such that \(z^n = 1\) which means \(z \in \{ \eta _0, \ldots , \eta _{n-1} \}\). This contradicts the fact that any two balls \(B ( \eta _j , \epsilon ), B ( \eta _k , \epsilon )\) have no intersection.
Define
For any \( z \in \partial B ( \eta _k , \epsilon ) \), we have
For any \( z \in \partial B ( \eta _k , \epsilon ) \) and any \(c > \frac{3 }{ 3 + v(\epsilon )} \), we have
where the second inequality is due to \(|\eta _k | = 1\) and \(\epsilon < \sin ( \pi /n) \le 1\).
Combining the two bounds (67) and (68), we obtain that
According to Rouché’s theorem, F and \(F + G_c \) have the same number of zeros inside \(B ( \eta _k , \epsilon ) \). Since F has exactly one root inside \(B ( \eta _k , \epsilon ) \) which is \(\eta _k\), we obtain that \(F + G_c \) has exactly one root \(p_k\) inside \(B ( \eta _k , \epsilon ) \). \(\Box \)
We first let \( \epsilon _0 = \sin ( \pi /n)/ 4 \), which implies \(B ( \eta _k , \epsilon _0 ), k=0, 1, \ldots , n-1\) are n disjoint balls. For any \(c \in ( 3/(3 + v( \epsilon _0 ) , 1 ) \), Lemma B.1 implies that \(F (p) + G_c (p) \) has exactly one root inside each ball. We denote \( p_0(c), p_1(c), \ldots , p_{n-1}(c) \) to be the roots of \(F (p) + G_c (p) \) such that \(p_k(c) \in B ( \eta _k , \epsilon _0 ) , \forall k \). Since \(F + G_c \) has exactly n complex roots, thus \(p_k(c)\)’s are all the roots of \(F + G_c \). Lemma B.1 implies that for any \(\epsilon > 0 \), there exists some \( \delta \) such that whenever \( c > 1- \delta \), we have \( | p_k( c) - \eta _k | < \epsilon , \; \forall k. \) This means
Since there is a one-to-one mapping between the roots of \( q^{n-1} (q - 1 + c) - c \) and the roots of \( F(p) + G_c (p) = p^n -1 + (1/c - 1) (p - 1) \) by the inverse transformation \( p = 1/q\), and \( |1/ q_k(c) - e^{ - i 2\pi k/n } | < \sin ( \pi /n)/4 \) implies \( | q_k(c) - e^{ i 2\pi k/n } | < \sin (\pi / n)/2 \), we obtain the following result: for any \(c \in ( 3/(3 + v( \epsilon _0 ) , 1 ) \), the equation \( q^{n-1} (q - 1 + c) - c \) has exactly one root \(q_k(c)\) such that \( |1/ q_k(c) - e^{ - i 2\pi k/n } | < \sin ( \pi /n)/2 \) for \(k = 0, 1, \ldots , n-1\); moreover,
1.3 B.3 Proof of Claim 5.1
Since \(2 \sin (n\phi /2) \cos ( x + (n+1)\phi /2 ) = \sin ( z + (n+ 1/2) \phi ) - \sin ( z + \phi /2 )\), the desired equation (28) is equivalent to
We prove (69) by induction. When \( n = 1\), it holds because \( \sin ( z + 1.5 \phi ) - \sin (z + 0.5 \phi ) = 2 \sin (\phi /2) \cos ( z + \phi ) \). Suppose (69) holds for \(n-1\), i.e.
Note that \( 2 \cos (z + n \phi ) \sin (\phi /2 ) = \sin (z + (n+1/2 \phi )) - \sin ( z + (n - 1/2) \phi ) \), therefore
This completes the induction step, and thus (69) holds. \(\square \)
Rights and permissions
About this article
Cite this article
Sun, R., Ye, Y. Worst-case complexity of cyclic coordinate descent: \(O(n^2)\) gap with randomized version. Math. Program. 185, 487–520 (2021). https://doi.org/10.1007/s10107-019-01437-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01437-5