Abstract
We consider minimization of a smooth nonconvex objective function using an iterative algorithm based on Newton’s method and the linear conjugate gradient algorithm, with explicit detection and use of negative curvature directions for the Hessian of the objective function. The algorithm tracks Newton-conjugate gradient procedures developed in the 1980s closely, but includes enhancements that allow worst-case complexity results to be proved for convergence to points that satisfy approximate first-order and second-order optimality conditions. The complexity results match the best known results in the literature for second-order methods.
Similar content being viewed by others
Notes
Interestingly, as we show in Appendix A, the ratios on the left-hand side of (6) can be calculated without knowledge of \(y_i\) for \(i=0,1,\cdots ,j-1\), provided that we store the scalar quantities \(\alpha _i\) and \(\Vert r_i\Vert ^2\) for \(i=0,1,\cdots ,j-1\).
References
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017), PMLR (2017)
Allen-Zhu, Z., Li, Y.: NEON2: finding local minima via first-order oracles. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (2018)
Birgin, E.G., Martínez, J.M.: The use of quadratic regularization with a cubic descent condition for unconstrained optimization. SIAM J. Optim. 27, 1049–1074 (2017)
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends\(^{{\copyright }}\) Mach. Learn. 8, 231–357 (2015)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: “Convex until proven guilty”: dimension-free acceleration of gradient descent on non-convex functions. In: International Conference on Machine Learning, vol. 70, 6–11 August 2017, International Convention Centre, Sydney, Australia, PMLR, pp. 654–663 (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for non-convex optimization. SIAM J. Optim. 28, 1751–1772 (2018)
Cartis, C., Gould, N.I.M., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization. SIAM J. Optim. 20, 2833–2852 (2010)
Cartis, C., Gould, N.I.M., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127, 245–295 (2011)
Cartis, C., Gould, N.I.M., Toint, P.L.: Optimal Newton-type methods for nonconvex optimization. Technical Report naXys-17-2011, Department of Mathematics, FUNDP, Namur (B) (2011)
Cartis, C., Gould, N.I.M., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28, 93–108 (2012)
Cartis, C., Gould, N.I.M., Toint, P.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. arXiv:1709.07180 (2017)
Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust-Region Methods. MPS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2000)
Curtis, F.E., Robinson, D.P., Samadi, M.: A trust region algorithm with a worst-case iteration complexity of \(\cal{O}\left(\epsilon ^{-3/2}\right)\) for nonconvex optimization. Math. Program. 162, 1–32 (2017)
Curtis, F.E., Robinson, D.P., Samadi, M.: An inexact regularized Newton framework with a worst-case iteration complexity of \(\cal{O}(\epsilon ^{-3/2})\) for nonconvex optimization. IMA J. Numer. Anal. (2018). https://doi.org/10.1093/imanum/dry022
Dembo, R.S., Steihaug, T.: Truncated-Newton algorithms for large-scale unconstrained optimization. Math. Program. 26, 190–212 (1983)
Fasano, G., Lucidi, S.: A nonmonotone truncated Newton–Krylov method exploiting negative curvature directions, for large-scale unconstrained optimization. Optim. Lett. 3, 521–535 (2009)
Gould, N.I.M., Lucidi, S., Roma, M., Toint, P.L.: Exploiting negative curvature directions in linesearch methods for unconstrained optimization. Optim. Methods Softw. 14, 75–98 (2000)
Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Frontiers in Applied Mathematics, 2nd edn. SIAM, Philadelphia (2008)
Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. In: Proceedings of the 31st Conference on Learning Theory, PMLR, pp. 1042–1085 (2018)
Karimi, S., Vavasis, S.A.: A unified convergence bound for conjugate gradient and accelerated gradient. arXiv:1605.00320 (2016)
Karimi, S., Vavasis, S.A.: A single potential governing convergence of conjugate gradient, accelerated gradient and geometric descent. arXiv:1712.09498 (2017)
Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13, 1094–1122 (1992)
Martínez, J.M., Raydan, M.: Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization. J. Glob. Optim. 68, 367–385 (2017)
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108, 177–205 (2006)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2006)
Royer, C.W., Wright, S.J.: Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J. Optim. 28, 1448–1477 (2018)
Steihaug, T.: The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal. 20, 626–637 (1983)
Xu, Y., Jin, R., Yang, T.: First-order stochastic algorithms for escaping from saddle points in almost linear time. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (2018)
Acknowledgements
We thank sincerely the associate editor and two referees, whose comments led us to improve the presentation and to derive stronger results.
Funding
Funding was provided by National Science Foundation (Grant Nos. 1447449, 1628384, 1634597, 1740707), Air Force Office of Scientific Research (Grant No. FA9550-13-1-0138) and Argonne National Laboratory (Grant Nos. 3F-30222, 8F-30039) and DARPA (Grant No. N660011824020).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research supported by NSF Awards IIS-1447449, 1628384, 1634597, and 1740707; AFOSR Award FA9550-13-1-0138; Subcontracts 3F-30222 and 8F-30039 from Argonne National Laboratory; and Award N660011824020 from the DARPA Lagrange Program.
Appendices
Linear conjugate gradient: relevant properties
In this appendix, we provide useful results for the classical CG algorithm, that also apply to the “standard CG” operations within Algorithm 1. To this end, and for the sake of discussion in Appendix B, we sketch the standard CG method in Algorithm 4, reusing the notation of Algorithm 1.
Here and below, we refer often to the following quadratic function:
where \({\bar{H}}\) and g are the matrix and vector parameters of Algorithms 1 or 4. When \({\bar{H}}\) is positive definite, the minimizer of q is identical to the unique solution of \({\bar{H}}y = -g\). CG can be viewed either as an algorithm to solve \({\bar{H}}y = -g\) or as an algorithm to minimize q.
The next lemma details several properties of the conjugate gradient method to be used in the upcoming proofs.
Lemma 7
Suppose that j iterations of the CG loop are performed in Algorithm 1 or 4. Then, we have
Moreover, the following properties hold.
- 1.
\(y_i \in {{\,\mathrm{span}\,}}\left\{ p_0,\cdots ,p_{i-1}\right\} \), \(i=1,2,\cdots ,j\).
- 2.
\(r_i \in {{\,\mathrm{span}\,}}\left\{ p_0,\cdots ,p_i\right\} \) for all \(i=1,2,\cdots ,j\), and
$$\begin{aligned} r_i^\top v = 0, \quad \text{ for } \text{ all } v \in {{\,\mathrm{span}\,}}\left\{ p_0,\cdots ,p_{i-1}\right\} \text{ and } \text{ all } i=1,2,\cdots ,j. \end{aligned}$$(In particular, \(r_i^\top r_l = 0\) if \(0 \le l<i \le j\). If \(j=n\), then \(r_n=0\).)
- 3.
\(\Vert r_i\Vert \le \Vert p_i\Vert \), \(i=0,1,\cdots ,j\).
- 4.
\(r_i^\top p_i = -\Vert r_i\Vert ^2\), \(i=0,1,\cdots ,j\).
- 5.
\(p_i^\top {\bar{H}}p_k=0\) for all \(i,k =0,1,\cdots ,j\) with \(k \ne i\).
- 6.
\(p_i = -\sum _{k=0}^i ({\Vert r_i\Vert ^2}/{\Vert r_k\Vert ^2}) r_k\), \(i=0,1,\cdots ,j\).
- 7.$$\begin{aligned} q(y_{i+1}) = q(y_i) - \frac{\Vert r_i\Vert ^4}{2 p_i^\top {\bar{H}}p_i}, \quad i=0,1,\cdots ,j-1. \end{aligned}$$
- 8.
\(r_i^\top {\bar{H}}r_i \ge p_i^\top {\bar{H}}p_i\), \(i=0,1,\cdots ,j\).
Proof
Since CG has not terminated prior to iteration j, (39) clearly holds. All properties then follow from the definition of the CG process, and most are proved in standard texts (see, for example, [25, Chapter 5]). Property 8 is less commonly used, so we provide a proof here.
The case \(i=0\) is immediate since \(r_0=-p_0\) and there is equality. When \(i \ge 1\), we have:
(Note that if iteration i is reached, we cannot have \(\Vert r_{i-1}\Vert =0\).) It follows that
as \(p_i^\top {\bar{H}}p_{i-1} = 0\) by Property 5 above. Since iteration i has been reached, \(p_{i-1}\) is a direction of positive curvature, and we obtain \(r_i^\top {\bar{H}}r_i \ge p_i^\top {\bar{H}}p_i\), as required. \(\square \)
We next address an important technical point about Algorithm 1: the test (6) to identify a direction of negative curvature for H after an insufficiently rapid rate of reduction in the residual norm \(\Vert r_j\Vert \) has been observed. As written, the formula (6) suggests both that previous iterations \(y_i\), \(i=1,2,\cdots ,j-1\) must be stored (or regenerated) and that additional matrix-vector multiplications (specifically, \({\bar{H}}(y_{j+1}-y_i)\), \(i=0,1,\cdots \)) must be performed. We show here that in fact (6) can be evaluated at essentially no cost, provided we store two extra scalars at each iteration of CG: the quantities \(\alpha _k\) and \(\Vert r_k\Vert ^2\), for \(k=0,1,\cdots ,j\).
Lemma 8
Suppose that Algorithm 1 computes iterates up to iteration \(j+1\). Then, for any \(i \in \{0,\cdots ,j\}\), we can compute (6) as
Proof
By definition, \(y_{j+1}-y_i = \sum _{k=i}^{j} \alpha _k p_k\). By conjugacy of the \(p_k\) vectors, we have
where we used the definition of \(\alpha _k\) to obtain the last equality. Now we turn our attention to the denominator. Using Property 6 of Lemma 7, we have that
By rearranging the terms in the sum, we obtain
Using the fact that the residuals \(\{ r_{\ell } \}_{\ell =0,1,\cdots ,j}\) form an orthogonal set (by Property 2 of Lemma 7), we have that
Combining this with (40) gives the desired result. \(\square \)
Implementing Procedure 2 via Lanczos and conjugate gradient
In the first part of this appendix (Appendix B.1) we outline the randomized Lanczos approach and describe some salient convergence properties. The second part (Appendix B.2) analyzes the CG method (Algorithm 4) applied to a (possibly nonconvex) quadratic function with a random linear term. We show that the number of iterations required by CG to detect nonpositive curvature in an indefinite matrix is the same as the number required by Lanczos, when the two approaches are initialized in a consistent way, thereby proving Theorem 1. As a result, both techniques are implementations of Procedure 2 that satisfy Assumption 3, provided than an upper bound M on \(\Vert H\Vert \) is known. In the third part (Appendix B.3), we deal with the case in which a bound on \(\Vert H\Vert \) is not known a priori, and describe a version of the randomized Lanczos scheme which obtains an overestimate of this quantity (to high probability) during its first phase of execution. The complexity of this version differs by only a modest multiple from the complexity of the original method, and still satisfies Assumption 3.
1.1 Randomized Lanczos
Consider first the Lanczos algorithm applied to a symmetric, n-by-n matrix \({\bar{H}}\) and a starting vector \(b \in {\mathbb {R}}^n\) with \(\Vert b\Vert =1\). After \(t+1\) iterations, Lanczos constructs a basis of the t-th Krylov subspace defined by
The Lanczos method can compute estimates of the minimum and maximum eigenvalues of \({\bar{H}}\). For \(t=0,1,\cdots \), those values are given by
The Krylov subspaces satisfy a shift invariance property, that is, for any \(\hat{H}=a_1 I + a_2 H\) with \((a_1,a_2) \in {\mathbb {R}}^2\), we have that
Properties of the randomized Lanczos procedure are explored in [22]. The following key result is a direct consequence of Theorem 4.2(a) from the cited paper, along with the shift invariance property mentioned above.
Lemma 9
Let \({\bar{H}}\) be an \(n \times n\) symmetric matrix, let b be chosen from a uniform distribution over the sphere \(\Vert b\Vert =1\), and suppose that \(\bar{\epsilon } \in [0,1)\) and \(\delta \in (0,1)\) are given. Suppose that \(\xi _{\mathrm{{min}}}({\bar{H}},b,k)\) and \(\xi _{\mathrm{{max}}}({\bar{H}},b,k)\) are defined as in (42). Then after k iterations of randomized Lanczos, the following convergence condition holds:
provided k satisfies
A sufficient condition for (44) thus is
Similarly, we have that
for k satisfying the same conditions (45) or (46).
1.2 Lanczos and conjugate gradient as minimum eigenvalue oracles
Lemma 2 implies that using the Lanczos algorithm to generate the minimum eigenvalue of \({\bar{H}}\) from (42a) represents an instance of Procedure 2 satisfying Assumption 3. The sequence of iterates \(\{z_t\}\) given by \(z_0=b\) and
eventually yields a direction of sufficient negative curvature, when such a direction exists.
Consider now Algorithm 4 applied to \({\bar{H}}\), and \(g=-b\). By Property 2 of Lemma 7, we can see that if Algorithm 4 does not terminate with \(j \le t\), for some given index t, then \(y_{t+1}\), \(r_{t+1}\), and \(p_{t+1}\) are computed, and we have
because \(\{r_{\ell }\}_{\ell =0}^{i}\) is a set of \(i+1\) orthogonal vectors in \(\mathcal{K}_{i}(b,{\bar{H}})\). Thus \(\{p_0,\cdots ,p_i\}\), \(\{r_0,\cdots ,r_i\}\), and \(\{ b, {\bar{H}}b, \cdots , {\bar{H}}^i b \}\) are all bases for \(\mathcal{K}_{i}(b,{\bar{H}})\), \(i=0,1,\cdots ,t\). As long as they are computed, the iterates of Algorithm 4 satisfy
The sequences defined by (48) (for Lanczos) and (50) (for CG) are related via the Krylov subspaces. We have the following result about the number of iterations required by CG to detect non-positive-definiteness.
Theorem 5
Consider applying Algorithm 4 to the quadratic function (38), with \(g=-b\) for some b with \(\Vert b\Vert = 1\). Let \(J\) be the smallest value of \(t \ge 0\) such that \(\mathcal{K}_t(b,{\bar{H}})\) contains a direction of nonpositive curvature, so that \(J\) is also the smallest index \(t \ge 0\) such that \(z_{t+1}^\top {\bar{H}}z_{t+1} \le 0\), where \(\{z_j \}\) are the Lanczos iterates from (48). Then Algorithm 4 terminates with \(j=J\), with \(p_J^\top {\bar{H}}p_J\le 0\).
Proof
We consider first the case of \(J=0\). Then \(z_1 = b / \Vert b\Vert \) and \(b^\top {\bar{H}}b \le 0\), so since \(p_0=-r_0=b\), we have \(p_0^\top {\bar{H}}p_0 \le 0\), so the result holds in this case. We assume that \(J\ge 1\) for the remainder of the proof.
Suppose first that Algorithm 4 terminates with \(j=t\), for some t satisfying \(1 \le t \le J\), because of a zero residual — \(r_t=0\)—without having encountered nonpositive curvature. In that case, we can show that \({\bar{H}}^t b \in {{\,\mathrm{span}\,}}\{b,\cdots ,{\bar{H}}^{t-1}b\}\).
We can invoke (49) with t replaced by \(t-1\) since Algorithm 4 has not terminated at iteration \(t-1\). By the recursive definition of \(r_{t-1}\) within Algorithm 4, there exist coefficients \(\tau _i\) and \(\sigma _i\) such that
Since \(r_t=0\), we have again from Algorithm 4 that
The coefficient \(\alpha _{t-1} \sigma _{t-1}\) must be nonzero, because otherwise this expression would represent a nontrivial linear combination of the basis elements \(\{ b, {\bar{H}}b, \cdots , {\bar{H}}^{t-1} b \}\) of \(\mathcal{K}_{t-1}(b,{\bar{H}})\) that is equal to zero. It follows from this observation and (51) that \({\bar{H}}^t b \in {{\,\mathrm{span}\,}}\{b, {\bar{H}}b, \cdots , {\bar{H}}^{t-1} b \} = \mathcal{K}_{t-1}(b,{\bar{H}})\), as required.
Consequently,
By using a recursive argument on the definition of \(\mathcal{K}_i(b,{\bar{H}})\) for \(i=t,\cdots ,J\), we arrive at \(\mathcal{K}_{t-1}(b,{\bar{H}}) = \mathcal{K}_{J}(b,{\bar{H}})\). Thus there is a value of t smaller than \(J\) such that \(\mathcal{K}_{t}(b,{\bar{H}})\) contains a direction of nonpositive curvature, contradicting our definition of \(J\). Thus we cannot have termination of Algorithm 4 with \(j \le J\) unless \(p_j^\top {\bar{H}}p_j \le 0\).
Suppose next that CG terminates with \(j=t\) for some \(t >J\). It follows that \(p_j^\top {\bar{H}}p_j > 0\) for all \(j=0,1,\cdots ,J\). By definition of \(J\), there is a nonzero vector \(z \in \mathcal{K}_{J}(b,{\bar{H}})\) such that \(z^\top {\bar{H}}z \le 0\). On the other hand, we have \(\mathcal{K}_{J}(b,{\bar{H}}) = {{\,\mathrm{span}\,}}\{p_0,p_1,\cdots ,p_{J} \}\) by (49), thus we can write \(z = \sum _{j=0}^{J} \gamma _j p_j\), for some scalars \(\gamma _j\), \(j=0,1,\cdots ,J\). By Property 5 of Lemma 7, we then have
Since \(p_{j}^\top {\bar{H}}p_{j}>0\) for every \(j=0,1,\cdots ,J\), and not all \(\gamma _j\) can be zero (because \(z \ne 0\)), the final summation is strictly positive, a contradiction.
Suppose now that CG terminates at some \(j<J\). Then \(p_j^\top {\bar{H}}p_j \le 0\), and since \(p_j \in \mathcal{K}_j(b,{\bar{H}})\), it follows that \(\mathcal{K}_j(b,{\bar{H}})\) contains a direction of nonpositive curvature, contradicting the definition of \(J\).
We conclude that Algorithm 4 must terminate with \(j=J\) and \(p_J^\top {\bar{H}}p_J\le 0\), as claimed. \(\square \)
Theorem 5 is a generic result that does not require b to be chosen randomly. It does not guarantee that Lanczos will detect nonpositive curvature in \({\bar{H}}\) whenever present, because b could be orthogonal to the subspace corresponding to the nonpositive curvature, so the Lanczos subspace never intersects with the subspace of negative curvature. When b is chosen randomly from a uniform distribution over the unit ball, however, we can certify the performance of Lanczos, as we have shown in Lemma 2 based on Lemma 9 above. We can exploit Theorem 5 to obtain the same performance for CG, as stated in Theorem 1. We restate this result as a corollary, and prove it now.
Corollary 3
Let b be distributed uniformly on the unit ball and H be a symmetric n-by-n matrix, with \(\Vert H \Vert \le M\). Given \(\delta \in [0,1)\), define
Consider applying Algorithm 4 with \({\bar{H}}:= H + \frac{1}{2} \epsilon I\) and \(g=-b\). Then, the following properties hold:
- (a)
If \(\lambda _{\mathrm{{min}}}(H) < -\epsilon \), then with probability at least \(1-\delta \), there is some index \(j \le {\bar{J}}\) such that Algorithm 4 terminates with a direction \(p_j\) such that \(p_j^\top H p_j \le -(\epsilon /2) \Vert p_j\Vert ^2\).
- (b)
if Algorithm 4 runs for \({\bar{J}}\) iterations without terminating, then with probability at least \(1-\delta \), we have that \(\lambda _{\mathrm{{min}}}(H) \ge -\epsilon \).
Proof
We will again exploit the invariance of the Krylov subspaces to linear shifts given by (43). This allows us to make inferences about the behavior of Algorithm 4 applied to \({\bar{H}}\) from the behavior of the Lanczos method applied to H, which has been described in Lemma 2.
Suppose first that \(\lambda _{\mathrm{{min}}}(H) < - \epsilon \). By Lemma 2, we know that with probability at least \(1-\delta \), the Lanczos procedure returns a vector v such that \(\Vert v\Vert =1\) and \(v^\top Hv \le -(\epsilon /2)\) after at most \({\bar{J}}\) iterations. Thus, for some \(j \le {\bar{J}}\), we have \(v \in \mathcal{K}_j(b,H) = \mathcal{K}_j(b,{\bar{H}})\), and moreover \(v^\top {\bar{H}}v \le 0\) by definition of \({\bar{H}}\), so the Krylov subspace \(\mathcal{K}_j(b,{\bar{H}})\) contains directions of nonpositive curvature, for some \(j \le {\bar{J}}\). It then follows from Theorem 5 that \(p_j^\top {\bar{H}}p_j \le 0\) for some \(j \le {\bar{J}}\). To summarize: If \(\lambda _{\mathrm{{min}}}(H) < - \epsilon \), then with probability \(1-\delta \), Algorithm 4 applied to \({\bar{H}}\) and \(g=-b\) will terminate with some \(p_j\) such that \(p_j^\top {\bar{H}}p_j \le 0\) for some j with \(j \le {\bar{J}}\). The proof of (a) is complete.
Suppose now that Algorithm 4 applied to \({\bar{H}}\) and \(g=-b\) runs for \({\bar{J}}\) iterations without terminating, that is \(p_j^\top {\bar{H}}p_j >0\) for \(j=0,1,\cdots ,{\bar{J}}\). It follows from the logic of Theorem 5 that \(\mathcal{K}_{{\bar{J}}}(b,{\bar{H}})\) contains no directions of nonpositive curvature for \({\bar{H}}\). Equivalently, there is no direction of curvature less than \(-\epsilon /2\) for H in \(\mathcal{K}_{{\bar{J}}}(b,H)\). By Lemma 2, this certifies with probability at least \(1-\delta \) that \(\lambda _{\mathrm{{min}}}(H) \ge -\epsilon \), establishing (b). \(\square \)
1.3 Randomized Lanczos with internal estimation of a bound on \(\Vert H\Vert \)
The methods discussed in Sect. B.2 assume knowledge of an upper bound on the considered matrix, denoted by M. When no such bound is known, we show here that it is possible to estimate it within the Lanczos procedure. Algorithm 5 details the method; we show that it can be used as an instance of Procedure 2 satisfying Assumption 3 when the optional parameter M is not provided.
Algorithm 5 consists in applying the Lanczos method on H starting with a random vector b. We first run Lanczos for \(j_M\) iterations, where \(j_M\) does not depend on any estimate on the minimum or maximum eigenvalue and instead targets a fixed accuracy. After this initial phase of \(j_M\) iterations, we have approximations of the extreme eigenvalues \(\xi _{\mathrm{{max}}}(H, b, j_M)\) and \(\xi _{\mathrm{{min}}}(H, b, j_M)\) from (42). An estimate M of \(\Vert H\Vert \) is then given by:
We show below that \(\Vert H\Vert \le M \le 2\Vert H\Vert \), with high probability. This value can then be used together with a tolerance \(\epsilon \) to define a new iteration limit for the Lanczos method. After this new iteration limit is reached, we can either produce a direction of curvature at most \(-\epsilon /2\), or certify with high probability that \(\lambda _{\min }(H) \succeq -\epsilon I\)—the desired outcomes for Procedure 2.
Algorithm 5 could be terminated earlier, in fewer than \(j_{\mathrm {total}}\) iterations, when a direction of sufficient negative is encountered. For simplicity, we do not consider this feature, but observe that it would not affect the guarantees of the method, described in Lemma 10 below.
Lemma 10
Consider Algorithm 5 with input parameters H and \(\epsilon \), and internal parameters \(\delta \) and b. The method outputs a value \(\lambda \) such that
in at most
matrix-vector multiplications by H, with probability at least \(1 - \delta \).
Proof
We begin by showing that the first phase of Algorithm 5 yields an accurate estimate of \(\Vert H\Vert \) with high probability. We assume that \(\Vert H\Vert >0\) as the result is trivially true otherwise. By setting \(\delta \leftarrow \delta /3\) and \({\bar{\epsilon }} = \frac{1}{4}\) in Lemma 9, we obtain that the following inequalities hold, each with probability at least \(1-\delta /3\):
We consider the various possibilities for \(\lambda _{\mathrm{{min}}}(H)\) and \(\lambda _{\mathrm{{max}}}(H)\) separately, showing in each case that M defined by (53) has \(\Vert H\Vert \le M \le 2 \Vert H\Vert \).
When \(\lambda _{\mathrm{{max}}}(H) \ge \lambda _{\mathrm{{min}}}(H) \ge 0\), we have \(\xi _{\mathrm{{max}}}(H,b,j_M) \ge \frac{3}{4} \lambda _{\mathrm{{min}}}(H)\) and \(0 \le \xi _{\mathrm{{min}}}(H,b,j_M) \le \xi _{\mathrm{{max}}}(H,b,j_M)\), so that
$$\begin{aligned} M&= 2 \xi _{\mathrm{{max}}}(H,b,j_M) \ge \frac{3}{2} \lambda _{\mathrm{{max}}}(H) = \frac{3}{2}\Vert H\Vert , \\ M&= 2 \xi _{\mathrm{{max}}}(H,b,j_M) \le 2 \lambda _{\mathrm{{max}}}(H) = 2 \Vert H\Vert , \end{aligned}$$as required.
When \(\lambda _{\mathrm{{min}}}(H) \le \lambda _{\mathrm{{max}}}(H) \le 0\), we have \(\xi _{\mathrm{{min}}}(H,b,j_M) \le \frac{3}{4} \lambda _{\mathrm{{min}}}(H) \le 0\) and \(\xi _{\mathrm{{min}}}(H,b,j_M) \le \xi _{\mathrm{{max}}}(H,b,j_M) \le 0\), so that
$$\begin{aligned} M&= 2 | \xi _{\mathrm{{min}}}(H,b,j_M)| \ge \frac{3}{2} | \lambda _{\mathrm{{min}}}(H)| = \frac{3}{2} \Vert H\Vert , \\ M&= 2 | \xi _{\mathrm{{min}}}(H,b,j_M)| \le 2 | \lambda _{\mathrm{{min}}}(H)| = 2 \Vert H\Vert , \end{aligned}$$as required.
When \(\lambda _{\mathrm{{min}}}(H) \le 0 \le \lambda _{\mathrm{{max}}}(H)\) and \(-\lambda _{\mathrm{{min}}}(H) \le \lambda _{\mathrm{{max}}}(H)\), we have \(\lambda _{\mathrm{{max}}}(H)-\lambda _{\mathrm{{min}}}(H) \le 2 \lambda _{\mathrm{{max}}}(H)\), so from (56a), it follows that \(\xi _{\mathrm{{max}}}(H,b,j_M) \ge \frac{1}{2} \lambda _{\mathrm{{max}}}(H) = \frac{1}{2} \Vert H\Vert \), and so
$$\begin{aligned} M&\ge 2 \xi _{\mathrm{{max}}}(H,b,j_M) \ge \Vert H\Vert , \\ M&= 2 \max \left\{ |\xi _{\mathrm{{max}}}(H, b, j_M)|, |\xi _{\mathrm{{min}}}(H, b, j_M)| \right\} \\&\le 2 \max \left\{ |\lambda _{\mathrm{{max}}}(H)|,|\lambda _{\mathrm{{min}}}(H)| \right\} = 2 \Vert H\Vert , \end{aligned}$$as required.
When \(\lambda _{\mathrm{{min}}}(H) \le 0 \le \lambda _{\mathrm{{max}}}(H)\) and \(-\lambda _{\mathrm{{min}}}(H) \ge \lambda _{\mathrm{{max}}}(H)\), we have \(\lambda _{\mathrm{{max}}}(H) - \lambda _{\mathrm{{min}}}(H) \le -2 \lambda _{\mathrm{{min}}}(H)\), so from (56b), it follows that \(\xi _{\mathrm{{min}}}(H,b,j_M) \le \frac{1}{2} \lambda _{\mathrm{{min}}}(H) \le 0\), and so
$$\begin{aligned} M&\ge 2 |\xi _{\mathrm{{min}}}(H,b,j_M)| \ge | \lambda _{\mathrm{{min}}}(H)| = \Vert H\Vert , \\ M&=2 \max \left\{ |\xi _{\mathrm{{max}}}(H, b, j_M)|, |\xi _{\mathrm{{min}}}(H, b, j_M)| \right\} \\&\le 2 \max \left\{ |\lambda _{\mathrm{{max}}}(H)|,|\lambda _{\mathrm{{min}}}(H)| \right\} = 2 \Vert H\Vert , \end{aligned}$$as required.
Since each of the bounds in (56) holds with probability at least \(1-\delta /3\), both hold with probability at least \(1-2\delta /3\), by a union bound argument.
We finally consider the complete run of Algorithm 5, which requires \(j_{\mathrm {total}}\) iterations of Lanczos. If our estimate M is accurate, we have by setting \(\delta \leftarrow \delta /3\) and \(M \leftarrow 2 \Vert H\Vert \) in Lemma 2 that \(\lambda =\xi _{\mathrm{{min}}}(H,b,j_{\mathrm {total}})\) satisfies (54) with probability \(1-\delta /3\). By using a union bound to combine this probability with the probabilities of estimating M appropriately, we obtain the probability of at least \(1-\delta \).
In conclusion, Algorithm 5 runs \(j_{\mathrm {total}}\) iterations of Lanczos (each requiring one matrix-vector multiplication by H) and terminates correctly with probability at least \(1-\delta \). \(\square \)
The lemma above shows that Algorithm 5 is an instance of Procedure 2 that does not require an a priori bound on \(\Vert H\Vert \). Assuming \(\Vert H\Vert \le U_H\), we further observe that Algorithm 5 satisfies the conditions of Assumption 3 with \(\mathcal {C}_{\mathrm {meo}}= \frac{\ln (25n/\delta ^2)}{\sqrt{2}}\sqrt{U_H}\), which is within a modest constant multiple of the one obtained for the Lanczos method with knowledge of \(\Vert H\Vert \) or \(U_H\).
Proof of Theorem 2
Proof
The proof proceeds by contradiction: If we assume that all conditions specified in the statement of the theorem hold, and in addition that
then we must have
contradicting (11). Note that
by assumption, therefore we can consider (57) to hold for \(i=0,1,\cdots ,{\hat{J}}\). Moreover, recalling the definition (38) of the quadratic function q, we have for any \(i=0,1,\cdots ,{\hat{J}}\) that
Thus, (57) can be reformulated as
where we used \(\nabla q(y_i)=r_i\) and the definitions (57), (59) of strong convexity along the directions \(y_{{\hat{J}}+1}-y_i\). In the remainder of the proof, and similarly to [5, Proof of Proposition 1], we will show that (60) leads to the contradiction (58), thus proving that (12) holds.
We define the sequence of functions \(\varPhi _j\), \(j=0,1,\cdots {\hat{J}}\) as follows:
and for \(j=0,\cdots ,{\hat{J}}-1\):
Since each \(\varPhi _j\) is a quadratic function with Hessian \(\epsilon I\), it can be written as follows:
where \(v_j\) is the unique minimizer of \(\varPhi _j\), and \(\varPhi _j^* = \varPhi _j(v_j)\) is the minimum value of \(\varPhi _j\). (Note that \(v_0=y_0=0\) and \(\varPhi ^*_0 = q(y_0) = 0\).)
Defining
we give a short inductive argument to establish that
For \(j=0\), (64) holds because \(\varPhi _0(y) = q(y) + \psi (y)\) by definition. Assuming that (64) holds for some index \(j \ge 0\), we find by first applying (61) (with \(z=y_{{\hat{J}}+1}\)) and then (60) (with \(i=j\)) that
Thus, we have
which proves (64) for \(j+1\), and thus completes the inductive argument.
We next prove another technical fact about the relationship between \(q(y_j)\) and \(\varPhi _j^*\), namely,
We establish this result by an inductive argument that is quite lengthy and technical; we note the termination of this phase of the proof clearly below.
This result trivially holds (with equality) at \(j=0\). Supposing that it holds for some \(j =0,1,\cdots , {\hat{J}}-1\), we will prove that it also holds for \(j+1\).
By using Properties 7 and 8 of Lemma 7, and also \(\Vert {\bar{H}}r_j\Vert \le (M+2\epsilon )\Vert r_j\Vert \), we have
It follows from induction hypothesis \(q(y_j) \le \varPhi _j^*\) that
By taking the derivative in (61), and using (62), we obtain
By rearranging the above relation (and noting that the z terms cancel out), we obtain:
It follows from this expression together with Properties 1 and 2 of Lemma 7 that
(The result holds for \(j=1\), from (67) we have \(v_1 \in {{\,\mathrm{span}\,}}\{v_0, r_0, y_0\} = {{\,\mathrm{span}\,}}\{r_0\} = {{\,\mathrm{span}\,}}\{p_0 \}\), and an induction based on (67) can be used to establish (68) for the other values of j.) By combining the expressions (62) for \(\varPhi _j\), \(\varPhi _{j+1}\) with the recurrence formula (61) for \(\varPhi _{j+1}\), we obtain
and therefore
On the other hand, we have by (67) that
where the last relation comes from \(r_j \perp {{\,\mathrm{span}\,}}\{p_0,\cdots ,p_{j-1}\}\) (Property 2 of Lemma 7) and (68) in the case \(j \ge 1\), and immediately in the case \(j=0\), since \(y_0=v_0=0\). By combining (69) and (70), we arrive at:
where the last inequality comes from (66). By using the definitions of \(\tau \) and \(\kappa \) in Algorithm 1, we have
It therefore follows from (71) that \(q(y_{j+1}) \le \varPhi _{j+1}^*\). At this point, we have shown that when \(q(y_j) \le \varPhi _j^*\) for \(j=0,1,\cdots ,{\hat{J}}-1\), it follows that \(q(y_{j+1}) \le \varPhi _{j+1}^*\), establishing the inductive step. As a result, our proof of (65) is complete.
By substituting \(j={\hat{J}}\) into (65), we obtain \(q(y_{{\hat{J}}}) \le \varPhi _{{\hat{J}}}^*\), which in combination with (64) with \(j={\hat{J}}\), and the definition (62), yields
By substitution from (63), we obtain
We now depart from the analysis of [5], and complete the proof of this result by expressing (73) in terms of residual norms. On the left-hand side, we have
because \(r_{{\hat{J}}+1}^\top (y_{{\hat{J}}}-y_{{\hat{J}}+1}) = r_{{\hat{J}}+1}^\top (\alpha _{{\hat{J}}} p_{{\hat{J}}}) = 0\) by Lemma 7, Property 2. We thus have from (59) that
On the right-hand side of (73), because of the strong convexity condition (60) at \(i=0\), we have
Moreover, we have
where the last relation follows from the definition of \(\alpha _i\). By combining these last two bounds, and using Property 3 of Lemma 7, we obtain
because \(p_j^\top {\bar{H}}p_j \ge \epsilon \Vert p_j\Vert ^2\) for \(j=0,1,\cdots ,{\hat{J}}\) by assumption.
To bound the sum in (75), we recall that since Algorithm 1 did not terminate until iteration \({\hat{J}}\), the residual norms \(\Vert r_i \Vert \) at all iterations \(i=0,1,\cdots ,{\hat{J}}-1\) must have decreased at the expected convergence rate. In particularly, we have \(\Vert r_i\Vert \le \sqrt{T} \tau ^{i/2} \Vert r_0\Vert \) for the possibly smaller versions of \(\sqrt{T}\) and \(\tau \) that prevailed at iteration i, so certainly \(\Vert r_i\Vert \le \sqrt{T} \tau ^{i/2} \Vert r_0\Vert \) for the final values of these parameters. Thus for \(i=0,1,\cdots ,{\hat{J}}-1\), we have
where we used \(\Vert r_{{\hat{J}}}\Vert \ge \sqrt{T} \tau ^{{\hat{J}}/2} \Vert r_0\Vert \) (from (11)) for the last inequality. Observing that this bound also holds (trivially) for \(i={\hat{J}}\), we obtain by substituting in (75) that
Applying successively (74), (73) and (76) finally yields:
After rearranging this inequality and dividing by \(\Vert r_{{\hat{J}}}\Vert > 0\), we obtain
We have thus established (58) which, as we noted earlier, contradicts (11). Thus (57) cannot be true, so we have established (12), as required. \(\square \)
Rights and permissions
About this article
Cite this article
Royer, C.W., O’Neill, M. & Wright, S.J. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization. Math. Program. 180, 451–488 (2020). https://doi.org/10.1007/s10107-019-01362-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01362-7
Keywords
- Smooth nonconvex optimization
- Newton’s method
- Conjugate gradient method
- Optimality conditions
- Worst-case complexity