Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We consider minimization of a smooth nonconvex objective function using an iterative algorithm based on Newton’s method and the linear conjugate gradient algorithm, with explicit detection and use of negative curvature directions for the Hessian of the objective function. The algorithm tracks Newton-conjugate gradient procedures developed in the 1980s closely, but includes enhancements that allow worst-case complexity results to be proved for convergence to points that satisfy approximate first-order and second-order optimality conditions. The complexity results match the best known results in the literature for second-order methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Interestingly, as we show in Appendix A, the ratios on the left-hand side of (6) can be calculated without knowledge of \(y_i\) for \(i=0,1,\cdots ,j-1\), provided that we store the scalar quantities \(\alpha _i\) and \(\Vert r_i\Vert ^2\) for \(i=0,1,\cdots ,j-1\).

References

  1. Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017), PMLR (2017)

  2. Allen-Zhu, Z., Li, Y.: NEON2: finding local minima via first-order oracles. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (2018)

  3. Birgin, E.G., Martínez, J.M.: The use of quadratic regularization with a cubic descent condition for unconstrained optimization. SIAM J. Optim. 27, 1049–1074 (2017)

    Article  MathSciNet  Google Scholar 

  4. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends\(^{{\copyright }}\) Mach. Learn. 8, 231–357 (2015)

    Article  Google Scholar 

  5. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: “Convex until proven guilty”: dimension-free acceleration of gradient descent on non-convex functions. In: International Conference on Machine Learning, vol. 70, 6–11 August 2017, International Convention Centre, Sydney, Australia, PMLR, pp. 654–663 (2017)

  6. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for non-convex optimization. SIAM J. Optim. 28, 1751–1772 (2018)

    Article  MathSciNet  Google Scholar 

  7. Cartis, C., Gould, N.I.M., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization. SIAM J. Optim. 20, 2833–2852 (2010)

    Article  MathSciNet  Google Scholar 

  8. Cartis, C., Gould, N.I.M., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127, 245–295 (2011)

    Article  MathSciNet  Google Scholar 

  9. Cartis, C., Gould, N.I.M., Toint, P.L.: Optimal Newton-type methods for nonconvex optimization. Technical Report naXys-17-2011, Department of Mathematics, FUNDP, Namur (B) (2011)

  10. Cartis, C., Gould, N.I.M., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28, 93–108 (2012)

    Article  MathSciNet  Google Scholar 

  11. Cartis, C., Gould, N.I.M., Toint, P.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. arXiv:1709.07180 (2017)

  12. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust-Region Methods. MPS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2000)

    Book  Google Scholar 

  13. Curtis, F.E., Robinson, D.P., Samadi, M.: A trust region algorithm with a worst-case iteration complexity of \(\cal{O}\left(\epsilon ^{-3/2}\right)\) for nonconvex optimization. Math. Program. 162, 1–32 (2017)

    Article  MathSciNet  Google Scholar 

  14. Curtis, F.E., Robinson, D.P., Samadi, M.: An inexact regularized Newton framework with a worst-case iteration complexity of \(\cal{O}(\epsilon ^{-3/2})\) for nonconvex optimization. IMA J. Numer. Anal. (2018). https://doi.org/10.1093/imanum/dry022

    Article  MathSciNet  Google Scholar 

  15. Dembo, R.S., Steihaug, T.: Truncated-Newton algorithms for large-scale unconstrained optimization. Math. Program. 26, 190–212 (1983)

    Article  MathSciNet  Google Scholar 

  16. Fasano, G., Lucidi, S.: A nonmonotone truncated Newton–Krylov method exploiting negative curvature directions, for large-scale unconstrained optimization. Optim. Lett. 3, 521–535 (2009)

    Article  MathSciNet  Google Scholar 

  17. Gould, N.I.M., Lucidi, S., Roma, M., Toint, P.L.: Exploiting negative curvature directions in linesearch methods for unconstrained optimization. Optim. Methods Softw. 14, 75–98 (2000)

    Article  MathSciNet  Google Scholar 

  18. Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Frontiers in Applied Mathematics, 2nd edn. SIAM, Philadelphia (2008)

    Book  Google Scholar 

  19. Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. In: Proceedings of the 31st Conference on Learning Theory, PMLR, pp. 1042–1085 (2018)

  20. Karimi, S., Vavasis, S.A.: A unified convergence bound for conjugate gradient and accelerated gradient. arXiv:1605.00320 (2016)

  21. Karimi, S., Vavasis, S.A.: A single potential governing convergence of conjugate gradient, accelerated gradient and geometric descent. arXiv:1712.09498 (2017)

  22. Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13, 1094–1122 (1992)

    Article  MathSciNet  Google Scholar 

  23. Martínez, J.M., Raydan, M.: Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization. J. Glob. Optim. 68, 367–385 (2017)

    Article  MathSciNet  Google Scholar 

  24. Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108, 177–205 (2006)

    Article  MathSciNet  Google Scholar 

  25. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2006)

    MATH  Google Scholar 

  26. Royer, C.W., Wright, S.J.: Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J. Optim. 28, 1448–1477 (2018)

    Article  MathSciNet  Google Scholar 

  27. Steihaug, T.: The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal. 20, 626–637 (1983)

    Article  MathSciNet  Google Scholar 

  28. Xu, Y., Jin, R., Yang, T.: First-order stochastic algorithms for escaping from saddle points in almost linear time. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (2018)

Download references

Acknowledgements

We thank sincerely the associate editor and two referees, whose comments led us to improve the presentation and to derive stronger results.

Funding

Funding was provided by National Science Foundation (Grant Nos. 1447449, 1628384, 1634597, 1740707), Air Force Office of Scientific Research (Grant No. FA9550-13-1-0138) and Argonne National Laboratory (Grant Nos. 3F-30222, 8F-30039) and DARPA (Grant No. N660011824020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephen J. Wright.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research supported by NSF Awards IIS-1447449, 1628384, 1634597, and 1740707; AFOSR Award FA9550-13-1-0138; Subcontracts 3F-30222 and 8F-30039 from Argonne National Laboratory; and Award N660011824020 from the DARPA Lagrange Program.

Appendices

Linear conjugate gradient: relevant properties

In this appendix, we provide useful results for the classical CG algorithm, that also apply to the “standard CG” operations within Algorithm 1. To this end, and for the sake of discussion in Appendix B, we sketch the standard CG method in Algorithm 4, reusing the notation of Algorithm 1.

figure d

Here and below, we refer often to the following quadratic function:

$$\begin{aligned} q(y) := \frac{1}{2}y^\top {\bar{H}}y + g^\top y, \end{aligned}$$
(38)

where \({\bar{H}}\) and g are the matrix and vector parameters of Algorithms 1 or 4. When \({\bar{H}}\) is positive definite, the minimizer of q is identical to the unique solution of \({\bar{H}}y = -g\). CG can be viewed either as an algorithm to solve \({\bar{H}}y = -g\) or as an algorithm to minimize q.

The next lemma details several properties of the conjugate gradient method to be used in the upcoming proofs.

Lemma 7

Suppose that j iterations of the CG loop are performed in Algorithm 1 or 4. Then, we have

$$\begin{aligned} \frac{p_i^\top {\bar{H}}p_i}{\Vert p_i\Vert ^2} > 0 \quad \text{ for } \text{ all } i=0,1,\cdots ,j-1. \end{aligned}$$
(39)

Moreover, the following properties hold.

  1. 1.

    \(y_i \in {{\,\mathrm{span}\,}}\left\{ p_0,\cdots ,p_{i-1}\right\} \), \(i=1,2,\cdots ,j\).

  2. 2.

    \(r_i \in {{\,\mathrm{span}\,}}\left\{ p_0,\cdots ,p_i\right\} \) for all \(i=1,2,\cdots ,j\), and

    $$\begin{aligned} r_i^\top v = 0, \quad \text{ for } \text{ all } v \in {{\,\mathrm{span}\,}}\left\{ p_0,\cdots ,p_{i-1}\right\} \text{ and } \text{ all } i=1,2,\cdots ,j. \end{aligned}$$

    (In particular, \(r_i^\top r_l = 0\) if \(0 \le l<i \le j\). If \(j=n\), then \(r_n=0\).)

  3. 3.

    \(\Vert r_i\Vert \le \Vert p_i\Vert \), \(i=0,1,\cdots ,j\).

  4. 4.

    \(r_i^\top p_i = -\Vert r_i\Vert ^2\), \(i=0,1,\cdots ,j\).

  5. 5.

    \(p_i^\top {\bar{H}}p_k=0\) for all \(i,k =0,1,\cdots ,j\) with \(k \ne i\).

  6. 6.

    \(p_i = -\sum _{k=0}^i ({\Vert r_i\Vert ^2}/{\Vert r_k\Vert ^2}) r_k\), \(i=0,1,\cdots ,j\).

  7. 7.
    $$\begin{aligned} q(y_{i+1}) = q(y_i) - \frac{\Vert r_i\Vert ^4}{2 p_i^\top {\bar{H}}p_i}, \quad i=0,1,\cdots ,j-1. \end{aligned}$$
  8. 8.

    \(r_i^\top {\bar{H}}r_i \ge p_i^\top {\bar{H}}p_i\), \(i=0,1,\cdots ,j\).

Proof

Since CG has not terminated prior to iteration j, (39) clearly holds. All properties then follow from the definition of the CG process, and most are proved in standard texts (see, for example, [25, Chapter 5]). Property 8 is less commonly used, so we provide a proof here.

The case \(i=0\) is immediate since \(r_0=-p_0\) and there is equality. When \(i \ge 1\), we have:

$$\begin{aligned} p_i = -r_i + \frac{\Vert r_i\Vert ^2}{\Vert r_{i-1}\Vert ^2}p_{i-1} \; \Leftrightarrow \; r_i = -p_i + \frac{\Vert r_i\Vert ^2}{\Vert r_{i-1}\Vert ^2}p_{i-1}. \end{aligned}$$

(Note that if iteration i is reached, we cannot have \(\Vert r_{i-1}\Vert =0\).) It follows that

$$\begin{aligned} r_i^\top {\bar{H}}r_i= & {} p_i^\top {\bar{H}}p_i -2\frac{\Vert r_i\Vert ^2}{\Vert r_{i-1}\Vert ^2} p_i^\top {\bar{H}}p_{i-1} + \frac{\Vert r_i\Vert ^4}{\Vert r_{i-1}\Vert ^4}p_{i-1}^\top {\bar{H}}p_{i-1} \\= & {} p_i^\top {\bar{H}}p_i + \frac{\Vert r_i\Vert ^4}{\Vert r_{i-1}\Vert ^4}p_{i-1}^\top {\bar{H}}p_{i-1}, \end{aligned}$$

as \(p_i^\top {\bar{H}}p_{i-1} = 0\) by Property 5 above. Since iteration i has been reached, \(p_{i-1}\) is a direction of positive curvature, and we obtain \(r_i^\top {\bar{H}}r_i \ge p_i^\top {\bar{H}}p_i\), as required. \(\square \)

We next address an important technical point about Algorithm 1: the test (6) to identify a direction of negative curvature for H after an insufficiently rapid rate of reduction in the residual norm \(\Vert r_j\Vert \) has been observed. As written, the formula (6) suggests both that previous iterations \(y_i\), \(i=1,2,\cdots ,j-1\) must be stored (or regenerated) and that additional matrix-vector multiplications (specifically, \({\bar{H}}(y_{j+1}-y_i)\), \(i=0,1,\cdots \)) must be performed. We show here that in fact (6) can be evaluated at essentially no cost, provided we store two extra scalars at each iteration of CG: the quantities \(\alpha _k\) and \(\Vert r_k\Vert ^2\), for \(k=0,1,\cdots ,j\).

Lemma 8

Suppose that Algorithm 1 computes iterates up to iteration \(j+1\). Then, for any \(i \in \{0,\cdots ,j\}\), we can compute (6) as

$$\begin{aligned} \frac{(y_{j+1}-y_i)^\top {\bar{H}}(y_{j+1}-y_i)}{\Vert y_{j+1}-y_i\Vert ^2} = \frac{\sum _{k=i}^{j} \alpha _k \Vert r_k\Vert ^2}{\sum _{\ell =0}^j \left[ \sum _{k=\max \{\ell ,i\}}^j \alpha _k \Vert r_k\Vert ^2 \right] ^2/\Vert r_{\ell }\Vert ^2}. \end{aligned}$$

Proof

By definition, \(y_{j+1}-y_i = \sum _{k=i}^{j} \alpha _k p_k\). By conjugacy of the \(p_k\) vectors, we have

$$\begin{aligned} (y_{j+1}-y_i)^\top {\bar{H}}(y_{j+1}-y_i) = \sum _{k=i}^{j} \alpha _k^2 p_k^\top {\bar{H}}p_k = \sum _{k=i}^{j} \alpha _k \Vert r_k\Vert ^2, \end{aligned}$$
(40)

where we used the definition of \(\alpha _k\) to obtain the last equality. Now we turn our attention to the denominator. Using Property 6 of Lemma 7, we have that

$$\begin{aligned} y_{j+1}-y_i = \sum _{k=i}^{j} \alpha _k p_k = \sum _{k=i}^{j} \alpha _k \left( -\sum _{\ell =0}^k \frac{\Vert r_k\Vert ^2}{\Vert r_\ell \Vert ^2} r_\ell \right) , \end{aligned}$$

By rearranging the terms in the sum, we obtain

$$\begin{aligned} y_{j+1}-y_i = -\sum _{k=i}^{j} \sum _{\ell =0}^k \alpha _k\Vert r_k\Vert ^2 \frac{r_\ell }{\Vert r_{\ell }\Vert ^2} = -\sum _{\ell =0}^j \left[ \sum _{k=\max \{\ell ,i\}}^j \alpha _k \Vert r_k\Vert ^2 \right] \frac{r_\ell }{\Vert r_\ell \Vert ^2}. \end{aligned}$$

Using the fact that the residuals \(\{ r_{\ell } \}_{\ell =0,1,\cdots ,j}\) form an orthogonal set (by Property 2 of Lemma 7), we have that

$$\begin{aligned} \Vert y_{j+1}-y_i\Vert ^2 = \sum _{\ell =0}^j \frac{1}{\Vert r_{\ell } \Vert ^2} \left[ \sum _{k=\max \{\ell ,i\}}^j \alpha _k \Vert r_k\Vert ^2 \right] ^2. \end{aligned}$$

Combining this with (40) gives the desired result. \(\square \)

Implementing Procedure 2 via Lanczos and conjugate gradient

In the first part of this appendix (Appendix B.1) we outline the randomized Lanczos approach and describe some salient convergence properties. The second part (Appendix B.2) analyzes the CG method (Algorithm 4) applied to a (possibly nonconvex) quadratic function with a random linear term. We show that the number of iterations required by CG to detect nonpositive curvature in an indefinite matrix is the same as the number required by Lanczos, when the two approaches are initialized in a consistent way, thereby proving Theorem 1. As a result, both techniques are implementations of Procedure 2 that satisfy Assumption 3, provided than an upper bound M on \(\Vert H\Vert \) is known. In the third part (Appendix B.3), we deal with the case in which a bound on \(\Vert H\Vert \) is not known a priori, and describe a version of the randomized Lanczos scheme which obtains an overestimate of this quantity (to high probability) during its first phase of execution. The complexity of this version differs by only a modest multiple from the complexity of the original method, and still satisfies Assumption 3.

1.1 Randomized Lanczos

Consider first the Lanczos algorithm applied to a symmetric, n-by-n matrix \({\bar{H}}\) and a starting vector \(b \in {\mathbb {R}}^n\) with \(\Vert b\Vert =1\). After \(t+1\) iterations, Lanczos constructs a basis of the t-th Krylov subspace defined by

$$\begin{aligned} \mathcal{K}_t(b, {\bar{H}}) = {{\,\mathrm{span}\,}}\{ b, {\bar{H}}b, \cdots , {\bar{H}}^t b \}. \end{aligned}$$
(41)

The Lanczos method can compute estimates of the minimum and maximum eigenvalues of \({\bar{H}}\). For \(t=0,1,\cdots \), those values are given by

$$\begin{aligned} \xi _{\mathrm{{min}}}({\bar{H}},b,t)&= \min _z \, z^\top {\bar{H}}z \quad \hbox {subject to }\Vert z\Vert _2=1, z \in \mathcal{K}_t(b,{\bar{H}}), \end{aligned}$$
(42a)
$$\begin{aligned} \xi _{\mathrm{{max}}}({\bar{H}},b,t)&= \max _z \, z^\top {\bar{H}}z \quad \hbox {subject to }\Vert z\Vert _2=1, z \in \mathcal{K}_t(b,{\bar{H}}). \end{aligned}$$
(42b)

The Krylov subspaces satisfy a shift invariance property, that is, for any \(\hat{H}=a_1 I + a_2 H\) with \((a_1,a_2) \in {\mathbb {R}}^2\), we have that

$$\begin{aligned} \mathcal{K}_t(b, \hat{H}) = \mathcal{K}_t(b,H) \quad \text{ for } t=0,1,\cdots \end{aligned}$$
(43)

Properties of the randomized Lanczos procedure are explored in [22]. The following key result is a direct consequence of Theorem 4.2(a) from the cited paper, along with the shift invariance property mentioned above.

Lemma 9

Let \({\bar{H}}\) be an \(n \times n\) symmetric matrix, let b be chosen from a uniform distribution over the sphere \(\Vert b\Vert =1\), and suppose that \(\bar{\epsilon } \in [0,1)\) and \(\delta \in (0,1)\) are given. Suppose that \(\xi _{\mathrm{{min}}}({\bar{H}},b,k)\) and \(\xi _{\mathrm{{max}}}({\bar{H}},b,k)\) are defined as in (42). Then after k iterations of randomized Lanczos, the following convergence condition holds:

$$\begin{aligned} \lambda _{\mathrm{{max}}}({\bar{H}}) - \xi _{\mathrm{{max}}}({\bar{H}},b,k)\le & {} \bar{\epsilon }(\lambda _{\mathrm{{max}}}({\bar{H}})-\lambda _{\mathrm{{min}}}({\bar{H}})) \nonumber \\&\quad \text{ with } \text{ probability } \text{ at } \text{ least } 1-\delta , \end{aligned}$$
(44)

provided k satisfies

$$\begin{aligned} k=n \quad \text{ or } \quad 1.648 \sqrt{n} \exp \left( -\sqrt{\bar{\epsilon }} (2k-1) \right) \le \delta . \end{aligned}$$
(45)

A sufficient condition for (44) thus is

$$\begin{aligned} k \ge \min \left\{ n,1+\left\lceil \frac{1}{4 \sqrt{\bar{\epsilon }}} \ln (2.75 n/\delta ^2) \right\rceil \right\} . \end{aligned}$$
(46)

Similarly, we have that

$$\begin{aligned} \xi _{\mathrm{{min}}}({\bar{H}},b,k)-\lambda _{\mathrm{{min}}}({\bar{H}})\le & {} \bar{\epsilon }(\lambda _{\mathrm{{max}}}({\bar{H}})-\lambda _{\mathrm{{min}}}({\bar{H}}))\nonumber \\&\quad \text{ with } \text{ probability } \text{ at } \text{ least } 1-\delta \end{aligned}$$
(47)

for k satisfying the same conditions (45) or (46).

1.2 Lanczos and conjugate gradient as minimum eigenvalue oracles

Lemma 2 implies that using the Lanczos algorithm to generate the minimum eigenvalue of \({\bar{H}}\) from (42a) represents an instance of Procedure 2 satisfying Assumption 3. The sequence of iterates \(\{z_t\}\) given by \(z_0=b\) and

$$\begin{aligned} z_{t+1}\, {\in }\, \arg \min _z \, \frac{1}{2} z^\top {\bar{H}}z \quad \text{ subject } \text{ to } \Vert z\Vert _2=1, z \in \mathcal{K}_t(b,{\bar{H}}),\hbox { for } t=0,1,\cdots \qquad \end{aligned}$$
(48)

eventually yields a direction of sufficient negative curvature, when such a direction exists.

Consider now Algorithm 4 applied to \({\bar{H}}\), and \(g=-b\). By Property 2 of Lemma 7, we can see that if Algorithm 4 does not terminate with \(j \le t\), for some given index t, then \(y_{t+1}\), \(r_{t+1}\), and \(p_{t+1}\) are computed, and we have

$$\begin{aligned} {{\,\mathrm{span}\,}}\{p_0,\cdots ,p_i\} = {{\,\mathrm{span}\,}}\{r_0,\cdots ,r_i\} = \mathcal{K}_{i}(b,{\bar{H}}), \quad \text{ for } i=0,1,\cdots ,t, \end{aligned}$$
(49)

because \(\{r_{\ell }\}_{\ell =0}^{i}\) is a set of \(i+1\) orthogonal vectors in \(\mathcal{K}_{i}(b,{\bar{H}})\). Thus \(\{p_0,\cdots ,p_i\}\), \(\{r_0,\cdots ,r_i\}\), and \(\{ b, {\bar{H}}b, \cdots , {\bar{H}}^i b \}\) are all bases for \(\mathcal{K}_{i}(b,{\bar{H}})\), \(i=0,1,\cdots ,t\). As long as they are computed, the iterates of Algorithm 4 satisfy

$$\begin{aligned} y_{t+1} := \arg \min _y \, \frac{1}{2} y^\top {\bar{H}}y - b^\top y \quad \text{ subject } \text{ to } y \in \mathcal{K}_{t}(b,{\bar{H}}),\hbox { for }t=0,1,\cdots .\qquad \end{aligned}$$
(50)

The sequences defined by (48) (for Lanczos) and (50) (for CG) are related via the Krylov subspaces. We have the following result about the number of iterations required by CG to detect non-positive-definiteness.

Theorem 5

Consider applying Algorithm 4 to the quadratic function (38), with \(g=-b\) for some b with \(\Vert b\Vert = 1\). Let \(J\) be the smallest value of \(t \ge 0\) such that \(\mathcal{K}_t(b,{\bar{H}})\) contains a direction of nonpositive curvature, so that \(J\) is also the smallest index \(t \ge 0\) such that \(z_{t+1}^\top {\bar{H}}z_{t+1} \le 0\), where \(\{z_j \}\) are the Lanczos iterates from (48). Then Algorithm 4 terminates with \(j=J\), with \(p_J^\top {\bar{H}}p_J\le 0\).

Proof

We consider first the case of \(J=0\). Then \(z_1 = b / \Vert b\Vert \) and \(b^\top {\bar{H}}b \le 0\), so since \(p_0=-r_0=b\), we have \(p_0^\top {\bar{H}}p_0 \le 0\), so the result holds in this case. We assume that \(J\ge 1\) for the remainder of the proof.

Suppose first that Algorithm 4 terminates with \(j=t\), for some t satisfying \(1 \le t \le J\), because of a zero residual — \(r_t=0\)—without having encountered nonpositive curvature. In that case, we can show that \({\bar{H}}^t b \in {{\,\mathrm{span}\,}}\{b,\cdots ,{\bar{H}}^{t-1}b\}\).

We can invoke (49) with t replaced by \(t-1\) since Algorithm 4 has not terminated at iteration \(t-1\). By the recursive definition of \(r_{t-1}\) within Algorithm 4, there exist coefficients \(\tau _i\) and \(\sigma _i\) such that

$$\begin{aligned} r_{t-1} = -b + \sum _{i=1}^{t-1} \tau _i {\bar{H}}^i b, \quad p_{t-1} = \sum _{i=0}^{t-1} \sigma _i {\bar{H}}^i b. \end{aligned}$$

Since \(r_t=0\), we have again from Algorithm 4 that

$$\begin{aligned} 0 = r_{t-1} + \alpha _{t-1} {\bar{H}}p_{t-1} = -b + \sum _{i=1}^{t-1} (\tau _i + \alpha _{t-1} \sigma _{i-1}) {\bar{H}}^i b + \alpha _{t-1} \sigma _{t-1} {\bar{H}}^t b.\qquad \end{aligned}$$
(51)

The coefficient \(\alpha _{t-1} \sigma _{t-1}\) must be nonzero, because otherwise this expression would represent a nontrivial linear combination of the basis elements \(\{ b, {\bar{H}}b, \cdots , {\bar{H}}^{t-1} b \}\) of \(\mathcal{K}_{t-1}(b,{\bar{H}})\) that is equal to zero. It follows from this observation and (51) that \({\bar{H}}^t b \in {{\,\mathrm{span}\,}}\{b, {\bar{H}}b, \cdots , {\bar{H}}^{t-1} b \} = \mathcal{K}_{t-1}(b,{\bar{H}})\), as required.

Consequently,

$$\begin{aligned} \mathcal{K}_t(b,{\bar{H}}) = {{\,\mathrm{span}\,}}\{b,{\bar{H}}b,\cdots ,{\bar{H}}^t b\} = {{\,\mathrm{span}\,}}\{b,\cdots ,{\bar{H}}^{t-1}b\} = \mathcal{K}_{t-1}(b,{\bar{H}}). \end{aligned}$$

By using a recursive argument on the definition of \(\mathcal{K}_i(b,{\bar{H}})\) for \(i=t,\cdots ,J\), we arrive at \(\mathcal{K}_{t-1}(b,{\bar{H}}) = \mathcal{K}_{J}(b,{\bar{H}})\). Thus there is a value of t smaller than \(J\) such that \(\mathcal{K}_{t}(b,{\bar{H}})\) contains a direction of nonpositive curvature, contradicting our definition of \(J\). Thus we cannot have termination of Algorithm 4 with \(j \le J\) unless \(p_j^\top {\bar{H}}p_j \le 0\).

Suppose next that CG terminates with \(j=t\) for some \(t >J\). It follows that \(p_j^\top {\bar{H}}p_j > 0\) for all \(j=0,1,\cdots ,J\). By definition of \(J\), there is a nonzero vector \(z \in \mathcal{K}_{J}(b,{\bar{H}})\) such that \(z^\top {\bar{H}}z \le 0\). On the other hand, we have \(\mathcal{K}_{J}(b,{\bar{H}}) = {{\,\mathrm{span}\,}}\{p_0,p_1,\cdots ,p_{J} \}\) by (49), thus we can write \(z = \sum _{j=0}^{J} \gamma _j p_j\), for some scalars \(\gamma _j\), \(j=0,1,\cdots ,J\). By Property 5 of Lemma 7, we then have

$$\begin{aligned} 0 \ge z^\top {\bar{H}}z = \sum _{j=0}^{J} \gamma _j^2 p_j^\top {\bar{H}}p_j. \end{aligned}$$

Since \(p_{j}^\top {\bar{H}}p_{j}>0\) for every \(j=0,1,\cdots ,J\), and not all \(\gamma _j\) can be zero (because \(z \ne 0\)), the final summation is strictly positive, a contradiction.

Suppose now that CG terminates at some \(j<J\). Then \(p_j^\top {\bar{H}}p_j \le 0\), and since \(p_j \in \mathcal{K}_j(b,{\bar{H}})\), it follows that \(\mathcal{K}_j(b,{\bar{H}})\) contains a direction of nonpositive curvature, contradicting the definition of \(J\).

We conclude that Algorithm 4 must terminate with \(j=J\) and \(p_J^\top {\bar{H}}p_J\le 0\), as claimed. \(\square \)

Theorem 5 is a generic result that does not require b to be chosen randomly. It does not guarantee that Lanczos will detect nonpositive curvature in \({\bar{H}}\) whenever present, because b could be orthogonal to the subspace corresponding to the nonpositive curvature, so the Lanczos subspace never intersects with the subspace of negative curvature. When b is chosen randomly from a uniform distribution over the unit ball, however, we can certify the performance of Lanczos, as we have shown in Lemma 2 based on Lemma 9 above. We can exploit Theorem 5 to obtain the same performance for CG, as stated in Theorem 1. We restate this result as a corollary, and prove it now.

Corollary 3

Let b be distributed uniformly on the unit ball and H be a symmetric n-by-n matrix, with \(\Vert H \Vert \le M\). Given \(\delta \in [0,1)\), define

$$\begin{aligned} {\bar{J}}:= \min \left\{ n, 1+\left\lceil \frac{\ln (2.75n/\delta ^2)}{2} \sqrt{\frac{M}{\epsilon }} \right\rceil \right\} . \end{aligned}$$
(52)

Consider applying Algorithm 4 with \({\bar{H}}:= H + \frac{1}{2} \epsilon I\) and \(g=-b\). Then, the following properties hold:

  1. (a)

    If \(\lambda _{\mathrm{{min}}}(H) < -\epsilon \), then with probability at least \(1-\delta \), there is some index \(j \le {\bar{J}}\) such that Algorithm 4 terminates with a direction \(p_j\) such that \(p_j^\top H p_j \le -(\epsilon /2) \Vert p_j\Vert ^2\).

  2. (b)

    if Algorithm 4 runs for \({\bar{J}}\) iterations without terminating, then with probability at least \(1-\delta \), we have that \(\lambda _{\mathrm{{min}}}(H) \ge -\epsilon \).

Proof

We will again exploit the invariance of the Krylov subspaces to linear shifts given by (43). This allows us to make inferences about the behavior of Algorithm 4 applied to \({\bar{H}}\) from the behavior of the Lanczos method applied to H, which has been described in Lemma 2.

Suppose first that \(\lambda _{\mathrm{{min}}}(H) < - \epsilon \). By Lemma 2, we know that with probability at least \(1-\delta \), the Lanczos procedure returns a vector v such that \(\Vert v\Vert =1\) and \(v^\top Hv \le -(\epsilon /2)\) after at most \({\bar{J}}\) iterations. Thus, for some \(j \le {\bar{J}}\), we have \(v \in \mathcal{K}_j(b,H) = \mathcal{K}_j(b,{\bar{H}})\), and moreover \(v^\top {\bar{H}}v \le 0\) by definition of \({\bar{H}}\), so the Krylov subspace \(\mathcal{K}_j(b,{\bar{H}})\) contains directions of nonpositive curvature, for some \(j \le {\bar{J}}\). It then follows from Theorem 5 that \(p_j^\top {\bar{H}}p_j \le 0\) for some \(j \le {\bar{J}}\). To summarize: If \(\lambda _{\mathrm{{min}}}(H) < - \epsilon \), then with probability \(1-\delta \), Algorithm 4 applied to \({\bar{H}}\) and \(g=-b\) will terminate with some \(p_j\) such that \(p_j^\top {\bar{H}}p_j \le 0\) for some j with \(j \le {\bar{J}}\). The proof of (a) is complete.

Suppose now that Algorithm 4 applied to \({\bar{H}}\) and \(g=-b\) runs for \({\bar{J}}\) iterations without terminating, that is \(p_j^\top {\bar{H}}p_j >0\) for \(j=0,1,\cdots ,{\bar{J}}\). It follows from the logic of Theorem 5 that \(\mathcal{K}_{{\bar{J}}}(b,{\bar{H}})\) contains no directions of nonpositive curvature for \({\bar{H}}\). Equivalently, there is no direction of curvature less than \(-\epsilon /2\) for H in \(\mathcal{K}_{{\bar{J}}}(b,H)\). By Lemma 2, this certifies with probability at least \(1-\delta \) that \(\lambda _{\mathrm{{min}}}(H) \ge -\epsilon \), establishing (b). \(\square \)

1.3 Randomized Lanczos with internal estimation of a bound on \(\Vert H\Vert \)

The methods discussed in Sect. B.2 assume knowledge of an upper bound on the considered matrix, denoted by M. When no such bound is known, we show here that it is possible to estimate it within the Lanczos procedure. Algorithm 5 details the method; we show that it can be used as an instance of Procedure 2 satisfying Assumption 3 when the optional parameter M is not provided.

Algorithm 5 consists in applying the Lanczos method on H starting with a random vector b. We first run Lanczos for \(j_M\) iterations, where \(j_M\) does not depend on any estimate on the minimum or maximum eigenvalue and instead targets a fixed accuracy. After this initial phase of \(j_M\) iterations, we have approximations of the extreme eigenvalues \(\xi _{\mathrm{{max}}}(H, b, j_M)\) and \(\xi _{\mathrm{{min}}}(H, b, j_M)\) from (42). An estimate M of \(\Vert H\Vert \) is then given by:

$$\begin{aligned} M = 2 \max \left\{ |\xi _{\mathrm{{max}}}(H, b, j_M)|, |\xi _{\mathrm{{min}}}(H, b, j_M)| \right\} . \end{aligned}$$
(53)

We show below that \(\Vert H\Vert \le M \le 2\Vert H\Vert \), with high probability. This value can then be used together with a tolerance \(\epsilon \) to define a new iteration limit for the Lanczos method. After this new iteration limit is reached, we can either produce a direction of curvature at most \(-\epsilon /2\), or certify with high probability that \(\lambda _{\min }(H) \succeq -\epsilon I\)—the desired outcomes for Procedure 2.

figure e

Algorithm 5 could be terminated earlier, in fewer than \(j_{\mathrm {total}}\) iterations, when a direction of sufficient negative is encountered. For simplicity, we do not consider this feature, but observe that it would not affect the guarantees of the method, described in Lemma 10 below.

Lemma 10

Consider Algorithm 5 with input parameters H and \(\epsilon \), and internal parameters \(\delta \) and b. The method outputs a value \(\lambda \) such that

$$\begin{aligned} \lambda \le \lambda _{\mathrm{{min}}}(H) + \frac{\epsilon }{2} \end{aligned}$$
(54)

in at most

$$\begin{aligned} \min \left\{ n, 1+\max \left\{ {\Bigg \lceil }\frac{1}{2} \ln (25n/\delta ^2){\Bigg \rceil }, {\Bigg \lceil }\frac{1}{2} \ln (25n/\delta ^2)\sqrt{\frac{2\Vert H\Vert }{\epsilon }}{\Bigg \rceil } \right\} \right\} \end{aligned}$$
(55)

matrix-vector multiplications by H, with probability at least \(1 - \delta \).

Proof

We begin by showing that the first phase of Algorithm 5 yields an accurate estimate of \(\Vert H\Vert \) with high probability. We assume that \(\Vert H\Vert >0\) as the result is trivially true otherwise. By setting \(\delta \leftarrow \delta /3\) and \({\bar{\epsilon }} = \frac{1}{4}\) in Lemma 9, we obtain that the following inequalities hold, each with probability at least \(1-\delta /3\):

$$\begin{aligned} \xi _{\mathrm{{max}}}(H,b,j_M)&\ge \lambda _{\mathrm{{max}}}(H) - \tfrac{1}{4} (\lambda _{\mathrm{{max}}}(H)-\lambda _{\mathrm{{min}}}(H)), \end{aligned}$$
(56a)
$$\begin{aligned} \xi _{\mathrm{{min}}}(H,b,j_M)&\le \lambda _{\mathrm{{min}}}(H) + \tfrac{1}{4} (\lambda _{\mathrm{{max}}}(H)-\lambda _{\mathrm{{min}}}(H)). \end{aligned}$$
(56b)

We consider the various possibilities for \(\lambda _{\mathrm{{min}}}(H)\) and \(\lambda _{\mathrm{{max}}}(H)\) separately, showing in each case that M defined by (53) has \(\Vert H\Vert \le M \le 2 \Vert H\Vert \).

  • When \(\lambda _{\mathrm{{max}}}(H) \ge \lambda _{\mathrm{{min}}}(H) \ge 0\), we have \(\xi _{\mathrm{{max}}}(H,b,j_M) \ge \frac{3}{4} \lambda _{\mathrm{{min}}}(H)\) and \(0 \le \xi _{\mathrm{{min}}}(H,b,j_M) \le \xi _{\mathrm{{max}}}(H,b,j_M)\), so that

    $$\begin{aligned} M&= 2 \xi _{\mathrm{{max}}}(H,b,j_M) \ge \frac{3}{2} \lambda _{\mathrm{{max}}}(H) = \frac{3}{2}\Vert H\Vert , \\ M&= 2 \xi _{\mathrm{{max}}}(H,b,j_M) \le 2 \lambda _{\mathrm{{max}}}(H) = 2 \Vert H\Vert , \end{aligned}$$

    as required.

  • When \(\lambda _{\mathrm{{min}}}(H) \le \lambda _{\mathrm{{max}}}(H) \le 0\), we have \(\xi _{\mathrm{{min}}}(H,b,j_M) \le \frac{3}{4} \lambda _{\mathrm{{min}}}(H) \le 0\) and \(\xi _{\mathrm{{min}}}(H,b,j_M) \le \xi _{\mathrm{{max}}}(H,b,j_M) \le 0\), so that

    $$\begin{aligned} M&= 2 | \xi _{\mathrm{{min}}}(H,b,j_M)| \ge \frac{3}{2} | \lambda _{\mathrm{{min}}}(H)| = \frac{3}{2} \Vert H\Vert , \\ M&= 2 | \xi _{\mathrm{{min}}}(H,b,j_M)| \le 2 | \lambda _{\mathrm{{min}}}(H)| = 2 \Vert H\Vert , \end{aligned}$$

    as required.

  • When \(\lambda _{\mathrm{{min}}}(H) \le 0 \le \lambda _{\mathrm{{max}}}(H)\) and \(-\lambda _{\mathrm{{min}}}(H) \le \lambda _{\mathrm{{max}}}(H)\), we have \(\lambda _{\mathrm{{max}}}(H)-\lambda _{\mathrm{{min}}}(H) \le 2 \lambda _{\mathrm{{max}}}(H)\), so from (56a), it follows that \(\xi _{\mathrm{{max}}}(H,b,j_M) \ge \frac{1}{2} \lambda _{\mathrm{{max}}}(H) = \frac{1}{2} \Vert H\Vert \), and so

    $$\begin{aligned} M&\ge 2 \xi _{\mathrm{{max}}}(H,b,j_M) \ge \Vert H\Vert , \\ M&= 2 \max \left\{ |\xi _{\mathrm{{max}}}(H, b, j_M)|, |\xi _{\mathrm{{min}}}(H, b, j_M)| \right\} \\&\le 2 \max \left\{ |\lambda _{\mathrm{{max}}}(H)|,|\lambda _{\mathrm{{min}}}(H)| \right\} = 2 \Vert H\Vert , \end{aligned}$$

    as required.

  • When \(\lambda _{\mathrm{{min}}}(H) \le 0 \le \lambda _{\mathrm{{max}}}(H)\) and \(-\lambda _{\mathrm{{min}}}(H) \ge \lambda _{\mathrm{{max}}}(H)\), we have \(\lambda _{\mathrm{{max}}}(H) - \lambda _{\mathrm{{min}}}(H) \le -2 \lambda _{\mathrm{{min}}}(H)\), so from (56b), it follows that \(\xi _{\mathrm{{min}}}(H,b,j_M) \le \frac{1}{2} \lambda _{\mathrm{{min}}}(H) \le 0\), and so

    $$\begin{aligned} M&\ge 2 |\xi _{\mathrm{{min}}}(H,b,j_M)| \ge | \lambda _{\mathrm{{min}}}(H)| = \Vert H\Vert , \\ M&=2 \max \left\{ |\xi _{\mathrm{{max}}}(H, b, j_M)|, |\xi _{\mathrm{{min}}}(H, b, j_M)| \right\} \\&\le 2 \max \left\{ |\lambda _{\mathrm{{max}}}(H)|,|\lambda _{\mathrm{{min}}}(H)| \right\} = 2 \Vert H\Vert , \end{aligned}$$

    as required.

Since each of the bounds in (56) holds with probability at least \(1-\delta /3\), both hold with probability at least \(1-2\delta /3\), by a union bound argument.

We finally consider the complete run of Algorithm 5, which requires \(j_{\mathrm {total}}\) iterations of Lanczos. If our estimate M is accurate, we have by setting \(\delta \leftarrow \delta /3\) and \(M \leftarrow 2 \Vert H\Vert \) in Lemma 2 that \(\lambda =\xi _{\mathrm{{min}}}(H,b,j_{\mathrm {total}})\) satisfies (54) with probability \(1-\delta /3\). By using a union bound to combine this probability with the probabilities of estimating M appropriately, we obtain the probability of at least \(1-\delta \).

In conclusion, Algorithm 5 runs \(j_{\mathrm {total}}\) iterations of Lanczos (each requiring one matrix-vector multiplication by H) and terminates correctly with probability at least \(1-\delta \). \(\square \)

The lemma above shows that Algorithm 5 is an instance of Procedure 2 that does not require an a priori bound on \(\Vert H\Vert \). Assuming \(\Vert H\Vert \le U_H\), we further observe that Algorithm 5 satisfies the conditions of Assumption 3 with \(\mathcal {C}_{\mathrm {meo}}= \frac{\ln (25n/\delta ^2)}{\sqrt{2}}\sqrt{U_H}\), which is within a modest constant multiple of the one obtained for the Lanczos method with knowledge of \(\Vert H\Vert \) or \(U_H\).

Proof of Theorem 2

Proof

The proof proceeds by contradiction: If we assume that all conditions specified in the statement of the theorem hold, and in addition that

$$\begin{aligned} \frac{(y_{{\hat{J}}+1}-y_i)^\top {\bar{H}}(y_{{\hat{J}}+1}-y_i)}{\Vert y_{{\hat{J}}+1}-y_i\Vert ^2} \ge \epsilon , \quad \text{ for } \text{ all } i=0,1,\cdots ,{\hat{J}}-1, \end{aligned}$$
(57)

then we must have

$$\begin{aligned} \Vert r_{{\hat{J}}}\Vert \le \sqrt{T} \tau ^{{\hat{J}}/2} \Vert r_0\Vert , \end{aligned}$$
(58)

contradicting (11). Note that

$$\begin{aligned} \frac{(y_{{\hat{J}}+1}-y_{{\hat{J}}})^\top {\bar{H}}(y_{{\hat{J}}+1}-y_{{\hat{J}}})}{\Vert y_{{\hat{J}}+1}-y_{{\hat{J}}}\Vert ^2} = \frac{\alpha _{{\hat{J}}} p_{{\hat{J}}}^\top {\bar{H}}(\alpha _{{\hat{J}}} p_{{\hat{J}}})}{\Vert \alpha _{{\hat{J}}} p_{{\hat{J}}}\Vert ^2} = \frac{p_{{\hat{J}}}^\top {\bar{H}}p_{{\hat{J}}}}{\Vert p_{{\hat{J}}}\Vert ^2} \ge \epsilon \end{aligned}$$
(59)

by assumption, therefore we can consider (57) to hold for \(i=0,1,\cdots ,{\hat{J}}\). Moreover, recalling the definition (38) of the quadratic function q, we have for any \(i=0,1,\cdots ,{\hat{J}}\) that

$$\begin{aligned} q(y_{{\hat{J}}+1}) = q(y_i) + \nabla q(y_i)^\top (y_{{\hat{J}}+1}-y_i) + \frac{1}{2}(y_{{\hat{J}}+1}-y_i)^\top {\bar{H}}(y_{{\hat{J}}+1}-y_i). \end{aligned}$$

Thus, (57) can be reformulated as

$$\begin{aligned} q(y_{{\hat{J}}+1}) \ge q(y_i) + r_i^\top (y_{{\hat{J}}+1}-y_i) + \frac{\epsilon }{2}\Vert y_{{\hat{J}}+1}-y_i\Vert ^2, \quad \text{ for } \text{ all } i=0,1,\cdots ,{\hat{J}},\nonumber \\ \end{aligned}$$
(60)

where we used \(\nabla q(y_i)=r_i\) and the definitions (57), (59) of strong convexity along the directions \(y_{{\hat{J}}+1}-y_i\). In the remainder of the proof, and similarly to [5, Proof of Proposition 1], we will show that (60) leads to the contradiction (58), thus proving that (12) holds.

We define the sequence of functions \(\varPhi _j\), \(j=0,1,\cdots {\hat{J}}\) as follows:

$$\begin{aligned} \varPhi _0(z) := q(y_0) + \frac{\epsilon }{2}\Vert z-y_0\Vert ^2, \end{aligned}$$

and for \(j=0,\cdots ,{\hat{J}}-1\):

$$\begin{aligned} \varPhi _{j+1}(z) := \tau \varPhi _j(z) + (1-\tau )\left( q(y_j) + r_j^\top (z-y_j) +\frac{\epsilon }{2}\Vert z-y_j\Vert ^2 \right) . \end{aligned}$$
(61)

Since each \(\varPhi _j\) is a quadratic function with Hessian \(\epsilon I\), it can be written as follows:

$$\begin{aligned} \varPhi _j(z) \; = \; \varPhi _j^* + \frac{\epsilon }{2}\Vert z-v_j\Vert ^2, \end{aligned}$$
(62)

where \(v_j\) is the unique minimizer of \(\varPhi _j\), and \(\varPhi _j^* = \varPhi _j(v_j)\) is the minimum value of \(\varPhi _j\). (Note that \(v_0=y_0=0\) and \(\varPhi ^*_0 = q(y_0) = 0\).)

Defining

$$\begin{aligned} \psi (y) := q(y_0) - q(y) + \frac{\epsilon }{2}\Vert y-y_0\Vert ^2 = \varPhi _0(y) - q(y), \end{aligned}$$
(63)

we give a short inductive argument to establish that

$$\begin{aligned} \varPhi _j(y_{{\hat{J}}+1}) \le q(y_{{\hat{J}}+1}) + \tau ^j \psi (y_{{\hat{J}}+1}), \quad j=0,1,\cdots ,{\hat{J}}. \end{aligned}$$
(64)

For \(j=0\), (64) holds because \(\varPhi _0(y) = q(y) + \psi (y)\) by definition. Assuming that (64) holds for some index \(j \ge 0\), we find by first applying (61) (with \(z=y_{{\hat{J}}+1}\)) and then (60) (with \(i=j\)) that

$$\begin{aligned} \varPhi _{j+1}(y_{{\hat{J}}+1})= & {} \tau \varPhi _j(y_{{\hat{J}}+1}) + (1-\tau )\left( q(y_j) + r_j^\top (y_{{\hat{J}}+1}-y_j) +\frac{\epsilon }{2}\Vert y_{{\hat{J}}+1}-y_j\Vert ^2 \right) \\\le & {} \tau \varPhi _j(y_{{\hat{J}}+1}) + (1-\tau ) q(y_{{\hat{J}}+1}). \end{aligned}$$

Thus, we have

$$\begin{aligned} \varPhi _{j+1}(y_{{\hat{J}}+1})&\le \tau \varPhi _j(y_{{\hat{J}}+1}) + (1-\tau ) q(y_{{\hat{J}}+1}) \\&\le \tau q(y_{{\hat{J}}+1}) + \tau ^{j+1} \psi (y_{{\hat{J}}+1}) + (1-\tau ) q(y_{{\hat{J}}+1}) \quad \hbox {from }(64) \\&= q(y_{{\hat{J}}+1}) + \tau ^{j+1} \psi (y_{{\hat{J}}+1}), \end{aligned}$$

which proves (64) for \(j+1\), and thus completes the inductive argument.

We next prove another technical fact about the relationship between \(q(y_j)\) and \(\varPhi _j^*\), namely,

$$\begin{aligned} q(y_j) \le \varPhi _j^*, \quad j=0,1,\cdots , {\hat{J}}. \end{aligned}$$
(65)

We establish this result by an inductive argument that is quite lengthy and technical; we note the termination of this phase of the proof clearly below.

This result trivially holds (with equality) at \(j=0\). Supposing that it holds for some \(j =0,1,\cdots , {\hat{J}}-1\), we will prove that it also holds for \(j+1\).

By using Properties 7 and 8 of Lemma 7, and also \(\Vert {\bar{H}}r_j\Vert \le (M+2\epsilon )\Vert r_j\Vert \), we have

$$\begin{aligned} q(y_{j+1}) = q(y_j) - \frac{\Vert r_j\Vert ^4}{2\, p_j^\top {\bar{H}}p_j} \le q(y_j) - \frac{\Vert r_j\Vert ^4}{2\, r_j^\top {\bar{H}}r_j} \le q(y_j) - \frac{\Vert r_j\Vert ^2}{2(M+2\epsilon )}. \end{aligned}$$

It follows from induction hypothesis \(q(y_j) \le \varPhi _j^*\) that

$$\begin{aligned} q(y_{j+1})\le & {} q(y_j) - \frac{\Vert r_j\Vert ^2}{2(M+2\epsilon )} = \tau q(y_j) + (1-\tau ) q(y_j) -\frac{\Vert r_j\Vert ^2}{2(M+2\epsilon )} \nonumber \\\le & {} \tau \varPhi _j^* + (1-\tau ) q(y_j) -\frac{\Vert r_j\Vert ^2}{2(M+2\epsilon )}. \end{aligned}$$
(66)

By taking the derivative in (61), and using (62), we obtain

$$\begin{aligned} \nabla \varPhi _{j+1}(z)&= \tau \nabla \varPhi _j(z) + (1-\tau )\left[ r_j + \epsilon (z-y_j)\right] \\&\Rightarrow \;\; \epsilon (z-v_{j+1})= \epsilon \tau (z-v_j) + (1-\tau ) \left( r_j + \epsilon (z-y_j) \right) . \end{aligned}$$

By rearranging the above relation (and noting that the z terms cancel out), we obtain:

$$\begin{aligned} v_{j+1} \; = \; \tau v_j - \frac{1-\tau }{\epsilon }r_j + (1-\tau ) y_j. \end{aligned}$$
(67)

It follows from this expression together with Properties 1 and 2 of Lemma 7 that

$$\begin{aligned} v_j \in {{\,\mathrm{span}\,}}\left\{ p_0,p_1,\cdots ,p_{j-1}\right\} , \quad j=1,2,\cdots ,{\hat{J}}. \end{aligned}$$
(68)

(The result holds for \(j=1\), from (67) we have \(v_1 \in {{\,\mathrm{span}\,}}\{v_0, r_0, y_0\} = {{\,\mathrm{span}\,}}\{r_0\} = {{\,\mathrm{span}\,}}\{p_0 \}\), and an induction based on (67) can be used to establish (68) for the other values of j.) By combining the expressions (62) for \(\varPhi _j\), \(\varPhi _{j+1}\) with the recurrence formula (61) for \(\varPhi _{j+1}\), we obtain

$$\begin{aligned} \varPhi _{j+1}^* + \frac{\epsilon }{2}\Vert y_j-v_{j+1}\Vert ^2&= \varPhi _{j+1}(y_j) \\&= \tau \varPhi _j(y_j) + (1-\tau ) q(y_j) \\&= \tau \left( \varPhi _j^* + \frac{\epsilon }{2}\Vert y_j-v_j\Vert ^2\right) + (1-\tau ) q(y_j) \end{aligned}$$

and therefore

$$\begin{aligned} \varPhi _{j+1}^* = \tau \left( \varPhi _j^* + \frac{\epsilon }{2}\Vert y_j-v_j\Vert ^2\right) + (1-\tau ) q(y_j) - \frac{\epsilon }{2}\Vert y_j-v_{j+1}\Vert ^2. \end{aligned}$$
(69)

On the other hand, we have by (67) that

$$\begin{aligned} \Vert y_j-v_{j+1}\Vert ^2= & {} \left\| \tau (y_j-v_j) +\frac{1-\tau }{\epsilon }r_j \right\| ^2 \nonumber \\= & {} (\tau ^2 \Vert y_j-v_j\Vert ^2 + \frac{(1-\tau )^2}{\epsilon ^2}\Vert r_j\Vert ^2 + \frac{2}{\epsilon }(1-\tau )\tau r_j^\top (y_j-v_j) \nonumber \\= & {} \tau ^2 \Vert y_j-v_j\Vert ^2 + \frac{(1-\tau )^2}{\epsilon ^2}\Vert r_j\Vert ^2, \end{aligned}$$
(70)

where the last relation comes from \(r_j \perp {{\,\mathrm{span}\,}}\{p_0,\cdots ,p_{j-1}\}\) (Property 2 of Lemma 7) and (68) in the case \(j \ge 1\), and immediately in the case \(j=0\), since \(y_0=v_0=0\). By combining (69) and (70), we arrive at:

$$\begin{aligned} \varPhi _{j+1}^*&= \tau \left( \varPhi _j^* + \frac{\epsilon }{2}\Vert y_j-v_j\Vert ^2\right) + (1-\tau ) q(y_j) - \frac{\epsilon }{2}\Vert y_j-v_{j+1}\Vert ^2 \nonumber \\&= \tau \left( \varPhi _j^* + \frac{\epsilon }{2}\Vert y_j-v_j\Vert ^2\right) + (1-\tau ) q(y_j) \nonumber \\&\quad -\frac{\epsilon }{2}\tau ^2 \Vert y_j-v_j\Vert ^2 - \frac{(1-\tau )^2}{2\epsilon } \Vert r_j\Vert ^2 \nonumber \\&= \tau \varPhi _j^* + \frac{\epsilon }{2}\left[ \tau -\tau ^2\right] \Vert y_j-v_j\Vert ^2 + (1-\tau ) q(y_j) - \frac{(1-\tau )^2}{2\epsilon } \Vert r_j\Vert ^2 \nonumber \\&= \tau \varPhi _j^* + \frac{\epsilon }{2}(1-\tau )\tau \Vert y_j-v_j\Vert ^2 + (1-\tau ) q(y_j) - \frac{(1-\tau )^2}{2\epsilon } \Vert r_j\Vert ^2 \nonumber \\&\ge \tau \varPhi _j^* + (1-\tau ) q(y_j) - \frac{(1-\tau )^2}{2\epsilon } \Vert r_j\Vert ^2 \nonumber \\&\ge q(y_{j+1}) + \frac{1}{2(M+2\epsilon )}\Vert r_j\Vert ^2 - \frac{(1-\tau )^2}{2\epsilon } \Vert r_j\Vert ^2. \end{aligned}$$
(71)

where the last inequality comes from (66). By using the definitions of \(\tau \) and \(\kappa \) in Algorithm 1, we have

$$\begin{aligned} \frac{(1-\tau )^2}{2\epsilon } = \frac{1}{2\epsilon (\sqrt{\kappa }+1)^2} \le \frac{1}{2\epsilon \kappa } = \frac{1}{2(M+2\epsilon )}. \end{aligned}$$

It therefore follows from (71) that \(q(y_{j+1}) \le \varPhi _{j+1}^*\). At this point, we have shown that when \(q(y_j) \le \varPhi _j^*\) for \(j=0,1,\cdots ,{\hat{J}}-1\), it follows that \(q(y_{j+1}) \le \varPhi _{j+1}^*\), establishing the inductive step. As a result, our proof of (65) is complete.

By substituting \(j={\hat{J}}\) into (65), we obtain \(q(y_{{\hat{J}}}) \le \varPhi _{{\hat{J}}}^*\), which in combination with (64) with \(j={\hat{J}}\), and the definition (62), yields

$$\begin{aligned} q(y_{{\hat{J}}}) -q(y_{{\hat{J}}+1}) \le \varPhi _{{\hat{J}}}^* - q(y_{{\hat{J}}+1}) \le \varPhi _{{\hat{J}}}(y_{{\hat{J}}+1}) - q(y_{{\hat{J}}+1}) \le \tau ^{{\hat{J}}} \psi (y_{{\hat{J}}+1}).\qquad \end{aligned}$$
(72)

By substitution from (63), we obtain

$$\begin{aligned} q(y_{{\hat{J}}}) - q(y_{{\hat{J}}+1}) \; \le \; \tau ^{{\hat{J}}} \left( q(y_0)-q(y_{{\hat{J}}+1}) + \frac{\epsilon }{2}\Vert y_0-y_{{\hat{J}}+1}\Vert ^2\right) . \end{aligned}$$
(73)

We now depart from the analysis of [5], and complete the proof of this result by expressing (73) in terms of residual norms. On the left-hand side, we have

$$\begin{aligned} q(y_{{\hat{J}}}) - q(y_{{\hat{J}}+1})&= r_{{\hat{J}}+1}^\top (y_{{\hat{J}}}-y_{{\hat{J}}+1}) + \frac{1}{2}(y_{{\hat{J}}}-y_{{\hat{J}}+1})^\top {\bar{H}}(y_{{\hat{J}}}-y_{{\hat{J}}+1}) \\&= \frac{1}{2}(y_{{\hat{J}}}-y_{{\hat{J}}+1})^\top {\bar{H}}(y_{{\hat{J}}}-y_{{\hat{J}}+1}) \end{aligned}$$

because \(r_{{\hat{J}}+1}^\top (y_{{\hat{J}}}-y_{{\hat{J}}+1}) = r_{{\hat{J}}+1}^\top (\alpha _{{\hat{J}}} p_{{\hat{J}}}) = 0\) by Lemma 7, Property 2. We thus have from (59) that

$$\begin{aligned} q(y_{{\hat{J}}}) - q(y_{{\hat{J}}+1})&\ge \frac{\epsilon }{2}\Vert y_{{\hat{J}}}-y_{{\hat{J}}+1}\Vert ^2 \nonumber \\&{= \frac{\epsilon }{2} \Vert \alpha _{{\hat{J}}}p_{{\hat{J}}}\Vert ^2} \nonumber \\&{\ge \frac{\epsilon }{2 (M+2\epsilon )^2}\Vert {\bar{H}}(\alpha _{{\hat{J}}}p_{{\hat{J}}})\Vert ^2} \quad \quad \quad \text{(since } \Vert {\bar{H}}p_{{\hat{J}}}\Vert \le (M+2\epsilon )\Vert p_{{\hat{J}}}\Vert \text{) } \nonumber \\&= \frac{\epsilon }{2 (M+2\epsilon )^2} \Vert {\bar{H}}(y_{{\hat{J}}}-y_{{\hat{J}}+1})\Vert ^2 \nonumber \\&= \frac{\epsilon }{2 (M+2\epsilon )^2} \Vert r_{{\hat{J}}} - r_{{\hat{J}}+1}\Vert ^2 \quad \quad \quad (\hbox {since }r_j=g+{\bar{H}}y_j) \nonumber \\&= \frac{\epsilon }{2 (M+2\epsilon )^2} (\Vert r_{{\hat{J}}}\Vert ^2 + \Vert r_{{\hat{J}}+1}\Vert ^2) \quad \quad (\hbox {by Lemma}~7,\hbox { Property } 2) \nonumber \\&\ge \frac{\epsilon }{2 (M+2\epsilon )^2} \Vert r_{{\hat{J}}}\Vert ^2, \end{aligned}$$
(74)

On the right-hand side of (73), because of the strong convexity condition (60) at \(i=0\), we have

$$\begin{aligned} q(y_0)-q(y_{{\hat{J}}+1}) + \frac{\epsilon }{2}\Vert y_0-y_{{\hat{J}}+1}\Vert ^2&\le -\nabla q(y_0)^\top (y_{{\hat{J}}+1}-y_0) \\&= -r_0^\top (y_{{\hat{J}}+1}-y_0) \le \Vert r_0\Vert \Vert y_{{\hat{J}}+1}-y_0\Vert . \end{aligned}$$

Moreover, we have

$$\begin{aligned} \Vert y_{{\hat{J}}+1}-y_0\Vert \; = \; \left\| \sum _{i=0}^{{\hat{J}}} \alpha _i p_i \right\| \; \le \; \sum _{i=0}^{{\hat{J}}} \alpha _i \Vert p_i\Vert \; = \; \sum _{i=0}^{{\hat{J}}} \frac{\Vert r_i\Vert ^2}{p_i^\top {\bar{H}}p_i}\Vert p_i\Vert , \end{aligned}$$

where the last relation follows from the definition of \(\alpha _i\). By combining these last two bounds, and using Property 3 of Lemma 7, we obtain

$$\begin{aligned} q(y_0)-q(y_{{\hat{J}}+1}) + \frac{\epsilon }{2}\Vert y_0-y_{{\hat{J}}+1}\Vert ^2 \; \le \; \Vert r_0\Vert \sum _{i=0}^{{\hat{J}}} \Vert r_i\Vert \frac{\Vert p_i\Vert ^2}{p_i^\top {\bar{H}}p_i} \; \le \; \Vert r_0\Vert \frac{1}{\epsilon } \sum _{i=0}^{{\hat{J}}} \Vert r_i\Vert ,\nonumber \\ \end{aligned}$$
(75)

because \(p_j^\top {\bar{H}}p_j \ge \epsilon \Vert p_j\Vert ^2\) for \(j=0,1,\cdots ,{\hat{J}}\) by assumption.

To bound the sum in (75), we recall that since Algorithm 1 did not terminate until iteration \({\hat{J}}\), the residual norms \(\Vert r_i \Vert \) at all iterations \(i=0,1,\cdots ,{\hat{J}}-1\) must have decreased at the expected convergence rate. In particularly, we have \(\Vert r_i\Vert \le \sqrt{T} \tau ^{i/2} \Vert r_0\Vert \) for the possibly smaller versions of \(\sqrt{T}\) and \(\tau \) that prevailed at iteration i, so certainly \(\Vert r_i\Vert \le \sqrt{T} \tau ^{i/2} \Vert r_0\Vert \) for the final values of these parameters. Thus for \(i=0,1,\cdots ,{\hat{J}}-1\), we have

$$\begin{aligned} \Vert r_i\Vert \le \sqrt{T} \tau ^{i/2} \Vert r_0\Vert \le \tau ^{(i-{\hat{J}})/2}\Vert r_{{\hat{J}}}\Vert , \end{aligned}$$

where we used \(\Vert r_{{\hat{J}}}\Vert \ge \sqrt{T} \tau ^{{\hat{J}}/2} \Vert r_0\Vert \) (from (11)) for the last inequality. Observing that this bound also holds (trivially) for \(i={\hat{J}}\), we obtain by substituting in (75) that

$$\begin{aligned} q(y_0)-q(y_{{\hat{J}}+1}) + \frac{\epsilon }{2}\Vert y_0-y_{{\hat{J}}+1}\Vert ^2&\le \Vert r_0\Vert \frac{1}{\epsilon } \sum _{i=0}^{{\hat{J}}} \tau ^{(i-{\hat{J}})/2} \Vert r_{{\hat{J}}}\Vert \nonumber \\&\le \Vert r_0\Vert \frac{\tau ^{-{\hat{J}}/2}}{\epsilon }\Vert r_{{\hat{J}}}\Vert \,\sum _{i=0}^{{\hat{J}}} (\sqrt{\tau })^{i} \nonumber \\&\le \Vert r_0\Vert \frac{\tau ^{-{\hat{J}}/2}}{\epsilon }\Vert r_{{\hat{J}}}\Vert \frac{1}{1-\sqrt{\tau }}. \end{aligned}$$
(76)

Applying successively (74), (73) and (76) finally yields:

$$\begin{aligned} \frac{\epsilon }{2 (M+2\epsilon )^2} \Vert r_{{\hat{J}}}\Vert ^2&\le q(y_{{\hat{J}}}) - q(y_{{\hat{J}}+1}) \\&\le \tau ^{{\hat{J}}} \left( q(y_0)-q(y_{{\hat{J}}+1}) + \frac{\epsilon }{2}\Vert y_0-y_{{\hat{J}}+1}\Vert ^2 \right) \\&\le \tau ^{{\hat{J}}} \Vert r_0\Vert \Vert r_{{\hat{J}}}\Vert \frac{\tau ^{-{\hat{J}}/2}}{\epsilon }\frac{1}{1-\sqrt{\tau }}. \end{aligned}$$

After rearranging this inequality and dividing by \(\Vert r_{{\hat{J}}}\Vert > 0\), we obtain

$$\begin{aligned} \Vert r_{{\hat{J}}}\Vert \le \frac{2 (M+2\epsilon )^2}{\epsilon ^2}\frac{\tau ^{{\hat{J}}/2}}{1-\sqrt{\tau }} \Vert r_0\Vert = \sqrt{T} \tau ^{{\hat{J}}/2}\Vert r_0\Vert . \end{aligned}$$
(77)

We have thus established (58) which, as we noted earlier, contradicts (11). Thus (57) cannot be true, so we have established (12), as required. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Royer, C.W., O’Neill, M. & Wright, S.J. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization. Math. Program. 180, 451–488 (2020). https://doi.org/10.1007/s10107-019-01362-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01362-7

Keywords

Mathematics Subject Classification

Navigation