Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

A worst-case complexity bound is proved for a sequential quadratic optimization (commonly known as SQP) algorithm that has been designed for solving optimization problems involving a stochastic objective function and deterministic nonlinear equality constraints. Barring additional terms that arise due to the adaptivity of the monotonically nonincreasing merit parameter sequence, the proved complexity bound is comparable to that known for the stochastic gradient algorithm for unconstrained nonconvex optimization. The overall complexity bound, which accounts for the adaptivity of the merit parameter sequence, shows that a result comparable to the unconstrained setting (with additional logarithmic factors) holds with high probability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Berahas, A.S., Curtis, F.E., O’Neill, M.J., Robinson, D.P.: A Stochastic Sequential Quadratic Optimization Algorithm for Nonlinear Equality Constrained Optimization with Rank-Deficient Jacobians. arXiv:2106.13015 (2021)

  2. Berahas, A.S., Curtis, F.E., Robinson, D.P., Zhou, B.: Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM J. Optim. 31, 1352–1379 (2021)

    Article  MathSciNet  Google Scholar 

  3. Bertsekas, D.P.: Network optimization: continuous and discrete models, Athena Scientific Belmont (1998)

  4. Betts, J.T.: Practical methods for optimal control and estimation using nonlinear programming, SIAM (2010)

  5. Byrd, R.H., Lopez-Calva, G., Nocedal, J.: A line search exact penalty method using steering rules. Math. Program. 133, 39–73 (2012)

    Article  MathSciNet  Google Scholar 

  6. Byrd, R.H., Nocedal, J., Waltz, R.A.: Steering exact penalty methods for nonlinear programming. Optim. Methods Softw. 23, 197–213 (2008)

    Article  MathSciNet  Google Scholar 

  7. Cartis, C., Gould, N.I.M., Toint, P.L.: An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA J. Numer. Anal. 32, 1662–1695 (2012). https://doi.org/10.1093/imanum/drr035

    Article  MathSciNet  Google Scholar 

  8. Cartis, C., Gould, N.I.M., Toint, P.L.: On the complexity of finding first-order critical points in constrained nonlinear optimization. Math. Program. 144, 93–106 (2014)

    Article  MathSciNet  Google Scholar 

  9. Chen, C., Tung, F., Vedula, N., Mori, G.: Constraint-aware deep neural network compression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 400–415 (2018)

  10. Curtis, F.E.: A penalty-interior-point algorithm for nonlinear constrained optimization. Math. Program. Comput. 4, 181–209 (2012)

    Article  MathSciNet  Google Scholar 

  11. Curtis, F.E., Jiang, H., Robinson, D.P.: An adaptive augmented lagrangian method for large-scale constrained optimization. Math. Program. 152, 201–245 (2015)

    Article  MathSciNet  Google Scholar 

  12. Curtis, F.E., Robinson, D.P.: Exploiting negative curvature in deterministic and stochastic optimization. Math. Program. 176, 69–94 (2019)

    Article  MathSciNet  Google Scholar 

  13. Curtis, F.E., Robinson, D.P., Samadi, M.: Complexity analysis of a trust funnel algorithm for equality constrained optimization. SIAM J. Optim. 28, 1533–1563 (2018)

    Article  MathSciNet  Google Scholar 

  14. Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29, 207–239 (2019)

    Article  MathSciNet  Google Scholar 

  15. Durrett, R.: Probability: Theory and Examples, vol. 49, Cambridge University Press (2019)

  16. Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155, 267–305 (2016)

    Article  MathSciNet  Google Scholar 

  17. Grapiglia, G.N., Yuan, Y.-X.: On the complexity of an augmented Lagrangian method for nonconvex optimization. IMA J. Numer. Anal. 41, 1546–1568 (2020)

    Article  MathSciNet  Google Scholar 

  18. Han, S.P., Mangasarian, O.L.: Exact penalty functions in nonlinear programming. Math. Program. 17, 251–269 (1979). https://doi.org/10.1007/BF01588250

    Article  MathSciNet  Google Scholar 

  19. Hazan, E., Luo, H.: Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, PMLR, pp. 1263–1271 (2016)

  20. Kupfer, F., Sachs, E.W.: Numerical solution of a nonlinear parabolic control problem by a reduced sqp method. Comput. Optim. Appl. 1, 113–135 (1992)

    Article  MathSciNet  Google Scholar 

  21. Li, X., Orabona, F.: A high probability analysis of adaptive SGD with momentum. arXiv: 2007.14294 (2020)

  22. Mongeau, M., Sartenaer, A.: Automatic decrease of the penalty parameter in exact penalty function methods. Eur. J. Oper. Res. 83, 686–699 (1995)

    Article  Google Scholar 

  23. Na, S., Anitescu, M., Kolar, M.: An Adaptive Stochastic Sequential Quadratic Programming with Differentiable Exact Augmented Lagrangians. arXiv preprint arXiv:2102.05320 (2021)

  24. Na, S., Mahoney, M.W.: Asymptotic convergence rate and statistical inference for stochastic sequential quadratic programming. arXiv:2205.13687 (2022)

  25. Nandwani, Y., Pathak, A., Mausam, P.: A primal dual formulation for deep learning with constraints, in NeurIPS, Singla (2019)

  26. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)

    Article  MathSciNet  Google Scholar 

  27. Nocedal, J., Wächter, A., Waltz, R.A.: Adaptive barrier update strategies for nonlinear interior methods. SIAM J. Optim. 19, 1674–1693 (2009)

    Article  MathSciNet  Google Scholar 

  28. Nocedal, J., Wright, S.J.: Numerical optimization. Springer Science & Business Media (2006)

  29. Rees, T., Dollar, H.S., Wathen, A.J.: Optimal solvers for pde-constrained optimization. SIAM J. Sci. Comput. 32, 271–298 (2010)

    Article  MathSciNet  Google Scholar 

  30. Roy, S. K., Mhammedi, Z., Harandi, M.: Geometry aware constrained optimization techniques for deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4469 (2018)

  31. Stirzaker, D.: Elementary probability. Cambridge University Press (2003)

  32. Wilson, R.B.: A Simplicial Algorithm for Concave Programming, Ph.D. Dissertation, Graduate School of Business Administration (1963)

Download references

Funding

Funding was provided by National Science Foundation (Grant Nos. 2030859, CCF-2008484) and Office of Naval Research (Grant No. N00014-21-1-2532).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. O’Neill.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Theorem 1 (Deterministic Algorithm Complexity)

In this appendix, we prove Theorem 1, which states a worst-case complexity bound for Algorithm 2.1 of [2]. We refer to quantities defined and employed in the analysis in [2]. In particular, in this appendix, for all \(k \in \mathbb {N}^{}\), we suppose that \(g_k = \nabla f(x_k)\) and \(d_k = u_k + v_k\) with \(u_k \in {{\,\textrm{Null}\,}}(J_k)\) and \(v_k \in {{\,\textrm{Range}\,}}(J_k^\top )\) is the search direction computed by solving the SQP subproblem with \(g_k = \nabla f(x_k)\). As seen in [2], the convergence properties of Algorithm 2.1 in that paper are driven by reductions in a model of the merit function in each iteration. Our first lemma proves a useful lower bound for such a reduction.

Lemma 13

Define \((\kappa _{uv},\kappa _H,\kappa _v,\tau _{\min },\zeta ,\sigma ) \in (0,\infty )^5 \times (0,1)\) as in [2] and let

$$\begin{aligned} \hat{\kappa }:= \min \left\{ 1,\frac{1}{(1+\kappa _{uv})\kappa _v\kappa _H^2}\right\} \ \ \text {and}\ \ {\tilde{\kappa }}:= \tfrac{1}{4} \zeta \kappa _{uv} \kappa _v \hat{\kappa }. \end{aligned}$$
(46)

Then, for any \(\varepsilon \in (0,1)\), if \(\Vert g_k + J_k^\top y_k\Vert > \varepsilon \) and/or \(\sqrt{\Vert c_k\Vert _1} > \varepsilon \), then

$$\begin{aligned} \varDelta q(x_k, \tau _k, g_k, H_k, d_k) \ge \min \left\{ \sigma \hat{\kappa }, \tau _{\min } {\tilde{\kappa }} \right\} \varepsilon ^2. \end{aligned}$$
(47)

Proof

Consider arbitrary \((\varepsilon ,k) \in (0,1) \times \mathbb {N}^{}\) such that \(\Vert g_k + J_k^\top y_k\Vert > \varepsilon \) and/or \(\sqrt{\Vert c_k\Vert _1} > \varepsilon \). Let us consider two cases. First, suppose that \(\Vert c_k\Vert _1 > \hat{\kappa } \varepsilon ^2\). Then, by [2, equation (2.9)],

$$\begin{aligned} \varDelta q(x_k, \tau _k, g_k, H_k, d_k) \ge \tfrac{1}{2} \tau _k \max \{d_k^\top H_k d_k, 0\} + \sigma \Vert c_k\Vert _1 \ge \sigma \Vert c_k\Vert _1 \ge \sigma \hat{\kappa } \varepsilon ^2, \end{aligned}$$

which implies (47), as desired. Second, suppose that \(\Vert c_k\Vert _1 \le \hat{\kappa } \varepsilon ^2 \le \varepsilon ^2\), which by the definition of \((\varepsilon ,k)\) implies that \(\Vert g_k + J_k^\top y_k\Vert > \varepsilon \). It follows from this fact that \(\Vert d_k\Vert > \varepsilon /\kappa _H\); indeed, if \(\Vert d_k\Vert \le \varepsilon /\kappa _H\), then by [2, equation (2.6) and Assumption 2.4] one would find

$$\begin{aligned} \Vert g_k + J_k^\top y_k\Vert = \Vert H_k d_k\Vert \le \kappa _H \Vert d_k\Vert \le \varepsilon , \end{aligned}$$

which is a contradiction. Hence, \(\Vert d_k\Vert > \varepsilon /\kappa _H\), and by [2, Lemma 2.9], it follows that \(\Vert v_k\Vert ^2 \le \kappa _v \Vert c_k\Vert \le \kappa _v \Vert c_k\Vert _1 \le \kappa _v \hat{\kappa } \varepsilon ^2\), which combined shows that

$$\begin{aligned} \varepsilon ^2/\kappa _H^2 < \Vert d_k\Vert ^2 = \Vert u_k\Vert ^2 + \Vert v_k\Vert ^2 \le \Vert u_k\Vert ^2 + \kappa _v \hat{\kappa } \varepsilon ^2. \end{aligned}$$

From this fact and the definition of \(\hat{\kappa }\), it follows that

$$\begin{aligned} \Vert u_k\Vert ^2> & {} \frac{\varepsilon ^2}{\kappa _H^2} - \kappa _v \hat{\kappa } \varepsilon ^2 \ge \frac{\varepsilon ^2}{\kappa _H^2}\left( 1 - \frac{1}{(1+\kappa _{uv})}\right) \\= & {} \frac{\kappa _{uv}\varepsilon ^2}{(1+\kappa _{uv})\kappa _H^2} \ge \kappa _{uv} \kappa _v \hat{\kappa } \varepsilon ^2 \ge \kappa _{uv} \Vert v_k\Vert ^2, \end{aligned}$$

which along with [2, Lemma 2.10] implies \(d_k^\top H_k d_k \ge \tfrac{1}{2}\zeta \Vert u_k\Vert ^2 \ge \tfrac{1}{2}\zeta \kappa _{uv} \kappa _v \hat{\kappa } \varepsilon ^2\). Thus,

$$\begin{aligned} \varDelta q(x_k, \tau _k, g_k, H_k, d_k)&\ge \tfrac{1}{2}\tau _k \max \{d_k^\top H_k d_k, 0\} + \sigma \Vert c_k\Vert _1 \ge \tfrac{1}{4} \tau _{\min } \zeta \kappa _{uv} \kappa _v \hat{\kappa } \varepsilon ^2, \end{aligned}$$

which implies (47), as desired. \(\square \)

We now prove Theorem 1, further details of which are provided in the statement below.

Theorem 4

Define \((\tau _{-1},f_{\text {low}},\alpha _{\min },\tau _{\min },\eta ,\sigma ) \in (0,\infty )^4 \times (0,1)^2\) as in [2] and \(({\hat{\kappa }},{\tilde{\kappa }}) \in (0,1] \times (0,\infty )\) as in (46). Then, for any \(\varepsilon \in (0,1)\), Theorem 1 holds with (4) given by

$$\begin{aligned} {\overline{K}}_\varepsilon := \left( \frac{\tau _{-1}(f_0 - f_{\text {low}}) + \Vert c_0\Vert _1}{\eta \alpha _{\min } \min \{\sigma \hat{\kappa }, \tau _{\min } {\tilde{\kappa }} \}}\right) \varepsilon ^{-2}. \end{aligned}$$

Proof

To derive a contradiction, suppose (3) does not hold for all \(k \in \{0,\dots ,{\overline{K}}_\varepsilon \}\). Then, along with Lemma 13 and [2, equation (2.10) and Lemma 2.17], it follows for all such k that

$$\begin{aligned} \phi (x_k + \alpha _k d_k, \tau _k) - \phi (x_k, \tau _k){} & {} \le - \eta \alpha _k \varDelta q(x_k, \tau _k, g_k, H_k, d_k) \\{} & {} \le - \eta \alpha _{\min } \min \{\sigma {\hat{\kappa }}, \tau _{\min } {\tilde{\kappa }}\} \varepsilon ^2. \end{aligned}$$

By the definition of \(\phi \), this means for all such k that

$$\begin{aligned} \tau _k f_{k+1} + \Vert c_{k+1}\Vert _1 \le \tau _k f_k + \Vert c_k\Vert _1 - \eta \alpha _{\min } \min \{\sigma {\hat{\kappa }}, \tau _{\min } {\tilde{\kappa }}\} \varepsilon ^2. \end{aligned}$$

Summing this inequality for all \(k \in \{0,\dots ,{\overline{K}}_\varepsilon \}\), one can deduce that

$$\begin{aligned}{} & {} \Vert c_{{\overline{K}}_\varepsilon +1}\Vert _1 - \Vert c_0\Vert _1 + \tau _{{\overline{K}}_\varepsilon }f_{{\overline{K}}_\varepsilon +1} - \tau _0 f_0 + \sum _{k=1}^{{\overline{K}}_\varepsilon } f_k (\tau _{k-1} - \tau _k) \\{} & {} \quad \le - ({\overline{K}}_\varepsilon +1) \eta \alpha _{\min } \min \{\sigma {\hat{\kappa }}, \tau _{\min } {\tilde{\kappa }}\} \varepsilon ^2. \end{aligned}$$

Since \(\{\tau _k\}\) is monotonically nonincreasing, \(\Vert c_{{\overline{K}}_\varepsilon +1}\Vert _1 \ge 0\), and \(f_k \ge f_{\text {low}}\) for all \(k \in \mathbb {N}^{}\),

$$\begin{aligned} - \Vert c_0\Vert _1 + \tau _{{\overline{K}}_\varepsilon }f_{\text {low}}- \tau _0 f_0 + f_{\text {low}}(\tau _{0} - \tau _{{\overline{K}}_\varepsilon }) \le - ({\overline{K}}_\varepsilon +1) \eta \alpha _{\min } \min \{\sigma {\hat{\kappa }}, \tau _{\min } {\tilde{\kappa }}\} \varepsilon ^2. \end{aligned}$$

Rearranging this inequality, one arrives at the conclusion that

$$\begin{aligned} {\overline{K}}_\varepsilon +1 \le \left( \frac{\tau _0 (f_0 - f_{\text {low}}) + \Vert c_0\Vert _1}{\eta \alpha _{\min } \min \{\sigma {\hat{\kappa }}, \tau _{\min } {\tilde{\kappa }}\}}\right) \varepsilon ^{-2} \le \left( \frac{\tau _{-1} (f_0 - f_{\text {low}}) + \Vert c_0\Vert _1}{\eta \alpha _{\min } \min \{\sigma {\hat{\kappa }}, \tau _{\min } {\tilde{\kappa }}\}}\right) \varepsilon ^{-2} \equiv {\overline{K}}_\varepsilon , \end{aligned}$$

which is a contradiction. Therefore, one arrives at the desired conclusion that Algorithm 2.1 yields an iterate satisfying (3) in at most \({\overline{K}}_\varepsilon \) iterations. \(\square \)

B Proofs of Lemma 9

In this appendix, we prove Lemma 9. Toward this end, we prove for any \(\delta \in (0,1)\) with \(\hat{\delta }\) as defined in (32) and \(\ell (s_{\max },\hat{\delta })\) as defined in (33), one finds

$$\begin{aligned} \mathbb {P}\left[ \sum _{k=0}^{k_{\max }} \mathbb {P}[\mathcal {T}_k < \mathcal {T}_{k-1} | \mathcal {F}_k] \le \ell (s_{\max },\hat{\delta })+1 \Bigg | E \right] \ge 1-\delta . \end{aligned}$$
(48)

We build to this result, ultimately proved as Theorem 5, with a series of preliminary lemmas.

As our first preliminary result, we state a particular form of Chernoff’s bound [31] in the following lemma, which will prove instrumental in deriving (48).

Lemma 14

For any \(k \in \mathbb {N}^{}\), let \(\{Y_0,\dots ,Y_k\}\) be independent Bernoulli random variables. Then, for any \(s \in \mathbb {N}^{}\) and \(\bar{\delta } \in (0,1)\), it follows that

$$\begin{aligned} \mu := \sum _{j=0}^k \mathbb {P}[Y_j = 1] \ge \ell (s,\bar{\delta })\ \ \implies \ \ \mathbb {P}\left[ \sum _{j=0}^k Y_j \le s\right] \le \bar{\delta }. \end{aligned}$$
(49)

Proof

Suppose that \(\mu \ge \ell (s,{\bar{\delta }})\). By the multiplicative form of Chernoff’s bound, it follows for \(\rho := 1 - s/\mu \) (which is in the interval (0, 1) by (49)) that

$$\begin{aligned} \mathbb {P}\left[ \sum _{j=0}^{k} Y_j \le s \right] \le e^{-\tfrac{1}{2}\mu \rho ^2} = e^{- \tfrac{1}{2}\mu (1-s/\mu )^2}. \end{aligned}$$

Hence, to prove the result, all that remains is to show that \(e^{-\tfrac{1}{2}\mu (1-s/\mu )^2} \le \bar{\delta }\), i.e., that \(-\tfrac{1}{2}\mu (1-s/\mu )^2 \le \log (\bar{\delta })\). Using \(\log (\bar{\delta }) = -\log (1/\bar{\delta })\), this inequality is equivalent to

$$\begin{aligned} 0 \le \tfrac{1}{2}\mu (1-s/\mu )^2 - \log (1/\bar{\delta }) = \tfrac{1}{2\mu } (\mu - s)^2 - \log (1/\bar{\delta }), \end{aligned}$$

which holds if and only if \(\mu ^2 - 2\mu (s+\log (1/\bar{\delta })) + s^2 \ge 0\). Viewing the left-hand side of this inequality as a convex quadratic function in \(\mu \), one finds that the inequality holds as long as \(\mu \) is greater than or equal to the positive root of the quadratic, i.e.,

$$\begin{aligned} s + \log (1/\bar{\delta }) + \sqrt{(s+\log (1/\bar{\delta }))^2 - s^2} = s + \log (1/\bar{\delta }) + \sqrt{\log (1/\bar{\delta })^2 + 2s\log (1/\bar{\delta })}. \end{aligned}$$

This holds since \(\mu \ge \ell (s,{\bar{\delta }})\); hence, the result is proved. \(\square \)

Now, we turn our attention to proving (48). For any realization of a run of the algorithm up to iteration \(k \in [k_{\max }]\), let \(w_k\) denote the number of times that the merit parameter has been decreased up to the beginning of iteration k and let \(\overline{p}_k\) denote the probability that the merit parameter is decreased during iteration k. The signature of a realization up to iteration \(k \in \mathbb {N}^{}\) is \((\overline{p}_0,\dots ,\overline{p}_k,w_0,\dots ,w_k)\), which encodes all of the pertinent information regarding the behavior of the merit parameter sequence up to the start of iteration k.

One could imagine using all possible signatures to define a tree whereby each node contains a subset of all realizations of the algorithm. To construct such tree, one could first consider the root node, which could be denoted by \({\tilde{N}}(\overline{p}_0,w_0)\), where \(\overline{p}_0\) is uniquely defined by the starting conditions of our algorithm and \(w_0 = 0\). All realizations of our algorithm follow the same initialization, so \(\overline{p}_0\) and \(w_0\) would be in the signature of every realization. Now, one could define a node \({\tilde{N}}(\overline{p}_{[k]},w_{[k]})\) at depth \(k \in [k_{\max }]\) (where the root node has a depth of 0) in the tree as the set of all realizations of our algorithm for which the signature of the realization up to iteration k is \((\overline{p}_0,\dots ,\overline{p}_k,w_0,\dots ,w_k)\). One could then define the edges in the tree by connecting nodes at adjacent levels, where node \({\tilde{N}}(\overline{p}_{[k]},w_{[k]})\) is connected to node \({\tilde{N}}({\bar{p}}_{[k]},\overline{p}_{k+1},w_{[k]},w_{k+1})\) for any \(\overline{p}_{k+1} \in [0,1]\) and \(w_{k+1} \in \{w_k,w_k+1\}\).

Unfortunately, the construction described in the previous paragraph may lead to nodes in the tree representing realizations with probability zero occurrence. In order to remedy this, we instead construct a tree where the nodes contain all realizations whose probability signatures fall within specified intervals. To define such intervals, consider arbitrary \(B \in \mathbb {N}^{} \setminus \{0\}\) and let us restrict the sequence of values \(p_{[k]}\) used to define our nodes as those with

$$\begin{aligned} p_{[k]} = (p_0, \dots , p_k) \in \left\{ 0, \tfrac{1}{B}, \dots , \tfrac{B-1}{B}\right\} ^{k+1}. \end{aligned}$$
(50)

For \(p \in \{0, 1/B, \dots , (B-1)/B\}\), these define the open probability intervals \(\iota (p)\) given by

$$\begin{aligned} \iota (p) = {\left\{ \begin{array}{ll} \left[ p, p + \tfrac{1}{B}\right) &{} \text {if} \;p \in \left\{ 0, \tfrac{1}{B}, \dots , \tfrac{B-2}{B}\right\} , \\ \left[ \tfrac{B-1}{B}, 1\right] &{} \text {if} \;p = \tfrac{B-1}{B}. \end{array}\right. } \end{aligned}$$

Now, we can construct our tree as follows. As before, first consider the root node, which we denote by \(N(p_0,w_0)\), where \(p_0 \in \{0, 1/B, \dots , (B-1)/B\}\) is uniquely defined by the starting conditions of our algorithm so that \(\mathbb {P}[\mathcal {T}_0 < \tau _{-1} | \mathcal {F}_0] \in \iota (p_0)\) and \(w_0 = 0\). All realizations of our algorithm follow the same initialization, so with \({\bar{p}}_0 = \mathbb {P}[\mathcal {T}_0 < \tau _{-1} | \mathcal {F}_0]\) one finds that \({\bar{p}}_0 \in \iota (p_0)\) and \(w_0\) are in the signature of every realization. We define a node \(N(p_{[k]},w_{[k]})\) at depth \(k \in [k_{\max }]\) as the set of all realizations for which the signature of the realization at iteration k exactly matches \(w_{[k]}\) and has probabilities that fall within the intervals defined by \(p_{[k]}\); i.e., a realization with signature \((\overline{p}_{[k]},w_{[k]})\) is a member of \(N(p_{[k]},w_{[k]})\) if and only if, for all \(j \in [k]\), one finds that \(\overline{p}_j \in \iota (p_j)\). The edges in the tree connect nodes in adjacent levels, where \(N(p_{[k]},w_{[k]})\) is connected to \(N(p_{[k]},p_{k+1},w_{[k]},w_{k+1})\) for any \(p_{k+1} \in \{0,1/B,\dots ,(B-1)/B\}\) and \(w_{k+1} \in \{w_k,w_k+1\}\).

Notationally, since the behavior of the algorithm up to iteration \(k \in \mathbb {N}^{}\) is determined by the initial conditions and stochastic gradients in \(G_{[k-1]}\), we write

$$\begin{aligned} G_{[k-1]} \in N(p_{[k]},w_{[k]}) \end{aligned}$$

to denote the event that the signature of the algorithm up to k is a member of \(N(p_{[k]},w_{[k]})\). The initial condition, denoted for consistency as \(G_{[-1]} \in N(p_0,w_0)\), occurs with probability one. Based on the description above, the nodes of our tree satisfy: For any node at a depth of \(k \ge 2\), the event \(G_{[k-1]} \in N(p_{[k]},w_{[k]})\) occurs if and only if

$$\begin{aligned} \begin{aligned} \mathbb {P}[\mathcal {T}_k< \mathcal {T}_{k-1} | \mathcal {F}_k]&\in \iota (p_k), \\ S_{k-1}:= \sum _{i=0}^{k-1} \mathcal {I}[\mathcal {T}_i < \mathcal {T}_{i-1}]&= w_k, \\ \text {and}\ \ G_{[k-2]}&\in N(p_{[k-1]},w_{[k-1]}), \end{aligned} \end{aligned}$$
(51)

where \(\mathcal {I}[\cdot ]\) denotes the indicator function of any given event.

Let us now define certain important sets of nodes in the tree. First, let

$$\begin{aligned} \mathcal {L}_{\text {good}}:= \left\{ N(p_{[k]},w_{[k]}): \left( \sum _{i=0}^k p_i \le \ell (s_{\max },\hat{\delta }) + 1\right) \wedge (w_{k} = s_{\max } \vee k = k_{\max }) \right\} \end{aligned}$$

be the set of nodes at which the sum of the elements of \(p_{[k]}\) is sufficiently small and either \(w_k\) has reached \(s_{\max }\) or k has reached \(k_{\max }\). A node in this set is of interest since, due to the iteration and/or merit parameter decrease limit having been reached, the probability is zero that a certain “bad” event can occur over all realizations with signatures that are members of the node; see Lemma 15 on page 38. Second, let

$$\begin{aligned} \mathcal {L}_{\text {bad}}:= \left\{ N(p_{[k]},w_{[k]}): \sum _{i=0}^{k} p_i > \ell (s_{\max },\hat{\delta })+1 \right\} \end{aligned}$$

be the nodes in the complement of \(\mathcal {L}_{\text {good}}\) at which the sum of the elements of \(p_{[k]}\) has exceeded the threshold \(\ell (s_{\max },{\hat{\delta }})+1\). A node in this set is of interest since, due to this threshold having been exceeded, all realizations with signatures that are members of this node correspond to poor behavior of the algorithm (and there is no need to consider the behavior of the algorithm beyond this point). Going forward, we restrict attention to the tree defined by the root node and all paths from the root node that terminate at a node contained in \(\mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\). It is clear from this restriction and the definitions of \(\mathcal {L}_{\text {good}}\) and \(\mathcal {L}_{\text {bad}}\) that this tree is finite with the elements of \(\mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\) being leaves.

Let us now define relationships between nodes. The parent of a node is defined as

$$\begin{aligned} P(N(p_{[k]},w_{[k]})) = N(p_{[k-1]},w_{[k-1]}). \end{aligned}$$

On the other hand, the children of node \(N(p_{[k]},w_{[k]})\) are defined as

$$\begin{aligned} C(N(p_{[k]},w_{[k]})) = {\left\{ \begin{array}{ll} \{N(p_{[k]},p_{k+1},w_{[k]},w_{k+1})\} &{} \text {if }N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\\ \emptyset &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

This ensures that paths down the tree terminate at nodes in \(\mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\), making these nodes the leaves of the tree. For convenience in the remainder of our discussions, let \(C(\emptyset ) = \emptyset \).

We define the height of node \(N(p_{[k]},w_{[k]})\) as the length of the longest path from \(N(p_{[k]},w_{[k]})\) to a leaf node, i.e., the height is denoted as

$$\begin{aligned} h(N(p_{[k]},w_{[k]})):= \left( \min \{j \in \mathbb {N}^{} \setminus \{0\}: C^j(N(p_{[k]},w_{[k]})) = \emptyset \}\right) -1, \end{aligned}$$

where \(C^j(N(p_{[k]},w_{[k]}))\) is shorthand for applying the mapping \(C(\cdot )\) consecutively j times. From this definition, \(h(N(p_{[k]},w_{[k]})) = 0\) for all \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\).

Next, let us define two more sets of nodes that will be useful later. Let \(C_{\text {dec}}(N(p_{[k]},w_{[k]}))\) denote the set of children of \(N(p_{[k]},w_{[k]})\) such that the merit parameter decreases and let \(C_{\text {dec}}^c(N(p_{[k]},w_{[k]}))\) denote set of children of \(N(p_{[k]},w_{[k]})\) such that it does not decrease, so

$$\begin{aligned} C_{\text {dec}}(N(p_{[k]},w_{[k]}))&:= \{ N(p_{[k]},p_{k+1},w_{[k]},w_{k+1}) : \nonumber \\&\qquad (N(p_{[k]},p_{k+1},w_{[k]},w_{k+1}) \in C(N(p_{[k]},w_{[k]}))) \nonumber \\&\qquad \wedge (w_{k+1} = w_{k} + 1)\} \end{aligned}$$
(52)

and

$$\begin{aligned} C_{\text {dec}}^c(N(p_{[k]},w_{[k]}))&:= \{ N(p_{[k]},p_{k+1},w_{[k]},w_{k+1}) : \nonumber \\&\qquad (N(p_{[k]},p_{k+1},w_{[k]},w_{k+1}) \in C(N(p_{[k]},w_{[k]}))) \nonumber \\&\qquad \wedge (w_{k+1} = w_{k})\}. \end{aligned}$$
(53)

Finally, let us define the event \(E_{\text {bad},B}\) as the event that for some \(j \in [k_{\max }]\) one finds

$$\begin{aligned} \left( \sum _{i=0}^{j} \mathbb {P}[\mathcal {T}_i < \mathcal {T}_{i-1} | \mathcal {F}_i] > \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B} + 1\right) . \end{aligned}$$
(54)

With respect to our goal of proving (48), the event \(E_{\text {bad},B}\) is of interest since it is the event that the given probabilities accumulated up to iteration \(j \in [k_{\max }]\) (and beyond) exceed the threshold found in (48) plus a factor that is inversely proportional to B.

Let us now prove some properties of the leaf nodes.

Lemma 15

For any \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {good}}\), one finds

$$\begin{aligned} \mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] = 0. \end{aligned}$$

On the other hand, for all \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\), one finds

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad \le \hat{\delta } \prod _{i=1}^k \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_i < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] . \end{aligned}$$

Proof

Consider an arbitrary index \(k \in [k_{\max }]\) and an arbitrary pair \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {good}}\). By the definition of \(\mathcal {L}_{\text {good}}\), it follows that

$$\begin{aligned} \sum _{i=0}^k p_i \le \ell (s_{\max }, \hat{\delta }) + 1. \end{aligned}$$
(55)

Since the maximum depth of a node is \(k_{\max }\) and the gap between the discrete values in (50) is \(\tfrac{1}{B}\), it follows along with (55) that

$$\begin{aligned}&\mathbb {P}\left[ \sum _{i=0}^k \mathbb {P}[\mathcal {T}_i < \mathcal {T}_{i-1} | \mathcal {F}_i]> \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B} + 1 \Big | E, G_{[k-1]} \in N(p_{[k]},w_{[k]})\right] \\&\quad \le \mathbb {P}\left[ \sum _{i=0}^k \left( p_i + \tfrac{1}{B}\right)> \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B} + 1 \Big | E, G_{[k-1]} \in N(p_{[k]},w_{[k]})\right] \\&\quad \le \mathbb {P}\left[ \ell (s_{\max },\hat{\delta }) + \tfrac{k+1}{B} + 1 > \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B} + 1 \Big | E, G_{[k-1]} \in N(p_{[k]},w_{[k]})\right] = 0. \end{aligned}$$

Therefore, for any \(j \in \{1,\dots ,k\}\), one finds from conditional probability that

$$\begin{aligned}&\mathbb {P}\left[ (54) \text { holds} \wedge G_{[j-1]} \in N(p_{[j]},w_{[j]}) | E\right] \\&\quad =\mathbb {P}\left[ \sum _{i=0}^{j} \mathbb {P}[\mathcal {T}_i < \mathcal {T}_{i-1} | \mathcal {F}_i] > \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B} + 1 \Big | E, G_{[j-1]} \in N(p_{[j]},w_{[j]})\right] \\&\qquad \quad \cdot \mathbb {P}\left[ G_{[j-1]} \in N(p_{[j]},w_{[j]}) | E\right] = 0. \end{aligned}$$

In addition, (54) cannot hold for \(j=0\) since \(\ell (s_{\max },\hat{\delta })+1 > 1\). Hence, along with the conclusion above, it follows that \(E_{\text {bad},B}\) does not occur when a signature up to iteration \(j \in \{1,\dots ,k\}\) falls into a node along any path from the root node to \(N(p_{[k]},w_{[k]})\). Now, by the definition of \(\mathcal {L}_{\text {good}}\), at least one of \(w_{k} = s_{\max }\) or \(k = k_{\max }\) holds. Let us consider each case in turn. If \(k = k_{\max }\), then it follows by the preceding arguments that

$$\begin{aligned} \mathbb {P}\left[ \sum _{i=0}^{k_{\max }} \mathbb {P}[\mathcal {T}_i < \mathcal {T}_{i-1} | \mathcal {F}_i] \le \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B} + 1 \Big | E, G_{[k-1]} \in N(p_{[k]},w_{[k]})\right] = 1. \end{aligned}$$

Otherwise, if \(w_{k} = s_{\max }\), then it follows by the definition of the event E that \(\mathbb {P}[\mathcal {T}_i < \mathcal {T}_{i-1} | \mathcal {F}_i] = 0\) for all \(i \in \{k,\dots ,k_{\max }\}\), and therefore the equation above again follows. Overall, it follows that \(\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k-1]}) \wedge E_{\text {bad},B}| E] = 0\), as desired.

Next, we remark that

$$\begin{aligned} \mathbb {P}[G_{[-1]} \in N(p_0, w_0) \wedge E_{\text {bad},B}| E] = 0 \end{aligned}$$

since

$$\begin{aligned} \mathbb {P}[\mathcal {T}_0< \tau _{-1} | \mathcal {F}_0] \le p_0 + \frac{1}{B} < \ell (s_{\max },\hat{\delta }) + \frac{k_{\max }+1}{B} + 1. \end{aligned}$$

Thus, consider arbitrary \(k \in \mathbb {N}^{} \setminus \{0\}\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\). One finds

$$\begin{aligned}&\ \mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad =\ \mathbb {P}[E_{\text {bad},B}| E, G_{[k-1]} \in N(p_{[k]},w_{[k]})] \cdot \mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) | E] \\&\quad \le \ \mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) | E]. \end{aligned}$$

Hence, using the initial condition that \(G_{[-1]} \in N(p_0,w_0)\), it follows that

$$\begin{aligned}&\ \mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \nonumber \\&\quad \le \ \mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) | E] = \mathbb {P}\left[ 51 \text { holds} \big | E\right] \nonumber \\&\quad =\ \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{k}< \mathcal {T}_{k-1} | \mathcal {F}_k] \in \iota (p_k) \big | E, S_{k-1} = w_{k}, G_{[k-2]} \in N(p_{[k-1]},w_{[k-1]})\right] \nonumber \\&\qquad \ \cdot \mathbb {P}\left[ S_{k-1} = w_{k} \wedge G_{[k-2]} \in N(p_{[k-1]},w_{[k-1]}) \big | E\right] \nonumber \\&\quad =\ \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{k}< \mathcal {T}_{k-1} | \mathcal {F}_k] \in \iota (p_k) \big | E, S_{k-1} = w_{k}, G_{[k-2]} \in N(p_{[k-1]},w_{[k-1]})\right] \nonumber \\&\qquad \ \cdot \mathbb {P}\left[ S_{k-1} = w_{k} \big | E, G_{[k-2]} \in N(p_{[k-1]},w_{[k-1]})\right] \mathbb {P}\left[ G_{[k-2]} \in N(p_{[k-1]},w_{[k-1]}) \big | E\right] \nonumber \\&\quad =\ \mathbb {P}[G_{-1} \in N(p_0,w_0)] \nonumber \\&\qquad \ \cdot \prod _{i=1}^k \Big (\mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i}< \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \nonumber \\&\qquad \cdot \mathbb {P}\left[ S_{i-1} = w_{i} \big | E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \Big ) \nonumber \\&\quad =\ \prod _{i=1}^k \Big (\mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \nonumber \\&\qquad \ \cdot \mathbb {P}\left[ S_{i-1} = w_{i} \big | E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \Big ). \end{aligned}$$
(56)

Our goal is to bound (56). Toward this end, define

$$\begin{aligned} \mathcal {I}_{\text {dec}}:= \{i \in \{1,\dots ,k\}: w_{i} = w_{i-1} + 1\}\ \ \text {and}\ \ \mathcal {I}_{\text {dec}}^c:= \{i \in \{1,\dots ,k\}: w_{i} = w_{i-1}\}, \end{aligned}$$

which by the definition of \(w_{[k]}\) form a partition of \(\{1,\dots ,k\}\). For any \(i \in \mathcal {I}_{\text {dec}}\),

$$\begin{aligned}&\ \mathbb {P}[S_{i-1} = w_{i} | E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})] \\&\quad =\ \mathbb {P}[\mathcal {T}_{i-1} < \mathcal {T}_{i-2} |E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})] \le p_{i-1} + \tfrac{1}{B}. \end{aligned}$$

On the other hand, for any \(i \in \mathcal {I}_{\text {dec}}^c\),

$$\begin{aligned}&\ \mathbb {P}[S_{i-1} = w_{i} | E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})] \\&\quad =\ \mathbb {P}[\mathcal {T}_{i-1} = \mathcal {T}_{i-2} |E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})] \\&\quad =\ 1 - \mathbb {P}[\mathcal {T}_{i-1} < \mathcal {T}_{i-2} |E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})] \le 1-p_{i-1}. \end{aligned}$$

Thus, it follows that the latter term in (56) satisfies

$$\begin{aligned}{} & {} \prod _{i=1}^{k} \mathbb {P}[S_{i-1} = w_{i} | E, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})]\\{} & {} \quad \le \left( \prod _{i \in \mathcal {I}_{\text {dec}}} (p_{i-1} + \tfrac{1}{B})\right) \left( \prod _{i \in \mathcal {I}_{\text {dec}}^c} (1-p_{i-1})\right) . \end{aligned}$$

Now let us bound this term. By the definition of \(\mathcal {L}_{\text {bad}}\), one finds that

$$\begin{aligned} \sum _{i=0}^k p_i> \ell (s_{\max },\hat{\delta }) + 1 \implies \sum _{i=0}^{k-1} p_i > \ell (s_{\max },\hat{\delta }). \end{aligned}$$
(57)

In addition, by the definition of \(s_{\max }\), it follows that \(w_k \le s_{\max }\) for all \(k \in [k_{\max }]\), from which it follows that \(|\mathcal {I}_{\text {dec}}| \le s_{\max }\). Now, let \(\{Z_0,\dots ,Z_{k-1}\}\) be independent Bernoulli random variables such that, for all \(i \in \{0,\dots ,k-1\)}, one has

$$\begin{aligned} \mathbb {P}[Z_i=1] = {\left\{ \begin{array}{ll} p_i + \frac{1}{B} &{} \text {if} \; i+1 \in \mathcal {I}_{\text {dec}}\\ p_i &{} \text {if} \; i+1 \in \mathcal {I}_{\text {dec}}^c. \end{array}\right. } \end{aligned}$$

By (57), it follows from the definition of these random variables that \(\sum _{i=0}^{k-1} \mathbb {P}[Z_i=1] \ge \ell (s_{\max }, \hat{\delta })\). Then, it follows by Lemma 14 and the preceding argument that

$$\begin{aligned}&\prod _{i \in \mathcal {I}_{\text {dec}}} \left( p_{i-1} + \frac{1}{B}\right) \prod _{i \in \mathcal {I}_{\text {dec}}^c} (1-p_{i-1}) \\&\quad = \mathbb {P}\left[ (Z_{i-1} = 1 \text { for all }i \in \mathcal {I}_{\text {dec}}) \wedge (Z_{i-1} = 0 \text {for all }i \in \mathcal {I}_{\text {dec}}^c) \right] \\&\quad = \mathbb {P}\left[ \left( \sum _{i=0}^{k-1} Z_i \le s_{\max }\right) \wedge (Z_{i-1} = 1 \text { for all }i \in \mathcal {I}_{\text {dec}}) \wedge (Z_{i-1} = 0\text { for all }i \in \mathcal {I}_{\text {dec}}^c) \right] \\&\quad \le \mathbb {P}\left[ \sum _{i=0}^{k-1} Z_i \le s_{\max }\right] \le \hat{\delta }. \end{aligned}$$

Combining this with (56), the desired conclusion follows. \(\square \)

Next, we present a lemma about nodes in the sets defined in (52) and (53). The lemma essentially states that a certain probability of interest, defined as the product of probabilities along a path to a child node, can be reduced to a product of probabilities to the child’s parent node by partitioning the childen into those at which a merit parameter decrease has occurred and children at which a merit parameter decrease has not occurred.

Lemma 16

For all \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\), one finds that

$$\begin{aligned}&\sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]})) \}} \\&\quad \prod _{i=1}^{k+1} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i}< \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \\&\quad = \prod _{i=1}^k \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \end{aligned}$$

and, similarly, one finds that

$$\begin{aligned}&\sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]})) \}} \\&\quad \prod _{i=1}^{k+1} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i}< \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \\&\quad = \prod _{i=1}^k \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] , \end{aligned}$$

where by the definitions of \(C_{\text {dec}}\) and \(C_{\text {dec}}^c\) it follows that the value of \(w_{k+1}\) in the sum in the former equation is one greater than the value of \(w_{k+1}\) in the sum in the latter equation.

Proof

One finds that

$$\begin{aligned}&\sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]}))\}} \\&\quad \prod _{i=1}^{k+1} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i}< \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \\&\quad = \prod _{i=1}^{k} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i}< \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \\&\quad \cdot \sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]}))\}} \\&\quad \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{k+1} < \mathcal {T}_{k} | \mathcal {F}_{k+1}] \in \iota (p_{k+1}) \big | E, S_{k} = w_{k+1}, G_{[k-1]} \in N(p_{[k]},w_{[k]})\right] . \end{aligned}$$

The desired conclusion follows since, by the definition of \(C_{\text {dec}}(N(p_{[k]},w_{[k]}))\), all elements in the latter sum have \(S_k = w_{k+1} = w_{k}+1\), meaning that the sum is exhaustive over all possible outcomes with the same conditions, and hence the sum is 1.

The proof of the second desired conclusion follows in the same manner with \(C_{\text {dec}}^c\) in place of \(C_{\text {dec}}\) and \(S_k=w_{k+1}=w_{k}\) in place of \(S_k=w_{k+1}=w_k+1\). \(\square \)

Next, we derive a result for certain nodes containing realizations with \(w_{k} = s_{\max }- 1\).

Lemma 17

For any \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) such that \(w_{k} = s_{\max } - 1\) and \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), it follows that

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \nonumber \\&\quad \le \hat{\delta } \prod _{i=1}^k \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] . \end{aligned}$$
(58)

Proof

By the supposition that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), it follows that any \((p_{[k]},w_{[k]})\) with \(h(N(p_{[k]},w_{[k]})) = 0\) has \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\), in which case the desired conclusion follows from Lemma 15. With this base case being established, we now prove the result by induction. Suppose that the result holds for all \((p_{[k]},w_{[k]})\) such that \(w_{k} = s_{\max } - 1\), \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), and \(h(N(p_{[k]},w_{[k]})) \le j\) for some \(j \in \mathbb {N}^{}\). Our goal is to show that the same statement holds with j replaced by \(j+1\). For this purpose, consider arbitrary \((p_{[k]},w_{[k]})\) such that \(w_{k} = s_{\max } - 1\), \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), and \(h(N(p_{[k]},w_{[k]})) = j + 1\). Observe that by the definition of the child operators C, \(C_{\text {dec}}\), and \(C_{\text {dec}}^c\), it follows that

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad = \sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C(N(p_{[k]},w_{[k]})) \}} \mathbb {P}[G_{[k]} \in N(p_{[k+1]},w_{[k+1]}) \wedge E_{\text {bad},B}| E] \\&\quad = \sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]})) \}} \mathbb {P}[G_{[k]} \in N(p_{[k+1]},w_{[k+1]}) \wedge E_{\text {bad},B}| E] \\&\quad + \sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]})) \}} \mathbb {P}[G_{[k]} \in N(p_{[k+1]},w_{[k+1]}) \wedge E_{\text {bad},B}| E]. \end{aligned}$$

Since \(w_{k} = s_{\max }-1\), it follows from the definition of \(C_{\text {dec}}\) that for any \((p_{k+1},w_{k+1})\) with \(N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]}))\), one finds that \(w_{k+1} = w_{k}+1 = s_{\max }\). By the definition of \(s_{\max }\), this implies that \(\mathbb {P}[\mathcal {T}_{k+1} < \mathcal {T}_k | \mathcal {F}_{k+1}] = 0\), so \(p_{k+1} = 0\). In addition, since \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {bad}}\) since \(C(N(p_{[k]},w_{[k]})) \ne \emptyset \), it follows that \(\sum _{i=0}^{k+1} p_{k+1} \le \ell (s_{\max },\hat{\delta })+1\), meaning \(N(p_{[k+1]},w_{[k+1]}) \in \mathcal {L}_{\text {good}}\). Consequently, from above and Lemma 15, one finds

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad = \sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]})) \}} \mathbb {P}[G_{[k]} \in N(p_{[k+1]},w_{[k+1]}) \wedge E_{\text {bad},B}| E]. \end{aligned}$$

Since \(h(N(p_{[k]},w_{[k]}) = j+1\), it follows that \(h(N(p_{[k+1]},w_{[k+1]})) \le j\) for any \((p_{[k+1]},w_{[k+1]})\) with \(h(N(p_{[k+1]},w_{[k+1]})) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]}))\). Therefore, by the induction hypothesis and the result of Lemma 16, it follows that

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad = \sum _{\{(p_{k+1},w_{k+1}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]})) \}} \\&\quad \hat{\delta } \prod _{i=1}^{k+1} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i}< \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \\&\quad \le \hat{\delta } \prod _{i=1}^k \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] , \end{aligned}$$

which completes the proof. \(\square \)

Using the preceding lemma as a base case, we now perform induction on the difference \(s_{\max }- w_{k}\) to prove a similar result for arbitrary \(s_{\max }\).

Lemma 18

For any \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), it follows that

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad \le \hat{\delta } \cdot \sum _{j=0}^{\min \{s_{\max }-w_{k}-1,h(N(p_{[k]},w_{[k]}))\}} \left( \begin{matrix} h(N(p_{[k]},w_{[k]})) \\ j \end{matrix} \right) \\&\qquad \cdot \prod _{i=1}^k \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_{i}, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] . \end{aligned}$$

Proof

For all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\) and \(h(N(p_{[k]},w_{[k]})) = 0\), it follows that \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\). The result holds in this case according to Lemma 15 since one finds that \(\sum _{j=0}^{\min \{s_{\max }-w_{k}-1,h(N(p_{[k]},w_{[k]}))\}} \left( {\begin{array}{c}h(N(p_{[k]},w_{[k]}))\\ j\end{array}}\right) = \left( {\begin{array}{c}0\\ 0\end{array}}\right) = 1\). On the other hand, for all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\) and \(s_{\max }- w_{k} = 1\), the result follows from Lemma 17. Hence, to prove the remainder of the result by induction, one may assume that it holds for all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), \(h(N(p_{[k]},w_{[k]})) \le t\) for some \(t \in \mathbb {N}^{}\), and \(s_{\max }- w_{k} = r\) for some \(r \in \mathbb {N}^{} {\setminus } \{0\}\), and show that it holds for all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), \(h(N(p_{[k]},w_{[k]})) = t + 1\), and \(s_{\max }- w_{k} = r\). (Notice that the base cases above show that the result holds for \(t=0\) and any \(r \in \mathbb {N}^{} \setminus \{0\}\), as well as for any \(t \in \mathbb {N}^{}\) and \(r=1\). Hence, one may complete the induction over the index pairs by showing that if it holds for (tr), then it holds for \((t+1,r)\), as claimed above.)

Consider arbitrary \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), \(h(N(p_{[k]},w_{[k]})) = t + 1\), and \(s_{\max }- w_{k} = r\). By the definitions of C, \(C_{\text {dec}}\), and \(C_{\text {dec}}^c\), it follows that

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad = \sum _{\{(p_{[k+1]},w_{[k+1]}) : N(p_{[k+1]},w_{[k+1]}) \in C(N(p_{[k]},w_{[k]})) \}} \mathbb {P}[G_{[k]} \in N(p_{[k+1]},w_{[k+1]}) \wedge E_{\text {bad},B}| E] \\&\quad = \sum _{\{(p_{[k+1]},w_{[k+1]}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]})) \}} \mathbb {P}[G_{[k]} \in N(p_{[k+1]},w_{[k+1]}) \wedge E_{\text {bad},B}| E] \\&\qquad + \sum _{\{(p_{[k+1]},w_{[k+1]}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]})) \}} \mathbb {P}[G_{[k]} \in N(p_{[k+1]},w_{[k+1]}) \wedge E_{\text {bad},B}| E]. \end{aligned}$$

Further by the definition of \(C_{\text {dec}}\), it follows that \(w_{k+1} = w_k + 1\) (thus \(s_{\max }- w_{k+1} = r-1\)) for all terms in the former sum on the right-hand side, whereas by the definition of \(C_{\text {dec}}^c\) it follows that \(w_{k+1} = w_k\) (thus \(s_{\max }- w_{k+1} = r\)) for all terms in the latter sum on the right-hand side. Moreover, from \(h(N(p_{[k]},w_{[k]})) = t+1\), it follows that \(h(N(p_{[k+1]},w_{[k+1]})) \le t\) for all terms on the right-hand side. Therefore, by the induction hypothesis, it follows that

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \\&\quad \le \sum _{\{(p_{[k+1]},w_{[k+1]}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]})) \}} \hat{\delta } \sum _{j=0}^{\min \{r-2,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right) \\&\quad \cdot \prod _{i=1}^{k+1} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i}< \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_{i}) \big | E, S_{i-1} = w_i, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] \\&\qquad + \sum _{\{(p_{[k+1]},w_{[k+1]}) : N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]})) \}} \hat{\delta } \sum _{j=0}^{\min \{r-1,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right) \\&\qquad \cdot \prod _{i=1}^{k+1} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_{i}) \big | E, S_{i-1} = w_i, G_{[i-2]} \in N(p_{[i-1]},w_{[i-2]})\right] , \end{aligned}$$

which by Lemma 16 implies that

$$\begin{aligned}&\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k]}) \wedge E_{\text {bad},B}| E] \nonumber \\&\quad \le \hat{\delta } \left( \sum _{j=0}^{\min \{r-2,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right) + \sum _{j=0}^{\min \{r-1,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right) \right) \nonumber \\&\quad \cdot \prod _{i=1}^{k} \mathbb {P}\left[ \mathbb {P}[\mathcal {T}_{i} < \mathcal {T}_{i-1} | \mathcal {F}_i] \in \iota (p_i) \big | E, S_{i-1} = w_i, G_{[i-2]} \in N(p_{[i-1]},w_{[i-1]})\right] . \end{aligned}$$
(59)

To complete the proof, we need only consider two cases on the relationship between t and r. First, if \(t \le r-2\), then Pascal’s rule implies that

$$\begin{aligned} \sum _{j=0}^{\min \{r-2,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right) + \sum _{j=0}^{\min \{r-1,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right)&= 2\sum _{j=0}^{t} \left( {\begin{array}{c}t\\ j\end{array}}\right) \\&= \left( {\begin{array}{c}t\\ t\end{array}}\right) + \left( {\begin{array}{c}t\\ 0\end{array}}\right) + \sum _{j=1}^t \left( \left( {\begin{array}{c}t\\ j\end{array}}\right) + \left( {\begin{array}{c}t\\ j-1\end{array}}\right) \right) \\&= \left( {\begin{array}{c}t+1\\ t+1\end{array}}\right) + \left( {\begin{array}{c}t+1\\ 0\end{array}}\right) + \sum _{j=1}^t \left( {\begin{array}{c}t+1\\ j\end{array}}\right) \\&= \sum _{j=0}^{t+1} \left( {\begin{array}{c}t+1\\ j\end{array}}\right) = \sum _{j=0}^{h(N_{p_{[k]},w_{[k]}})} \left( {\begin{array}{c}h(N_{p_{[k]},w_{[k]}})\\ j\end{array}}\right) . \end{aligned}$$

Since \(t \le r-2\), it follows that \(h(N_{p_{[k]},w_{[k]}}) = t+1 \le r-1 = s_{\max }-w_{k}-1\), which combined with (59) proves the result in this case. Second, if \(t > r-2\), then similarly

$$\begin{aligned} \sum _{j=0}^{\min \{r-2,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right) + \sum _{j=0}^{\min \{r-1,t\}} \left( {\begin{array}{c}t\\ j\end{array}}\right)&= \sum _{j=0}^{r-2} \left( {\begin{array}{c}t\\ j\end{array}}\right) + \sum _{j=0}^{r-1} \left( {\begin{array}{c}t\\ j\end{array}}\right) \\&= \left( {\begin{array}{c}t\\ 0\end{array}}\right) + \sum _{j=1}^{r-1} \left( \left( {\begin{array}{c}t\\ j\end{array}}\right) + \left( {\begin{array}{c}t\\ j-1\end{array}}\right) \right) \\&= \left( {\begin{array}{c}t+1\\ 0\end{array}}\right) + \sum _{j=1}^{r-1} \left( {\begin{array}{c}t+1\\ j\end{array}}\right) \\&= \sum _{j=0}^{r-1} \left( {\begin{array}{c}t+1\\ j\end{array}}\right) = \sum _{j=0}^{s_{\max }-w_{k}-1} \left( {\begin{array}{c}h(N_{p_{[k]},w_{[k]}})\\ j\end{array}}\right) . \end{aligned}$$

Since \(t > r-2\), \(h(N_{p_{[k]},w_{[k]}}) = t+1 > r-1 = s_{\max }-w_{k-1}-1\), which combined with (59) proves the result for this case as well. \(\square \)

We now prove our first main result of this section.

Theorem 5

For any \(\delta \in (0,1)\) with \(\hat{\delta }\) as defined in (32) and \(\ell (s_{\max },\hat{\delta })\) as defined in (33), one finds that (48) holds.

Proof

First, consider the case where \(s_{\max }= 0\). Then, by the definition of \(s_{\max }\),

$$\begin{aligned} \mathbb {P}[\mathcal {T}_k < \mathcal {T}_{k-1}|\mathcal {F}_k] = 0, \end{aligned}$$

for all \(k = [k_{\max }]\), so the result holds trivially.

Now, let \(s_{\max }\in \mathbb {N}^{} \setminus \{0\}\). By construction of our tree and the definitions of \(\mathcal {L}_{\text {good}}\) and \(\mathcal {L}_{\text {bad}}\), one finds that \(h(N(p_0,w_0)) \le k_{\max }\). In addition, by the definition of \(s_{\max }\), \(s_{\max }- 1 < k_{\max }\), so \(\min \{s_{\max }-w_0-1,h(N(p_0,w_0))\} = s_{\max }-1 \ge 0\). Consider arbitrary \(B \in \mathbb {N}^{} \setminus \{0\}\) (see (54)). By Lemma 18 and (32),

$$\begin{aligned} \mathbb {P}[E_{\text {bad},B}| E] = \mathbb {P}[G_{[-1]} \in N(p_0,w_0) \wedge E_{\text {bad},B}| E] \le \hat{\delta } \sum _{j=0}^{\min \{s_{\max }-1,k_{\max }\}} \left( {\begin{array}{c}k_{\max }\\ j\end{array}}\right) = \delta . \end{aligned}$$

Therefore, by the definition of \(E_{\text {bad},B}\) (see (54)), it follows that

$$\begin{aligned} \mathbb {P}\left[ \sum _{k=0}^{k_{\max }} \mathbb {P}[\mathcal {T}_k < \mathcal {T}_{k-1} | \mathcal {F}_k] \le \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B}+1 \Bigg | E \right] \ge 1-\delta . \end{aligned}$$

Now, let us define the event \(E_{\text {good},B}\) for \(B \in \mathbb {N}^{} {\setminus } \{0\}\) as the event that

$$\begin{aligned} \sum _{k=0}^{k_{\max }} \mathbb {P}[\mathcal {T}_k < \mathcal {T}_{k-1} | \mathcal {F}_k] \le \ell (s_{\max },\hat{\delta }) + \tfrac{k_{\max }+1}{B}+1. \end{aligned}$$

One sees that \(E_{\text {good},B} \supseteq E_{\text {good},B+1}\) for all such B. Therefore, by the properties of a decreasing sequence of events (see, for example [31, Section 1.5]), it follows that

$$\begin{aligned}&\mathbb {P}\left[ \sum _{k=0}^{k_{\max }} \mathbb {P}[\mathcal {T}_k< \mathcal {T}_{k-1} | \mathcal {F}_k] \le \ell (s_{\max },{\hat{\delta }}) +1 \Bigg | E \right] \\&\quad = \mathbb {P}\left[ \lim _{B \rightarrow \infty } \left( \sum _{k=0}^{k_{\max }} \mathbb {P}[\mathcal {T}_k< \mathcal {T}_{k-1} | \mathcal {F}_k] \le \ell (s_{\max },{\hat{\delta }}) + \frac{k_{\max }+1}{B}+1\right) \Bigg | E \right] \\&\quad = \lim _{B \rightarrow \infty } \mathbb {P}\left[ \sum _{k=0}^{k_{\max }} \mathbb {P}[\mathcal {T}_k < \mathcal {T}_{k-1} | \mathcal {F}_k] \le \ell (s_{\max },{\hat{\delta }}) + \frac{k_{\max }+1}{B}+1 \Bigg | E \right] \ge 1-\delta , \end{aligned}$$

as desired. \(\square \)

Next, we present some preliminary results that are required to prove the second statement of Lemma 9. Recall the random index set \(\mathcal {K}_{\tau }\) defined in (34). Our next lemma shows a property about any iteration \(k \in [k_{\max }]\) in which \(k \in \mathcal {K}_{\tau }\).

Lemma 19

Let Assumption 4 hold. Then, one finds that

$$\begin{aligned} \mathbb {P}[\mathcal {T}_k < \mathcal {T}_{k-1} | \mathcal {F}_k, k \in \mathcal {K}_{\tau }] \ge p_{\tau }. \end{aligned}$$

Proof

In any iteration where \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\), it follows that \(\mathcal {T}_k^{\text {trial},\text {true}}< \infty \), so

$$\begin{aligned} \mathcal {T}_k^{\text {trial},\text {true}}= \frac{(1-\sigma )\Vert c(X_k)\Vert _1}{\nabla f(X_k)^\top D_k^{\text {true}}+ \max \{(D_k^{\text {true}})^\top H_k D_k^{\text {true}},0\}} \end{aligned}$$

and thus

$$\begin{aligned} (1-\sigma )\Vert c(X_k)\Vert _1 < (\nabla f(X_k)^\top D_k^{\text {true}}+ \max \{(D_k^{\text {true}})^\top H_k D_k^{\text {true}},0\}) \mathcal {T}_{k-1}. \end{aligned}$$

By the definition of \(\mathcal {T}_k\), if

$$\begin{aligned} G_k^\top D_k + \max \{D_k^\top H_k D_k,0\} \ge \nabla f(X_k)^\top D_k^{\text {true}}+ \max \{(D_k^{\text {true}})^\top H_k D_k^{\text {true}},0\} \end{aligned}$$

in an iteration such that \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\), then

$$\begin{aligned} (1-\sigma )\Vert c(X_k)\Vert _1 < (G_k^\top D_k + \max \{D_k^\top H_k D_k,0\}) \mathcal {T}_{k-1}, \end{aligned}$$

meaning that \(\mathcal {T}_k < \mathcal {T}_{k-1}\). Observe that the event \(k \in \mathcal {K}_{\tau }\) only depends on the history of the algorithm prior to iteration k and thus the \(\sigma \)-algebra generated by \(k \in \mathcal {K}_{\tau }\) is included in \(\mathcal {F}_k\). Therefore, by [15, Theorem 4.1.13], for any random variable Z, we have

$$\begin{aligned} \mathbb {E}_k[Z | k \in \mathcal {K}_{\tau }] = \mathbb {E}_k[\mathbb {E}_k[Z] | k \in \mathcal {K}_{\tau }]. \end{aligned}$$

Therefore, with \({\varvec{1}}({\tilde{E}})\) denoting the indicator of event \({\tilde{E}}\), it follows from Assumption 4 that

$$\begin{aligned}&\mathbb {P}_k[\mathcal {T}_k < \mathcal {T}_{k-1} | k \in \mathcal {K}_{\tau }] \\&\quad \ge \mathbb {E}_k[{\varvec{1}}(G_k^\top D_k + \max \{D_k^\top H_k D_k,0\} \ge \nabla f(X_k)^\top D_k^{\text {true}}\\&\qquad + \max \{(D_k^{\text {true}})^\top H_k D_k^{\text {true}},0\}) | k \in \mathcal {K}_{\tau }] \\&\quad = \mathbb {E}_k[\mathbb {E}_k[{\varvec{1}}(G_k^\top D_k + \max \{D_k^\top H_k D_k,0\} \ge \nabla f(X_k)^\top D_k^{\text {true}}\\&\qquad + \max \{(D_k^{\text {true}})^\top H_k D_k^{\text {true}},0\})] | k \in \mathcal {K}_{\tau }] \\&\quad = \mathbb {E}_k[\mathbb {P}_k[G_k^\top D_k + \max \{D_k^\top H_k D_k,0\} \ge \nabla f(X_k)^\top D_k^{\text {true}}\\&\qquad + \max \{(D_k^{\text {true}})^\top H_k D_k^{\text {true}},0\}] | k \in \mathcal {K}_{\tau }] \\&\quad \ge \mathbb {E}_k[p_{\tau } | k \in \mathcal {K}_{\tau }] = p_{\tau }, \end{aligned}$$

as desired. \(\square \)

The previous lemma guarantees that in any iteration in which \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\), the probability is at least \(p_{\tau }\) that the merit parameter decreases. By the scheme for setting \(\mathcal {T}_k\),

$$\begin{aligned} \mathbb {P}_k[\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_k | \mathcal {T}_k^{\text {trial},\text {true}}\ge \mathcal {T}_{k-1}] = 0, \end{aligned}$$
(60)

so one must have \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\) in any iteration when \(\hat{\mathcal {T}}_k < \mathcal {T}_k\). Thus, we can obtain a bound on the number of iterations at which \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_k\) by bounding the number of iterations at which \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\). Now we prove a result relating \(|\mathcal {K}_{\tau }|\) to the probabilities of decreasing the merit parameter over all iterations.

Lemma 20

One finds that

$$\begin{aligned} \sum _{k=0}^{k_{\max }} \mathbb {P}_k[\mathcal {T}_k < \mathcal {T}_{k-1}] \ge |\mathcal {K}_{\tau }| p_{\tau }. \end{aligned}$$

Proof

By the law of total probability,

$$\begin{aligned}&\sum _{k=0}^{k_{\max }} \mathbb {P}_k[\mathcal {T}_k< \mathcal {T}_{k-1}] \\&\quad = \sum _{k=0}^{k_{\max }} \mathbb {P}_k[\mathcal {T}_k< \mathcal {T}_{k-1} | k \in \mathcal {K}_{\tau }] \mathbb {P}_k[k \in \mathcal {K}_{\tau }] + \mathbb {P}_k[\mathcal {T}_k< \mathcal {T}_{k-1} | k \in \mathcal {K}_{\tau }^c] \mathbb {P}_k[k \in \mathcal {K}_{\tau }^c] \\&\quad \ge \sum _{k=0}^{k_{\max }} \mathbb {P}_k[\mathcal {T}_k < \mathcal {T}_{k-1} | k \in \mathcal {K}_{\tau }] \mathbb {P}_k[k \in \mathcal {K}_{\tau }]. \end{aligned}$$

Now, similar to the previous lemma, we note that \(k \in \mathcal {K}_{\tau }\) only depends on the history of the algorithm prior to iteration k. Denoting the indicator of the event \(k \in \mathcal {K}_{\tau }\) as \({\varvec{1}}(k \in \mathcal {K}_{\tau })\), it follows by Lemma 19 that

$$\begin{aligned} \sum _{k=0}^{k_{\max }} \mathbb {P}_k[\mathcal {T}_k < \mathcal {T}_{k-1} | k \in \mathcal {K}_{\tau }] \mathbb {P}_k[k \in \mathcal {K}_{\tau }]&\ge p_{\tau } \sum _{k=0}^{k_{\max }} \mathbb {E}_k[{\varvec{1}}(k \in \mathcal {K}_{\tau })] \\&= p_{\tau } \sum _{k=0}^{k_{\max }} {\varvec{1}}(k \in \mathcal {K}_{\tau }) \\&= p_{\tau } \left| \mathcal {K}_{\tau }\right| , \end{aligned}$$

where the second to last equality follows by \({\varvec{1}}(k \in \mathcal {K}_{\tau }) \in \mathcal {F}_k\). \(\square \)

Now, we are prepared to prove Lemma 9.

Proof

(Lemma 9) For the first statement, observe that, for any \(k \in [k_{\max }]\), by the defintion of \(E_{k,3}\), the event \(\mathcal {T}_k < \mathcal {T}_{k-1}\) must occur whenever \(E_{k,3}\) occurs. Therefore, for any \(k \in [k_{\max }]\), one finds

$$\begin{aligned} \mathbb {P}[\mathcal {T}_k < \mathcal {T}_{k-1} | \mathcal {F}_k] \ge \mathbb {P}[E_{k,3} | \mathcal {F}_k]. \end{aligned}$$

Equation (35) then follows directly from Theorem 5.

Now, consider the second statement of Lemma 9. The proof follows by combining Theorem 5 and Lemma 20 with the preceeding argument. \(\square \)

C Lemma Required for the Proof of Corollary 1

This appendix provides the following lemma, which shows that the order notation result in (5a) and (5b) holds, as required in the proof of Corollary 1.

Lemma 21

Let \(\delta \in (0,1)\), \(\hat{\delta }\) be defined in (32), \(s_{\max }\in \mathbb {N}^{} \setminus \{0\}\) and \(\ell (s_{\max },\hat{\delta })\) be defined in (33). Then,

$$\begin{aligned} \ell (s_{\max },\hat{\delta }) = \mathcal {O}\left( s_{\max }\log (k_{\max }) + \log (1/\delta )\right) . \end{aligned}$$

Proof

Since \(s_{\max }\in \mathbb {N}^{} \setminus \{0\}\), it follows that

$$\begin{aligned} \sum _{j=0}^{\max \{s_{\max }-1,0\}} \left( {\begin{array}{c}k_{\max }\\ j\end{array}}\right)&= \sum _{j=0}^{s_{\max }-1} \frac{(k_{\max })!}{j!(k_{\max }-j)!} \le \sum _{j=0}^{s_{\max }-1} \frac{(k_{\max })!}{(k_{\max }-j)!} \\&= 1+\sum _{j=1}^{s_{\max }-1} \prod _{i=k_{\max }+1-j}^{k_{\max }} i \le 1 + \sum _{j=1}^{s_{\max }-1} (k_{\max })^j \\&\le s_{\max }(k_{\max })^{s_{\max }-1}. \end{aligned}$$

Then, by the definitions of \(\ell (s_{\max },\hat{\delta })\) and \(\hat{\delta }\), it follows that

$$\begin{aligned} \ell (s_{\max },\hat{\delta })&= \mathcal {O}\left( s_{\max }+ \log (1/\hat{\delta })\right) \\&= \mathcal {O}\left( s_{\max }+ \log (s_{\max }) + (s_{\max }-1)\log (k_{\max }) + \log (1/\delta )\right) \\&= \mathcal {O}\left( s_{\max }\log (k_{\max }) + \log (1/\delta )\right) , \end{aligned}$$

as desired. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Curtis, F.E., O’Neill, M.J. & Robinson, D.P. Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization. Math. Program. 205, 431–483 (2024). https://doi.org/10.1007/s10107-023-01981-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-023-01981-1

Keywords

Mathematics Subject Classification

Navigation