Abstract
A worst-case complexity bound is proved for a sequential quadratic optimization (commonly known as SQP) algorithm that has been designed for solving optimization problems involving a stochastic objective function and deterministic nonlinear equality constraints. Barring additional terms that arise due to the adaptivity of the monotonically nonincreasing merit parameter sequence, the proved complexity bound is comparable to that known for the stochastic gradient algorithm for unconstrained nonconvex optimization. The overall complexity bound, which accounts for the adaptivity of the merit parameter sequence, shows that a result comparable to the unconstrained setting (with additional logarithmic factors) holds with high probability.
Similar content being viewed by others
References
Berahas, A.S., Curtis, F.E., O’Neill, M.J., Robinson, D.P.: A Stochastic Sequential Quadratic Optimization Algorithm for Nonlinear Equality Constrained Optimization with Rank-Deficient Jacobians. arXiv:2106.13015 (2021)
Berahas, A.S., Curtis, F.E., Robinson, D.P., Zhou, B.: Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM J. Optim. 31, 1352–1379 (2021)
Bertsekas, D.P.: Network optimization: continuous and discrete models, Athena Scientific Belmont (1998)
Betts, J.T.: Practical methods for optimal control and estimation using nonlinear programming, SIAM (2010)
Byrd, R.H., Lopez-Calva, G., Nocedal, J.: A line search exact penalty method using steering rules. Math. Program. 133, 39–73 (2012)
Byrd, R.H., Nocedal, J., Waltz, R.A.: Steering exact penalty methods for nonlinear programming. Optim. Methods Softw. 23, 197–213 (2008)
Cartis, C., Gould, N.I.M., Toint, P.L.: An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA J. Numer. Anal. 32, 1662–1695 (2012). https://doi.org/10.1093/imanum/drr035
Cartis, C., Gould, N.I.M., Toint, P.L.: On the complexity of finding first-order critical points in constrained nonlinear optimization. Math. Program. 144, 93–106 (2014)
Chen, C., Tung, F., Vedula, N., Mori, G.: Constraint-aware deep neural network compression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 400–415 (2018)
Curtis, F.E.: A penalty-interior-point algorithm for nonlinear constrained optimization. Math. Program. Comput. 4, 181–209 (2012)
Curtis, F.E., Jiang, H., Robinson, D.P.: An adaptive augmented lagrangian method for large-scale constrained optimization. Math. Program. 152, 201–245 (2015)
Curtis, F.E., Robinson, D.P.: Exploiting negative curvature in deterministic and stochastic optimization. Math. Program. 176, 69–94 (2019)
Curtis, F.E., Robinson, D.P., Samadi, M.: Complexity analysis of a trust funnel algorithm for equality constrained optimization. SIAM J. Optim. 28, 1533–1563 (2018)
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29, 207–239 (2019)
Durrett, R.: Probability: Theory and Examples, vol. 49, Cambridge University Press (2019)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155, 267–305 (2016)
Grapiglia, G.N., Yuan, Y.-X.: On the complexity of an augmented Lagrangian method for nonconvex optimization. IMA J. Numer. Anal. 41, 1546–1568 (2020)
Han, S.P., Mangasarian, O.L.: Exact penalty functions in nonlinear programming. Math. Program. 17, 251–269 (1979). https://doi.org/10.1007/BF01588250
Hazan, E., Luo, H.: Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, PMLR, pp. 1263–1271 (2016)
Kupfer, F., Sachs, E.W.: Numerical solution of a nonlinear parabolic control problem by a reduced sqp method. Comput. Optim. Appl. 1, 113–135 (1992)
Li, X., Orabona, F.: A high probability analysis of adaptive SGD with momentum. arXiv: 2007.14294 (2020)
Mongeau, M., Sartenaer, A.: Automatic decrease of the penalty parameter in exact penalty function methods. Eur. J. Oper. Res. 83, 686–699 (1995)
Na, S., Anitescu, M., Kolar, M.: An Adaptive Stochastic Sequential Quadratic Programming with Differentiable Exact Augmented Lagrangians. arXiv preprint arXiv:2102.05320 (2021)
Na, S., Mahoney, M.W.: Asymptotic convergence rate and statistical inference for stochastic sequential quadratic programming. arXiv:2205.13687 (2022)
Nandwani, Y., Pathak, A., Mausam, P.: A primal dual formulation for deep learning with constraints, in NeurIPS, Singla (2019)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)
Nocedal, J., Wächter, A., Waltz, R.A.: Adaptive barrier update strategies for nonlinear interior methods. SIAM J. Optim. 19, 1674–1693 (2009)
Nocedal, J., Wright, S.J.: Numerical optimization. Springer Science & Business Media (2006)
Rees, T., Dollar, H.S., Wathen, A.J.: Optimal solvers for pde-constrained optimization. SIAM J. Sci. Comput. 32, 271–298 (2010)
Roy, S. K., Mhammedi, Z., Harandi, M.: Geometry aware constrained optimization techniques for deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4469 (2018)
Stirzaker, D.: Elementary probability. Cambridge University Press (2003)
Wilson, R.B.: A Simplicial Algorithm for Concave Programming, Ph.D. Dissertation, Graduate School of Business Administration (1963)
Funding
Funding was provided by National Science Foundation (Grant Nos. 2030859, CCF-2008484) and Office of Naval Research (Grant No. N00014-21-1-2532).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proof of Theorem 1 (Deterministic Algorithm Complexity)
In this appendix, we prove Theorem 1, which states a worst-case complexity bound for Algorithm 2.1 of [2]. We refer to quantities defined and employed in the analysis in [2]. In particular, in this appendix, for all \(k \in \mathbb {N}^{}\), we suppose that \(g_k = \nabla f(x_k)\) and \(d_k = u_k + v_k\) with \(u_k \in {{\,\textrm{Null}\,}}(J_k)\) and \(v_k \in {{\,\textrm{Range}\,}}(J_k^\top )\) is the search direction computed by solving the SQP subproblem with \(g_k = \nabla f(x_k)\). As seen in [2], the convergence properties of Algorithm 2.1 in that paper are driven by reductions in a model of the merit function in each iteration. Our first lemma proves a useful lower bound for such a reduction.
Lemma 13
Define \((\kappa _{uv},\kappa _H,\kappa _v,\tau _{\min },\zeta ,\sigma ) \in (0,\infty )^5 \times (0,1)\) as in [2] and let
Then, for any \(\varepsilon \in (0,1)\), if \(\Vert g_k + J_k^\top y_k\Vert > \varepsilon \) and/or \(\sqrt{\Vert c_k\Vert _1} > \varepsilon \), then
Proof
Consider arbitrary \((\varepsilon ,k) \in (0,1) \times \mathbb {N}^{}\) such that \(\Vert g_k + J_k^\top y_k\Vert > \varepsilon \) and/or \(\sqrt{\Vert c_k\Vert _1} > \varepsilon \). Let us consider two cases. First, suppose that \(\Vert c_k\Vert _1 > \hat{\kappa } \varepsilon ^2\). Then, by [2, equation (2.9)],
which implies (47), as desired. Second, suppose that \(\Vert c_k\Vert _1 \le \hat{\kappa } \varepsilon ^2 \le \varepsilon ^2\), which by the definition of \((\varepsilon ,k)\) implies that \(\Vert g_k + J_k^\top y_k\Vert > \varepsilon \). It follows from this fact that \(\Vert d_k\Vert > \varepsilon /\kappa _H\); indeed, if \(\Vert d_k\Vert \le \varepsilon /\kappa _H\), then by [2, equation (2.6) and Assumption 2.4] one would find
which is a contradiction. Hence, \(\Vert d_k\Vert > \varepsilon /\kappa _H\), and by [2, Lemma 2.9], it follows that \(\Vert v_k\Vert ^2 \le \kappa _v \Vert c_k\Vert \le \kappa _v \Vert c_k\Vert _1 \le \kappa _v \hat{\kappa } \varepsilon ^2\), which combined shows that
From this fact and the definition of \(\hat{\kappa }\), it follows that
which along with [2, Lemma 2.10] implies \(d_k^\top H_k d_k \ge \tfrac{1}{2}\zeta \Vert u_k\Vert ^2 \ge \tfrac{1}{2}\zeta \kappa _{uv} \kappa _v \hat{\kappa } \varepsilon ^2\). Thus,
which implies (47), as desired. \(\square \)
We now prove Theorem 1, further details of which are provided in the statement below.
Theorem 4
Define \((\tau _{-1},f_{\text {low}},\alpha _{\min },\tau _{\min },\eta ,\sigma ) \in (0,\infty )^4 \times (0,1)^2\) as in [2] and \(({\hat{\kappa }},{\tilde{\kappa }}) \in (0,1] \times (0,\infty )\) as in (46). Then, for any \(\varepsilon \in (0,1)\), Theorem 1 holds with (4) given by
Proof
To derive a contradiction, suppose (3) does not hold for all \(k \in \{0,\dots ,{\overline{K}}_\varepsilon \}\). Then, along with Lemma 13 and [2, equation (2.10) and Lemma 2.17], it follows for all such k that
By the definition of \(\phi \), this means for all such k that
Summing this inequality for all \(k \in \{0,\dots ,{\overline{K}}_\varepsilon \}\), one can deduce that
Since \(\{\tau _k\}\) is monotonically nonincreasing, \(\Vert c_{{\overline{K}}_\varepsilon +1}\Vert _1 \ge 0\), and \(f_k \ge f_{\text {low}}\) for all \(k \in \mathbb {N}^{}\),
Rearranging this inequality, one arrives at the conclusion that
which is a contradiction. Therefore, one arrives at the desired conclusion that Algorithm 2.1 yields an iterate satisfying (3) in at most \({\overline{K}}_\varepsilon \) iterations. \(\square \)
B Proofs of Lemma 9
In this appendix, we prove Lemma 9. Toward this end, we prove for any \(\delta \in (0,1)\) with \(\hat{\delta }\) as defined in (32) and \(\ell (s_{\max },\hat{\delta })\) as defined in (33), one finds
We build to this result, ultimately proved as Theorem 5, with a series of preliminary lemmas.
As our first preliminary result, we state a particular form of Chernoff’s bound [31] in the following lemma, which will prove instrumental in deriving (48).
Lemma 14
For any \(k \in \mathbb {N}^{}\), let \(\{Y_0,\dots ,Y_k\}\) be independent Bernoulli random variables. Then, for any \(s \in \mathbb {N}^{}\) and \(\bar{\delta } \in (0,1)\), it follows that
Proof
Suppose that \(\mu \ge \ell (s,{\bar{\delta }})\). By the multiplicative form of Chernoff’s bound, it follows for \(\rho := 1 - s/\mu \) (which is in the interval (0, 1) by (49)) that
Hence, to prove the result, all that remains is to show that \(e^{-\tfrac{1}{2}\mu (1-s/\mu )^2} \le \bar{\delta }\), i.e., that \(-\tfrac{1}{2}\mu (1-s/\mu )^2 \le \log (\bar{\delta })\). Using \(\log (\bar{\delta }) = -\log (1/\bar{\delta })\), this inequality is equivalent to
which holds if and only if \(\mu ^2 - 2\mu (s+\log (1/\bar{\delta })) + s^2 \ge 0\). Viewing the left-hand side of this inequality as a convex quadratic function in \(\mu \), one finds that the inequality holds as long as \(\mu \) is greater than or equal to the positive root of the quadratic, i.e.,
This holds since \(\mu \ge \ell (s,{\bar{\delta }})\); hence, the result is proved. \(\square \)
Now, we turn our attention to proving (48). For any realization of a run of the algorithm up to iteration \(k \in [k_{\max }]\), let \(w_k\) denote the number of times that the merit parameter has been decreased up to the beginning of iteration k and let \(\overline{p}_k\) denote the probability that the merit parameter is decreased during iteration k. The signature of a realization up to iteration \(k \in \mathbb {N}^{}\) is \((\overline{p}_0,\dots ,\overline{p}_k,w_0,\dots ,w_k)\), which encodes all of the pertinent information regarding the behavior of the merit parameter sequence up to the start of iteration k.
One could imagine using all possible signatures to define a tree whereby each node contains a subset of all realizations of the algorithm. To construct such tree, one could first consider the root node, which could be denoted by \({\tilde{N}}(\overline{p}_0,w_0)\), where \(\overline{p}_0\) is uniquely defined by the starting conditions of our algorithm and \(w_0 = 0\). All realizations of our algorithm follow the same initialization, so \(\overline{p}_0\) and \(w_0\) would be in the signature of every realization. Now, one could define a node \({\tilde{N}}(\overline{p}_{[k]},w_{[k]})\) at depth \(k \in [k_{\max }]\) (where the root node has a depth of 0) in the tree as the set of all realizations of our algorithm for which the signature of the realization up to iteration k is \((\overline{p}_0,\dots ,\overline{p}_k,w_0,\dots ,w_k)\). One could then define the edges in the tree by connecting nodes at adjacent levels, where node \({\tilde{N}}(\overline{p}_{[k]},w_{[k]})\) is connected to node \({\tilde{N}}({\bar{p}}_{[k]},\overline{p}_{k+1},w_{[k]},w_{k+1})\) for any \(\overline{p}_{k+1} \in [0,1]\) and \(w_{k+1} \in \{w_k,w_k+1\}\).
Unfortunately, the construction described in the previous paragraph may lead to nodes in the tree representing realizations with probability zero occurrence. In order to remedy this, we instead construct a tree where the nodes contain all realizations whose probability signatures fall within specified intervals. To define such intervals, consider arbitrary \(B \in \mathbb {N}^{} \setminus \{0\}\) and let us restrict the sequence of values \(p_{[k]}\) used to define our nodes as those with
For \(p \in \{0, 1/B, \dots , (B-1)/B\}\), these define the open probability intervals \(\iota (p)\) given by
Now, we can construct our tree as follows. As before, first consider the root node, which we denote by \(N(p_0,w_0)\), where \(p_0 \in \{0, 1/B, \dots , (B-1)/B\}\) is uniquely defined by the starting conditions of our algorithm so that \(\mathbb {P}[\mathcal {T}_0 < \tau _{-1} | \mathcal {F}_0] \in \iota (p_0)\) and \(w_0 = 0\). All realizations of our algorithm follow the same initialization, so with \({\bar{p}}_0 = \mathbb {P}[\mathcal {T}_0 < \tau _{-1} | \mathcal {F}_0]\) one finds that \({\bar{p}}_0 \in \iota (p_0)\) and \(w_0\) are in the signature of every realization. We define a node \(N(p_{[k]},w_{[k]})\) at depth \(k \in [k_{\max }]\) as the set of all realizations for which the signature of the realization at iteration k exactly matches \(w_{[k]}\) and has probabilities that fall within the intervals defined by \(p_{[k]}\); i.e., a realization with signature \((\overline{p}_{[k]},w_{[k]})\) is a member of \(N(p_{[k]},w_{[k]})\) if and only if, for all \(j \in [k]\), one finds that \(\overline{p}_j \in \iota (p_j)\). The edges in the tree connect nodes in adjacent levels, where \(N(p_{[k]},w_{[k]})\) is connected to \(N(p_{[k]},p_{k+1},w_{[k]},w_{k+1})\) for any \(p_{k+1} \in \{0,1/B,\dots ,(B-1)/B\}\) and \(w_{k+1} \in \{w_k,w_k+1\}\).
Notationally, since the behavior of the algorithm up to iteration \(k \in \mathbb {N}^{}\) is determined by the initial conditions and stochastic gradients in \(G_{[k-1]}\), we write
to denote the event that the signature of the algorithm up to k is a member of \(N(p_{[k]},w_{[k]})\). The initial condition, denoted for consistency as \(G_{[-1]} \in N(p_0,w_0)\), occurs with probability one. Based on the description above, the nodes of our tree satisfy: For any node at a depth of \(k \ge 2\), the event \(G_{[k-1]} \in N(p_{[k]},w_{[k]})\) occurs if and only if
where \(\mathcal {I}[\cdot ]\) denotes the indicator function of any given event.
Let us now define certain important sets of nodes in the tree. First, let
be the set of nodes at which the sum of the elements of \(p_{[k]}\) is sufficiently small and either \(w_k\) has reached \(s_{\max }\) or k has reached \(k_{\max }\). A node in this set is of interest since, due to the iteration and/or merit parameter decrease limit having been reached, the probability is zero that a certain “bad” event can occur over all realizations with signatures that are members of the node; see Lemma 15 on page 38. Second, let
be the nodes in the complement of \(\mathcal {L}_{\text {good}}\) at which the sum of the elements of \(p_{[k]}\) has exceeded the threshold \(\ell (s_{\max },{\hat{\delta }})+1\). A node in this set is of interest since, due to this threshold having been exceeded, all realizations with signatures that are members of this node correspond to poor behavior of the algorithm (and there is no need to consider the behavior of the algorithm beyond this point). Going forward, we restrict attention to the tree defined by the root node and all paths from the root node that terminate at a node contained in \(\mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\). It is clear from this restriction and the definitions of \(\mathcal {L}_{\text {good}}\) and \(\mathcal {L}_{\text {bad}}\) that this tree is finite with the elements of \(\mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\) being leaves.
Let us now define relationships between nodes. The parent of a node is defined as
On the other hand, the children of node \(N(p_{[k]},w_{[k]})\) are defined as
This ensures that paths down the tree terminate at nodes in \(\mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\), making these nodes the leaves of the tree. For convenience in the remainder of our discussions, let \(C(\emptyset ) = \emptyset \).
We define the height of node \(N(p_{[k]},w_{[k]})\) as the length of the longest path from \(N(p_{[k]},w_{[k]})\) to a leaf node, i.e., the height is denoted as
where \(C^j(N(p_{[k]},w_{[k]}))\) is shorthand for applying the mapping \(C(\cdot )\) consecutively j times. From this definition, \(h(N(p_{[k]},w_{[k]})) = 0\) for all \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {good}}\cup \mathcal {L}_{\text {bad}}\).
Next, let us define two more sets of nodes that will be useful later. Let \(C_{\text {dec}}(N(p_{[k]},w_{[k]}))\) denote the set of children of \(N(p_{[k]},w_{[k]})\) such that the merit parameter decreases and let \(C_{\text {dec}}^c(N(p_{[k]},w_{[k]}))\) denote set of children of \(N(p_{[k]},w_{[k]})\) such that it does not decrease, so
and
Finally, let us define the event \(E_{\text {bad},B}\) as the event that for some \(j \in [k_{\max }]\) one finds
With respect to our goal of proving (48), the event \(E_{\text {bad},B}\) is of interest since it is the event that the given probabilities accumulated up to iteration \(j \in [k_{\max }]\) (and beyond) exceed the threshold found in (48) plus a factor that is inversely proportional to B.
Let us now prove some properties of the leaf nodes.
Lemma 15
For any \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {good}}\), one finds
On the other hand, for all \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\), one finds
Proof
Consider an arbitrary index \(k \in [k_{\max }]\) and an arbitrary pair \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {good}}\). By the definition of \(\mathcal {L}_{\text {good}}\), it follows that
Since the maximum depth of a node is \(k_{\max }\) and the gap between the discrete values in (50) is \(\tfrac{1}{B}\), it follows along with (55) that
Therefore, for any \(j \in \{1,\dots ,k\}\), one finds from conditional probability that
In addition, (54) cannot hold for \(j=0\) since \(\ell (s_{\max },\hat{\delta })+1 > 1\). Hence, along with the conclusion above, it follows that \(E_{\text {bad},B}\) does not occur when a signature up to iteration \(j \in \{1,\dots ,k\}\) falls into a node along any path from the root node to \(N(p_{[k]},w_{[k]})\). Now, by the definition of \(\mathcal {L}_{\text {good}}\), at least one of \(w_{k} = s_{\max }\) or \(k = k_{\max }\) holds. Let us consider each case in turn. If \(k = k_{\max }\), then it follows by the preceding arguments that
Otherwise, if \(w_{k} = s_{\max }\), then it follows by the definition of the event E that \(\mathbb {P}[\mathcal {T}_i < \mathcal {T}_{i-1} | \mathcal {F}_i] = 0\) for all \(i \in \{k,\dots ,k_{\max }\}\), and therefore the equation above again follows. Overall, it follows that \(\mathbb {P}[G_{[k-1]} \in N(p_{[k]},w_{[k-1]}) \wedge E_{\text {bad},B}| E] = 0\), as desired.
Next, we remark that
since
Thus, consider arbitrary \(k \in \mathbb {N}^{} \setminus \{0\}\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\). One finds
Hence, using the initial condition that \(G_{[-1]} \in N(p_0,w_0)\), it follows that
Our goal is to bound (56). Toward this end, define
which by the definition of \(w_{[k]}\) form a partition of \(\{1,\dots ,k\}\). For any \(i \in \mathcal {I}_{\text {dec}}\),
On the other hand, for any \(i \in \mathcal {I}_{\text {dec}}^c\),
Thus, it follows that the latter term in (56) satisfies
Now let us bound this term. By the definition of \(\mathcal {L}_{\text {bad}}\), one finds that
In addition, by the definition of \(s_{\max }\), it follows that \(w_k \le s_{\max }\) for all \(k \in [k_{\max }]\), from which it follows that \(|\mathcal {I}_{\text {dec}}| \le s_{\max }\). Now, let \(\{Z_0,\dots ,Z_{k-1}\}\) be independent Bernoulli random variables such that, for all \(i \in \{0,\dots ,k-1\)}, one has
By (57), it follows from the definition of these random variables that \(\sum _{i=0}^{k-1} \mathbb {P}[Z_i=1] \ge \ell (s_{\max }, \hat{\delta })\). Then, it follows by Lemma 14 and the preceding argument that
Combining this with (56), the desired conclusion follows. \(\square \)
Next, we present a lemma about nodes in the sets defined in (52) and (53). The lemma essentially states that a certain probability of interest, defined as the product of probabilities along a path to a child node, can be reduced to a product of probabilities to the child’s parent node by partitioning the childen into those at which a merit parameter decrease has occurred and children at which a merit parameter decrease has not occurred.
Lemma 16
For all \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\), one finds that
and, similarly, one finds that
where by the definitions of \(C_{\text {dec}}\) and \(C_{\text {dec}}^c\) it follows that the value of \(w_{k+1}\) in the sum in the former equation is one greater than the value of \(w_{k+1}\) in the sum in the latter equation.
Proof
One finds that
The desired conclusion follows since, by the definition of \(C_{\text {dec}}(N(p_{[k]},w_{[k]}))\), all elements in the latter sum have \(S_k = w_{k+1} = w_{k}+1\), meaning that the sum is exhaustive over all possible outcomes with the same conditions, and hence the sum is 1.
The proof of the second desired conclusion follows in the same manner with \(C_{\text {dec}}^c\) in place of \(C_{\text {dec}}\) and \(S_k=w_{k+1}=w_{k}\) in place of \(S_k=w_{k+1}=w_k+1\). \(\square \)
Next, we derive a result for certain nodes containing realizations with \(w_{k} = s_{\max }- 1\).
Lemma 17
For any \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) such that \(w_{k} = s_{\max } - 1\) and \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), it follows that
Proof
By the supposition that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), it follows that any \((p_{[k]},w_{[k]})\) with \(h(N(p_{[k]},w_{[k]})) = 0\) has \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\), in which case the desired conclusion follows from Lemma 15. With this base case being established, we now prove the result by induction. Suppose that the result holds for all \((p_{[k]},w_{[k]})\) such that \(w_{k} = s_{\max } - 1\), \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), and \(h(N(p_{[k]},w_{[k]})) \le j\) for some \(j \in \mathbb {N}^{}\). Our goal is to show that the same statement holds with j replaced by \(j+1\). For this purpose, consider arbitrary \((p_{[k]},w_{[k]})\) such that \(w_{k} = s_{\max } - 1\), \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), and \(h(N(p_{[k]},w_{[k]})) = j + 1\). Observe that by the definition of the child operators C, \(C_{\text {dec}}\), and \(C_{\text {dec}}^c\), it follows that
Since \(w_{k} = s_{\max }-1\), it follows from the definition of \(C_{\text {dec}}\) that for any \((p_{k+1},w_{k+1})\) with \(N(p_{[k+1]},w_{[k+1]}) \in C_{\text {dec}}(N(p_{[k]},w_{[k]}))\), one finds that \(w_{k+1} = w_{k}+1 = s_{\max }\). By the definition of \(s_{\max }\), this implies that \(\mathbb {P}[\mathcal {T}_{k+1} < \mathcal {T}_k | \mathcal {F}_{k+1}] = 0\), so \(p_{k+1} = 0\). In addition, since \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {bad}}\) since \(C(N(p_{[k]},w_{[k]})) \ne \emptyset \), it follows that \(\sum _{i=0}^{k+1} p_{k+1} \le \ell (s_{\max },\hat{\delta })+1\), meaning \(N(p_{[k+1]},w_{[k+1]}) \in \mathcal {L}_{\text {good}}\). Consequently, from above and Lemma 15, one finds
Since \(h(N(p_{[k]},w_{[k]}) = j+1\), it follows that \(h(N(p_{[k+1]},w_{[k+1]})) \le j\) for any \((p_{[k+1]},w_{[k+1]})\) with \(h(N(p_{[k+1]},w_{[k+1]})) \in C_{\text {dec}}^c(N(p_{[k]},w_{[k]}))\). Therefore, by the induction hypothesis and the result of Lemma 16, it follows that
which completes the proof. \(\square \)
Using the preceding lemma as a base case, we now perform induction on the difference \(s_{\max }- w_{k}\) to prove a similar result for arbitrary \(s_{\max }\).
Lemma 18
For any \(k \in [k_{\max }]\) and \((p_{[k]},w_{[k]})\) with \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), it follows that
Proof
For all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\) and \(h(N(p_{[k]},w_{[k]})) = 0\), it follows that \(N(p_{[k]},w_{[k]}) \in \mathcal {L}_{\text {bad}}\). The result holds in this case according to Lemma 15 since one finds that \(\sum _{j=0}^{\min \{s_{\max }-w_{k}-1,h(N(p_{[k]},w_{[k]}))\}} \left( {\begin{array}{c}h(N(p_{[k]},w_{[k]}))\\ j\end{array}}\right) = \left( {\begin{array}{c}0\\ 0\end{array}}\right) = 1\). On the other hand, for all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\) and \(s_{\max }- w_{k} = 1\), the result follows from Lemma 17. Hence, to prove the remainder of the result by induction, one may assume that it holds for all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), \(h(N(p_{[k]},w_{[k]})) \le t\) for some \(t \in \mathbb {N}^{}\), and \(s_{\max }- w_{k} = r\) for some \(r \in \mathbb {N}^{} {\setminus } \{0\}\), and show that it holds for all \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), \(h(N(p_{[k]},w_{[k]})) = t + 1\), and \(s_{\max }- w_{k} = r\). (Notice that the base cases above show that the result holds for \(t=0\) and any \(r \in \mathbb {N}^{} \setminus \{0\}\), as well as for any \(t \in \mathbb {N}^{}\) and \(r=1\). Hence, one may complete the induction over the index pairs by showing that if it holds for (t, r), then it holds for \((t+1,r)\), as claimed above.)
Consider arbitrary \((p_{[k]},w_{[k]})\) such that \(N(p_{[k]},w_{[k]}) \not \in \mathcal {L}_{\text {good}}\), \(h(N(p_{[k]},w_{[k]})) = t + 1\), and \(s_{\max }- w_{k} = r\). By the definitions of C, \(C_{\text {dec}}\), and \(C_{\text {dec}}^c\), it follows that
Further by the definition of \(C_{\text {dec}}\), it follows that \(w_{k+1} = w_k + 1\) (thus \(s_{\max }- w_{k+1} = r-1\)) for all terms in the former sum on the right-hand side, whereas by the definition of \(C_{\text {dec}}^c\) it follows that \(w_{k+1} = w_k\) (thus \(s_{\max }- w_{k+1} = r\)) for all terms in the latter sum on the right-hand side. Moreover, from \(h(N(p_{[k]},w_{[k]})) = t+1\), it follows that \(h(N(p_{[k+1]},w_{[k+1]})) \le t\) for all terms on the right-hand side. Therefore, by the induction hypothesis, it follows that
which by Lemma 16 implies that
To complete the proof, we need only consider two cases on the relationship between t and r. First, if \(t \le r-2\), then Pascal’s rule implies that
Since \(t \le r-2\), it follows that \(h(N_{p_{[k]},w_{[k]}}) = t+1 \le r-1 = s_{\max }-w_{k}-1\), which combined with (59) proves the result in this case. Second, if \(t > r-2\), then similarly
Since \(t > r-2\), \(h(N_{p_{[k]},w_{[k]}}) = t+1 > r-1 = s_{\max }-w_{k-1}-1\), which combined with (59) proves the result for this case as well. \(\square \)
We now prove our first main result of this section.
Theorem 5
For any \(\delta \in (0,1)\) with \(\hat{\delta }\) as defined in (32) and \(\ell (s_{\max },\hat{\delta })\) as defined in (33), one finds that (48) holds.
Proof
First, consider the case where \(s_{\max }= 0\). Then, by the definition of \(s_{\max }\),
for all \(k = [k_{\max }]\), so the result holds trivially.
Now, let \(s_{\max }\in \mathbb {N}^{} \setminus \{0\}\). By construction of our tree and the definitions of \(\mathcal {L}_{\text {good}}\) and \(\mathcal {L}_{\text {bad}}\), one finds that \(h(N(p_0,w_0)) \le k_{\max }\). In addition, by the definition of \(s_{\max }\), \(s_{\max }- 1 < k_{\max }\), so \(\min \{s_{\max }-w_0-1,h(N(p_0,w_0))\} = s_{\max }-1 \ge 0\). Consider arbitrary \(B \in \mathbb {N}^{} \setminus \{0\}\) (see (54)). By Lemma 18 and (32),
Therefore, by the definition of \(E_{\text {bad},B}\) (see (54)), it follows that
Now, let us define the event \(E_{\text {good},B}\) for \(B \in \mathbb {N}^{} {\setminus } \{0\}\) as the event that
One sees that \(E_{\text {good},B} \supseteq E_{\text {good},B+1}\) for all such B. Therefore, by the properties of a decreasing sequence of events (see, for example [31, Section 1.5]), it follows that
as desired. \(\square \)
Next, we present some preliminary results that are required to prove the second statement of Lemma 9. Recall the random index set \(\mathcal {K}_{\tau }\) defined in (34). Our next lemma shows a property about any iteration \(k \in [k_{\max }]\) in which \(k \in \mathcal {K}_{\tau }\).
Lemma 19
Let Assumption 4 hold. Then, one finds that
Proof
In any iteration where \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\), it follows that \(\mathcal {T}_k^{\text {trial},\text {true}}< \infty \), so
and thus
By the definition of \(\mathcal {T}_k\), if
in an iteration such that \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\), then
meaning that \(\mathcal {T}_k < \mathcal {T}_{k-1}\). Observe that the event \(k \in \mathcal {K}_{\tau }\) only depends on the history of the algorithm prior to iteration k and thus the \(\sigma \)-algebra generated by \(k \in \mathcal {K}_{\tau }\) is included in \(\mathcal {F}_k\). Therefore, by [15, Theorem 4.1.13], for any random variable Z, we have
Therefore, with \({\varvec{1}}({\tilde{E}})\) denoting the indicator of event \({\tilde{E}}\), it follows from Assumption 4 that
as desired. \(\square \)
The previous lemma guarantees that in any iteration in which \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\), the probability is at least \(p_{\tau }\) that the merit parameter decreases. By the scheme for setting \(\mathcal {T}_k\),
so one must have \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\) in any iteration when \(\hat{\mathcal {T}}_k < \mathcal {T}_k\). Thus, we can obtain a bound on the number of iterations at which \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_k\) by bounding the number of iterations at which \(\mathcal {T}_k^{\text {trial},\text {true}}< \mathcal {T}_{k-1}\). Now we prove a result relating \(|\mathcal {K}_{\tau }|\) to the probabilities of decreasing the merit parameter over all iterations.
Lemma 20
One finds that
Proof
By the law of total probability,
Now, similar to the previous lemma, we note that \(k \in \mathcal {K}_{\tau }\) only depends on the history of the algorithm prior to iteration k. Denoting the indicator of the event \(k \in \mathcal {K}_{\tau }\) as \({\varvec{1}}(k \in \mathcal {K}_{\tau })\), it follows by Lemma 19 that
where the second to last equality follows by \({\varvec{1}}(k \in \mathcal {K}_{\tau }) \in \mathcal {F}_k\). \(\square \)
Now, we are prepared to prove Lemma 9.
Proof
(Lemma 9) For the first statement, observe that, for any \(k \in [k_{\max }]\), by the defintion of \(E_{k,3}\), the event \(\mathcal {T}_k < \mathcal {T}_{k-1}\) must occur whenever \(E_{k,3}\) occurs. Therefore, for any \(k \in [k_{\max }]\), one finds
Equation (35) then follows directly from Theorem 5.
Now, consider the second statement of Lemma 9. The proof follows by combining Theorem 5 and Lemma 20 with the preceeding argument. \(\square \)
C Lemma Required for the Proof of Corollary 1
This appendix provides the following lemma, which shows that the order notation result in (5a) and (5b) holds, as required in the proof of Corollary 1.
Lemma 21
Let \(\delta \in (0,1)\), \(\hat{\delta }\) be defined in (32), \(s_{\max }\in \mathbb {N}^{} \setminus \{0\}\) and \(\ell (s_{\max },\hat{\delta })\) be defined in (33). Then,
Proof
Since \(s_{\max }\in \mathbb {N}^{} \setminus \{0\}\), it follows that
Then, by the definitions of \(\ell (s_{\max },\hat{\delta })\) and \(\hat{\delta }\), it follows that
as desired. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Curtis, F.E., O’Neill, M.J. & Robinson, D.P. Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization. Math. Program. 205, 431–483 (2024). https://doi.org/10.1007/s10107-023-01981-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-023-01981-1