Abstract
This chapter studies the robustness of reinforcement learning for discrete-time linear stochastic systems with multiplicative noise evolving in continuous state and action spaces. As one of the popular methods in reinforcement learning, the robustness of policy iteration is a longstanding open problem for the stochastic linear quadratic regulator (LQR) problem with multiplicative noise. A solution in the spirit of input-to-state stability is given, guaranteeing that the solutions of the policy iteration algorithm are bounded and enter a small neighborhood of the optimal solution, whenever the error in each iteration is bounded and small. In addition, a novel off-policy multiple-trajectory optimistic least-squares policy iteration algorithm is proposed, to learn a near-optimal solution of the stochastic LQR problem directly from online input/state data, without explicitly identifying the system matrices. The efficacy of the proposed algorithm is supported by rigorous convergence analysis and numerical results on a second-order example.
Dedicated to Laurent Praly, a beautiful mind
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C.: Model-free linear quadratic control via reduction to expert prediction. In: International Conference on Artificial Intelligence and Statistics (AISTATS) (2019)
Agarwal, R.P.: Difference Equations and Inequalities: Theory, Methods, and Applications, 2nd edn. Marcel Dekker Inc, New York (2000)
Athans, M., Ku, R., Gershwin, S.: The uncertainty threshold principle: some fundamental limitations of optimal decision making under dynamic uncertainty. IEEE Trans. Autom. Control 22(3), 491–495 (1977)
Beghi, A., D’alessandro, D.: Discrete-time optimal control with control-dependent noise and generalized Riccati difference equations. Automatica, 34(8):1031 – 1034, 1998
Bertsekas, D.P.: Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 9(3), 310–335 (2011)
Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena Scientific, Belmont, Massachusetts (2019)
Bian, T., Jiang, Z.P.: Continuous-time robust dynamic programming. SIAM J. Control Optim. 57(6), 4150–4174 (2019)
Bian, T., Wolpert, D.M., Jiang, Z.P.: Model-free robust optimal feedback mechanisms of biological motor control. Neural Comput. 32(3), 562–595 (2020)
Bitmead, R.R., Gevers, M., Wertz, V.: Adaptive Optimal Control: The Thinking Man’s GPC. Prentice-Hall, Englewood Cliffs, New Jersy (1990)
Breakspear, M.: Dynamic models of large-scale brain activity. Nat. Neurosci. 20(3), 340–352 (2017)
Bryson, A.E., Ho, Y.C.: Applied Optimal Control: Optimization, Estimation and Control. Talor & Francis (1975)
Buşoniu, L., de Bruin, T., Tolić, D., Kober, J., Palunko, I.: Reinforcement learning for control: Performance, stability, and deep approximators. Annu. Rev. Control 46, 8–28 (2018)
Coppens, P., Patrinos, P.: Sample complexity of data-driven stochastic LQR with multiplicative uncertainty. In: The 59th IEEE Conference on Decision and Control (CDC), pp. 6210–6215 (2020)
Coppens, P., Schuurmans, M., Patrinos, P.: Data-driven distributionally robust LQR with multiplicative noise. In: Learning for Dynamics and Control (L4DC), pp. 521–530. PMLR (2020)
De Koning, W.L.: Infinite horizon optimal control of linear discrete time systems with stochastic parameters. Automatica 18(4), 443–453 (1982)
De Koning, W.L.: Compensatability and optimal compensation of systems with white parameters. IEEE Trans. Autom. Control 37(5), 579–588 (1992)
Drenick, R., Shaw, L.: Optimal control of linear plants with random parameters. IEEE Trans. Autom. Control 9(3), 236–244 (1964)
Du, K., Meng, Q., Zhang, F.: A Q-learning algorithm for discrete-time linear-quadratic control with random parameters of unknown distribution: convergence and stabilization. arXiv preprint arXiv:2011.04970, 2020
Duncan, T.E., Guo, L., Pasik-Duncan, B.: Adaptive continuous-time linear quadratic gaussian control. IEEE Trans. Autom. Control 44(9), 1653–1662 (1999)
Gravell, B., Esfahani, P.M., Summers, T.: Learning robust controllers for linear quadratic systems with multiplicative noise via policy gradient. IEEE Trans. Autom. Control (2019)
Gravell, B., Esfahani, P.M., Summers, T.: Robust control design for linear systems via multiplicative noise. arXiv preprint arXiv:2004.08019 (2020)
Gravell, B., Ganapathy, K., Summers, T.: Policy iteration for linear quadratic games with stochastic parameters. IEEE Control Syst. Lett. 5(1), 307–312 (2020)
Guo, Y., Summers, T.H.: A performance and stability analysis of low-inertia power grids with stochastic system inertia. In: American Control Conference (ACC), pp. 1965–1970 (2019)
Hespanha, J.P., Naghshtabrizi, P., Xu, Y.: A survey of recent results in networked control systems. Proceedings of the IEEE 95(1), 138–162 (2007)
Hewer, G.: An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Trans. Autom. Control (1971)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, New York (2012)
Jiang, Y., Jiang, Z.P.: Adaptive dynamic programming as a theory of sensorimotor control. Biolog. Cybern. 108(4), 459–473 (2014)
Jiang, Y., Jiang, Z.P.: Robust Adaptive Dynamic Programming. Wiley, Hoboken, New Jersey (2017)
Jiang, Z.P., Bian, T., Gao, W.: Learning-based control: A tutorial and some recent results. Found. Trends Syst. Control. 8(3), 176–284 (2020)
Jiang, Z.P., Lin, Y., Wang, Y.: Nonlinear small-gain theorems for discrete-time feedback systems and applications. Automatica 40(12), 2129–2136 (2004)
Kamalapurkar, R., Walters, P., Rosenfeld, J., Dixon, W.: Reinforcement learning for optimal feedback control: A Lyapunov-based approach. Springer (2018)
Kantorovich, L.V., Akilov, G.P.: Functional Analysis in Normed Spaces. Macmillan, New York (1964)
Kiumarsi, B., Vamvoudakis, K.G., Modares, H., Lewis, F.L.: Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2042–2062 (2018)
Konrad, K.: Theory and Application of Infinite Series, 2nd edn. Dover Publications, New York (1990)
Lai, J., Xiong, J., Shu, Z.: Model-free optimal control of discrete-time systems with additive and multiplicative noises. arXiv preprintarXiv:2008.08734 (2020)
Levine, S., Koltun, V.: Continuous inverse optimal control with locally optimal examples. In: International Conference on Machine Learning (ICML) (2012)
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprintarXiv:2005.01643 (2020)
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. In: International Conference on LearningRepresentations (ICLR) (2016)
Ljung, L.: System Identification: Theory for the user, 2nd edn. Prentice Hall PTR, Upper Saddle River (1999)
Magnus, J.R., Neudecker, H.: Matrix Differential Calculus With Applications In Statistics And Economerices. Wiley, New York (2007)
Monfort, M., Liu, A., Ziebart, B.D.: Intent prediction and trajectory forecasting via predictive inverse linear-quadratic regulation. In: AAAI Conference on Artificial Intelligence (AAAI) (2015)
Morozan, T.: Stabilization of some stochastic discrete-time control systems. Stoch. Anal. Appl. 1(1), 89–116 (1983)
Pang, B., Bian, T., Jiang, Z.-P.: Robust policy iteration for continuous-time linear quadratic regulation. IEEE Trans. Autom. Control (2020)
Pang, B., Jiang, Z.-P.: Robust reinforcement learning: A case study in linear quadratic regulation. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)
Powell, W.B.: From reinforcement learning to optimal control: A unified framework for sequential decisions. arXiv preprintarXiv:1912.03513 (2019)
Praly, L., Lin, S.-F., Kumar, P.R.: A robust adaptive minimum variance controller. SIAM J. Control Optim. 27(2), 235–266 (1989)
Rami, M.A., Chen, X., Zhou, X.Y.: Discrete-time indefinite LQ control with state and control dependent noises. J. Glob. Optim. 23(3), 245–265 (2002)
Åström, K.J., Wittenmark, B.: Adaptive Control, 2nd edn. Addison-Wesley, Reading, Massachusetts (1995)
Sontag, E.D.: Input to state stability: Basic concepts and results. Nonlinear and optimal control theory. volume 1932, pp. 163–220. Springer, Berlin (2008)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge, Massachusetts (2018)
Sutton, R.S., Barto, A.G., Williams, R.J.: Reinforcement learning is direct adaptive optimal control. IEEE Control Syst. Mag. 12(2), 19–22 (1992)
Tiedemann, A., De Koning, W.: The equivalent discrete-time optimal control problem for continuous-time systems with stochastic parameters. Int. J. Control 40(3), 449–466 (1984)
Todorov, E.: Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system. Neural Comput. 17(5), 1084–1108 (2005)
Tu, S., Recht, B.: The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Annual Conference on Learning Theory (COLT) (2019)
Xing, Y., Gravell, B., He, X., Johansson, K.H., Summers, T.: Linear system identification under multiplicative noise from multiple trajectory data. In American Control Conference (ACC), pp 5157–5261 (2020)
Huang, Y., Zhang, W., Zhang, H.: Infinite horizon LQ optimal control for discrete-time stochastic systems. In: The 6th World Congress on Intelligent Control and Automation (WCICA), vol. 1, pp. 252–256 (2006)
Acknowledgements
Confucius once said, Virtue is not left to stand alone. He who practices it will have neighbors. Laurent Praly, the former PhD advisor of the second-named author, is such a beautiful mind. His vision about and seminal contributions to control theory, especially nonlinear and adaptive control, have influenced generations of students including the authors of this chapter. ZPJ is privileged to have Laurent as the PhD advisor during 1989–1993 and is very grateful to Laurent for introducing him to the field of nonlinear control. It is under Laurent’s close guidance that ZPJ started, in 1991, working on the stability and control of interconnected nonlinear systems that has paved the foundation for nonlinear small-gain theory. The research findings presented here are just a reflection of Laurent’s vision about the relationships between control and learning. We also thank the U.S. National Science Foundation for its continuous financial support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1
The following lemma provides the relationship between operations \({{\,\mathrm{vec}\,}}(\cdot )\) and \({{\,\mathrm{svec}\,}}(\cdot )\).
Lemma 9.5
([40, Page 57]) For \(X\in \mathbb {S}^n\), there exists a unique matrix \(D_n\in \mathbb {R}^{n^2\times \frac{1}{2}n(n+1)}\) with full column rank, such that
\(D_n\) is called the duplication matrix.
Lemma 9.6
([44, Lemma A.3.]) Let \(\mathcal {O}\) be a compact set such that \(\rho (O)<1\) for any \(O\in \mathcal {O}\), then there exist an \(a_0>0\) and a \(0<b_0<1\), such that
for any \(O\in \mathcal {O}\).
For \(X\in \mathbb {R}^{n\times n}\), \(Y\in \mathbb {R}^{n\times m}\), \(X+\Delta X\in \mathbb {R}^{n\times n}\), \(Y + \Delta Y\in \mathbb {R}^{n\times m}\), supposing X and \(X + \Delta X\) are invertible, the following inequality is repeatedly used:
Appendix 2
The following property of \(\mathcal {L}_{K}(\cdot )\) is useful.
Lemma 9.7
If K is mean-square stabilizing, then \(\mathcal {L}_K(Y_1)\le \mathcal {L}_K(Y_2)\implies Y_1\ge Y_2\), where \(Y_1,Y_2\in \mathbb {S}^n\).
Proof
Let \(\{x_t\}_{t=0}^\infty \) be the solution of the closed-loop system (9.1) with controller \(u=-Kx\). Then for any \(t\ge 1\)
Since K is mean-square stabilizing,
The proof is complete because \(x_0\) is arbitrary. \(\square \)
Now we are ready to prove Theorem 9.1.
Proof
(Theorem 9.1) By (9.7) and (9.8), for any \(x\in \mathbb {R}^n\),
Thus \(\mathcal {H}(P_1,K_2)\le 0\). By definition, \(P_1>0\) and
Then Lemma 9.1 implies that \(K_2\) is mean-square stabilizing. Inserting (9.7) into the above inequality yields \(\mathcal {L}_{K_2}(P_1) \le \mathcal {L}_{K_2}(P_2)\). This implies \(P_1\ge P_2\) by Lemma 9.7. An application of mathematical induction proves the first two items. For the last item, by a theorem on the convergence of a monotone sequence of self-adjoint operators (see [32, Pages 189–190]), \(\lim _{i\rightarrow \infty } P_i\) and \(\lim _{i\rightarrow \infty } K_i\) exist. Letting \(i\rightarrow \infty \) in (9.7) and (9.8), and eliminating \(K_\infty \) in (9.7) using (9.8), we have
The proof is complete by the uniqueness of \(P^*\). \(\square \)
Appendix 3
Proof
(Lemma 9.2) Since \(\mathscr {K}(P^*)\) is mean-square stabilizing, by continuity there always exists a \(\bar{\delta }_0>0\), such that \(\mathscr {R}(P_i)\) is invertible, \(\mathscr {K}(P_i)\) is mean-square stabilizing for all \(P_i\in \mathcal {\bar{B}}_{\bar{\delta }_0}(P^*)\). Suppose \(P_i\in \mathcal {\bar{B}}_{\bar{\delta }_0}(P^*)\). Subtracting
from both sides of the GARE (9.4) yields
Subtracting (9.24) from (9.12), we have
Taking norm on both sides of above equation, (9.11) yields
Since \(\mathscr {K}(\cdot )\) is locally Lipschitz continuous at \(P^*\), by continuity of matrix norm and matrix inverse, there exists a \(c_1>0\), such that
So for any \(0<\sigma <1\), there exists a \(\bar{\delta }_0\ge \delta _0>0\) with \(c_1\delta _0\le \sigma \). This completes the proof. \(\square \)
Appendix 4
Before the proof of Lemma 9.3, some auxiliary lemmas are firstly proved. Procedure 9.2 will exhibit a singularity, if \([\hat{G}_{i}]_{uu}\) in (9.10) is singular, or the cost (9.2) of \(\hat{K}_{i+1}\) is infinity. The following lemma shows that if \(\Delta G_i\) is small, no singularity will occur. Let \(\bar{\delta }_0\) be the one defined in the proof of Lemma 9.2, then \(\delta _0\le \bar{\delta }_0\).
Lemma 9.8
For any \(\tilde{P}_i\in \mathcal {B}_{\delta _0}(P^*)\), there exists a \(d(\delta _0)>0\), independent of \(\tilde{P}_i\), such that \(\hat{K}_{i+1}\) is mean-square stabilizing and \([\hat{G}_{i}]_{uu}\) is invertible, if \(\Vert \Delta G_i\Vert _F\le d\).
Proof
Since \(\mathcal {\bar{B}}_{\bar{\delta }_0}(P^*)\) is compact and \(\mathscr {A}(\mathscr {K}(\cdot ))\) is a continuous function, set
is also compact. By continuity and Lemma 9.1, for each \(X\in \mathcal {S}\), there exists a \(r(X)>0\) such that \(\rho (Y+I_n\otimes I_n)<1\) for any \(Y\in \mathcal {B}_{r(X)}(X)\). The compactness of \(\mathcal {S}\) implies the existence of a \(\underline{r}>0\), such that \(\rho (Y+I_n\otimes I_n)<1\) for each \(Y\in \mathcal {B}_{\underline{r}}(X)\) and all \(X\in \mathcal {S}\). Similarly, there exists \(d_1>0\) such that \([\hat{G}_{i}]_{uu}\) is invertible for all \(\tilde{P}_i\in \mathcal {\bar{B}}_{\bar{\delta }_0}(P^*)\), if \(\Vert \Delta G_i\Vert _F\le d_1\). Note that in policy improvement step of Procedure 9.1 (the policy update step in Procedure 9.2), the improved policy\(\tilde{K}_{i+1}=[\tilde{G}_{i}]_{uu}^{-1}[\tilde{G}_{i}]_{ux}\) (the updated policy \(\hat{K}_{i+1}\)) is continuous function of \(\tilde{G}_i\) (\(\hat{G}_i\)), and there exists a \(0<d_2\le d_1\), such that \(\mathscr {A}(\hat{K}_{i+1})\in \mathcal {B}_{\underline{r}}(\mathscr {A}(\mathscr {K}(\tilde{P}_i)))\) for all \(\tilde{P}_i\in \mathcal {\bar{B}}_{\bar{\delta }_0}(P^*)\), if \(\Vert \Delta G_i\Vert _F\le d_2\). Thus, Lemma 9.1 implies that \(\hat{K}_{i+1}\) is mean-square stabilizing. Setting \(d=d_2\) completes the proof. \(\square \)
By Lemma 9.8, if \(\Vert \Delta G_i\Vert _F\le d\), the sequence \(\{\tilde{P}_i\}_{i=0}^{\infty }\) satisfies (9.13). For simplicity, we denote \(\mathcal {E}(\tilde{G}_i,\Delta G_i)\) in (9.13) by \(\mathcal {E}_i\). The following lemma gives an upper bound on \(\Vert \mathcal {E}_i\Vert _F\) in terms of \(\Vert \Delta G_i\Vert _F\).
Lemma 9.9
For any \(\tilde{P}_i\in \mathcal {B}_{\delta _0}(P^*)\) and any \(c_2>0\), there exists a \(0<\delta _1^1(\delta _0,c_2)\le d\), independent of \(\tilde{P}_i\), where d is defined in Lemma 9.8, such that
if \(\Vert \Delta G_i\Vert _F<\delta _1^1\), where \(c_3(\delta _0)>0\).
Proof
For any \(\tilde{P}_i\in \bar{\mathcal {B}}_{\delta _0}(P^*)\), \(\Vert \Delta G_i\Vert _F\le d\), we have from (9.23)
where the last inequality comes from the continuity of matrix inverse and the extremum value theorem. Define
Define
Using (9.25), it is easy to check that \(\Vert \Delta \mathscr {A}_i\Vert _F\le c_5\Vert \Delta G_i\Vert _F\), \(\Vert \Delta b_i\Vert _2\le c_6\Vert \Delta G_i\Vert _F\), for some \(c_5(\delta _0,d)>0\), \(c_6(\delta _0,d)>0\). Then by (9.23)
where the last inequality comes from the continuity of matrix inverse and Lemma 9.8. Choosing \(0<\delta ^1_1\le d\) such that \(c_3\delta ^1_1<c_2\) completes the proof.
Now we are ready to prove Lemma 9.3.
Proof
(Lemma 9.3) Let \(c_2=(1-\sigma )\delta _0\) in Lemma 9.9, and \(\delta _1\) be equal to the \(\delta _1^1\) associated with \(c_2\). For any \(i\in \mathbb {Z}_+\), if \(\tilde{P}_i\in \mathcal {B}_{\delta _0}(P^*)\), then \([\hat{G}_i]_{uu}\) is invertible, \(\hat{K}_{i+1}\) is mean-square stabilizing and
where (9.26) and (9.28) are due to Lemmas 9.2 and 9.9. By induction, (9.26) to (9.28) hold for all \(i\in \mathbb {Z}_+\), thus by (9.27),
which proves (i) and (ii) in Lemma 9.3. Then (9.25) implies (iii) in Lemma 9.3.
In terms of (iv) in Lemma 9.3, for any \(\epsilon >0\), there exists a \(i_1\in \mathbb {Z}_+\), such that \(\sup \{\Vert \Delta G_i\Vert _F\}_{i=i_1}^\infty <\gamma ^{-1}(\epsilon /2)\). Take \(i_2\ge i_1\). For \(i\ge i_2\), we have by (ii) in Lemma 9.3,
where the second inequality is due to the boundedness of \(\tilde{P}_i\). Since \(\lim _{i\rightarrow \infty }\beta (c_7,i-i_2)=0\), there is a \(i_3\ge i_2\) such that \(\beta (c_7,i-i_2)<\epsilon /2\) for all \(i\ge i_3\), which completes the proof. \(\square \)
Appendix 5
Notice that all the conclusions of Theorem 9.2 can be implied by Lemma 9.3 if
for Procedure 9.2. Thus, the proof of Theorem 9.2 reduces to the proof of the following lemma.
Lemma 9.10
Given a mean-square stabilizing \(\hat{K}_1\), there exist \(0<\delta _2<\min (\gamma ^{-1}(\epsilon ),\delta _1)\), \(\bar{i}\in \mathbb {Z}_+\), \(\alpha _2>0\), and \(\kappa _2>0\), such that \([\hat{G}_i]_{uu}\) is invertible, \(\hat{K}_i\) is mean-square stabilizing, \(\Vert \tilde{P}_i\Vert _F<\alpha _2\), \(\Vert \hat{K}_i\Vert _F<\kappa _2\), \(i=1,\cdots ,\bar{i}\), \(\tilde{P}_{\bar{i}}\in \mathcal {B}_{\delta _0}(P^*)\), as long as \(\Vert \Delta G\Vert _\infty <\delta _2\).
The next two lemmas state that under certain conditions on \(\Vert \Delta G_i\Vert _F\), each element in \(\{\hat{K}_i\}_{i=1}^{\bar{i}}\) is mean-square stabilizing, each element in \(\{[\hat{G}_i]_{uu}\}_{i=1}^{\bar{i}}\) is invertible, and \(\{\tilde{P}_i\}_{i=1}^{\bar{i}}\) is bounded. For simplicity, in the following we assume \(S>I_n\) and \(R>I_m\). All the proofs still work for any \(S>0\) and \(R>0\), by suitable rescaling.
Lemma 9.11
If \(\hat{K}_i\) is mean-square stabilizing, then \([\hat{G}_i]_{uu}\) is nonsingular and \(\hat{K}_{i+1}\) is mean-square stabilizing, as long as \(\Vert \Delta G_i\Vert _F < a_i\), where
Furthermore,
Proof
By definition,
Since \(R>I_m\), the eigenvalues \(\lambda _j([\tilde{G}_i]_{uu}^{-1})\in (0,1]\) for all \(1\le j\le m\). Then by the fact that for any \(X\in \mathbb {S}^m\)
we have
Thus by [26, Section 5.8], \([\hat{G}_i]_{uu}\) is invertible.
For any \(x\in \mathbb {R}^{n}\) on the unit ball, define
and
Then
where \(\vert \mathcal {X}_{\hat{K}_{i}}\vert _{abs}\) denotes the matrix obtained from \(\mathcal {X}_{\hat{K}_{i}}\) by taking the absolute value of each entry. Thus by (9.31) and the definition of \(\tilde{G}_i\), we have
where
For any x on the unit ball, \(\vert \mathbf {1}^Tx\vert _{abs}\le \sqrt{n}\). Similarly, for any \(K\in \mathbb {R}^{m\times n}\), by the definition of induced matrix norm, \(\vert \mathbf {1}^TKx\vert _{abs}\le \Vert K\Vert _2 \sqrt{m}\). This implies
which means \(\mathbf {1}^T\vert \mathcal {X}_K\vert _{abs}\mathbf {1}\le m(\sqrt{n} + \Vert K\Vert _2)^2\). Thus
Then \(S>I_n\) leads to
for all x on the unit ball. So \(\hat{K}_{i+1}\) is mean-square stabilizing by Lemma 9.1.
By definition,
where the second inequality comes from [26, Inequality (5.8.2)], and the last inequality is due to (9.30). This completes the proof. \(\square \)
Lemma 9.12
For any \(\bar{i}\in \mathbb {Z}_+\), \(\bar{i}>0\), if
where \(a_i\) is defined in Lemma 9.11, then
for \(i=1,\cdots ,\bar{i}\), where
Proof
Inequality (9.32) yields
where
Inserting (9.9) into above inequality, and using Lemma 9.7, we have
With \(S>I_n\), (9.35) yields
Similar to (9.36), we have
From (9.36) to (9.37), we obtain
By definition of \(\epsilon _{2,i}\) and condition (9.34),
Then [34, §28. Theorem 3] yields
An application of (9.29) completes the proof. \(\square \)
Now we are ready to prove Lemma 9.10.
Proof
(Lemma 9.10) Consider Procedure 9.2 confined to the first \(\bar{i}\) iterations, where \(\bar{i}\) is a sufficiently large integer to be determined later in this proof. Suppose
Condition (9.38) implies condition (9.34). Thus \(\hat{K}_i\) is mean-square stabilizing, \([\hat{G}_i]_{uu}^{-1}\) is invertible, \(\Vert \tilde{P}_i\Vert _F\) and \(\Vert \hat{K}_i\Vert _F\) are bounded. By (9.9) we have
Letting \(E_i = \hat{K}_{i+1} - \mathscr {K}(\tilde{P}_i)\), the above equation can be rewritten as
where \(\mathcal {N}(\tilde{P_i}) = \mathcal {L}_{\mathscr {K}(\tilde{P_i})}^{-1}\circ \mathcal {R}(\tilde{P}_i),\) and
Given \(\hat{K}_1\), let \(\mathcal {M}_{\bar{i}}\) denote the set of all possible \(\tilde{P}_i\), generated by (9.39) under condition (9.38). By definition, \(\{\mathcal {M}_j\}_{j=1}^\infty \) is a nondecreasing sequence of sets, i.e., \(\mathcal {M}_1\subset \mathcal {M}_2 \subset \cdots \). Define \(\mathcal {M} = \cup _{j=1}^\infty \mathcal {M}_j\), \(\mathcal {D} = \{P\in \mathbb {S}^n\ \vert \ \Vert P\Vert _F\le 6\Vert \tilde{P}_1\Vert _F\}\). Then by Lemma 9.12 and Theorem 9.1, \(\mathcal {M}\subset \mathcal {D}\); \(\mathcal {M}\) is compact; \(\mathscr {K}(P)\) is stable for any \(P\in \mathcal {M}\).
Now we prove that \(\mathcal {N}(P^1)\) is Lipschitz continuous on \(\mathcal {M}\). Using (9.11), we have
where the last inequality is due to the fact that matrix inversion \(\mathscr {A}(\cdot )\), \(\mathscr {K}(\cdot )\), and \(\mathcal {R}(\cdot )\) are locally Lipschitz, thus Lipschitz on compact set \(\mathcal {M}\) with some Lipschitz constant \(L>0\).
Define \(\{P_{k\vert i}\}_{k=0}^{\infty }\) as the sequence generated by (9.12) with \(P_{0\vert i}=\tilde{P}_i\). Similar to (9.39), we have
By Theorem 9.1 and the fact that \(\mathcal {M}\) is compact, there exists \(k_0\in \mathbb {Z}_+\), such that
Suppose
We find an upper bound on \(\Vert P_{k\vert i}-\tilde{P}_{i+k}\Vert _F\). Notice that from (9.39) to (9.41),
An application of the Gronwall inequality [2, Theorem 4.1.1.] to the above inequality implies
By (9.11), the error term in (9.39) satisfies
where \(C_1\) is a constant and the inequality is due to the continuity of matrix inverse.
Let \(\bar{i}>k_0\), and \(k=k_0\), \(i = \bar{i}-k_0\) in (9.44). Then by condition (9.38), Lemma 9.12, (9.43), (9.44), and (9.45), there exist \(i_0\in \mathbb {Z}_+\), \(i_0>k_0\), such that \(\Vert P_{k_0\vert \bar{i}-k_0} -\tilde{P}_{\bar{i}}\Vert _F<\delta _0/2\), for all \(\bar{i}\ge i_0\). Setting \(i = \bar{i}-k_0\) in (9.42), the triangle inequality yields \(\tilde{P}_{\bar{i}}\in \mathcal {B}_{\delta _0}(P^*)\), for \(\bar{i}\ge i_0\). Then in (9.38), choosing \(\bar{i}\ge i_0\) such that \(\delta _2 = b_{\bar{i}}<\min (\gamma ^{-1}(\epsilon ),\delta _1)\) completes the proof. \(\square \)
Appendix 6
For given \(\hat{K}_1\), let \(\mathcal {K}\) denote the set of control gains (including \(\hat{K}_1\)) generated by Procedure 9.2 with all possible \(\{\Delta G_i\}_{i=1}^\infty \) satisfying \(\Vert \Delta G\Vert _\infty <\delta _2\), where \(\delta _2\) is the one in Theorem 9.2. The following result is firstly derived.
Lemma 9.13
Under the conditions in Theorem 9.3, there exist \(\bar{L}_0>0\) and \(N_0>0\) such that for any \(\bar{L}\ge \bar{L}_0\) and \(N\ge N_0\), \(\hat{K}_i\in \mathcal {K}\) implies \(\Vert \Delta G_i\Vert _F< \delta _2\), almost surely.
Proof
By definition, in the context of Algorithm 9.1,
where \(\tilde{P}_i\) is the unique solution of (9.9) with \(K=\hat{K}_i\). Thus, the task is to prove that each term in the right-hand side of the above inequality is less than \(\delta _2/3\). To this end, we firstly study \(\Vert \tilde{P}_i - \hat{P}_{i,\bar{L}} \Vert _F\). Define \(\hat{p}_{i,j}={{\,\mathrm{vec}\,}}(\hat{P}_{i,j})\), by Lemma 9.5, Line 11 and Line 12 in Algorithm 9.1 can be rewritten as
where \(\hat{p}_{i,0}\in \mathbb {R}^{n^2}\) and
Similar derivations applied to (9.20) with \(K=\hat{K}_i\) yield
Since (9.20) is identical to (9.14), (9.47) is identical to (9.15) with K and \({{\,\mathrm{vec}\,}}(P_{K,j})\) replaced by \(\hat{K}_i\) and \(\bar{p}_{i,j}\) respectively, and
Since \(\mathscr {A}(\hat{K}_i)\), \(\hat{K}_i\in \mathcal {K}\) is mean-square stabilizing, by Lemma 9.4
where \(\bar{P}_{i,j} = {{\,\mathrm{vec}\,}}^{-1}(\bar{p}_{i,j})\). By definition and Theorem 9.2, \(\bar{\mathcal {K}}\) is bounded, thus compact. Let \(\mathcal {V}\) be the set of the unique solutions of (9.5) with \(K\in \mathcal {K}\). Then by Theorem 9.2 \(\mathcal {V}\) is bounded. So \(\mathscr {A}(K)\) is mean-square stable for \(\forall K\in \bar{\mathcal {K}}\), otherwise by (9.11) and Lemma 9.1 it contradicts the boundedness of \(\mathcal {V}\). Define \(\mathcal {K}_1 = \{\mathscr {A}(K)+I_n\otimes I_n\vert K\in \bar{\mathcal {K}}\}\). Then \(\rho (X)<1\) for any \(X\in \mathcal {K}_1\), and by continuity \(\mathcal {K}_1\) is a compact set. This implies the existence of a \(\delta _3>0\), such that \(\rho (X)<1\) for any \(X\in \bar{\mathcal {K}}_2\), where
Define
The boundedness of \(\mathcal {K}\), (9.22), and (9.48) imply the existence of \(N_1>0\), such that for any \(N\ge N_1\), any \(\hat{K}_i\in \mathcal {K}\), almost surely
where \(C_9>0\) is a constant. Then
and (9.46) admits a unique stable equilibrium, that is,
for some \(\mathring{P}_i\in \mathbb {S}^n\). From (9.46), (9.47), (9.49), and (9.51), we have
Thus by (9.23), for any \(N\ge N_1\), any \(\hat{K}_i\in \mathcal {K}\), almost surely
where \(C_{10}\) and \(C_{11}\) are some positive constants, and the last inequality is due to (9.48), (9.50) and the fact that \(\mathcal {K}_1\) and \(\bar{\mathcal {K}}_2\) are compact sets. Then for any \(\epsilon _1>0\), the boundedness of \(\mathcal {K}\) and (9.22) implies the existence of \(N_2\ge N_1\), such that for any \(N\ge N_2\), almost surely
as long as \(\hat{K}_i\in \mathcal {K}\). By Lemma 9.6 and (9.52), for any \(N\ge N_2\) and any \(\hat{K}_i\in \mathcal {K}\),
for some \(a_0>0\), \(1>b_0>0\) and \(a_1>0\). Therefore there exists a \(\bar{L}_1>0\), such that for any \(\bar{L}\ge \bar{L}_1\), and any \(N\ge N_2\), almost surely
as long as \(\hat{K}_i\in \mathcal {K}\). With (9.52) and (9.53), we obtain
almost surely for any \(\bar{L}\ge \bar{L}_1\), any \(N\ge N_2\), as long as \(\hat{K}_i\in \mathcal {K}\). Since \(\epsilon _1\) is arbitrary, we can choose \(\epsilon _1\) such that almost surely
for any \(\bar{L}\ge \bar{L}_1\), any \(N\ge N_2\), as long as \(\hat{K}_i\in \mathcal {K}\).
Secondly, by definition and (9.54), there exist \(\bar{L}_2\ge \bar{L}_1\) and \(N_3\ge N_2\), such that
for any \(\bar{L}\ge \bar{L}_2\), any \(N\ge N_3\), as long as \(\hat{K}_i\in \mathcal {K}\).
Thirdly, since \(\mathcal {V}\) is bounded, \(\hat{P}_{i,\bar{L}}\) is also almost surely bounded by (9.54). Thus, from Line 14 in Algorithm 9.1 and (9.22), there exists \(N_4\ge N_3\), such that
for any \(N\ge N_4\) and any \(\bar{L}\ge \bar{L}_2\), as long as \(\hat{K}_i\in \mathcal {K}\).
Setting \(N_0 = N_4\) and \(\bar{L}_0 = \bar{L}_2\) yields \(\Vert \Delta G_i\Vert <\delta _2\). \(\square \)
Now we are ready to prove the convergence of Algorithm 9.1.
Proof
(Theorem 9.3) Since \(\hat{K}_1\in \mathcal {K}\), Lemma 9.13 implies \(\Vert \Delta G_1\Vert _F<\delta _2\) almost surely. By definition, \(\hat{K}_2\in \mathcal {K}\). Thus \(\Vert \Delta G_i\Vert _F<\delta _2, i=1,2,\cdots \) almost surely by mathematical induction. Then Theorem 9.2 completes the proof. \(\square \)
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Pang, B., Jiang, ZP. (2022). Robust Reinforcement Learning for Stochastic Linear Quadratic Control with Multiplicative Noise. In: Jiang, ZP., Prieur, C., Astolfi, A. (eds) Trends in Nonlinear and Adaptive Control. Lecture Notes in Control and Information Sciences, vol 488. Springer, Cham. https://doi.org/10.1007/978-3-030-74628-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-74628-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74627-8
Online ISBN: 978-3-030-74628-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)