Abstract
In this paper, we will deal with a linear quadratic optimal control problem with unknown dynamics. As a modeling assumption, we will suppose that the knowledge that an agent has on the current system is represented by a probability distribution \(\pi \) on the space of matrices. Furthermore, we will assume that such a probability measure is opportunely updated to take into account the increased experience that the agent obtains while exploring the environment, approximating with increasing accuracy the underlying dynamics. Under these assumptions, we will show that the optimal control obtained by solving the “average” linear quadratic optimal control problem with respect to a certain \(\pi \) converges to the optimal control driven related to the linear quadratic optimal control problem governed by the actual, underlying dynamics. This approach is closely related to model-based reinforcement learning algorithms where prior and posterior probability distributions describing the knowledge on the uncertain system are recursively updated. In the last section, we will show a numerical test that confirms the theoretical results.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Reinforcement learning (RL) is one of the three basic machine learning paradigms, together with supervised learning and unsupervised learning. In RL, an agent interacts with a partially unknown environment, aiming at finding a policy, which optimizes the measure of a certain long-term performance [25]. The connection between RL and optimal control theory was already identified in the past [26]. Nowadays, several research questions and tools coming from the RL literature are influencing the optimal control field and vice versa [23].
A natural setting in RL consists in considering a Markov decision process over state/action pairs varying on a discrete-time set. The discrete-time problem setting provides an excellent framework to develop methods and algorithms, which, however, often underlies a continuous-time structure. For this reason, in particular in the control system engineering field, significant attention has been recently given to continuous-time RL [8, 14, 15, 17].
Both in discrete- and continuous-time problem settings, one can consider two main RL philosophies: The first one, called model-based, usually concerns the reconstruction of a model from the data trying to mimic the unknown environment. That model is then used to plan and to compute a suboptimal policy. The second RL philosophy, called model-free, employs a direct approximation of the value function and/or a policy based on a dynamic-programming-like algorithm, without using a model to simulate the unknown environment. An excellent overview of the two approaches can be found in [25].
About ten years ago, PILCO was introduced [6, 7], an innovative and disruptive method, from which many subsequent model-based RL methods have been inspired. Rather than exploiting the data to construct a dynamics approximating the partially known environment, PILCO makes use of them to construct a probability distribution (more precisely, a Gaussian process) on a class of dynamical systems. At each iteration, this distribution is updated to fit the data set. After the model update, PILCO takes the policy improvement step which boils down to solving an averaged optimal control problem, where the averaging distribution is the one extrapolated by the data at the previous experiments. That approach has the advantage of considerably reducing the model bias, one of the main shortcomings of model-based RL [1]. A general, rigorous framework capturing PILCO as well as other Bayesian model-based RL approaches (see, e.g., [3, 4, 10,11,12, 29]) has been developed in [18, 19]. In particular, it is important to mention that the framework developed in [18] is closely related to the averaging control framework and Riemann–Stieltjes optimal control [2, 16, 21, 24, 30].
The aim of this paper is to provide a stricter link between PILCO [7] and the framework introduced in [18]. We will concentrate on a specific physical system, driven by a linear quadratic regulator (LQR) optimal control problem, namely
where \(B, Q, R, Q_f\) are given, known matrices and \(x_0\in \mathbb {R}^n \) is a given vector. However, in this context we will assume that the physical system \(\hat{A}\) is unknown and the knowledge we have about \(\hat{A}\) is merely represented by a probability distribution \(\pi \) constructed over a set of matrices \(\mathcal {A}\) (with \(\hat{A}\in \mathcal {A}\)) by using the data available from the physical system. This situation is in accordance with the PILCO setting, which is “not focusing on a single dynamics model", but makes use of “a probabilistic dynamics model, a distribution over all plausible dynamics models that could have generated the observed experience" ( [6], pg. 34). Such a modeling setting allows us to define the averaged optimal control problem
If indeed the real physical system is driven by the equation
for a certain matrix \(\hat{A}\), then it is reasonable to expect that an increase of the experience will produce a more accurate distribution \(\pi \) over \(\mathcal {A}\). This fact can be translated into the assumption that the probability distribution is “close" (in a precise sense that will be specified in the sequel) to \(\delta _{\hat{A}}\), when enough experience of the environment (here represented by \(\hat{A}\)) is gained.
We would like to stress that our goal is not to propose a new algorithm to find an optimal policy but to consider a class of existing algorithms and motivate their good performances. In particular, the paper aims to provide an insight into the convergence of Bayesian-like RL algorithms in which a recursive construction of probability measures is carried out. (Further considerations on the connection with RL are given in Remarks 3.4 and 5.5.) Here, by “convergence", we mean convergence of the optimal policy obtained by estimating the underlying dynamics using data from the real system (the one constructed in the so-called policy improvement step) toward the optimal policy obtained by solving problem (1). More precisely, the questions we will tackle in this paper are:
-
(1)
Is the value function related to the optimal control (1) close to the value function associated with (2) when \(\pi \) is close to \(\delta _{\hat{A}}\) w.r.t. the Wasserstein distance (see (9) for the formal definition)?
-
(2)
Under the same assumptions over \(\pi \), is the optimal control of (2) close to the optimal control of (1)?
For both question (1) and question (2), we will provide positive answers. It is worth noticing that, in control theory, it is very uncommon to have a positive answer to question (2), even when one has a positive answer to question (1).
The paper is organized as follows. Section 2 introduces the basic notations that we will use throughout the paper. In Sect. 3, we state the problem formulation and we study the basic properties of the LQ optimal control problem (2). Then, in Sect. 4, we also derive a Pontryagin’s maximum principle for problem (2), refining some results in [2]. In Sect. 5, we state and prove the main results of the paper, providing positive answers to question (1) and question (2). In Sect. 6, we strengthen the results presented in Sect. 5 in the case in which one is dealing with a discrete probability measure \(\pi \). That result is further stressed in Sect. 7, where we present and analyze a numerical example. Finally, we present the conclusion and discuss some future directions and open questions.
2 Preliminaries and notations
In this section, we will recall some useful notations and concepts which will be used throughout the paper. For vectors \(v \in \mathbb {R}^n\), |v| denotes the Euclidean norm; we use \(\mathbb {B}_n(x,r)\) to denote the open ball in \(\mathbb {R}^n\) centered at \(x\in \mathbb {R}^n\) and of radius \(r>0\). We will use \(\mathcal {L}\) to denote the Lebesgue \(\sigma \)-algebra and \(\mathcal {B}_X\) to denote the collection of all Borel sets on a given topological space X.
In the following, T will be our fixed time horizon. Given a time \(s \in [0,T]\), we denote by \(C([s,T]; \mathbb {R}^n)\) the space of continuous functions \(x :[s,T] \rightarrow \mathbb {R}^n\). For functions \(x(\cdot ) \in C([s,T]; \mathbb {R}^n)\), \(\left\Vert {x(\cdot )} \right\Vert _\infty \) or \(\left\Vert {x} \right\Vert _\infty \) denotes the sup norm. For continuous functions \(y(\cdot ) \in C(\mathbb {R}^n; \mathbb {R})\) and for a compact set \(K \subset \mathbb {R}^n\), we also define the sup norm restricted on the compact set K as \(\left\Vert {y(\cdot )} \right\Vert _{\infty , K} :=\sup _{x \in K} |y(x)|\).
Moreover, given \(p \in [1,\infty )\), we define the spaces of a.e. defined functions
and
For \(g \in L^p([s,T]; \mathbb {R}^n)\), the \(L^p\)-norm \(\left\Vert {g(\cdot )} \right\Vert _p\) or \(\left\Vert {g} \right\Vert _p\) is defined by
and for \(g \in W^{1,p}([s,T]; \mathbb {R}^n)\), we define
Let us denote by \(\mathbb {M}_{m \times n}\) the space of real matrices with m rows and n columns. For square matrices \(A \in \mathbb {M}_{n \times n}\), we consider the \(2-\)norm
Given two matrices \(A, A' \in \mathbb {M}_{n \times n}\), \(d_2(A,A') :=\left\Vert {A-A'} \right\Vert _2\) denotes the distance between the two matrices induced by the \(2-\)norm.
For a generic metric space (X, d), \(\mathcal {M}(X)\) will denote the space of measures on X, equipped with the weak-\(^\star \) topology, according to which \(\pi ^N \overset{*}{\rightharpoonup }\pi \) if and only if
where \(C_b(X)\) denote all bounded continuous functions \(f :X \rightarrow \mathbb {R}\). When the space X is compact, the weak-\(^\star \) topology is metrized by the Wasserstein distance (see, e.g., [27])
where \(\varGamma (\pi ,\pi ')\) is the collection of all probability measures on \(X \times X\) with marginals \(\pi \) and \(\pi '\) on the first and second factors, respectively.
Given a probability space \((\varOmega ,\mathcal {F},\pi )\) and a random variable Y on \(\varOmega \), \(\mathbb {E}_\pi [Y]\) denotes the expected value of Y with respect to \(\pi \).
3 Problem statements and preliminary results
We begin our discussion considering two LQ optimal control problems. We will see in the sequel how the two problems are connected:
Problem A: the LQR problem.
Let us consider the classical LQR problem with finite horizon, which we will refer to as Problem A:
where \(s \in [0,T]\), \(x_0\in \mathbb {R}^n\), \(\mathcal {U}:=\{u:[s,T] \rightarrow \mathbb {R}^m\; \mathrm {Lebesgue}\; \mathrm {measurable}\}\) and
The pair \((x,u)(\cdot )\) such that \(u\in \mathcal {U}\) and \(x(\cdot )\) is the solution of the Cauchy problem
is called admissible process for Problem A.
Let us define the value function \(V:[0,T]\times \mathbb {R}^n\rightarrow \mathbb {R}\) for Problem A as
We shall say that \((\bar{x}, \bar{u})(\cdot )\) is an optimal process for Problem A if
for any other admissible process \((x,u)(\cdot )\) of Problem A. In this case, \(\bar{u}\) will be denoted as optimal control for Problem A.
Problem B: an LQR problem with unknown dynamics. Let us now introduce an optimal control that does not require the exact knowledge of the matrix \(\hat{A}\), but merely a probability distribution defined on a compact space of matrices \(\mathcal {A}\) containing \(\hat{A}\). For each \(s\in [0,T]\), \(x_0\in \mathbb {R}^n\) and \(\pi \in \mathcal {M}(\mathcal {A})\), consider the following optimal control problem, which we will refer to as Problem B:
where \(\mathcal {U}:=\{u:[s,T] \rightarrow \mathbb {R}^m\; \mathrm {Lebesgue}\; \mathrm {measurable}\}\) and
Remark 3.1
Sometimes we will denote this problem as Problem B\(_\pi \), to stress its dependency on the probability distribution \(\pi \).
The definition of the value function \(V_\pi :[0,T]\times \mathbb {R}^n\rightarrow \mathbb {R}\) for Problem B is
The collection \(\left\{ (x_A,u)(\cdot ):\; A\in \mathcal {A}\right\} \) such that \(u\in \mathcal {U}\) and, for each \(A \in \mathcal {A}\), \(x_A(\cdot )\) is the solution of the Cauchy problem
is called admissible process for Problem B. Note that the initial condition is the same for every A. The admissible process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is optimal for Problem B if
for any other admissible process \(\left\{ (x_A,u)(\cdot ):\; A\in \mathcal {A}\right\} \) of Problem B. In this case, \(\bar{u}\) will be denoted as optimal control for Problem B.
Throughout the whole paper, the following Standing Hypothesis will be imposed both for Problem A and for Problem B:
- (SH)::
-
\(Q,Q_f \in \mathbb {M}_{n\times n}\) are symmetric and semipositive definite, and \(R\in \mathbb {M}_{m\times m}\) is symmetric and positive definite.
Remark 3.2
As we will show in Lemma 3.6, Problem B admits a unique optimal process when the assumption (SH) holds true. Furthermore, it is interesting to observe that Problem A can be regarded as a particular case of Problem B when one chooses \(\pi = \delta _{\hat{A}}\), namely when \(\pi \) is a Dirac delta concentrated at \(\hat{A}\):
Remark 3.3
In our framework, we consider a probability distribution merely on the matrix A of the dynamics and not on the matrix B. It is not hard to check that the arguments proposed and the results achieved in the next sections hold as well in an extended framework where the matrix B is possibly unknown. However, we preferred to consider a simpler case to keep the overall presentation as clear as possible. Moreover, it seems reasonable to assume that the agent does not know how the environment works (matrix A), whereas being aware of how the control affects the system (matrix B).
Remark 3.4
(Nonlinear systems and connection with RL) We defined \(\pi \) as a probability distribution on a space of matrices \(\mathcal {A}\). In a higher perspective, we can identify \(\mathcal {A}\) with a class of linear dynamical systems, namely
and see \(\pi \) as a probability distribution on \(\widetilde{\mathcal {A}}\) as well.
More generally, one could consider a nonlinear dynamical system
assuming that the f is unknown and only a probability distribution \(\pi \) on a space X of possible dynamics is available. In a similar way to how we did, given a cost functional
one could define a Problem B for the nonlinear problem (see [18, 22]).
A concrete example of nonlinear Problem B related to RL is presented in PILCO, by Deisenroth and Rasmussen [6, 7]. PILCO aims at approximating the actual dynamics with a Gaussian process (GP); the parameters of the GP are tuned according to the data gathered by an agent while interacting with the environment. Providing a probability distribution of the dynamics permits to obtain an interval of confidence of the actual dynamics at each point, rather than a pointwise estimate of the actual dynamics. We stress that a GP is a probability distribution on the set X of bounded, continuous functions, and thus, it constitutes a perfect candidate to play the role of \(\pi \) in Problem B. In the policy improvement step, the optimal policy is then computed minimizing the averaged cost over all possible realizations of the GP, in a similar way to how we defined the cost functional in (16).
The framework presented in this paper incorporates the approach presented in PILCO, as well as other probabilistic model-based RL methods [29], when the dynamics is linear and the cost is quadratic. We will make further considerations about the link with PILCO in Remark 5.5.
3.1 Preliminary results for Problem B
Let us start showing a series of basic results on the existence and the regularity of trajectories and optimal controls for the system we are considering.
In this section, we assume that \(\pi \) is a probability distribution on a compact set of matrices \(\mathcal {A}\subset \mathbb {M}_{n \times n}\). Being \(\mathcal {A}\) bounded, there exists a constant \(C_A\) such that \(\left\Vert {A} \right\Vert _2 \le C_A, \ \forall \, A \in \mathcal {A}\).
For a given matrix \(A \in \mathcal {A}\) and an admissible control \(u \in \mathcal {U}\), the notation \(x_A(t;u)\) will denote the solution of (18) relative to A and u; sometimes, when it is not ambiguous, we could omit the dependency on the control u and write only \(x_A(t)\).
Lemma 3.5
(Boundness and continuity of trajectories) Let us consider the dynamical system in (18). The following results hold:
-
(i)
For each \(u \in L^1([s,T];\mathbb {R}^m)\), the trajectory \(x_A(\cdot ; u)\) is uniformly bounded for all \(A \in \mathcal {A}\):
$$\begin{aligned} |x_A(t; u)| \le C_x^u \qquad \forall \, t \in [s,T], \, \forall \, A \in \mathcal {A}\, , \end{aligned}$$where \(C_x^u\) is a constant which depends on u and on \(x_0\).
-
(ii)
For each \(u \in L^1([s,T];\mathbb {R}^m)\), the map \(A \mapsto x_A(\cdot ; u)\) is continuous;
-
(iii)
For each \(A \in \mathcal {A}\), the map \(u \mapsto x_A(\cdot ; u)\) is continuous for \(u \in L^1([s,T];\mathbb {R}^m)\).
Proof
-
(i)
Recall that \(x_A(t)\) satisfies the relation
$$\begin{aligned} x_A(t) = x_0 + \int _s^t A x_A(\tau ) + B u(\tau ) \, d\tau , \qquad \forall \, t \in [s,T] . \end{aligned}$$for each \(A\in \mathcal {A}\). Then,
$$\begin{aligned} |x_A(t)| \le |x_0| + \int _s^t \left\Vert {A} \right\Vert _2 |x_A(\tau )| \, d\tau + \left\Vert {B} \right\Vert _2 \int _s^t \left|{u(\tau )} \right| d\tau \qquad \forall \, t \in [s,T] , \end{aligned}$$so by the Grönwall lemma (see, e.g., Lemma 2.4.4 on [28]) we get
$$\begin{aligned} \begin{aligned} |x_A(t) |&\le \left( |x_0| + \left\Vert {B} \right\Vert _2 \int _s^t \left|{u(\tau )} \right| d\tau \right) \, e^{\left\Vert {A} \right\Vert _2(t-s)} \\&\le \left( |x_0| + \left\Vert {B} \right\Vert _2 \int _s^t \left|{u(\tau )} \right| d\tau \right) \, e^{C_A T}=:C^u_x \end{aligned} \end{aligned}$$(22)for every \(A \in \mathcal {A}\) and \(t\in [s,T]\). This shows condition (i).
-
(ii)
Fix a control \(u \in L^1([s,T];\mathbb {R}^m)\) and consider the trajectories solutions of (18) relative to two different matrices A and \(A'\). If we define
$$\begin{aligned} z (t) :=x_A(t) - x_{A'}(t) \, , \end{aligned}$$then z solves the following differential system:
$$\begin{aligned} \left\{ \begin{aligned} \dot{z}(t)&=A x_A(t) - A' x_{A'}(t) \qquad \qquad t \in [s,T] \\ z(s)&=0 \ . \end{aligned} \right. \end{aligned}$$Notice that we can rewrite the right-hand side as
$$\begin{aligned} A x_A(t) - A' x_{A'}(t) = A z(t) + (A-A') x_{A'}(t) \, , \end{aligned}$$and thus, we can give the estimate
$$\begin{aligned} | \dot{z} (t) | \le C_A |z (t)| + \left\Vert {A-A'} \right\Vert _2 C_x^u \, . \end{aligned}$$Applying again the Grönwall lemma on z, we get
$$\begin{aligned} |z(t)| \le \left\Vert {A-A'} \right\Vert _2 C_x^u \int _s^t e^{\int _\tau ^t{C_A d\sigma }} d\tau \le T C_x^u e^{C_A T} \left\Vert {A-A'} \right\Vert _2 \quad \forall \, t \in [s,T] \, , \end{aligned}$$which implies the continuity of the map \(A \mapsto x_A(\cdot )\).
-
(iii)
Fix a matrix \(A \in \mathcal {A}\) and consider the trajectories relative to two different controls \(u, u' \in \mathcal {U}\). In a similar way as in (ii), we define
$$\begin{aligned} z(t) :=x_A(t;u) - x_A(t; u') \, , \end{aligned}$$which solves the ODE system
$$\begin{aligned} \left\{ \begin{aligned} \dot{z}(t)&= A z(t) + B \! \left( u(t) - u'(t) \right) \qquad \qquad t \in [s,T] \\ z(s)&=0 \ , \end{aligned} \right. \end{aligned}$$and, by Grönwall’s lemma, we get
$$\begin{aligned} |z(t)| \le \left\Vert {B} \right\Vert _2 \int _s^t e^{\int _\tau ^t{C_A d\sigma }} |u(\tau ) - u'(\tau )| d\tau \le e^{C_A T} \left\Vert {B} \right\Vert _2 \left\Vert {u-u'} \right\Vert _1 \quad \end{aligned}$$for all \(t \in [s,T]\). The last inequality gives the continuity with respect to \(u \in L^1([s,T]; \mathbb {R}^m)\).
\(\square \)
The following result guarantees that the minimization problem (15) is well posed:
Lemma 3.6
(Existence, uniqueness and upper bounds of the optimal control) Let the assumption (SH) hold true. Given a \(\pi \in \mathcal {M}(\mathcal {A})\), Problem B (15) admits a unique minimizer \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \), satisfying the following upper bound:
where \(\overline{C}_u\) does not depend on \(\pi \) and is defined as
where \(r_1\) is the smallest eigenvalue of the matrix R.
Proof
Consider a minimizing sequence \(u^k\in \mathcal {U}\) of the cost functional \( J_{s, \pi }\) defined in (16), satisfying \( J_{s, \pi }[u^k]\rightarrow \inf _{u\in \mathcal {U}} J_{s, \pi }[u]\), \(\inf _{u\in \mathcal {U}} J_{s, \pi }[u]\le J_{s, \pi }[u^k]\), and the related minimizing process \(\{(x_A^k, u^k)(\cdot ) :A \in \mathcal {A}\}\). Set \(\varepsilon _k :=J_{s, \pi }[u^k]-\inf _{u\in \mathcal {U}} J_{s, \pi }[u]\ge 0\). It is not restrictive to assume that \(\varepsilon _k<1\) for all \(k\in \mathbb {N}\).
Let us consider the system (18) when the control is \(u^0 \equiv 0\). The process \(\{(x_A^0, u^0)(\cdot ) :A \in \mathcal {A}\})\) is solution of
for every \(A\in \mathcal {A}\). Hence,
Clearly, the cost achieved by the control \(u^0\) can be estimated as follows:
Furthermore, it follows from the construction of the minimizing sequence that
On the other hand, since the matrix \(R>0\), one has
where \(r_1\) is the smallest eigenvalue of the matrix R.
Hence, one obtains the bound on the minimizing sequence
which results in a uniformly bounded norm:
In view of the previous relation, it follows from standard compactness arguments that \(u^k\rightharpoonup \bar{u}\) weakly in \(L^2([s,T]; \mathbb {R}^m)\). Since \(u^k\) is uniformly bounded in \(L^2\), then using in turn the relation (22), the Hölder inequality and the relation (25), one obtains that there exists a constant \(C_x>0\) such that
holds for every \(k \in \mathbb {N}\), \(A\in \mathcal {A}\) and \(t\in [s,T]\). Furthermore, for each \(k \in \mathbb {N}\), \(A\in \mathcal {A}\) and \(t\in [s,T]\), one has
which implies that, for each \(A\in \mathcal {A}\), \(\dot{x}_A^k\rightharpoonup \dot{\bar{x}}_A\) weakly in \(L^1([s,T]; \mathbb {R}^m)\), \(x_A^k\rightarrow \bar{x}_A\) uniformly in [s, T] and, in view of the linearity of the control system, the process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is the solution of the linear system
for each \(A\in \mathcal {A}\). So the process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is a minimizer for Problem B, and in view of (24), \(\bar{u}\) satisfies the bound (23) with the stricter constant
Since the functional \(u\mapsto J_{s, \pi }[u]\) is strictly convex, the uniqueness of the minimizer follows from standard arguments. This completes the proof. \(\square \)
Remark 3.7
In view of the previous results, if \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is an optimal process for problem B, then the constant
is such that
and does not depend on \(\pi \).
4 Optimality conditions
Let us consider the optimal control problem
where \(U:[s,T]\leadsto \mathbb {R}^m\) is a \(\mathcal {L}\times \mathcal {B}_{\mathbb {R}^m}\)-measurable multifunction taking values compact sets and \(J_{s, \pi }\) is the cost functional defined in (16) for a given, fixed \(\pi \in \mathcal {M}(\mathcal {A})\). For a given reference process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \), we assume that the following condition holds true:
- (TH)::
-
There exist \(\delta >0\) and a function \(c\in L^2([s,T];\mathbb {R})\) such that
$$\begin{aligned} \left| Ax+Bu \right| \le c(t), \end{aligned}$$for all \(x\in \mathbb {B}_n(\bar{x}_A(t) ,\delta )\), \(u\in U(t)\), \(A\in \mathcal {A}\), a.e. \(t\in [s,T]\).
It follows from standard ODE theory that when condition (TH) holds, for every admissible process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) one has that \(\bar{x}_A(\cdot )\) is in \(W^{1,1}([s,T]; \mathbb {R}^n)\) for all \(A \in \mathcal {A}\).
Definition 4.1
For a given \(\delta >0\), a process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is said to be a \(W^{1,1}\)-local minimizer for problem (27) if
for every process \(\left\{ (x_A,u)(\cdot ):\; A\in \mathcal {A}\right\} \) such that
We recall the following result due to Bettiol and Khalil, which is a special case of Theorem 3.3 in [2]:
Theorem 4.2
(Bettiol-Khalil, 2019) Let \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) be a \(W^{1,1}\)-local minimizer for the optimal control problem (27). Let the assumption (TH) be satisfied. Then, there exists a \(\mathcal {L} \times \mathcal {B}_\mathcal {A}\) measurable function \(p :[s,T] \times \mathcal {A}\rightarrow \mathbb {R}^n\), \(p(t,A)\equiv p_A(t)\), such that
-
(i)
$$\begin{aligned} p_A(\cdot ) \in W^{1,1}([s,T]; \mathbb {R}^n) \qquad \forall \, A \in supp(\pi ) \, ; \end{aligned}$$
-
(ii)
$$\begin{aligned} \begin{aligned}&\int _\mathcal {A}p_A(t) B \bar{u}(t) \, d\pi (A) - \frac{1}{2} \bar{u}(t)^T R \bar{u}(t)\\&\qquad =\max _{u \in U(t)} \left\{ \int _\mathcal {A}p_A(t) B u\, d\pi (A) - \frac{1}{2} u^T R u \right\} \quad \mathrm {a.e.}\; t\in [s,T] \end{aligned} \end{aligned}$$
-
(iii)
$$\begin{aligned} -\dot{p}_A(t) = A^T p_A(t) - Q \bar{x}_A(t) \qquad a. e. \ t \in [s,T], \, \forall A \in supp(\pi ) \, ; \end{aligned}$$
-
(iv)
$$\begin{aligned} -p_A(T) = Q_f \bar{x}_A(T) \qquad \forall A \in supp(\pi ) \, . \end{aligned}$$
Remark 4.3
Let us notice that Theorem 3.3 in [2] is derived under the stronger assumption:
- (TH’)::
-
there exist \(\delta >0\) and \(c>0\) such that
$$\begin{aligned} \left| Ax+Bu \right| \le c \, , \end{aligned}$$for all \(x\in \mathbb {B}(\bar{x}_A(t), \delta )\), \(u\in U(t)\), \(A\in \mathcal {A}\), a.e. \(t\in [s,T]\).
However, scrutiny to the proof given there reveals that the result still holds true under the relaxed condition (TH). Furthermore, Theorem 3.3 in [2] is derived for a Mayer optimal control problem, i.e., with only a final cost, but an analogous theorem for Bolza optimal control problems can be easily obtained by using a standard state augmentation argument.
We are now ready to prove the necessary optimality conditions for Problem B:
Theorem 4.4
Assume the hypothesis (SH). Then, the following optimality condition is satisfied by the minimizer \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) for Problem B (15). There exists a \(\mathcal {L} \times \mathcal {B}_\mathcal {A}\) measurable function \(p :[s,T] \times \mathcal {A}\rightarrow \mathbb {R}^n\), \(p(t,A)\equiv p_A(t)\), such that
-
(i)
$$\begin{aligned} p_A(\cdot ) \in W^{1,1}([s,T]; \mathbb {R}^n) \qquad \forall \, A \in supp(\pi ) \, ; \end{aligned}$$
-
(ii)
$$\begin{aligned} \bar{u}(t) = + R^{-1} B^T \int _{\mathcal {A}} p_A(t) \, d\pi (A) \qquad t \in [s,T] \ ; \end{aligned}$$
-
(iii)
$$\begin{aligned} -\dot{p}_A(t) = A^T p_A(t) - Q \bar{x}_A(t) \qquad a. e. \ t \in [s,T], \, \forall A \in supp(\pi ) \, ; \end{aligned}$$
-
(iv)
$$\begin{aligned} -p_A(T) = Q_f \bar{x}_A(T) \qquad \forall A \in supp(\pi ) \, . \end{aligned}$$
Proof
Let us first observe that the optimal process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) exists and is unique, due to Lemma 3.6. Consider now the optimal control problem
Such an optimal control problem is a special case of (27) with the choice of \(U(t)=\mathbb {B}_m(\bar{u}(t), 1)\). Clearly, since \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is a minimizer for Problem B, then it is also a minimizer for problem (28). Furthermore, since any element of \(u\in U(t)\) can be written as \(u=\bar{u}(t)+v\), for some \(v\in \mathbb {B}_m(0,1)\) and in view of Remark 3.7, then one can easily find \(\delta >0\) and a function \(c\in L^2([s,T], \mathbb {R})\) such that the hypothesis (TH) is satisfied. Indeed,
for all \(x\in \mathbb {B}_n(\bar{x}_A(t),\delta )\), \(\forall A\in \mathcal {A}\), for all \(u\in \mathbb {B}_m(\bar{u}(t),1)\), a.e. \(t\in [s,T]\), where \(\overline{C}_x\) is the constant appearing in Remark 3.7. So the process \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is a \(W^{1,1}\)-minimizer (see Definition 4.1) for the optimal control problem (28), and the hypothesis (SH) is satisfied. Then, one can invoke Theorem 4.2, which provides conditions (i)–(iii)–(iv) of the statement. In order to obtain condition (ii), it is enough to observe that
and that \(\bar{u}(t)\) is clearly an interior point of U(t) for a.e. \(t\in [s,T]\). Hence, \(\bar{u}(t)\) has to satisfy also condition (ii) of Theorem 4.4. This completes the proof. \(\square \)
Remark 4.5
Theorem 4.4 provides the existence of a multiplier \(p_A(\cdot )\) for all \(A \in supp(\pi )\). In general, we can extend its definition to all \(A \in \mathcal {A}\), considering \(p_A(\cdot )\) as the unique solution of
where \(\left\{ (\bar{x}_A,\bar{u})(\cdot ):\; A\in \mathcal {A}\right\} \) is the unique minimizer of Problem B.
The following result guarantees that this extension defined in Remark 4.5 is continuous with respect to A:
Lemma 4.6
(Boundness and continuity of multipliers) Let us consider the multipliers defined in (30) for each \(A \in \mathcal {A}\). They have the following properties:
-
(i)
There exists a positive constant \(\overline{C}_p\), independent from \(\pi \), which bounds uniformly all multipliers, i.e.,
$$\begin{aligned} |p_A(t) | \le \overline{C}_p \qquad \forall \, t \in [s,T], \, \forall \, A \in \mathcal {A}\ , \end{aligned}$$ -
(ii)
The map \(A \mapsto p_A(\cdot )\) is continuous.
Proof
The proof is similar to that of Lemma 3.5, with the only difference that here we need to apply the Grönwall lemma backward instead of forward. Notice that the final condition
will not be the same for all \(A \in \mathcal {A}\), whereas the initial condition was the same in Lemma 3.5. However, it is still continuous with respect to A by Lemma 3.5, so the same arguments can be applied to prove (ii). \(\square \)
5 Main convergence results
In this section, we will present the main results, which are, respectively, the convergence of the value function (Corollary 5.2) and the convergence of the optimal control (Theorem 5.3). Given a sequence of probability distributions \(\{ \pi ^N \}\subset \mathcal {M}(\mathcal {A})\), for each \(N \in \mathbb {N}\) we consider Problem \(B_{\pi ^N}\), namely problem (15) relative to the distribution \(\pi ^N\). Recalling the definition of the value function in (17), we define
If \(\{ \pi ^N \}\subset \mathcal {M}(\mathcal {A})\) is such that \(W_1(\pi ^N,\pi ^\infty )\rightarrow 0\) for \(N\rightarrow \infty \), what can be said about the convergence of the value functions \(V^N\) to \(V^\infty \) and of the optimal controls \(u^N\) to \(u^\infty \) for \(N\rightarrow \infty \)? In this section, we give an answer to those questions.
Theorem 5.1
(Lipschitz estimate for the value function w.r.t. \(\pi \)) Let the assumption (SH) be satisfied. Given \(\pi ,\,\pi ' \in \mathcal {M}(\mathcal {A})\) and \((s,x_0)\in [0,T]\times K\) with \(K\subset \mathbb {R}^n\) compact set, let us consider the two value functions \(V_\pi \) and \(V_{\pi '}\) as defined in (17). Then, the distance between V and \(V'\) can be bounded uniformly for \((s,x_0)\in [0,T]\times K\), that is:
where \(C_K= C_K(T, C_A, \left\Vert {Q} \right\Vert _2, \left\Vert {Q_f} \right\Vert _2, r_1, K)\) is a constant which does not depend on the distributions \(\pi \) and \(\pi '\), but merely on the compact set \(\mathcal {A}\).
Proof
We divide the proof into three steps.
STEP 1: Fix two matrices \(A, A' \in \mathcal {A}\), any point \((s,x_0) \in [0,T]\times \mathbb {R}^n\) and a control \(u \in \mathcal {U}\). Using Grönwall’s lemma as we did for point (ii) of Lemma 3.5, we get the following estimate:
with \(C_x^u\) given by Lemma 3.5.
Let us denote by \(\ell \) the running cost and by h the final cost:
so we can write
Since both \(\ell \) and h are locally Lipschitz continuous, one has
and, similarly,
where we defined the Lipschitz constants
these two constants inherit from \(C_x^u\) the dependency on \(x_0\) and u.
Finally, the cost difference between two single trajectories can be easily bounded by
STEP 2: Fix an initial condition \(x(s)=x_0 \in \mathbb {R}^n\) with \(s\in [0,T]\) and a control \(u \in \mathcal {U}\). We want to prove a bound for the distance between \(J_{s,\pi }[u]\) and \(J_{s,\pi '}[u]\).
As a property of \(W_1\), there exists (see Theorem 4.1 on [27]) a probability distribution \(\gamma ^*\in \varGamma (\pi ,\pi ')\) on \(\mathcal {A}\times \mathcal {A}\) with marginal distributions \(\pi \) and \(\pi '\) such that
where \(d_2\) is the distance introduced in (7). We can thus write
where we have used that
for all measurable functions \(\phi , \psi \) on \(\mathcal {A}\), since \(\gamma ^*\) admits \(\pi \) and \(\pi '\) as marginals.
Using the bound (33) from STEP 1 and formula (34), we get
Note that the constant \(C_x^u\) which appears here depends merely on \(x_0\) and u.
STEP 3: We will now show that an estimate similar to (35) holds true even for the value functions \(V_\pi \) and \(V_{\pi '}\).
Fix any point \((s, x_0)\in [0,T]\times K\). In view of Lemma 3.6, there exist controls \(\bar{u}, \bar{u}' \in \mathcal {U}\) such that
Then, one has
and, in the same way,
Hence,
Moreover, being both \(\bar{u}\) and \(\bar{u}'\) optimal control for some distribution on \(\mathcal {A}\), we can use the uniform constant \(\overline{C}_x\) given by Remark 3.7 and define an analogous uniform constants for the Lipschitz continuity of \(\ell \) and h:
In this way, the estimate becomes independent from \(\bar{u}\) and \(\bar{u}'\) and thus from \(\pi \) and \(\pi '\):
where
Finally, noting that the estimate depends on \(x_0\) only through its norm, we get that the bound is uniform in compact sets \(K \subset \mathbb {R}^n\), letting
\(\square \)
A straightforward consequence of the previous theorem is the following:
Corollary 5.2
(Convergence of the value function) If \(\pi ^N \overset{*}{\rightharpoonup }\pi ^\infty \), then \(V^N(s, x_0) \rightarrow V^\infty (s, x_0)\) for each \((s, x_0) \in [0,T]\times \mathbb {R}^n\). The convergence is uniform in compact sets \([0,T]\times K\), where \(K \subset \mathbb {R}^n\) is compact.
In what follows, we will use \(\bar{u}^N(\cdot )\) to denote the optimal control of Problem \(B_{\pi ^N}\). Furthermore, \(\bar{x}_A^N(\cdot )\) and \(p_A^N(\cdot )\) denote, respectively, the optimal trajectories and the multipliers relative to Problem \(B_{\pi ^N}\), that is the solutions of the differential systems (18) and (30).
The following theorem provides a strong convergence of \(\bar{u}^N(\cdot )\) to the optimal control \(\bar{u}^\infty (\cdot )\) of the limit problem \(B_{\pi ^\infty }\), assuming that \(\pi ^N \overset{*}{\rightharpoonup }\pi ^\infty \).
Theorem 5.3
(Convergence of the optimal control) Let the assumption (SH) be satisfied. Consider a sequence of probability distributions \(\{ \pi ^N \}\subset \mathcal {M}(\mathcal {A})\), such that \(\pi ^N \overset{*}{\rightharpoonup }\pi ^\infty \) and fix \(s \in [0,T]\) and \(x_0\in \mathbb {R}^n\). If \(\bar{u}^N(\cdot )\) and \(\bar{u}^\infty (\cdot )\) are, respectively, the optimal controls of Problem \(B_{\pi ^N}\) and \(B_{\pi ^\infty }\), i.e., they satisfy (19), respectively, for \(\pi ^N\) and \(\pi ^\infty \), then \(\bar{u}^N(\cdot )\) converges uniformly \(\bar{u}^\infty (\cdot )\) for \(N \rightarrow \infty \).
Proof
Without loss of generality, one can take \(s=0\), being all the other cases similar.
Lemma 3.6 assures that, for each \(\pi ^N\in \mathcal {M}(\mathcal {A})\), there exists a unique optimal process \(\left\{ (\bar{x}_A^N,\bar{u}^N)(\cdot ):\; A\in \mathcal {A}\right\} \). Taking into account Theorem 4.4 and Remark 4.5, that process satisfies the following necessary conditions: For each \(N\in \mathbb {N}\), there exists a continuous function \(p^N :[0,T] \times \mathcal {A}\rightarrow \mathbb {R}^n\), \(p^N(t,A)\equiv p_A^N(t)\), such that
-
(i)
$$\begin{aligned} p_A^N(\cdot ) \in W^{1,1}([0,T]; \mathbb {R}^n) \qquad \forall \, A \in \mathcal {A} \, ; \end{aligned}$$
-
(ii)
$$\begin{aligned} \bar{u}^N(t) = + R^{-1} B^T \int _{\mathcal {A}} p_A^N(t) \, d\pi ^N(A), \qquad \forall \;t \in [0,T] \ ; \end{aligned}$$
-
(iii)
$$\begin{aligned} -\dot{p}_A^N(t) = A^T p^N_A(t) - Q \bar{x}_A^N(t) \qquad a. e. \ t \in [0,T], \; \forall A \in \mathcal {A} \, ; \end{aligned}$$
-
(iv)
$$\begin{aligned} -p_A^N(T) = Q_f \bar{x}_A^N(T) \qquad \forall A \in \mathcal {A} \, . \end{aligned}$$
For each \(A \in \mathcal {A}\), the family of functions
is uniformly bounded due to Lemmas 3.5 and 4.6. Moreover, one can find the bounds on the derivatives:
for all \(A\in \mathcal {A}\), a.e. \(t\in [0,T]\). The second bound in (36) and the relation (ii) imply that also the map
is equibounded and equicontinuous in N. For each \(A\in \mathcal {A}\) fixed, one can then apply Theorem 2.5.3 on [28], implying the existence of some limit functions \(x_A^\infty , p_A^\infty \in W^{1,1}([0,T]; \mathbb {R}^n)\) and \(u^{\infty }\in C^{0}([0,T];\mathbb {R}^m)\) such that
and such that, for each \(A\in \mathcal {A}\), \((x^\infty _A, p_A^\infty )(\cdot )\) is a solution of the boundary value problem
Furthermore, since \(\pi ^N \overset{*}{\rightharpoonup }\pi ^\infty \), \(u^\infty \) satisfies the relation
In fact, the result follows from the estimate
for all \(t\in [0,T]\), which implies that
uniformly on [0, T] as \(N \rightarrow \infty \).
Notice that the convergence is guaranteed along a subsequence, but one can say that the whole sequence converges since the limit does not depend on the subsequence. (It solves (37).)
It remains to show that the limiting process \(\{(x^\infty _A, u^\infty )(\cdot ) :A \in \mathcal {A}\}\) is actually optimal for the Problem B (15) relative to \(\pi ^\infty \). To this aim, let us stress the following properties of the cost functional \(J_{0, \pi }\) in (16), using the lighter notation
-
1)
if \(\pi ^N \overset{*}{\rightharpoonup }\pi ^\infty \), \(J^N[u] \rightarrow J^\infty [u]\) for each \(u \in \mathcal {U}\), since the map \(A \mapsto x_A(\cdot )\) is continuous by Lemma 3.5;
-
2)
each \(J^N\) is continuous with respect to u, since for each \(A \in \mathcal {A}\), the map \(u \mapsto x_A(\cdot ; u)\) is continuous, again from Lemma 3.5.
Since \(\bar{u}^N\) is the optimal control of Problem \(B_{\pi ^N}\) and u is an admissible control for the same problem, then we get
so, letting \(N \rightarrow \infty \), it easily follows from the previous relation and properties 1) and 2) that
In view of the uniqueness of the optimal control \(\bar{u}^\infty \) for Problem \(B_{\pi ^\infty }\), one can conclude that \(u^\infty \equiv \bar{u}^\infty \). Hence, also the process \(\left\{ (x_A^\infty ,u^\infty )(\cdot ):\; A\in \mathcal {A}\right\} \) is optimal for the given problem. This concludes the proof. \(\square \)
Remark 5.4
(Special Case: \(\pi = \delta _{\hat{A}})\) The particular case in which \(\pi \) is a Dirac delta \(\delta _{\hat{A}}\) for a given matrix \(\hat{A} \in \mathbb {M}_{n \times n}\) deserves special attention. Indeed, when \(\pi = \delta _{\hat{A}}\), the cost functional \(J_{s, \pi }\) (16) becomes the cost functional
and Problem B in (15) coincides with a standard Problem A (see (10))
Furthermore, the definition of the value function and that of the optimal control we gave in (17) and (19) agree, in this special case, with the classic definitions in control theory (see (13) and (14)):
If we apply Corollary 5.2 and Theorem 5.3 to a sequence \(\pi ^N\) converging to \(\delta _{\hat{A}}\), then we obtain
and
where \(V^N\) and \(\bar{u}^N\) are, respectively, the value function in (17) and the optimal control (19) relative to \(\pi ^N\).
Remark 5.5
(Additional remarks on the connection with RL) Let us comment the results of this section bearing in mind the PILCO algorithm, presented in Remark 3.4. Recall that PILCO uses the agent experience on the environment to tune the parameters of a Gaussian process (GP), building up a stochastic model for the dynamics. The GP is then updated as the agent collects more data on the environment. This procedure generates a sequence of probability distributions \(\pi ^N\) on the space of continuous functions, which should get sufficiently close to the Dirac delta representing the actual dynamics.
Now, assume that we are applying PILCO to the linear system (12) in which the agent does not know the matrix \(\hat{A}\). Then, one can think to \(\left\{ \pi ^N\right\} _{N\in \mathbb {N}}\) as a sequence of probability distributions on the space of matrices \(\mathbb {M}_{n \times n}\), which is converging to a Dirac delta concentrated at the true matrix \(\hat{A}\). If, at each step, the agent picks as control \(u^N\), namely the one minimizing the expected cost with respect to \(\pi ^N\), and \(\pi ^N \overset{*}{\rightharpoonup }\pi ^\infty \), then one can apply Theorem 5.3 to the special case presented in Remark 5.4 and obtains the uniform convergence \(\bar{u}^N \rightarrow \bar{u}^\infty \), where \(\bar{u}^\infty \) is the optimal control for the limit problem.
This suggests that even if the distribution \(\pi ^N\) does not exactly reach \(\delta _{{\hat{A}}}\) but gets sufficiently close to it, then the control \(\bar{u}^N\) is suboptimal for the actual LQR problem.
6 A case of study: finite support measures converging to \(\delta _{\hat{A}}\)
In this section, we will assume that \(\mathcal {A}\) is a finite set, namely \(\mathcal {A} :=\{A_1,\ldots , A_M\}\) for some integer \(M\in \mathbb {N}\). Let us consider a sequence of probability distributions \(\{\pi ^N\}\subset \mathcal {M}(\mathcal {A})\), which can be written as
For a given \(s\in [0,T]\) and \(x_0\in \mathbb {R}^n\), suppose that the underlying dynamics governing the optimal control problem we are interested in is a standard Problem A (see (10)):
where \({\hat{A}}\in \mathcal {A}\) and the cost functional \(J_{s}[u]\) is defined as in (11):
Without loss of generality, one can set \(\hat{A}\equiv A_1\). For each \(s\in [0,T]\) and \(x_0\in \mathbb {R}^n\), the value function \(V(s, x_0)\) for this problem has been defined in (13).
Then, one can expect that, after some interactions with the system, it is possible to construct a sequence of probability distributions \(\{\pi ^N\}\subset \mathcal {M}(\mathcal {A})\) capturing the current belief that one has about the real system (42) and such that, when N is sufficiently large, \(\pi ^N\) gets closer and closer to \(\delta _{A_1}\) and eventually \(\pi ^N \overset{*}{\rightharpoonup }\delta _{A_1}\).
For each fixed \(N\in \mathbb {N}\), one can reformulate Problem B associated with \(\pi ^N\) as a classical LQR problem on an augmented system of dimension nM:
with cost functional
where we have used the compact notation \(X(t) :=\left( x_{A_1}(t),\ldots ,x_{A_M}(t)\right) \), \(X_0\in \mathbb {R}^{nM}\) and
In this section, we will use \(V^N(s,X_0)\) to denote the value function related to the LQR problem (44), namely
Since the optimal control problem (44) can be regarded as a classic LQR problem, then one has the following relation between the value function and the optimal control in feedback form (see, e.g., [13], Theorem 3.4):
there exists \(P^N\) such that
where \([s,T]\ni t \mapsto P^N(t) \in \mathbb {M}_{nM \times nM}\) solves the Riccati equation
For \(x_0\in \mathbb {R}^n\), we will use the notation \(V^N(s,x_0)\) to denote the value function \(V^N(s,X_0)\) evaluated at \(X_0=(x_0,\ldots ,x_0)\in \mathbb {R}^{nM}\).
We can summarize the results of the previous section applied to problem (44) as follows:
Corollary 6.1
Let the assumption (SH) be satisfied. For each \(s\in [0,T]\), \(x_0\in \mathbb {R}^{n}\) and \(\{\pi ^N\} \subset \mathcal {M}(\mathcal {A})\), the optimal control problems (44) and (42) satisfy the following conditions:
-
(i)
problems (44) and (42) admit a unique optimal process \(\left\{ (\bar{x}_A^N,\bar{u}^N)(\cdot ):\; A\in \mathcal {A}\right\} \) and \((\bar{x}, \bar{u})(\cdot )\), respectively;
-
(ii)
for each \(K\subset \mathbb {R}^n\), \(\exists \, C_K\) such that
$$\begin{aligned} ||V^N - V||_{\infty ,K} \le C_K\, W_1(\pi ^N,\delta _{{\hat{A}}}) \ ; \end{aligned}$$(50) -
(iii)
if, moreover, \(\{\pi ^N\}\) is such that \(\pi ^N\overset{*}{\rightharpoonup }\delta _{{\hat{A}}}\), then the optimal control \(\bar{u}^N\rightarrow \bar{u}\) uniformly for \(t\in [s,T]\).
For each \(N\in \mathbb {N}\), the solution of (49) is related to the matrices \(X^N, Y^N:[s,T] \rightarrow \mathbb {M}_{nM \times nM}\), such that the pair \((X^N,Y^N)(\cdot )\) is the solution of the backward Hamiltonian differential equation
where
The relation between the solutions of (49) and (51) was stated in precise terms by Coppel in [5, pp. 274–275]:
Theorem 6.2
Suppose that \(\mathbf (SH) \) holds true. Let \(X, Y:[s,T] \rightarrow \mathbb {M}_{nM \times nM}\) be the solutions of the Hamiltonian differential problem (51). Then,
-
1.
X(t) is non-singular for all \(t \in [s,T]\);
-
2.
the solution of (49) is
$$\begin{aligned} \widetilde{P}(t) = Y(t) X^{-1}(t), \qquad t \in [s,T] \ . \end{aligned}$$(53)
We are now ready to strengthen the results of the previous by showing that, in the case in which the sequence of measures and its limit have finite support, then also the solution of the Riccati equation (49) has good convergence properties:
Theorem 6.3
(Convergence of the Riccati equation) Let the assumption (SH) be satisfied. Suppose that the sequence \(\{\pi ^N\}\subset \mathcal {M}(\mathcal {A})\) is such that \(\pi ^N\overset{*}{\rightharpoonup }\delta _{A_1}\), that is, for each \(i=1, \dots , M\) the weights converge:
when \(N\rightarrow \infty \). Then, the sequence of matrices \(\{ \widetilde{P}^N(t) \}\subset \mathbb {M}_{nM \times nM}\) which solve the Riccati equation (49) for each \(N\in \mathbb {N}\) converges to the matrix
uniformly on \(t \in [s,T]\), where \(P(t)\in \mathbb {M}_{n \times n}\) is the solution of the Riccati equation related to the optimal control problem (42) with state matrix \(A_1\), namely:
Proof
Consider the Hamiltonian systems in (51) related to all different distributions \(\pi ^N\). Notice that the norm of \(H^N\) in (52) can be bounded:
for all \(N\in \mathbb {N}\).
Using Grönwall’s lemma, one can easily show that the pair of matrices \((X^N, Y^N)\) solution to (51) is uniformly bounded and that, using again (51), \((\dot{X}^N, \dot{Y}^N)\) is uniformly integrally bounded, for all \(N\in \mathbb {N}\). So it is possible to apply Theorem 2.5.3 of [28] to show that the pair \((X^N, Y^N)\) converges to some matrices \((X^\infty , Y^\infty )\) solution of the system
where
and
Then, in view of Theorem 6.2, the matrix \(X^\infty (t)\) is also nonsingular for each \(t \in [s,T]\), and the matrix \(\widetilde{P}^\infty (t) :=Y^\infty (t) X^\infty (t)^{-1}\) is the solution of the Riccati equation
Furthermore, since each \(X^N(t)^{-1}\) is continuous and well defined for each \(N\in \mathbb {N}\) and that \(X^{\infty }(t)^{-1}\) is uniformly continuous on [s, T], then
uniformly on \(t \in [s,T]\). Finally, consider the matrix \(\bar{P}(t)\in \mathbb {M}_{nM \times nM}\),
where \(P(t)\in \mathbb {M}_{n \times n}\) is the unique solution of the Riccati equation (54). A direct verification shows that \(\bar{P}(t)\in \mathbb {M}_{nM \times nM}\) also satisfies the Riccati equation (57). However, the problem (57) admits a unique solution, implying that \(\widetilde{P}^\infty (t) \equiv \bar{P}(t)\) for all \(t\in [s,T]\). This concludes the proof. \(\square \)
Remark 6.4
(Feedback optimal controls) We would like to stress that the convergence of the Riccati matrix solution \(\tilde{P}^N\) to P provided by Theorem 6.3 has the following implication: For each \((s,x_0)\in [0,T]\times \mathbb {R}^n\), we define
and then, \(\bar{u}^N\) tend to \({\bar{u}}\) for N going to \(+\infty \). Namely that the optimal control of problem (44), which satisfies the formula (48), evaluated at \(X_0=\left( x_0,\ldots , x_0 \right) \) pointwisely converges to the optimal control of problem (42). Furthermore, for each \(K\subset \mathbb {R}^n\) compact, one has that
It is important to point out that whereas Theorem 5.3 proves the convergence for the class of optimal open-loop controls, Theorem 6.3 deals with the convergence of optimal controls in feedback form.
It remains an open question whether Theorem 6.3 can be proved for a generic sequence of probability measures \(\left\{ \pi ^N\right\} _{N\in \mathbb {N}}\) converging weakly-* to a generic probability measure \(\pi \). Such an issue is delicate and will be studied in a forthcoming paper.
7 A numerical example
The aim of this section is to verify that the results summarized in Corollary 6.1 hold in a concrete example.
The model is the one presented in Sect. 6. The true dynamics is a controlled harmonic oscillator described by the matrices
the coefficients that are used to define the cost functional are
and the final time is \(T = 5\). Let us write \(\pi ^N\) as in (41) for \(M = 9\), and the 9 matrices are defined in a neighborhood of matrix \(\hat{A}\), i.e.,
where \(\{e_j\}_{j=1,\dots ,4}\) are the matrices of the canonical basis of \(\mathbb {R}^{2 \times 2}\). The probabilities \(\alpha _i^N\) are defined according to the following rule:
Note that the Wasserstein distance with respect to the Euclidean norm on \(\mathbb {R}^{2 \times 2}\) between \(\pi ^N\) and \(\delta _{\hat{A}}\) can be computed exactly:
Both Problem A, that is the LQR problem with the matrix \(\hat{A}\) known, and Problem B, that is the LQR problem with the matrix \(\hat{A}\) unknown can be solved by finding the solution of a Riccati equation (see §IV.5 on Fleming–Rishel monograph [9]). We solved the equation numerically for \(N = 0, \dots , 9\). For each N, we computed the sup norm of the difference \(\widetilde{V}^N-V\) and of the difference \(\bar{u}^N-\bar{u}\), where \(\bar{u}^N\) and \(\bar{u}\) are, respectively, the optimal controls for the two problems starting from \(x_0 = (1,0)\). The results are summarized in Table 1. All the computations have been done using MATLAB on a MacBook Air 13” 2017 with Intel Core i5 Processor (2x 1.8 GHz).
Notice that when we increase N by one, we halve the distance \(W_1(\pi ^N,\delta _{\hat{A}})\) and Table 1 tells us that also the error \(||\widetilde{V}^N-V||_{\infty , K}\), with \(K :=[-2,2]^2\), is halved; this is consistent with the estimate given by Corollary 6.1. At the same time, we remark that the error \(||\bar{u}^N-\bar{u}||_\infty \) is halved as well, even if we did not have any estimate on the convergence rate of the optimal controls. We can say that in this example, both the errors are going to 0 with order 1.
The optimal trajectory of Problem A starting from \(x_0 = (1,0)\) is represented in Fig. 1. For Problem B, the optimal trajectory is actually made of 9 trajectories, the costs of which are weighted averaged in order to compute the cost functional \(\widetilde{J}\). Two examples of optimal (multi-)trajectory, respectively, for \(\pi ^0\) and \(\pi ^2\), are represented in Figs. 2 and 3; note that the trajectory related to the true dynamics is \(x^1(t)\), which is the darkest one. Finally, in Fig. 4 the optimal controls for Problem B with \(N = 0, \dots , 4\) are compared with the true optimal control.
8 Conclusions
In this paper, we proved some convergence properties for the optimal policies of LQ optimal control problems with uncertainties (our Problem B), assuming that the current belief on the dynamics is represented by a generic probability distribution \(\pi \) on the space of matrices. Under standard hypotheses on the system dynamics and the cost functional, we proved that the open-loop, optimal control \(\bar{u}_\pi \) of Problem B converges to the open-loop, optimal control \(\bar{u}\) of the actual system as soon as the distribution \(\pi \) is sufficiently close (w.r.t. the Wasserstein distance (9)) to a Dirac’s delta \(\delta _{{\hat{A}}}\) evaluated at the actual system matrix \(\hat{A}\). We also showed that when the probability distribution \(\pi \) is actually a discrete measure, then also the closed-loop optimal control of Problem B converges to the closed-loop optimal control of the actual system when the distribution \(\pi \) is sufficiently close to the Dirac’s delta \(\delta _{{\hat{A}}}\). The latter result was also validated by a numerical example presented in Sect. 7.
It is worth stressing that the proposed approach has strong connections with several Bayesian-like RL algorithms (such as PILCO), providing a theoretical framework to obtain stability and convergence guarantees for such algorithms.
As a future direction, we would like to extend this approach to a nonlinear, control affine optimal control problem with convex functional, getting closer to the problem formulation studied in [7]. A first attempt in this direction has been recently proposed in [22]. Furthermore, we are interested in constructing new efficient RL algorithms using well-established tools from control theory. In this context, we have already developed a new method for solving LQR problems with unknown dynamics [20].
References
Atkeson CG, Santamaria JC (1997) A comparison of direct and model-based reinforcement learning. Proceedings of international conference on robotics and automation automation, vol 4, pp 3557–3564. https://doi.org/10.1109/ROBOT.1997.606886
Bettiol P, Khalil N (2019) Necessary optimality conditions for average cost minimization problems. Discrete Continuous Dyn Syst B 24(5):2093
Chowdhary G, Kingravi HA, How JP, Vela PA (2013) A Bayesian nonparametric approach to adaptive control using Gaussian processes. In: CDC, IEEE, pp 874–879
Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in neural information processing systems, vol 2018-December, pp 4754–4765
Coppel WA (1975) 18.-Linear-quadratic optimal control. Proc R Soc Edinb Secti A Math 73:271–289
Deisenroth MP (2010) Efficient reinforcement learning using Gaussian processes, vol 9. KIT Scientific Publishing
Deisenroth MP, Fox D, Rasmussen CE (2013) Gaussian processes for data-efficient learning in robotics and control. IEEE Trans Pattern Anal Mach Intell 37(2):408–423
Doya K (2000) Reinforcement learning in continuous time and space. Neural Comput 12(1):219–245
Fleming WH, Rishel RW (2012) Deterministic and stochastic optimal control, vol 1. Springer, Berlin
Gal Y, McAllister R, Rasmussen CE (2016) Improving PILCO with Bayesian neural network dynamics models. In: Data-efficient machine learning workshop, ICML, vol 4, p 25
Janner M, Fu J, Zhang M, Levine S (2019) When to trust your model: model-based policy optimization. Advances in neural information processing systems, vol 32, pp 12519–212530
Kamthe S, Deisenroth M (2018) Data-efficient reinforcement learning with probabilistic model predictive control. In: International conference on artificial intelligence and statistics, PMLR, pp 1701–1710
Kwakernaak H, Sivan R (1972) Linear optimal control systems, vol 1. Wiley-Interscience, New York
Lee JY, Park JB, Choi YH (2014) Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations. IEEE Trans Neural Netw Learn Syst 26(5):916–932
Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Magaz 9(3):32
Lohéac J, Zuazua E (2016) From averaged to simultaneous controllability. Ann Facult sci Toul Math 25:785–828
Munos R (2000) A study of reinforcement learning in the continuous case by the means of viscosity solutions. Mach Learn 40(3):265–299
Murray R, Palladino M (2018) A model for system uncertainty in reinforcement learning. Syst Control Lett 122:24–31
Murray R, Palladino M (2019) Modelling uncertainty in reinforcement learning. In: 2019 IEEE 58th conference on decision and control (CDC), IEEE, pp 2436–2441
Pacifico A, Pesare A, Falcone M (in press) A new algorithm for the LQR problem with partially unknown dynamics. In: 2021 large-scale scientific computing (LSSC). Springer International Publishing
Palladino M (2016) Necessary conditions for adverse control problems expressed by relaxed derivatives. Set-Valued Var Anal 24(4):659
Pesare A, Palladino M, Falcone M (in press) Convergence of the value function in optimal control problems with unknown dynamics. In: 2021 European control conference (ECC), IEEE
Recht B (2019) A tour of reinforcement learning: the view from continuous control. Ann Rev Control Robot Auton Syst 2:253–279
Ross IM, Proulx RJ, Karpenko M, Gong Q (2015) Riemann-Stieltjes optimal control problems for uncertain dynamic systems. J Guid Control Dyn 38(7):1251–1263
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. MIT Press, Cambridge, MA
Sutton RS, Barto AG, Williams RJ (1992) Reinforcement learning is direct adaptive optimal control. IEEE Control Syst 12(2):19–22
Villani C (2008) Optimal transport: old and new, vol 338. Springer, Berlin
Vinter R (2010) Optimal control. Springer, Berlin
Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J (2019) Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057
Zuazua E (2014) Averaged control. Automatica 50(12):3077–3087
Funding
Open access funding provided by Universitá degli Studi di Roma La Sapienza within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Published in the topical collection Machine Learning for Control Systems and Optimal Control.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pesare, A., Palladino, M. & Falcone, M. Convergence results for an averaged LQR problem with applications to reinforcement learning. Math. Control Signals Syst. 33, 379–411 (2021). https://doi.org/10.1007/s00498-021-00294-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00498-021-00294-y
Keywords
- Reinforcement learning
- Linear quadratic regulator
- Averaged control
- Optimal control
- Model-based RL
- Convergence