COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN
CONTINUOUS TIME
arXiv:1707.07834v1 [math.PR] 25 Jul 2017
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
Abstract. We analyse a version of the policy iteration algorithm for the discounted infinitehorizon problem for controlled multidimensional diffusion processes, where both the drift and
the diffusion coefficient can be controlled. We prove that, under assumptions on the problem
data, the payoffs generated by the algorithm converge monotonically to the value function and
an accumulation point of the sequence of policies is an optimal policy. The algorithm is stated
and analysed in continuous time and state, with discretisation featuring neither in theorems nor
the proofs. A key technical tool used to show that the algorithm is well-defined is the mirror
coupling of Lindvall and Rogers.
1. Introduction
Howard’s policy iteration algorithm (PIA) [13] is a well-known tool for solving control problems for Markov decision processes (see e.g. [4] for a recent survey of approximate policy iteration
methods for finite state, discrete time, stochastic dynamic programming problems). The algorithm can be recast for general state-spaces continuous-time control problems, where allowed
actions can be chosen from a Polish space. In this paper we investigate the convergence of
the PIA for an infinite horizon discounted cost problem in the context of controlled diffusion
processes in Rd , where the control takes place in an arbitrary compact metric space. The main
aim of the paper is to analyse the convergence of a sequence of policies and the corresponding
payoff functions produced by the PIA under assumptions that are at least in principle verifiable
in terms of the model data.
Our control setting is similar to that of [1], where an ergodic cost criterion was considered.
The main differences, beyond the cost criterion, are as follows: (1) we allow the controller to
modulate the drift as well as the diffusion coefficient; (2) we consider a generalised version of the
PIA where an arbitrary scaling function can be used to simplify the algorithm; (3) we investigate
the convergence not only of payoffs but also of policies produced by the PIA; (4) we work with
Markov policies that are defined for every x ∈ Rd , not almost everywhere, and obtain a locally
uniform convergence of a subsequence to the optimal policy.
This last point in this list is particularly important in our setting, as our aim is to design and
analyse an algorithm that can in principle at least be used in to construct an optimal policy.
This requirement forces us to work in the context of classical solutions of PDEs, rather than
Date: November 7, 2018.
2010 Mathematics Subject Classification. 93E20, 49L99, 60H30.
Key words and phrases. Policy iteration algorithm, policy improvement, optimal stochastic control, controlled
diffusion processes, discounted infinite horizon problem, mirror coupling.
SD supported by the EPSRC grant EP/P00377X/1; AM supported by the EPSRC grant EP/P003818/1 and
the Programme on Data-Centric Engineering funded by Lloyd’s Register Foundation; DŠ supported by the Slovene
Human Resources Development and Scholarship Fund.
1
2
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
relying on generalised solutions of the Poisson equation in appropriate Sobolev spaces. The
latter approach, followed in [1], is based on the fact that there exists a canonical solution to the
2,p
Poisson equation in Wloc
(Rd ), see [2]. The analysis of the PIA can than be performed using
2,p
Krylov’s extension [15] of Itô’s formula to functions in the Sobolev space Wloc
(Rd ).
Our method for solving the Poisson equation in the classical sense is based on the coupling
of Lindvall and Rogers [18]. This coupling plays a crucial role in the proof of Proposition 1,
guaranteeing that a payoff function for a locally Lipschitz Markov policy is the classical solution
of the corresponding Poisson equation. The main technical contribution of the paper is the
result in Lemma 7, which shows that the mirror coupling from [18] is successful with very high
probability, when the diffusion processes are started sufficiently close together. Interestingly, the
condition in [18] for the coupling to be successful is not satisfied in our setting in general. The
proof of Lemma 7 is based on a local path-wise comparison of a time-change of the distance
between the coupled diffusions and an appropriately chosen Bessel process.
The convergence of the policies and payoffs in the PIA is obtained in several steps. First
we show that the PIA always improves the payoff. Then we prove, using a “diagonalisation”
argument and an Arzela-Ascoli type result, that a subsequence of the policies produced by the
PIA converges locally uniformly to a locally Lipschitz limiting policy. The final stage of the
argument shows that this limiting policy is indeed an optimal policy with payoff equal to the
pointwise limit of the payoffs produced by the PIA. These steps are detailed in Section 2 and
proved in Section 5.
The literature on the PIA for Markov decision processes in various settings is extensive (see
e.g. [7], [11], [12], [17], [20], [16], [19], [22], [23] and [24] and the references therein). Our
approach is in some sense closest to the continuous time analysis in general state spaces in [7],
where the convergence of the subsequence of the policies is established in the case of finite action
space. In [24] this restriction is removed, but the controlled processes considered do not include
diffusions. A recent application of the PIA to impulse control in continuous time is given in [3].
Finally, as mentioned above, we observe that the PIA can be generalised by multiplying the
expression to be minimised by an arbitrary positive scaling function that can depend both on the
state and the control action (see (gPIA) below). A choice of the scaling function clearly influences
the sequence of policies produced by the algorithm. In particular, in the one-dimensional case, the
scaling function can be used to eliminate the second derivative of the payoff from the algorithm.
This idea is described in Section 3. A numerical example of the PIA is reported in Section 4. In
this examples at least, the convergence of the PIA is very fast as the algorithm finds an optimal
policy in fewer than six iterations.
2. Multidimensional controlled diffusion processes
Let (A, dA ) be a compact metric space of available control actions and, for some d, m ∈ N,
let σ : Rd × A → Rd×m and µ : Rd × A → Rd be measurable functions. Let the set A(x) of
admissible policies at x ∈ Rd consist of processes Π = (Πt )t≥0 satisfying the following: Π is A
valued, adapted to a filtration (Ft )t≥0 and there exists an (Ft )-adapted process X Π,x = XtΠ,x t≥0
satisfying the SDE
Z t
Z t
Π,x
Π,x
σ Xs , Πs dBs +
Xt = x +
µ XsΠ,x , Πs ds, t ≥ 0,
(1)
0
0
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
3
where B = (Bt )t≥0 is an m-dimensional (Ft )-Brownian motion. Note that the filtration (Ft )t≥0
(and indeed the entire filtered probability space) depends on the policy Π in A(x). Pick measurable functions α : Rd × A → R and f : Rd × A → R. For any x ∈ Rd and Π ∈ A(x), define
the payoff of the policy Π by
Z ∞ R
Π,x
− 0t α(XsΠ,x ,Πs )ds
f Xt , Πt dt .
e
VΠ (x) := E
0
Control problem. Construct the value function V , defined by
V (x) :=
inf
Π∈A(x)
VΠ (x),
x ∈ Rd ,
and, if it exists, an optimal control {Πx ∈ A(x) : x ∈ Rd }, satisfying V (x) = VΠx (x).
Note first that the problem is specified completely by the deterministic data σ, µ, α and f .
In order to define an algorithm that solves this problem, the functions σ, µ, α, f are assumed
to satisfy Assumption 1 below throughout this section.
Assumption 1. The functions σ, µ, α and f are bounded, and Lipschitz on compacts in Rd × A,
i.e. for every compact set K ⊆ Rd × A there exists a constant CK > 0 such that
1
(2)
kh(x, p) − h(y, r)k ≤ CK kx − yk2 + dA (p, r)2 2
holds for every (x, p), (y, r) ∈ K, and h ∈ {σ, µ, α, f }. In addition, α(x, p) > ǫ0 > 0 for all
(x, p) ∈ Rd × A, and there exists λ > 0 such that
(3)
hσ(x, p)σ(x, p)T v, vi ≥ λkvk2
for all
x ∈ Rd , p ∈ A, v ∈ Rd .
Remark 1. Here, and throughout the paper, k · k and h·, ·i denote the Euclidean norm and inner
p
product respectively. The norm kM k = sup {kM vk/kvk : v ∈ Rm \{0}} = λmax (M M T ), for a
matrix M ∈ Rd×m , is used in (2) for h = σ, where λmax (M M T ) is the largest eigenvalue of the
non-negative definite matrix M M T ∈ Rd×d and M T ∈ Rm×d denotes the transpose of M .
Remark 2. The uniform ellipticity condition in (3) is the multidimensional analogue of the
volatility being bounded away from 0. Hence, for all x ∈ Rd and p ∈ A, the smallest eigenvalue
of σ(x, p)σ(x, p)T ∈ Rd×d is at least of size λ and, in particular, m ≥ d (cf. Remark 3 below).
A measurable function π : Rd → A is a Markov policy (or synonymously Markov control ) if
for x ∈ Rd there exists an Rd -valued process X π,x = (Xtπ,x )t≥0 that satisfies the following SDE:
Z t
Z t
(4)
Xtπ,x = x +
σ (Xsπ,x , π (Xsπ,x )) dBs +
µ (Xsπ,x , π (Xsπ,x )) ds, t ≥ 0.
0
0
(X π,x , B)
Let (Ft )t≥0 be a filtration with respect to which
is (Ft )-adapted and B is an (Ft )Brownian motion. Such a filtration (Ft )t≥0 exists by the definition of a solution of SDE (4), see
e.g. [14, Def. 5.3.1, p. 300]. Then (Ft )t≥0 can be taken to be the filtration in the definition of
the policy π(X π,x ) ∈ A(x). Moreover, without loss of generality, we may assume that (Ft )t≥0
satisfies the usual conditions.
For any function π : Rd → A that is Lipschitz on compacts (i.e. locally Lipschitz) in Rd ,
Assumption 1 implies (see e.g. [5, p. 45] and the references therein) that the SDE in (4) has a
unique, strong non-exploding solution, thus making π a Markov policy. The payoff function of a
locally Lipschitz Markov policy is a classical solution of a linear PDE, a fact key for (gPIA) to
work. To state this formally, recall that h : Rd → Rk (for any k ∈ N) is (1/2)-Hölder continuous
4
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
on a compact D ⊂ Rd if kh(x′ ) − h(x′′ )k ≤ KD kx′ − x′′ k1/2 holds for some constant KD > 0 and
all x′ , x′′ ∈ D. Streamline the notation for Markov policies as follows:
Vπ (·) := Vπ(X π,· ) (·) and Lπ h := 12 Tr σπT Hh σπ + µTπ ∇h for h ∈ C 2 (Rd ),
(5)
σπ (·) := σ(·, π(·)), µπ (·) := µ(·, π(·)), απ (·) := α(·, π(·)), fπ (·) := f (·, π(·)),
where Hh and ∇h are the Hessian and gradient of h, respectively, and Tr(M ) denotes the trace
of any matrix M ∈ Rm×m .
Proposition 1. Let Assumption 1 hold. For a locally Lipschitz Markov policy π we have:
Vπ ∈ C 2 (Rd ) is the unique bounded solution of the Poisson equation Lπ Vπ − απ Vπ + fπ = 0 and
HVπ is (1/2)-Hölder on compacts in Rd .
Remark 3. The proof of Proposition 1, given in Section 5.2, depends crucially on the coupling
in Lindvall and Rogers [18] of two d-dimensional diffusions via a coupling of the corresponding
d-dimensional driving Brownian motions (see Lemma 7 below). Moreover, it is easy to see that
the probability of coupling could in general be zero if the dimension m of the Brownian noise is
strictly less than d. In this case the controlled diffusions in Rd , started at distinct points, could
remain forever on disjoint submanifolds of positive co-dimension in Rd for any choice of control.
In particular, Proposition 1 fails in the case m < d, as demonstrated by Example 1 below.
Example 1. Let m = 1, d = 2, A = [−1, 1], f (x1 , x2 , a) := |x1 + x2 | + |x1 − x2 | + a2 /2,
α(x1 , x2 , a) ≡ 1, σ(x1 , x2 , a) ≡ (1, 1)T and µ(x1 , x2 , a) = a(1, −1)T . Then, for a constant policy
πa ≡ a ∈ A, the controlled process started at x ∈ R2 is given by Xtπ,x = x+(1, 1)T Bt +a(1, −1)T t,
for t ≥ 0. In particular, for a = 0 and any x = (x1 , x2 )T , we have Vπ0 (x) = |x1 − x2 | + g(x1 + x2 ),
where g(y) := E|y + 2Be1 |, y ∈ R, and e1 is an exponential random variable with mean 1,
independent of B. Since Be1 has a smooth density, it follows that g is also smooth, implying
that Vπ0 cannot satisfy the conclusion of Proposition 1.
Remark 4. As we are using the standard weak formulation of the control problem (i.e. the filtered
probability space is not specified in advance), all that matters for a Markov policy π is the law
of the controlled process X π,· , which solves the martingale problem in (4). Moreover, this law is
uniquely determined by the symmetric-matrix valued function (x, a) 7→ σ(x, a)σ(x, a)T ∈ Rd×d
(and of course the drift (x, a) 7→ µ(x, a)). Since the symmetric square root of σ(x, a)σ(x, a)T in
Rd×d satisfies Assumption 1, in the remainder of the paper we assume, without loss of generality,
that the noise and the controlled process are of the same dimension, i.e. d = m (cf. Remark 2).
Remark 5. For any locally Lipschitz Markov policy π, the process X π,· is strong Markov.
Hence [8, Thm. 1.7] implies that Vπ is in the domain D(Aπ ) of the generator Aπ of X π,· and
that the Poisson equation Aπ Vπ − απ Vπ + fπ = 0 holds. Recall that for a bounded continuous
g : Rd → R in D(Aπ ) the limit (Aπ g)(x) := limt→0 (Eg(Xtπ,x ) − g(x))/t exists for all x ∈ Rd .
Furthermore, if g is also in C 2 (Rd ), it is known that Aπ g = Lπ g. However, [8, Thm. 1.7] does not
imply that Vπ is in C 2 (Rd ). The PDE in Proposition 1, key for (gPIA) to work, is established
via the coupling argument in Section 5.2.
If a policy π : Rd → A is constant (i.e. π ≡ p ∈ A), write σp , µp , αp , fp , Lp and Vp
instead of σπ , µπ , απ , fπ , Lπ and Vπ , respectively. Let S : Rd × A → (0, ∞) be a continuous
function and, for any p ∈ A, denote Sp (x) := S(x, p). Under Assumption 1, the function
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
5
p 7→ Sp (x)(Lp h(x) − αp (x)h(x) + fp (x)), p ∈ A, is continuous for any x ∈ Rd and h ∈ C 2 (Rd ).
Since A is compact, there exists Ih (x) ∈ A, which minimises this function.
Assumption 2. For any h ∈ C 2 (Rd ), the function Ih : Rd → A can be chosen to be locally Lipschtiz
on Rd . The continuous scaling function S satisfies ǫS < S < MS for some ǫS , MS ∈ (0, ∞).
Generalised Policy Iteration Algorithm
Input: σ, µ, α, f , S satisfying Assumptions 1–2, constant policy π0 and N ∈ N.
for n = 0 to N − 1 do
Compute Vπn from the PDE in Proposition 1. Choose the policy πn+1 as follows:
(gPIA)
πn+1 (x) ∈ argmin {Sp (x) (Lp Vπn (x) − αp (x)Vπn (x) + fp (x))} ,
p∈A
x ∈ Rd .
end
Output: Approximation VπN of the value function V .
Remark 6. By Assumption 2, the policy πn+1 defined in (gPIA) is locally Lipschitz for all
0 ≤ n ≤ N − 1. Hence, by Proposition 1, the algorithm is well defined.
Remark 7. In the classical case of (gPIA) we take S ≡ 1. A non-trivial scaling function S,
which plays an important role in the one-dimensional context (see Section 3 below), makes the
algorithm into a generalised Policy Iteration Algorithm. Thm 2 requires only the positivity and
continuity of S. The uniform bounds on S in Assumption 2 are used only in the proof of Thm 5.
(gPIA) always leads to an improved payoff (Theorem 2 is proved in Section 5.2 below):
Theorem 2. Under Assumptions 1–2, the inequality Vπn+1 ≤ Vπn holds on Rd for all n ∈
{0, . . . , N − 1}.
The sequence {VπN }N ∈N , obtained by running (gPIA) from a given policy π0 for each N ∈ N,
is non-increasing and bounded. Hence we may define
(6)
Vlim (x) := lim VπN (x),
N →∞
x ∈ Rd .
However, the sequence of the corresponding Markov policies {πN }N ∈N need not converge and,
even if it did, the limit may be discontinuous and hence not necessarily a Markov policy.
Remark 8. If the algorithm stops before N , i.e. πn+1 = πn for some n < N , then clearly
VπN = Vπn and πN = πn . As this holds for any N > n, with (gPIA) started at a given π0 , we
may proceed directly to the verification lemma (Theorem 5 below) to conclude that Vπn is the
value function with an optimal policy πn .
Controlling the convergence of the policies requires the following additional assumptions.
Introduce the set SB,K := {h ∈ C 2 (Rd ) : k∇h(x)k < B 1 , kHh(x)k < B 2 for x ∈ DK }, where
DK := {x ∈ Rd : kxk ≤ K} is a ball of radius K > 0 and B := (B 1 , B 2 ) ∈ (0, ∞)2 are constants.
Assumption 3. For any K > 0, there exist constants BK and CK satisfying the following: if
h ∈ SBK ,K , then Ih (defined in Assumption 2) satisfies d(Ih (x), Ih (y)) ≤ CK kx − yk for all
x, y ∈ DK .
Assumption 4. For any K > 0, let BK , CK be as in Assumption 3. Then for a locally Lipschitz
Markov policy π : Rd → A, such that d(π(x), π(y)) ≤ CK kx − yk, x, y ∈ DK , the following holds:
Vπ ∈ SBK ,K .
6
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
Remark 9. Non-trivial problem data that satisfy Assumptions 1–4 are described in Section 4
below. It is precisely these types of examples that motivated the form Assumptions 2–4 take.
Assumption 1 is standard and Assumptions 2–3 concern only the deterministic data specifying
the problem. Assumption 4 essentially states that kHVπ k has a prescribed bound on the ball DK
if the coefficients of the PDE in Proposition 1 have a prescribed Lipschitz constant. Schauder’s
boundary estimates for elliptic PDEs [10, p. 86] suggest that this requirement is both natural
and feasible. In fact, Assumption 4 may follow from assumptions of the type 1–3 on the problem
data. This is left for future research.
Proposition 3 and Theorems 4 and 5, proved in Section 5.2 below, show that (gPIA) converges.
Proposition 3. Let Assumptions 1–4 hold. Then there exists a subsequence of {πN }N ∈N that
converges uniformly on every compact subset of Rd to a locally Lipschitz Markov policy.
Let πlim : Rd → A denote a locally Lipschitz Markov policy that is a locally uniform limit of
a subsequence of {πN }N ∈N . By Propositions 1, Vπlim is a well-defined function in C 2 (Rd ) that
solves the corresponding Poisson equation. However, since πlim clearly depends on its defining
subsequence, so may Vπlim . Furthermore, Vlim may depend on the choice of π0 in (gPIA). But
this is not so, since Vlim equals both the value function V and the payoff for the policy πlim .
Theorem 4. Under Assumptions 1–4, the equality Vlim = Vπlim holds on Rd for a policy πlim .
Theorem 5. Let Assumptions 1–4 hold. Then for every x ∈ Rd and Π ∈ A(x) the inequality
Vlim (x) ≤ VΠ (x) holds. Hence Vlim equals the value function V , does not depend on the choice of
π0 in (gPIA) and πlim is an optimal locally Lipschitz Markov policy for the control problem.
Remark 10. The key technical issue in the proof of Theorem 5 is that the policies in the convergent subsequence constructed in the proof of Proposition 3 are not improvements of their
predecessors (cf. (gPIA)). The idea of the proof is to work with a convergent subsequence of the
pairs of policies {(πN , πN +1 )}N ∈N , where {πN }N ∈N is produced by the (gPIA) (see Section 5.2
for details).
3. The one-dimensional case
There are two reason for considering the one-dimensional control problem in its own right.
(A) The canonical choice for the scaling function S := 1/σ 2 simplifies the (gPIA) to
(7)
πn+1 (x) ∈ argmin (µ(x, p)Vπ′n (x) − α(x, p)Vπn (x) + f (x, p))/σ 2 (x, p) ,
p∈A
by removing the second derivative of the payoff function Vπn from the minimisation procedure.
This reduction appears to make the numerical implementation of the (gPIA) converge extremely
fast: in the example in Section 4.2 below the optimal payoff and policy are obtained in fewer
than half a dozen iterations.
(B) It is natural to control the process X Π,x only up to its first exit from an interval (a, b), where
a, b ∈ [−∞, ∞], and generalise the payoff as follows:
!
Z τab (X Π,x ) R
R τ b (X Π,x )
Π,x
Π,x
− 0t αΠs (XsΠ,x )ds
− 0a
αΠt (XtΠ,x )dt
e
fΠs (Xt )dt + e
g(Xτ b (X Π,x ) ) .
VΠ (x) := E
0
a
Here µ, σ, α, f : (a, b)×A → R are measurable with the same notational convention as in Section 2
(fp (x) = f (x, p) etc.). Furthermore, Π ∈ A(x) if X Π,x follows SDE (1) on the stochastic interval
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
7
Π,x
[0, τab (X Π,x )), where τab (X Π,x ) := inf{t ≥ 0; XtΠ,x ∈ {a, b}} (inf ∅ = ∞), and XτΠ,x
b (X Π,x ) = Xt
a
for τab (X Π,x ) ≤ t (i.e. Πt , t ∈ [τab (X Π,x ), ∞), are irrelevant for VΠ (x)). Pick an arbitrary function
g : {a, b} ∩ R → R and set the control problem as in Section 2 with Rd substituted by (a, b).
Remark 11. In Assumptions 1–4 we substitute Rd with (a, b). In particular, inequality (3) in
Assumption 1 takes the form σ 2 (x, p) ≥ λ for all x ∈ (a, b), p ∈ A. Assumption 1 hence implies
the requirement on the scaling function S = 1/σ 2 in Assumption 2. In Assumptions 3–4, the
family of closed balls (DK )K>0 in Rd is substituted with a family of compact intervals (DK )K>0
in (a, b), such that ∪K>0 DK = (a, b) and, if K ′ < K, DK ′ is contained in the interior of DK .
R τab (X Π,x )
Remark 12. On the event {τab (X Π,x ) = ∞} we take g(XτΠ,x
αΠt (XtΠ,x )dt) =
b (X Π,x ) )/ exp( 0
a
0, since by Assumption 1 the integral is infinite. Note also that on this event XτΠ,x
b (X Π,x ) may not
a
be defined. Moreover, if {a, b} ∩ R = ∅, then by Assumption 1 we have τab (X Π,x ) = ∞ a.s.
A measurable function π : (a, b) → A is a Markov policy if for every x ∈ (a, b) there exists a
process X π,x = (Xtπ,x )t≥0 satisfying SDE (4) on the stochastic interval [0, τab (X π,x )) and Xtπ,x =
b
π,x ) ≤ t < ∞. If π is a Markov policy, then π(X π,x ) := (π(X π,x ))
Xτπ,x
b
π,x ) if τa (X
t
t≥0 ∈ A(x) for
a (X
every x ∈ (a, b), where we pick arbitrary elements in A for the values π(a) and π(b) if a > −∞
and b < ∞, respectively. We use analogous notation to that in (5), e.g. Lπ h := 12 σπ2 h′′ + µπ h′
for any h ∈ C 2 ((a, b)).
Any locally Lipschitz function π : (a, b) → A is a Markov policy, since the Engelbert-Schmidt
conditions for the existence and uniqueness of the solution of the corresponding SDE (see e.g. [14,
Sec. 5.5]) are satisfied by Assumption 1. By substituting the state space Rd with the interval
(a, b) in Proposition 1, Theorem 2, Proposition 3 and Theorems 4 and 5 of Section 2, we obtain
the results of the present section, which thus solve the control problem in the one-dimensional
case. In the interest of brevity, we omit their statement. We stress that the main difference lies
in the fact that the proofs, in Sections 5.3 and 5.4 below, rely on the theory of ODEs and scalar
SDEs. In particular, we need to prove that the payoff has a continuous extension to a finite
boundary point of the state space.
4. Examples
4.1. Data satisfying Assumptions 1–4. We now describe a class of models that provably
satisfies Assumptions 1–4. The main aim in the present section is not to be exhaustive, but
merely to demonstrate that the form (particularly) of Assumptions 3–4 is natural in the context
of control problems considered here. The example we give is in dimension one. But it is clear
from the construction below that it can easily be generalised.
Let A := [−a, a], for some constant a > 0, and σ, µ, f, α : R × [−a, a] → R be given by
(8)
σ(x, p) := σ1 (x),
µ(x, p) := µ1 (x) + pµ2 ,
f (x, p) := f1 (x) + f2 (p),
α(x, p) ≡ α0 ,
where σ1 , µ1 , f1 ∈ C 1 (R), f2 ∈ C 2 ((−a, a)) is convex and symmetric (i.e. f2 (p) = f2 (−p) for all
p ∈ A) and µ2 and α0 are constants. For any h ∈ {σ1 , µ1 , f1 , f2 } (resp. h′ ∈ {σ1′ , µ′1 , f1′ , f2′ }) let
the positive constant Ch (resp. Ch′ ) satisfy |h| ≤ Ch (resp. |h′ | ≤ Ch′ ). In particular, we assume
that the derivatives of σ1 , µ1 , f1 , f2 are bounded. Moreover we may (and do) take Cf′ 2 := f2′ (a).
Assume also that α0 > 0 and σ12 > λ > 0 (so that As 1 is satisfied) and the scaling function
S ≡ 1. Then the following proposition holds.
8
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
2 < L := inf
′′
Proposition 6. Assume α0 > Cµ′ 1 + |µ2 |(2 + Cf′ 1 /Cf′ 2 ) and |µ2 |B∞
p∈A f2 (p), where
f2
1
2
1
/λ and B∞
:= (Cf′ 1 + Cf′ 2 )/(α0 − Cµ′ 1 − |µ2 |),
B∞
:= 2(Cf1 + Cf2 ) + (Cµ1 + a|µ2 |)B∞
1 , B 2 ) and
then Assumptions 1–4 hold. Moreover, in Assumptions 3–4 we have BK = (B∞
∞
CK = 1 for any K > 0.
Remark 13. It is clear that the assumptions in Proposition 6 define a non-empty subclass of
models (8). Moreover, these assumptions are much stronger than what is required by our general
Assumptions 3–4 since the proposition yields global (rather than local) bounds on the derivatives
of the payoff functions and the Lipschitz coefficients of the policies arising in (gPIA).
1 and |h′′ (x)| < B 2 for all x ∈ R. Then the
Proof. Pick h ∈ C 2 (R), such that |h′ (x)| < B∞
∞
function Ih in As 3 satisfies Ih (x) = argmin{pµ2 h′ (x) + f2 (p)}. By assumption we have
p∈A
1
|µ2 h′ (x)| ≤ |µ2 |B∞
< Cf′ 2 = f2′ (a),
implying Ih (x) = (f2′ )−1 (−µ2 h′ (x))
∀x ∈ R.
2 /L
Differentiate Ih to obtain |Ih′ (x)| ≤ |µ2 |B∞
f2 < 1, x ∈ R, and note that Assumptions 2–3
follow.
We now establish As 4. The idea is to start with any policy π : R → A in C 2 (R), such that
its derivative satisfies |π ′ | ≤ 1 on all of R (e.g. a constant policy), and apply stochastic flow of
diffeomeorphisms [21, Sec. V.10] to deduce the necessary regularity of the payoff function Vπ . In
the notation from (5), we have |µ′π | ≤ Cµ′ 1 + |µ2 | and |σπ′ | = |σ1′ | ≤ Cσ′ 1 . Hence, for each x ∈ R,
the stochastic exponential Y = (Yt )t∈R+ , given by
Z t
Z t
σπ′ (Xsπ,x )Ys dWs ,
µ′π (Xsπ,x )Ys ds +
Yt = 1 +
0
0
exists (we suppress the dependence on x (and π) from the notation). Since the coefficients
SDE (4) are in C 1 (R) with bounded and locally Lipschitz first derivative, [21, Sec. V.10, Thm 49]
implies that the flow of controlled processes {X π,x }x∈R may be constructed on the single prob∂
X π,x = Y . The upshot here
ability space so that it is smooth in the initial condition x with ∂x
is that, by the argument in the proof of [9, Prop. 3.2], we obtain a stochastic representation for
Rt
∂
the derivative ∂x
Efπ (Xtπ,x ) = E[Yt fπ′ (Xtπ,x )] for every t ∈ R+ . Since Yt = Mt exp 0 µ′π (Xsπ,x )ds,
R·
where the stochastic exponential M = E( 0 σπ′ (Xsπ,x )dWs ) is a true martingale by Novikov’s
condition, the following inequality holds: EYt ≤ exp(t(Cµ′ 1 + |µ2 |)). Since α0 > Cµ′ 1 + |µ2 | by
R∞
1
assumption and the inequality |fπ′ | ≤ Cf′ 1 + Cf′ 2 holds, we have |E 0 e−α0 s Ys fπ′ (Xsπ,x )ds| < B∞
for all x ∈ R.
R∞
Recall that Vπ (x) = E 0 e−α0 s fπ (Xsπ,x )ds. By [21, Sec. V.8, Thm 43], the family of random
variables indexed by δ ∈ (0, 1),
Z
Z ∞
1
1 ∞ −α0 s
e
|fπ (Xsπ,x+δ ) − fπ (Xsπ,x )|ds ≤ (Cf′ 1 + Cf′ 2 )
e−α0 s |Xsπ,x+δ − Xsπ,x |ds,
δ 0
δ
0
is uniformly integrable. Hence limδ→0 (Vπ (x + δ) − Vπ (x))/δ takes the form
Z ∞
1
for all x ∈ R.
Vπ′ (x) = E
e−α0 s Ys fπ′ (Xsπ,x )ds, implying |Vπ′ (x)| < B∞
0
2 , concluding the proof.
This inequality, the fact σ12 > λ and Proposition 1 imply |Vπ′′ | < B∞
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
9
Remark 14. The process Y in the proof of Proposition 6 exists in the multidimensional setting,
see [21, Sec. V.10, Thm 49]. Hence the same argument works in higher dimensions if we can
deduce a bound on the Hessian of the payoff function from the PDE in Proposition 1.
4.2. Numerical examples. Consider the one-dimensional control problem: A = [−1, 1], a =
−10, b = 10, g(a) := a2 , g(b) := b2 , σ(x, p) := 1, µ(x, p) := p, α(x, p) := 1 and f (x, p) := x2 + p2 ,
which is in the class discussed in Section 4.1. Explicitly, we seek to compute inf Π∈A(x) VΠ (x) for
every x ∈ (−10, 10), where the payoff VΠ (x) of a policy Π is defined in Section 3.
We implemented (gPIA), with the main step given by (7), in Matlab. The payoff function
at each step is obtained as the solution to the differential equation from Proposition 1 with the
boundary conditions given by the function g. The new policy at each step can be calculated
explicitly (cf. the proof of Proposition 6 above). Figures 1 and 2 graph the payoff functions
and the policies (colour coded). The initial policy π0 ≡ 1 and its payoff correspond to the blue
graphs.
100
1
90
0.8
80
0.6
70
0.4
60
0.2
50
0
40
−0.2
30
−0.4
20
−0.6
10
−0.8
0
−10
−8
−6
−4
−2
0
2
4
6
8
10
−1
−10
Figure 1. The graphs of Vπn
for n ∈ {0, 1, 2, 3, 4}.
−8
−6
−4
−2
0
2
4
6
8
10
Figure 2. The graphs of πn
for n ∈ {0, 1, 2, 3, 4}.
The graphs suggest that convergence effectively occurs in just a few steps. Figures 3 and 4,
containing the graphs of the differences of the consecutive payoffs and policies on the logarithmic
scale, confirm this. In Figures 1 and 2 it seems that fewer graphs are presented than is stated in
the caption. The reason for this is that the final few graphs coincide. Moreover, the policies only
differ on a subinterval (−2, 2), because outside of it they coincide as it is optimal to chose one of
the boundary points of A = [−1, 1]. Finally, there is no numerical indication that the sequence
of policies have more than one accumulation point as they appear to converge very fast indeed.
5. Proofs
5.1. Auxiliary results - the multidimensional case.
5.1.1. The reflection coupling of Lindvall and Rogers [18] and the continuity of the payoff Vπ .
We now establish the continuity of the payoff function for a locally Lipschitz Markov policy π
under Assumption 1. The reflection coupling of Lindvall and Rogers [18] plays a crucial role
in this. In fact, the continuity of Vπ hinges on the following property of the coupling in [18]:
′
copies of X π,x and X π,x , started very close to each other, will meet with high probability before
moving apart by a certain distance greater than kx − x′ k (see Lemma 7 below).
10
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
5
0
0
−5
−5
−10
−10
−15
−15
−20
−20
−25
−25
−30
−30
−35
−35
−40
−40
−10
−8
−6
−4
−2
0
2
4
6
8
10
Figure 3. The graphs
of log(|Vπn+1 − Vπn |) for
n ∈ {0, 1, 2, 3, 4}.
−45
−10
−8
−6
−4
−2
0
2
4
6
8
10
Figure 4. The graphs
of log(|πn+1 − πn |) for
n ∈ {0, 1, 2, 3, 4}.
We first show that the coupling from [18] can be applied to the diffusion X π,· . As explained in
Remark 4, we may (and do) assume that the dimension of the noise and the controlled process
are equal, i.e. d = m. By Assumption 1 above σπ and µπ are bounded and hence [18, As. (12)(ii)]
holds. Inequality (3) in Assumption 1 implies that λmax (σπ−1 σπ−T ) ≤ 1/λ. Hence, by Remark 1,
√
we have kσπ−1 k ≤ 1/ λ and [18, As. (12)(ii)] also holds. The assumptions in [18, (12)(i)] requires
that σπ and µπ are globally Lipschitz. But this assumption is only used in [18] as a guarantee
that the corresponding SDE has a unique strong solution, which is the case in our setting under
the locally Lipschitz condition in Assumption 1. Hence, for any x, x′ ∈ Rd , the coupling from [18]
′
′
can be applied to construct the process (X π,x , X π,x ) so that X π,x follows SDE (4) and X π,x
satisfies
Z t
Z t
′
′
′
Xtπ,x = x′ +
µπ Xsπ,x ds +
σπ Xsπ,x Hs dBs ,
for t ∈ [0, ρ0 (Y )),
0
0
′
where ρ0 (Y ) := inf{t ≥ 0 : kYt k = 0} (inf ∅ := ∞) is the coupling time, Y := X π,x − X π,x , and
′
σπ−1 Xtπ,x Yt
(9)
Ht := I − 2ut uTt ,
defined via ut :=
for t ∈ [0, ρ0 (Y )),
′
σπ−1 Xtπ,x Yt
is the reflection in Rd about the hyperplane orthogonal to the unit vector ut . Moreover, we
′
have Xtπ,x = Xtπ,x for all t ∈ [ρ0 (Y ), ∞). Note also that Ht ∈ O(d) is an orthogonal matrix for
Rt
t ∈ [0, ρ0 (Y )) and the process B ′ = (Bt′ )t∈R+ , given by Bt′ := 0 (I{s<ρ0 (Y )} Ht + I{s≥ρ0 (Y )} I)dBs ,
′
is a Brownian motion by the Lévy characterisation theorem. Hence X π,x satisfies the SDE
′
′
′
′
dXtπ,x = σπ (Xtπ,x )dBt′ + µπ (Xtπ,x )dt with X0π,x = x′ (see [18, Sec. 3] for more details).
Lemma 7. Fix a locally Lipschitz Markov policy π : Rd → A and x ∈ Rd . Then for every
ǫ ∈ (0, 1) there exist ϕ̄ ∈ (0, 1] with the property: ∀ϕ ∈ (0, ϕ̄) ∃ϕ′ ∈ (0, ϕ) such that P(ρϕ (Y ) <
ρ0 (Y )) < ǫ if kx − x′ k < ϕ′ , where ρc (Y ) := inf {t ≥ 0; kYt k = c} (inf ∅ = ∞) for any c > 0.
Remark 15. Note that the main assumption in [18, Thm. 1] is not satisfied in Lemma 7, as
′
we have no assumption on the global variability of σπ . Hence the coupling (X π,x , X π,x ) is not
necessarily successful even if the starting points x and x′ are very close to each other, i.e. possibly
P(ρ0 (Y ) < ∞) < 1 even if kY0 k = kx − x′ k is very close to zero. However, by Lemma 7, the
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
11
coupling will occur with probability at least 1 − ǫ before the diffusions are more than ϕ̄ = ϕ̄(ǫ)
away from each other, implying the continuity of Vπ (cf. Lemma 8 and Remark 16 below).
′
′
Proof. Let S̄ := kY k2 , δ := σπ (X π,x ) − σπ (X π,x ) and β := µπ (X π,x ) − µπ (X π,x ). Define
(10)
′
αt := σπ (Xtπ,x ) − σπ (Xtπ,x )Ht
and
vt := Yt /kYt k
for t ∈ [0, ρ0 (Y )).
In this proof x ∈ Rd is fixed and x′ ∈ Rd is arbitrary in the ball of radius one centred at x.
Recall that ∇h(z) = 2z and Hh(z) = 2I for h(z) := kzk2 , z ∈ Rd , and apply Itô’s lemma to S̄:
Z t p
Z t p
′ 2
T
(11) S̄t = kx − x k +
2 S̄s vs αs dBs +
2 S̄s vsT βs + Tr αs αsT ds, t ∈ [0, ρ0 (Y )).
0
0
Our task is to study the behaviour of S̄ when started very close to zero. To do this, we first
establish the facts in (12) and (14) below, which in turn allow us to apply time-change and
coupling techniques to prove the lemma. We start by proving the following:
2
2
(12) 0 ≤ Tr αt αtT − vtT αt = Tr δt δtT − vtT δt ≤ Mx2 kYt k2 for t ∈ [0, ρ0 (Y ) ∧ ρ1 (Y )),
where Mx > 1 is a Lipschitz constant for σπ and µπ in the ball around x of radius one. The
first inequality in (12) follows since the trace is the sum of the eigenvalues of αt αtT , which are
2
all non-negative, while vtT αt is at most the largest eigenvalue. The second inequality follows
since σπ is Lipschitz on any ball around x and kYt k < 1 for t < ρ1 (Y ). To establish the equality
in (12) note that, as kvtT Ak2 = Tr(AAT vt vtT ) = Tr(vt vtT AAT ) for any A ∈ Rd×d , we have
Tr(αt αtT ) − kvtT αt k2 − (Tr(δt δtT ) − kvtT δt k2 ) = Tr((I − vt vtT )(αt αtT − δt δtT )).
Recall that Ht−1 = HtT = Ht . We therefore find
αt αtT − δt δtT
(13)
′
′
= σπ (Xtπ,x )(I − Ht )σπ (Xtπ,x )T + σπ (Xtπ,x )(I − Ht )σπ (Xtπ,x )T
′
= 2 vt uTt σπ (Xtπ,x )T + σπ (Xtπ,x )ut vtT kYt k/kσπ (Xtπ,x )−1 Yt k,
where the second equality follows by definition (9) and identity vt = Yt /kYt k. Hence (12) follows.
′
′
Since Tr(vt uTt σπ (Xtπ,x )T ) = hσπ (Xtπ,x )σπ (Xtπ,x )−1 vt , vt ikYt k/kσπ (Xtπ,x )−1 Yt k holds for times
t ∈ [0, ρ0 (Y )), equalities (12) and (13) yield:
2
2
2
′
′
= 4hσπ (Xtπ,x )σπ (Xtπ,x )−1 vt , vt ikYt k2 /kσπ (Xtπ,x )−1 Yt k2 .
√
Inequality (3) in Assumption 1 implies kσπ−1 k ≤ 1/ λ (cf. the second paragraph of this section).
√
′
Hence kYt k/kσπ (Xtπ,x )−1 Yt k ≥ λ. By the definition of δt above we get
√
′
′
2
vtT αt ≥ 4λ(1 + hδt σπ (Xtπ,x )−1 vt , vt i) ≥ 4λ(1 − kδt kkσπ (Xtπ,x )−1 k) ≥ 4λ(1 − kδt k/ λ).
vtT αt
≥ vtT αt
For any ǫ ∈ (0, 1), define
− vtT δt
√
ϕ̄ := min{1, ǫ λ/Mx , ǫ(1 − ǫ)λ/Mx2 },
where Mx is as in (12) above. Then, if kx − x′ k < ϕ̄ and t ∈ [0, ρϕ̄ (Y )), we have kYt k < ϕ̄ and
√
hence kδt k ≤ Mx kYt k ≤ ǫ λ. In particular, we get
(14)
kvtT αt k2 ≥ 4λ(1 − ǫ) > 0
for any t ∈ [0, T ), where T := ρ0 (Y ) ∧ ρϕ̄ (Y ).
Let M > 0 denote a global upper bound on σπ and µπ , which exists by Assumption 1. Since
′
the inequalities kvtT αt k ≤ kαt k ≤ kσπ (Xtπ,x )k + kσπ (Xtπ,x )k ≤ 2M hold for all t ∈ [0, ρ0 (Y )),
Rt
the increasing process [N ] = ([N ]t )t∈R+ , given by [N ]t := 0 I{s<ρ0 (Y )} kvsT αs k2 ds, is well-define
Rt
and [N ]t < ∞ for every t ∈ R+ . Hence N = (Nt )t∈R+ , given by Nt := 0 I{s<ρ0 (Y )} vsT αs dBs ,
12
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
is a well-defined local martingale with a quadratic variation process given by [N ]. Let τ =
(τs )s∈R+ and W = (Ws )s∈R+ be the Dambis Dubins-Schwartz (DDS) time-change and Brownian
motion, respectively, for the local martingale N (see [14, Thm 3.4.6, p. 174]). More precisely,
let s 7→ τs := inf{t ∈ R+ : [N ]t > s} (with inf ∅ = ∞) be the inverse of t 7→ [N ]t . Then W
satisfies W[N ]t = Nt for all t ∈ R+ . Moreover it holds that τs < ∞ for s < [N ]∞ := limt↑∞ [N ]t .
If [N ]∞ < ∞ with positive probability, we have to extend the probability space to support
W (see e.g. [14, Prob. 3.4.7, p. 175]). This extension however has no bearing on the coupling
′
(X π,x , X π,x ).
Let α̂s := ατs , δ̂s := δτs , β̂s := βτs , v̂s := vτs and Ŝs := S̄τs for s ∈ [0, [N ]ρ0 (Y ) ), cf. (10) above.
Assume kx − x′ k < ϕ̄ and time-change the integrals in (11) (see [14, Prop. 3.4.8, p. 176]) to get
Z u q
Z u
′ 2
2 Ŝs dWs +
Ŝu = kx − x k +
(1 + νs ) ds,
for any u ∈ [0, [N ]T ),
0
0
p
where νs := 2 Ŝs v̂sT β̂s + Tr(δ̂s δ̂sT ) − kv̂sT δ̂s k2 /kv̂sT α̂s k2 and T is defined in (14). By (14) it
holds that kv̂sT α̂s k2 ≥ 4λ(1 − ǫ) for all s ∈ [0, [N ]T ). Then (12) and the definitions of β̂, δ̂ and νs
imply the inequalities 0 ≤ νs < Mx2 kYτs k2 /(λ(1 − ǫ)) for all s ∈ [0, [N ]T ). Any ϕ ∈ (0, ϕ̄) satisfies
ϕ < ǫ(1 − ǫ)λ/Mx2 and R := ρ0 (Y ) ∧ ρϕ (Y ) ≤ T . Hence the Lipschitz property of σπ and νπ on
the ball of radius ϕ around x implies
νs < ǫ
for all s ∈ [0, [N ]R ).
Rs √
The SDE Ss = kx − x′ k + 0 2 Sr dWr + (1 + ǫ)s, s ∈ R+ , for the squared Bessel process
of dimension 1 + ǫ has a pathwise unique (and hence strong) solution S = (Ss )s∈R+ , see [6,
App. A.3, p. 108]. Note that S is driven by the DDS Brownian motion W introduced above.
Hence the coupling (S, Ŝ) on the stochastic interval [0, [N ]R ) allows us to compare the two
processes pathwise. Assume kx − x′ k < ϕ. Then the following equality holds:
Z p
q
q
p
1 s
ǫ/ Sr − νr / Ŝr dr
for any s ∈ [0, [N ]R ).
Ss − Ŝs =
2 0
p
√
Almost surely, the path of the process ( Ss − Ŝs )s∈[0,[N ]R ) is continuously differentiable and,
by (15), its derivative is strictly positive at every zero of the path. Since the derivative is
continuous, it must be strictly positive on a neighbourhood of each zero. This implies that the
only zero is at s = 0 (i.e. S0 = Ŝ0 ), and it holds that
(15)
(16)
Ss ≥ Ŝs
for all s ∈ [0, [N ]R ).
We now conclude the proof of the lemma. Assume as before that kx − x′ k < ϕ and define
Υϕ (Ŝ) := inf{s ∈ [0, [N ]T ) : Ŝs = ϕ} (with inf ∅ = ∞). Note that the events {Υϕ (Ŝ) < ∞} and
{[N ]ρϕ (Y ) < [N ]ρ0 (Y ) } coincide, since on either event we have Υϕ (Ŝ) = [N ]R = [N ]ρϕ (Y ) . Hence,
(17)
{ρϕ (Y ) < ρ0 (Y )} = {Υϕ (Ŝ) < ∞} ⊆ {S exits interval (0, ϕ) at ϕ},
where the inclusion follows by (16). Recall that s(z) = z (1−ǫ)/2 , z ∈ R+ , is a scale function of
the diffusion S. Hence P(S exits interval (0, ϕ) at ϕ) = s(kx − x′ k)/s(ϕ). Define ϕ′ := ǫϕ and
note that by (17) we have: P(ρϕ (Y ) < ρ0 (Y )) < ǫ for any x′ ∈ Rd satisfying kx − x′ k < ϕ′ .
Lemma 8. Pick a locally Lipschitz Markov policy π : Rd → A and let Assumption 1 hold. Then
the corresponding payoff function in (5), Vπ : Rd → R+ , is continuous.
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
13
Proof. Fix x ∈ Rd and pick arbitrary ε ∈ (0, 1). By Assumption 1 there exists ǫ0 > 0, such that
απ ≥ ǫ0 , and a constant M > 1 that simultaneously bounds απ , |fπ | < M and is a Lipschitz
constant on the ball of radius one around x for απ and fπ . Apply Lemma 7 to x, ǫ := εǫ0 /(6M )
and π to obtain ϕ̄ ∈ (0, 1] such that ∀ϕ ∈ (0, ϕ̄) ∃ϕ′ ∈ (0, ϕ) such that P(ρϕ < ρ0 ) < ǫ for every
x′ ∈ Rd satisfying kx − x′ k < ϕ′ (here ρϕ , ρ0 stand for ρϕ (Y ), ρ0 (Y ), resp.). Specifically, define
ϕ := min{ϕ̄/2, ε/(3(1 + M/ǫ0 )M/ǫ0 ), (εeǫ0 )/(3M 2 )}
(18)
and fix ϕ′ ∈ (0, ϕ) such that the conclusion of Lemma 7 holds. Throughout this proof we use
′
the notation and notions from Lemma 7. In particular, (X π,x , X π,x ) denotes the coupling of
two controlled processes started at (x, x′ ) and we assume that kx − x′ k < ϕ′ .
′
′
Recall that Vπ (x′ ) = EF∞ (X π,x ) for any x′ ∈ Rd , where F∞ (X π,x ) is given in (19). By
decomposing the probability space into complementary events {ρϕ > ρ0 } and {ρϕ < ρ0 }, we
obtain the following inequality |Vπ (x) − Vπ (x′ )| ≤ A + A′ + A′′ , where
Z ρ0
R
′
R
− 0t απ Xsπ,x ds
π,x
π,x′
− 0t απ (Xsπ,x )ds
A := E I{ρϕ >ρ0 }
e
fπ (Xt ) − e
fπ Xt
dt ,
0
Z ∞
R
′
R
− 0t απ Xsπ,x ds
π,x
π,x′
− 0t απ (Xsπ,x )ds
′
e
fπ (Xt ) − e
fπ Xt
A := E I{ρϕ >ρ0 }
dt ,
′′
A := E I{ρϕ <ρ0 }
ρ0
∞
Z
e
0
−
Rt
0
απ (Xsπ,x )ds
fπ (Xtπ,x )
−e
−
Rt
0
′
απ Xsπ,x ds
fπ
′
Xtπ,x
dt .
Hence, by Lemma 7, we have A′′ ≤ P(ρϕ < ρ0 )2M/ǫ0 < ǫ2M/ǫ0 = ε/3.
′
Since in the summands A and A′ the coupling succeeds before the components of (X π,x , X π,x )
grow at least ϕ apart, we Rcan control these terms using the local regularity of απ and fπ . Consider
′
t
π,x
A. Add and subtract e− 0 απ (Xs )ds fπ (Xtπ,x ) to obtain the bound:
Z ρ0
R
′
R
− 0t απ Xsπ,x ds
π,x
π,x′
− 0t απ (Xsπ,x )ds
−ǫ0 t
A ≤EI{ρϕ >ρ0 }
fπ (Xt ) − fπ Xt
e
−e
+M e
dt.
0
′
On the event {ρϕ > ρ0 }, for t < ρϕ it holds that kXtπ,x − Xtπ,x k < ϕ. Since z 7→ e−z has a
positive derivative bounded above by one for z ∈ R+ , απ − ǫ0 ≥ 0 and both fπ and απ are
Lipschitz with constant M on the ball of radius ϕ around x, we get
Z t
Z ρ0
ε
M2
M
−ǫ0 t
−ǫ0 t
π,x′
π,x
≤ ,
+ 2
A ≤EI{ρϕ >ρ0 }
M ϕe
+ Me
απ (Xs ) − απ (Xs ) ds dt < ϕ
ǫ0
3
ǫ0
0
0
′
where the last inequality follows from (18). Furthermore, since Xtπ,x = Xtπ,x for all t ≥ ρ0 , it
holds that the following expectation equals A′ :
Z ∞
R
R ρ0
R ρ0
π,x
π,x′
− t α (X π,x )ds
EI{ρϕ >ρ0 }
|fπ (Xtπ,x )| dt.
e−ρ0 ǫ0 e− 0 (απ (Xs )−ǫ0 )ds − e− 0 (απ (Xs )−ǫ0 )ds e ρ0 π s
ρ0
Since |fπ | < M , απ ≥ ǫ0 , |e−z − e−y | ≤ |z − y| for z, y ∈ R+ and e−tǫ0 ≤ e−1 ǫ0 for t ∈ R+ we find
Z ρ0
M2
π,x
π,x′
′
−ρ0 ǫ0
,
απ (Xs ) − απ (Xs ) ds ≤ M 2 ϕE I{ρϕ >ρ0 } e−ρ0 ǫ0 ρ0 ≤ ϕ
A ≤ E I{ρϕ >ρ0 } M e
eǫ0
0
which is by (18) less than ε/3. Hence, for any kx − x′ k ≤ ϕ′ , we proved that |Vπ (x) − Vπ (x′ )| < ǫ,
which concludes the proof of the lemma.
Remark 16. The proofs of Lemmas 7 and 8 show that if the locally Lipschitz property in Assumption 1 is substituted by the globally Lipschitz requirement, we can conclude that the payoff
function Vπ is in fact uniformly continuous. However, the coupling from [18] may still not be
14
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
successful, since the global Lipschitz condition controls globally the local variability of the coefficients. The coupling may fail because the assumptions in [18, Thm. 1] constrain the global
variability of σπ . In fact, the idea of the proof of Lemma 7 can be used to construct an example
where P(ρ0 (Y ) < ∞) < 1 by bounding the norm of kY k2 from below by a squared Bessel process
of dimension greater than two on an event of positive probability.
5.1.2. A version of the Ascoli-Arzela Theorem. The following fact is key for proving the existence
of the optimal strategy and showing that a subsequence of {πN }N ∈N in (gPIA) converges to it.
Lemma 9. Let (M1 , d1 ) and (M2 , d2 ) be compact metric spaces, and for every n ∈ N let fn :
M1 → M2 . If the sequence {fn }n∈N is equicontinuous, i.e.
∀ǫ > 0
∃δ > 0
∀x, y ∈ M1
∀n ∈ N :
d1 (x, y) < δ =⇒ d2 (fn (x), fn (y)) < ǫ,
then there exists a uniformly convergent subsequence {fnk }k∈N , i.e. ∃f : M1 → M2 such that for
every ǫ > 0 there exists N ∈ N such that supx∈M1 d2 (fnk (x), f (x)) < ǫ for all k ≥ N .
Proof. Let B(x, 1/m) := {y ∈ M1 : d1 (x, y) < 1/m} be a ball of radius 1/m, m ∈ N, centred
at x ∈ M1 . Since M1 is compact and metric, it is totally bounded: ∃Sm ⊆ M1 finite satisfying
M1 = ∪x∈Sm B(x, 1/m). Then S := ∪m∈N Sm = {xn ∈ M1 ; n ∈ N} is countable and dense in M1 .
We now apply the standard diagonalisation argument to find the subsequence in the lemma.
Let ι1 : N → N be an increasing function defining a subsequence {fι1 (n) }n∈N that converges at
x1 , i.e. limn→∞ fι1 (n) (x1 ) exists in M2 . Such a function ι1 exists since M2 is compact. Assume
now that we have constructed an increasing ιk : N → N such that {fιk (n) }n∈N converges on
the set {x1 , . . . , xk } for some k ∈ N. Then there exists an increasing ι : N → N such that the
sequence of functions {fιk+1 (n) }n∈N , where ιk+1 := ιk ◦ ι, converges at xk+1 as well as on the set
{x1 , . . . , xk }, as it is a subsequence of {fιk (n) }n∈N . Since k ∈ N was arbitrary, we have defined a
sequence of subsequences of {fn }n∈N , such that the k-th subsequence converges on {x1 , . . . , xk }.
Consider the “diagonal” subsequence {fnk }k∈N , fnk := fιk (k) for any k ∈ N. By construction
it converges on S. We now prove that it is uniformly Cauchy, which implies uniform convergence
since M2 is complete. Pick any ǫ > 0. By equicontinuity ∃m ∈ N such that for any k ∈ N and
x, y ∈ M1 satisfying d1 (x, y) < 1/m, it holds that d2 (fnk (x), fnk (y)) < ǫ/3. Furthermore, since
Sm is finite, ∃N ∈ N such that for all natural numbers k1 , k2 ≥ N we have d2 (fnk1 (y), fnk2 (y)) <
ǫ/3 for all y ∈ Sm . Finally, for any x ∈ M1 there exists y ∈ Sm such that d1 (x, y) < 1/m. Hence,
for any k1 , k2 ≥ N it holds that
d2 (fnk1 (x), fnk2 (x)) ≤ d2 (fnk1 (x), fnk1 (y)) + d2 (fnk1 (y), fnk2 (y)) + d2 (fnk2 (y), fnk2 (x)) < ǫ.
Since x ∈ M1 was arbitrary, the lemma follows.
5.1.3. A uniformly integrable martingale. If the process X π,x in (4), controlled by a Markov
policy π, exists for all x ∈ Rd , then X π,· is a strong Markov process [14, Thm 4.30, p. 322], since σ
and µ are bounded by Assumption 1. Define the additive functional F (X π,x ) = (Ft (X π,x ))t∈[0,∞] ,
Z t R
u
π,x
π,x
e− 0 απ (Xs )ds fπ (Xuπ,x ) du
(19)
Ft (X ) :=
for t ∈ [0, ∞].
0
Remark 17. Note that Vπ (x) = EF∞ (X π,x ) and, by Assumption 1, the process |F (X π,x )| is
bounded by some constant C0 > 0. Hence |F∞ (X π,x )| < C0 and |Vπ (x)| < C0 .
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
15
Lemma 10. The following holds for every Markov policy π, x ∈ Rd and (Ft )-stopping time T :
RT
π,x
E F∞ (X π,x )|FT = FT (X π,x ) + I{T <∞} e− 0 απ (Xs )ds Vπ XTπ,x .
In particular, the process M = (Mr )r∈[0,∞] is a uniformly integrable martingale, where
Mr := Fr (X π,x ) + I{r<∞} e−
Rr
0
απ (Xsπ,x )ds
Vπ (Xrπ,x ) .
Proof. The following calculations imply the lemma:
Z ∞ R
π,x
− 0t+T απ (Xsπ,x )ds
π,x
π,x
fπ Xt+T dt FT
e
E F∞ (X )|FT = FT (X ) + E I{T <∞}
0
Z ∞ R
R
π,x
π,x
− 0t απ (Xs+T
ds
− 0T απ (Xtπ,x )dt
π,x
)
e
E
= FT (X ) + I{T <∞} e
fπ Xt+T dt FT
0
= FT (X
π,x
) + I{T <∞} e
−
RT
0
απ (Xtπ,x )dt
Vπ (XTπ,x ),
where we applied the strong Markov property of X π,· in the last step.
5.2. Proofs of results in Section 2.
Proof of Proposition 1. Assume m = d, cf. Remark 4. It suffices to prove that the PDE holds
on the ball D := {y ∈ Rd : ky − x′ k < 1} for any x′ ∈ Rd . Fix x ∈ D and define τ := inf{t ∈
R+ : Xtπ,x ∈ ∂D} (with inf ∅ = ∞) to be the first time the process X π,x hits the boundary
∂D := {y ∈ Rd : ky − x′ k = 1} of D. Note that, by Assumption 1, we have τ < ∞.
Let v ∈ C 2 (D) ∩ C(D̄), where D̄ := D ∪ ∂D, denote a solution of the boundary value problem
Lπ v − α π v + fπ = 0
in
D,
where v = Vπ on ∂D.
Since π is locally Lipschitz and (2) in Assumption 1 holds, the coefficients σπ , µπ , fπ , απ are
(1/2)-Hölder (in fact Lipschitz) on D̄. The boundary data Vπ |∂D is continuous by Lemma 8,
απ ≥ 0 and σπ satisfies (3). Hence, by [10, Thm 19, p. 87], the function v exists, is unique and
Hv is (1/2)-Hölder.
π,x
∈ D̄. Hence we can define
Note that, for all t ∈ [0, ∞], we have Xt∧τ
(20)
Yt := Ft∧τ (X π,x ) + e−
R t∧τ
0
απ (Xrπ,x )dr
π,x
v (Xt∧τ
),
for t ∈ [0, ∞],
where the process F· (X π,x ) is given in (19) above. The process Y = (Yt )t∈[0,∞] is bounded
by a constant and by definition converges almost sure limt→∞ Yt = Y∞ . Since v solves the
boundary value problem above and X π,x satisfies SDE (4), Itô’s formula on the stochastic interval
[0, τ ] ⊂ R+ yields
Z t∧τ R
s
π,x
Yt = v(x) +
e− 0 απ (Xr )dr ∇v (Xsπ,x )T σπ (Xsπ,x ) dBs , t ∈ [0, ∞],
0
making Y into a local martingale. Since Y is bounded, it is a uniformly integrable martingale
satisfying v(x) = Y0 = E[Y∞ ]. Since v = Vπ on ∂D and Xτπ,x ∈ ∂D, the definition of Y in (20)
and Lemma 10 (applied to the stopping time T := τ ) yield
Y∞ = Fτ (X π,x ) + e−
Rτ
0
απ (Xrπ,x )dr
Vπ (Xτπ,x ) = E (F∞ (X π,x )|Fτ ) ,
implying v(x) = E(F∞ (X π,x )) = Vπ (x).
The uniqueness follows similarly: let v be another bounded solution of the Poisson equation
on Rd . Define the process Y as in (20) with τ ≡ ∞ and t < ∞. As above we have v(x) = EYt for
all t ∈ R+ . Then the DCT, applicable since v is bounded, yields v(x) = limt↑∞ EYt = Vπ (x).
16
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
Proof of Theorem 2. Let πn and πn+1 be as in (gPIA), n ∈ N ∪ {0}. Define Y = (Yt )t∈R+ by
(21)
Yt := Ft (X πn+1 ,x ) + e−
Rt
0
π
απn+1 (Xr n+1
,x
)dr V
πn
π
Xt n+1
,x
,
π
t ∈ R+ .
,x
where F· (X πn+1 ,x ) is given in (19) above. Define τm := inf{t ≥ 0 : kXt n+1 k = m} for any fixed
m > kxk and note that τm < ∞ by Assumption 1. Itô’s formula, applicable by Proposition 1,
yields
Z ·∧τm R
πn+1 ,x
πn+1 ,x
s
)dr f
e− 0 απn+1 (Xr
) ds,
Y·∧τm = Vπn (x) + M +
πn+1 + Lπn+1 Vπn − απn+1 Vπn (Xs
0
Rs
πn+1 ,x
R t∧τ
π
,x
)dr
(∇VπTn σπn+1 )(Xs n+1 )dBs , is a local
where M = (Mt )t∈R+ , Mt := 0 m e− 0 απn+1 (Xr
martingale. Since the functions σπn+1 and ∇Vπn are bounded on the ball {y ∈ Rd : kyk ≤ m} by
Assumption 1 and Proposition 1, respectively, and απn+1 > ǫ0 > 0, the quadratic variation of M
is bounded above by a constant. Hence M is a uniformly integrable martingale. In particular,
EMt = 0 for all t ∈ R+ . By (gPIA) and Proposition 1, we have
on Rd .
fπn+1 + Lπn+1 Vπn − απn+1 Vπn Sπn+1 ≤ (fπn + Lπn Vπn − απn Vπn ) Sπn = 0
Since Sπn+1 > 0, we have E (Yt∧τm ) ≤ Vπn (x). Hence (21), Assumption 1 and the DCT, as t ↑ ∞,
yield
R τm
πn+1 ,x
π
,x
)dr .
Vπn (x) ≥ EFτm (X πn+1 ,x ) + EVπn Xτmn+1 e− 0 απn+1 (Xr
Hence, by Remark 17, we have Vπn (x) ≥ EFτm (X πn+1 ,x ) − C0 Ee−ǫ0 τm . Since X πn+1 ,x satisfies
SDE (4) for all t ∈ R+ , we have limm↑∞ τm = ∞. The DCT and Remark 17 yield Vπn+1 (x) =
EF∞ (X πn+1 ,x ) = limm↑∞ EFτm (X πn+1 ,x ) − C0 Ee−ǫ0 τm ≤ Vπn (x), which concludes the proof.
Proof of Proposition 3. Run (gPIA) to produce a sequence of policies {πN }N ∈N , starting from a
constant policy π0 . Fix an arbitrary K0 > 0 and consider the restriction of this sequence onto
the closed ball DK0 . Since the Lipschitz constant of π0 is equal to zero and hence smaller than
CK0 , Assumption 4 implies Vπ0 ∈ SBK0 ,K0 . Assumption 3 implies that the Lipschitz constant
of π1 is also at most CK0 . Iterating this argument implies that all the policies in the sequence
{πN }N ∈N have the same Lipschitz constant on DK0 , making it equicontinuous on DK0 . By
Lemma 9 above, there exists a subsequence that converges uniformly on DK0 to a function
0 :D
0
π∞
K0 → A. Moreover, π∞ is also Lipschitz with a constant bounded above by CK0 .
Let K1 := 2K0 and repeat the argument above for K1 and the subsequence of {πN }N ∈N
constructed in the previous paragraph. This yields a further subsequence of the policies that
1 :D
converges uniformly to a Lipschitz function π∞
K1 → A with the Lipschitz constant bounded
0 on D
above by CK1 . Since the sequence we started with converges pointwise to π∞
K0 ⊂ DK1 ,
1
0
so must its every subsequence. Hence it holds that π∞ (x) = π∞ (x) for all x ∈ DK0 .
k : D
For k ∈ N, let Kk := 2Kk−1 and construct inductively π∞
Kk → A as above. Then the
d
n
function πlim : R → A, given by πlim (x) := π∞ (x) for any n ∈ N such that x ∈ DKn , is welldefined and locally Lipschitz. Let the policy πnk : Rd → A be the k-th element of the convergent
k :D
subsequence used to define π∞
Kk → A. Then, by construction, the “diagonal” subsequence
{πnk }k∈N of {πN }N ∈N converges uniformly to πlim on DK for any K > 0.
Proof of Theorem 4. Let {πnk }k∈N be a subsequence of the output of (gPIA), {πN }N ∈N , that
converges locally uniformly to a policy πlim = limk↑∞ πnk . By (6) and Theorem 2, Vπnk ց Vlim
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
17
as k → ∞. Fix K > 0 and let τK := inf{t ∈ R+ : Xtπlim ,x − x ∈ ∂DK } be the first time X πlim ,x
hits the boundary of the closed ball x + DK with radius K, centred at an arbitrary x ∈ Rd .
Pick k ∈ N, t ∈ R+ and define
Z t R
Rt
πlim ,x
πlim ,x
s
πlim ,x
)dr f
)dr V
k
πlim ,x
St :=
e − 0 απnk ( X r
) ds + e− 0 απnk (Xr
).
πnk (Xs
πnk (Xt
0
Apply Itô’s formula to the process S k = (Stk )t≥0 on the stochastic interval [0, τK ) to get
Z t∧τK R
T
πlim ,x
s
)dr ∇V
k
St∧τK = Vπnk (x) +
e − 0 απnk ( X r
σπlim (Xsπlim ,x ) dBs
πn k
0
+
Z
t∧τK
e−
Rs
0
απn
k
0
π
,x
(Xr lim )dr f
πlim ,x
) ds.
πnk + Lπlim Vπnk − απnk Vπnk (Xs
Note that σπlim and ∇Vπnk are bounded on DK by Assumption 1 and Proposition 1, respectively,
and απnk > ǫ0 > 0. Hence the quadratic variation of the stochastic integral is bounded, making
it into a true martingale. This fact and the equality απnk Vπnk − fπnk = Lπnk Vπnk (Prop. 1) yield
Z t∧τK R
πlim ,x
s
)dr f
πlim ,x
k
ESt∧τK = Vπnk (x) + E
e − 0 απ n k ( X r
) ds
πnk + Lπlim Vπnk − απnk Vπnk (Xs
0
(22)
= Vπnk (x) + E
Z
t∧τK
e−
Rs
0
0
απ n
k
π
,x
(Xr lim )dr L V
πlim ,x
) ds.
πlim πnk − Lπnk Vπnk (Xs
Note [Lπlim − Lπnk ]Vπnk = (µπlim − µπnk )T ∇Vπnk + 21 Tr((σπlim + σπnk )T HVπnk (σπlim − σπnk )).
Since, for every k, Vπnk solves the corresponding Poisson equation in Proposition 1 and, by
Assumption 1 and Remark 17, the family of functions {σπnk , µπnk , απnk , fπnk , Vπnk : k ∈ N} is
uniformly bounded on the ball x+DK , Schauder’s boundary estimate for elliptic PDEs [10, p. 86]
implies that the sequences {∇Vπnk }k∈N and {HVπnk }k∈N are also uniformly bounded on x + DK .
Since απnk > ǫ0 > 0 for all k ∈ N and the limits limk↑∞ µπnk = µπlim and limk↑∞ σπnk = σπlim are
k
uniform on x + DK , the DCT and the equality in (22) imply limk↑∞ ESt∧τ
= Vlim (x). Hence,
K
k
the definition of S above, Assumption 1, Remark 17 and a further application of the DCT yield
Z t∧τK R
πlim ,x
s
πlim ,x
)dr f
(23)
Vlim (x) = E
) ds + Et∧τK ,
e− 0 απlim (Xr
πlim (Xs
0
R t∧τK
π
,x
πlim ,x
απlim (Xr lim )dr
.
Vlim Xt∧τ
where Et∧τK := Ee− 0
K
By (6) and Remark 17, the inequality |Vlim (y)| ≤ C0 holds for all y ∈ Rd . By Assumption 1
we hence get
0 ≤ lim sup |Et∧K | ≤ C0 lim sup Ee−ǫ0 (t∧τK ) = 0,
t∧K→∞
t∧K→∞
since τK ↑ ∞ as K ↑ ∞. The DCT applied to the first summand in (23), as t ∧ K → ∞, yields
the equality Vlim (x) = Vπlim (x). Since x ∈ Rd was arbitrary, the theorem follows.
Proof of Theorem 5. The second assertion in the theorem follows from the first one and Theorem 4. We now establish the first assertion of Theorem 5. Equip A × A with a product metric,
e.g. d∞ ((p1 , p2 ), (a1 , a2 )) := max{dA (a1 , p1 ), dA (a2 , p2 )}, and let {πN }N ∈N be constructed by
the (gPIA). As in the proof of Proposition 3, {(πN +1 , πN ) : Rd → A × A}N ∈N are Lipschitz on a
closed ball DK of radius K > 0 with the Lipschitz constant CK , independent of N . Hence as in
the proof of Proposition 3, there exists a subsequence {(π1+nk , πnk )}k∈N that converges uniformly
on every compact subset of Rd to a locally Lipschitz function (π̃lim , πlim ) : Rd → A × A.
18
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
Pick any x ∈ Rd , a policy Π ∈ A(x), K > 0 and let τK := inf{t ∈ R+ : XtΠ,x − x ∈ ∂DK }
be the first time the controlled process X Π,x hits the boundary of the closed ball x + DK with
radius K (centred at x). Since Πs ∈ A for all s ∈ R+ , the (gPIA) implies the inequality
(24)
SΠs (fΠs + LΠs Vπnk − αΠs Vπnk ) ≥ Sπnk +1 (fπnk +1 + Lπnk +1 Vπnk − απnk +1 Vπnk )
on Rd .
Denote Lπ h := Lπ h − απ h + fπ for any policy π and h ∈ C 2 (Rd ). Then, for k ∈ N, we find that
Z t∧τK R
R t∧τK
s
Π,x
Π,x
αΠr (XrΠ,x )dr
E
e− 0 αΠr (Xr )dr fΠs XsΠ,x ds + e− 0
Vπnk Xt∧τ
K
0
Itô
= Vπnk (x) + E
(25)
Z
t∧τK
e−
Rs
0
αΠr (XrΠ,x )dr
0
ǫS
≥ Vπnk (x) +
E
MS
Z
t∧τK
0
e−
Rs
0
f Π s + L Π s V πn k − α Π s V π n k
αΠr (XrΠ,x )dr
Lπnk +1 Vπnk
XsΠ,x ds,
XsΠ,x ds
where the last inequality follows from Assumption 2 and inequality (24).
The next task is to take the limit as k → ∞ on both sides of inequality (25). Since the sequence
{π1+nk }k∈N converges locally uniformly to the locally Lipschitz policy π̃lim (resp. πlim ), Theorem 4 implies Vπ̃lim = Vlim (resp. Vπlim = Vlim ). Proposition 1 implies Lπ̃lim Vlim = 0 = Lπlim Vlim .
Hence we can express Lπnk +1 Vπnk = Lπnk +1 Vπnk −Lπ̃lim Vπnk +Lπ̃lim Vπnk −Lπ̃lim Vlim . By Schauder’s
boundary estimate for elliptic PDEs [10, p. 86], the sequences {∇Vπnk }k∈N and {HVπnk }k∈N
are uniformly bounded on x + DK . By Assumption 1 and Remark 17, the bounded sequence
{(σπnk +1 , µπnk +1 , απnk +1 , fπnk +1 , Vπnk +1 )}k∈N tends to the limit (σπ̃lim , µπ̃lim , απ̃lim , fπ̃lim , Vπ̃lim ) uniformly on x + DK as k ↑ ∞. Hence, so does
(26) Lπnk +1 Vπnk −Lπ̃lim Vπnk = [Lπnk +1 −Lπ̃lim ]Vπnk −(απnk +1 −απ̃lim )Vπnk +(fπnk +1 −fπ̃lim ) → 0.
By the elliptic version of Theorem 15 in [10, p. 80] applied to the family of PDEs Lπnk Vπnk = 0,
k ∈ N, there exists a subsequence of {Vπnk }k∈N (again denoted by {Vπnk }k∈N ), such that the
corresponding sequence {(Vπnk , ∇Vπnk , HVπnk )}k∈N converges uniformly on the closed ball x+DK
to (Vπlim , ∇Vπlim , HVπlim ) = (Vlim , ∇Vlim , HVlim ). Hence it follows that
(27)
Lπ̃lim Vπnk − Lπ̃lim Vlim = Lπ̃lim (Vπnk − Vlim ) − απ̃lim (Vπnk − Vlim ) → 0,
as k → ∞.
Equations (26) and (27) imply that Lπnk +1 Vπnk → 0 as k → ∞ uniformly on the ball x + DK .
Apply the DCT to the right-hand side of (25) and Assumption 1 and Remark 17 to its lefthand side:
Z t∧τK R
s
Π,x
Vlim (x) ≤ E
e− 0 αΠr (Xr )dr fΠs XsΠ,x ds + C0 Ee−ǫ0 (t∧τK ) .
0
Since this inequality holds for all K, t > 0 and τK ↑ ∞ as K ↑ ∞, the inequality Vlim (x) ≤ VΠ (x)
follows by the DCT as t ∧ K → ∞ (cf. the last paragraph of the proof of Theorem 4).
5.3. Auxiliary results - the one-dimensional case. Throughout Sections 5.3 and 5.4, define
τcd (Z) := inf{t ≥ 0; Zt ∈ {c, d}} (inf ∅ = ∞) for any continuous stochastic process (Zt )t∈R+ in
R and −∞ ≤ c < d ≤ ∞.
Lemma 11. For any Markov policy π : (a, b) → A, the payoff function Vπ : (a, b) → R can be
continuously extended by defining Vπ (a) := g(a) if a > −∞ and Vπ (b) := g(b) if b < ∞.
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
19
Proof. Let {xn }n∈N be a decreasing sequence in (a, b) that converges to a > −∞. We now prove
that limn→∞ Vπ (xn ) = g(a) (the argument for b is analogous).
Pick arbitrary ǫ > 0. Since µ is bounded and σ 2 bounded and bounded away from 0, a simple
coupling argument yields that the process X π,xn can be bounded by a Brownian motion with
b
drift so that P (τa∞ (X π,xn ) > ǫ) < ǫ and P τa∞ (X π,xn ) > τ−∞
(X π,xn ) < ǫ hold for large n ∈ N.
Hence, there exists n0 ∈ N such that
b
b
(X π,xn ) < 2ǫ
(X π,xn ) ≤ P (τa∞ (X π,xn ) > ǫ) + P τa∞ (X π,xn ) > τ−∞
P τa∞ (X π,xn ) > ǫ ∧ τ−∞
R τab (X π,xn )
π,xn
απ (Xs
)dt g(X π,xn
for all n ≥ n0 . Define the quantities Bab := |e− 0
)−g(a)| and Aba :=
τab (X π,xn )
R
R τab (X π,xn ) − t α (X π,xn )ds
b
e 0 π s
|fπ (Xtπ,xn )|dt and the event C := {τa∞ (X π,xn ) ≤ ǫ ∧ τ−∞
(X π,xn )}.
0
Then we have
|Vπ (xn ) − g(a)| ≤ E (Aba + Bab )IΩ\C + (Aba + Bab )IC .
We now show that there exists M > 0, which does not depend on ǫ, such that |Vπ (xn )−g(a)| is
bounded above by 4M ǫ for all n ≥ n0 . The expectation on the event Ω\C, which has probability
less than 2ǫ, is smaller than 2M ǫ since f, g are bounded and α ≥ ǫ0 > 0. On the event C we
n
have τab (X π,xn ) ≤ ǫ, which implies EAba IC < M ǫ. On C it holds that Xτπ,x
b (X π,xn ) = a. Hence,
a
the elementary inequality 1 − e−x ≤ x for x ≥ 0, yields an upper bound on EBab IC of the form
Rǫ
|g(a)|E 0 απ (Xtπ,xn ) dt . This concludes the proof.
Lemma 12 is the analogue of Lemma 10 with an analogous proof, which we omit for brevity.
Lemma 12. The following holds for every Markov policy π, x ∈ (a, b) and stopping time ρ:
!
R τab (X π,x )
π,x
π,x
απ (Xt )dt
F
I b π,x
=M ,
E F b π,x (X π,x ) + e− 0
g X
τa (X
)
τab (X π,x )
R r∧τab (X π,x )
{τa (X
)<∞}
ρ
ρ
π,x
απ (Xs )ds V (X π,x
), for r ∈ [0, ∞].
where Mr := Fr∧τab (X π,x ) (X π,x ) + I{r<∞} e− 0
π
r∧τab (X π,x )
In particular, the process M = (Mr )r∈[0,∞] is a uniformly integrable martingale.
5.4. Proofs of results in Section 3.
Proof of Proposition 1 in dimension one. Recall that Assumption 1 holds. We need to show
that for any locally Lipschitz Markov policy π : (a, b) → A we have Vπ ∈ C 2 ((a, b)) and Lπ Vπ −
απ Vπ + fπ = 0.
Let a < a′ < a′′ < x < b′′ < b′ < b, and for any c < d denote τcd := τcd (X π,x ). Let v ∈
C 2 ((a′ , b′ )) ∩ C([a′ , b′ ]) be the unique solution of the boundary value problem Lπ v − απ v + fπ = 0,
v(a′ ) = Vπ (a′ ), v(b′ ) = Vπ (b′ ), guaranteed to exist by Theorem 19 in [10, p. 87], which is
′′
applicable by Assumption 1. Let
by Itô’s formula on
′′
[0, τab′′ ]
′′ ′′
Sta ,b
:= Ft∧τ b
π,x )
′′ (X
a′′
+
e−
and the definition of v, the process
R t∧τab′′
απ (Xrπ,x )dr v(X π,x ). Then,
′′
t∧τab′′
′′ ′′
′′ ′′
S a ,b = (Sta ,b )t≥0 satisfies
0
′′
′′ ′′
Sta ,b
a′′ ,b′′
= v(x) +
Z
t∧τab′′
0
e−
Rs
0
απ (Xrπ,x )dr
σπ v ′ (Xsπ,x ) dBs .
Hence S
is clearly a uniformly integrable martingale and the following equalities hold:
′ ′
′′
a′′ ,b′′
a′′ ,b′′
a′′ ,b′′
limt↑∞ E|St
. Define S a ,b by substituting τab′′ in the
| = 0 and v(x) = ES∞
− S∞
20
SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ
′′
′′
′′
′
′
definition of S a ,b with τab′ . Since X π,x is continuous, we have lima′′ ↓a′ ,b′′ ↑b′ τab′′ = τab′ a.s. Hence,
a′′ ,b′′
a′ ,b′
= ES∞
.
by the DCT, v(x) = lima′′ ↓a′ ,b′′ ↑b′ ES∞
′
Note that the boundary conditions for v, the fact X π,x
∈ {a′ , b′ } and Lemma 12 (with ρ = τab′ )
b′
a′ ,b′
imply S∞
= E(Fτab (X π,x ) + e−
R τab
0
τa ′
απ (Xrπ,x )dr
g(Xτπ,x
b <∞} |F b′ ). Taking expectations on both
b )I{τa
τ
a
sides of this equality yields v(x) = Vπ (x).
a′
Proof of Theorem 2 in dimension one. We claim that under Assumptions 1–2, the inequality
Vπn+1 (x) ≤ Vπn (x) holds for all x ∈ (a, b) and n ∈ N, where πn+1 is defined in (7).
Define the process Y as in (21) in Section 5.2 and consider the stopped process Y·∧τ b′ , where
a′
a < a′ < x < b′ < b and τcd := τcd (X πn+1 ,x ) for any c < d. Then the proof follows the same steps
as the proof of Theorem 2 in Section 5.2. The only difference is that in the penultimate line of
the proof of Section 5.2 we apply the DCT and Lemma 11 (instead of the DCT only) to obtain
Vπn (x) ≥ Vπn+1 (x).
The proof of the one-dimensional case of Proposition 3 is completely analogous to the multidimensional one and is hence omitted.
Proof of Theorem 4 in one dimension. We need to show that Vlim (x) = Vπlim (x) holds for all
x ∈ (a, b). The proof follows along the same lines as in the multi-dimensional case of Section 5.2.
′
The only difference lies in the fact that we stop the process X πlim ,x at τab′ (X πlim ,x ), where a <
a′ < x < b′ < b, and take the limit as (a′ , b′ , t) → (a, b, ∞).
The verification lemma in the one-dimensional case is established exactly as in the proof of
Theorem 5. The details are omitted.
References
[1] A. Arapostathis, On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion. Optimization, Control, and Applications of Stochastic Systems, 1–12, 2012.
[2] A. Arapostathis and V.S. Borkar, Uniform recurrence properties of controlled diffusions and applications to
optimal control. SIAM J. Control Optim., 48(7), 152–160, 2010.
[3] P. Azimzadeh, E. Bayraktar and G. Labahn, Convergence of approximation schemes for weakly nonlocal
second order equations. arXiv:1705.02922v2, 2017.
[4] P. Bertsekas, Approximate Policy Iteration: a Survey and Some New Methods. Journal of Control Theory
and Applications, 9(3):310–335, 2011.
[5] A. N. Borodin, P. Salminen, Handbook of Brownian Motion – Facts and Formulae. Birkhauser, Basel, second
edition, 2002.
[6] A. S. Cherny and H.-J. Engelbert, Singular Stochastic Differential Equations. Lecture Notes in Mathematics,
Springer-Verlag, Berlin Heidelberg, 2005.
[7] B. T. Doshi, Continuous Time Control of Markov Processes on an Arbitrary State Space: Discounted Rewards.
The Annals of Statistics, 4(6):1219–1235, 1976.
[8] E.B. Dynkin, Markov Processes, vol I. Springer-Verlag, Berlin, 1965.
[9] E. Fournié, J.-M. Lasry, J. Lebuchoux, P.-L. Lions, N. Touzi, Applications of Malliavin calculus to Monte
Carlo methods in finance. Finance and Stochastics, 3:391–412, 1999.
[10] A. Friedman, Partial Differential Equations of Parabolic Type. Prentice-Hall, Englewood Cliffs, N.J., 1964.
[11] O. Hernández-Lerma and J. B. Lasserre, Policy iteration for average cost Markov control processes on Borel
spaces. Acta Applicandae Mathematicae, 47(2):125–154, 1997.
[12] A. Hordijk and M. L. Puterman, On the convergence of policy iteration in finite state undiscounted Markov
decision processes: the unichain case. Mathematics of Operations Research, 12(1):163–176, 1987.
[13] R. A. Howard, Dynamic Programming and Markov Processes. MIT Press, Cambridge, 1960.
COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME
21
[14] I. Karatzas and S. E. Shreve, Brownian Motion and Stochastic Calculus. Springer-Verlag, New York, second
edition, 1991.
[15] N. V. Krylov, Controlled Diffusion Processes. Springer-Verlag, New York, 1980.
[16] M. G. Lagoudakis and R. Parr, Least-Squares Policy Iteration. Journal of Machine Learning Research, 4:11071149, 2003.
[17] J. B. Lasserre, A new policy iteration scheme for Markov decision processes using Schweitzer’s formula.
Journal of Applied Probability, 31(1):268–273, 1994.
[18] T. Lindvall and L. C. G. Rogers, Coupling of Multidimensional Diffusions by Reflection. The Annals of
Probability, 14(3):860–872, 1986.
[19] S. Mahadevan, Representation Policy Iteration. arXiv:1207.1408 [cs.AI], 2012.
[20] S. P. Meyn, The policy iteration algorithm for average reward Markov decision processes with general state
space. IEEE Transactions on Automatic Control, 42(12):1663–1680, 1997.
[21] P. E. Protter, Stochastic Integration and Differential Equations. Stochastic Modelling and Applied Probability,
Springer-Verlag, Berlin Heidelberg, 2004.
[22] J. P. Rust, A Comparison of Policy Iteration Methods for Solving Continuous-State, Infinite-Horizon Markovian Decision Problems Using Random, Quasi-Random, and Deterministic Discretizations. Available at
http://ssrn.com/abstract=37768, 1997.
[23] M. S. Santos and J. Rust, Convergence properties of policy iteration. SIAM Journal on Control and Optimization, 42(6):2094–2115, 2004.
[24] Q. X. Zhu, X. S. Yang and C. X. Huang, Policy iteration for continuous-time average reward Markov decision
processes in Polish spaces. Abstract and Applied Analysis, 2009, Article ID103723, 17 pages.
Department of Statistics, University of Warwick, and, The Alan Turing Institute, UK
E-mail address: s.d.jacka@warwick.ac.uk
Department of Mathematics, King’s College London, and, The Alan Turing Institute, UK
E-mail address: aleksandar.mijatovic@kcl.ac.uk
Department of Statistics, University of Warwick, UK
E-mail address: d.siraj@warwick.ac.uk