Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Coupling and a generalised Policy Iteration Algorithm in continuous time

2017, arXiv (Cornell University)

COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME arXiv:1707.07834v1 [math.PR] 25 Jul 2017 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Abstract. We analyse a version of the policy iteration algorithm for the discounted infinitehorizon problem for controlled multidimensional diffusion processes, where both the drift and the diffusion coefficient can be controlled. We prove that, under assumptions on the problem data, the payoffs generated by the algorithm converge monotonically to the value function and an accumulation point of the sequence of policies is an optimal policy. The algorithm is stated and analysed in continuous time and state, with discretisation featuring neither in theorems nor the proofs. A key technical tool used to show that the algorithm is well-defined is the mirror coupling of Lindvall and Rogers. 1. Introduction Howard’s policy iteration algorithm (PIA) [13] is a well-known tool for solving control problems for Markov decision processes (see e.g. [4] for a recent survey of approximate policy iteration methods for finite state, discrete time, stochastic dynamic programming problems). The algorithm can be recast for general state-spaces continuous-time control problems, where allowed actions can be chosen from a Polish space. In this paper we investigate the convergence of the PIA for an infinite horizon discounted cost problem in the context of controlled diffusion processes in Rd , where the control takes place in an arbitrary compact metric space. The main aim of the paper is to analyse the convergence of a sequence of policies and the corresponding payoff functions produced by the PIA under assumptions that are at least in principle verifiable in terms of the model data. Our control setting is similar to that of [1], where an ergodic cost criterion was considered. The main differences, beyond the cost criterion, are as follows: (1) we allow the controller to modulate the drift as well as the diffusion coefficient; (2) we consider a generalised version of the PIA where an arbitrary scaling function can be used to simplify the algorithm; (3) we investigate the convergence not only of payoffs but also of policies produced by the PIA; (4) we work with Markov policies that are defined for every x ∈ Rd , not almost everywhere, and obtain a locally uniform convergence of a subsequence to the optimal policy. This last point in this list is particularly important in our setting, as our aim is to design and analyse an algorithm that can in principle at least be used in to construct an optimal policy. This requirement forces us to work in the context of classical solutions of PDEs, rather than Date: November 7, 2018. 2010 Mathematics Subject Classification. 93E20, 49L99, 60H30. Key words and phrases. Policy iteration algorithm, policy improvement, optimal stochastic control, controlled diffusion processes, discounted infinite horizon problem, mirror coupling. SD supported by the EPSRC grant EP/P00377X/1; AM supported by the EPSRC grant EP/P003818/1 and the Programme on Data-Centric Engineering funded by Lloyd’s Register Foundation; DŠ supported by the Slovene Human Resources Development and Scholarship Fund. 1 2 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ relying on generalised solutions of the Poisson equation in appropriate Sobolev spaces. The latter approach, followed in [1], is based on the fact that there exists a canonical solution to the 2,p Poisson equation in Wloc (Rd ), see [2]. The analysis of the PIA can than be performed using 2,p Krylov’s extension [15] of Itô’s formula to functions in the Sobolev space Wloc (Rd ). Our method for solving the Poisson equation in the classical sense is based on the coupling of Lindvall and Rogers [18]. This coupling plays a crucial role in the proof of Proposition 1, guaranteeing that a payoff function for a locally Lipschitz Markov policy is the classical solution of the corresponding Poisson equation. The main technical contribution of the paper is the result in Lemma 7, which shows that the mirror coupling from [18] is successful with very high probability, when the diffusion processes are started sufficiently close together. Interestingly, the condition in [18] for the coupling to be successful is not satisfied in our setting in general. The proof of Lemma 7 is based on a local path-wise comparison of a time-change of the distance between the coupled diffusions and an appropriately chosen Bessel process. The convergence of the policies and payoffs in the PIA is obtained in several steps. First we show that the PIA always improves the payoff. Then we prove, using a “diagonalisation” argument and an Arzela-Ascoli type result, that a subsequence of the policies produced by the PIA converges locally uniformly to a locally Lipschitz limiting policy. The final stage of the argument shows that this limiting policy is indeed an optimal policy with payoff equal to the pointwise limit of the payoffs produced by the PIA. These steps are detailed in Section 2 and proved in Section 5. The literature on the PIA for Markov decision processes in various settings is extensive (see e.g. [7], [11], [12], [17], [20], [16], [19], [22], [23] and [24] and the references therein). Our approach is in some sense closest to the continuous time analysis in general state spaces in [7], where the convergence of the subsequence of the policies is established in the case of finite action space. In [24] this restriction is removed, but the controlled processes considered do not include diffusions. A recent application of the PIA to impulse control in continuous time is given in [3]. Finally, as mentioned above, we observe that the PIA can be generalised by multiplying the expression to be minimised by an arbitrary positive scaling function that can depend both on the state and the control action (see (gPIA) below). A choice of the scaling function clearly influences the sequence of policies produced by the algorithm. In particular, in the one-dimensional case, the scaling function can be used to eliminate the second derivative of the payoff from the algorithm. This idea is described in Section 3. A numerical example of the PIA is reported in Section 4. In this examples at least, the convergence of the PIA is very fast as the algorithm finds an optimal policy in fewer than six iterations. 2. Multidimensional controlled diffusion processes Let (A, dA ) be a compact metric space of available control actions and, for some d, m ∈ N, let σ : Rd × A → Rd×m and µ : Rd × A → Rd be measurable functions. Let the set A(x) of admissible policies at x ∈ Rd consist of processes Π = (Πt )t≥0 satisfying the following: Π is A valued, adapted to a filtration (Ft )t≥0 and there exists an (Ft )-adapted process X Π,x = XtΠ,x t≥0 satisfying the SDE Z t Z t   Π,x Π,x σ Xs , Πs dBs + Xt = x + µ XsΠ,x , Πs ds, t ≥ 0, (1) 0 0 COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 3 where B = (Bt )t≥0 is an m-dimensional (Ft )-Brownian motion. Note that the filtration (Ft )t≥0 (and indeed the entire filtered probability space) depends on the policy Π in A(x). Pick measurable functions α : Rd × A → R and f : Rd × A → R. For any x ∈ Rd and Π ∈ A(x), define the payoff of the policy Π by Z ∞ R    Π,x − 0t α(XsΠ,x ,Πs )ds f Xt , Πt dt . e VΠ (x) := E 0 Control problem. Construct the value function V , defined by V (x) := inf Π∈A(x) VΠ (x), x ∈ Rd , and, if it exists, an optimal control {Πx ∈ A(x) : x ∈ Rd }, satisfying V (x) = VΠx (x). Note first that the problem is specified completely by the deterministic data σ, µ, α and f . In order to define an algorithm that solves this problem, the functions σ, µ, α, f are assumed to satisfy Assumption 1 below throughout this section. Assumption 1. The functions σ, µ, α and f are bounded, and Lipschitz on compacts in Rd × A, i.e. for every compact set K ⊆ Rd × A there exists a constant CK > 0 such that 1 (2) kh(x, p) − h(y, r)k ≤ CK kx − yk2 + dA (p, r)2 2 holds for every (x, p), (y, r) ∈ K, and h ∈ {σ, µ, α, f }. In addition, α(x, p) > ǫ0 > 0 for all (x, p) ∈ Rd × A, and there exists λ > 0 such that (3) hσ(x, p)σ(x, p)T v, vi ≥ λkvk2 for all x ∈ Rd , p ∈ A, v ∈ Rd . Remark 1. Here, and throughout the paper, k · k and h·, ·i denote the Euclidean norm and inner p product respectively. The norm kM k = sup {kM vk/kvk : v ∈ Rm \{0}} = λmax (M M T ), for a matrix M ∈ Rd×m , is used in (2) for h = σ, where λmax (M M T ) is the largest eigenvalue of the non-negative definite matrix M M T ∈ Rd×d and M T ∈ Rm×d denotes the transpose of M . Remark 2. The uniform ellipticity condition in (3) is the multidimensional analogue of the volatility being bounded away from 0. Hence, for all x ∈ Rd and p ∈ A, the smallest eigenvalue of σ(x, p)σ(x, p)T ∈ Rd×d is at least of size λ and, in particular, m ≥ d (cf. Remark 3 below). A measurable function π : Rd → A is a Markov policy (or synonymously Markov control ) if for x ∈ Rd there exists an Rd -valued process X π,x = (Xtπ,x )t≥0 that satisfies the following SDE: Z t Z t (4) Xtπ,x = x + σ (Xsπ,x , π (Xsπ,x )) dBs + µ (Xsπ,x , π (Xsπ,x )) ds, t ≥ 0. 0 0 (X π,x , B) Let (Ft )t≥0 be a filtration with respect to which is (Ft )-adapted and B is an (Ft )Brownian motion. Such a filtration (Ft )t≥0 exists by the definition of a solution of SDE (4), see e.g. [14, Def. 5.3.1, p. 300]. Then (Ft )t≥0 can be taken to be the filtration in the definition of the policy π(X π,x ) ∈ A(x). Moreover, without loss of generality, we may assume that (Ft )t≥0 satisfies the usual conditions. For any function π : Rd → A that is Lipschitz on compacts (i.e. locally Lipschitz) in Rd , Assumption 1 implies (see e.g. [5, p. 45] and the references therein) that the SDE in (4) has a unique, strong non-exploding solution, thus making π a Markov policy. The payoff function of a locally Lipschitz Markov policy is a classical solution of a linear PDE, a fact key for (gPIA) to work. To state this formally, recall that h : Rd → Rk (for any k ∈ N) is (1/2)-Hölder continuous 4 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ on a compact D ⊂ Rd if kh(x′ ) − h(x′′ )k ≤ KD kx′ − x′′ k1/2 holds for some constant KD > 0 and all x′ , x′′ ∈ D. Streamline the notation for Markov policies as follows:  Vπ (·) := Vπ(X π,· ) (·) and Lπ h := 12 Tr σπT Hh σπ + µTπ ∇h for h ∈ C 2 (Rd ), (5) σπ (·) := σ(·, π(·)), µπ (·) := µ(·, π(·)), απ (·) := α(·, π(·)), fπ (·) := f (·, π(·)), where Hh and ∇h are the Hessian and gradient of h, respectively, and Tr(M ) denotes the trace of any matrix M ∈ Rm×m . Proposition 1. Let Assumption 1 hold. For a locally Lipschitz Markov policy π we have: Vπ ∈ C 2 (Rd ) is the unique bounded solution of the Poisson equation Lπ Vπ − απ Vπ + fπ = 0 and HVπ is (1/2)-Hölder on compacts in Rd . Remark 3. The proof of Proposition 1, given in Section 5.2, depends crucially on the coupling in Lindvall and Rogers [18] of two d-dimensional diffusions via a coupling of the corresponding d-dimensional driving Brownian motions (see Lemma 7 below). Moreover, it is easy to see that the probability of coupling could in general be zero if the dimension m of the Brownian noise is strictly less than d. In this case the controlled diffusions in Rd , started at distinct points, could remain forever on disjoint submanifolds of positive co-dimension in Rd for any choice of control. In particular, Proposition 1 fails in the case m < d, as demonstrated by Example 1 below. Example 1. Let m = 1, d = 2, A = [−1, 1], f (x1 , x2 , a) := |x1 + x2 | + |x1 − x2 | + a2 /2, α(x1 , x2 , a) ≡ 1, σ(x1 , x2 , a) ≡ (1, 1)T and µ(x1 , x2 , a) = a(1, −1)T . Then, for a constant policy πa ≡ a ∈ A, the controlled process started at x ∈ R2 is given by Xtπ,x = x+(1, 1)T Bt +a(1, −1)T t, for t ≥ 0. In particular, for a = 0 and any x = (x1 , x2 )T , we have Vπ0 (x) = |x1 − x2 | + g(x1 + x2 ), where g(y) := E|y + 2Be1 |, y ∈ R, and e1 is an exponential random variable with mean 1, independent of B. Since Be1 has a smooth density, it follows that g is also smooth, implying that Vπ0 cannot satisfy the conclusion of Proposition 1. Remark 4. As we are using the standard weak formulation of the control problem (i.e. the filtered probability space is not specified in advance), all that matters for a Markov policy π is the law of the controlled process X π,· , which solves the martingale problem in (4). Moreover, this law is uniquely determined by the symmetric-matrix valued function (x, a) 7→ σ(x, a)σ(x, a)T ∈ Rd×d (and of course the drift (x, a) 7→ µ(x, a)). Since the symmetric square root of σ(x, a)σ(x, a)T in Rd×d satisfies Assumption 1, in the remainder of the paper we assume, without loss of generality, that the noise and the controlled process are of the same dimension, i.e. d = m (cf. Remark 2). Remark 5. For any locally Lipschitz Markov policy π, the process X π,· is strong Markov. Hence [8, Thm. 1.7] implies that Vπ is in the domain D(Aπ ) of the generator Aπ of X π,· and that the Poisson equation Aπ Vπ − απ Vπ + fπ = 0 holds. Recall that for a bounded continuous g : Rd → R in D(Aπ ) the limit (Aπ g)(x) := limt→0 (Eg(Xtπ,x ) − g(x))/t exists for all x ∈ Rd . Furthermore, if g is also in C 2 (Rd ), it is known that Aπ g = Lπ g. However, [8, Thm. 1.7] does not imply that Vπ is in C 2 (Rd ). The PDE in Proposition 1, key for (gPIA) to work, is established via the coupling argument in Section 5.2. If a policy π : Rd → A is constant (i.e. π ≡ p ∈ A), write σp , µp , αp , fp , Lp and Vp instead of σπ , µπ , απ , fπ , Lπ and Vπ , respectively. Let S : Rd × A → (0, ∞) be a continuous function and, for any p ∈ A, denote Sp (x) := S(x, p). Under Assumption 1, the function COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 5 p 7→ Sp (x)(Lp h(x) − αp (x)h(x) + fp (x)), p ∈ A, is continuous for any x ∈ Rd and h ∈ C 2 (Rd ). Since A is compact, there exists Ih (x) ∈ A, which minimises this function. Assumption 2. For any h ∈ C 2 (Rd ), the function Ih : Rd → A can be chosen to be locally Lipschtiz on Rd . The continuous scaling function S satisfies ǫS < S < MS for some ǫS , MS ∈ (0, ∞). Generalised Policy Iteration Algorithm Input: σ, µ, α, f , S satisfying Assumptions 1–2, constant policy π0 and N ∈ N. for n = 0 to N − 1 do Compute Vπn from the PDE in Proposition 1. Choose the policy πn+1 as follows: (gPIA) πn+1 (x) ∈ argmin {Sp (x) (Lp Vπn (x) − αp (x)Vπn (x) + fp (x))} , p∈A x ∈ Rd . end Output: Approximation VπN of the value function V . Remark 6. By Assumption 2, the policy πn+1 defined in (gPIA) is locally Lipschitz for all 0 ≤ n ≤ N − 1. Hence, by Proposition 1, the algorithm is well defined. Remark 7. In the classical case of (gPIA) we take S ≡ 1. A non-trivial scaling function S, which plays an important role in the one-dimensional context (see Section 3 below), makes the algorithm into a generalised Policy Iteration Algorithm. Thm 2 requires only the positivity and continuity of S. The uniform bounds on S in Assumption 2 are used only in the proof of Thm 5. (gPIA) always leads to an improved payoff (Theorem 2 is proved in Section 5.2 below): Theorem 2. Under Assumptions 1–2, the inequality Vπn+1 ≤ Vπn holds on Rd for all n ∈ {0, . . . , N − 1}. The sequence {VπN }N ∈N , obtained by running (gPIA) from a given policy π0 for each N ∈ N, is non-increasing and bounded. Hence we may define (6) Vlim (x) := lim VπN (x), N →∞ x ∈ Rd . However, the sequence of the corresponding Markov policies {πN }N ∈N need not converge and, even if it did, the limit may be discontinuous and hence not necessarily a Markov policy. Remark 8. If the algorithm stops before N , i.e. πn+1 = πn for some n < N , then clearly VπN = Vπn and πN = πn . As this holds for any N > n, with (gPIA) started at a given π0 , we may proceed directly to the verification lemma (Theorem 5 below) to conclude that Vπn is the value function with an optimal policy πn . Controlling the convergence of the policies requires the following additional assumptions. Introduce the set SB,K := {h ∈ C 2 (Rd ) : k∇h(x)k < B 1 , kHh(x)k < B 2 for x ∈ DK }, where DK := {x ∈ Rd : kxk ≤ K} is a ball of radius K > 0 and B := (B 1 , B 2 ) ∈ (0, ∞)2 are constants. Assumption 3. For any K > 0, there exist constants BK and CK satisfying the following: if h ∈ SBK ,K , then Ih (defined in Assumption 2) satisfies d(Ih (x), Ih (y)) ≤ CK kx − yk for all x, y ∈ DK . Assumption 4. For any K > 0, let BK , CK be as in Assumption 3. Then for a locally Lipschitz Markov policy π : Rd → A, such that d(π(x), π(y)) ≤ CK kx − yk, x, y ∈ DK , the following holds: Vπ ∈ SBK ,K . 6 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Remark 9. Non-trivial problem data that satisfy Assumptions 1–4 are described in Section 4 below. It is precisely these types of examples that motivated the form Assumptions 2–4 take. Assumption 1 is standard and Assumptions 2–3 concern only the deterministic data specifying the problem. Assumption 4 essentially states that kHVπ k has a prescribed bound on the ball DK if the coefficients of the PDE in Proposition 1 have a prescribed Lipschitz constant. Schauder’s boundary estimates for elliptic PDEs [10, p. 86] suggest that this requirement is both natural and feasible. In fact, Assumption 4 may follow from assumptions of the type 1–3 on the problem data. This is left for future research. Proposition 3 and Theorems 4 and 5, proved in Section 5.2 below, show that (gPIA) converges. Proposition 3. Let Assumptions 1–4 hold. Then there exists a subsequence of {πN }N ∈N that converges uniformly on every compact subset of Rd to a locally Lipschitz Markov policy. Let πlim : Rd → A denote a locally Lipschitz Markov policy that is a locally uniform limit of a subsequence of {πN }N ∈N . By Propositions 1, Vπlim is a well-defined function in C 2 (Rd ) that solves the corresponding Poisson equation. However, since πlim clearly depends on its defining subsequence, so may Vπlim . Furthermore, Vlim may depend on the choice of π0 in (gPIA). But this is not so, since Vlim equals both the value function V and the payoff for the policy πlim . Theorem 4. Under Assumptions 1–4, the equality Vlim = Vπlim holds on Rd for a policy πlim . Theorem 5. Let Assumptions 1–4 hold. Then for every x ∈ Rd and Π ∈ A(x) the inequality Vlim (x) ≤ VΠ (x) holds. Hence Vlim equals the value function V , does not depend on the choice of π0 in (gPIA) and πlim is an optimal locally Lipschitz Markov policy for the control problem. Remark 10. The key technical issue in the proof of Theorem 5 is that the policies in the convergent subsequence constructed in the proof of Proposition 3 are not improvements of their predecessors (cf. (gPIA)). The idea of the proof is to work with a convergent subsequence of the pairs of policies {(πN , πN +1 )}N ∈N , where {πN }N ∈N is produced by the (gPIA) (see Section 5.2 for details). 3. The one-dimensional case There are two reason for considering the one-dimensional control problem in its own right. (A) The canonical choice for the scaling function S := 1/σ 2 simplifies the (gPIA) to  (7) πn+1 (x) ∈ argmin (µ(x, p)Vπ′n (x) − α(x, p)Vπn (x) + f (x, p))/σ 2 (x, p) , p∈A by removing the second derivative of the payoff function Vπn from the minimisation procedure. This reduction appears to make the numerical implementation of the (gPIA) converge extremely fast: in the example in Section 4.2 below the optimal payoff and policy are obtained in fewer than half a dozen iterations. (B) It is natural to control the process X Π,x only up to its first exit from an interval (a, b), where a, b ∈ [−∞, ∞], and generalise the payoff as follows: ! Z τab (X Π,x ) R R τ b (X Π,x ) Π,x Π,x − 0t αΠs (XsΠ,x )ds − 0a αΠt (XtΠ,x )dt e fΠs (Xt )dt + e g(Xτ b (X Π,x ) ) . VΠ (x) := E 0 a Here µ, σ, α, f : (a, b)×A → R are measurable with the same notational convention as in Section 2 (fp (x) = f (x, p) etc.). Furthermore, Π ∈ A(x) if X Π,x follows SDE (1) on the stochastic interval COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 7 Π,x [0, τab (X Π,x )), where τab (X Π,x ) := inf{t ≥ 0; XtΠ,x ∈ {a, b}} (inf ∅ = ∞), and XτΠ,x b (X Π,x ) = Xt a for τab (X Π,x ) ≤ t (i.e. Πt , t ∈ [τab (X Π,x ), ∞), are irrelevant for VΠ (x)). Pick an arbitrary function g : {a, b} ∩ R → R and set the control problem as in Section 2 with Rd substituted by (a, b). Remark 11. In Assumptions 1–4 we substitute Rd with (a, b). In particular, inequality (3) in Assumption 1 takes the form σ 2 (x, p) ≥ λ for all x ∈ (a, b), p ∈ A. Assumption 1 hence implies the requirement on the scaling function S = 1/σ 2 in Assumption 2. In Assumptions 3–4, the family of closed balls (DK )K>0 in Rd is substituted with a family of compact intervals (DK )K>0 in (a, b), such that ∪K>0 DK = (a, b) and, if K ′ < K, DK ′ is contained in the interior of DK . R τab (X Π,x ) Remark 12. On the event {τab (X Π,x ) = ∞} we take g(XτΠ,x αΠt (XtΠ,x )dt) = b (X Π,x ) )/ exp( 0 a 0, since by Assumption 1 the integral is infinite. Note also that on this event XτΠ,x b (X Π,x ) may not a be defined. Moreover, if {a, b} ∩ R = ∅, then by Assumption 1 we have τab (X Π,x ) = ∞ a.s. A measurable function π : (a, b) → A is a Markov policy if for every x ∈ (a, b) there exists a process X π,x = (Xtπ,x )t≥0 satisfying SDE (4) on the stochastic interval [0, τab (X π,x )) and Xtπ,x = b π,x ) ≤ t < ∞. If π is a Markov policy, then π(X π,x ) := (π(X π,x )) Xτπ,x b π,x ) if τa (X t t≥0 ∈ A(x) for a (X every x ∈ (a, b), where we pick arbitrary elements in A for the values π(a) and π(b) if a > −∞ and b < ∞, respectively. We use analogous notation to that in (5), e.g. Lπ h := 12 σπ2 h′′ + µπ h′ for any h ∈ C 2 ((a, b)). Any locally Lipschitz function π : (a, b) → A is a Markov policy, since the Engelbert-Schmidt conditions for the existence and uniqueness of the solution of the corresponding SDE (see e.g. [14, Sec. 5.5]) are satisfied by Assumption 1. By substituting the state space Rd with the interval (a, b) in Proposition 1, Theorem 2, Proposition 3 and Theorems 4 and 5 of Section 2, we obtain the results of the present section, which thus solve the control problem in the one-dimensional case. In the interest of brevity, we omit their statement. We stress that the main difference lies in the fact that the proofs, in Sections 5.3 and 5.4 below, rely on the theory of ODEs and scalar SDEs. In particular, we need to prove that the payoff has a continuous extension to a finite boundary point of the state space. 4. Examples 4.1. Data satisfying Assumptions 1–4. We now describe a class of models that provably satisfies Assumptions 1–4. The main aim in the present section is not to be exhaustive, but merely to demonstrate that the form (particularly) of Assumptions 3–4 is natural in the context of control problems considered here. The example we give is in dimension one. But it is clear from the construction below that it can easily be generalised. Let A := [−a, a], for some constant a > 0, and σ, µ, f, α : R × [−a, a] → R be given by (8) σ(x, p) := σ1 (x), µ(x, p) := µ1 (x) + pµ2 , f (x, p) := f1 (x) + f2 (p), α(x, p) ≡ α0 , where σ1 , µ1 , f1 ∈ C 1 (R), f2 ∈ C 2 ((−a, a)) is convex and symmetric (i.e. f2 (p) = f2 (−p) for all p ∈ A) and µ2 and α0 are constants. For any h ∈ {σ1 , µ1 , f1 , f2 } (resp. h′ ∈ {σ1′ , µ′1 , f1′ , f2′ }) let the positive constant Ch (resp. Ch′ ) satisfy |h| ≤ Ch (resp. |h′ | ≤ Ch′ ). In particular, we assume that the derivatives of σ1 , µ1 , f1 , f2 are bounded. Moreover we may (and do) take Cf′ 2 := f2′ (a). Assume also that α0 > 0 and σ12 > λ > 0 (so that As 1 is satisfied) and the scaling function S ≡ 1. Then the following proposition holds. 8 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ 2 < L := inf ′′ Proposition 6. Assume α0 > Cµ′ 1 + |µ2 |(2 + Cf′ 1 /Cf′ 2 ) and |µ2 |B∞ p∈A f2 (p), where f2   1 2 1 /λ and B∞ := (Cf′ 1 + Cf′ 2 )/(α0 − Cµ′ 1 − |µ2 |), B∞ := 2(Cf1 + Cf2 ) + (Cµ1 + a|µ2 |)B∞ 1 , B 2 ) and then Assumptions 1–4 hold. Moreover, in Assumptions 3–4 we have BK = (B∞ ∞ CK = 1 for any K > 0. Remark 13. It is clear that the assumptions in Proposition 6 define a non-empty subclass of models (8). Moreover, these assumptions are much stronger than what is required by our general Assumptions 3–4 since the proposition yields global (rather than local) bounds on the derivatives of the payoff functions and the Lipschitz coefficients of the policies arising in (gPIA). 1 and |h′′ (x)| < B 2 for all x ∈ R. Then the Proof. Pick h ∈ C 2 (R), such that |h′ (x)| < B∞ ∞ function Ih in As 3 satisfies Ih (x) = argmin{pµ2 h′ (x) + f2 (p)}. By assumption we have p∈A 1 |µ2 h′ (x)| ≤ |µ2 |B∞ < Cf′ 2 = f2′ (a), implying Ih (x) = (f2′ )−1 (−µ2 h′ (x)) ∀x ∈ R. 2 /L Differentiate Ih to obtain |Ih′ (x)| ≤ |µ2 |B∞ f2 < 1, x ∈ R, and note that Assumptions 2–3 follow. We now establish As 4. The idea is to start with any policy π : R → A in C 2 (R), such that its derivative satisfies |π ′ | ≤ 1 on all of R (e.g. a constant policy), and apply stochastic flow of diffeomeorphisms [21, Sec. V.10] to deduce the necessary regularity of the payoff function Vπ . In the notation from (5), we have |µ′π | ≤ Cµ′ 1 + |µ2 | and |σπ′ | = |σ1′ | ≤ Cσ′ 1 . Hence, for each x ∈ R, the stochastic exponential Y = (Yt )t∈R+ , given by Z t Z t σπ′ (Xsπ,x )Ys dWs , µ′π (Xsπ,x )Ys ds + Yt = 1 + 0 0 exists (we suppress the dependence on x (and π) from the notation). Since the coefficients SDE (4) are in C 1 (R) with bounded and locally Lipschitz first derivative, [21, Sec. V.10, Thm 49] implies that the flow of controlled processes {X π,x }x∈R may be constructed on the single prob∂ X π,x = Y . The upshot here ability space so that it is smooth in the initial condition x with ∂x is that, by the argument in the proof of [9, Prop. 3.2], we obtain a stochastic representation for Rt ∂ the derivative ∂x Efπ (Xtπ,x ) = E[Yt fπ′ (Xtπ,x )] for every t ∈ R+ . Since Yt = Mt exp 0 µ′π (Xsπ,x )ds, R· where the stochastic exponential M = E( 0 σπ′ (Xsπ,x )dWs ) is a true martingale by Novikov’s condition, the following inequality holds: EYt ≤ exp(t(Cµ′ 1 + |µ2 |)). Since α0 > Cµ′ 1 + |µ2 | by R∞ 1 assumption and the inequality |fπ′ | ≤ Cf′ 1 + Cf′ 2 holds, we have |E 0 e−α0 s Ys fπ′ (Xsπ,x )ds| < B∞ for all x ∈ R. R∞ Recall that Vπ (x) = E 0 e−α0 s fπ (Xsπ,x )ds. By [21, Sec. V.8, Thm 43], the family of random variables indexed by δ ∈ (0, 1), Z Z ∞ 1 1 ∞ −α0 s e |fπ (Xsπ,x+δ ) − fπ (Xsπ,x )|ds ≤ (Cf′ 1 + Cf′ 2 ) e−α0 s |Xsπ,x+δ − Xsπ,x |ds, δ 0 δ 0 is uniformly integrable. Hence limδ→0 (Vπ (x + δ) − Vπ (x))/δ takes the form Z ∞ 1 for all x ∈ R. Vπ′ (x) = E e−α0 s Ys fπ′ (Xsπ,x )ds, implying |Vπ′ (x)| < B∞ 0 2 , concluding the proof. This inequality, the fact σ12 > λ and Proposition 1 imply |Vπ′′ | < B∞  COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 9 Remark 14. The process Y in the proof of Proposition 6 exists in the multidimensional setting, see [21, Sec. V.10, Thm 49]. Hence the same argument works in higher dimensions if we can deduce a bound on the Hessian of the payoff function from the PDE in Proposition 1. 4.2. Numerical examples. Consider the one-dimensional control problem: A = [−1, 1], a = −10, b = 10, g(a) := a2 , g(b) := b2 , σ(x, p) := 1, µ(x, p) := p, α(x, p) := 1 and f (x, p) := x2 + p2 , which is in the class discussed in Section 4.1. Explicitly, we seek to compute inf Π∈A(x) VΠ (x) for every x ∈ (−10, 10), where the payoff VΠ (x) of a policy Π is defined in Section 3. We implemented (gPIA), with the main step given by (7), in Matlab. The payoff function at each step is obtained as the solution to the differential equation from Proposition 1 with the boundary conditions given by the function g. The new policy at each step can be calculated explicitly (cf. the proof of Proposition 6 above). Figures 1 and 2 graph the payoff functions and the policies (colour coded). The initial policy π0 ≡ 1 and its payoff correspond to the blue graphs. 100 1 90 0.8 80 0.6 70 0.4 60 0.2 50 0 40 −0.2 30 −0.4 20 −0.6 10 −0.8 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 −1 −10 Figure 1. The graphs of Vπn for n ∈ {0, 1, 2, 3, 4}. −8 −6 −4 −2 0 2 4 6 8 10 Figure 2. The graphs of πn for n ∈ {0, 1, 2, 3, 4}. The graphs suggest that convergence effectively occurs in just a few steps. Figures 3 and 4, containing the graphs of the differences of the consecutive payoffs and policies on the logarithmic scale, confirm this. In Figures 1 and 2 it seems that fewer graphs are presented than is stated in the caption. The reason for this is that the final few graphs coincide. Moreover, the policies only differ on a subinterval (−2, 2), because outside of it they coincide as it is optimal to chose one of the boundary points of A = [−1, 1]. Finally, there is no numerical indication that the sequence of policies have more than one accumulation point as they appear to converge very fast indeed. 5. Proofs 5.1. Auxiliary results - the multidimensional case. 5.1.1. The reflection coupling of Lindvall and Rogers [18] and the continuity of the payoff Vπ . We now establish the continuity of the payoff function for a locally Lipschitz Markov policy π under Assumption 1. The reflection coupling of Lindvall and Rogers [18] plays a crucial role in this. In fact, the continuity of Vπ hinges on the following property of the coupling in [18]: ′ copies of X π,x and X π,x , started very close to each other, will meet with high probability before moving apart by a certain distance greater than kx − x′ k (see Lemma 7 below). 10 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ 5 0 0 −5 −5 −10 −10 −15 −15 −20 −20 −25 −25 −30 −30 −35 −35 −40 −40 −10 −8 −6 −4 −2 0 2 4 6 8 10 Figure 3. The graphs of log(|Vπn+1 − Vπn |) for n ∈ {0, 1, 2, 3, 4}. −45 −10 −8 −6 −4 −2 0 2 4 6 8 10 Figure 4. The graphs of log(|πn+1 − πn |) for n ∈ {0, 1, 2, 3, 4}. We first show that the coupling from [18] can be applied to the diffusion X π,· . As explained in Remark 4, we may (and do) assume that the dimension of the noise and the controlled process are equal, i.e. d = m. By Assumption 1 above σπ and µπ are bounded and hence [18, As. (12)(ii)] holds. Inequality (3) in Assumption 1 implies that λmax (σπ−1 σπ−T ) ≤ 1/λ. Hence, by Remark 1, √ we have kσπ−1 k ≤ 1/ λ and [18, As. (12)(ii)] also holds. The assumptions in [18, (12)(i)] requires that σπ and µπ are globally Lipschitz. But this assumption is only used in [18] as a guarantee that the corresponding SDE has a unique strong solution, which is the case in our setting under the locally Lipschitz condition in Assumption 1. Hence, for any x, x′ ∈ Rd , the coupling from [18] ′ ′ can be applied to construct the process (X π,x , X π,x ) so that X π,x follows SDE (4) and X π,x satisfies Z t  Z t    ′ ′ ′ Xtπ,x = x′ + µπ Xsπ,x ds + σπ Xsπ,x Hs dBs , for t ∈ [0, ρ0 (Y )), 0 0 ′ where ρ0 (Y ) := inf{t ≥ 0 : kYt k = 0} (inf ∅ := ∞) is the coupling time, Y := X π,x − X π,x , and   ′ σπ−1 Xtπ,x Yt   (9) Ht := I − 2ut uTt , defined via ut := for t ∈ [0, ρ0 (Y )), ′ σπ−1 Xtπ,x Yt is the reflection in Rd about the hyperplane orthogonal to the unit vector ut . Moreover, we ′ have Xtπ,x = Xtπ,x for all t ∈ [ρ0 (Y ), ∞). Note also that Ht ∈ O(d) is an orthogonal matrix for Rt t ∈ [0, ρ0 (Y )) and the process B ′ = (Bt′ )t∈R+ , given by Bt′ := 0 (I{s<ρ0 (Y )} Ht + I{s≥ρ0 (Y )} I)dBs , ′ is a Brownian motion by the Lévy characterisation theorem. Hence X π,x satisfies the SDE ′ ′ ′ ′ dXtπ,x = σπ (Xtπ,x )dBt′ + µπ (Xtπ,x )dt with X0π,x = x′ (see [18, Sec. 3] for more details). Lemma 7. Fix a locally Lipschitz Markov policy π : Rd → A and x ∈ Rd . Then for every ǫ ∈ (0, 1) there exist ϕ̄ ∈ (0, 1] with the property: ∀ϕ ∈ (0, ϕ̄) ∃ϕ′ ∈ (0, ϕ) such that P(ρϕ (Y ) < ρ0 (Y )) < ǫ if kx − x′ k < ϕ′ , where ρc (Y ) := inf {t ≥ 0; kYt k = c} (inf ∅ = ∞) for any c > 0. Remark 15. Note that the main assumption in [18, Thm. 1] is not satisfied in Lemma 7, as ′ we have no assumption on the global variability of σπ . Hence the coupling (X π,x , X π,x ) is not necessarily successful even if the starting points x and x′ are very close to each other, i.e. possibly P(ρ0 (Y ) < ∞) < 1 even if kY0 k = kx − x′ k is very close to zero. However, by Lemma 7, the COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 11 coupling will occur with probability at least 1 − ǫ before the diffusions are more than ϕ̄ = ϕ̄(ǫ) away from each other, implying the continuity of Vπ (cf. Lemma 8 and Remark 16 below). ′ ′ Proof. Let S̄ := kY k2 , δ := σπ (X π,x ) − σπ (X π,x ) and β := µπ (X π,x ) − µπ (X π,x ). Define (10) ′ αt := σπ (Xtπ,x ) − σπ (Xtπ,x )Ht and vt := Yt /kYt k for t ∈ [0, ρ0 (Y )). In this proof x ∈ Rd is fixed and x′ ∈ Rd is arbitrary in the ball of radius one centred at x. Recall that ∇h(z) = 2z and Hh(z) = 2I for h(z) := kzk2 , z ∈ Rd , and apply Itô’s lemma to S̄: Z t p Z t p  ′ 2 T (11) S̄t = kx − x k + 2 S̄s vs αs dBs + 2 S̄s vsT βs + Tr αs αsT ds, t ∈ [0, ρ0 (Y )). 0 0 Our task is to study the behaviour of S̄ when started very close to zero. To do this, we first establish the facts in (12) and (14) below, which in turn allow us to apply time-change and coupling techniques to prove the lemma. We start by proving the following:   2 2 (12) 0 ≤ Tr αt αtT − vtT αt = Tr δt δtT − vtT δt ≤ Mx2 kYt k2 for t ∈ [0, ρ0 (Y ) ∧ ρ1 (Y )), where Mx > 1 is a Lipschitz constant for σπ and µπ in the ball around x of radius one. The first inequality in (12) follows since the trace is the sum of the eigenvalues of αt αtT , which are 2 all non-negative, while vtT αt is at most the largest eigenvalue. The second inequality follows since σπ is Lipschitz on any ball around x and kYt k < 1 for t < ρ1 (Y ). To establish the equality in (12) note that, as kvtT Ak2 = Tr(AAT vt vtT ) = Tr(vt vtT AAT ) for any A ∈ Rd×d , we have Tr(αt αtT ) − kvtT αt k2 − (Tr(δt δtT ) − kvtT δt k2 ) = Tr((I − vt vtT )(αt αtT − δt δtT )). Recall that Ht−1 = HtT = Ht . We therefore find αt αtT − δt δtT (13) ′ ′ = σπ (Xtπ,x )(I − Ht )σπ (Xtπ,x )T + σπ (Xtπ,x )(I − Ht )σπ (Xtπ,x )T  ′ = 2 vt uTt σπ (Xtπ,x )T + σπ (Xtπ,x )ut vtT kYt k/kσπ (Xtπ,x )−1 Yt k, where the second equality follows by definition (9) and identity vt = Yt /kYt k. Hence (12) follows. ′ ′ Since Tr(vt uTt σπ (Xtπ,x )T ) = hσπ (Xtπ,x )σπ (Xtπ,x )−1 vt , vt ikYt k/kσπ (Xtπ,x )−1 Yt k holds for times t ∈ [0, ρ0 (Y )), equalities (12) and (13) yield: 2 2 2 ′ ′ = 4hσπ (Xtπ,x )σπ (Xtπ,x )−1 vt , vt ikYt k2 /kσπ (Xtπ,x )−1 Yt k2 . √ Inequality (3) in Assumption 1 implies kσπ−1 k ≤ 1/ λ (cf. the second paragraph of this section). √ ′ Hence kYt k/kσπ (Xtπ,x )−1 Yt k ≥ λ. By the definition of δt above we get √ ′ ′ 2 vtT αt ≥ 4λ(1 + hδt σπ (Xtπ,x )−1 vt , vt i) ≥ 4λ(1 − kδt kkσπ (Xtπ,x )−1 k) ≥ 4λ(1 − kδt k/ λ). vtT αt ≥ vtT αt For any ǫ ∈ (0, 1), define − vtT δt √ ϕ̄ := min{1, ǫ λ/Mx , ǫ(1 − ǫ)λ/Mx2 }, where Mx is as in (12) above. Then, if kx − x′ k < ϕ̄ and t ∈ [0, ρϕ̄ (Y )), we have kYt k < ϕ̄ and √ hence kδt k ≤ Mx kYt k ≤ ǫ λ. In particular, we get (14) kvtT αt k2 ≥ 4λ(1 − ǫ) > 0 for any t ∈ [0, T ), where T := ρ0 (Y ) ∧ ρϕ̄ (Y ). Let M > 0 denote a global upper bound on σπ and µπ , which exists by Assumption 1. Since ′ the inequalities kvtT αt k ≤ kαt k ≤ kσπ (Xtπ,x )k + kσπ (Xtπ,x )k ≤ 2M hold for all t ∈ [0, ρ0 (Y )), Rt the increasing process [N ] = ([N ]t )t∈R+ , given by [N ]t := 0 I{s<ρ0 (Y )} kvsT αs k2 ds, is well-define Rt and [N ]t < ∞ for every t ∈ R+ . Hence N = (Nt )t∈R+ , given by Nt := 0 I{s<ρ0 (Y )} vsT αs dBs , 12 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ is a well-defined local martingale with a quadratic variation process given by [N ]. Let τ = (τs )s∈R+ and W = (Ws )s∈R+ be the Dambis Dubins-Schwartz (DDS) time-change and Brownian motion, respectively, for the local martingale N (see [14, Thm 3.4.6, p. 174]). More precisely, let s 7→ τs := inf{t ∈ R+ : [N ]t > s} (with inf ∅ = ∞) be the inverse of t 7→ [N ]t . Then W satisfies W[N ]t = Nt for all t ∈ R+ . Moreover it holds that τs < ∞ for s < [N ]∞ := limt↑∞ [N ]t . If [N ]∞ < ∞ with positive probability, we have to extend the probability space to support W (see e.g. [14, Prob. 3.4.7, p. 175]). This extension however has no bearing on the coupling ′ (X π,x , X π,x ). Let α̂s := ατs , δ̂s := δτs , β̂s := βτs , v̂s := vτs and Ŝs := S̄τs for s ∈ [0, [N ]ρ0 (Y ) ), cf. (10) above. Assume kx − x′ k < ϕ̄ and time-change the integrals in (11) (see [14, Prop. 3.4.8, p. 176]) to get Z u q Z u ′ 2 2 Ŝs dWs + Ŝu = kx − x k + (1 + νs ) ds, for any u ∈ [0, [N ]T ), 0 0  p  where νs := 2 Ŝs v̂sT β̂s + Tr(δ̂s δ̂sT ) − kv̂sT δ̂s k2 /kv̂sT α̂s k2 and T is defined in (14). By (14) it holds that kv̂sT α̂s k2 ≥ 4λ(1 − ǫ) for all s ∈ [0, [N ]T ). Then (12) and the definitions of β̂, δ̂ and νs imply the inequalities 0 ≤ νs < Mx2 kYτs k2 /(λ(1 − ǫ)) for all s ∈ [0, [N ]T ). Any ϕ ∈ (0, ϕ̄) satisfies ϕ < ǫ(1 − ǫ)λ/Mx2 and R := ρ0 (Y ) ∧ ρϕ (Y ) ≤ T . Hence the Lipschitz property of σπ and νπ on the ball of radius ϕ around x implies νs < ǫ for all s ∈ [0, [N ]R ). Rs √ The SDE Ss = kx − x′ k + 0 2 Sr dWr + (1 + ǫ)s, s ∈ R+ , for the squared Bessel process of dimension 1 + ǫ has a pathwise unique (and hence strong) solution S = (Ss )s∈R+ , see [6, App. A.3, p. 108]. Note that S is driven by the DDS Brownian motion W introduced above. Hence the coupling (S, Ŝ) on the stochastic interval [0, [N ]R ) allows us to compare the two processes pathwise. Assume kx − x′ k < ϕ. Then the following equality holds: Z  p q q  p 1 s ǫ/ Sr − νr / Ŝr dr for any s ∈ [0, [N ]R ). Ss − Ŝs = 2 0 p √ Almost surely, the path of the process ( Ss − Ŝs )s∈[0,[N ]R ) is continuously differentiable and, by (15), its derivative is strictly positive at every zero of the path. Since the derivative is continuous, it must be strictly positive on a neighbourhood of each zero. This implies that the only zero is at s = 0 (i.e. S0 = Ŝ0 ), and it holds that (15) (16) Ss ≥ Ŝs for all s ∈ [0, [N ]R ). We now conclude the proof of the lemma. Assume as before that kx − x′ k < ϕ and define Υϕ (Ŝ) := inf{s ∈ [0, [N ]T ) : Ŝs = ϕ} (with inf ∅ = ∞). Note that the events {Υϕ (Ŝ) < ∞} and {[N ]ρϕ (Y ) < [N ]ρ0 (Y ) } coincide, since on either event we have Υϕ (Ŝ) = [N ]R = [N ]ρϕ (Y ) . Hence, (17) {ρϕ (Y ) < ρ0 (Y )} = {Υϕ (Ŝ) < ∞} ⊆ {S exits interval (0, ϕ) at ϕ}, where the inclusion follows by (16). Recall that s(z) = z (1−ǫ)/2 , z ∈ R+ , is a scale function of the diffusion S. Hence P(S exits interval (0, ϕ) at ϕ) = s(kx − x′ k)/s(ϕ). Define ϕ′ := ǫϕ and note that by (17) we have: P(ρϕ (Y ) < ρ0 (Y )) < ǫ for any x′ ∈ Rd satisfying kx − x′ k < ϕ′ .  Lemma 8. Pick a locally Lipschitz Markov policy π : Rd → A and let Assumption 1 hold. Then the corresponding payoff function in (5), Vπ : Rd → R+ , is continuous. COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 13 Proof. Fix x ∈ Rd and pick arbitrary ε ∈ (0, 1). By Assumption 1 there exists ǫ0 > 0, such that απ ≥ ǫ0 , and a constant M > 1 that simultaneously bounds απ , |fπ | < M and is a Lipschitz constant on the ball of radius one around x for απ and fπ . Apply Lemma 7 to x, ǫ := εǫ0 /(6M ) and π to obtain ϕ̄ ∈ (0, 1] such that ∀ϕ ∈ (0, ϕ̄) ∃ϕ′ ∈ (0, ϕ) such that P(ρϕ < ρ0 ) < ǫ for every x′ ∈ Rd satisfying kx − x′ k < ϕ′ (here ρϕ , ρ0 stand for ρϕ (Y ), ρ0 (Y ), resp.). Specifically, define ϕ := min{ϕ̄/2, ε/(3(1 + M/ǫ0 )M/ǫ0 ), (εeǫ0 )/(3M 2 )} (18) and fix ϕ′ ∈ (0, ϕ) such that the conclusion of Lemma 7 holds. Throughout this proof we use ′ the notation and notions from Lemma 7. In particular, (X π,x , X π,x ) denotes the coupling of two controlled processes started at (x, x′ ) and we assume that kx − x′ k < ϕ′ . ′ ′ Recall that Vπ (x′ ) = EF∞ (X π,x ) for any x′ ∈ Rd , where F∞ (X π,x ) is given in (19). By decomposing the probability space into complementary events {ρϕ > ρ0 } and {ρϕ < ρ0 }, we obtain the following inequality |Vπ (x) − Vπ (x′ )| ≤ A + A′ + A′′ , where   Z ρ0 R ′    R − 0t απ Xsπ,x ds π,x π,x′ − 0t απ (Xsπ,x )ds A := E I{ρϕ >ρ0 } e fπ (Xt ) − e fπ Xt dt , 0   Z ∞ R ′    R − 0t απ Xsπ,x ds π,x π,x′ − 0t απ (Xsπ,x )ds ′ e fπ (Xt ) − e fπ Xt A := E I{ρϕ >ρ0 } dt ,  ′′ A := E I{ρϕ <ρ0 } ρ0 ∞ Z e 0 − Rt 0 απ (Xsπ,x )ds fπ (Xtπ,x ) −e − Rt 0  ′ απ Xsπ,x ds fπ  ′ Xtπ,x   dt . Hence, by Lemma 7, we have A′′ ≤ P(ρϕ < ρ0 )2M/ǫ0 < ǫ2M/ǫ0 = ε/3. ′ Since in the summands A and A′ the coupling succeeds before the components of (X π,x , X π,x ) grow at least ϕ apart, we Rcan control these terms using the local regularity of απ and fπ . Consider ′ t π,x A. Add and subtract e− 0 απ (Xs )ds fπ (Xtπ,x ) to obtain the bound:   Z ρ0  R ′   R − 0t απ Xsπ,x ds π,x π,x′ − 0t απ (Xsπ,x )ds −ǫ0 t A ≤EI{ρϕ >ρ0 } fπ (Xt ) − fπ Xt e −e +M e dt. 0 ′ On the event {ρϕ > ρ0 }, for t < ρϕ it holds that kXtπ,x − Xtπ,x k < ϕ. Since z 7→ e−z has a positive derivative bounded above by one for z ∈ R+ , απ − ǫ0 ≥ 0 and both fπ and απ are Lipschitz with constant M on the ball of radius ϕ around x, we get    Z t Z ρ0  ε M2 M −ǫ0 t −ǫ0 t π,x′ π,x ≤ , + 2 A ≤EI{ρϕ >ρ0 } M ϕe + Me απ (Xs ) − απ (Xs ) ds dt < ϕ ǫ0 3 ǫ0 0 0 ′ where the last inequality follows from (18). Furthermore, since Xtπ,x = Xtπ,x for all t ≥ ρ0 , it holds that the following expectation equals A′ : Z ∞ R R ρ0 R ρ0 π,x π,x′ − t α (X π,x )ds EI{ρϕ >ρ0 } |fπ (Xtπ,x )| dt. e−ρ0 ǫ0 e− 0 (απ (Xs )−ǫ0 )ds − e− 0 (απ (Xs )−ǫ0 )ds e ρ0 π s ρ0 Since |fπ | < M , απ ≥ ǫ0 , |e−z − e−y | ≤ |z − y| for z, y ∈ R+ and e−tǫ0 ≤ e−1 ǫ0 for t ∈ R+ we find   Z ρ0  M2 π,x π,x′ ′ −ρ0 ǫ0 , απ (Xs ) − απ (Xs ) ds ≤ M 2 ϕE I{ρϕ >ρ0 } e−ρ0 ǫ0 ρ0 ≤ ϕ A ≤ E I{ρϕ >ρ0 } M e eǫ0 0 which is by (18) less than ε/3. Hence, for any kx − x′ k ≤ ϕ′ , we proved that |Vπ (x) − Vπ (x′ )| < ǫ, which concludes the proof of the lemma.  Remark 16. The proofs of Lemmas 7 and 8 show that if the locally Lipschitz property in Assumption 1 is substituted by the globally Lipschitz requirement, we can conclude that the payoff function Vπ is in fact uniformly continuous. However, the coupling from [18] may still not be 14 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ successful, since the global Lipschitz condition controls globally the local variability of the coefficients. The coupling may fail because the assumptions in [18, Thm. 1] constrain the global variability of σπ . In fact, the idea of the proof of Lemma 7 can be used to construct an example where P(ρ0 (Y ) < ∞) < 1 by bounding the norm of kY k2 from below by a squared Bessel process of dimension greater than two on an event of positive probability. 5.1.2. A version of the Ascoli-Arzela Theorem. The following fact is key for proving the existence of the optimal strategy and showing that a subsequence of {πN }N ∈N in (gPIA) converges to it. Lemma 9. Let (M1 , d1 ) and (M2 , d2 ) be compact metric spaces, and for every n ∈ N let fn : M1 → M2 . If the sequence {fn }n∈N is equicontinuous, i.e. ∀ǫ > 0 ∃δ > 0 ∀x, y ∈ M1 ∀n ∈ N : d1 (x, y) < δ =⇒ d2 (fn (x), fn (y)) < ǫ, then there exists a uniformly convergent subsequence {fnk }k∈N , i.e. ∃f : M1 → M2 such that for every ǫ > 0 there exists N ∈ N such that supx∈M1 d2 (fnk (x), f (x)) < ǫ for all k ≥ N . Proof. Let B(x, 1/m) := {y ∈ M1 : d1 (x, y) < 1/m} be a ball of radius 1/m, m ∈ N, centred at x ∈ M1 . Since M1 is compact and metric, it is totally bounded: ∃Sm ⊆ M1 finite satisfying M1 = ∪x∈Sm B(x, 1/m). Then S := ∪m∈N Sm = {xn ∈ M1 ; n ∈ N} is countable and dense in M1 . We now apply the standard diagonalisation argument to find the subsequence in the lemma. Let ι1 : N → N be an increasing function defining a subsequence {fι1 (n) }n∈N that converges at x1 , i.e. limn→∞ fι1 (n) (x1 ) exists in M2 . Such a function ι1 exists since M2 is compact. Assume now that we have constructed an increasing ιk : N → N such that {fιk (n) }n∈N converges on the set {x1 , . . . , xk } for some k ∈ N. Then there exists an increasing ι : N → N such that the sequence of functions {fιk+1 (n) }n∈N , where ιk+1 := ιk ◦ ι, converges at xk+1 as well as on the set {x1 , . . . , xk }, as it is a subsequence of {fιk (n) }n∈N . Since k ∈ N was arbitrary, we have defined a sequence of subsequences of {fn }n∈N , such that the k-th subsequence converges on {x1 , . . . , xk }. Consider the “diagonal” subsequence {fnk }k∈N , fnk := fιk (k) for any k ∈ N. By construction it converges on S. We now prove that it is uniformly Cauchy, which implies uniform convergence since M2 is complete. Pick any ǫ > 0. By equicontinuity ∃m ∈ N such that for any k ∈ N and x, y ∈ M1 satisfying d1 (x, y) < 1/m, it holds that d2 (fnk (x), fnk (y)) < ǫ/3. Furthermore, since Sm is finite, ∃N ∈ N such that for all natural numbers k1 , k2 ≥ N we have d2 (fnk1 (y), fnk2 (y)) < ǫ/3 for all y ∈ Sm . Finally, for any x ∈ M1 there exists y ∈ Sm such that d1 (x, y) < 1/m. Hence, for any k1 , k2 ≥ N it holds that d2 (fnk1 (x), fnk2 (x)) ≤ d2 (fnk1 (x), fnk1 (y)) + d2 (fnk1 (y), fnk2 (y)) + d2 (fnk2 (y), fnk2 (x)) < ǫ. Since x ∈ M1 was arbitrary, the lemma follows.  5.1.3. A uniformly integrable martingale. If the process X π,x in (4), controlled by a Markov policy π, exists for all x ∈ Rd , then X π,· is a strong Markov process [14, Thm 4.30, p. 322], since σ and µ are bounded by Assumption 1. Define the additive functional F (X π,x ) = (Ft (X π,x ))t∈[0,∞] , Z t R u π,x π,x e− 0 απ (Xs )ds fπ (Xuπ,x ) du (19) Ft (X ) := for t ∈ [0, ∞]. 0 Remark 17. Note that Vπ (x) = EF∞ (X π,x ) and, by Assumption 1, the process |F (X π,x )| is bounded by some constant C0 > 0. Hence |F∞ (X π,x )| < C0 and |Vπ (x)| < C0 . COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 15 Lemma 10. The following holds for every Markov policy π, x ∈ Rd and (Ft )-stopping time T : RT   π,x E F∞ (X π,x )|FT = FT (X π,x ) + I{T <∞} e− 0 απ (Xs )ds Vπ XTπ,x . In particular, the process M = (Mr )r∈[0,∞] is a uniformly integrable martingale, where Mr := Fr (X π,x ) + I{r<∞} e− Rr 0 απ (Xsπ,x )ds Vπ (Xrπ,x ) . Proof. The following calculations imply the lemma:   Z ∞ R  π,x  − 0t+T απ (Xsπ,x )ds π,x π,x fπ Xt+T dt FT e E F∞ (X )|FT = FT (X ) + E I{T <∞} 0 Z ∞ R  R π,x π,x  − 0t απ (Xs+T ds − 0T απ (Xtπ,x )dt π,x ) e E = FT (X ) + I{T <∞} e fπ Xt+T dt FT 0 = FT (X π,x ) + I{T <∞} e − RT 0 απ (Xtπ,x )dt Vπ (XTπ,x ), where we applied the strong Markov property of X π,· in the last step.  5.2. Proofs of results in Section 2. Proof of Proposition 1. Assume m = d, cf. Remark 4. It suffices to prove that the PDE holds on the ball D := {y ∈ Rd : ky − x′ k < 1} for any x′ ∈ Rd . Fix x ∈ D and define τ := inf{t ∈ R+ : Xtπ,x ∈ ∂D} (with inf ∅ = ∞) to be the first time the process X π,x hits the boundary ∂D := {y ∈ Rd : ky − x′ k = 1} of D. Note that, by Assumption 1, we have τ < ∞. Let v ∈ C 2 (D) ∩ C(D̄), where D̄ := D ∪ ∂D, denote a solution of the boundary value problem Lπ v − α π v + fπ = 0 in D, where v = Vπ on ∂D. Since π is locally Lipschitz and (2) in Assumption 1 holds, the coefficients σπ , µπ , fπ , απ are (1/2)-Hölder (in fact Lipschitz) on D̄. The boundary data Vπ |∂D is continuous by Lemma 8, απ ≥ 0 and σπ satisfies (3). Hence, by [10, Thm 19, p. 87], the function v exists, is unique and Hv is (1/2)-Hölder. π,x ∈ D̄. Hence we can define Note that, for all t ∈ [0, ∞], we have Xt∧τ (20) Yt := Ft∧τ (X π,x ) + e− R t∧τ 0 απ (Xrπ,x )dr π,x v (Xt∧τ ), for t ∈ [0, ∞], where the process F· (X π,x ) is given in (19) above. The process Y = (Yt )t∈[0,∞] is bounded by a constant and by definition converges almost sure limt→∞ Yt = Y∞ . Since v solves the boundary value problem above and X π,x satisfies SDE (4), Itô’s formula on the stochastic interval [0, τ ] ⊂ R+ yields Z t∧τ R s π,x Yt = v(x) + e− 0 απ (Xr )dr ∇v (Xsπ,x )T σπ (Xsπ,x ) dBs , t ∈ [0, ∞], 0 making Y into a local martingale. Since Y is bounded, it is a uniformly integrable martingale satisfying v(x) = Y0 = E[Y∞ ]. Since v = Vπ on ∂D and Xτπ,x ∈ ∂D, the definition of Y in (20) and Lemma 10 (applied to the stopping time T := τ ) yield Y∞ = Fτ (X π,x ) + e− Rτ 0 απ (Xrπ,x )dr Vπ (Xτπ,x ) = E (F∞ (X π,x )|Fτ ) , implying v(x) = E(F∞ (X π,x )) = Vπ (x). The uniqueness follows similarly: let v be another bounded solution of the Poisson equation on Rd . Define the process Y as in (20) with τ ≡ ∞ and t < ∞. As above we have v(x) = EYt for all t ∈ R+ . Then the DCT, applicable since v is bounded, yields v(x) = limt↑∞ EYt = Vπ (x).  16 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Proof of Theorem 2. Let πn and πn+1 be as in (gPIA), n ∈ N ∪ {0}. Define Y = (Yt )t∈R+ by (21) Yt := Ft (X πn+1 ,x ) + e− Rt 0 π απn+1 (Xr n+1 ,x )dr V πn π Xt n+1 ,x  , π t ∈ R+ . ,x where F· (X πn+1 ,x ) is given in (19) above. Define τm := inf{t ≥ 0 : kXt n+1 k = m} for any fixed m > kxk and note that τm < ∞ by Assumption 1. Itô’s formula, applicable by Proposition 1, yields Z ·∧τm R πn+1 ,x  πn+1 ,x s )dr f e− 0 απn+1 (Xr ) ds, Y·∧τm = Vπn (x) + M + πn+1 + Lπn+1 Vπn − απn+1 Vπn (Xs 0 Rs πn+1 ,x R t∧τ π ,x )dr (∇VπTn σπn+1 )(Xs n+1 )dBs , is a local where M = (Mt )t∈R+ , Mt := 0 m e− 0 απn+1 (Xr martingale. Since the functions σπn+1 and ∇Vπn are bounded on the ball {y ∈ Rd : kyk ≤ m} by Assumption 1 and Proposition 1, respectively, and απn+1 > ǫ0 > 0, the quadratic variation of M is bounded above by a constant. Hence M is a uniformly integrable martingale. In particular, EMt = 0 for all t ∈ R+ . By (gPIA) and Proposition 1, we have  on Rd . fπn+1 + Lπn+1 Vπn − απn+1 Vπn Sπn+1 ≤ (fπn + Lπn Vπn − απn Vπn ) Sπn = 0 Since Sπn+1 > 0, we have E (Yt∧τm ) ≤ Vπn (x). Hence (21), Assumption 1 and the DCT, as t ↑ ∞, yield R τm πn+1 ,x π ,x  )dr . Vπn (x) ≥ EFτm (X πn+1 ,x ) + EVπn Xτmn+1 e− 0 απn+1 (Xr Hence, by Remark 17, we have Vπn (x) ≥ EFτm (X πn+1 ,x ) − C0 Ee−ǫ0 τm . Since X πn+1 ,x satisfies SDE (4) for all t ∈ R+ , we have limm↑∞ τm = ∞. The DCT and Remark 17 yield Vπn+1 (x) = EF∞ (X πn+1 ,x ) = limm↑∞ EFτm (X πn+1 ,x ) − C0 Ee−ǫ0 τm ≤ Vπn (x), which concludes the proof.  Proof of Proposition 3. Run (gPIA) to produce a sequence of policies {πN }N ∈N , starting from a constant policy π0 . Fix an arbitrary K0 > 0 and consider the restriction of this sequence onto the closed ball DK0 . Since the Lipschitz constant of π0 is equal to zero and hence smaller than CK0 , Assumption 4 implies Vπ0 ∈ SBK0 ,K0 . Assumption 3 implies that the Lipschitz constant of π1 is also at most CK0 . Iterating this argument implies that all the policies in the sequence {πN }N ∈N have the same Lipschitz constant on DK0 , making it equicontinuous on DK0 . By Lemma 9 above, there exists a subsequence that converges uniformly on DK0 to a function 0 :D 0 π∞ K0 → A. Moreover, π∞ is also Lipschitz with a constant bounded above by CK0 . Let K1 := 2K0 and repeat the argument above for K1 and the subsequence of {πN }N ∈N constructed in the previous paragraph. This yields a further subsequence of the policies that 1 :D converges uniformly to a Lipschitz function π∞ K1 → A with the Lipschitz constant bounded 0 on D above by CK1 . Since the sequence we started with converges pointwise to π∞ K0 ⊂ DK1 , 1 0 so must its every subsequence. Hence it holds that π∞ (x) = π∞ (x) for all x ∈ DK0 . k : D For k ∈ N, let Kk := 2Kk−1 and construct inductively π∞ Kk → A as above. Then the d n function πlim : R → A, given by πlim (x) := π∞ (x) for any n ∈ N such that x ∈ DKn , is welldefined and locally Lipschitz. Let the policy πnk : Rd → A be the k-th element of the convergent k :D subsequence used to define π∞ Kk → A. Then, by construction, the “diagonal” subsequence {πnk }k∈N of {πN }N ∈N converges uniformly to πlim on DK for any K > 0.  Proof of Theorem 4. Let {πnk }k∈N be a subsequence of the output of (gPIA), {πN }N ∈N , that converges locally uniformly to a policy πlim = limk↑∞ πnk . By (6) and Theorem 2, Vπnk ց Vlim COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 17 as k → ∞. Fix K > 0 and let τK := inf{t ∈ R+ : Xtπlim ,x − x ∈ ∂DK } be the first time X πlim ,x hits the boundary of the closed ball x + DK with radius K, centred at an arbitrary x ∈ Rd . Pick k ∈ N, t ∈ R+ and define Z t R Rt πlim ,x πlim ,x s πlim ,x )dr f )dr V k πlim ,x St := e − 0 απnk ( X r ) ds + e− 0 απnk (Xr ). πnk (Xs πnk (Xt 0 Apply Itô’s formula to the process S k = (Stk )t≥0 on the stochastic interval [0, τK ) to get Z t∧τK R  T πlim ,x s )dr ∇V k St∧τK = Vπnk (x) + e − 0 απnk ( X r σπlim (Xsπlim ,x ) dBs πn k 0 + Z t∧τK e− Rs 0 απn k 0   π ,x (Xr lim )dr f πlim ,x ) ds. πnk + Lπlim Vπnk − απnk Vπnk (Xs Note that σπlim and ∇Vπnk are bounded on DK by Assumption 1 and Proposition 1, respectively, and απnk > ǫ0 > 0. Hence the quadratic variation of the stochastic integral is bounded, making it into a true martingale. This fact and the equality απnk Vπnk − fπnk = Lπnk Vπnk (Prop. 1) yield Z t∧τK R   πlim ,x s )dr f πlim ,x k ESt∧τK = Vπnk (x) + E e − 0 απ n k ( X r ) ds πnk + Lπlim Vπnk − απnk Vπnk (Xs 0 (22) = Vπnk (x) + E Z t∧τK e− Rs 0 0 απ n k   π ,x (Xr lim )dr L V πlim ,x ) ds. πlim πnk − Lπnk Vπnk (Xs Note [Lπlim − Lπnk ]Vπnk = (µπlim − µπnk )T ∇Vπnk + 21 Tr((σπlim + σπnk )T HVπnk (σπlim − σπnk )). Since, for every k, Vπnk solves the corresponding Poisson equation in Proposition 1 and, by Assumption 1 and Remark 17, the family of functions {σπnk , µπnk , απnk , fπnk , Vπnk : k ∈ N} is uniformly bounded on the ball x+DK , Schauder’s boundary estimate for elliptic PDEs [10, p. 86] implies that the sequences {∇Vπnk }k∈N and {HVπnk }k∈N are also uniformly bounded on x + DK . Since απnk > ǫ0 > 0 for all k ∈ N and the limits limk↑∞ µπnk = µπlim and limk↑∞ σπnk = σπlim are k uniform on x + DK , the DCT and the equality in (22) imply limk↑∞ ESt∧τ = Vlim (x). Hence, K k the definition of S above, Assumption 1, Remark 17 and a further application of the DCT yield Z t∧τK R πlim ,x s πlim ,x )dr f (23) Vlim (x) = E ) ds + Et∧τK , e− 0 απlim (Xr πlim (Xs 0 R t∧τK π ,x πlim ,x  απlim (Xr lim )dr . Vlim Xt∧τ where Et∧τK := Ee− 0 K By (6) and Remark 17, the inequality |Vlim (y)| ≤ C0 holds for all y ∈ Rd . By Assumption 1 we hence get 0 ≤ lim sup |Et∧K | ≤ C0 lim sup Ee−ǫ0 (t∧τK ) = 0, t∧K→∞ t∧K→∞ since τK ↑ ∞ as K ↑ ∞. The DCT applied to the first summand in (23), as t ∧ K → ∞, yields the equality Vlim (x) = Vπlim (x). Since x ∈ Rd was arbitrary, the theorem follows.  Proof of Theorem 5. The second assertion in the theorem follows from the first one and Theorem 4. We now establish the first assertion of Theorem 5. Equip A × A with a product metric, e.g. d∞ ((p1 , p2 ), (a1 , a2 )) := max{dA (a1 , p1 ), dA (a2 , p2 )}, and let {πN }N ∈N be constructed by the (gPIA). As in the proof of Proposition 3, {(πN +1 , πN ) : Rd → A × A}N ∈N are Lipschitz on a closed ball DK of radius K > 0 with the Lipschitz constant CK , independent of N . Hence as in the proof of Proposition 3, there exists a subsequence {(π1+nk , πnk )}k∈N that converges uniformly on every compact subset of Rd to a locally Lipschitz function (π̃lim , πlim ) : Rd → A × A. 18 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ Pick any x ∈ Rd , a policy Π ∈ A(x), K > 0 and let τK := inf{t ∈ R+ : XtΠ,x − x ∈ ∂DK } be the first time the controlled process X Π,x hits the boundary of the closed ball x + DK with radius K (centred at x). Since Πs ∈ A for all s ∈ R+ , the (gPIA) implies the inequality (24) SΠs (fΠs + LΠs Vπnk − αΠs Vπnk ) ≥ Sπnk +1 (fπnk +1 + Lπnk +1 Vπnk − απnk +1 Vπnk ) on Rd . Denote Lπ h := Lπ h − απ h + fπ for any policy π and h ∈ C 2 (Rd ). Then, for k ∈ N, we find that Z t∧τK R   R t∧τK  s Π,x Π,x αΠr (XrΠ,x )dr E e− 0 αΠr (Xr )dr fΠs XsΠ,x ds + e− 0 Vπnk Xt∧τ K 0 Itô = Vπnk (x) + E (25) Z t∧τK e− Rs 0 αΠr (XrΠ,x )dr 0 ǫS ≥ Vπnk (x) + E MS Z t∧τK 0 e− Rs 0  f Π s + L Π s V πn k − α Π s V π n k αΠr (XrΠ,x )dr  Lπnk +1 Vπnk    XsΠ,x ds,  XsΠ,x ds where the last inequality follows from Assumption 2 and inequality (24). The next task is to take the limit as k → ∞ on both sides of inequality (25). Since the sequence {π1+nk }k∈N converges locally uniformly to the locally Lipschitz policy π̃lim (resp. πlim ), Theorem 4 implies Vπ̃lim = Vlim (resp. Vπlim = Vlim ). Proposition 1 implies Lπ̃lim Vlim = 0 = Lπlim Vlim . Hence we can express Lπnk +1 Vπnk = Lπnk +1 Vπnk −Lπ̃lim Vπnk +Lπ̃lim Vπnk −Lπ̃lim Vlim . By Schauder’s boundary estimate for elliptic PDEs [10, p. 86], the sequences {∇Vπnk }k∈N and {HVπnk }k∈N are uniformly bounded on x + DK . By Assumption 1 and Remark 17, the bounded sequence {(σπnk +1 , µπnk +1 , απnk +1 , fπnk +1 , Vπnk +1 )}k∈N tends to the limit (σπ̃lim , µπ̃lim , απ̃lim , fπ̃lim , Vπ̃lim ) uniformly on x + DK as k ↑ ∞. Hence, so does (26) Lπnk +1 Vπnk −Lπ̃lim Vπnk = [Lπnk +1 −Lπ̃lim ]Vπnk −(απnk +1 −απ̃lim )Vπnk +(fπnk +1 −fπ̃lim ) → 0. By the elliptic version of Theorem 15 in [10, p. 80] applied to the family of PDEs Lπnk Vπnk = 0, k ∈ N, there exists a subsequence of {Vπnk }k∈N (again denoted by {Vπnk }k∈N ), such that the corresponding sequence {(Vπnk , ∇Vπnk , HVπnk )}k∈N converges uniformly on the closed ball x+DK to (Vπlim , ∇Vπlim , HVπlim ) = (Vlim , ∇Vlim , HVlim ). Hence it follows that (27) Lπ̃lim Vπnk − Lπ̃lim Vlim = Lπ̃lim (Vπnk − Vlim ) − απ̃lim (Vπnk − Vlim ) → 0, as k → ∞. Equations (26) and (27) imply that Lπnk +1 Vπnk → 0 as k → ∞ uniformly on the ball x + DK . Apply the DCT to the right-hand side of (25) and Assumption 1 and Remark 17 to its lefthand side: Z t∧τK R  s Π,x Vlim (x) ≤ E e− 0 αΠr (Xr )dr fΠs XsΠ,x ds + C0 Ee−ǫ0 (t∧τK ) . 0 Since this inequality holds for all K, t > 0 and τK ↑ ∞ as K ↑ ∞, the inequality Vlim (x) ≤ VΠ (x) follows by the DCT as t ∧ K → ∞ (cf. the last paragraph of the proof of Theorem 4).  5.3. Auxiliary results - the one-dimensional case. Throughout Sections 5.3 and 5.4, define τcd (Z) := inf{t ≥ 0; Zt ∈ {c, d}} (inf ∅ = ∞) for any continuous stochastic process (Zt )t∈R+ in R and −∞ ≤ c < d ≤ ∞. Lemma 11. For any Markov policy π : (a, b) → A, the payoff function Vπ : (a, b) → R can be continuously extended by defining Vπ (a) := g(a) if a > −∞ and Vπ (b) := g(b) if b < ∞. COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 19 Proof. Let {xn }n∈N be a decreasing sequence in (a, b) that converges to a > −∞. We now prove that limn→∞ Vπ (xn ) = g(a) (the argument for b is analogous). Pick arbitrary ǫ > 0. Since µ is bounded and σ 2 bounded and bounded away from 0, a simple coupling argument yields that the process X π,xn can be bounded by a Brownian motion with  b drift so that P (τa∞ (X π,xn ) > ǫ) < ǫ and P τa∞ (X π,xn ) > τ−∞ (X π,xn ) < ǫ hold for large n ∈ N. Hence, there exists n0 ∈ N such that     b b (X π,xn ) < 2ǫ (X π,xn ) ≤ P (τa∞ (X π,xn ) > ǫ) + P τa∞ (X π,xn ) > τ−∞ P τa∞ (X π,xn ) > ǫ ∧ τ−∞ R τab (X π,xn ) π,xn απ (Xs )dt g(X π,xn for all n ≥ n0 . Define the quantities Bab := |e− 0 )−g(a)| and Aba := τab (X π,xn ) R R τab (X π,xn ) − t α (X π,xn )ds b e 0 π s |fπ (Xtπ,xn )|dt and the event C := {τa∞ (X π,xn ) ≤ ǫ ∧ τ−∞ (X π,xn )}. 0 Then we have   |Vπ (xn ) − g(a)| ≤ E (Aba + Bab )IΩ\C + (Aba + Bab )IC . We now show that there exists M > 0, which does not depend on ǫ, such that |Vπ (xn )−g(a)| is bounded above by 4M ǫ for all n ≥ n0 . The expectation on the event Ω\C, which has probability less than 2ǫ, is smaller than 2M ǫ since f, g are bounded and α ≥ ǫ0 > 0. On the event C we n have τab (X π,xn ) ≤ ǫ, which implies EAba IC < M ǫ. On C it holds that Xτπ,x b (X π,xn ) = a. Hence, a the elementary inequality 1 − e−x ≤ x for x ≥ 0, yields an upper bound on EBab IC of the form  Rǫ |g(a)|E 0 απ (Xtπ,xn ) dt . This concludes the proof.  Lemma 12 is the analogue of Lemma 10 with an analogous proof, which we omit for brevity. Lemma 12. The following holds for every Markov policy π, x ∈ (a, b) and stopping time ρ: !   R τab (X π,x ) π,x π,x απ (Xt )dt F I b π,x =M , E F b π,x (X π,x ) + e− 0 g X τa (X ) τab (X π,x ) R r∧τab (X π,x ) {τa (X )<∞} ρ ρ π,x απ (Xs )ds V (X π,x ), for r ∈ [0, ∞]. where Mr := Fr∧τab (X π,x ) (X π,x ) + I{r<∞} e− 0 π r∧τab (X π,x ) In particular, the process M = (Mr )r∈[0,∞] is a uniformly integrable martingale. 5.4. Proofs of results in Section 3. Proof of Proposition 1 in dimension one. Recall that Assumption 1 holds. We need to show that for any locally Lipschitz Markov policy π : (a, b) → A we have Vπ ∈ C 2 ((a, b)) and Lπ Vπ − απ Vπ + fπ = 0. Let a < a′ < a′′ < x < b′′ < b′ < b, and for any c < d denote τcd := τcd (X π,x ). Let v ∈ C 2 ((a′ , b′ )) ∩ C([a′ , b′ ]) be the unique solution of the boundary value problem Lπ v − απ v + fπ = 0, v(a′ ) = Vπ (a′ ), v(b′ ) = Vπ (b′ ), guaranteed to exist by Theorem 19 in [10, p. 87], which is ′′ applicable by Assumption 1. Let by Itô’s formula on ′′ [0, τab′′ ] ′′ ′′ Sta ,b := Ft∧τ b π,x ) ′′ (X a′′ + e− and the definition of v, the process R t∧τab′′ απ (Xrπ,x )dr v(X π,x ). Then, ′′ t∧τab′′ ′′ ′′ ′′ ′′ S a ,b = (Sta ,b )t≥0 satisfies 0 ′′ ′′ ′′ Sta ,b a′′ ,b′′ = v(x) + Z t∧τab′′ 0 e− Rs 0 απ (Xrπ,x )dr σπ v ′ (Xsπ,x ) dBs . Hence S is clearly a uniformly integrable martingale and the following equalities hold: ′ ′ ′′ a′′ ,b′′ a′′ ,b′′ a′′ ,b′′ limt↑∞ E|St . Define S a ,b by substituting τab′′ in the | = 0 and v(x) = ES∞ − S∞ 20 SAUL D. JACKA, ALEKSANDAR MIJATOVIĆ, AND DEJAN ŠIRAJ ′′ ′′ ′′ ′ ′ definition of S a ,b with τab′ . Since X π,x is continuous, we have lima′′ ↓a′ ,b′′ ↑b′ τab′′ = τab′ a.s. Hence, a′′ ,b′′ a′ ,b′ = ES∞ . by the DCT, v(x) = lima′′ ↓a′ ,b′′ ↑b′ ES∞ ′ Note that the boundary conditions for v, the fact X π,x ∈ {a′ , b′ } and Lemma 12 (with ρ = τab′ ) b′ a′ ,b′ imply S∞ = E(Fτab (X π,x ) + e− R τab 0 τa ′ απ (Xrπ,x )dr g(Xτπ,x b <∞} |F b′ ). Taking expectations on both b )I{τa τ a sides of this equality yields v(x) = Vπ (x). a′  Proof of Theorem 2 in dimension one. We claim that under Assumptions 1–2, the inequality Vπn+1 (x) ≤ Vπn (x) holds for all x ∈ (a, b) and n ∈ N, where πn+1 is defined in (7). Define the process Y as in (21) in Section 5.2 and consider the stopped process Y·∧τ b′ , where a′ a < a′ < x < b′ < b and τcd := τcd (X πn+1 ,x ) for any c < d. Then the proof follows the same steps as the proof of Theorem 2 in Section 5.2. The only difference is that in the penultimate line of the proof of Section 5.2 we apply the DCT and Lemma 11 (instead of the DCT only) to obtain Vπn (x) ≥ Vπn+1 (x).  The proof of the one-dimensional case of Proposition 3 is completely analogous to the multidimensional one and is hence omitted. Proof of Theorem 4 in one dimension. We need to show that Vlim (x) = Vπlim (x) holds for all x ∈ (a, b). The proof follows along the same lines as in the multi-dimensional case of Section 5.2. ′ The only difference lies in the fact that we stop the process X πlim ,x at τab′ (X πlim ,x ), where a < a′ < x < b′ < b, and take the limit as (a′ , b′ , t) → (a, b, ∞).  The verification lemma in the one-dimensional case is established exactly as in the proof of Theorem 5. The details are omitted. References [1] A. Arapostathis, On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion. Optimization, Control, and Applications of Stochastic Systems, 1–12, 2012. [2] A. Arapostathis and V.S. Borkar, Uniform recurrence properties of controlled diffusions and applications to optimal control. SIAM J. Control Optim., 48(7), 152–160, 2010. [3] P. Azimzadeh, E. Bayraktar and G. Labahn, Convergence of approximation schemes for weakly nonlocal second order equations. arXiv:1705.02922v2, 2017. [4] P. Bertsekas, Approximate Policy Iteration: a Survey and Some New Methods. Journal of Control Theory and Applications, 9(3):310–335, 2011. [5] A. N. Borodin, P. Salminen, Handbook of Brownian Motion – Facts and Formulae. Birkhauser, Basel, second edition, 2002. [6] A. S. Cherny and H.-J. Engelbert, Singular Stochastic Differential Equations. Lecture Notes in Mathematics, Springer-Verlag, Berlin Heidelberg, 2005. [7] B. T. Doshi, Continuous Time Control of Markov Processes on an Arbitrary State Space: Discounted Rewards. The Annals of Statistics, 4(6):1219–1235, 1976. [8] E.B. Dynkin, Markov Processes, vol I. Springer-Verlag, Berlin, 1965. [9] E. Fournié, J.-M. Lasry, J. Lebuchoux, P.-L. Lions, N. Touzi, Applications of Malliavin calculus to Monte Carlo methods in finance. Finance and Stochastics, 3:391–412, 1999. [10] A. Friedman, Partial Differential Equations of Parabolic Type. Prentice-Hall, Englewood Cliffs, N.J., 1964. [11] O. Hernández-Lerma and J. B. Lasserre, Policy iteration for average cost Markov control processes on Borel spaces. Acta Applicandae Mathematicae, 47(2):125–154, 1997. [12] A. Hordijk and M. L. Puterman, On the convergence of policy iteration in finite state undiscounted Markov decision processes: the unichain case. Mathematics of Operations Research, 12(1):163–176, 1987. [13] R. A. Howard, Dynamic Programming and Markov Processes. MIT Press, Cambridge, 1960. COUPLING AND A GENERALISED POLICY ITERATION ALGORITHM IN CONTINUOUS TIME 21 [14] I. Karatzas and S. E. Shreve, Brownian Motion and Stochastic Calculus. Springer-Verlag, New York, second edition, 1991. [15] N. V. Krylov, Controlled Diffusion Processes. Springer-Verlag, New York, 1980. [16] M. G. Lagoudakis and R. Parr, Least-Squares Policy Iteration. Journal of Machine Learning Research, 4:11071149, 2003. [17] J. B. Lasserre, A new policy iteration scheme for Markov decision processes using Schweitzer’s formula. Journal of Applied Probability, 31(1):268–273, 1994. [18] T. Lindvall and L. C. G. Rogers, Coupling of Multidimensional Diffusions by Reflection. The Annals of Probability, 14(3):860–872, 1986. [19] S. Mahadevan, Representation Policy Iteration. arXiv:1207.1408 [cs.AI], 2012. [20] S. P. Meyn, The policy iteration algorithm for average reward Markov decision processes with general state space. IEEE Transactions on Automatic Control, 42(12):1663–1680, 1997. [21] P. E. Protter, Stochastic Integration and Differential Equations. Stochastic Modelling and Applied Probability, Springer-Verlag, Berlin Heidelberg, 2004. [22] J. P. Rust, A Comparison of Policy Iteration Methods for Solving Continuous-State, Infinite-Horizon Markovian Decision Problems Using Random, Quasi-Random, and Deterministic Discretizations. Available at http://ssrn.com/abstract=37768, 1997. [23] M. S. Santos and J. Rust, Convergence properties of policy iteration. SIAM Journal on Control and Optimization, 42(6):2094–2115, 2004. [24] Q. X. Zhu, X. S. Yang and C. X. Huang, Policy iteration for continuous-time average reward Markov decision processes in Polish spaces. Abstract and Applied Analysis, 2009, Article ID103723, 17 pages. Department of Statistics, University of Warwick, and, The Alan Turing Institute, UK E-mail address: s.d.jacka@warwick.ac.uk Department of Mathematics, King’s College London, and, The Alan Turing Institute, UK E-mail address: aleksandar.mijatovic@kcl.ac.uk Department of Statistics, University of Warwick, UK E-mail address: d.siraj@warwick.ac.uk