Bayesian Reinforcement Learning: A Survey
Bayesian Reinforcement Learning: A Survey
Bayesian Reinforcement Learning: A Survey
R
in Machine Learning
Vol. 8, No. 5-6 (2015) 359492
c 2015 M. Ghavamzadeh, S. Mannor, J. Pineau, and
A. Tamar
DOI: 10.1561/2200000049
arXiv:1609.04436v1 [cs.AI] 14 Sep 2016
Mohammad Ghavamzadeh
Adobe Research & INRIA
mohammad.ghavamzadeh@inria.fr
Shie Mannor
Technion
shie@ee.technion.ac.il
Joelle Pineau
McGill University
jpineau@cs.mcgill.ca
Aviv Tamar
University of California, Berkeley
avivt@berkeley.edu
Contents
1 Introduction 3
2 Technical Background 11
2.1 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . 11
2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . 14
2.3 Partially Observable Markov Decision Processes . . . . . . 18
2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . 20
2.5 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . 22
3 Bayesian Bandits 29
3.1 Classical Results . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Bayes-UCB . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . 33
iii
iv
8 Outlook 117
Acknowledgements 121
Appendices 123
References 131
Abstract
3
4 CHAPTER 1. INTRODUCTION
An Example Domain
We present an illustrative domain suitable to be solved using the BRL
techniques surveyed in this paper. This running example will be used
throughout the paper to elucidate the difference between the various
BRL approaches and to clarify various BRL concepts.
Example 1.1 (The Online Shop). In the online shop domain, a retailer
aims to maximize profit by sequentially suggesting products to online
shopping customers. Formally, the domain is characterized by the fol-
lowing model:
Bandits
Bayes
UCB
(Sec
3)
Thompson
sampling
Oine
value
approximaGon
- Finite
state
controllers
- BEETLE
Online
near-myopic
value
approximaGon
- Bayesian
DP
- VOI
heuris9c
Online
tree
search
approximaGon
Model-based
- Forward
search
BRL
(Sec
4)
- Bayesian
sparse
sampling
- HMDP
- BFS3
- Branch-and-bound
search
- BAMCP
ExploraGon
bonus
approximaGon
Bayesian
- BOSS
RL
- BEB
- VBRB
- BOLT
Value
funcGon
algos
- GPTD
- GPSARSA
Policy
gradient
algos
Model-free
BRL
(Sec
5)
- Bayesian
Quadrature
- Two
Bayesian
models
for
es9ma9ng
the
policy
gradient
Actor-CriGc
algos
- GPTD
+
Bayesian
policy
gradient
Throughout the paper, we revisit the online shop domain, and spec-
ify explicit configurations that are relevant to the surveyed methods.
2
Technical Background
11
12 CHAPTER 2. TECHNICAL BACKGROUND
Definition 2.1 (Regret). Let a arg maxaA EyP (|a) r(y) denote
Example 2.1 (Online shop bandit setting). Recall the online shop do-
main of Example 1.1. In the MAB setting, there is no state information
2.1. MULTI-ARMED BANDITS 13
A rule according to which the agent selects its actions at each possible
state, or policy, is defined as a mapping from past observations to a
distribution over the set of actions. A policy is called Markov if the
distribution depends only on the last state of the observation sequence.
A policy is called stationary if it does not change over time. A stationary
Markov policy (|s) P(A) is a probability distribution over the
set of actions given a state s S. A policy is deterministic if the
probability distribution concentrates on a single action for all histories.
A deterministic policy is identified by a mapping from the set of states
to the set of actions, i.e., : S A. In the rest of the paper, we use
the term policy to refer to stationary Markov policies.
The MDP controlled by a policy induces a Markov chain M
with reward distribution q (|s) = q |s, (s) such that R (s) =
T
Pr(|) = P0 (z0 )
Y
P (zt |zt1 ).
t=1
t=0
All assumptions are similar to MDPs, with the addition that P(O) is
the set of probability distributions on (Borel) subsets of O, and is
stationary. As a motivation for the POMDP model, we revisit again
the online shop domain.
Example 2.2 (Online shop hidden state setting). Recall the online
shop domain of Example 1.1. In a realistic scenario, some features of
the customer, such as its gender or age, might not be visible to the
decision maker due to privacy, or other reasons. In such a case, the MDP
model, which requires the full state information for making decisions
is not suitable. In the POMDP model, only observable quantities, such
as the items in the shopping cart, are used for making decisions.
Since the state is not directly observed, the agent must rely on
the recent history of actions and observations, {ot+1 , at , ot , . . . , o1 , a0 }
2.3. PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES19
In that case, the best action, (bt ), is the one associated with the
-vector that returns the best value.
The belief as defined here can be interpreted as a state in a particu-
lar kind of MDP, often called the belief-MDP. The main intuition is that
20 CHAPTER 2. TECHNICAL BACKGROUND
for any partially observable system with known parameters, the belief
is actually fully observable and computable (for small enough state
spaces and is approximable for others). Thus, the planning problem is
in fact one of planning in an MDP, where the state space corresponds
to the belief simplex. This does not lead to any computational advan-
tage, but conceptually, suggests that known results and methods for
MDPs can also be applied to the belief-MDP.
3. We observe data Y = y.
Y T = HF T + N T , (2.12)
>
where F T = F (x1 ), . . . , F (xT ) , Y T = (y1 , . . . , yT )> , and N T
N (0, ). Here []i,j is the measurement noise covariance between the
ith and jth samples. In the linear statistical model, we then have
>
F T N (f , K), where f = f(x1 ), . . . , f(xT ) and [K]i,j = k(xi , xj )
is a T T kernel matrix. Since both F T and N T are Gaussian and
independent from each other, we have Y T N (H f , HKH > + ).
Consider a query point x, we then have
k(x)> 0
F (x) f (x)
k(x, x)
FT = N f , k(x) K 0 ,
NT
0 0 0
>
where k(x) = k(x1 , x), . . . , k(xT , x) . Using Eq. 2.12, we have the
26 CHAPTER 2. TECHNICAL BACKGROUND
following transformation:
F (x) 1 0 0 F (x)
F T = 0 I 0 F T , (2.13)
YT 0 H I NT
where I is the identity matrix. From Eq. 2.13, we have
k(x)> k(x)> H >
F (x) f (x)
k(x, x)
FT = N f , k(x) K KH > .
YT
Hf Hk(x) HK HKH > +
Cov F (x), F (x0 )|DT = k(x, x0 ) k(x)> H > (HKH > + )1 Hk(x0 ),
(2.14)
where
= H > (HKH > + )1 (y T H f ),
(2.16)
C = H > (HKH > + )1 H.
From Eqs. 2.15 and 2.16, we can conclude that and C are sufficient
statistics for the posterior moments.
If we set the transformation H to be the identity and assume that
the noise terms corrupting each sample are i.i.d. Gaussian, i.e., N T
N (0, 2 I), where 2 is the variance of each noise term, the linear sta-
tistical model is reduced to the standard linear regression model. In this
2.5. BAYESIAN LEARNING 27
= f(x) + k(x)> ,
(2.17)
Cov F (x), F (x0 )|DT = k(x, x0 ) k(x)> (K + 2 I)1 k(x0 )
with
the feature vector and W = (W1 , . . . , Wn )> is the weight vector. The
model-equation (2.11) now becomes
(2.19)
where = (x1 ), . . . , (xT ) is a n T feature matrix. Finally, since
F (x) = (x)> W , the posterior mean and covariance of F can be easily
computed as
28 CHAPTER 2. TECHNICAL BACKGROUND
E F (x)|DT = (x)> w
+ (x)> S w H >(H> S w H >+)1 (y T H> w),
0 > 0
Cov F (x), F (x )|DT = (x) S w (x )
(x)> S w H >(H> S w H >+)1 H> S w (x0 ).
(2.20)
29
30 CHAPTER 3. BAYESIAN BANDITS
In their seminal work, Lai and Robbins [Lai and Robbins, 1985] proved
asymptotically tight bounds on the frequentist regret in terms of the
Kullback-Leibler (KL) divergence between the distributions of the re-
wards of the different arms. These bounds grow logarithmically with
3.1. CLASSICAL RESULTS 31
the number of steps T , such that regret is O (ln T ). Mannor and Tsit-
siklis [Mannor and Tsitsiklis, 2004] later showed non-asymptotic lower
bounds with a similar logarithmic dependence onT . For the Bayesian
regret, the lower bound on the regret is O KT (see, e.g., Theorem
3.5 of Bubeck and Cesa-Bianchi [Bubeck and Cesa-Bianchi, 2012]).
In the Bayesian setting, and for models that admit sufficient statis-
tics, Gittins [Gittins, 1979] showed that an optimal strategy may be
found by solving a specific MDP planning problem. The key observa-
tion here is that the dynamics of the posterior for each arm may be
represented by a special MDP termed a bandit process [Gittins, 1979].
-discounted reward
" #
X
t
E Rit (sit (t), 1) .
t=0
and beyond the scope of this survey. For reference see [Gittins, 1979],
and also [Tsitsiklis, 1994] for a simpler derivation, and [Nio-Mora,
2011] for a finite horizon setting.
Due to the technical complexity of calculating optimal Gittins
index policies, recent approaches concentrate on much simpler algo-
rithms, that nonetheless admit optimal upper bounds (i.e., match the
order of their respective lower bounds up to constant factors) on the
expected regret.
3.2 Bayes-UCB
Theorem 3.2. [Agrawal and Goyal, 2012] For the K-armed stochastic
bandit problem, the TS algorithm has (frequentist) expected regret
2 !
1
X
E Regret(T ) O ln T .
a6=a
2a
Theorem 3.3. [Kaufmann et al., 2012b] For any > 0, there exists a
problem-dependent constant C(, ) such that the regret of TS satisfies
X a ln(T ) + ln ln(T )
E Regret(T ) (1 + ) + C(, ).
a6=a
KL (a, a)
Theorem 3.4. [Gopalan et al., 2014] For , (0, 1), there exists T >
0 such that for all T > T , with probability at least 1 ,
X
NT (a) B + C(log T ),
a6=a
Note that t is also random and depends on the prior and on the
history of the algorithm up to time t. As Russo and Van Roy [Russo
and Van Roy, 2014a] explain, the numerator in Eq. 3.1 roughly captures
how much knowing that the selected action is optimal influences the
expected reward observed, while the denominator measures how much
on average, knowing which action is optimal changes the observations
at the selected action. Intuitively, the information ratio tends to be
small when knowing which action is optimal significantly influences
the anticipated observations at many other actions.
The following theorem relates a bound on t to a bound on the
Bayesian regret.
Russo and Van Roy [Russo and Van Roy, 2014a] further showed
p
inqgeneral, t K/2, giving an order-optimal upper bound
that
1
of O 2 H(p1 )KT . However, structure between the arms may be
exploited to further bound the information ratio more tightly. For ex-
ample, consider the case of linear optimization under bandit feedback
where we have A Rd , and the reward satisfies EyP (|a) [r(y)] = a> .
q
1
In this case, an order-optimal bound of O 2 H(p1 )dT holds [Russo
and Van Roy, 2014a]. It is important to note that the term H(p1 ) is
bounded by log(K), but in fact may be much smaller when there is an
informative prior available.
An analysis of TS with a slightly different flavour was given by Guha
and Munagala [Guha and Munagala, 2014], who studied the stochastic
regret of TS, defined as the expected number of times a sub-optimal arm
is chosen, where the expectation is Bayesian, i.e., taken with respect
to Pprior (). For some horizon T , and prior Pprior , let OP T (T, Pprior )
denote the stochastic regret of an optimal policy. Such a policy ex-
ists, and in principle may be calculated using dynamic programming
(cf. the Gittins index discussion above). For the cases of two-armed
40 CHAPTER 3. BAYESIAN BANDITS
bandits, and K-MABs with Bernoulli point priors, Guha and Muna-
gala [Guha and Munagala, 2014] show that the stochastic regret of TS,
labeled T S(T, Pprior ) is a 2-approximation of OP T (T, Pprior ), namely
T S(T, Pprior ) 2OP T (T, Pprior ). Interestingly, and in contrast to the
asymptotic regret results discussed above, this result holds for all T .
We conclude by noting that contextual bandits may be approached
using Bayesian techniques in a very similar manner to the MAB algo-
rithms described above. The only difference is that the unknown vector
should now parameterize the distribution over actions and context,
P (|a, s). Empirically, the efficiency of TS was demonstrated in an
online-advertising application of contextual bandits [Chapelle and Li,
2011]. On the theoretical side, Agrawal and Goyal [Agrawal and Goyal,
2013b] study contextual bandits with rewards that linearly depend on
the context, and show a frequentist regret bound of O d2 T 1+ ,
where d is the dimension of the context vector, and is an algorithm-
parameter that can be chosen in (0, 1). For the same problem, Russo
and Van Roy [Russo and Van Roy, 2014b] derive a Bayesian regret
bound of order O d T ln T , which, up to logarithmic terms, matches
the order of the O d T lower bound for this problem [Rusmevichien-
tong and Tsitsiklis, 2010].
4
Model-based Bayesian Reinforcement Learning
41
42CHAPTER 4. MODEL-BASED BAYESIAN REINFORCEMENT LEARNING
keep a joint posterior distribution over model parameters and the true
physical state of the system, and a policy will be derived to select
optimal actions with respect to this posterior.
Let s,a,s0 be the (unknown) probability of transitioning from state
s to state s0 when taking action a, s,a,r the (unknown) probability of
obtaining reward r when taking action a at state s, and the set
of all such parameters. The belief P0 () expresses our initial knowledge
about the model parameters. We can then compute bt (), the belief
after a t-step trajectory {st+1 , rt , at , st , . . . , s1 , r0 , a0 , s0 }. Considering
a single observation (s, a, r, s0 ), we have
Z
bt+1 (0 ) = Pr(s0 , r|s, a, 0 ) Pr(s0 , 0 |s, a, )bt ()dsd , (4.1)
S,
indicator function):
s,a,s0
P 0 (s0 , 0 |s, a, ) = P I(0s,a,s0 = s,a,s0 + 1). (4.2)
00
s S s,a,s 00
Example 4.1 (The Chain problem). The 5-state Chain problem [Strens,
2000], shown in Figure 4.1, requires the MDP agent to select between
two abstract actions {1, 2}. Action 1 causes the agent to move to the
right with probability 0.8 (effect a in Figure 4.1) and causes the
agent to reset to the initial state with probability 0.2 (effect b in
Figure 4.1). Action 2 causes the agent to reset with probability 0.8
(effect b) and causes the agent to move to the right with probability
0.2 (effect a). The action b has constant reward of +2. Rewards vary
based on the state and effect (a and b), as shown in Figure 4.1.
The optimal policy is to always choose action 1, causing the agent to
potentially receive +10 several times until slipping back (randomly) to
4.2. EXPLORATION/EXPLOITATION DILEMMA 45
Figure 4.1: The Chain problem (figure reproduced from [Strens, 2000])
Finite-state controllers
Duff [Duff, 2001] suggests using Finite-State Controllers (FSC) to com-
pactly represent the optimal policy of a BAMDP, and then finding
the best FSC in the space of FSCs of some bounded size.
where
< E[q
E[qs,a2 ] qs,a
if a = a1 and qs,a s,a2 ],
Gains,a (Qs,a ) = qs,a E[qs,a1 ] if a 6= a1 and qs,a > E[qs,a2 ],
0 otherwise,
assuming a1 and a2 denote the actions with the best and second-best
50CHAPTER 4. MODEL-BASED BAYESIAN REINFORCEMENT LEARNING
Forward search
An approach of this type was used in the case of a partially observable
extension to the BAMDP [Ross et al., 2008a]. The idea here is to con-
sider the current hyper-state and build a fixed-depth forward search
tree containing all hyper-states reachable within some fixed finite plan-
ning horizon, denoted d. Assuming some default value for the leaves of
this tree, denoted V0 (s, ), dynamic programming backups can be ap-
plied to estimate the expected return of the possible actions at the root
hyper-state. The action with highest return over that finite horizon is
executed and then the forward search is conducted again on the next
hyper-state. This approach is detailed in Algorithm 2 and illustrated
in Figure 4.2. The main limitation of this approach is the fact that for
most domains, a full forward search (i.e., without pruning of the search
tree) can only be achieved over a very short decision horizon, since
the number of nodes explored is O(|S|d ), where d is the search depth.
4.5. ONLINE TREE SEARCH APPROXIMATION 51
Figure 4.2: An AND-OR tree constructed by the forward search process for the
Chain problem. The top node contains the initial state 1 and the prior over the
model 0 . After the first action, the agent can end up in either state 1 or state 2 of
the Chain and update its posterior accordingly. Note that depending on what action
was chosen (1 or 2) and the effect (a or b), different parameters of are updated as
per Algorithm 2
.
the leaf nodes; this can either be a naive estimate, such as the immedi-
ate reward, maxa R(s, a), or a value computed from repeated Bellman
backups, such as the one used for the Bayesian dynamic programming
approach. The next algorithm we review proposes some solutions to
these problems.
We take this opportunity to draw the readers attention to the sur-
vey paper on online POMDP planning algorithms [Ross et al., 2008c],
which provides a comprehensive review and empirical evaluation of a
range of search-based POMDP solving algorithms, including options for
combining offline and online methods in the context of forward search
trees. Some of these methods may be much more efficient than those
presented above and could be applied to plan in the BAMDP model.
a lower-bound at the leaf) and then prune nodes whenever they cannot
improve the lower-bound (assuming we are maximizing values). In the
context of BAMDPs, this can be used to prune hyper-state nodes in
the forward search tree [Paquet et al., 2005]. The challenge in this case
is to find good bounds; this is especially difficult given the uncertainty
over the underlying model. The method has been used in the context
of partially observable BAMDP [Png and Pineau, 2011, Png, 2011] us-
ing a naive heuristic, D d
P
d=0 Rmax , where D is the search depth and
Rmax is the maximum reward. The method was applied successfully to
solve simulated dialogue management problems; computational scala-
bility was achieved via a number of structural constraints, including
the parameter tying method proposed by [Poupart et al., 2006].
policies (UCT and rollout) to traverse and grow the forward search
tree. Starting at the root node, for any node that has been previously
visited, it uses the UCT criteria
s
log n(s, h)
a = arg max Q(s, h, a) + c
a n(s, h, a)
to select actions. Along the way, it updates the statistics n(s, h) (the
number of times the node corresponding to state s and history h has
been visited in the search tree) and n(s, h, a) (the number of times
action a was chosen in this node); these are initialized at 0 when the
tree is created. Once it reaches a leaf node, instead of using a default
value function (or bound), it first selects any untried action (updating
its count to 1) and then continues to search forward using a rollout
policy until it reaches a given depth (or terminal node). The nodes
visited by the rollout policy are not added to the tree (i.e., no n()
statistics are preserved).
To apply this to the Bayesian context, BAMCP must select actions
according to the posterior of the model parameters. Rather than sam-
pling multiple models from the posterior, BAMCP samples a single
model Pt at the root of the tree and uses this same model (without
posterior updating) throughout the search tree to sample next states,
after both UCT and rollout actions. In practice, to reduce computa-
tion in large domains, the model Pt is not sampled in its entirety at
the beginning of the tree building process, rather, it is generated lazily
as samples are required.
In addition, to further improve efficiency, BAMCP uses learning
within the forward rollouts to direct resources to important areas of
the search space. Rather than using a random policy for the rollouts,
a model-free estimate of the value function is maintained
t , at ) Q(s
t , at ) + rt + max Q(s
t+1 , a) Q(s
t , at ) ,
Q(s
a
and actions during rollouts are selected according to the -greedy policy
defined by this estimated Q function.
BAMCP is shown to converge (in probability) to the optimal
Bayesian policy (denoted V (s, h) in general, or V (s, ) when the pos-
terior is constrained to a Dirichlet distribution). The main complexity
56CHAPTER 4. MODEL-BASED BAYESIAN REINFORCEMENT LEARNING
result for BAMCP is based on the UCT analysis and shows that the
error in estimating V (st , ht ) decreases as O log n(st , ht ) /n(st , ht ) .
Empirical evaluation of BAMCP with a number of simulation do-
mains has shown that it outperforms Bayesian Dynamic Programming,
the Value of Information heuristic, BFS3 [Asmuth and Littman, 2011],
as well as BEB [Kolter and Ng, 2009] and SmartSampler [Castro and
Precup, 2010], both of which will be described in the next section. A
good part of this success could be attributed to the fact that unlike
many forward search sparse sampling algorithm (e.g., BFS3), BAMCP
can take advantage of its learning during rollouts to effectively bias the
search tree towards good solutions.
We now review a class of algorithms for the BAMDP model that are
guaranteed to select actions such as to incur only a small loss compared
to the optimal Bayesian policy. Algorithmically, these approaches are
similar to those examined in 4.5 and typically require forward sam-
pling of the model and decision space. Analytically, these approaches
have been shown to achieve bounded error in a polynomial number
of steps using analysis techniques from the Probably Approximately
Correct (PAC) literature. These methods are rooted in earlier papers
showing that reinforcement learning in finite MDPs can be achieved
in a polynomial number of steps [Kearns and Singh, 1998, Brafman
and Tennenholtz, 2003, Strehl and Littman, 2005]. These earlier pa-
pers did not assume a Bayesian learning framework; the extension to
Bayesian learning was first established in the BOSS (Best of Sampled
Sets) approach.
The main idea behind many of the methods presented in this section
is the notion of Optimism in the Face of Uncertainty, which suggests
that, when in doubt, an agent should act according to an optimistic
model of the MDP; in the case where the optimistic model is correct,
the agent will gather reward, and if not, the agent will receive valuable
information from which it can infer a better model.
4.6. METHODS WITH EXPLORATION BONUS TO ACHIEVE PAC GUARANTEES57
Empirical results show that the method can be used to solve small
domains (e.g., 5x5 grid) somewhat more effectively than a non-Bayesian
method such as RMAX [Brafman and Tennenholtz, 2003]. Results also
show that BFS3 can take advantage of structural assumptions in the
prior (e.g., parameter tying) to tackle much larger domains, up to 1016
states.
1
We have slightly modified the description of this theorem, and others below, to
improve legibility of the paper. Refer to the original publication for full details.
58CHAPTER 4. MODEL-BASED BAYESIAN REINFORCEMENT LEARNING
there exists some MDP, M, and parameter, 0 (, p), such that with
probability greater than 0 = 0.15,
V At (st ) < V (st ) 0 ,
will hold for an unbounded number of time steps.
Empirical evaluation of BEB showed that in the Chain problem
(Figure 4.1), it could outperform a (non-Bayesian) PAC-MDP algo-
rithm in terms of sample efficiency and find the correct optimal policy.
Vt (s, ) = max R(s, , a) + R
s,,a + P (s0 |s, , a)Vtt
(s0 , ) ,
X
aA
s0 S
s,,a is defined as
where the reward bonus R
sX
R R(s,,a) + P P2 (s0 |s,,a) ,
s0 S
with
Z
2
R(s,,a) = R(s, , a)2 b()d R(s, , a)2 , (4.4)
Z
P2 (s0 |s,,a) = P (s0 |s, , a)2 b()d P (s0 |s, , a)2 , (4.5)
policy with respect to the MDP , on all but
P from its current state,
C(s,a)
O s,a
(1)2
ln 1 ln (1)
1
time steps with probability at least 14.
For the case where uncertainty over the transition model is modelled
using independent Dirichlet priors (as we have considered throughout
this section), the sample complexity of this approach decreases as a
function of O(1/ n). This is consistent with other PAC-MDP results,
which also bound the sample complexity to achieve small error with
respect to the optimal policy of the true underlying MDP. However, it
is not as fast as the BEB approach, which bounds error with respect
to the best Bayesian policy.
Empirical results for the variance-based approach show that it is
competitive with BEETLE, BOSS and BEB on versions of the Chain
problem that use parameter tying to reduce the space of model un-
certainty, and shows better performance for the variant of the domain
where uncertainty over all state-action pairs is modelled with an inde-
pendent Dirichlet distribution. Results also show superior performance
on a 4 4 grid-world inspired by the Wumpus domain of [Russell and
Norvig, 2002].
HMDP fixed d a rand() s0 Pr(s0 |s, a, ) Re-sample Backups in tree, Not known
at node Q() by LP at leaves
4.7. EXTENSIONS TO UNKNOWN REWARDS
BAMCP fixed d UCT criteria s0 Pr(s0 |s, a, ) Sample 1 Rollouts with Bayes-opt.
BFS3 variable d arg maxa arg maxs0 Re-sample Backups in tree, PAC guarantee
U (s, , a) (U (s0 ) L(s0 )) at node heuristic at leaves
st = gT (st1 , at1 , Vt ),
We can extend the BAMDP model to capture uncertainty over the pa-
rameters of a POMDP (as defined in 2.3) by introducing the Bayes-
Adaptive POMDP (BAPOMDP) model [Ross et al., 2008a, 2011].
Given a POMDP model, hS, A, O, P, , P0 , Ri, we define a correspond-
ing BAPOMDP as follows:
Here and capture the space of Dirichlet parameters for the con-
jugate prior over the transition and observation functions, respectively.
The Bayesian approach to learning the transition P and observation
involves starting with a prior distribution, which we denote 0 and 0 ,
and maintaining the posterior distribution over and after observ-
ing the history of action-observation-rewards. The belief can be tracked
using a Bayesian filter, with appropriate handing of the observations.
70CHAPTER 4. MODEL-BASED BAYESIAN REINFORCEMENT LEARNING
where Q() is just Eq. 4.9 without the maximization over actions and
a is the optimal action at bt (s, , ). Because the second term is in-
dependent of the action choice, to minimize the Bayes risk, we simply
consider maximizing the first term over actions. The analysis of this
algorithm has yielded a bound on the number of samples needed to
72CHAPTER 4. MODEL-BASED BAYESIAN REINFORCEMENT LEARNING
ensure -error. The bound is quite loose in practice, but at least pro-
vides some upper-bound on the number of queries needed to achieve
good performance. Note that this analysis is based on the Bayes risk
criterion that provides a myopic view of uncertainty (i.e., it assumes
that the next action will resolve the uncertainty over models).
The planning approach suggested by Ross et al. [2011] aims to ap-
proximate the optimal BAPOMDP strategy by employing a forward
search similar to that outlined in Algorithm 2. In related work, Png
and Pineau [2011] use a branch-and-bound algorithm to approximate
the BAPOMDP solution. Many of the other techniques outlined in
Table 4.1 could also be extended to the BAPOMDP model.
Finally, it is worth mentioning that the method of Dallaire et al.
[2009], described in 4.8, is also able to handle continuous partially
observable domains by using an additional Gaussian process for the
observation function.
The work and methods presented above focus on the case where the
model representation consists of a discrete (flat) set of states. For many
larger domains, it is common to assume that the states are arranged in a
more sophisticated structure, whether a simple clustering or more com-
plicated hierarchy. It is therefore interesting to consider how Bayesian
reinforcement learning methods can be used in those cases.
The simplest case, called parameter tying in previous sections, corre-
sponds to the case where states are grouped according to a pre-defined
clustering assignment. In this case, it is common to aggregate learned
parameters according to this assignment [Poupart et al., 2006, Sorg
et al., 2010, Png and Pineau, 2011]. The advantage is that there are
fewer parameters to estimate, and thus, learning can be achieved with
fewer samples. The main downside is that this requires a hand-coded
assignment. In practice, it may also be preferable to use a coarser clus-
tering than is strictly correct in order to improve variance (at the ex-
pense of more bias). This is a standard model selection problem.
More sophisticated approaches have been proposed to automate the
process of clustering. In cases where the (unknown) model parameters
4.10. EXTENSIONS TO OTHER PRIORS AND STRUCTURED MDPS73
75
76CHAPTER 5. MODEL-FREE BAYESIAN REINFORCEMENT LEARNING
>
By defining RT = R(s0 ), . . . , R(sT 1 ) , V T +1 =
> >
V (s0 ), . . . , V (sT ) , and N T = N (s0 , s1 ), . . . , N (sT 1 , sT ) ,
we write the above set of T equations as
RT = HV T +1 + N T , (5.6)
RT = HV T + N T , (5.9)
Figure 5.1: A graph illustrating the conditional independencies between the latent
V (st ) value variables (bottom row), the noise variables V (st ) (top row), and the
observable R(st ) reward variables (middle row), in the GPTD model. As in the case
of GP regression, all of the V (st ) variables should be connected by arrows, due to
the dependencies introduced by the prior. To avoid cluttering the diagram, this was
marked by the dashed frame surrounding them.
>
Assumption A2 The residuals V T +1 = V (s0 ), . . . , V (sT )
can be modeled as a Gaussian process.
By definition (Eq. 5.2), E V (s) = 0 for all s. Using Assump-
tion A3, it is easy to show that E V (xt )2 = Var D(xt ) . Thus, de-
Eq. 5.11 indicates that the Gaussian noise process N T is colored with
a tri-diagonal covariance matrix. If we assume that for all t = 0, . . . , T ,
t = , then diag( T +1 ) = 2 I and Eq. 5.11 may be simplified and
written as
1 + 2 0 ... 0 0
1 + 2 . . . 0 0
.. .. .. .. ..
2 > 2
= HH = .
. . . . .
0 0 0 ... 1 + 2
0 0 0 ... 1 + 2
Eq. 5.6 (or in case of episodic problems Eq. 5.9), along with the
measurement noise distribution of Eq. 5.11, and a prior distribution
for V (defined either parametrically or non-parametrically, see 2.5.2),
completely specify a statistical generative model relating the value and
reward random processes. In order to infer value estimates from a se-
quence of observed rewards, Bayes rule can be applied to this gener-
ative model to derive a posterior distribution over the value function
conditioned on the observed rewards.
In the case in which we place a Gaussian prior over V T , both V T
and N T are normally distributed, and thus, the generative model of
Eq. 5.6 (Eq. 5.9) will belong to the family of linear statistical models
discussed in 2.5.2. Consequently, both parametric and non-parametric
treatments of this model (see 2.5.2) may be applied in full to the gen-
erative model of Eq. 5.6 (Eq. 5.9), with H given by Eq. 5.7 (Eq. 5.10).
Figure 5.2 demonstrates how the GPTD model described in this section
is related to the family of linear statistical models and GP regression
discussed in 2.5.2.
5.1. VALUE FUNCTION ALGORITHMS 81
GP Regression
Gaussian Noise
(independent of F)
General Linear
Transformation N (0, )
Y T = HF T + N T
Observable Unknown
Process Function
GPTD
Linear Transformation >
N 0, = H diag( T )H
H z }| {
0 1 0 1 0 1
R(s0 ) z
2 }| {
3 V (s0 ) N (s0 , s1 )
B .. C 1 0 ... 0 0 B .. C B C ..
B . C 6 0 7B . C B C .
B C 6 1 ... 0 0 7BB C B C
B R(st ) C 6 .. .. 7 B V (st ) C B N (st , st+1 ) C
C B
B C
B R(st+1 ) C = 6 ... .. ..
7B +B C
B C 6 . . . . 7 B V (st+1 ) C BN (st+1 , st+2 )C
C
C
B .. C 4 0 0 0 ... 1 5B .. C B .. C
@ . A @ . A @ . A
0 0 0 ... 0 1
R(sT 1 ) V (sT 1) N (sT 1 , sT )
Observable Unknown
Process Function
def def
N (st , st+1 ) = V (st ) V (st+1 ) V (st ) = D(st ) V (st )
Figure 5.2: The connection between the GPTD model described in 5.1.1 and the
family of linear statistical models and GP regression discussed in 2.5.2.
E W |RT = r T = (> + )1 r T ,
E V (s)|RT = r T = k(s)> ,
where
and Barto, 1996, Boyan, 1999]. The main idea is to obtain the value
of the weight vector W in the parametric GPTD model by carrying
out maximum likelihood (ML) inference (W is the value for which the
observed data is most likely to be generated by the stochastic GPTD
model). This allows us to derive LSTD() for each value of 1, as an
ML solution arising from some GPTD generative model with a specific
noise covariance matrix .
Bayesian quadrature and then present the BPG models and algorithms.
where K is the kernel (or Gram) matrix, and []i,j is the measure-
ment noise covariance between the ith and jth samples. It is typically
4
Similar to [OHagan, 1991], here for simplicity we consider the case where the
integral to be estimated is a scalar-valued integral. However, the results of this
section can be extended to vector-valued integrals, such as the gradient of the ex-
pected return with respect to the policy parameters that we shall study in 5.2.2
(see Ghavamzadeh et al. [2013]).
5.2. BAYESIAN POLICY GRADIENT 89
| {z } BAC
Modeled
as GP
E[|DM ] , Var[|DM ] f
Z z }| {
r() = dsda (s; ) r(a|s; ) Q(s, a; )
| {z }
g
Figure 5.3: The connection between BQ and the two BPG models of [Ghavamzadeh
and Engel, 2006], and the Bayesian actor-critic (BAC) formulation of 5.3.
Model 1 Model 2
Deter. factor (g) g(; ) = Pr(; ) g(; ) =
Pr(; )
GP factor (f ) f (; ) = () log Pr(; ) f () = ()
Measurement (y) y(; ) = () log Pr(; ) y() = ()
Prior mean of f E fj (; ) = 0 E f () = 0
Cov fj (; ), f` ( 0 ; ) Cov f (), f ( 0 )
Prior Cov. of f
= j,` k(, 0 ) = k(, 0 )
Kernel function k(, 0 ) = k(, 0 ) =
2
1 + u()> G1 u( 0 ) u()> G1 u( 0 )
E B ()|DM Y Cb BCy
(b0 b> Cb)I B 0 BCB >
Cov B ()|DM
b or B (b)i = 1 + u(i )> G1 u(i ) B=U
b0 or B 0 b0 = 1 + n B0 = G
pendently of all other components, and thus, the same kernel function
K and noise covariance are used for all components of f (; ). Hence
for the jth component of f and y we have a priori
>
f j = fj (1 ; ), . . . , fj (M ; ) N (0, K),
>
y j = yj (1 ; ), . . . , yj (M ; ) N (0, K + ).
The above two BPG models can define Bayesian algorithms for eval-
uating the gradient of the expected return with respect to the policy
parameters (see [Ghavamzadeh and Engel, 2006] and [Ghavamzadeh
et al., 2013] for the pseudo-code of the resulting algorithms). It is im-
portant to note that computing the quadratic and linear Fisher kernels
used in Models 1 and 2 requires calculating the Fisher information
matrix G() (Eq. 5.27). Consequently, every time the policy parame-
ters are updated, G needs to be recomputed. Ghavamzadeh and En-
gel [Ghavamzadeh and Engel, 2006] suggest two possible approaches
for online estimation of the Fisher information matrix.
Similar to most non-parametric methods, the Bayesian gradient
evaluation algorithms can be made more efficient, both in time and
memory, by sparsifying the solution. Sparsification also helps to nu-
merically stabilize the algorithms when the kernel matrix is singular,
or nearly so. Ghavamzadeh et al. [Ghavamzadeh et al., 2013] show how
one can incrementally perform such sparsification in their Bayesian gra-
dient evaluation algorithms, i.e., how to selectively add a new observed
path to a set of dictionary paths used as a basis for representing or
approximating the full solution.
E B ()|DM .
These equations provide us with the general form of the posterior policy
gradient moments. We are now left with a computational issue, namely,
how to compute the integrals appearing in these expressions? We need
to be able to evaluate the following integrals:
Z Z
Bt = dzg(z; )kt (z)> , B0 = dzdz 0 g(z; )k(z, z 0 )g(z 0 ; )> .
Z Z2
(5.31)
Using the definitions of B t and B 0 , the gradient posterior moments
may be written as
Bt = U t , B 0 = G, (5.35)
where U t = u(z0 ), u(z1 ), . . . , u(zt ) . As a result, the integrals of
the gradient posterior moments (Eq. 5.32) are analytically tractable
(see Ghavamzadeh et al. [2013] for more details). An immediate conse-
quence of Eq. 5.35 is that, in order to compute the posterior moments
of the policy gradient, we only need to be able to evaluate (or estimate)
the score vectors u(zi ), i = 0, . . . , t and the Fisher information matrix
G of the policy.
Similar to the BPG method, Ghavamzadeh and En-
gel [Ghavamzadeh and Engel, 2007] suggest methods for online
estimation of the Fisher information matrix in Eq. 5.34 and for using
online sparsification to make the BAC algorithm more efficient in both
time and memory (see Ghavamzadeh et al. [2013] for more details).
They also report experimental results [Ghavamzadeh and Engel, 2007,
Ghavamzadeh et al., 2013] which indicate that the BAC algorithm
tends to significantly reduce the number of samples needed to obtain
accurate gradient estimates, and thus, given a fixed number of samples
per iteration, finds a better policy than both MC-based policy gradient
6
Similar to u() and G defined by Eqs. 5.19 and 5.27, to simplify the notation,
we omit the dependence of u and G to the policy parameters , and replace u(z; )
and G() with u(z) and G in the sequel.
5.3. BAYESIAN ACTOR-CRITIC 99
methods and the BPG algorithm, which do not take into account the
Markovian property of the system.
6
Risk-aware Bayesian Reinforcement Learning
The results presented so far have all been concerned with optimiz-
ing the expected return of the policy. However, in many applications,
the decision-maker is also interested in minimizing the risk of a pol-
icy. By risk,1 we mean performance criteria that take into account not
only the expected return, but also some additional statistics of it, such
as variance, Value-at-Risk (VaR), expected shortfall (also known as
conditional-value-at-risk or CVaR), etc. The primary motivation for
dealing with risk-sensitive performance criteria comes from finance,
where risk plays a dominant role in decision-making. However, there
are many other application domains where risk is important, such as
process control, resource allocation, and clinical decision-making.
In general, there are two sources that contribute to the reward un-
certainty in MDPs: internal uncertainty and parametric uncertainty.
Internal uncertainty reflects the uncertainty of the return due to the
stochastic transitions and rewards, for a single and known MDP. Para-
metric uncertainty, on the other hand, reflects the uncertainty about
the unknown MDP parameters the transition and reward distribu-
1
The term risk here should not be confused with the Bayes risk, defined in Chap-
ter 3.
101
102CHAPTER 6. RISK-AWARE BAYESIAN REINFORCEMENT LEARNING
Let P (s0 |s, a) = Epost P (s0 |s, a) denote the expected transi-
Theorem 6.1. [Mannor et al., 2007] The expectation (under the pos-
terior) of V satisfies
Epost [V ] = V + 2 X
Q V + B
+ Lbias ,
1
where X = I P , and the vector B and matrix Q are computed
according to
i = (a|i)2 R(i, a)e> M ,i ,
i,a X
X
B
a
and
(i) X
i,j = Cov (i) = i,a ,
X
Q j, ,i in which Cov (a|i)2 M
a
where the matrix M i,a is the posterior covariance matrix of P (|s, a),
and higher order terms
h i
k E fk (P ) R ,
X
Lbias =
k=3
k
P X
in which P = P P , and fk (P ) = X .
Theorem 6.2. [Mannor et al., 2007] Using the same notations as The-
orem 6.1, the second moment (under the posterior) of V satisfies
>
Epost V V > = V V + X
2 Q V R > + R V > Q>
> + R B> + W X > + Lvar ,
+ BR
It is important to note that except for the higher order terms Lbias
and Lvar , the terms in Theorems 6.1 and 6.2 do not depend on the
unknown transitions, and thus, may be calculated from the data. This
results in a second-order approximation of the bias and variance of V ,
which may be used to derive confidence intervals around the estimated
value function. This approximation was also used by Delage and Man-
nor [Delage and Mannor, 2010] for risk-sensitive policy optimization,
as we describe next.
Percentile Criterion
Consider again a setting where n(s, a, s0 ) transitions from state s to
state s0 under action a from MDP M were observed, and let Ppost de-
note the posterior distribution of the transition probabilities in M given
these observations. Delage and Mannor [Delage and Mannor, 2010] in-
vestigate the percentile criterion 4 for M, defined as
max y
yR,
(6.1)
" # !
X
t
s.t. Ppost E R(Zt ) Z0 P0 , y 1 ,
t=0
The next result of [Delage and Mannor, 2010] shows that given enough
observations, optimizing F () leads to an approximate solution of the
percentile problem (6.1).
Theorem 6.3. [Delage and Mannor, 2010] Let N =
mins,a s0 n(s, a, s0 ) denote the minimum number of state transi-
P
tions observed from any state using any action, and (0, 0.5]. The
policy
= arg max F ()
is O 1/ N optimal with respect to the percentile optimization cri-
terion (6.1).
Note that, as discussed earlier, F () may be efficiently evaluated for
every . However, F () is non-convex in , but empirically, global op-
timization techniques for maximizing F () lead to useful solutions [De-
lage and Mannor, 2010].
Delage and Mannor [Delage and Mannor, 2010] also consider a case
where the state transitions are known, but there is uncertainty in the
reward distribution. For general reward distributions the corresponding
percentile optimization problem is also NP-hard. However, for the case
of a Gaussian prior, the resulting optimization problem is a second-
order cone program, for which efficient solutions are known.
Max-Min Criterion
Consider the percentile criterion (6.1) in the limit of 0. In this
case, we are interested in the performance under the worst realizable
posterior transition probability, i.e.,
" #
X
max min E t R(Zt ) Z0 P0 , P, , (6.2)
P Ppost
t=0
5
Note that the 1st order term is ignored, since it would cancel anyway in the
optimization that follows.
106CHAPTER 6. RISK-AWARE BAYESIAN REINFORCEMENT LEARNING
mally, given a measure over the interval [0, 1], consider the following
optimization problem:
Z
max y(x)d
,yF x
" T # !
X
s.t. Ppost E R(Zt ) Z0 P0 , y(x) x x [0, 1],
t=0
(6.4)
where F is the class of real-valued and bounded -integrable functions
on the interval [0, 1], Ppost denotes the probability of drawing a tran-
sition matrix P 0 from the posterior distribution of the transitions (as
before), and the expectation is with respect to a concrete realization of
that P 0 . Note that here the horizon is T and there is no discounting,
as opposed to the infinite horizon discounted setting discussed earlier.
The percentile measure criterion (6.4) may be seen as a generaliza-
tion of the percentile criterion (6.1), which is obtained by setting to
a Dirac delta at 1 . In addition, when is uniform on [0, 1], (6.4) is
equivalent to the expected value of Theorem 6.1, and when is a delta
Dirac at 0, we obtain the max-min criterion (6.2). Finally, when is a
step function, an expected shortfall (CVaR) criterion is obtained.
Chen and Bowling [Chen and Bowling, 2012] introduce the k-of-N
family of percentile measures, which admits an efficient solution under
a general uncertainty structure. For a policy , the k-of-N measure
is equivalent to the following sampling procedure: first draw N MDPs
according to the posterior distribution, then select a set of the k MDPs
with the worst expected performance under policy , and finally choose
a random MDP from this set (according to a uniform distribution). By
selecting suitable k and N , the k-of-N measure may be tuned to closely
approximate the CVaR or max-min criterion.
The main reason for using the k-of-N measure is that the above
sampling procedure may be seen as a two-player zero-sum extensive-
form game with imperfect information, which may be solved efficiently
using counterfactual regret minimization. This results in the following
convergence guarantee:
Theorem 6.4. [Chen and Bowling, 2012] For any > 0 and (0, 1],
108CHAPTER 6. RISK-AWARE BAYESIAN REINFORCEMENT LEARNING
let
2 2 16T 2 2 |I1 |2 |A|
T = 1 + ,
2 2
where T is the maximum difference in total reward over T steps.
With probability 1 , the current strategy at iteration T , chosen
uniformly at random from the interval [1, T], is an -approximation
to a solution of (6.4) when is a k-of-N measure. The total time
|I1 |3 |A|3 N log N
complexity is O (T /) 2 , where |I1 | O (|S|T ) for
3
arbitrary reward uncertainty and |I1 | O |S|T +1 AT for arbitrary
transition and reward uncertainty.
Note that in line with the previous hardness results, both the CVaR
and max-min criteria may be represented using a non-increasing and
piecewise Lipschitz continuous measure, while the percentile criterion
may not.
7
BRL Extensions
109
110 CHAPTER 7. BRL EXTENSIONS
the size of the state space. Michini and How [Michini and How, 2012a]
suggest a potential solution to this problem in which they partition the
observations (system trajectories generated by the expert) into sets of
smaller sub-demonstrations, such that each sub-demonstration is at-
tributed to a smaller and less-complex class of reward functions. These
simple rewards can be intuitively interpreted as sub-goals of the expert.
They propose a BIRL algorithm that uses a Bayesian non-parametric
mixture model to automate this partitioning process. The proposed al-
gorithm uses a Chinese restaurant process prior over partitions so that
there is no need to specify the number of partitions a priori.
Most of the work in IRL assumes that the data is generated by a
single expert optimizing a fixed reward function. However, there are
many applications in which we observe multiple experts, each execut-
ing a policy that is optimal (or good) with respect to its own reward
function. There have been a few works that consider this scenario. Dim-
itrakakis and Rothkopf [Dimitrakakis and Rothkopf, 2011] generalize
BIRL to multi-task learning. They assume a common prior for the
trajectories and estimate the reward functions individually for each
trajectory. Other than the common prior, they do not make any addi-
tional efforts to group trajectories that are likely to be generated from
the same or similar reward functions. Babes et al. [Babes et al., 2011]
propose a method that combines expectation maximization (EM) clus-
tering with IRL. They cluster the trajectories based on the inferred
reward functions, where one reward function is defined per cluster.
However, their proposed method is based on the assumption that the
number of clusters (the number of the reward functions) is known. Fi-
nally, Choi and Kim [Choi and Kim, 2012] present a nonparametric
Bayesian approach using the Dirichlet process mixture model in order
to address the IRL problem with multiple reward functions. They de-
velop an efficient Metropolis-Hastings sampler that utilizes the gradient
of the reward function posterior to infer the reward functions from the
behavior data. Moreover, after completing IRL on the behavior data,
their method can efficiently estimate the reward function for a new tra-
jectory by computing the mean of the reward function posterior, given
the pre-learned results.
7.3. BAYESIAN MULTI-AGENT REINFORCEMENT LEARNING113
adopt the GPTD value function model ([Engel et al., 2005a], also see
5.1.1) for each task, model the distribution over the value functions
using an HBM, and develop solutions to the following problems: (i)
joint learning of the value functions (multi-task learning), and (ii) effi-
cient transfer of the information acquired in (i) to facilitate learning the
value function of a newly observed task (transfer learning). They first
present an HBM for the case in which all of the value functions belong
to the same class, and derive an EM algorithm to find MAP estimates
of the value functions and the models hyper-parameters. However, if
the functions do not belong to the same class, simply learning them to-
gether can be detrimental (negative transfer). It is therefore important
to have models that will generally benefit from related tasks and will
not hurt performance when the tasks are unrelated. This is particularly
important in RL as changing the policy at each step of policy itera-
tion (this is true even for fitted value iteration) can change the way in
which tasks are clustered together. This means that even if we start
with value functions that all belong to the same class, after one itera-
tion the new value functions may be clustered into several classes. To
address this issue, they introduce a Dirichlet process (DP) based HBM
for the case where the value functions belong to an undefined number
of classes, and derive inference algorithms for both the multi-task and
transfer learning scenarios in this model.
The MTRL approach in [Wilson et al., 2007] also uses a DP-based
HBM to model the distribution over a common structure of the tasks.
In this work, the tasks share structure in their dynamics and reward
functions. The setting is incremental, i.e., the tasks are observed as a se-
quence, and there is no restriction on the number of samples generated
by each task. The focus is not on joint learning with finite number of
samples, but on using the information gained from the previous tasks
to facilitate learning in a new one. In other words, the focus in this
work is on transfer and not on multi-task learning.
8
Outlook
117
118 CHAPTER 8. OUTLOOK
prior. While this is certainly the case for model-based BRL, for larger
problems there is always a need for some sort of regularization. A prior
serves as a mean to regularize the model selection problem. Thus, the
problem of specifying a prior is addressed by any RL algorithm that is
supposed to work for large-scale problems. We believe that a promising
direction for future research concerns devising priors based on observed
data, as per the empirical Bayes approach [Efron, 2010]. A related
issue is model mis-specification and how to quantify the performance
degradation that may arise from not knowing the model precisely.
There are also some algorithmic and theoretical challenges that
we would like to point out here. First, scaling BRL is a major issue.
Being able to solve large problems is still an elusive goal. We should
mention that currently, a principled approach for scaling up RL in gen-
eral is arguably missing. Approximate value function methods [Bert-
sekas and Tsitsiklis, 1996] have proved successful for solving certain
large scale problems [Powell, 2011], and policy search has been suc-
cessfully applied to many robotics tasks [Kober et al., 2013]. However,
there is currently no solution (neither frequentist nor Bayesian) to the
exploration-exploitation dilemma in large-scale MDPs. We hope that
scaling up BRL, in which exploration-exploitation is naturally handled,
may help us overcome this barrier. Conceptually, BRL may be easier to
scale since it allows us, in some sense, to embed domain knowledge into
a problem. Second, the BRL framework we presented assumes that the
model is specified correctly. Dealing with model misspecification and
consequently with model selection is a thorny issue for all RL algo-
rithms. When the state parameters are unknown or not observed, the
dynamics may stop being Markov and many RL algorithms fail in the-
ory and in practice. The BRL framework may offer a solution to this
issue since the Bayesian approach can naturally handle model selection
and misspecification by not committing to a single model and rather
sustaining a posterior over the possible models.
Another perceived limitation of BRL is the complexity of imple-
menting the majority of BRL methods. Indeed, some frequentist RL
algorithms are more elegant than their Bayesian counterparts, which
may discourage some practitioners from using BRL. This is, by no
119
121
Appendices
A
Index of Symbols
Notation Definition
R set of real numbers
N set of natural numbers
E expected value
Var variance
Cov covariance
KL Kullback-Leibler divergence
H Shannon entropy
N (m, 2 ) Gaussian (normal) distribution with mean m and variance 2
P probability distribution
M model
A set of actions
K cardinality of A
T time horizon
aA a nominal action
1
In this paper, we use upper-case and lower-case letters to refer to random vari-
ables and the values taken by random variables, respectively. We also use bold-face
letters for vectors and matrices.
125
126 APPENDIX A. INDEX OF SYMBOLS
Notation Definition
Y set of outcomes (in MAB)
yY a nominal outcome
r(y) reward for outcome y
P (y|a) P(Y) probability of observing outcome y after taking
action a
a optimal arm
a difference in expected reward between arms a and a
S set of states (or contexts in a contextual MAB)
sS a nominal state (context)
PS (s) P(S) context probability (in contextual MAB)
P (y|a, s) P(Y) probability of observing outcome y after taking
action a when the context is s (in contextual MAB)
O set of observations
oO a nominal observation
P P(S) transition probability function
P (s0 |s, a) P(S) probability of being in state s0 after taking action
a in state s
(o|s, a) P(O) probability of seeing observation o after taking
action a to reach state s
q P(R) probability distribution over reward
R(s, a) q(|s, a) random variable of the reward of taking action
a in state s
r(s, a) reward of taking action a in state s
r(s, a) expected reward of taking action a in state s
Rmax maximum random immediate reward
R max maximum expected immediate reward
P0 P(S) initial state distribution
bt P(S) POMDPs state distribution at time t
(bt , a, o) the information state (belief) update equation
a (stationary and Markov) policy
(a|s) probability that policy selects action a in state s
an optimal policy
M the Markov chain induced by policy
P probability distribution of the Markov chain induced
by policy
P (s0 |s) probability of being in state s0 after taking an action
according to policy in state s
P0 initial state distribution of the Markov chain induced
by policy
q reward distribution of the Markov chain induced by
policy
127
Notation Definition
q (|s) reward distribution of the Markov
chain induced by policy at state s
R (s) q (|s) reward random variable of the Markov
chain induced by policy at state s
P(S) stationary distribution over states of
the Markov chain induced by policy
Z =S A set of state-action pairs
z = (s, a) Z a nominal state-action pair
= {z0 , z1 , . . . , zT } a system path or trajectory
set of all possible system trajectories
() (discounted) return of path
() expected value of the (discounted)
return of path
() expected return of policy
D (s) random variable of the (discounted)
return of state s
D (z) random variable of the (discounted)
return of state-action pair z
V value function of policy
Q action-value function of policy
V optimal value function
Q optimal action-value function
discount factor
linear function over the belief simplex
in POMDPs
the set of -functions representing the
POMDP value function
k(, ) a kernel function
K kernel matrix covariance matrix
[K]i,j = k(xi , xj )
>
k(x) = k(x1 , x), . . . , k(xT , x) a kernel vector of size T
I identity matrix
DT a set of training samples of size T
i a basis function (e.g., over states or
state-action pairs)
>
() = 1 (), . . . , n () a feature vector of size n
= [(x1 ), . . . , (xT )] a n T feature matrix
vector of unknown parameters
a parameter space that the unknown
parameters belong to ( )
B
Discussion on GPTD Assumptions on the Noise
Process
>
Assumption A2 The residuals V T +1 = V (s0 ), . . . , V (sT )
can be modeled as a Gaussian process.
This may not seem like a correct assumption in general, however,
in the absence of any prior information concerning the distribution
of the residuals, it is the simplest assumption that can be made,
because the Gaussian distribution has the highest entropy among all
distributions with the same covariance.
129
130APPENDIX B. DISCUSSION ON GPTD ASSUMPTIONS ON THE NOISE PROCESS
131
132 References