Interpreting Systems as Solving POMDPs: A Step Towards a Formal Understanding of Agency

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1721))

Included in the following conference series:

International Workshop on Active Inference

567 Accesses
4 Citations
1 Altmetric

Abstract

Under what circumstances can a system be said to have beliefs and goals, and how do such agency-related features relate to its physical state? Recent work has proposed a notion of interpretation map, a function that maps the state of a system to a probability distribution representing its beliefs about an external world. Such a map is not completely arbitrary, as the beliefs it attributes to the system must evolve over time in a manner that is consistent with Bayes’ theorem, and consequently the dynamics of a system constrain its possible interpretations. Here we build on this approach, proposing a notion of interpretation not just in terms of beliefs but in terms of goals and actions. To do this we make use of the existing theory of partially observable Markov decision processes (POMDPs): we say that a system can be interpreted as a solution to a POMDP if it not only admits an interpretation map describing its beliefs about the hidden state of a POMDP but also takes actions that are optimal according to its belief state. An agent is then a system together with an interpretation of this system as a POMDP solution. Although POMDPs are not the only possible formulation of what it means to have a goal, this nevertheless represents a step towards a more general formal definition of what it means for a system to be an agent.

This project was made possible through the support of Grant 62229 from the John Templeton Foundation. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the John Templeton Foundation. Work on this project was also supported by a grant from GoodAI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Planning in Partially Observable Domains with Fuzzy Epistemic States and Probabilistic Dynamics

Complexity Bounds for Deterministic Partially Observed Markov Decision Processes

Article 30 October 2024

Probabilistic Majorization of Partially Observable Markov Decision Processes

Notes

1.
These machines are probably equivalent to the sufficient information state processes in [9, definition 2] but establishing this is beyond the scope of this work.
2.
The FEP literature includes both publications on how to construct agents that solve problems (e.g. [8]) and publications on when a system of stochastic differential equations contain an agent performing approximate Bayesian inference. Only the latter literature addresses a question comparable to the one addressed in the present manuscript.
3.
We choose actions to be deterministic functions of the machine state because the stochastic Moore machines considered here also have deterministic outputs. Other choices may be more suitable in other cases.
4.
If the denominator in Eq. (3) is zero for some value $s \,{\in }\, {\mathcal {S}}$ then define e.g. $f_{\pi ^*}(b,s)=b$.

References

Beer, R.D.: The cognitive domain of a glider in the game of life. Artif. Life 20(2), 183–206 (2014). https://doi.org/10.1162/ARTL_a_00125
Article Google Scholar
Biehl, M., Ikegami, T., Polani, D.: Towards information based spatiotemporal patterns as a foundation for agent representation in dynamical systems. In: Proceedings of the Artificial Life Conference 2016, pp. 722–729. The MIT Press (2016). https://doi.org/10.7551/978-0-262-33936-0-ch115. https://mitpress.mit.edu/sites/default/files/titles/content/conf/alife16/ch115.html
Da Costa, L., Friston, K., Heins, C., Pavliotis, G.A.: Bayesian mechanics for stationary processes. arXiv:2106.13830 [math-ph, physics:nlin, q-bio] (2021)
Dennett, D.C.: True believers: the intentional strategy and why it works. In: Heath, A.F. (ed.) Scientific Explanation: Papers Based on Herbert Spencer Lectures Given in the University of Oxford, pp. 53–75. Clarendon Press (1981)
Google Scholar
Friston, K.: Life as we know it. J. R. Soc. Interface 10(86) (2013). https://doi.org/10.1098/rsif.2013.0475. http://rsif.royalsocietypublishing.org/content/10/86/20130475
Friston, K.: A free energy principle for a particular physics. arXiv:1906.10184 [q-bio] (2019)
Friston, K., et al.: The free energy principle made simpler but not too simple (2022). https://doi.org/10.48550/arXiv.2201.06387 [cond-mat, physics:nlin, physics:physics, q-bio]
Friston, K., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., Pezzulo, G.: Active inference and epistemic value. Cogn. Neurosci. 6(4), 187–214 (2015). https://doi.org/10.1080/17588928.2015.1020053
Article Google Scholar
Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000). https://doi.org/10.1613/jair.678. http://arxiv.org/abs/1106.0234 [cs]
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998). https://doi.org/10.1016/S0004-3702(98)00023-X. http://www.sciencedirect.com/science/article/pii/S000437029800023X
Kenton, Z., Kumar, R., Farquhar, S., Richens, J., MacDermott, M., Everitt, T.: Discovering agents (2022). https://doi.org/10.48550/arXiv.2208.08345 [cs]
Orseau, L., McGill, S.M., Legg, S.: Agents and devices: a relative definition of agency. arXiv:1805.12387 [cs, stat] (2018)
Parr, T.: Inferential dynamics: comment on: How particular is the physics of the free energy principle? by Aguilera et al. Phys. Life Rev. 42, 1–3 (2022). https://doi.org/10.1016/j.plrev.2022.05.006. https://www.sciencedirect.com/science/article/pii/S1571064522000276
Parr, T., Da Costa, L., Friston, K.: Markov blankets, information geometry and stochastic thermodynamics. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 378(2164), 20190159 (2020). https://doi.org/10.1098/rsta.2019.0159. https://royalsocietypublishing.org/doi/full/10.1098/rsta.2019.0159
Sondik, E.J.: The optimal control of partially observable Markov processes over the infinite horizon: discounted costs. Oper. Res. 26(2), 282–304 (1978). http://www.jstor.org/stable/169635
Virgo, N., Biehl, M., McGregor, S.: Interpreting dynamical systems as Bayesian reasoners. arXiv:2112.13523 [cs, q-bio] (2021)

Download references

Author information

Authors and Affiliations

Cross Labs, Cross Compass, Tokyo, 104-0045, Japan
Martin Biehl
Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, 152-8550, Japan
Nathaniel Virgo

Authors

Martin Biehl
View author publications
You can also search for this author in PubMed Google Scholar
Nathaniel Virgo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Biehl .

Editor information

Editors and Affiliations

University of Sussex, Brighton, UK
Christopher L. Buckley
University of Chieti-Pescara, Pescara, Italy
Daniela Cialfi
Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
Pablo Lanillos
Wellcome Centre for Human Neuroimaging, London, UK
Maxwell Ramstead
Wellcome Centre for Human Neuroimaging, London, UK
Noor Sajid
Kyoto University, Kyoto, Japan
Hideaki Shimazaki
Ghent University, Ghent, Belgium
Tim Verbelen

Appendices

A Consistency in More Familiar Form

One way to turn Eq. (2) into a probably more familiar form is to introduce some abbreviations and look at some special cases. We follow a similar strategy to [16]. Let

$$\begin{aligned} \psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m){:}{=}\sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\alpha (m)}(a) \end{aligned}$$

(8)

and

$$\begin{aligned} \psi _{{\mathcal {I}}}(i|m){:}{=}\sum _{h' \in {\mathcal {H}}}\psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m). \end{aligned}$$

(9)

Then consider the case of a deterministic machine and choose the $m' \in {\mathcal {M}}$ that actually occurs for a given input $i \in {\mathcal {I}}$ such that ${\mu }(m'|i,m)=1$ or abusing notation $m' = m'(i,m)$. Then we get from Eq. (2):

$$\begin{aligned} \psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m)=\psi (h'|{\mu }(i,m)) \psi _{{\mathcal {I}}}(i|m). \end{aligned}$$

(10)

If we then also consider an input $i \in {\mathcal {I}}$ that is subjectively possible as defined in [16] which here means that $\psi _{{\mathcal {I}}}(i|m)>0$ we get

$$\begin{aligned} \psi (h'|m'(i,m))= \frac{\psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m)}{\psi _{{\mathcal {I}}}(i|m)}. \end{aligned}$$

(11)

This makes it more apparent that in the interpretation the updated machine state $m'=m'(i,m)$ parameterizes a belief $\psi (h'|m'(i,m))$ which is equal to the posterior distribution over the hidden state given input i. In the non-deterministic case, note that when ${\mu }(m'|i,m)=0$ the consistency equation imposes no condition, which makes sense since that means the machine state $m'$ can never occur. When ${\mu }(m'|i,m)>0$ we can divide Eq. (2) by this to also get Eq. (10). The subsequent argument for $m'=m'(i,m)$ then must hold not only for this one possible next state but instead for every $m'$ with ${\mu }(m'|i,m)$. So in this case (if s is subjectively possible) any of the possible next states will parameterize a belief $\psi (h'|m')$ equal to the posterior.

B Proof of Proposition 1

For the readers’s convenience we recall the proposition:

Proposition 2

Consider a stochastic Moore machine $({\mathcal {M}},{\mathcal {I}},{\mathcal {O}},{\mu },{\omega })$, together with a consistent interpretation as a solution to a POMDP, given by the POMDP $({\mathcal {H}},{\mathcal {O}},{\mathcal {I}},\kappa ,{r})$ and Markov kernel $\psi :{\mathcal {M}}\rightarrow P{\mathcal {H}}$. Suppose it is given an input $i\in {\mathcal {I}}$, and that this input has a positive probability according to the interpretation. (That is, Eq. (14) is obeyed.) Then the parameterized distributions $\psi (m)$ update as required by the belief state update equation (Eq. (3)) whenever $a=\pi ^*(b)$ i.e. whenever the action is equal to the optimal action. More formally, for any $m,m' \in {\mathcal {M}}$ with ${\mu }(m'|i,m)>0$ and $i \in {\mathcal {I}}$ that can occur according to the POMDP transition and sensor kernels, we have for all $h' \in {\mathcal {H}}$

$$\begin{aligned} \psi (h'|m') = f(\psi (m),\pi ^*(\psi (m)),i)(h'). \end{aligned}$$

(12)

Proof

By assumption the machine part $({\mathcal {M}},{\mathcal {I}},{\mu })$ of the stochastic Moore machine has a consistent Bayesian influenced filtering interpretation $({\mathcal {H}},{\mathcal {O}},\psi ,{\omega },\kappa )$.

This means that the belief $\psi (m)$ parameterized by the stochastic Moore machine obeys Eq. (2). This means that for every possible next state $m'$ (i.e. ${\mu }(m'|s,m)>0$) we have

$$\begin{aligned} \begin{aligned} \sum _{h \in {\mathcal {H}}}&\sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) \\&= \psi (h'|m') \left( \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a)\right) \end{aligned} \end{aligned}$$

(13)

and for every subjectively possible input, that is, for every input $i \in {\mathcal {I}}$ with

$$\begin{aligned} \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) >0 \end{aligned}$$

(14)

(see below for a note on why this assumption is reasonable) we will have:

$$\begin{aligned} \psi (h'|m') =&\,\frac{ \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) }{ \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a)}\end{aligned}$$

(15)

$$\begin{aligned} =&\, \frac{ \sum _{h \in {\mathcal {H}}}\kappa (h',i|h,\omega (m)) \psi (h|m) }{ \sum _{h \in {\mathcal {H}}}\sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,\omega (m)) \psi (h|m)}. \end{aligned}$$

(16)

Now consider the update function for which the optimal policy is found Eq. (3):

$$\begin{aligned} f(b,a,s)(h') {:}{=}&\, \frac{\sum _{h \in {\mathcal {H}}}\kappa (h',s|h,a)b(h)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',s|\bar{h},a)b(\bar{h})} \end{aligned}$$

(17)

and plug in the belief $b=\psi (m)$ parameterized by the machine state, the optimal action $\pi ^*(\psi (m))$ specified for that belief by the optimal policy $\pi ^*$, and the $s=i$:

$$\begin{aligned} f(\psi (m),\pi ^*(m),i)(h') {:}{=}&\, \frac{\sum _{h \in {\mathcal {H}}}\kappa (h',i|h,\pi ^*(\psi (m)))\psi (m)(h)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',i|\bar{h},\pi ^*(\psi (m)))\psi (m)(\bar{h})}. \end{aligned}$$

(18)

Also introduce $\kappa $ and write $\psi (h|m)$ for $\psi (m)(h)$ as usual

$$\begin{aligned} f(\psi (m),\pi ^*(m),i)(h') {:}{=}&\, \frac{\sum _{h \in {\mathcal {H}}}\kappa (h',i|h,\pi ^*(\psi (m)))\psi (h|m)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',i|\bar{h},\pi ^*(\psi (m)))\psi (\bar{h}|m)}\end{aligned}$$

(19)

$$\begin{aligned} =&\, \psi (h'|m'). \end{aligned}$$

(20)

Which is what we wanted to prove.

Note that if Eq. (14) is not true and the probability of an input i is impossible according to the POMDP transition function, the kernel $\psi $, and the optimal policy $\omega $ then Eq. (13) puts no constraint on the machine kernel ${\mu }$ since both sides are zero. So the behavior of the stochastic Moore machine in this case is arbitrary. This makes sense since according to the POMDP that we use to interpret the machine this input is impossible, so our interpretation should tell us nothing about this situation.

C Sondik’s Example

We now consider the example from [15]. This has a known optimal solution. We constructed a stochastic Moore machine from this solution which has an interpretation as a solution to Sondik’s POMDP. This proves existence of stochastic Moore machines with such interpretations.

Consider the following stochastic Moore machine:

State space ${\mathcal {M}}{:}{=}[0,1]$. (This state will be interpreted as the belief probability of the hidden state being equal to 1.)
input space ${\mathcal {I}}=\{1,2\}$
machine kernel ${\mu }:{\mathcal {I}}\times {\mathcal {M}}\rightarrow P{\mathcal {M}}$ defined by deterministic function $g:{\mathcal {I}}\times {\mathcal {M}}\rightarrow {\mathcal {M}}$:
$$\begin{aligned} {\mu }(m'|s,m){:}{=}\delta _{g(s,m)}(m') \end{aligned}$$
(21)
where
$$\begin{aligned} g(S=1,m){:}{=}{\left\{ \begin{array}{ll} \frac{15}{6m+20}-\frac{1}{2} &{}\text { if } 0 \le m \le 0.1188\\ \frac{9}{5} - \frac{72}{5m+60} &{}\text { if } 0.1188 \le m \le 1. \end{array}\right. } \end{aligned}$$
(22)
and
$$\begin{aligned} g(S=2,m){:}{=}{\left\{ \begin{array}{ll} 2+\frac{20}{3m-15} &{}\text { if } 0 \le m \le 0.1188\\ - \frac{1}{5} - \frac{12}{5m-40} &{}\text { if } 0.1188 \le m \le 1. \end{array}\right. } \end{aligned}$$
(23)
output space ${\mathcal {O}}{:}{=}\{1,2\}$
expose kernel $\omega :{\mathcal {M}}\rightarrow {\mathcal {O}}$ defined by
$$\begin{aligned} \omega (m){:}{=}{\left\{ \begin{array}{ll} 1 \text { if } 0 \le m < 0.1188\\ 2 \text { if } 0.1188 \le m \le 1. \end{array}\right. } \end{aligned}$$
(24)

A consistent interpretation as a solution to a POMDP for this stochastic Moore machine is given by

The POMDP with
- state space ${\mathcal {H}}{:}{=}\{1,2\}$
- action space equal to the output space ${\mathcal {O}}$ of the machine above
- sensor space equal to the input space ${\mathcal {I}}$ of the machine above
- model kernel $\kappa :{\mathcal {H}}\times {\mathcal {O}}\rightarrow {\mathcal {H}}\times {\mathcal {I}}$ defined by
  $$\begin{aligned} \kappa (h',s|h,a){:}{=}\nu (h'|h,a)\phi (s|h',a) \end{aligned}$$
  (25)
  where $\nu :{\mathcal {H}}\times {\mathcal {O}}\rightarrow P{\mathcal {H}}$ and $\phi :{\mathcal {H}}\times {\mathcal {O}}\rightarrow P{\mathcal {I}}$ are shown in Table 1
  Table 1. Sondik’s POMDP data.
  Full size table
- reward function $r:{\mathcal {H}}\times {\mathcal {O}}\rightarrow \mathbb {R}$ also shown in Table 1.
Markov kernel $\psi : {\mathcal {M}}\rightarrow P{\mathcal {H}}$ given by:
$$\begin{aligned} \psi (h|m):=m^{\delta _1(h)}(1-m)^{\delta _2(h)}. \end{aligned}$$
(26)

To verify this we have to check that $({\mathcal {H}},{\mathcal {O}},\psi ,{\omega },\kappa )$ is a consistent Bayesian influenced filtering interpretation of the machine $({\mathcal {M}},{\mathcal {I}},{\mu })$. For this we need to check Eq. (2) with $\delta _{\alpha (m)}(a){:}{=}\delta _{\omega (m)}(a)$. So for each each $m \in [0,1]$, $h' \in \{1,2\}$, $i \in \{1,2\}$, and $m' \in [0,1]$ we need to check:

$$\begin{aligned} \begin{aligned} \left( \sum _{h \in {\mathcal {H}}}\right.&\left. \sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) \right) {\mu }(m'|i,m)\\&= \psi (h'|m') \left( \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a)\right) {\mu }(m'|i,m). \end{aligned} \end{aligned}$$

(27)

This is tedious to check but true. We would usually also have to show that $\omega $ is indeed the optimal policy for Sondik’s POMDP but this is shown in [15].

D POMDPs and Belief State MDPs

Here we give some more details about belief state MDPs and the optimal value function and policy of Eqs. (4) and (5). There is no original content in this section and it follows closely the expositions in [9, 10].

We first define an MDP and its solution and then discuss then add some details about the belief state MDP associated to a POMDP.

Definition 6

A Markov decision process (MDP) can be defined as a tuple $({\mathcal {X}},{\mathcal {A}},{\nu },{r})$ consisting of a set ${\mathcal {X}}$ called the state space, a set ${\mathcal {A}}$ called the action space, a Markov kernel ${\nu }:{\mathcal {X}}\times {\mathcal {A}}\rightarrow P({\mathcal {X}})$ called the transition kernel, and a reward function ${r}:{\mathcal {X}}\times {\mathcal {A}}\rightarrow \mathbb {R}$. Here, the transition kernel takes a state $x \in {\mathcal {X}}$ and an action $a \in {\mathcal {A}}$ to a probability distribution ${\nu }(x,a)$ over next states and the reward function returns a real-valued instantaneous reward r(x, a) depending on the hidden state and an action.

A solution to a given MDP is a control policy. As the goal of the MDP we here choose the maximization of expected cumulative discounted reward for an infinite time horizon (an alternative would be to consider finite time horizons). This means an optimal policy maximizes

$$\begin{aligned} \mathbb {E}\left[ \sum _{t=1}^\infty \gamma ^{t-1} {r}(x_t,a_t)\right] . \end{aligned}$$

(28)

where $0<\gamma <1$ is a parameter called the discount factor. This specifies the goal.

To express the optimal policy explicitly we can use the optimal value function $V^*:{\mathcal {X}}\rightarrow \mathbb {R}$. This is the solution to the Bellman equation [10]:

$$\begin{aligned} V^*(x)=&\max _{a \in {\mathcal {A}}} \left( {r}(x,a) + \gamma \sum _{x' \in {\mathcal {X}}} {\nu }(x'|a,x) V^*(x') \right) . \end{aligned}$$

(29)

The optimal policy is then the function $\pi ^*:{\mathcal {X}}\rightarrow {\mathcal {A}}$ that greedily maximizes the optimal value function [10]:

$$\begin{aligned} \pi ^*(x)=&\mathop {\mathrm {arg\,max}}\limits _{a \in {\mathcal {A}}} \left( {r}(x,a) + \gamma \sum _{x' \in {\mathcal {X}}} {\nu }(x'|a,x) V^*(x') \right) . \end{aligned}$$

(30)

1.1 D.1 Belief State MDP

The belief state MDP for a POMDP (see Definition 4) is defined using the belief state update function of Eq. (3). We first define this function again here with an additional intermediate step:

$$\begin{aligned} f(b,a,s)(h'){:}{=}&\, Pr(h'|b,a,s)\end{aligned}$$

(31)

$$\begin{aligned} =&\,\frac{Pr(h',s|b,a)}{Pr(s|b,a)}\end{aligned}$$

(32)

$$\begin{aligned} =&\,\frac{\sum _{h \in {\mathcal {H}}}\kappa (h',s|h,a)b(h)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',s|\bar{h},a)b(\bar{h})}. \end{aligned}$$

(33)

The function f(b, a, s) returns the posterior belief over hidden states h given prior belief $b \in P{\mathcal {H}}$, an action $a \in {\mathcal {A}}$ and observation $s \in {\mathcal {S}}$. The Markov kernel $\delta _{f}:P{\mathcal {H}}\times {\mathcal {S}}\times {\mathcal {A}}\rightarrow PP{\mathcal {H}}$ associated to this function can be seen as a probability of the next belief state $b'$ given current belief state b, action a and sensor value s:

$$\begin{aligned} Pr(b'|b,a,s) =\delta _{f(b,a,s)}(b'). \end{aligned}$$

(34)

Intuitively, the belief state MDP has as its transition kernel the probability $Pr(b'|b,a)$ expected over all next sensor values of the next belief state $b'$ given that the current belief state is b the action is a and beliefs get updated according to the rules of probability, so

$$\begin{aligned} Pr(b'|b,a)=&\sum _s Pr(b'|b,a,s) Pr(s|b,a)\end{aligned}$$

(35)

$$\begin{aligned} =&\sum _{s \in {\mathcal {S}}} \delta _{f(b,a,s)}(b') \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h). \end{aligned}$$

(36)

This gives some intuition behind the definition of belief state MDPs.

Definition 7

Given a POMDP $({\mathcal {H}},{\mathcal {A}},{\mathcal {S}},\kappa ,{r})$ the associated belief state Markov decision process (belief state MDP) is the MDP $(P{\mathcal {H}},{\mathcal {A}},{\beta },{\rho })$ where

the state space $P{\mathcal {H}}$ is the space of probability distributions beliefs over the hidden state of the POMDP. We write b(h) for the probability of a hidden state $h \in {\mathcal {H}}$ according to belief $b \in P{\mathcal {H}}$.
the action space $\mathcal {A}$ is the same as for the underlying POMDP
the transition kernel $\kappa : P{\mathcal {H}}\times A \rightarrow P{\mathcal {H}}$ is defined as [10, Section 3.4]
$$\begin{aligned} {\beta }(b'|b,a){:}{=}&\sum _{s \in {\mathcal {S}}} \delta _{f(b,a,s)}(b') \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h). \end{aligned}$$
(37)
the reward function ${\rho }:P{\mathcal {H}}\times {\mathcal {A}}\rightarrow \mathbb {R}$ is defined as
$$\begin{aligned} {\rho }(b,a){:}{=}\sum _{h \in {\mathcal {H}}} b(h) {r}(h,a). \end{aligned}$$
(38)
So the reward for action a under belief b is equal to the expectation under belief b of the original POMDP reward of that action a.

1.2 D.2 Optimal Belief-MDP Policy

Using the belief MDP we can express the optimal policy for the POMDP.

The optimal policy can be expressed in terms of the optimal value function of the belief MDP. This is the solution to the equation [9]

$$\begin{aligned} V^*(b)=&\max _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{b' \in P{\mathcal {H}}} {\beta }(b'|a,b) V^*(b') \right) \end{aligned}$$

(39)

$$\begin{aligned} V^*(b)=&\max _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{b' \in P{\mathcal {H}}} \sum _{s \in {\mathcal {S}}} \delta _{f(b,a,s)}(b') \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h) V^*(b') \right) \end{aligned}$$

(40)

$$\begin{aligned} V^*(b)=&\max _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{s \in {\mathcal {S}}} \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h) V^*(f(b,a,s)) \right) . \end{aligned}$$

(41)

This is the expression we used in Eq. (4). The optimal policy for the belief MDP is then [9]:

$$\begin{aligned} \pi ^*(b)=&\mathop {\mathrm {arg\,max}}\limits _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{s \in {\mathcal {S}}} \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h) V^*(f(b,a,s)) \right) . \end{aligned}$$

(42)

This is the expression we used in Eq. (5).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Biehl, M., Virgo, N. (2023). Interpreting Systems as Solving POMDPs: A Step Towards a Formal Understanding of Agency. In: Buckley, C.L., et al. Active Inference. IWAI 2022. Communications in Computer and Information Science, vol 1721. Springer, Cham. https://doi.org/10.1007/978-3-031-28719-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-28719-0_2
Published: 22 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28718-3
Online ISBN: 978-3-031-28719-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Interpreting Systems as Solving POMDPs: A Step Towards a Formal Understanding of Agency

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Planning in Partially Observable Domains with Fuzzy Epistemic States and Probabilistic Dynamics

Complexity Bounds for Deterministic Partially Observed Markov Decision Processes

Probabilistic Majorization of Partially Observable Markov Decision Processes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Consistency in More Familiar Form

B Proof of Proposition 1

Proposition 2

Proof

C Sondik’s Example

D POMDPs and Belief State MDPs

Definition 6

1.1 D.1 Belief State MDP

Definition 7

1.2 D.2 Optimal Belief-MDP Policy

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Interpreting Systems as Solving POMDPs: A Step Towards a Formal Understanding of Agency

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Planning in Partially Observable Domains with Fuzzy Epistemic States and Probabilistic Dynamics

Complexity Bounds for Deterministic Partially Observed Markov Decision Processes

Probabilistic Majorization of Partially Observable Markov Decision Processes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Consistency in More Familiar Form

B Proof of Proposition 1

Proposition 2

Proof

C Sondik’s Example

D POMDPs and Belief State MDPs

Definition 6

1.1 D.1 Belief State MDP

Definition 7

1.2 D.2 Optimal Belief-MDP Policy

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation