Nothing Special   »   [go: up one dir, main page]

Skip to main content

Interpreting Systems as Solving POMDPs: A Step Towards a Formal Understanding of Agency

  • Conference paper
  • First Online:
Active Inference (IWAI 2022)

Abstract

Under what circumstances can a system be said to have beliefs and goals, and how do such agency-related features relate to its physical state? Recent work has proposed a notion of interpretation map, a function that maps the state of a system to a probability distribution representing its beliefs about an external world. Such a map is not completely arbitrary, as the beliefs it attributes to the system must evolve over time in a manner that is consistent with Bayes’ theorem, and consequently the dynamics of a system constrain its possible interpretations. Here we build on this approach, proposing a notion of interpretation not just in terms of beliefs but in terms of goals and actions. To do this we make use of the existing theory of partially observable Markov decision processes (POMDPs): we say that a system can be interpreted as a solution to a POMDP if it not only admits an interpretation map describing its beliefs about the hidden state of a POMDP but also takes actions that are optimal according to its belief state. An agent is then a system together with an interpretation of this system as a POMDP solution. Although POMDPs are not the only possible formulation of what it means to have a goal, this nevertheless represents a step towards a more general formal definition of what it means for a system to be an agent.

This project was made possible through the support of Grant 62229 from the John Templeton Foundation. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the John Templeton Foundation. Work on this project was also supported by a grant from GoodAI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    These machines are probably equivalent to the sufficient information state processes in [9, definition 2] but establishing this is beyond the scope of this work.

  2. 2.

    The FEP literature includes both publications on how to construct agents that solve problems (e.g. [8]) and publications on when a system of stochastic differential equations contain an agent performing approximate Bayesian inference. Only the latter literature addresses a question comparable to the one addressed in the present manuscript.

  3. 3.

    We choose actions to be deterministic functions of the machine state because the stochastic Moore machines considered here also have deterministic outputs. Other choices may be more suitable in other cases.

  4. 4.

    If the denominator in Eq. (3) is zero for some value \(s \,{\in }\, {\mathcal {S}}\) then define e.g. \(f_{\pi ^*}(b,s)=b\).

References

  1. Beer, R.D.: The cognitive domain of a glider in the game of life. Artif. Life 20(2), 183–206 (2014). https://doi.org/10.1162/ARTL_a_00125

    Article  Google Scholar 

  2. Biehl, M., Ikegami, T., Polani, D.: Towards information based spatiotemporal patterns as a foundation for agent representation in dynamical systems. In: Proceedings of the Artificial Life Conference 2016, pp. 722–729. The MIT Press (2016). https://doi.org/10.7551/978-0-262-33936-0-ch115. https://mitpress.mit.edu/sites/default/files/titles/content/conf/alife16/ch115.html

  3. Da Costa, L., Friston, K., Heins, C., Pavliotis, G.A.: Bayesian mechanics for stationary processes. arXiv:2106.13830 [math-ph, physics:nlin, q-bio] (2021)

  4. Dennett, D.C.: True believers: the intentional strategy and why it works. In: Heath, A.F. (ed.) Scientific Explanation: Papers Based on Herbert Spencer Lectures Given in the University of Oxford, pp. 53–75. Clarendon Press (1981)

    Google Scholar 

  5. Friston, K.: Life as we know it. J. R. Soc. Interface 10(86) (2013). https://doi.org/10.1098/rsif.2013.0475. http://rsif.royalsocietypublishing.org/content/10/86/20130475

  6. Friston, K.: A free energy principle for a particular physics. arXiv:1906.10184 [q-bio] (2019)

  7. Friston, K., et al.: The free energy principle made simpler but not too simple (2022). https://doi.org/10.48550/arXiv.2201.06387 [cond-mat, physics:nlin, physics:physics, q-bio]

  8. Friston, K., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., Pezzulo, G.: Active inference and epistemic value. Cogn. Neurosci. 6(4), 187–214 (2015). https://doi.org/10.1080/17588928.2015.1020053

    Article  Google Scholar 

  9. Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000). https://doi.org/10.1613/jair.678. http://arxiv.org/abs/1106.0234 [cs]

  10. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998). https://doi.org/10.1016/S0004-3702(98)00023-X. http://www.sciencedirect.com/science/article/pii/S000437029800023X

  11. Kenton, Z., Kumar, R., Farquhar, S., Richens, J., MacDermott, M., Everitt, T.: Discovering agents (2022). https://doi.org/10.48550/arXiv.2208.08345 [cs]

  12. Orseau, L., McGill, S.M., Legg, S.: Agents and devices: a relative definition of agency. arXiv:1805.12387 [cs, stat] (2018)

  13. Parr, T.: Inferential dynamics: comment on: How particular is the physics of the free energy principle? by Aguilera et al. Phys. Life Rev. 42, 1–3 (2022). https://doi.org/10.1016/j.plrev.2022.05.006. https://www.sciencedirect.com/science/article/pii/S1571064522000276

  14. Parr, T., Da Costa, L., Friston, K.: Markov blankets, information geometry and stochastic thermodynamics. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 378(2164), 20190159 (2020). https://doi.org/10.1098/rsta.2019.0159. https://royalsocietypublishing.org/doi/full/10.1098/rsta.2019.0159

  15. Sondik, E.J.: The optimal control of partially observable Markov processes over the infinite horizon: discounted costs. Oper. Res. 26(2), 282–304 (1978). http://www.jstor.org/stable/169635

  16. Virgo, N., Biehl, M., McGregor, S.: Interpreting dynamical systems as Bayesian reasoners. arXiv:2112.13523 [cs, q-bio] (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Biehl .

Editor information

Editors and Affiliations

Appendices

A Consistency in More Familiar Form

One way to turn Eq. (2) into a probably more familiar form is to introduce some abbreviations and look at some special cases. We follow a similar strategy to [16]. Let

$$\begin{aligned} \psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m){:}{=}\sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\alpha (m)}(a) \end{aligned}$$
(8)

and

$$\begin{aligned} \psi _{{\mathcal {I}}}(i|m){:}{=}\sum _{h' \in {\mathcal {H}}}\psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m). \end{aligned}$$
(9)

Then consider the case of a deterministic machine and choose the \(m' \in {\mathcal {M}}\) that actually occurs for a given input \(i \in {\mathcal {I}}\) such that \({\mu }(m'|i,m)=1\) or abusing notation \(m' = m'(i,m)\). Then we get from Eq. (2):

$$\begin{aligned} \psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m)=\psi (h'|{\mu }(i,m)) \psi _{{\mathcal {I}}}(i|m). \end{aligned}$$
(10)

If we then also consider an input \(i \in {\mathcal {I}}\) that is subjectively possible as defined in [16] which here means that \(\psi _{{\mathcal {I}}}(i|m)>0\) we get

$$\begin{aligned} \psi (h'|m'(i,m))= \frac{\psi _{{\mathcal {H}},{\mathcal {I}}}(h',i|m)}{\psi _{{\mathcal {I}}}(i|m)}. \end{aligned}$$
(11)

This makes it more apparent that in the interpretation the updated machine state \(m'=m'(i,m)\) parameterizes a belief \(\psi (h'|m'(i,m))\) which is equal to the posterior distribution over the hidden state given input i. In the non-deterministic case, note that when \({\mu }(m'|i,m)=0\) the consistency equation imposes no condition, which makes sense since that means the machine state \(m'\) can never occur. When \({\mu }(m'|i,m)>0\) we can divide Eq. (2) by this to also get Eq. (10). The subsequent argument for \(m'=m'(i,m)\) then must hold not only for this one possible next state but instead for every \(m'\) with \({\mu }(m'|i,m)\). So in this case (if s is subjectively possible) any of the possible next states will parameterize a belief \(\psi (h'|m')\) equal to the posterior.

B Proof of Proposition 1

For the readers’s convenience we recall the proposition:

Proposition 2

Consider a stochastic Moore machine \(({\mathcal {M}},{\mathcal {I}},{\mathcal {O}},{\mu },{\omega })\), together with a consistent interpretation as a solution to a POMDP, given by the POMDP \(({\mathcal {H}},{\mathcal {O}},{\mathcal {I}},\kappa ,{r})\) and Markov kernel \(\psi :{\mathcal {M}}\rightarrow P{\mathcal {H}}\). Suppose it is given an input \(i\in {\mathcal {I}}\), and that this input has a positive probability according to the interpretation. (That is, Eq. (14) is obeyed.) Then the parameterized distributions \(\psi (m)\) update as required by the belief state update equation (Eq. (3)) whenever \(a=\pi ^*(b)\) i.e. whenever the action is equal to the optimal action. More formally, for any \(m,m' \in {\mathcal {M}}\) with \({\mu }(m'|i,m)>0\) and \(i \in {\mathcal {I}}\) that can occur according to the POMDP transition and sensor kernels, we have for all \(h' \in {\mathcal {H}}\)

$$\begin{aligned} \psi (h'|m') = f(\psi (m),\pi ^*(\psi (m)),i)(h'). \end{aligned}$$
(12)

Proof

By assumption the machine part \(({\mathcal {M}},{\mathcal {I}},{\mu })\) of the stochastic Moore machine has a consistent Bayesian influenced filtering interpretation \(({\mathcal {H}},{\mathcal {O}},\psi ,{\omega },\kappa )\).

This means that the belief \(\psi (m)\) parameterized by the stochastic Moore machine obeys Eq. (2). This means that for every possible next state \(m'\) (i.e. \({\mu }(m'|s,m)>0\)) we have

$$\begin{aligned} \begin{aligned} \sum _{h \in {\mathcal {H}}}&\sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) \\&= \psi (h'|m') \left( \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a)\right) \end{aligned} \end{aligned}$$
(13)

and for every subjectively possible input, that is, for every input \(i \in {\mathcal {I}}\) with

$$\begin{aligned} \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) >0 \end{aligned}$$
(14)

(see below for a note on why this assumption is reasonable) we will have:

$$\begin{aligned} \psi (h'|m') =&\,\frac{ \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) }{ \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a)}\end{aligned}$$
(15)
$$\begin{aligned} =&\, \frac{ \sum _{h \in {\mathcal {H}}}\kappa (h',i|h,\omega (m)) \psi (h|m) }{ \sum _{h \in {\mathcal {H}}}\sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,\omega (m)) \psi (h|m)}. \end{aligned}$$
(16)

Now consider the update function for which the optimal policy is found Eq. (3):

$$\begin{aligned} f(b,a,s)(h') {:}{=}&\, \frac{\sum _{h \in {\mathcal {H}}}\kappa (h',s|h,a)b(h)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',s|\bar{h},a)b(\bar{h})} \end{aligned}$$
(17)

and plug in the belief \(b=\psi (m)\) parameterized by the machine state, the optimal action \(\pi ^*(\psi (m))\) specified for that belief by the optimal policy \(\pi ^*\), and the \(s=i\):

$$\begin{aligned} f(\psi (m),\pi ^*(m),i)(h') {:}{=}&\, \frac{\sum _{h \in {\mathcal {H}}}\kappa (h',i|h,\pi ^*(\psi (m)))\psi (m)(h)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',i|\bar{h},\pi ^*(\psi (m)))\psi (m)(\bar{h})}. \end{aligned}$$
(18)

Also introduce \(\kappa \) and write \(\psi (h|m)\) for \(\psi (m)(h)\) as usual

$$\begin{aligned} f(\psi (m),\pi ^*(m),i)(h') {:}{=}&\, \frac{\sum _{h \in {\mathcal {H}}}\kappa (h',i|h,\pi ^*(\psi (m)))\psi (h|m)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',i|\bar{h},\pi ^*(\psi (m)))\psi (\bar{h}|m)}\end{aligned}$$
(19)
$$\begin{aligned} =&\, \psi (h'|m'). \end{aligned}$$
(20)

Which is what we wanted to prove.

Note that if Eq. (14) is not true and the probability of an input i is impossible according to the POMDP transition function, the kernel \(\psi \), and the optimal policy \(\omega \) then Eq. (13) puts no constraint on the machine kernel \({\mu }\) since both sides are zero. So the behavior of the stochastic Moore machine in this case is arbitrary. This makes sense since according to the POMDP that we use to interpret the machine this input is impossible, so our interpretation should tell us nothing about this situation.

C Sondik’s Example

We now consider the example from [15]. This has a known optimal solution. We constructed a stochastic Moore machine from this solution which has an interpretation as a solution to Sondik’s POMDP. This proves existence of stochastic Moore machines with such interpretations.

Consider the following stochastic Moore machine:

  • State space \({\mathcal {M}}{:}{=}[0,1]\). (This state will be interpreted as the belief probability of the hidden state being equal to 1.)

  • input space \({\mathcal {I}}=\{1,2\}\)

  • machine kernel \({\mu }:{\mathcal {I}}\times {\mathcal {M}}\rightarrow P{\mathcal {M}}\) defined by deterministic function \(g:{\mathcal {I}}\times {\mathcal {M}}\rightarrow {\mathcal {M}}\):

    $$\begin{aligned} {\mu }(m'|s,m){:}{=}\delta _{g(s,m)}(m') \end{aligned}$$
    (21)

    where

    $$\begin{aligned} g(S=1,m){:}{=}{\left\{ \begin{array}{ll} \frac{15}{6m+20}-\frac{1}{2} &{}\text { if } 0 \le m \le 0.1188\\ \frac{9}{5} - \frac{72}{5m+60} &{}\text { if } 0.1188 \le m \le 1. \end{array}\right. } \end{aligned}$$
    (22)

    and

    $$\begin{aligned} g(S=2,m){:}{=}{\left\{ \begin{array}{ll} 2+\frac{20}{3m-15} &{}\text { if } 0 \le m \le 0.1188\\ - \frac{1}{5} - \frac{12}{5m-40} &{}\text { if } 0.1188 \le m \le 1. \end{array}\right. } \end{aligned}$$
    (23)
  • output space \({\mathcal {O}}{:}{=}\{1,2\}\)

  • expose kernel \(\omega :{\mathcal {M}}\rightarrow {\mathcal {O}}\) defined by

    $$\begin{aligned} \omega (m){:}{=}{\left\{ \begin{array}{ll} 1 \text { if } 0 \le m < 0.1188\\ 2 \text { if } 0.1188 \le m \le 1. \end{array}\right. } \end{aligned}$$
    (24)

A consistent interpretation as a solution to a POMDP for this stochastic Moore machine is given by

  • The POMDP with

    • state space \({\mathcal {H}}{:}{=}\{1,2\}\)

    • action space equal to the output space \({\mathcal {O}}\) of the machine above

    • sensor space equal to the input space \({\mathcal {I}}\) of the machine above

    • model kernel \(\kappa :{\mathcal {H}}\times {\mathcal {O}}\rightarrow {\mathcal {H}}\times {\mathcal {I}}\) defined by

      $$\begin{aligned} \kappa (h',s|h,a){:}{=}\nu (h'|h,a)\phi (s|h',a) \end{aligned}$$
      (25)

      where \(\nu :{\mathcal {H}}\times {\mathcal {O}}\rightarrow P{\mathcal {H}}\) and \(\phi :{\mathcal {H}}\times {\mathcal {O}}\rightarrow P{\mathcal {I}}\) are shown in Table 1

      Table 1. Sondik’s POMDP data.
    • reward function \(r:{\mathcal {H}}\times {\mathcal {O}}\rightarrow \mathbb {R}\) also shown in Table 1.

  • Markov kernel \(\psi : {\mathcal {M}}\rightarrow P{\mathcal {H}}\) given by:

    $$\begin{aligned} \psi (h|m):=m^{\delta _1(h)}(1-m)^{\delta _2(h)}. \end{aligned}$$
    (26)

To verify this we have to check that \(({\mathcal {H}},{\mathcal {O}},\psi ,{\omega },\kappa )\) is a consistent Bayesian influenced filtering interpretation of the machine \(({\mathcal {M}},{\mathcal {I}},{\mu })\). For this we need to check Eq. (2) with \(\delta _{\alpha (m)}(a){:}{=}\delta _{\omega (m)}(a)\). So for each each \(m \in [0,1]\), \(h' \in \{1,2\}\), \(i \in \{1,2\}\), and \(m' \in [0,1]\) we need to check:

$$\begin{aligned} \begin{aligned} \left( \sum _{h \in {\mathcal {H}}}\right.&\left. \sum _{a \in {\mathcal {A}}}\kappa (h',i|h,a) \psi (h|m)\delta _{\omega (m)}(a) \right) {\mu }(m'|i,m)\\&= \psi (h'|m') \left( \sum _{h \in {\mathcal {H}}}\sum _{a \in {\mathcal {A}}} \sum _{ h'' \in {\mathcal {H}}}\kappa ( h'',i|h,a) \psi (h|m)\delta _{\omega (m)}(a)\right) {\mu }(m'|i,m). \end{aligned} \end{aligned}$$
(27)

This is tedious to check but true. We would usually also have to show that \(\omega \) is indeed the optimal policy for Sondik’s POMDP but this is shown in [15].

D POMDPs and Belief State MDPs

Here we give some more details about belief state MDPs and the optimal value function and policy of Eqs. (4) and (5). There is no original content in this section and it follows closely the expositions in [9, 10].

We first define an MDP and its solution and then discuss then add some details about the belief state MDP associated to a POMDP.

Definition 6

A Markov decision process (MDP) can be defined as a tuple \(({\mathcal {X}},{\mathcal {A}},{\nu },{r})\) consisting of a set \({\mathcal {X}}\) called the state space, a set \({\mathcal {A}}\) called the action space, a Markov kernel \({\nu }:{\mathcal {X}}\times {\mathcal {A}}\rightarrow P({\mathcal {X}})\) called the transition kernel, and a reward function \({r}:{\mathcal {X}}\times {\mathcal {A}}\rightarrow \mathbb {R}\). Here, the transition kernel takes a state \(x \in {\mathcal {X}}\) and an action \(a \in {\mathcal {A}}\) to a probability distribution \({\nu }(x,a)\) over next states and the reward function returns a real-valued instantaneous reward r(xa) depending on the hidden state and an action.

A solution to a given MDP is a control policy. As the goal of the MDP we here choose the maximization of expected cumulative discounted reward for an infinite time horizon (an alternative would be to consider finite time horizons). This means an optimal policy maximizes

$$\begin{aligned} \mathbb {E}\left[ \sum _{t=1}^\infty \gamma ^{t-1} {r}(x_t,a_t)\right] . \end{aligned}$$
(28)

where \(0<\gamma <1\) is a parameter called the discount factor. This specifies the goal.

To express the optimal policy explicitly we can use the optimal value function \(V^*:{\mathcal {X}}\rightarrow \mathbb {R}\). This is the solution to the Bellman equation [10]:

$$\begin{aligned} V^*(x)=&\max _{a \in {\mathcal {A}}} \left( {r}(x,a) + \gamma \sum _{x' \in {\mathcal {X}}} {\nu }(x'|a,x) V^*(x') \right) . \end{aligned}$$
(29)

The optimal policy is then the function \(\pi ^*:{\mathcal {X}}\rightarrow {\mathcal {A}}\) that greedily maximizes the optimal value function [10]:

$$\begin{aligned} \pi ^*(x)=&\mathop {\mathrm {arg\,max}}\limits _{a \in {\mathcal {A}}} \left( {r}(x,a) + \gamma \sum _{x' \in {\mathcal {X}}} {\nu }(x'|a,x) V^*(x') \right) . \end{aligned}$$
(30)

1.1 D.1 Belief State MDP

The belief state MDP for a POMDP (see Definition 4) is defined using the belief state update function of Eq. (3). We first define this function again here with an additional intermediate step:

$$\begin{aligned} f(b,a,s)(h'){:}{=}&\, Pr(h'|b,a,s)\end{aligned}$$
(31)
$$\begin{aligned} =&\,\frac{Pr(h',s|b,a)}{Pr(s|b,a)}\end{aligned}$$
(32)
$$\begin{aligned} =&\,\frac{\sum _{h \in {\mathcal {H}}}\kappa (h',s|h,a)b(h)}{\sum _{\bar{h},\bar{h}' \in {\mathcal {H}}} \kappa (\bar{h}',s|\bar{h},a)b(\bar{h})}. \end{aligned}$$
(33)

The function f(bas) returns the posterior belief over hidden states h given prior belief \(b \in P{\mathcal {H}}\), an action \(a \in {\mathcal {A}}\) and observation \(s \in {\mathcal {S}}\). The Markov kernel \(\delta _{f}:P{\mathcal {H}}\times {\mathcal {S}}\times {\mathcal {A}}\rightarrow PP{\mathcal {H}}\) associated to this function can be seen as a probability of the next belief state \(b'\) given current belief state b, action a and sensor value s:

$$\begin{aligned} Pr(b'|b,a,s) =\delta _{f(b,a,s)}(b'). \end{aligned}$$
(34)

Intuitively, the belief state MDP has as its transition kernel the probability \(Pr(b'|b,a)\) expected over all next sensor values of the next belief state \(b'\) given that the current belief state is b the action is a and beliefs get updated according to the rules of probability, so

$$\begin{aligned} Pr(b'|b,a)=&\sum _s Pr(b'|b,a,s) Pr(s|b,a)\end{aligned}$$
(35)
$$\begin{aligned} =&\sum _{s \in {\mathcal {S}}} \delta _{f(b,a,s)}(b') \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h). \end{aligned}$$
(36)

This gives some intuition behind the definition of belief state MDPs.

Definition 7

Given a POMDP \(({\mathcal {H}},{\mathcal {A}},{\mathcal {S}},\kappa ,{r})\) the associated belief state Markov decision process (belief state MDP) is the MDP \((P{\mathcal {H}},{\mathcal {A}},{\beta },{\rho })\) where

  • the state space \(P{\mathcal {H}}\) is the space of probability distributions beliefs over the hidden state of the POMDP. We write b(h) for the probability of a hidden state \(h \in {\mathcal {H}}\) according to belief \(b \in P{\mathcal {H}}\).

  • the action space \(\mathcal {A}\) is the same as for the underlying POMDP

  • the transition kernel \(\kappa : P{\mathcal {H}}\times A \rightarrow P{\mathcal {H}}\) is defined as [10, Section 3.4]

    $$\begin{aligned} {\beta }(b'|b,a){:}{=}&\sum _{s \in {\mathcal {S}}} \delta _{f(b,a,s)}(b') \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h). \end{aligned}$$
    (37)
  • the reward function \({\rho }:P{\mathcal {H}}\times {\mathcal {A}}\rightarrow \mathbb {R}\) is defined as

    $$\begin{aligned} {\rho }(b,a){:}{=}\sum _{h \in {\mathcal {H}}} b(h) {r}(h,a). \end{aligned}$$
    (38)

    So the reward for action a under belief b is equal to the expectation under belief b of the original POMDP reward of that action a.

1.2 D.2 Optimal Belief-MDP Policy

Using the belief MDP we can express the optimal policy for the POMDP.

The optimal policy can be expressed in terms of the optimal value function of the belief MDP. This is the solution to the equation [9]

$$\begin{aligned} V^*(b)=&\max _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{b' \in P{\mathcal {H}}} {\beta }(b'|a,b) V^*(b') \right) \end{aligned}$$
(39)
$$\begin{aligned} V^*(b)=&\max _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{b' \in P{\mathcal {H}}} \sum _{s \in {\mathcal {S}}} \delta _{f(b,a,s)}(b') \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h) V^*(b') \right) \end{aligned}$$
(40)
$$\begin{aligned} V^*(b)=&\max _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{s \in {\mathcal {S}}} \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h) V^*(f(b,a,s)) \right) . \end{aligned}$$
(41)

This is the expression we used in Eq. (4). The optimal policy for the belief MDP is then [9]:

$$\begin{aligned} \pi ^*(b)=&\mathop {\mathrm {arg\,max}}\limits _{a \in {\mathcal {A}}} \left( {\rho }(b,a) + \gamma \sum _{s \in {\mathcal {S}}} \sum _{h,h' \in {\mathcal {H}}} \kappa (h',s|h,a)b(h) V^*(f(b,a,s)) \right) . \end{aligned}$$
(42)

This is the expression we used in Eq. (5).

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Biehl, M., Virgo, N. (2023). Interpreting Systems as Solving POMDPs: A Step Towards a Formal Understanding of Agency. In: Buckley, C.L., et al. Active Inference. IWAI 2022. Communications in Computer and Information Science, vol 1721. Springer, Cham. https://doi.org/10.1007/978-3-031-28719-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28719-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28718-3

  • Online ISBN: 978-3-031-28719-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics