POMDP Controllers with Optimal Budget

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13479))

Included in the following conference series:

International Conference on Quantitative Evaluation of Systems

618 Accesses

Abstract

Parametric Markov chains (pMCs) have transitions labeled with functions over a fixed set of parameters. They are useful if the exact transition probabilities are uncertain, e.g., when checking a model for robustness. This paper presents a simple way to check whether the expected total reward until reaching a given target state is monotonic in (some of) the parameters. We exploit this monotonicity together with parameter lifting to find an $\varepsilon $-close bound on the optimal expected total reward. Our results are also useful to automatically synthesise controllers with a fixed memory structure for partially observable Markov decision processes (POMDPs), a popular model in AI planning. We experimentally show that our approach can successfully find $\varepsilon $-optimal controllers for optimal budget in such POMDPs.

Supported by DFG RTG 2236 “UnRAVeL” and DFG 1462/4-1 “PASYWI”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Search and Explore: Symbiotic Policy Synthesis in POMDPs

Enforcing Almost-Sure Reachability in POMDPs

Reasoning and predicting POMDP planning complexity via covering numbers

Article 19 January 2016

Notes

1.
ETR = Existential Theory of the Reals. ETR-complete decision problems are as hard as finding the roots of a multivariate polynomial. ETR lies inbetween NP and PSPACE.
2.
To be precise, on the interior of the closed set R.

References

Amato, C., Bernstein, D.S., Zilberstein, S.: Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Auton. Agents Multi Agent Syst. 21(3), 293–320 (2010)
Article Google Scholar
Baier, C., Hensel, C., Hutschenreiter, L., Junges, S., Katoen, J.P., Klein, J.: Parametric Markov chains: PCTL complexity and fraction-free Gaussian elimination. Inf. Comput. 272 (2020)
Google Scholar
Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)
Google Scholar
Bonet, B.: Solving large POMDPs using real time dynamic programming. In: Proceedings of the AAAI 2021 Fall Symposium on POMDPs (1998)
Google Scholar
Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms. In: IJCAI, pp. 1641–1646 (2009)
Google Scholar
Bork, A., Junges, S., Katoen, J.-P., Quatmann, T.: Verification of Indefinite-Horizon POMDPs. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 288–304. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_16
Bork, A., Katoen, J.-P., Quatmann, T.: Under-approximating expected total rewards in POMDPs. In: TACAS 2022. LNCS, vol. 13244, pp. 22–40. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99527-0_2
Braziunas, D., Boutilier, C.: Stochastic local search for POMDP controllers. In: AAAI, pp. 690–696. AAAI Press/The MIT Press, Menlo, Park (2004)
Google Scholar
Carr, S., Jansen, N., Topcu, U.: Verifiable RNN-based policies for POMDPs under temporal logic constraints. In: IJCAI, pp. 4121–4127. ijcai.org (2020)
Google Scholar
Carr, S., Jansen, N., Wimmer, R., Serban, A.C., Becker, B., Topcu, U.: Counterexample-guided strategy improvement for POMDPs using recurrent neural networks. In: IJCAI, pp. 5532–5539. ijcai.org (2019)
Google Scholar
Chatterjee, K., Chmelik, M., Davies, J.: A symbolic SAT-based algorithm for almost-sure reachability with small strategies in POMDPs. In: AAAI. pp. 3225–3232. AAAI Press, Menlo, Park (2016)
Google Scholar
Chen, T., Feng, Y., Rosenblum, D.S., Su, G.: Perturbation analysis in verification of discrete-time Markov chains. In: Baldan, P., Gorla, D. (eds.) CONCUR 2014. LNCS, vol. 8704, pp. 218–233. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44584-6_16
Cleaveland, M., Ruchkin, I., Sokolsky, O., Lee, I.: Monotonic safety for scalable and data-efficient probabilistic safety analysis. In: ICCPS, pp. 92–103. ACM (2022)
Google Scholar
Daws, C.: Symbolic and parametric model checking of discrete-time Markov chains. In: Liu, Z., Araki, K. (eds.) ICTAC 2004. LNCS, vol. 3407, pp. 280–294. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31862-0_21
Doshi, F., Pineau, J., Roy, N.: Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs. In: ICML. ACM International Conference Proceeding Series, vol. 307, pp. 256–263. ACM (2008)
Google Scholar
Fang, X., Calinescu, R., Gerasimou, S., Alhwikem, F.: Fast parametric model checking through model fragmentation. In: ICSE. pp. 835–846. IEEE (2021)
Google Scholar
Filieri, A., Ghezzi, C., Tamburrelli, G.: Run-time efficient probabilistic model checking. In: ICSE. ACM (2011)
Google Scholar
Gainer, P., Hahn, E.M., Schewe, S.: Accelerated model checking of parametric Markov chains. In: Lahiri, S.K., Wang, C. (eds.) ATVA 2018. LNCS, vol. 11138, pp. 300–316. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01090-4_18
Gouberman, A., Siegle, M., Tati, B.: Markov chains with perturbed rates to absorption: theory and application to model repair. Perform. Eval. 130, 32–50 (2019)
Google Scholar
Hahn, E.M., Han, T., Zhang, L.: Synthesis for PCTL in parametric Markov decision processes. In: Bobaru, M., Havelund, K., Holzmann, G.J., Joshi, R. (eds.) NFM 2011. LNCS, vol. 6617, pp. 146–161. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20398-5_12
Hahn, E.M., Hermanns, H., Zhang, L.: Probabilistic reachability for parametric Markov models. Int. J. Softw. Tools Technol. Transfer 13, 319 (2011). https://doi.org/10.1007/s10009-010-0146-x
Hartmanns, A., Klauck, M., Parker, D., Quatmann, T., Ruijters, E.: The quantitative verification benchmark set. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 344–350. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_20
Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000)
Google Scholar
Hensel, C., Junges, S., Katoen, J.-P., Quatmann, T., Volk, M.: The probabilistic model checker Storm. Int. J. Softw. Tools Technol. Transfer 1–22 (2021). https://doi.org/10.1007/s10009-021-00633-z
Hutschenreiter, L., Baier, C., Klein, J.: Parametric Markov chains: PCTL complexity and fraction-free Gaussian elimination. In: GandALF. EPTCS, vol. 256 (2017)
Google Scholar
Jansen, N., Corzilius, F., Volk, M., Wimmer, R., Ábrahám, E., Katoen, J.-P., Becker, B.: Accelerating parametric probabilistic verification. In: Norman, G., Sanders, W. (eds.) QEST 2014. LNCS, vol. 8657, pp. 404–420. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10696-0_31
Junges, S.: Parameter synthesis in Markov models. Ph.D. thesis, RWTH Aachen University, Germany (2020)
Google Scholar
Junges, S., et al.: Parameter synthesis for Markov models. CoRR abs/1903.07993 (2019)
Google Scholar
Junges, S., et al.: Finite-state controllers of POMDPs using parameter synthesis. In: UAI. AUAI Press, Monterey,(2018)
Google Scholar
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)
Google Scholar
Lanotte, R., Maggiolo-Schettini, A., Troina, A.: Parametric probabilistic transition systems for system design and analysis. Formal Aspects Comput. 19(1), 93–109 (2007)
Google Scholar
Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable environments: Scaling up. In: ICML, pp. 362–370. Morgan Kaufmann (1995)
Google Scholar
Lovejoy, W.S.: Computationally feasible bounds for partially observed Markov decision processes. Oper. Res. 39(1), 162–175, 104504 (1991)
Google Scholar
Pathak, S., Ábrahám, E., Jansen, N., Tacchella, A., Katoen, J.P.: A greedy approach for the efficient repair of stochastic models. In: NFM. LNCS, vol. 9058 (2015)
Google Scholar
Quatmann, T., Dehnert, C., Jansen, N., Junges, S., Katoen, J.-P.: Parameter synthesis for Markov models: faster than ever. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 50–67. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46520-3_4
Chapter MATH Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 4th edn. Pearson, Boston (2020)
Google Scholar
Silver, D., Veness, J.: Monte-Carlo planning in large POMDPs. In: NIPS, pp. 2164–2172. Curran Associates, Inc., Red Hook (2010)
Google Scholar
Smith, T., Simmons, R.G.: Heuristic Search Value Iteration for POMDPs. In: UAI. pp. 520–527. AUAI Press (2004)
Google Scholar
Spel, J., Junges, S., Katoen, J.-P.: Are parametric Markov chains monotonic? In: Chen, Y.-F., Cheng, C.-H., Esparza, J. (eds.) ATVA 2019. LNCS, vol. 11781, pp. 479–496. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31784-3_28
Chapter Google Scholar
Spel, J., Junges, S., Katoen, J.P.: Are parametric Markov chains monotonic? CoRR abs/1907.08491 (2019), extended version
Google Scholar
Spel, J., Junges, S., Katoen, J.-P.: Finding provably optimal Markov chains. In: TACAS 2021. LNCS, vol. 12651, pp. 173–190. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72016-2_10
Chapter Google Scholar
Spel, J., Stein, S., Katoen, J.P.: POMDP controllers with optimal budget (artifact). Zenodo , 104504 (2022). https://doi.org/10.5281/zenodo.6793377
Wang, Y., Chaudhuri, S., Kavraki, L.E.: Bounded policy synthesis for POMDPs with safe-reachability objectives. In: International Foundation for Autonomous Agents and Multiagent Systems Richland, AAMAS, pp. 238–246. SC, USA/ACM (2018)
Google Scholar
Winkler, T., Junges, S., Pérez, G.A., Katoen, J.: On the complexity of reachability in parametric Markov decision processes. In: Proceedings of CONCUR. LIPIcs, vol. 140, pp. 14:1–14:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)
Google Scholar
Winterer, L., et al.: Motion planning under partial observability using game-based abstraction. In: CDC, pp. 2201–2208. IEEE (2017)
Google Scholar
Yang, L., Murugesan, S., Zhang, J.: Real-time scheduling over Markovian channels: when partial observability meets hard deadlines. In: GLOBECOM, pp. 1–5. IEEE (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

RWTH Aachen University, Aachen, Germany
Jip Spel, Svenja Stein & Joost-Pieter Katoen

Authors

Jip Spel
View author publications
You can also search for this author in PubMed Google Scholar
Svenja Stein
View author publications
You can also search for this author in PubMed Google Scholar
Joost-Pieter Katoen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jip Spel .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Germany
Erika Ábrahám
University of Southern California, Los Angeles, CA, USA
Marco Paolieri

Appendices

A Algorithm

In this section, we describe the algorithmic approach to obtain a sufficient reward order, which is used for monotonicity checking.

Our algorithm takes as input an pMC and bounds for all states. It returns a set of annotated reward orders $(\mathcal {A}, \preceq ^{{\texttt {rew}}, \mathcal {A}})$, where $\mathcal {A}$ is a set of assumptions of the form $s \preceq ^{{\texttt {rew}}}_{}s'$. The algorithm iteratively computes a set of reward orders. At this stage, both the bounds and the assumptions are not relevant and not used.

Initially, we start with the trivial order for all $s\in S$. The queue is initialised (l.1) with an empty set of assumptions, the trivial order and all non-target states in S. At each iteration, we pick an order from the queue. If all states are processed (l.5), we are done building the reward order. Otherwise, we pick a state s to process (l.7), and try to order this state based on the reasoning from Lemmas 2 and 3 in Algorithm 1. We pick this state in reversed topological order, as this increases the likelihood that all successor states are contained in the order. Once we processed s, we put the (possibl) extended order in the queue together with the set of assumptions and the states we still need to process (l.25). Note that, if the state reward is 0 (l.9–18), the algorithm is equivalent to Algorithm 1 extended for treating cycles as found in [39, Sect. 4.1 and 4.3]. As assumptions are not used, Algorithm 1 in fact computes a single reward order; it runs linear in the number of transitions. Note that this order is not necessarily sufficient for the pMC.

Assumptions. To obtain a sufficient reward order, we consider assumptions, as described in Sect. 3.3. We exploit the annotations (called assumptions) that were ignored so far. Recall from Definition 6 that a reward order is not sufficient at a parametric state s, if its successors $s_1$ and $s_2$ are not totally ordered. To remain sound, we consider all possible orderings of $s_1$ and $s_2$. We add for each possible ordering a copy of the reward order to the queue, and continue as if the ordering of $s_1$ and $s_2$ is known. We extend Algorithm 1 with assumptions by adding the code of Algorithm 2 after lines 18 and 24.

Using Bounds. As creating assumptions might lead to an exponential explosion in the number of orders in the queue, we also consider the situation in which we have bounds $L_R(s)$ and $U_R(s)$ at our disposal satisfying

$$ L_R(s) \, \le \, \textrm{ER}^{s}_{\mathcal {M}[\vec {u}]}(\lozenge T) \, \le \, U_R(s) \quad \text {for all } \vec {u} \in R~. $$

As these bounds are a relatively cheap way to order states, we extend Algorithm 1 by adding Algorithm 3 directly after line 7. This algorithm uses Lemma 6 to order the successor states of s to obtain a sufficient order for s (l.1–5). Furthermore, it uses Lemmas 7 and 8, to order a state relative to its successor states (l. 10–15).

Table 3. Overview of the found $\varepsilon $-optimal values

Full size table

B Results

Table 3 shows the $\varepsilon $-optimal values for the integrated approach. It shows the results for $\varepsilon =0.1$ and $\varepsilon = 0.05$. For all entries the result for $\varepsilon =0.1$ and $\varepsilon = 0.05$ with heuristics is equal. This confirms that our initial guess for CurMax is a good one, even though we do need splitting of the region. The splitting is necessary as the bound found with parameter lifting is not yet tight enough to confirm that CurMax is indeed an $\varepsilon $-optimum. When maximizing the 4$\,\times \,$4 grid (2) we find with heuristics a value of 10685 and without heuristics 10679. This can be explained by the new way of updating CurMax.

C Proofs

1.1 C.1 Proof Sketch of Theorem 1

Theorem 1

$$\begin{aligned} \Big (\forall s \in S.\ \textsf{ER}^{s \rightarrow T}{}{\uparrow _{p}^{\ell }}\Big ) \ \text { implies } \ \textsf{ER}^{s_{ I } \rightarrow T}{}{\uparrow _{p}}. \end{aligned}$$

Proof

We sketch the proof of Theorem 1, which follows the lines of [39, Thm. 2]. First of all, we lift the notion of local monotonicity (Definition 4) to local monotonicity for n steps. Secondly, we claim that local monotonicity implies local monotonicity for n steps and from this global monotonicity follows.

We define the length of a finite path $|\hat{\pi }| = n$ for $\hat{\pi }= s_0 s_1 \ldots s_n$. Let ${ Paths}^{n}(s)$ be the set of all paths with length n starting from state $s\in S$.

Definition 8

(Locally monotonic increasing for n steps).$\textsf{ER}_{\mathcal {M}}^{s \rightarrow T}$ is locally monotonic increasing for n steps in parameter p (at s) on region R, denoted $\textsf{ER}_{\mathcal {M}}^{s \rightarrow T}{\uparrow _{p}^{\ell , {n}, R}}$, iff for all $\vec {u} \in R$:

$$\begin{aligned} \left( \sum \limits _{\hat{\pi }\in { Paths}^{n}(s)} \left( \dfrac{{\scriptstyle \partial }}{{\scriptstyle \partial p}}{\Pr (\hat{\pi })}\right) \cdot \left( r(\hat{\pi }, \lozenge T) + \textsf{ER}_{\mathcal {M}}^{\hat{\pi }_n \rightarrow T}\right) \right) (\vec {u})\ \ge \ 0. \end{aligned}$$

For $n=1$, Definition 8 corresponds to Definition 4. The claim that local monotonicity implies local monotonicity for n steps (for all n) can be proven by induction over n. The claim that global monotonicity follows from this can be shown similar to the proof of [39, Theorem 2, Equation 2], with as main difference that no state $s \in S$ exists for which $\textsf{Pr}^{s \rightarrow T} = 0$, cf. Remark 1.

1.2 C.2 Proof of Lemma 7

Lemma 7

For any state s with $\textsf {succ}(s) = \{ s_1, s_2 \}$, $f = \mathcal {P}(s, s_1)$, region R and $s_1 \preceq ^{{\texttt {rew}}}_{}s_2$: if for all $\vec {u} \in R$

1.
$r(s) \ge f(\vec {u}) \cdot \left( U_R(s_2) - L_R(s_1)\right) \ \text { then } \ s_2 \preceq ^{{\texttt {rew}}}_{}s \text {, and }$
2.
$ r(s) \le f(\vec {u}) \cdot \left( L_R(s_2) - U_R(s_1)\right) \ \text { then } \ s \preceq ^{{\texttt {rew}}}_{}s_2.$

Proof

Let $f' = \mathcal {P}(s, s_2) = 1{-}f$. For the first case, we derive for all $\vec {u} \in R$:

$$\begin{aligned} r(s)&\ge {f(\vec {u})} \cdot \left( U_R(s_2) - L_R(s_1)\right) \\&\ge f(\vec {u}) \cdot \left( \textsf{ER}^{s_2 \rightarrow T} - \textsf{ER}^{s_1 \rightarrow T}\right) \\&= f(\vec {u}) \cdot \textsf{ER}^{s_2 \rightarrow T} - f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T}\\&= (1{-}f'(\vec {u})) \cdot \textsf{ER}^{s_2 \rightarrow T} - f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T}\\&= \textsf{ER}^{s_2 \rightarrow T} -f'(\vec {u}) \cdot \textsf{ER}^{s_2 \rightarrow T} - f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T}. \end{aligned}$$

From this. it immediately follows for all $\vec {u} \in R$:

$$\begin{aligned} \underbrace{r(s) + f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T} + f'(\vec {u}) \cdot \textsf{ER}^{s_2 \rightarrow T}}_{= \textsf{ER}^{s \rightarrow T}} \ge \textsf{ER}^{s_2 \rightarrow T}, \end{aligned}$$

so $s_2 \preceq ^{{\texttt {rew}}}_{}s$.

For the second case, the proof follows in a similar way.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Spel, J., Stein, S., Katoen, JP. (2022). POMDP Controllers with Optimal Budget. In: Ábrahám, E., Paolieri, M. (eds) Quantitative Evaluation of Systems. QEST 2022. Lecture Notes in Computer Science, vol 13479. Springer, Cham. https://doi.org/10.1007/978-3-031-16336-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-16336-4_6
Published: 11 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16335-7
Online ISBN: 978-3-031-16336-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics