Nothing Special   »   [go: up one dir, main page]

Skip to main content

POMDP Controllers with Optimal Budget

  • Conference paper
  • First Online:
Quantitative Evaluation of Systems (QEST 2022)

Abstract

Parametric Markov chains (pMCs) have transitions labeled with functions over a fixed set of parameters. They are useful if the exact transition probabilities are uncertain, e.g., when checking a model for robustness. This paper presents a simple way to check whether the expected total reward until reaching a given target state is monotonic in (some of) the parameters. We exploit this monotonicity together with parameter lifting to find an \(\varepsilon \)-close bound on the optimal expected total reward. Our results are also useful to automatically synthesise controllers with a fixed memory structure for partially observable Markov decision processes (POMDPs), a popular model in AI planning. We experimentally show that our approach can successfully find \(\varepsilon \)-optimal controllers for optimal budget in such POMDPs.

Supported by DFG RTG 2236 “UnRAVeL” and DFG 1462/4-1 “PASYWI”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    ETR = Existential Theory of the Reals. ETR-complete decision problems are as hard as finding the roots of a multivariate polynomial. ETR lies inbetween NP and PSPACE.

  2. 2.

    To be precise, on the interior of the closed set R.

References

  1. Amato, C., Bernstein, D.S., Zilberstein, S.: Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Auton. Agents Multi Agent Syst. 21(3), 293–320 (2010)

    Article  Google Scholar 

  2. Baier, C., Hensel, C., Hutschenreiter, L., Junges, S., Katoen, J.P., Klein, J.: Parametric Markov chains: PCTL complexity and fraction-free Gaussian elimination. Inf. Comput. 272 (2020)

    Google Scholar 

  3. Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)

    Google Scholar 

  4. Bonet, B.: Solving large POMDPs using real time dynamic programming. In: Proceedings of the AAAI 2021 Fall Symposium on POMDPs (1998)

    Google Scholar 

  5. Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms. In: IJCAI, pp. 1641–1646 (2009)

    Google Scholar 

  6. Bork, A., Junges, S., Katoen, J.-P., Quatmann, T.: Verification of Indefinite-Horizon POMDPs. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 288–304. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_16

  7. Bork, A., Katoen, J.-P., Quatmann, T.: Under-approximating expected total rewards in POMDPs. In: TACAS 2022. LNCS, vol. 13244, pp. 22–40. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99527-0_2

  8. Braziunas, D., Boutilier, C.: Stochastic local search for POMDP controllers. In: AAAI, pp. 690–696. AAAI Press/The MIT Press, Menlo, Park (2004)

    Google Scholar 

  9. Carr, S., Jansen, N., Topcu, U.: Verifiable RNN-based policies for POMDPs under temporal logic constraints. In: IJCAI, pp. 4121–4127. ijcai.org (2020)

    Google Scholar 

  10. Carr, S., Jansen, N., Wimmer, R., Serban, A.C., Becker, B., Topcu, U.: Counterexample-guided strategy improvement for POMDPs using recurrent neural networks. In: IJCAI, pp. 5532–5539. ijcai.org (2019)

    Google Scholar 

  11. Chatterjee, K., Chmelik, M., Davies, J.: A symbolic SAT-based algorithm for almost-sure reachability with small strategies in POMDPs. In: AAAI. pp. 3225–3232. AAAI Press, Menlo, Park (2016)

    Google Scholar 

  12. Chen, T., Feng, Y., Rosenblum, D.S., Su, G.: Perturbation analysis in verification of discrete-time Markov chains. In: Baldan, P., Gorla, D. (eds.) CONCUR 2014. LNCS, vol. 8704, pp. 218–233. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44584-6_16

  13. Cleaveland, M., Ruchkin, I., Sokolsky, O., Lee, I.: Monotonic safety for scalable and data-efficient probabilistic safety analysis. In: ICCPS, pp. 92–103. ACM (2022)

    Google Scholar 

  14. Daws, C.: Symbolic and parametric model checking of discrete-time Markov chains. In: Liu, Z., Araki, K. (eds.) ICTAC 2004. LNCS, vol. 3407, pp. 280–294. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31862-0_21

  15. Doshi, F., Pineau, J., Roy, N.: Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs. In: ICML. ACM International Conference Proceeding Series, vol. 307, pp. 256–263. ACM (2008)

    Google Scholar 

  16. Fang, X., Calinescu, R., Gerasimou, S., Alhwikem, F.: Fast parametric model checking through model fragmentation. In: ICSE. pp. 835–846. IEEE (2021)

    Google Scholar 

  17. Filieri, A., Ghezzi, C., Tamburrelli, G.: Run-time efficient probabilistic model checking. In: ICSE. ACM (2011)

    Google Scholar 

  18. Gainer, P., Hahn, E.M., Schewe, S.: Accelerated model checking of parametric Markov chains. In: Lahiri, S.K., Wang, C. (eds.) ATVA 2018. LNCS, vol. 11138, pp. 300–316. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01090-4_18

  19. Gouberman, A., Siegle, M., Tati, B.: Markov chains with perturbed rates to absorption: theory and application to model repair. Perform. Eval. 130, 32–50 (2019)

    Google Scholar 

  20. Hahn, E.M., Han, T., Zhang, L.: Synthesis for PCTL in parametric Markov decision processes. In: Bobaru, M., Havelund, K., Holzmann, G.J., Joshi, R. (eds.) NFM 2011. LNCS, vol. 6617, pp. 146–161. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20398-5_12

  21. Hahn, E.M., Hermanns, H., Zhang, L.: Probabilistic reachability for parametric Markov models. Int. J. Softw. Tools Technol. Transfer 13, 319 (2011). https://doi.org/10.1007/s10009-010-0146-x

  22. Hartmanns, A., Klauck, M., Parker, D., Quatmann, T., Ruijters, E.: The quantitative verification benchmark set. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 344–350. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_20

  23. Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000)

    Google Scholar 

  24. Hensel, C., Junges, S., Katoen, J.-P., Quatmann, T., Volk, M.: The probabilistic model checker Storm. Int. J. Softw. Tools Technol. Transfer 1–22 (2021). https://doi.org/10.1007/s10009-021-00633-z

  25. Hutschenreiter, L., Baier, C., Klein, J.: Parametric Markov chains: PCTL complexity and fraction-free Gaussian elimination. In: GandALF. EPTCS, vol. 256 (2017)

    Google Scholar 

  26. Jansen, N., Corzilius, F., Volk, M., Wimmer, R., Ábrahám, E., Katoen, J.-P., Becker, B.: Accelerating parametric probabilistic verification. In: Norman, G., Sanders, W. (eds.) QEST 2014. LNCS, vol. 8657, pp. 404–420. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10696-0_31

  27. Junges, S.: Parameter synthesis in Markov models. Ph.D. thesis, RWTH Aachen University, Germany (2020)

    Google Scholar 

  28. Junges, S., et al.: Parameter synthesis for Markov models. CoRR abs/1903.07993 (2019)

    Google Scholar 

  29. Junges, S., et al.: Finite-state controllers of POMDPs using parameter synthesis. In: UAI. AUAI Press, Monterey,(2018)

    Google Scholar 

  30. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)

    Google Scholar 

  31. Lanotte, R., Maggiolo-Schettini, A., Troina, A.: Parametric probabilistic transition systems for system design and analysis. Formal Aspects Comput. 19(1), 93–109 (2007)

    Google Scholar 

  32. Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable environments: Scaling up. In: ICML, pp. 362–370. Morgan Kaufmann (1995)

    Google Scholar 

  33. Lovejoy, W.S.: Computationally feasible bounds for partially observed Markov decision processes. Oper. Res. 39(1), 162–175, 104504 (1991)

    Google Scholar 

  34. Pathak, S., Ábrahám, E., Jansen, N., Tacchella, A., Katoen, J.P.: A greedy approach for the efficient repair of stochastic models. In: NFM. LNCS, vol. 9058 (2015)

    Google Scholar 

  35. Quatmann, T., Dehnert, C., Jansen, N., Junges, S., Katoen, J.-P.: Parameter synthesis for Markov models: faster than ever. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 50–67. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46520-3_4

    Chapter  MATH  Google Scholar 

  36. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 4th edn. Pearson, Boston (2020)

    Google Scholar 

  37. Silver, D., Veness, J.: Monte-Carlo planning in large POMDPs. In: NIPS, pp. 2164–2172. Curran Associates, Inc., Red Hook (2010)

    Google Scholar 

  38. Smith, T., Simmons, R.G.: Heuristic Search Value Iteration for POMDPs. In: UAI. pp. 520–527. AUAI Press (2004)

    Google Scholar 

  39. Spel, J., Junges, S., Katoen, J.-P.: Are parametric Markov chains monotonic? In: Chen, Y.-F., Cheng, C.-H., Esparza, J. (eds.) ATVA 2019. LNCS, vol. 11781, pp. 479–496. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31784-3_28

    Chapter  Google Scholar 

  40. Spel, J., Junges, S., Katoen, J.P.: Are parametric Markov chains monotonic? CoRR abs/1907.08491 (2019), extended version

    Google Scholar 

  41. Spel, J., Junges, S., Katoen, J.-P.: Finding provably optimal Markov chains. In: TACAS 2021. LNCS, vol. 12651, pp. 173–190. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72016-2_10

    Chapter  Google Scholar 

  42. Spel, J., Stein, S., Katoen, J.P.: POMDP controllers with optimal budget (artifact). Zenodo , 104504 (2022). https://doi.org/10.5281/zenodo.6793377

  43. Wang, Y., Chaudhuri, S., Kavraki, L.E.: Bounded policy synthesis for POMDPs with safe-reachability objectives. In: International Foundation for Autonomous Agents and Multiagent Systems Richland, AAMAS, pp. 238–246. SC, USA/ACM (2018)

    Google Scholar 

  44. Winkler, T., Junges, S., Pérez, G.A., Katoen, J.: On the complexity of reachability in parametric Markov decision processes. In: Proceedings of CONCUR. LIPIcs, vol. 140, pp. 14:1–14:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)

    Google Scholar 

  45. Winterer, L., et al.: Motion planning under partial observability using game-based abstraction. In: CDC, pp. 2201–2208. IEEE (2017)

    Google Scholar 

  46. Yang, L., Murugesan, S., Zhang, J.: Real-time scheduling over Markovian channels: when partial observability meets hard deadlines. In: GLOBECOM, pp. 1–5. IEEE (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jip Spel .

Editor information

Editors and Affiliations

Appendices

A Algorithm

In this section, we describe the algorithmic approach to obtain a sufficient reward order, which is used for monotonicity checking.

figure ab

Our algorithm takes as input an pMC and bounds for all states. It returns a set of annotated reward orders \((\mathcal {A}, \preceq ^{{\texttt {rew}}, \mathcal {A}})\), where \(\mathcal {A}\) is a set of assumptions of the form \(s \preceq ^{{\texttt {rew}}}_{}s'\). The algorithm iteratively computes a set of reward orders. At this stage, both the bounds and the assumptions are not relevant and not used.

Initially, we start with the trivial order for all \(s\in S\). The queue is initialised (l.1) with an empty set of assumptions, the trivial order and all non-target states in S. At each iteration, we pick an order from the queue. If all states are processed (l.5), we are done building the reward order. Otherwise, we pick a state s to process (l.7), and try to order this state based on the reasoning from Lemmas 2 and 3 in Algorithm 1. We pick this state in reversed topological order, as this increases the likelihood that all successor states are contained in the order. Once we processed s, we put the (possibl) extended order in the queue together with the set of assumptions and the states we still need to process (l.25). Note that, if the state reward is 0 (l.9–18), the algorithm is equivalent to Algorithm 1 extended for treating cycles as found in [39, Sect. 4.1 and 4.3]. As assumptions are not used, Algorithm 1 in fact computes a single reward order; it runs linear in the number of transitions. Note that this order is not necessarily sufficient for the pMC.

figure ad

Assumptions. To obtain a sufficient reward order, we consider assumptions, as described in Sect. 3.3. We exploit the annotations (called assumptions) that were ignored so far. Recall from Definition 6 that a reward order is not sufficient at a parametric state s, if its successors \(s_1\) and \(s_2\) are not totally ordered. To remain sound, we consider all possible orderings of \(s_1\) and \(s_2\). We add for each possible ordering a copy of the reward order to the queue, and continue as if the ordering of \(s_1\) and \(s_2\) is known. We extend Algorithm 1 with assumptions by adding the code of Algorithm 2 after lines 18 and 24.

Using Bounds. As creating assumptions might lead to an exponential explosion in the number of orders in the queue, we also consider the situation in which we have bounds \(L_R(s)\) and \(U_R(s)\) at our disposal satisfying

$$ L_R(s) \, \le \, \textrm{ER}^{s}_{\mathcal {M}[\vec {u}]}(\lozenge T) \, \le \, U_R(s) \quad \text {for all } \vec {u} \in R~. $$

As these bounds are a relatively cheap way to order states, we extend Algorithm 1 by adding Algorithm 3 directly after line 7. This algorithm uses Lemma 6 to order the successor states of s to obtain a sufficient order for s (l.1–5). Furthermore, it uses Lemmas 7 and 8, to order a state relative to its successor states (l. 10–15).

figure ae
Table 3. Overview of the found \(\varepsilon \)-optimal values

B Results

Table 3 shows the \(\varepsilon \)-optimal values for the integrated approach. It shows the results for \(\varepsilon =0.1\) and \(\varepsilon = 0.05\). For all entries the result for \(\varepsilon =0.1\) and \(\varepsilon = 0.05\) with heuristics is equal. This confirms that our initial guess for CurMax is a good one, even though we do need splitting of the region. The splitting is necessary as the bound found with parameter lifting is not yet tight enough to confirm that CurMax is indeed an \(\varepsilon \)-optimum. When maximizing the 4\(\,\times \,\)4 grid (2) we find with heuristics a value of 10685 and without heuristics 10679. This can be explained by the new way of updating CurMax.

C Proofs

1.1 C.1 Proof Sketch of Theorem 1

Theorem 1

$$\begin{aligned} \Big (\forall s \in S.\ \textsf{ER}^{s \rightarrow T}{}{\uparrow _{p}^{\ell }}\Big ) \ \text { implies } \ \textsf{ER}^{s_{ I } \rightarrow T}{}{\uparrow _{p}}. \end{aligned}$$

Proof

We sketch the proof of Theorem 1, which follows the lines of [39, Thm. 2]. First of all, we lift the notion of local monotonicity (Definition 4) to local monotonicity for n steps. Secondly, we claim that local monotonicity implies local monotonicity for n steps and from this global monotonicity follows.

We define the length of a finite path \(|\hat{\pi }| = n\) for \(\hat{\pi }= s_0 s_1 \ldots s_n\). Let \({ Paths}^{n}(s)\) be the set of all paths with length n starting from state \(s\in S\).

Definition 8

(Locally monotonic increasing for n steps).\(\textsf{ER}_{\mathcal {M}}^{s \rightarrow T}\) is locally monotonic increasing for n steps in parameter p (at s) on region R, denoted \(\textsf{ER}_{\mathcal {M}}^{s \rightarrow T}{\uparrow _{p}^{\ell , {n}, R}}\), iff for all \(\vec {u} \in R\):

$$\begin{aligned} \left( \sum \limits _{\hat{\pi }\in { Paths}^{n}(s)} \left( \dfrac{{\scriptstyle \partial }}{{\scriptstyle \partial p}}{\Pr (\hat{\pi })}\right) \cdot \left( r(\hat{\pi }, \lozenge T) + \textsf{ER}_{\mathcal {M}}^{\hat{\pi }_n \rightarrow T}\right) \right) (\vec {u})\ \ge \ 0. \end{aligned}$$

For \(n=1\), Definition 8 corresponds to Definition 4. The claim that local monotonicity implies local monotonicity for n steps (for all n) can be proven by induction over n. The claim that global monotonicity follows from this can be shown similar to the proof of [39, Theorem 2, Equation 2], with as main difference that no state \(s \in S\) exists for which \(\textsf{Pr}^{s \rightarrow T} = 0\), cf. Remark 1.

1.2 C.2 Proof of Lemma 7

Lemma 7

For any state s with \(\textsf {succ}(s) = \{ s_1, s_2 \}\), \(f = \mathcal {P}(s, s_1)\), region R and \(s_1 \preceq ^{{\texttt {rew}}}_{}s_2\): if for all \(\vec {u} \in R\)

  1. 1.

    \(r(s) \ge f(\vec {u}) \cdot \left( U_R(s_2) - L_R(s_1)\right) \ \text { then } \ s_2 \preceq ^{{\texttt {rew}}}_{}s \text {, and }\)

  2. 2.

    \( r(s) \le f(\vec {u}) \cdot \left( L_R(s_2) - U_R(s_1)\right) \ \text { then } \ s \preceq ^{{\texttt {rew}}}_{}s_2.\)

Proof

Let \(f' = \mathcal {P}(s, s_2) = 1{-}f\). For the first case, we derive for all \(\vec {u} \in R\):

$$\begin{aligned} r(s)&\ge {f(\vec {u})} \cdot \left( U_R(s_2) - L_R(s_1)\right) \\&\ge f(\vec {u}) \cdot \left( \textsf{ER}^{s_2 \rightarrow T} - \textsf{ER}^{s_1 \rightarrow T}\right) \\&= f(\vec {u}) \cdot \textsf{ER}^{s_2 \rightarrow T} - f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T}\\&= (1{-}f'(\vec {u})) \cdot \textsf{ER}^{s_2 \rightarrow T} - f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T}\\&= \textsf{ER}^{s_2 \rightarrow T} -f'(\vec {u}) \cdot \textsf{ER}^{s_2 \rightarrow T} - f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T}. \end{aligned}$$

From this. it immediately follows for all \(\vec {u} \in R\):

$$\begin{aligned} \underbrace{r(s) + f(\vec {u}) \cdot \textsf{ER}^{s_1 \rightarrow T} + f'(\vec {u}) \cdot \textsf{ER}^{s_2 \rightarrow T}}_{= \textsf{ER}^{s \rightarrow T}} \ge \textsf{ER}^{s_2 \rightarrow T}, \end{aligned}$$

so \(s_2 \preceq ^{{\texttt {rew}}}_{}s\).

For the second case, the proof follows in a similar way.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Spel, J., Stein, S., Katoen, JP. (2022). POMDP Controllers with Optimal Budget. In: Ábrahám, E., Paolieri, M. (eds) Quantitative Evaluation of Systems. QEST 2022. Lecture Notes in Computer Science, vol 13479. Springer, Cham. https://doi.org/10.1007/978-3-031-16336-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16336-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16335-7

  • Online ISBN: 978-3-031-16336-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics