article

Free access

A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Authors:

Stéphane Ross,

Brahim Chaib-draa,

Pierre KreitmannAuthors Info & Claims

The Journal of Machine Learning Research, Volume 12

Pages 1729 - 1770

Published: 01 July 2011 Publication History

Abstract

Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent's return improve as a function of experience.

References

[1]

J. Asmuth, L. Li, M. Littman, A. Nouri, and D. Wingate. A bayesian sampling approach to exploration in reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2009.

[2]

P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Neural Information Processing Systems (NIPS), volume 19, pages 49-56, 2006.

[3]

P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. In Neural Information Processing Systems (NIPS), volume 21, 2009.

[4]

J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research (JAIR), 15:319-350, 2001.

[5]

R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961.

[6]

R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research (JMLR), 3:213-231, 2003.

[7]

G. Casella and R. Berger. Statistical Inference. Duxbury Resource Center, 2001.

[8]

P. S. Castro and D. Precup. Using linear programming for bayesian exploration in markov decision processes. In International Joint Conference on Artificial Intelligence (IJCAI), pages 2437-2442, 2007.

[9]

R. Dearden, N. Friedman, and S. J. Russell. Bayesian Q-learning. In AAAI Conference on Artificial Intelligence, pages 761-768, 1998.

[10]

R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 150-159, 1999.

[11]

E. Delage and S. Mannor. Percentile optimization in uncertain mdps with application to efficient exploration. In International Conference on Machine Learning (ICML), 2007.

[12]

F. Doshi, J. Pineau, and N. Roy. Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs. In International Conference on Machine Learning, pages 256-263. ACM, 2008.

[13]

F. Doshi-Velez. The infinite partially observable markov decision process. In Neural Information Processing Systems (NIPS), volume 22, 2010.

[14]

A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods In Practice. Springer, 2001.

[15]

M. Duff. Monte-Carlo algorithms for the improvement of finite-state stochastic controllers: Application to bayes-adaptive Markov decision processes. In International Workshop on Artificial Intelligence and Statistics (AISTATS), 2001.

[16]

M. Duff. Optimal Learning: Computational Procedures for Bayes-Adaptive Markov Decision Processes. PhD thesis, University of Massachusetts Amherst, Amherst, MA, 2002.

[17]

Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: The gaussian process approach to temporal difference learning. In International Conference on Machine Learning (ICML), pages 154-161, 2003.

[18]

Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with gaussian processes. In International Conference on Machine learning (ICML), pages 201-208, 2005.

[19]

A. A. Feldbaum. Dual control theory, parts i and ii. Automation and Remote Control, 21:874-880 and 1033-1039, 1961.

[20]

N.M. Filatov and H. Unbehauen. Survey of adaptive dual control methods. In IEEE Control Theory and Applications, volume 147, pages 118-128, 2000.

[21]

M. Ghavamzadeh and Y. Engel. Bayesian policy gradient algorithms. In Neural Information Processing Systems (NIPS), volume 19, pages 457-464, 2007a.

[22]

M. Ghavamzadeh and Y. Engel. Bayesian actor-critic algorithms. In International Conference on Machine Learning (ICML), pages 297-304, 2007b.

[23]

A. Greenfield and A. Brockwell. Adaptive control of nonlinear stochastic systems by particle filtering. In International Conference on Control and Automation (ICCA), pages 887-890, 2003.

[24]

D. Heckerman, D. Geiger, and D. M. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197-243, 1995.

[25]

M. Hutter. Universal Artificial Intelligence. Springer, 2005.

[26]

R. Jaulmes, J. Pineau, and D. Precup. Active learning in partially observable markov decision processes. European Conference on Machine Learning, pages 601-608, 2005.

[27]

E. T. Jaynes. Prior probabilities. IEEE Transactions on Systems Science and Cybernetics, 4:227- 241, 1968.

[28]

H. Jeffreys. Theory of Probability. Oxford University Press, 1961.

[29]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99-134, 1998.

[30]

M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In International Conference on Machine Learning (ICML), pages 260-268, 1998.

[31]

M. J. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1324-1331, 1999.

[32]

J. Zico Kolter and Andrew Y. Ng. Near-bayesian exploration in polynomial time. In International Conference on Machine Learning (ICML), 2009.

[33]

M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In Neural Information Processing Systems (NIPS), volume 14, pages 1555-1561, 2002.

[34]

A. K. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, 1996.

[35]

S. Paquet, L. Tobin, and B. Chaib-draa. An online POMDP algorithm for complex multiagent environments. In International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS), pages 970-977, 2005.

[36]

J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for POMDPs. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1025-1032, 2003.

[37]

P. Poupart and N. Vlassis. Model-based bayesian reinforcement learning in partially observable domains. In International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2008.

[38]

P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete bayesian reinforcement learning. In International Conference on Machine learning (ICML), pages 697-704, 2006.

[39]

R. Ravikanth, S.P. Meyn, and L.J. Brown. Bayesian adaptive control of time varying systems. In IEEE Conference on Decision and Control, pages 705-709, 1992.

[40]

S. Ross, B. Chaib-draa, and J. Pineau. Bayes-adaptive POMDPs. In Neural Information Processing Systems (NIPS), volume 20, pages 1225-1232, 2008a.

[41]

S. Ross, B. Chaib-draa, and J. Pineau. Bayesian reinforcement learning in continuous POMDPs. In International Conference on Robotics and Automation (ICRA), 2008b.

[42]

S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online POMDPs. Journal of Artificial Intelligence Research (JAIR), 32:663-704, 2008c.

[43]

I. Rusnak. Optimal adaptive control of uncertain stochastic discrete linear systems. In IEEE International Conference on Systems, Man and Cybernetics, pages 4521-4526, 1995.

[44]

D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. In Neural Information Processing Systems (NIPS), 2010.

[45]

R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21(5):1071-1088, Sep/Oct 1973.

[46]

T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 520-527, 2004.

[47]

E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University, 1971.

[48]

M. T. J. Spaan and N. Vlassis. Perseus: randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research (JAIR), 24:195-220, 2005.

[49]

A. L. Strehl and M. L. Littman. A theoretical analysis of model-based interval estimation. In International Conference on Machine learning (ICML), pages 856-863, 2005.

[50]

M. Strens. A bayesian framework for reinforcement learning. In International Conference on Machine Learning (ICML), 2000.

[51]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.

[52]

C. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool, 2010.

[53]

A. Tewari and P. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Neural Information Processing Systems (NIPS), volume 20, pages 1505-1512, 2008.

[54]

J. Veness, K. S. Ng, M. Hutter, W. Uther, and D. Silver. A monte-carlo aixi approximation. Journal of Artificial Intelligence Research (JAIR), 2011.

[55]

T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine learning (ICML), pages 956-963, 2005.

[56]

O. Zane. Discrete-time bayesian adaptive control problems with complete information. In IEEE Conference on Decision and Control, pages 2748-2749, 1992.

Cited By

De Peuter SOulasvirta AKaski S(2023)Toward AI assistants that let designers designAI Magazine10.1002/aaai.1207744:1(85-96)Online publication date: 5-Apr-2023
https://dl.acm.org/doi/10.1002/aaai.12077
Zuccotto MCastellini AFarinelli AHong JBures MPark JCerny T(2022)Learning state-variable relationships for improving POMCP performanceProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507049(739-747)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3477314.3507049
Lussange JLazarevich IBourgeois-Gironde SPalminteri SGutkin B(2021)Modelling Stock Markets by Multi-agent Reinforcement LearningComputational Economics10.1007/s10614-020-10038-w57:1(113-147)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1007/s10614-020-10038-w
Show More Cited By

Index Terms

A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Recommendations

Partially Observable Risk-Sensitive Markov Decision Processes

We consider the problem of minimizing a certainty equivalent of the total or discounted cost over a finite and an infinite time horizon that is generated by a partially observable Markov decision process POMDP. In contrast to a risk-neutral decision ...
Approximate solution methods for partially observable markov and semi-markov decision processes
Robust partially observable Markov decision process
ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

We seek to find the robust policy that maximizes the expected cumulative reward for the worst case when a partially observable Markov decision process (POMDP) has uncertain parameters whose values are only known to be in a given region. We prove that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 12, Issue

2/1/2011

3426 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 July 2011

Published in JMLR Volume 12

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
524
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)9

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

De Peuter SOulasvirta AKaski S(2023)Toward AI assistants that let designers designAI Magazine10.1002/aaai.1207744:1(85-96)Online publication date: 5-Apr-2023
https://dl.acm.org/doi/10.1002/aaai.12077
Zuccotto MCastellini AFarinelli AHong JBures MPark JCerny T(2022)Learning state-variable relationships for improving POMCP performanceProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507049(739-747)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3477314.3507049
Lussange JLazarevich IBourgeois-Gironde SPalminteri SGutkin B(2021)Modelling Stock Markets by Multi-agent Reinforcement LearningComputational Economics10.1007/s10614-020-10038-w57:1(113-147)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1007/s10614-020-10038-w
Katt SOliehoek FAmato CElkind EVeloso MAgmon NTaylor M(2019)Bayesian Reinforcement Learning in Factored POMDPsProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3331668(7-15)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.5555/3306127.3331668
François-Lavet VRabusseau GPineau JErnst DFonteneau R(2019)On overfitting and asymptotic bias in batch reinforcement learning with partial observabilityJournal of Artificial Intelligence Research10.1613/jair.1.1147865:1(1-30)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1613/jair.1.11478
Shen MHow J(2019)Active Perception in Adversarial Scenarios using Maximum Entropy Deep Reinforcement Learning2019 International Conference on Robotics and Automation (ICRA)10.1109/ICRA.2019.8794389(3384-3390)Online publication date: 20-May-2019
https://dl.acm.org/doi/10.1109/ICRA.2019.8794389
Wijesuriya VAbate A(2019)Bayes-Adaptive Planning for Data-Efficient Verification of Uncertain Markov Decision ProcessesQuantitative Evaluation of Systems10.1007/978-3-030-30281-8_6(91-108)Online publication date: 10-Sep-2019
https://dl.acm.org/doi/10.1007/978-3-030-30281-8_6
Oliehoek F(2018)Interactive learning and decision makingProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304652.3304829(5703-5708)Online publication date: 13-Jul-2018
https://dl.acm.org/doi/10.5555/3304652.3304829
Katt SOliehoek FAmato C(2017)Learning in POMDPs with Monte Carlo tree searchProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305381.3305569(1819-1827)Online publication date: 6-Aug-2017
https://dl.acm.org/doi/10.5555/3305381.3305569
Pajarinen JKyrki V(2017)Robotic manipulation of multiple objects as a POMDPArtificial Intelligence10.1016/j.artint.2015.04.001247:C(213-228)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1016/j.artint.2015.04.001
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents