Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Published: 01 July 2011 Publication History

Abstract

Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be finitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent's return improve as a function of experience.

References

[1]
J. Asmuth, L. Li, M. Littman, A. Nouri, and D. Wingate. A bayesian sampling approach to exploration in reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2009.
[2]
P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Neural Information Processing Systems (NIPS), volume 19, pages 49-56, 2006.
[3]
P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. In Neural Information Processing Systems (NIPS), volume 21, 2009.
[4]
J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research (JAIR), 15:319-350, 2001.
[5]
R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961.
[6]
R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research (JMLR), 3:213-231, 2003.
[7]
G. Casella and R. Berger. Statistical Inference. Duxbury Resource Center, 2001.
[8]
P. S. Castro and D. Precup. Using linear programming for bayesian exploration in markov decision processes. In International Joint Conference on Artificial Intelligence (IJCAI), pages 2437-2442, 2007.
[9]
R. Dearden, N. Friedman, and S. J. Russell. Bayesian Q-learning. In AAAI Conference on Artificial Intelligence, pages 761-768, 1998.
[10]
R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 150-159, 1999.
[11]
E. Delage and S. Mannor. Percentile optimization in uncertain mdps with application to efficient exploration. In International Conference on Machine Learning (ICML), 2007.
[12]
F. Doshi, J. Pineau, and N. Roy. Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs. In International Conference on Machine Learning, pages 256-263. ACM, 2008.
[13]
F. Doshi-Velez. The infinite partially observable markov decision process. In Neural Information Processing Systems (NIPS), volume 22, 2010.
[14]
A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods In Practice. Springer, 2001.
[15]
M. Duff. Monte-Carlo algorithms for the improvement of finite-state stochastic controllers: Application to bayes-adaptive Markov decision processes. In International Workshop on Artificial Intelligence and Statistics (AISTATS), 2001.
[16]
M. Duff. Optimal Learning: Computational Procedures for Bayes-Adaptive Markov Decision Processes. PhD thesis, University of Massachusetts Amherst, Amherst, MA, 2002.
[17]
Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: The gaussian process approach to temporal difference learning. In International Conference on Machine Learning (ICML), pages 154-161, 2003.
[18]
Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with gaussian processes. In International Conference on Machine learning (ICML), pages 201-208, 2005.
[19]
A. A. Feldbaum. Dual control theory, parts i and ii. Automation and Remote Control, 21:874-880 and 1033-1039, 1961.
[20]
N.M. Filatov and H. Unbehauen. Survey of adaptive dual control methods. In IEEE Control Theory and Applications, volume 147, pages 118-128, 2000.
[21]
M. Ghavamzadeh and Y. Engel. Bayesian policy gradient algorithms. In Neural Information Processing Systems (NIPS), volume 19, pages 457-464, 2007a.
[22]
M. Ghavamzadeh and Y. Engel. Bayesian actor-critic algorithms. In International Conference on Machine Learning (ICML), pages 297-304, 2007b.
[23]
A. Greenfield and A. Brockwell. Adaptive control of nonlinear stochastic systems by particle filtering. In International Conference on Control and Automation (ICCA), pages 887-890, 2003.
[24]
D. Heckerman, D. Geiger, and D. M. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197-243, 1995.
[25]
M. Hutter. Universal Artificial Intelligence. Springer, 2005.
[26]
R. Jaulmes, J. Pineau, and D. Precup. Active learning in partially observable markov decision processes. European Conference on Machine Learning, pages 601-608, 2005.
[27]
E. T. Jaynes. Prior probabilities. IEEE Transactions on Systems Science and Cybernetics, 4:227- 241, 1968.
[28]
H. Jeffreys. Theory of Probability. Oxford University Press, 1961.
[29]
L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99-134, 1998.
[30]
M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In International Conference on Machine Learning (ICML), pages 260-268, 1998.
[31]
M. J. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1324-1331, 1999.
[32]
J. Zico Kolter and Andrew Y. Ng. Near-bayesian exploration in polynomial time. In International Conference on Machine Learning (ICML), 2009.
[33]
M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In Neural Information Processing Systems (NIPS), volume 14, pages 1555-1561, 2002.
[34]
A. K. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, 1996.
[35]
S. Paquet, L. Tobin, and B. Chaib-draa. An online POMDP algorithm for complex multiagent environments. In International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS), pages 970-977, 2005.
[36]
J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for POMDPs. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1025-1032, 2003.
[37]
P. Poupart and N. Vlassis. Model-based bayesian reinforcement learning in partially observable domains. In International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2008.
[38]
P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete bayesian reinforcement learning. In International Conference on Machine learning (ICML), pages 697-704, 2006.
[39]
R. Ravikanth, S.P. Meyn, and L.J. Brown. Bayesian adaptive control of time varying systems. In IEEE Conference on Decision and Control, pages 705-709, 1992.
[40]
S. Ross, B. Chaib-draa, and J. Pineau. Bayes-adaptive POMDPs. In Neural Information Processing Systems (NIPS), volume 20, pages 1225-1232, 2008a.
[41]
S. Ross, B. Chaib-draa, and J. Pineau. Bayesian reinforcement learning in continuous POMDPs. In International Conference on Robotics and Automation (ICRA), 2008b.
[42]
S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa. Online POMDPs. Journal of Artificial Intelligence Research (JAIR), 32:663-704, 2008c.
[43]
I. Rusnak. Optimal adaptive control of uncertain stochastic discrete linear systems. In IEEE International Conference on Systems, Man and Cybernetics, pages 4521-4526, 1995.
[44]
D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. In Neural Information Processing Systems (NIPS), 2010.
[45]
R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21(5):1071-1088, Sep/Oct 1973.
[46]
T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 520-527, 2004.
[47]
E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University, 1971.
[48]
M. T. J. Spaan and N. Vlassis. Perseus: randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research (JAIR), 24:195-220, 2005.
[49]
A. L. Strehl and M. L. Littman. A theoretical analysis of model-based interval estimation. In International Conference on Machine learning (ICML), pages 856-863, 2005.
[50]
M. Strens. A bayesian framework for reinforcement learning. In International Conference on Machine Learning (ICML), 2000.
[51]
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.
[52]
C. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool, 2010.
[53]
A. Tewari and P. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Neural Information Processing Systems (NIPS), volume 20, pages 1505-1512, 2008.
[54]
J. Veness, K. S. Ng, M. Hutter, W. Uther, and D. Silver. A monte-carlo aixi approximation. Journal of Artificial Intelligence Research (JAIR), 2011.
[55]
T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine learning (ICML), pages 956-963, 2005.
[56]
O. Zane. Discrete-time bayesian adaptive control problems with complete information. In IEEE Conference on Decision and Control, pages 2748-2749, 1992.

Cited By

View all
  • (2023)Toward AI assistants that let designers designAI Magazine10.1002/aaai.1207744:1(85-96)Online publication date: 5-Apr-2023
  • (2022)Learning state-variable relationships for improving POMCP performanceProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507049(739-747)Online publication date: 25-Apr-2022
  • (2021)Modelling Stock Markets by Multi-agent Reinforcement LearningComputational Economics10.1007/s10614-020-10038-w57:1(113-147)Online publication date: 1-Jan-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 12, Issue
2/1/2011
3426 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 July 2011
Published in JMLR Volume 12

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)9
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Toward AI assistants that let designers designAI Magazine10.1002/aaai.1207744:1(85-96)Online publication date: 5-Apr-2023
  • (2022)Learning state-variable relationships for improving POMCP performanceProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507049(739-747)Online publication date: 25-Apr-2022
  • (2021)Modelling Stock Markets by Multi-agent Reinforcement LearningComputational Economics10.1007/s10614-020-10038-w57:1(113-147)Online publication date: 1-Jan-2021
  • (2019)Bayesian Reinforcement Learning in Factored POMDPsProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3331668(7-15)Online publication date: 8-May-2019
  • (2019)On overfitting and asymptotic bias in batch reinforcement learning with partial observabilityJournal of Artificial Intelligence Research10.1613/jair.1.1147865:1(1-30)Online publication date: 1-May-2019
  • (2019)Active Perception in Adversarial Scenarios using Maximum Entropy Deep Reinforcement Learning2019 International Conference on Robotics and Automation (ICRA)10.1109/ICRA.2019.8794389(3384-3390)Online publication date: 20-May-2019
  • (2019)Bayes-Adaptive Planning for Data-Efficient Verification of Uncertain Markov Decision ProcessesQuantitative Evaluation of Systems10.1007/978-3-030-30281-8_6(91-108)Online publication date: 10-Sep-2019
  • (2018)Interactive learning and decision makingProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304652.3304829(5703-5708)Online publication date: 13-Jul-2018
  • (2017)Learning in POMDPs with Monte Carlo tree searchProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305381.3305569(1819-1827)Online publication date: 6-Aug-2017
  • (2017)Robotic manipulation of multiple objects as a POMDPArtificial Intelligence10.1016/j.artint.2015.04.001247:C(213-228)Online publication date: 1-Jun-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media