research-article

Planning to the information horizon of BAMDPs via epistemic state abstraction

AUTHORs:

Dilip Arumugam,

Satinder SinghAuthors Info & Claims

NIPS'22: Proceedings of the 36th International Conference on Neural Information Processing Systems

Article No.: 1489, Pages 20482 - 20497

Published: 03 April 2024 Publication History

Abstract

The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.

Supplementary Material

Additional material (3600270.3601759_supp.pdf)

Supplemental material.

Download
251.43 KB

References

[1]

David Abel. A Theory of Abstraction in Reinforcement Learning. PhD thesis, Brown University, 2020. 7

[2]

David Abel, David Hershkowitz, and Michael Littman. Near optimal behavior via approximate state abstraction. In International Conference on Machine Learning, pages 2915-2923. PMLR, 2016. 2, 7

[3]

David Abel, Dilip Arumugam, Lucas Lehnert, and Michael L. Littman. Toward good abstractions for lifelong learning. In NeurIPS Workshop on Hierarchical Reinforcement Learning, 2017. 22

[4]

David Abel, Dilip Arumugam, Lucas Lehnert, and Michael Littman. State abstractions for lifelong reinforcement learning. In International Conference on Machine Learning, pages 10-19. PMLR, 2018. 7

[5]

David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L Littman, and Lawson LS Wong. State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3134-3142, 2019. 7

Digital Library

[6]

David Abel, Nate Umbanhowar, Khimya Khetarpal, Dilip Arumugam, Doina Precup, and Michael Littman. Value preserving state-action abstractions. In International Conference on Artificial Intelligence and Statistics, pages 1639-1650. PMLR, 2020. 22

[7]

David Abel, Cameron Allen, Dilip Arumugam, D. Ellis Hershkowitz, Michael L. Littman, and Lawson L.S. Wong. Bad-policy density: A measure of reinforcement learning hardness. In ICML Workshop on Reinforcement Learning Theory, 2021. 2

[8]

Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. FLAMBE: Structural complexity and representation learning of low rank MDPs. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20095-20107, 2020. 2

[9]

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184-1194, 2017. 4

[10]

Mauricio Araya-López, Vincent Thomas, and Olivier Buffet. Near-optimal BRL using optimistic local transitions. In Proceedings of the 29th International Conference on Machine Learning, pages 515-522, 2012. 1, 4

[11]

Dilip Arumugam, David Abel, Kavosh Asadi, Nakul Gopalan, Christopher Grimm, Jun Ki Lee, Lucas Lehnert, and Michael L Littman. Mitigating planner overfitting in model-based reinforcement learning. arXiv preprint arXiv:1812.01129, 2018. 10

[12]

Dilip Arumugam, Peter Henderson, and Pierre-Luc Bacon. An information-theoretic perspective on credit assignment in reinforcement learning. arXiv preprint arXiv:2103.06224, 2021. 2

[13]

John Asmuth, Lihong Li, Michael L Littman, Ali Nouri, and David Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 19-26, 2009. 1, 4

Digital Library

[14]

John Thomas Asmuth. Model-based Bayesian reinforcement learning with generalized priors. Rutgers The State University of New Jersey-New Brunswick, 2013. 10

[15]

Peter L Bartlett and Ambuj Tewari. REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35-42, 2009. 2

Digital Library

[16]

Marc G Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. 2

[17]

Richard Bellman. A Markovian decision process. Journal of mathematics and mechanics, pages 679-684, 1957. 2, 4, 5

[18]

Richard Bellman and Robert Kalaba. On adaptive control processes. IRE Transactions on Automatic Control, 4(2):l-9, 1959. 1, 3

[19]

Dimitri P Bertsekas, David A Castanon, et al. Adaptive aggregation methods for infinite horizon dynamic programming. 1988. 7

[20]

Pablo Samuel Castro and Doina Precup. Using linear programming for Bayesian exploration in Markov decision processes. In IJCAI, volume 24372442, 2007. 1, 4

[21]

Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012. 6, 10

Digital Library

[22]

Imre Csiszár. On an extremum problem of information theory. Studia Scientiarum Mathematicarum Hungarica, 9, 1974. 10

[23]

Peter Dayan and Terrence J Sejnowski. Exploration bonuses and dual control. Machine Learning, 25 (1):5-22, 1996. 1, 4

Digital Library

[24]

Thomas Dean and Robert Givan. Model minimization in Markov decision processes. In AAAI/IAAI, pages 106-111, 1997. 7

Digital Library

[25]

Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, pages 761-768, 1998. 1, 4

[26]

Christos Dimitrakakis. Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning. arXiv preprint arXiv:0912.5029, 2009. 1

[27]

Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou. Provably efficient reinforcement learning with aggregated states. arXiv preprint arXiv:1912.06366, 2019. 7

[28]

Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou. Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States. Journal of Machine Learning Research, 23(255):1-54, 2022. URL http://jmlr.org/papers/v23/21-0773.html. 10

[29]

Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665-1674. PMLR, 2019. 7

[30]

Richard M Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290-330, 1967. 8

[31]

Michael O Duff. Monte-Carlo algorithms for the improvement of finite-state stochastic controllers: Application to Bayes-adaptive Markov decision processes. In International Workshop on Artificial Intelligence and Statistics, pages 93-97. PMLR, 2001. 1, 4

[32]

Michael O Duff. Design for an optimal probe. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 131-138, 2003a. 1, 4

Digital Library

[33]

Michael O Duff. Diffusion approximation for Bayesian Markov chains. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 139-146, 2003b. 1, 4

[34]

Michael O Duff and Andrew G Barto. Local bandit approximation for optimal learning problems. In Advances in Neural Information Processing Systems, pages 1019-1025, 1997. 1, 4

[35]

Michael O'Gordon Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002. 1, 3

[36]

Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. Advances in Neural Information Processing Systems, 24:172-180, 2011. 2

[37]

Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite Markov decision processes. In UAI, volume 4, pages 162-169, 2004. 7

[38]

Norman Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for computing state similarity in Markov decision processes. arXiv preprint arXiv:1206.6836, 2012. 7

[39]

Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(5-6):359-483, 2015. 1, 4

Digital Library

[40]

John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):148-164, 1979. 4

[41]

Geoffrey J Gordon. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pages 261-268. Elsevier, 1995. 7

Digital Library

[42]

Arthur Guez, David Silver, and Peter Dayan. Efficient Bayes-adaptive reinforcement learning using sample-based search. In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1, pages 1025-1033, 2012. 1, 4, 10

Digital Library

[43]

Arthur Guez, David Silver, and Peter Dayan. Scalable and efficient Bayes-adaptive reinforcement learning based on monte-carlo tree search. Journal of Artificial Intelligence Research, 48:841-883, 2013. 1, 4

Digital Library

[44]

Arthur Guez, Nicolas Heess, David Silver, and Peter Dayan. Bayes-adaptive simulation-based search with value function approximation. In Advances in Neural Information Processing Systems, pages 451-459, 2014. 1, 4

[45]

Dorit S Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In Approximation algorithms for NP-hard problems, pages 94-143. 1996. 7

Digital Library

[46]

David Hsu, Wee Sun Lee, and Nan Rong. What makes some POMDP problems easy to approximate? In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 689-696, 2007. 2, 4

Digital Library

[47]

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010. 2

[48]

Nan Jiang, Alex Kulesza, and Satinder Singh. Abstraction selection in model-based reinforcement learning. In International Conference on Machine Learning, pages 179-188. PMLR, 2015a. 7

[49]

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181-1189. Citeseer, 2015b. 10

Digital Library

[50]

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704-1713. PMLR, 2017. 2

[51]

Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman Eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. arXiv preprint arXiv:2102.00815, 2021. 2

[52]

Nicholas K Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In IJCAI, volume 8, pages 752-757. Citeseer, 2005. 7

Digital Library

[53]

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99-134, 1998. 2, 4

[54]

Sham Kakade, Michael J Kearns, and John Langford. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 306312, 2003a. 4

[55]

Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, 2003b. 7

[56]

Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in Neural Information Processing Systems, pages 996-1002, 1999. 4

[57]

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209-232, 2002. 7, 10

Digital Library

[58]

Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine learning, 49(2):193-208, 2002. 4

Digital Library

[59]

Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In European conference on machine learning, pages 282-293. Springer, 2006. 4

Digital Library

[60]

J Zico Kolter and Andrew Y Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th annual international conference on machine learning, pages 513-520, 2009. 1, 4, 6, 10

Digital Library

[61]

Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 1008-1014, 2000. 4

Digital Library

[62]

Gilwoo Lee, Brian Hou, Aditya Mandalika, Jeongseok Lee, Sanjiban Choudhury, and Siddhartha S Srinivasa. Bayesian policy optimization for model uncertainty. In International Conference on Learning Representations, 2018. 4, 5

[63]

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for MDPs. ISAIM, 4:5, 2006. 2, 7

[64]

Michael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. On the complexity of solving Markov decision problems. In Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence, pages 394-402, 1995. 4

Digital Library

[65]

Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, and Zheng Wen. Reinforcement learning, bit by bit. arXiv preprint arXiv:2103.04047, 2021. 3

[66]

Odalric-Ambrym Maillard, Timothy A Mann, and Shie Mannor. How hard is my MDP? "the distribution-norm to the rescue". Advances in Neural Information Processing Systems, 27:18351843, 2014. 2

[67]

Vladimir Mikulik, Grégoire Delétang, Tom McGrath, Tim Genewein, Miljan Martic, Shane Legg, and Pedro A Ortega. Meta-trained agents implement Bayes-optimal agents. In NeurIPS, 2020. 10

[68]

Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961-6971. PMLR, 2020. 7

[69]

Pedro A Ortega, Jane X Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, et al. Meta-learning of sequential strategies. arXiv preprint arXiv:1905.03030, 2019. 10

[70]

Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pages 2701-2710, 2017. 4

[71]

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, pages 4026-4034, 2016a. 4

Digital Library

[72]

Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pages 2377-2386, 2016b. 4

[73]

Ian Osband, Benjamin Van Roy, Daniel J Russo, and Zheng Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1-62, 2019. 4

[74]

Brendan O'Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty Bellman equation and exploration. In International Conference on Machine Learning, pages 3836-3845, 2018. 4

[75]

Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 697-704, 2006. 1, 3, 4, 10

Digital Library

[76]

Martin L. Puterman. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994. 2

Digital Library

[77]

Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192-5202, 2018a. 4

Digital Library

[78]

Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770-787. SIAM, 2018b. 4

Digital Library

[79]

Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227-233, 1994. 9

Digital Library

[80]

Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. Advances in neural information processing systems, pages 361-368, 1995. 7

[81]

Jonathan Sorg, Satinder Singh, and Richard L Lewis. Variance-based rewards for approximate Bayesian reinforcement learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 564-571, 2010. 1, 4

Digital Library

[82]

Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10(Nov):2413-2444, 2009. 7

[83]

Malcolm Strens. A Bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943-950, 2000. 1, 4

[84]

Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898-2933. PMLR, 2019. 2

[85]

Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning. 1998. 2

Digital Library

[86]

Paul Tseng. Solving H-horizon, stationary Markov decision problems in time proportional to log(H). Operations Research Letters, 9(5):287-297, 1990. 4

Digital Library

[87]

John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1):59-94, 1996. 7

Digital Library

[88]

Benjamin Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234-244, 2006. 2, 7, 9

Digital Library

[89]

Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian sparse sampling for on-line reward optimization. In Proceedings of the 22nd International Conference on Machine Learning, pages 956-963, 2005. 1, 4

Digital Library

[90]

Ward Whitt. Approximations of dynamic programs, I. Mathematics of Operations Research, 3(3): 231-243, 1978. 7

Digital Library

[91]

Zongzhang Zhang, Michael Littman, and Xiaoping Chen. Covering number as a complexity measure for POMDP planning and learning. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. 2, 4, 7

[92]

Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. VariBAD: A very good method for Bayes-adaptive deep RL via meta-learning. In International Conference on Learning Representations, 2019. 1, 10

[93]

Luisa M Zintgraf, Leo Feng, Cong Lu, Maximilian Igl, Kristian Hartikainen, Katja Hofmann, and Shimon Whiteson. Exploration in approximate hyper-state space for meta reinforcement learning. In International Conference on Machine Learning, pages 12991-13001. PMLR, 2021. 10

Index Terms

Planning to the information horizon of BAMDPs via epistemic state abstraction

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient planning for factored infinite-horizon DEC-POMDPs
IJCAI'11: Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One

Decentralized partially observable Markov decision processes (DEC-POMDPs) are used to plan policies for multiple agents that must maximize a joint reward function but do not communicate with each other. The agents act under uncertainty about each other ...
Potential-based reward shaping for finite horizon online POMDP planning

In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), ...
Reducing the Planning Horizon Through Reinforcement Learning
Machine Learning and Knowledge Discovery in Databases
Abstract
Planning is a computationally expensive process, which can limit the reactivity of autonomous agents. Planning problems are usually solved in isolation, independently of similar, previously solved problems. The depth of search that a planner ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

November 2022

39114 pages

ISBN:9781713871088

Copyright © 2022 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 April 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents