research-article

IMED-RL: regret optimal learning of ergodic Markov decision processes

AUTHORs:

Fabien Pesquerel,

Odalric-Ambrym MaillardAuthors Info & Claims

NIPS'22: Proceedings of the 36th International Conference on Neural Information Processing Systems

Article No.: 1912, Pages 26363 - 26374

Published: 03 April 2024 Publication History

Abstract

We consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic UCB, KL-UCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problem-dependent lower bounds on the regret are known under the assumption that the MDP is ergodic. Under this assumption, we introduce IMED-RL and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori harder-to-exploit-optimally case when they are known. Rewards are assumed light-tailed, semi-bounded from above. Last, we provide numerical illustrations on classical tabular MDPs, ergodic and communicating only, showing the competitiveness of IMED-RL in finite-time against state-of-the-art algorithms. IMED-RL also benefits from a light complexity.

Supplementary Material

Additional material (3600270.3602182_supp.pdf)

Supplemental material.

Download
1.42 MB

References

[1]

R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054-1078, 1995.

[2]

R. Agrawal, D. Teneketzis, and V. Anantharam. Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space. IEEE Transactions on Automatic Control, 34(3), 1989.

[3]

P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Proceedings of the 20th conference on advances in Neural Information Processing Systems, NIPS '06, pages 49-56, Vancouver, British Columbia, Canada, dec 2006. MIT Press. ISBN 0-262-19568-2.

[4]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235-256, 2002.

Digital Library

[5]

M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263-272. PMLR, 2017.

[6]

D. P. Bertsekas and S. E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978.

[7]

J. Borwein and A. Lewis. Duality relationships for entropy-like minimization problem. SIAM Journal on Computation and Optimization, 29(2):325-338, 1991.

Digital Library

[8]

H. Bourel, O. Maillard, and M. S. Talebi. Tightening exploration in upper confidence reinforcement learning. In International Conference on Machine Learning, pages 1056-1066. PMLR, 2020.

[9]

A. Burnetas and M. Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, pages 222-255, 1997.

Digital Library

[10]

A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122-142, 1996.

Digital Library

[11]

X.-R. Cao. Reinforcement Learning, pages 289-340. Springer US, Boston, MA, 2007. ISBN 978-0-387-69082-7.

[12]

O. Cappé, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516-1541, 2013.

[13]

C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.

[14]

A. Dembo and O. Zeitouni. Large deviations techniques and applications. Elearn, 1998.

[15]

O. D. Domingues, P. Ménard, E. Kaufmann, and M. Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In V. Feldman, K. Ligett, and S. Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 578-598. PMLR, 16-19 Mar 2021. URL https://proceedings.mlr.press/v132/domingues21a.html.

[16]

S. Filippi, O. Cappé, and A. Garivier. Optimism in reinforcement learning and Kullback-Leibler divergence. In Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, Monticello, US, 2010.

[17]

T. L. Graves and T. L. Lai. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715-743, 1997.

[18]

O. Hernández-Lerma and J.-B. Lasserre. Discrete-Time Markov Control Processes. Springer New York, 1996. URL https://hal.laas.fr/hal-02095866.

[19]

J. Honda and A. Takemura. Finite-time regret bound of a bandit algorithm for the semi-bounded support model. arXiv:1202.2277, 2012.

[20]

J. Honda and A. Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. Machine Learning, 16:3721-3756, 2015.

Digital Library

[21]

T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:1563-1600, August 2010. ISSN 1532-4435.

[22]

C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d3b1fb02964aa64e257f9f26a31f72cf-Paper.pdf.

[23]

Y. Jin and A. Sidford. Efficiently solving MDPs with stochastic mirror descent. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4890-4900. PMLR, 13-18 Jul 2020. URL https://proceedings.mlr.press/v119/jin20f.html.

Digital Library

[24]

O.-A. Maillard, R. Munos, and G. Stoltz. A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Proceedings of the 23rd Annual Conference on Learning Theory, Budapest, Hungary, 2011.

[25]

J. Ok, A. Proutiere, and D. Tranos. Exploration in structured reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d693d554e0ede0d75f7d2873b015f228-Paper.pdf.

[26]

I. Osband, D. Russo, and B. Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.

[27]

M. L. Puterman. Markov Decision Processes — Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.

Digital Library

[28]

M. S. Talebi and O.-A. Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770-805, 2018.

[29]

A. Tewari and P. L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible mdps. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Proceedings of the 21st conference on advances in Neural Information Processing Systems, NIPS '07, Vancouver, British Columbia, Canada, dec 2007. MIT Press.

[30]

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285-294, 1933.

[31]

M. Wang. Primal-dual π learning: Sample complexity and sublinear run time for ergodic markov decision problems. ArXiv, abs/1710.06100, 2017.

[32]

A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7304-7312. PMLR, 09-15 Jun 2019. URL https://proceedings.mlr.press/v97/zanette19a.html.

Index Terms

IMED-RL: regret optimal learning of ergodic Markov decision processes

Index terms have been assigned to the content through auto-classification.

Recommendations

RL for latent MDPs: regret guarantees and a lower bound
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of ...
Posterior sampling for competitive RL: function approximation and partial observation
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

This paper investigates posterior sampling algorithms for competitive reinforcement learning (RL) in the context of general function approximations. Focusing on zero-sum Markov games (MGs) under two critical settings, namely self-play and adversarial ...
Expediting RL by using graphical structures
AAMAS '08: Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems - Volume 3

The goal of Reinforcement learning (RL) is to maximize reward (minimize cost) in a Markov decision process (MDP) without knowing the underlying model a priori. RL algorithms tend to be much slower than planning algorithms, which require the model as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

November 2022

39114 pages

ISBN:9781713871088

Copyright © 2022 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 April 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents