Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3600270.3602182guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

IMED-RL: regret optimal learning of ergodic Markov decision processes

Published: 03 April 2024 Publication History

Abstract

We consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic UCB, KL-UCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problem-dependent lower bounds on the regret are known under the assumption that the MDP is ergodic. Under this assumption, we introduce IMED-RL and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori harder-to-exploit-optimally case when they are known. Rewards are assumed light-tailed, semi-bounded from above. Last, we provide numerical illustrations on classical tabular MDPs, ergodic and communicating only, showing the competitiveness of IMED-RL in finite-time against state-of-the-art algorithms. IMED-RL also benefits from a light complexity.

Supplementary Material

Additional material (3600270.3602182_supp.pdf)
Supplemental material.

References

[1]
R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054-1078, 1995.
[2]
R. Agrawal, D. Teneketzis, and V. Anantharam. Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space. IEEE Transactions on Automatic Control, 34(3), 1989.
[3]
P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Proceedings of the 20th conference on advances in Neural Information Processing Systems, NIPS '06, pages 49-56, Vancouver, British Columbia, Canada, dec 2006. MIT Press. ISBN 0-262-19568-2.
[4]
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235-256, 2002.
[5]
M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263-272. PMLR, 2017.
[6]
D. P. Bertsekas and S. E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978.
[7]
J. Borwein and A. Lewis. Duality relationships for entropy-like minimization problem. SIAM Journal on Computation and Optimization, 29(2):325-338, 1991.
[8]
H. Bourel, O. Maillard, and M. S. Talebi. Tightening exploration in upper confidence reinforcement learning. In International Conference on Machine Learning, pages 1056-1066. PMLR, 2020.
[9]
A. Burnetas and M. Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, pages 222-255, 1997.
[10]
A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122-142, 1996.
[11]
X.-R. Cao. Reinforcement Learning, pages 289-340. Springer US, Boston, MA, 2007. ISBN 978-0-387-69082-7.
[12]
O. Cappé, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516-1541, 2013.
[13]
C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
[14]
A. Dembo and O. Zeitouni. Large deviations techniques and applications. Elearn, 1998.
[15]
O. D. Domingues, P. Ménard, E. Kaufmann, and M. Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In V. Feldman, K. Ligett, and S. Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 578-598. PMLR, 16-19 Mar 2021. URL https://proceedings.mlr.press/v132/domingues21a.html.
[16]
S. Filippi, O. Cappé, and A. Garivier. Optimism in reinforcement learning and Kullback-Leibler divergence. In Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, Monticello, US, 2010.
[17]
T. L. Graves and T. L. Lai. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715-743, 1997.
[18]
O. Hernández-Lerma and J.-B. Lasserre. Discrete-Time Markov Control Processes. Springer New York, 1996. URL https://hal.laas.fr/hal-02095866.
[19]
J. Honda and A. Takemura. Finite-time regret bound of a bandit algorithm for the semi-bounded support model. arXiv:1202.2277, 2012.
[20]
J. Honda and A. Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. Machine Learning, 16:3721-3756, 2015.
[21]
T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:1563-1600, August 2010. ISSN 1532-4435.
[22]
C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d3b1fb02964aa64e257f9f26a31f72cf-Paper.pdf.
[23]
Y. Jin and A. Sidford. Efficiently solving MDPs with stochastic mirror descent. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4890-4900. PMLR, 13-18 Jul 2020. URL https://proceedings.mlr.press/v119/jin20f.html.
[24]
O.-A. Maillard, R. Munos, and G. Stoltz. A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Proceedings of the 23rd Annual Conference on Learning Theory, Budapest, Hungary, 2011.
[25]
J. Ok, A. Proutiere, and D. Tranos. Exploration in structured reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d693d554e0ede0d75f7d2873b015f228-Paper.pdf.
[26]
I. Osband, D. Russo, and B. Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.
[27]
M. L. Puterman. Markov Decision Processes — Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.
[28]
M. S. Talebi and O.-A. Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770-805, 2018.
[29]
A. Tewari and P. L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible mdps. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Proceedings of the 21st conference on advances in Neural Information Processing Systems, NIPS '07, Vancouver, British Columbia, Canada, dec 2007. MIT Press.
[30]
W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285-294, 1933.
[31]
M. Wang. Primal-dual π learning: Sample complexity and sublinear run time for ergodic markov decision problems. ArXiv, abs/1710.06100, 2017.
[32]
A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7304-7312. PMLR, 09-15 Jun 2019. URL https://proceedings.mlr.press/v97/zanette19a.html.

Index Terms

  1. IMED-RL: regret optimal learning of ergodic Markov decision processes
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems
    November 2022
    39114 pages

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 03 April 2024

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media