Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3600270.3600926guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Deciding what to model: value-equivalent sampling for reinforcement learning

Published: 03 April 2024 Publication History

Abstract

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality. To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired sub-optimality gap or, alternatively, the best model given a limit on agent capacity.

Supplementary Material

Additional material (3600270.3600926_supp.pdf)
Supplemental material.

References

[1]
Romina Abachi, Mohammad Ghavamzadeh, and Amir-massoud Farahmand. Policy-aware model learning for policy gradient methods. arXiv preprint arXiv:2003.00030, 2020. 5
[2]
Yasin Abbasi-Yadkori and Csaba Szepesvari. Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm. arXiv preprint arXiv:1406.3926, 2014. 22
[3]
David Abel. A Theory of Abstraction in Reinforcement Learning. PhD thesis, Brown University, 2020. 23
[4]
David Abel, David Hershkowitz, and Michael Littman. Near optimal behavior via approximate state abstraction. In International Conference on Machine Learning, pages 2915-2923. PMLR, 2016. 23, 25
[5]
David Abel, Dilip Arumugam, Lucas Lehnert, and Michael Littman. State abstractions for lifelong reinforcement learning. In International Conference on Machine Learning, pages 10-19, 2018.
[6]
David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L Littman, and Lawson LS Wong. State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3134-3142, 2019. 22, 23, 25
[7]
David Abel, Nate Umbanhowar, Khimya Khetarpal, Dilip Arumugam, Doina Precup, and Michael Littman. Value preserving state-action abstractions. In International Conference on Artificial Intelligence and Statistics, pages 1639-1650. PMLR, 2020. 23
[8]
Shuchin Aeron, Venkatesh Saligrama, and Manqi Zhao. Information theoretic bounds for compressed sensing. IEEE Transactions on Information Theory, 56(10):5111-5130, 2010. 8
[9]
Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. FLAMBE: Structural complexity and representation learning of low rank MDPs. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20095-20107. Curran Associates, Inc., 2020. 23
[10]
Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184-1194,2017. 1,22
[11]
Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken ELBO. In International Conference on Machine Learning, pages 159-168. PMLR, 2018. 25
[12]
Mauricio Araya-López, Vincent Thomas, and Olivier Buffet. Near-optimal BRL using optimistic local transitions. In Proceedings of the 29th International Conference on Machine Learning, pages 515-522, 2012. 22
[13]
Suguru Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14-20, 1972. 25
[14]
Dilip Arumugam and Benjamin Van Roy. Randomized value functions via posterior state- abstraction sampling. arXiv preprint arXiv:2010.02383, 2020. 23
[15]
Dilip Arumugam and Benjamin Van Roy. Deciding what to learn: A rate-distortion approach. In International Conference on Machine Learning, pages 373-382. PMLR, 2021. 2, 5, 6, 7, 22, 25, 29
[16]
Dilip Arumugam and Benjamin Van Roy. The value of information when deciding what to learn. Advances in Neural Information Processing Systems, 34, 2021. 2, 6, 22, 25
[17]
Kavosh Asadi, Dipendra Misra, and Michael Littman. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, pages 264-273. PMLR, 2018. 5
[18]
John Asmuth, Lihong Li, Michael L Littman, Ali Nouri, and David Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 19-26, 2009. 22
[19]
Karl Johan Åström. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174-205, 1965. 23
[20]
Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, pages 89-96, 2009. 1, 22
[21]
Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463-474. PMLR, 2020. 5
[22]
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263-272. PMLR, 2017. 1, 22
[23]
Peter L Bartlett and Ambuj Tewari. REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35-42, 2009. 1, 22
[24]
Richard Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, pages 679-684, 1957. 3
[25]
Toby Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall, 1971. 2, 4, 10
[26]
Toby Berger and Jerry D Gibson. Lossy source coding. IEEE Transactions on Information Theory, 44(6):2693-2723, 1998. 4
[27]
Donald A. Berry, Robert W. Chen, Alan Zame, David C. Heath, and Larry A. Shepp. Bandit problems with infinitely many arms. Ann. Statist., 25(5):2103-2116, 10 1997. 22
[28]
Dimitri Bertsekas and Steven E Shreve. Stochastic optimal control: the discrete-time case, volume 5. Athena Scientific, 1996. 4
[29]
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995. 5
[30]
Dimitri P. Bertsekas and David A. Castanon. Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6):589-598, 1989. 23
[31]
Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460-473, 1972. 25
[32]
Thomas Bonald and Alexandre Proutiere. Two-target algorithms for infinite-armed bandits with Bernoulli rewards. In Advances in Neural Information Processing Systems, pages 2184-2192, 2013. 22
[33]
Vivek S Borkar, Sanjoy Mitter, and Sekhar Tatikonda. Markov control problems under communication contraints. Communications in Information and Systems, 1(1):15-32, 2001. 22
[34]
Pinhas Boukris. An upper bound on the speed of convergence of the Blahut algorithm for computing rate-distortion functions (corresp.). IEEE Transactions on Information Theory, 19 (5):708-709, 1973. 25
[35]
Ronen I Brafman and Moshe Tennenholtz. R-MAX-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213-231, 2002. 1, 22
[36]
Sébastien Bubeck and Mark Sellke. First-order Bayesian regret analysis of Thompson sampling. In Algorithmic Learning Theory, pages 196-233. PMLR, 2020. 7
[37]
Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X-armed bandits. Journal of Machine Learning Research, 12(5), 2011. 22
[38]
Sébastien Bubeck, Nicolö Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1-122, 2012. 2, 22
[39]
Pablo Samuel Castro and Doina Precup. Using linear programming for Bayesian exploration in Markov decision processes. In IJCAI, volume 24372442, 2007. 22
[40]
Mung Chiang and Stephen Boyd. Geometric programming duals of channel capacity and rate distortion. IEEE Transactions on Information Theory, 50(2):245-258, 2004. 25
[41]
Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012. 3, 4, 10, 30, 33, 35
[42]
Imre Csiszàr. On the computation of rate-distortion functions (corresp.). IEEE Transactions on Information Theory, 20(1):122-124, 1974. 25
[43]
Imre Csiszar. On an extremum problem of information theory. Studia Scientiarum Mathematicarum Hungarica, 9, 1974. 5, 8, 25
[44]
Brandon Cui, Yinlam Chow, and Mohammad Ghavamzadeh. Control-aware representations for model-based reinforcement learning. In International Conference on Learning Representations, 2020. 5
[45]
Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 2818-2826, 2015. 1, 22
[46]
Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: uniform PAC bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5717-5727, 2017. 1, 22
[47]
Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient PAC RL with rich observations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 1429-1439, 2018. 22
[48]
Justin Dauwels. Numerical computation of the capacity of continuous memoryless channels. In Proceedings of the 26th Symposium on Information Theory in the BENELUX, pages 221-228. Citeseer, 2005. 25
[49]
Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4):613-624, 1993. 26
[50]
Thomas Dean and Robert Givan. Model minimization in Markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 106-111. AAAI Press, 1997. 23
[51]
Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, pages 761-768, 1998. 22
[52]
Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 150159, 1999. 22
[53]
Yash Deshpande and Andrea Montanari. Linear bandits in high dimension and recommendation systems. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1750-1754. IEEE, 2012. 22
[54]
Shi Dong and Benjamin Van Roy. An information-theoretic analysis for Thompson sampling with many actions. In Advances in Neural Information Processing Systems, pages 4157-4165, 2018. 7, 22
[55]
Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou. Provably efficient reinforcement learning with aggregated states. arXiv preprint arXiv:1912.06366, 2019. 23
[56]
Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou. Simple agent, complex environment: Efficient reinforcement learning with agent state. arXiv preprint arXiv:2102.05261, 2021. 1, 22, 23
[57]
Pierluca D'Oro, Alberto Maria Metelli, Andrea Tirinzoni, Matteo Papini, and Marcello Restelli. Gradient-aware model-based policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3801-3808, 2020. 5
[58]
Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665-1674. PMLR, 2019. 22, 23
[59]
John C. Duchi. Lecture Notes for Statistics 311/Electrical Engineering 377, Stanford University. 2021. 3, 33, 36
[60]
John C. Duchi and Martin J. Wainwright. Distance-based and continuum Fano inequalities with applications to statistical estimation. arXiv preprint arXiv:1311.2669, 2013. 8, 33
[61]
Michael O'Gordon Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002. 22
[62]
Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, pages 1-50, 2021. 10
[63]
Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, and Benjamin Van Roy. Hypermodels for exploration. In International Conference on Learning Representations, 2020. 2, 25
[64]
Robert M. Fano. Class Notes for MIT Course 6.574: Transmission of Information, MIT, Cambridge, MA. 1952. 8, 33
[65]
Amir-massoud Farahmand. Iterative value-aware model learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 9090-9101, 2018. 5
[66]
Amir-massoud Farahmand, Andre Barreto, and Daniel Nikovski. Value-aware loss function for model-based reinforcement learning. In Artificial Intelligence and Statistics, pages 1486-1494. PMLR, 2017. 5, 7
[67]
Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite Markov Decision Processes. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 162-169, 2004. 23
[68]
Norman Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for computing state similarity in Markov Decision Processes. arXiv preprint arXiv:1206.6836, 2012. 23
[69]
Sebastian Flennerhag, Jane X Wang, Pablo Sprechmann, Francesco Visin, Alexandre Galashov, Steven Kapturowski, Diana L Borsa, Nicolas Heess, Andre Barreto, and Razvan Pascanu. Temporal difference uncertainties as a signal for exploration. arXiv preprint arXiv:2010.02255, 2020. 26
[70]
Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(5-6):359-483, 2015. 3, 22
[71]
Robert M. Gray. Entropy and information theory. Springer Science & Business Media, 2011. 3
[72]
Christopher Grimm, Andre Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020. 5, 6, 7, 8
[73]
Christopher Grimm, André Barreto, Greg Farquhar, David Silver, and Satinder Singh. Proper value equivalence. Advances in Neural Information Processing Systems, 34, 2021. 5, 7, 23
[74]
Arthur Guez, David Silver, and Peter Dayan. Efficient Bayes-adaptive reinforcement learning using sample-based search. In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1, pages 1025-1033, 2012. 22
[75]
Arthur Guez, David Silver, and Peter Dayan. Scalable and efficient Bayes-adaptive reinforcement learning based on Monte-Carlo tree search. Journal of Artificial Intelligence Research, 48:841-883, 2013.
[76]
Arthur Guez, Nicolas Heess, David Silver, and Peter Dayan. Bayes-adaptive simulation-based search with value function approximation. In Advances in Neural Information Processing Systems, pages 451-459, 2014. 22
[77]
Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov decision processes. arXiv preprint arXiv:1502.02259, 2015. 25
[78]
Botao Hao and Tor Lattimore. Regret bounds for information-directed reinforcement learning. arXiv preprint arXiv:2206.04640, 2022. 7
[79]
Matthew T Harrison and Ioannis Kontoyiannis. Estimation of the rate-distortion function. IEEE Transactions on Information Theory, 54(8):3757-3762, 2008. 25
[80]
Cassius T. Ionescu-Tulcea. Mesures dans les espaces produits. Atti Acad. Naz. Lincei Rend. Cl Sci. Fis. Mat. Nat, 8(7), 1949. 4
[81]
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010. 1,22
[82]
David Janz, Jiri Hron, Przemyslaw Mazur, Katja Hofmann, José Miguel Hernández-Lobato, and Sebastian Tschiatschek. Successor uncertainties: exploration and uncertainty in temporal difference learning. Advances in Neural Information Processing Systems, 32, 2019. 26
[83]
Nan Jiang, Alex Kulesza, and Satinder Singh. Abstraction selection in model-based reinforcement learning. In International Conference on Machine Learning, pages 179-188, 2015. 23
[84]
Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704-1713. PMLR, 2017. 22
[85]
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868-4878, 2018. 1, 22
[86]
Nicholas K Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, pages 752-757, 2005. 23
[87]
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285, 1996. 1
[88]
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99-134, 1998. 23
[89]
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267-274, 2002. 37
[90]
Sham Machandranath Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. 1, 22
[91]
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209-232, 2002. 1, 22
[92]
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681-690, 2008. 22
[93]
Andrei Nikolaevich Kolmogorov and Vladimir Mikhailovich Tikhomirov. ε-entropy and ε-capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3-86, 1959. 35
[94]
J Zico Kolter and Andrew Y Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th Annual International Conference on machine Learning, pages 513-520, 2009. 22
[95]
Victoria Kostina and Babak Hassibi. Rate-cost tradeoffs in control. IEEE Transactions on Automatic Control, 64(11):4525-4540, 2019. 22
[96]
Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with rich observations. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1848-1856, 2016. 22
[97]
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22, 1985. 2
[98]
Tor Lattimore and Andras Gyorgy. Mirror descent and the information ratio. In Conference on Learning Theory, pages 2965-2992. PMLR, 2021. 7
[99]
Tor Lattimore and Csaba Szepesvári. An information-theoretic approach to minimax regret in partial monitoring. In Conference on Learning Theory, pages 2111-2139. PMLR, 2019. 7
[100]
Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. 2, 4, 22
[101]
Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for MDPs. ISAIM, 4:5, 2006. 23, 25
[102]
Evan Z Liu, Aditi Raghunathan, Percy Liang, and Chelsea Finn. Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In International conference on machine learning, pages 6925-6935. PMLR, 2021. 25
[103]
Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3260-3268, 2017. 25
[104]
Xiuyuan Lu and Benjamin Van Roy. Information-theoretic confidence bounds for reinforcement learning. Advances in Neural Information Processing Systems, 32:2461-2470, 2019. 7
[105]
Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, and Zheng Wen. Reinforcement Learning, Bit by Bit. arXiv preprint arXiv:2103.04047, 2021. 1, 2, 10, 22, 23
[106]
Gerald Matz and Pierre Duhamel. Information geometric formulation and interpretation of accelerated Blahut-Arimoto-type algorithms. In Information theory workshop, pages 66-70. IEEE, 2004. 25
[107]
Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International Conference on Machine Learning, pages 6961-6971. PMLR, 2020. 23
[108]
Sanjoy Mitter and Anant Sahai. Information and control: Witsenhausen revisited. In Learning, Control and Hybrid Systems, pages 281-293. Springer, 1999. 22
[109]
Sanjoy K Mitter. Control with limited information. European Journal of Control, 7(2-3): 122-131, 2001. 22
[110]
Suraj Nair, Silvio Savarese, and Chelsea Finn. Goal-aware prediction: Learning to model what matters. In International Conference on Machine Learning, pages 7207-7219. PMLR, 2020. 5
[111]
Ziad Naja, Florence Alberge, and Pierre Duhamel. Geometrical interpretation and improvements of the Blahut-Arimoto's algorithm. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2505-2508. IEEE, 2009. 25
[112]
Urs Niesen, Devavrat Shah, and Gregory Wornell. Adaptive alternating minimization algorithms. In 2007 IEEE International Symposium on Information Theory, pages 1641-1645. IEEE, 2007. 25
[113]
Evgenii Nikishin, Romina Abachi, Rishabh Agarwal, and Pierre-Luc Bacon. Control-oriented model-based reinforcement learning with implicit differentiation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022. 5
[114]
Brendan O'Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty Bellman equation and exploration. In International Conference on Machine Learning, pages 3836-3845, 2018. 2, 26
[115]
Brendan O'Donoghue, Ian Osband, and Catalin Ionescu. Making sense of reinforcement learning and probabilistic inference. In International Conference on Learning Representations, 2020. 25
[116]
Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6120-6130, 2017. 5, 23
[117]
Ian Osband. Deep Exploration via Randomized Value Functions. PhD thesis, Stanford University, 2016. 26
[118]
Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the Eluder dimension. Advances in Neural Information Processing Systems, 27, 2014. 22
[119]
Ian Osband and Benjamin Van Roy. Gaussian-Dirichlet posterior dominance in sequential learning. arXiv preprint arXiv:1702.04126, 2017. 9
[120]
Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pages 2701-2710. PMLR, 2017. 1, 2, 6, 7, 9, 22, 38
[121]
Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26:3003-3011, 2013. 1, 2, 7, 9, 22, 26, 38
[122]
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via Bootstrapped DQN. In Advances in Neural Information Processing Systems, pages 4026-4034, 2016. 25,26
[123]
Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pages 2377-2386, 2016. 2
[124]
Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018. 25, 26
[125]
Ian Osband, Benjamin Van Roy, Daniel J Russo, and Zheng Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1-62, 2019. 2, 26
[126]
Ian Osband, Zheng Wen, Mohammad Asghari, Morteza Ibrahimi, Xiyuan Lu, and Benjamin Van Roy. Epistemic neural networks. arXiv preprint arXiv:2107.08924, 2021. 25, 26
[127]
Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, and Benjamin Van Roy. Evaluating predictive distributions: Does Bayesian deep learning work? arXiv preprint arXiv:2110.04629, 2021. 25
[128]
Hari Palaiyanur and Anant Sahai. On the uniform continuity of the rate-distortion function. In 2008 IEEE International Symposium on Information Theory, pages 857-861. IEEE, 2008. 25
[129]
Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. 2019. 3
[130]
Martin L. Puterman. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994. 3
[131]
Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331-5340. PMLR, 2019. 25
[132]
Kenneth Rose. A mapping approach to rate-distortion computation and analysis. IEEE Transactions on Information Theory, 40(6):1939-1952, 1994. 25
[133]
Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in MDPs. In Decision Making with Imperfect Decision Makers, pages 57-74. Springer, 2012. 22
[134]
Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395-411, 2010. 22
[135]
Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27:1583-1591, 2014. 7, 22, 38
[136]
Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442-2471, 2016. 7
[137]
Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Operations Research, 66(1):230-252, 2018. 7, 22
[138]
Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning. arXiv preprint arXiv:1803.02855, 2018. 2, 5, 6, 7, 22
[139]
Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning. Mathematics of Operations Research, 2022. 2, 5, 6, 7, 22
[140]
Daniel Russo, David Tse, and Benjamin Van Roy. Time-sensitive bandit learning and satisficing Thompson sampling. arXiv preprint arXiv:1704.09028, 2017. 2, 6, 22
[141]
Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1-96, 2018. 1, 22
[142]
Ilya O Ryzhov, Warren B Powell, and Peter I Frazier. The knowledge gradient algorithm for a general class of online learning problems. Operations Research, 60(1):180-195, 2012. 22
[143]
Jossy Sayir. Iterating the Arimoto-Blahut algorithm for faster convergence. In 2000 IEEE International Symposium on Information Theory (Cat. No. 00CH37060), page 235. IEEE, 2000. 25
[144]
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 588(7839):604-609, 2020. 5, 23, 25
[145]
Ehsan Shafieepoorfard, Maxim Raginsky, and Sean P Meyn. Rationally inattentive control of Markov processes. SIAM Journal on Control and Optimization, 54(2):987-1016, 2016. 22, 25
[146]
Claude E. Shannon. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec., March 1959, 4:142-163, 1959. 2, 4, 10, 22
[147]
David Silver, Hado Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The Predictron: End-to-end learning and planning. In International Conference on Machine Learning, pages 3191-3199. PMLR, 2017. 5, 23
[148]
Herbert A. Simon. Models of bounded rationality. Economic Analysis and Public Policy, MIT Press, Cambridge, Mass, 1982. 6
[149]
Jonathan Sorg, Satinder Singh, and Richard L Lewis. Variance-based rewards for approximate Bayesian reinforcement learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 564-571, 2010. 22
[150]
Jan-Erik Stjernvall. Dominance-a relation between distortion measures. IEEE Transactions on Information Theory, 29(6):798-807, 1983. 8, 32, 33
[151]
Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10(Nov):2413-2444, 2009. 1, 22
[152]
Malcolm JA Strens. A Bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 943-950, 2000. 2, 6, 22
[153]
Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Modelbased RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898-2933. PMLR, 2019. 22
[154]
Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning. 1998. 1
[155]
Sekhar Tatikonda and Sanjoy Mitter. Control under communication constraints. IEEE Transactions on Automatic Control, 49(7):1056-1068, 2004. 22
[156]
William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285-294, 1933. 1, 22
[157]
Naftali Tishby and Daniel Polani. Information theory of decisions and actions. In Perception-action cycle, pages 601-636. Springer, 2011. 22
[158]
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000. 9, 25
[159]
Benjamin Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234-244, 2006. 23, 25
[160]
Sergio Verdú et al. Generalizing the Fano inequality. IEEE Transactions on Information Theory, 40(4):1247-1251, 1994. 8
[161]
Claas A Voelcker, Victor Liao, Animesh Garg, and Amir-massoud Farahmand. Value gradient weighted model-based reinforcement learning. In International Conference on Learning Representations, 2022. 5, 7
[162]
Pascal O Vontobel, Aleksandar Kavcic, Dieter M Arnold, and Hans-Andrea Loeliger. A generalization of the Blahut-Arimoto algorithm to finite-state channels. IEEE Transactions on Information Theory, 54(5):1887-1918, 2008. 25
[163]
Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian sparse sampling for on-line reward optimization. In Proceedings of the 22nd International Conference on Machine Learning, pages 956-963, 2005. 22
[164]
Yizao Wang, Jean-Yves Audibert, and Rémi Munos. Algorithms for infinitely many-armed bandits. In Proceedings of the 21st International Conference on Neural Information Processing Systems, pages 1729-1736, 2008. 22
[165]
Ward Whitt. Approximations of dynamic programs, I. Mathematics of Operations Research, 3 (3):231-243, 1978. 23
[166]
Hans S Witsenhausen. Separation of estimation and control for discrete time systems. Proceedings of the IEEE, 59(11):1557-1566, 1971. 22
[167]
Yaming Yu. Squeezing the Arimoto-Blahut algorithm for faster convergence. IEEE Transactions on Information Theory, 56(7):3149-3157, 2010. 25
[168]
Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304-7312. PMLR, 2019. 1, 22
[169]
Julian Zimmert and Tor Lattimore. Connections between mirror descent, Thompson sampling and the information ratio. In Advances in Neural Information Processing Systems, pages 11973-11982, 2019. 7
[170]
Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning. In International Conference on Learning Representations, 2019. 25

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems
November 2022
39114 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 April 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media