Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models
<p>Example of the implicit regression architecture: (<b>a</b>): linear case, (<b>b</b>) quadratic case.</p> "> Figure 2
<p>The evolution of an energy function as the training sample size is increased from 100 to 100,000 in an environment with two distinct context categories (labeled 0, blue, and 1, orange) with optimal actions around 3 and 9, respectively.</p> "> Figure 3
<p>Example of a simulation environment used for testing the algorithms with two contexts. The probability of receiving a reward given an action depends on the context as shown on the left. The linearly and non-linearly separable contexts are shown in the centre and on the right, respectively.</p> ">
Abstract
:1. Introduction
- Introducing a technique for maximum entropy exploration with neural networks estimating rewards in contextual bandits with a continuous action space, sampling using Hamiltonian Monte Carlo;
- A novel algorithm that uses Energy Based Models based on neural networks to solve contextual bandit problems;
- Testing our algorithms in different simulation environments (including with dynamic environments), giving practitioners a guide to which algorithms to use in different scenarios.
2. Related Work
2.1. Entropy Based Exploration in Contextual Bandits
2.2. Energy Based Models in Reinforcement Learning
3. Algorithms to Solve Contextual Bandit Problems with Maximum Entropy Exploration
3.1. Contextual Bandit Problem Formulation
3.2. Maximum Entropy Exploration
3.3. Maximum Entropy Exploration with Neural Networks Modelling Reward
3.3.1. Discrete Action Sampling
3.3.2. Continuous Action Sampling
Algorithm 1 Maximum entropy exploration with neural networks |
|
3.4. Maximum Entropy Exploration with Energy Based Models (EBMs)
3.4.1. Training EBMs to Solve Contextual Bandit Problems
3.4.2. Architectures for EBMs
Algorithm 2 Contextual bandit with Energy Based Models |
|
3.4.3. Evolution of the Energy Distribution with Dataset Size
4. Experiments
4.1. The Simulation Environments
4.2. Experimental Setup
4.3. Baseline Algorithms
4.4. Specific Configurations of the Algorithms
4.5. Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Portugal, I.; Alencar, P.; Cowan, D. The use of machine learning algorithms in recommender systems: A systematic review. Expert Syst. Appl. 2018, 97, 205–227. [Google Scholar] [CrossRef] [Green Version]
- Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
- Bouneffouf, D.; Rish, I.; Aggarwal, C. Survey on applications of multi-armed and contextual bandits. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
- Trovò, F.; Paladino, S.; Restelli, M.; Gatti, N. Improving multi-armed bandit algorithms in online pricing settings. Int. J. Approx. Reason. 2018, 98, 196–235. [Google Scholar] [CrossRef]
- Xu, X.; Dong, F.; Li, Y.; He, S.; Li, X. Contextual-bandit based personalized recommendation with time-varying user interests. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6518–6525. [Google Scholar]
- Nuara, A.; Trovo, F.; Gatti, N.; Restelli, M. A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Gatti, N.; Lazaric, A.; Trovo, F. A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities. In Proceedings of the 13th ACM Conference on Electronic Commerce, Valencia, Spain, 4–8 June 2012; pp. 605–622. [Google Scholar]
- Gasparini, M.; Nuara, A.; Trovò, F.; Gatti, N.; Restelli, M. Targeting optimization for internet advertising by learning from logged bandit feedback. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933, 25, 285–294. [Google Scholar] [CrossRef]
- Agrawal, S.; Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 127–135. [Google Scholar]
- Gopalan, A.; Mannor, S.; Mansour, Y. Thompson Sampling for Complex Online Problems. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Proceedings of Machine Learning Research. Xing, E.P., Jebara, T., Eds.; PMLR: Bejing, China, 2014; Volume 32, pp. 100–108. [Google Scholar]
- Friston, K.; Kilner, J.; Harrison, L. A free energy principle for the brain. J. Physiol. Paris 2006, 100, 70–87. [Google Scholar] [CrossRef] [Green Version]
- Friston, K. The free-energy principle: A rough guide to the brain? Trends Cogn. Sci. 2009, 13, 293–301. [Google Scholar] [CrossRef]
- Friston, K. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci. 2010, 11, 127–138. [Google Scholar] [CrossRef]
- Brown, H.; Friston, K. Free-Energy and Illusions: The Cornsweet Effect. Front. Psychol. 2012, 3, 43. [Google Scholar] [CrossRef] [Green Version]
- Adams, R.A.; Shipp, S.; Friston, K.J. Predictions not commands: Active inference in the motor system. Brain Struct. Funct. 2013, 218, 611–643. [Google Scholar] [CrossRef]
- Schwartenbeck, P.; FitzGerald, T.; Dolan, R.; Friston, K. Exploration, novelty, surprise, and free energy minimization. Front. Psychol. 2013, 4, 710. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Marković, D.; Stojić, H.; Schwöbel, S.; Kiebel, S.J. An empirical evaluation of active inference in multi-armed bandits. Neural Netw. 2021, 144, 229–246. [Google Scholar] [CrossRef] [PubMed]
- Smith, R.; Friston, K.J.; Whyte, C.J. A step-by-step tutorial on active inference and its application to empirical data. J. Math. Psychol. 2022, 107, 102632. [Google Scholar] [CrossRef] [PubMed]
- Lee, K.; Choy, J.; Choi, Y.; Kee, H.; Oh, S. No-Regret Shannon Entropy Regularized Neural Contextual Bandit Online Learning for Robotic Grasping. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2020; pp. 9620–9625. [Google Scholar]
- Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv 2018, arXiv:1805.00909. [Google Scholar]
- Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1352–1361. [Google Scholar]
- Du, Y.; Lin, T.; Mordatch, I. Model Based Planning with Energy Based Models. arXiv 2019, arXiv:1909.06878. [Google Scholar]
- Bietti, A.; Agarwal, A.; Langford, J. A Contextual Bandit Bake-off. J. Mach. Learn. Res. 2021, 22, 1–49. [Google Scholar]
- Cavenaghi, E.; Sottocornola, G.; Stella, F.; Zanker, M. Non stationary multi-armed bandit: Empirical evaluation of a new concept drift-aware algorithm. Entropy 2021, 23, 380. [Google Scholar] [CrossRef]
- Abbasi-Yadkori, Y.; Pál, D.; Szepesvári, C. Improved algorithms for linear stochastic bandits. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24. [Google Scholar]
- Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef] [Green Version]
- Riquelme, C.; Tucker, G.; Snoek, J. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv 2018, arXiv:1802.09127. [Google Scholar]
- Zhou, D.; Li, L.; Gu, Q. Neural contextual bandits with ucb-based exploration. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 11492–11502. [Google Scholar]
- Zhang, W.; Zhou, D.; Li, L.; Gu, Q. Neural thompson sampling. arXiv 2020, arXiv:2010.00827. [Google Scholar]
- Kassraie, P.; Krause, A. Neural contextual bandits without regret. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; pp. 240–278. [Google Scholar]
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef] [Green Version]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Kuleshov, V.; Precup, D. Algorithms for multi-armed bandit problems. arXiv 2014, arXiv:1402.6028. [Google Scholar]
- LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. In Predicting Structured Data; MIT Press: Cambridge, MA, USA, 2006; Volume 1. [Google Scholar]
- Grathwohl, W.; Wang, K.C.; Jacobsen, J.H.; Duvenaud, D.; Norouzi, M.; Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. arXiv 2019, arXiv:1912.03263. [Google Scholar]
- Heess, N.; Silver, D.; Teh, Y.W. Actor-Critic Reinforcement Learning with Energy-Based Policies. In Proceedings of the Tenth European Workshop on Reinforcement Learning, Edinburgh, Scotland, 30 June–1 July 2012; Proceedings of Machine Learning Research. Deisenroth, M.P., Szepesvári, C., Peters, J., Eds.; PMLR: Edinburgh, UK, 2013; Volume 24, pp. 45–58. [Google Scholar]
- Cesa-Bianchi, N.; Gentile, C.; Lugosi, G.; Neu, G. Boltzmann exploration done right. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Degris, T.; Pilarski, P.M.; Sutton, R.S. Model-free reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference (ACC), Montreal, QC, Canada, 27–29 June 2012; pp. 2177–2182. [Google Scholar]
- Neal, R.M. MCMC using Hamiltonian dynamics. Handb. Markov Chain. Monte Carlo 2011, 2, 2. [Google Scholar]
- Betancourt, M.; Girolami, M. Hamiltonian Monte Carlo for hierarchical models. Curr. Trends Bayesian Methodol. Appl. 2015, 79, 2–4. [Google Scholar]
- Delyon, B.; Lavielle, M.; Moulines, E. Convergence of a stochastic approximation version of the EM algorithm. Ann. Stat. 1999, 27, 94–128. [Google Scholar] [CrossRef]
- Dillon, J.V.; Langmore, I.; Tran, D.; Brevdo, E.; Vasudevan, S.; Moore, D.; Patton, B.; Alemi, A.; Hoffman, M.; Saurous, R.A. Tensorflow distributions. arXiv 2017, arXiv:1711.10604. [Google Scholar]
- Moerland, T.M.; Broekens, J.; Jonker, C.M. Model-based reinforcement learning: A survey. arXiv 2020, arXiv:2006.16712. [Google Scholar]
- Kaiser, L.; Babaeizadeh, M.; Milos, P.; Osinski, B.; Campbell, R.H.; Czechowski, K.; Erhan, D.; Finn, C.; Kozakowski, P.; Levine, S.; et al. Model-based reinforcement learning for atari. arXiv 2019, arXiv:1903.00374. [Google Scholar]
- Boney, R.; Kannala, J.; Ilin, A. Regularizing model-based planning with energy-based models. In Proceedings of the Conference on Robot Learning, Virtual, 16–18 November 2020; pp. 182–191. [Google Scholar]
- Du, Y.; Mordatch, I. Implicit generation and generalization in energy-based models. arXiv 2019, arXiv:1903.08689. [Google Scholar]
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Xie, J.; Lu, Y.; Zhu, S.C.; Wu, Y. A theory of generative convnet. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2635–2644. [Google Scholar]
- Lippe, P. Tutorial 8: Deep Energy-Based Generative Models. Available online: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial8/Deep_Energy_Models.html (accessed on 22 July 2022).
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Duvenaud, D.; Kelly, J.; Swersky, K.; Hashemi, M.; Norouzi, M.; Grathwohl, W. No MCMC for Me: Amortized Samplers for Fast and Stable Training of Energy-Based Models. arXiv 2021, arXiv:2010.04230v3. [Google Scholar]
LINEAR | CIRCLE | |||||||
---|---|---|---|---|---|---|---|---|
STATIC | DYNAMIC | STATIC | DYNAMIC | |||||
Experiment |
Average Stochastic Regret |
Best Stocastic Regret |
Average Stochastic Regret |
Best Stocastic Regret |
Average Stochastic Regret |
Best Stocastic Regret |
Average Stochastic Regret |
Best Stocastic Regret |
EBM | 551 ± 123 | 450 | 3275 ± 59 | 3194 | 987 ± 402 | 613 | 3307 ± 71 | 3208 |
NN HMC | 2229 ± 100 | 2080 | 3578 ± 88 | 3502 | 2074 ± 254 | 1642 | 3572 ± 42 | 3532 |
NN Discrete | 1166 ± 98 | 1046 | 3720 ± 75 | 3627 | 991 ± 66 | 879 | 3645 ± 37 | 3616 |
UCB1 | 3721 ± 66 | 3675 | 4508 ± 127 | 4319 | 3683 ± 68 | 3608 | 4002 ± 63 | 3900 |
TS | 3552 ± 87 | 3480 | 4868 ± 65 | 4808 | 3619 ± 24 | 3583 | 4760 ± 65 | 4656 |
linUCB | 1196 ± 632 | 633 | 3822 ± 497 | 3134 | 4446 ± 558 | 3955 | 4933 ± 66 | 4842 |
linTS | 558 ± 47 | 481 | 3447 ± 50 | 3375 | 4473 ± 56 | 4413 | 5012 ± 53 | 4952 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Elwood, A.; Leonardi, M.; Mohamed, A.; Rozza, A. Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models. Entropy 2023, 25, 188. https://doi.org/10.3390/e25020188
Elwood A, Leonardi M, Mohamed A, Rozza A. Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models. Entropy. 2023; 25(2):188. https://doi.org/10.3390/e25020188
Chicago/Turabian StyleElwood, Adam, Marco Leonardi, Ashraf Mohamed, and Alessandro Rozza. 2023. "Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models" Entropy 25, no. 2: 188. https://doi.org/10.3390/e25020188