research-article

Maximal Objectives in the Multiarmed Bandit with Applications

Authors:

Vijay KambleAuthors Info & Claims

Management Science, Volume 70, Issue 12

Pages 8853 - 8874

https://doi.org/10.1287/mnsc.2022.00801

Published: 01 December 2024 Publication History

Abstract

In several applications of the stochastic multiarmed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, we study a new objective in the classic setup. Given K arms, instead of maximizing the expected total reward from T pulls (the traditional “sum” objective), we consider the vector of total rewards earned from each of the K arms at the end of T pulls and aim to maximize the expected highest total reward across arms (the “max” objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of Ω(log T) (with a higher instance-dependent constant compared with the traditional objective) and a worst case regret of Ω(K1/3T2/3). We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and simultaneously achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top m arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared with several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of conditioning an adequate supply of value-providing market entities (workers/sellers/service providers) in online platforms and marketplaces.

This paper was accepted by Vivek Farias, data science.

Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.00801.

References

[1]

Agrawal R (1995) Sample mean based index policies by O(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.

[2]

Audibert J-Y, Bubeck S (2009) Minimax policies for adversarial and stochastic bandits. Proc. 22nd Annual Conf. Learn. Theory, Quebec, Canada, 217–226.

[3]

Audibert J-Y, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. Proc. 23rd Annual Conf. Learn. Theory (Omnipress, Madison, WI), 41–53.

[4]

Auer P, Ortner R (2010) UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1–2):55–65.

[5]

Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.

Digital Library

[6]

Baudry D, Russac Y, Kaufmann E (2022) Efficient algorithms for extreme bandits. Preprint, submitted March 21, https://arxiv.org/abs/2203.10883.

[7]

Bhatt S, Li P, Samorodnitsky G (2022) Extreme bandits using robust statistics. IEEE Trans. Inform. Theory 69(3):1761–1776.

[8]

Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.

[9]

Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. Proc. 20th Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin, Heidelberg), 23–37.

[10]

Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Comput. Sci. 412(19):1832–1852.

Digital Library

[11]

Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. Proc. 26th Annual Conf. Learn. Theory (JMLR.org), 258–265.

[12]

Carpentier A, Locatelli A (2016) Tight (lower) bounds for the fixed budget best arm identification bandit problem. Proc. 29th Annual Conf. Learn. Theory (JMLR.org), 590–604.

[13]

Carpentier A, Valko M (2014) Extreme bandits. Adv. Neural Inform. Processing Systems 27: 28th Annual Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 1089–1097.

[14]

Carpentier A, Valko M (2015) Simple regret for infinitely many armed bandits. Proc. 32nd Internat. Conf. Machine Learn. (PMLR, New York), 1133–1141.

[15]

Cesa-Bianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Adv. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1160–1168.

[16]

Cicirello VA, Smith SF (2005) The max k-armed bandit: A new model of exploration applied to search heuristic selection. Proc. 20th National Conf. Artificial Intelligence, vol. 3 (AAAI Press, Palo Alto, CA), 1355–1361.

[17]

Dekel O, Ding J, Koren T, Peres Y (2014) Bandits with switching costs: T^2/3 regret. Proc. 46th Annual ACM Sympos. Theory Comput. (ACM, New York), 459–467.

[18]

Esfandiari H, Karbasi A, Mehrabian A, Mirrokni V (2021) Regret bounds for batched bandits. Proc. Conf. AAAI Artificial Intelligence, vol. 35 (AAAI Press, Palo Alto, CA), 7340–7348.

[19]

Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. Proc. 15th Internat. Conf. Comput. Learn. Theory (Springer, Berlin, Heidelberg), 255–270.

[20]

Even-Dar E, Mannor S, Mansour Y (2010) Learning with global cost in stochastic environments. Kalai AT, Mohri M, eds. Proc. 23rd Annual Conf. Learn. Theory, vol. 1 (Omnipress, Madison, WI), 80–92.

[21]

Even-Dar E, Kleinberg R, Mannor S, Mansour Y (2009) Online learning for global cost functions. Proc. 22nd Annual Conf. Learn. Theory, Montreal, Canada, 1527–1537.

[22]

Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7(6):1079–1105.

Digital Library

[23]

Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 503–513.

[24]

Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory (JMLR.org), 359–376.

[25]

Garivier A, Lattimore T, Kaufmann E (2016) On explore-then-commit strategies. Adv. Neural Inform. Processing Systems, vol. 29.

[26]

Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices (John Wiley & Sons, Hoboken, NJ).

[27]

Gourville JT, Soman D (2005) Overchoice and assortment type: When and why variety backfires. Marketing Sci. 24(3):382–395.

Digital Library

[28]

György A, Kocsis L (2011) Efficient multi-start strategies for local search algorithms. J. Artificial Intelligence Res. 41:407–444.

Digital Library

[29]

Hsu W-K, Xu J, Lin X, Bell MR (2022) Integrated online learning and adaptive control in queueing systems with uncertain payoffs. Oper. Res. 70(2):1166–1181.

Digital Library

[30]

Jamieson K, Malloy M, Nowak R, Bubeck S (2014) lil’UCB: An optimal exploration algorithm for multi-armed bandits. Proc. 27th Annual Conf. Learn. Theory (JMLR.org), 423–439.

[31]

Jin T, Xu P, Xiao X, Gu Q (2021) Double explore-then-commit: Asymptotic optimality and beyond. Proc. 34th Annual Conf. Learn. Theory (PMLR, New York), 2584–2633.

[32]

Johari R, Kamble V, Kanoria Y (2021) Matching while learning. Oper. Res. 69(2):655–681.

Digital Library

[33]

Jun K-S, Jamieson K, Nowak R, Zhu X (2016) Top arm identification in multi-armed bandits with batch arm pulls. Artificial Intelligence Statist., 139–148.

[34]

Kalyanakrishnan S, Tewari A, Auer P, Stone P (2012) PAC subset selection in stochastic multi-armed bandits. Proc. 29th Internat. Conf. Machine Learn., 227–234.

[35]

Karnin Z, Koren T, Somekh O (2013) Almost optimal exploration in multi-armed bandits. Proc. 30th Internat. Conf. Machine Learn. (JMLR.org), 1238–1246.

[36]

Kaufmann E, Kalyanakrishnan S (2013) Information complexity in bandit subset selection. Proc. 26th Annual Conf. Learn. Theory (JMLR.org), 228–251.

[37]

Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.

Digital Library

[38]

Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.

Digital Library

[39]

Lattimore T (2018) Refining the confidence level for optimistic bandit strategies. J. Machine Learn. Res. 19(1):765–796.

Digital Library

[40]

Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge).

[41]

Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5:623–648.

Digital Library

[42]

Martí R, Lozano JA, Mendiburu A, Hernando L (2018) Multi-start methods. Handbook of Heuristics (Springer, Berlin, Heidelberg), 155–175.

[43]

Martí R, Aceves R, León MT, Moreno-Vega JM, Duarte A (2019) Intelligent multi-start methods. Handbook of Metaheuristics, 221–243.

[44]

Massoulié L, Xu K (2018) On the capacity of information processing systems. Oper. Res. 66(2):568–586.

Digital Library

[45]

Ozbay E, Kamble V (2022) Incentives for exploration at market equilibrium. Preprint, submitted March 28, https://doi.org/10.2139/ssrn.4041075.

[46]

Perchet V, Rigollet P (2013) The multi-armed bandit problem with covariates. Ann. Statist. 41(2):693–721.

[47]

Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.

[48]

Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on thompson sampling. Foundations Trends Machine Learn., 11(1):1–96.

[49]

Settle RB, Golden LL (1974) Consumer perceptions: Overchoice in the market place. Adv. Consumer Res. 1:29–37.

[50]

Shah V, Gulikers L, Massoulié L, Vojnović M (2020) Adaptive matching for expert systems with uncertain task types. Oper. Res. 68(5):1403–1424.

Digital Library

[51]

Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.

Digital Library

[52]

Smith-Miles KA (2009) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surveys 41(6):1–25.

Digital Library

[53]

Streeter MJ, Smith SF (2006a) An asymptotically optimal algorithm for the max k-armed bandit problem. Proc. 21st National Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 135–142.

[54]

Streeter MJ, Smith SF (2006b) A simple distribution-free approach to the max k-armed bandit problem. Proc. 12th Internat. Conf. Principles Practice Constraint Programming (Springer, Berlin, Heidelberg), 560–574.

[55]

Sun X, Zhao J (2022) Congestion-aware matching and learning for service platforms.

[56]

Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.

[57]

Vaidhiyan NK, Sundaresan R (2017) Learning to detect an oddball target. IEEE Trans. Inform. Theory 64(2):831–852.

[58]

Yu H, Zhang D, Rauchwerger L (2004) An adaptive algorithm selection framework. Proc. 13th Internat. Conf. Parallel Architecture Compilation Techniques (IEEE, Piscataway, NJ), 278–289.

[59]

Zhou Y, Chen X, Li J (2014) Optimal PAC multiple arm identification with applications to crowdsourcing. Proc. 31st Internat. Conf. Machine Learn. (JMLR.org), 217–225.

Index Terms

Maximal Objectives in the Multiarmed Bandit with Applications
1. Mathematics of computing
  1. Probability and statistics
    1. Stochastic processes
2. Theory of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

Risk-Sensitive and Risk-Neutral Multiarmed Bandits

For the multiarmed bandit, the classic result is probabilistic: each state of each bandit (Markov chain with rewards) has an index that is determined by an optimal stopping time for that state's bandit, and expected discounted income is maximized by ...
Partially Observed Markov Decision Process Multiarmed Bandits---Structural Results

This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby ...
PAC-bayes-bernstein inequality for martingales and its application to multiarmed bandits
OTEAE'11: Proceedings of the 2011 International Conference on On-line Trading of Exploration and Exploitation 2 - Volume 26

We develop a new tool for data-dependent analysis of the exploration-exploitation trade-off in learning under limited feedback. Our tool is based on two main ingredients. The first ingredient is a new concentration inequality that makes it possible to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Management Science

Management Science Volume 70, Issue 12

December 2024

906 pages

DOI:10.1287/mnsc.2024.70.issue-12

Issue’s Table of Contents

Copyright © 2024, INFORMS.

Publisher

INFORMS

Linthicum, MD, United States

Publication History

Published: 01 December 2024

Accepted: 09 June 2023

Received: 14 March 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents