Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Maximal Objectives in the Multiarmed Bandit with Applications

Published: 15 March 2024 Publication History

Abstract

In several applications of the stochastic multiarmed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, we study a new objective in the classic setup. Given K arms, instead of maximizing the expected total reward from T pulls (the traditional “sum” objective), we consider the vector of total rewards earned from each of the K arms at the end of T pulls and aim to maximize the expected highest total reward across arms (the “max” objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of Ω(log T) (with a higher instance-dependent constant compared with the traditional objective) and a worst case regret of Ω(K1/3T2/3). We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and simultaneously achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top m arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared with several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of conditioning an adequate supply of value-providing market entities (workers/sellers/service providers) in online platforms and marketplaces.
This paper was accepted by Vivek Farias, data science.
Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.00801.

References

[1]
Agrawal R (1995) Sample mean based index policies by O(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4):1054–1078.
[2]
Audibert J-Y, Bubeck S (2009) Minimax policies for adversarial and stochastic bandits. Proc. 22nd Annual Conf. Learn. Theory, Quebec, Canada, 217–226.
[3]
Audibert J-Y, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. Proc. 23rd Annual Conf. Learn. Theory (Omnipress, Madison, WI), 41–53.
[4]
Auer P, Ortner R (2010) UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1–2):55–65.
[5]
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47:235–256.
[6]
Baudry D, Russac Y, Kaufmann E (2022) Efficient algorithms for extreme bandits. Preprint, submitted March 21, https://arxiv.org/abs/2203.10883.
[7]
Bhatt S, Li P, Samorodnitsky G (2022) Extreme bandits using robust statistics. IEEE Trans. Inform. Theory 69(3):1761–1776.
[8]
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1–122.
[9]
Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. Proc. 20th Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin, Heidelberg), 23–37.
[10]
Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Comput. Sci. 412(19):1832–1852.
[11]
Bubeck S, Wang T, Viswanathan N (2013) Multiple identifications in multi-armed bandits. Proc. 26th Annual Conf. Learn. Theory (JMLR.org), 258–265.
[12]
Carpentier A, Locatelli A (2016) Tight (lower) bounds for the fixed budget best arm identification bandit problem. Proc. 29th Annual Conf. Learn. Theory (JMLR.org), 590–604.
[13]
Carpentier A, Valko M (2014) Extreme bandits. Adv. Neural Inform. Processing Systems 27: 28th Annual Conf. Neural Inform. Processing Systems (Curran Associates, Inc., Red Hook, NY), 1089–1097.
[14]
Carpentier A, Valko M (2015) Simple regret for infinitely many armed bandits. Proc. 32nd Internat. Conf. Machine Learn. (PMLR, New York), 1133–1141.
[15]
Cesa-Bianchi N, Dekel O, Shamir O (2013) Online learning with switching costs and other adaptive adversaries. Adv. Neural Inform. Processing Systems (Curran Associates Inc., Red Hook, NY), 1160–1168.
[16]
Cicirello VA, Smith SF (2005) The max k-armed bandit: A new model of exploration applied to search heuristic selection. Proc. 20th National Conf. Artificial Intelligence, vol. 3 (AAAI Press, Palo Alto, CA), 1355–1361.
[17]
Dekel O, Ding J, Koren T, Peres Y (2014) Bandits with switching costs: T2/3 regret. Proc. 46th Annual ACM Sympos. Theory Comput. (ACM, New York), 459–467.
[18]
Esfandiari H, Karbasi A, Mehrabian A, Mirrokni V (2021) Regret bounds for batched bandits. Proc. Conf. AAAI Artificial Intelligence, vol. 35 (AAAI Press, Palo Alto, CA), 7340–7348.
[19]
Even-Dar E, Mannor S, Mansour Y (2002) PAC bounds for multi-armed bandit and Markov decision processes. Proc. 15th Internat. Conf. Comput. Learn. Theory (Springer, Berlin, Heidelberg), 255–270.
[20]
Even-Dar E, Mannor S, Mansour Y (2010) Learning with global cost in stochastic environments. Kalai AT, Mohri M, eds. Proc. 23rd Annual Conf. Learn. Theory, vol. 1 (Omnipress, Madison, WI), 80–92.
[21]
Even-Dar E, Kleinberg R, Mannor S, Mansour Y (2009) Online learning for global cost functions. Proc. 22nd Annual Conf. Learn. Theory, Montreal, Canada, 1527–1537.
[22]
Even-Dar E, Mannor S, Mansour Y, Mahadevan S (2006) Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Machine Learn. Res. 7(6):1079–1105.
[23]
Gao Z, Han Y, Ren Z, Zhou Z (2019) Batched multi-armed bandits problem. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 503–513.
[24]
Garivier A, Cappé O (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. Proc. 24th Annual Conf. Learn. Theory (JMLR.org), 359–376.
[25]
Garivier A, Lattimore T, Kaufmann E (2016) On explore-then-commit strategies. Adv. Neural Inform. Processing Systems, vol. 29.
[26]
Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices (John Wiley & Sons, Hoboken, NJ).
[27]
Gourville JT, Soman D (2005) Overchoice and assortment type: When and why variety backfires. Marketing Sci. 24(3):382–395.
[28]
György A, Kocsis L (2011) Efficient multi-start strategies for local search algorithms. J. Artificial Intelligence Res. 41:407–444.
[29]
Hsu W-K, Xu J, Lin X, Bell MR (2022) Integrated online learning and adaptive control in queueing systems with uncertain payoffs. Oper. Res. 70(2):1166–1181.
[30]
Jamieson K, Malloy M, Nowak R, Bubeck S (2014) lil’UCB: An optimal exploration algorithm for multi-armed bandits. Proc. 27th Annual Conf. Learn. Theory (JMLR.org), 423–439.
[31]
Jin T, Xu P, Xiao X, Gu Q (2021) Double explore-then-commit: Asymptotic optimality and beyond. Proc. 34th Annual Conf. Learn. Theory (PMLR, New York), 2584–2633.
[32]
Johari R, Kamble V, Kanoria Y (2021) Matching while learning. Oper. Res. 69(2):655–681.
[33]
Jun K-S, Jamieson K, Nowak R, Zhu X (2016) Top arm identification in multi-armed bandits with batch arm pulls. Artificial Intelligence Statist., 139–148.
[34]
Kalyanakrishnan S, Tewari A, Auer P, Stone P (2012) PAC subset selection in stochastic multi-armed bandits. Proc. 29th Internat. Conf. Machine Learn., 227–234.
[35]
Karnin Z, Koren T, Somekh O (2013) Almost optimal exploration in multi-armed bandits. Proc. 30th Internat. Conf. Machine Learn. (JMLR.org), 1238–1246.
[36]
Kaufmann E, Kalyanakrishnan S (2013) Information complexity in bandit subset selection. Proc. 26th Annual Conf. Learn. Theory (JMLR.org), 228–251.
[37]
Kaufmann E, Cappé O, Garivier A (2016) On the complexity of best-arm identification in multi-armed bandit models. J. Machine Learn. Res. 17(1):1–42.
[38]
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.
[39]
Lattimore T (2018) Refining the confidence level for optimistic bandit strategies. J. Machine Learn. Res. 19(1):765–796.
[40]
Lattimore T, Szepesvári C (2020) Bandit Algorithms (Cambridge University Press, Cambridge).
[41]
Mannor S, Tsitsiklis JN (2004) The sample complexity of exploration in the multi-armed bandit problem. J. Machine Learn. Res. 5:623–648.
[42]
Martí R, Lozano JA, Mendiburu A, Hernando L (2018) Multi-start methods. Handbook of Heuristics (Springer, Berlin, Heidelberg), 155–175.
[43]
Martí R, Aceves R, León MT, Moreno-Vega JM, Duarte A (2019) Intelligent multi-start methods. Handbook of Metaheuristics, 221–243.
[44]
Massoulié L, Xu K (2018) On the capacity of information processing systems. Oper. Res. 66(2):568–586.
[45]
Ozbay E, Kamble V (2022) Incentives for exploration at market equilibrium. Preprint, submitted March 28, https://doi.org/10.2139/ssrn.4041075.
[46]
Perchet V, Rigollet P (2013) The multi-armed bandit problem with covariates. Ann. Statist. 41(2):693–721.
[47]
Perchet V, Rigollet P, Chassang S, Snowberg E (2016) Batched bandit problems. Ann. Statist. 44(2):660–681.
[48]
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z (2018) A tutorial on thompson sampling. Foundations Trends Machine Learn., 11(1):1–96.
[49]
Settle RB, Golden LL (1974) Consumer perceptions: Overchoice in the market place. Adv. Consumer Res. 1:29–37.
[50]
Shah V, Gulikers L, Massoulié L, Vojnović M (2020) Adaptive matching for expert systems with uncertain task types. Oper. Res. 68(5):1403–1424.
[51]
Slivkins A (2019) Introduction to multi-armed bandits. Foundations Trends Machine Learn. 12(1–2):1–286.
[52]
Smith-Miles KA (2009) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surveys 41(6):1–25.
[53]
Streeter MJ, Smith SF (2006a) An asymptotically optimal algorithm for the max k-armed bandit problem. Proc. 21st National Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 135–142.
[54]
Streeter MJ, Smith SF (2006b) A simple distribution-free approach to the max k-armed bandit problem. Proc. 12th Internat. Conf. Principles Practice Constraint Programming (Springer, Berlin, Heidelberg), 560–574.
[55]
Sun X, Zhao J (2022) Congestion-aware matching and learning for service platforms.
[56]
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294.
[57]
Vaidhiyan NK, Sundaresan R (2017) Learning to detect an oddball target. IEEE Trans. Inform. Theory 64(2):831–852.
[58]
Yu H, Zhang D, Rauchwerger L (2004) An adaptive algorithm selection framework. Proc. 13th Internat. Conf. Parallel Architecture Compilation Techniques (IEEE, Piscataway, NJ), 278–289.
[59]
Zhou Y, Chen X, Li J (2014) Optimal PAC multiple arm identification with applications to crowdsourcing. Proc. 31st Internat. Conf. Machine Learn. (JMLR.org), 217–225.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Management Science
Management Science  Volume 70, Issue 12
December 2024
906 pages
DOI:10.1287/mnsc.2024.70.issue-12
Issue’s Table of Contents

Publisher

INFORMS

Linthicum, MD, United States

Publication History

Published: 15 March 2024
Accepted: 09 June 2023
Received: 14 March 2022

Author Tags

  1. multiarmed bandits
  2. L objective
  3. online platforms

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media