Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1625275.1625409guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Continuous time associative bandit problems

Published: 06 January 2007 Publication History

Abstract

In this paper we consider an extension of the multiarmed bandit problem. In this generalized setting, the decision maker receives some side information, performs an action chosen from a finite set and then receives a reward. Unlike in the standard bandit settings, performing an action takes a random period of time. The environment is assumed to be stationary, stochastic and memoryless. The goal is to maximize the average reward received in one unit time, that is, to maximize the average rate of return. We consider the on-line learning problem where the decisionmaker initially does not know anything about the environment but must learn about it by trial and error. We propose an "upper confidence bound"-style algorithm that exploits the structure of the problem. We show that the regret of this algorithm relative to the optimal algorithm that has perfect knowledge about the problem grows at the optimal logarithmic rate in the number of decisions and scales polynomially with the parameters of the problem.

References

[1]
{Agrawal, 1995} R. Agrawal. Sample mean based index policies with o(logn) regret for the multi-armed bandit problem. Adv. in Appl. Probability, 27:1054-1078, 1995.
[2]
{Auer et al., 2002} P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235-256, 2002.
[3]
{Devroye et al., 1996} L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag New York, 1996.
[4]
{Karoui and Karatzas, 1997} N. El Karoui and I. Karatzas. Synchronization and optimality for multi-armed bandit problems in continuous time. Computational and Applied Mathematics, 16:117-152, 1997.
[5]
{Kaspi and Mandelbaum, 1998} H. Kaspi and A. Mandelbaum. Multi armed bandits in discrete and continuous time. Ann. of Appl. Probabability, 8:1270-1290, 1998.
[6]
{Lai and Robbins, 1985} T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4-22, 1985.
[7]
{Puterman, 1994} M.L. Puterman. Markov Decision Processes - Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.
[8]
{Robbins, 1952} H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527-535, 1952.
[9]
{Szepesvári and Littman, 1999} Cs. Szepesvári and M.L. Littman. A unified analysis of value-function-based reinforcement-learning algorithms. Neural Computation, 11:2017-2059, 1999.

Cited By

View all
  • (2023)Bandit task assignment with unknown processing timeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666514(8937-8957)Online publication date: 10-Dec-2023
  • (2019)Learning to Control Renewal Processes with Bandit FeedbackProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261583:2(1-32)Online publication date: 19-Jun-2019
  • (2018)Bandits with KnapsacksJournal of the ACM10.1145/316453965:3(1-55)Online publication date: 1-Mar-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence
January 2007
2953 pages
  • Editors:
  • Rajeev Sangal,
  • Harish Mehta,
  • R. K. Bagga

Sponsors

  • The International Joint Conferences on Artificial Intelligence, Inc.

Publisher

Morgan Kaufmann Publishers Inc.

San Francisco, CA, United States

Publication History

Published: 06 January 2007

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Bandit task assignment with unknown processing timeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666514(8937-8957)Online publication date: 10-Dec-2023
  • (2019)Learning to Control Renewal Processes with Bandit FeedbackProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261583:2(1-32)Online publication date: 19-Jun-2019
  • (2018)Bandits with KnapsacksJournal of the ACM10.1145/316453965:3(1-55)Online publication date: 1-Mar-2018
  • (2017)When to reset your keysProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298102(3679-3685)Online publication date: 4-Feb-2017
  • (2014)Optimal resource allocation with semi-bandit feedbackProceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence10.5555/3020751.3020801(477-486)Online publication date: 23-Jul-2014

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media