Article

Continuous time associative bandit problems

Authors:

András György,

Levente Kocsis,

Ivett Szabó,

Csaba SzepesváriAuthors Info & Claims

IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence

Pages 830 - 835

Published: 06 January 2007 Publication History

Abstract

In this paper we consider an extension of the multiarmed bandit problem. In this generalized setting, the decision maker receives some side information, performs an action chosen from a finite set and then receives a reward. Unlike in the standard bandit settings, performing an action takes a random period of time. The environment is assumed to be stationary, stochastic and memoryless. The goal is to maximize the average reward received in one unit time, that is, to maximize the average rate of return. We consider the on-line learning problem where the decisionmaker initially does not know anything about the environment but must learn about it by trial and error. We propose an "upper confidence bound"-style algorithm that exploits the structure of the problem. We show that the regret of this algorithm relative to the optimal algorithm that has perfect knowledge about the problem grows at the optimal logarithmic rate in the number of decisions and scales polynomially with the parameters of the problem.

References

[1]

{Agrawal, 1995} R. Agrawal. Sample mean based index policies with o(logn) regret for the multi-armed bandit problem. Adv. in Appl. Probability, 27:1054-1078, 1995.

Crossref

Google Scholar

[2]

{Auer et al., 2002} P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235-256, 2002.

Digital Library

Google Scholar

[3]

{Devroye et al., 1996} L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag New York, 1996.

Google Scholar

[4]

{Karoui and Karatzas, 1997} N. El Karoui and I. Karatzas. Synchronization and optimality for multi-armed bandit problems in continuous time. Computational and Applied Mathematics, 16:117-152, 1997.

Google Scholar

[5]

{Kaspi and Mandelbaum, 1998} H. Kaspi and A. Mandelbaum. Multi armed bandits in discrete and continuous time. Ann. of Appl. Probabability, 8:1270-1290, 1998.

Crossref

Google Scholar

[6]

{Lai and Robbins, 1985} T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4-22, 1985.

Digital Library

Google Scholar

[7]

{Puterman, 1994} M.L. Puterman. Markov Decision Processes - Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.

Digital Library

Google Scholar

[8]

{Robbins, 1952} H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58:527-535, 1952.

Crossref

Google Scholar

[9]

{Szepesvári and Littman, 1999} Cs. Szepesvári and M.L. Littman. A unified analysis of value-function-based reinforcement-learning algorithms. Neural Computation, 11:2017-2059, 1999.

Digital Library

Google Scholar

Cited By

View all

Ito SHatano DSumita HTakemura KFukunaga TKakimura NKawarabayashi KOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Bandit task assignment with unknown processing timeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666514(8937-8957)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666514
Cayci SEryilmaz ASrikant R(2019)Learning to Control Renewal Processes with Bandit FeedbackProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261583:2(1-32)Online publication date: 19-Jun-2019
https://dl.acm.org/doi/10.1145/3341617.3326158
Badanidiyuru AKleinberg RSlivkins A(2018)Bandits with KnapsacksJournal of the ACM10.1145/316453965:3(1-55)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1145/3164539
Show More Cited By

Recommendations

Robbing the bandit: less regret in online geometric optimization against an adaptive adversary
SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm

We consider "online bandit geometric optimization," a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize "regret," defined as the difference between the actual loss of an online decision-...
Bandit problems with fidelity rewards

The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how 'loyal' the player has been to that arm in the ...
Finite-time Analysis of the Multiarmed Bandit Problem

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

IJCAI'07: Proceedings of the 20th international joint conference on Artifical intelligence

January 2007

2953 pages

Editors:
Rajeev Sangal
International Institute of Information Technology, Hyderabad
,
Harish Mehta
Onward Technologies Limited
,
R. K. Bagga
International Institute of Information Technology, Hyderabad

Publisher

Morgan Kaufmann Publishers Inc.

San Francisco, CA, United States

Publication History

Published: 06 January 2007

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ito SHatano DSumita HTakemura KFukunaga TKakimura NKawarabayashi KOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Bandit task assignment with unknown processing timeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666514(8937-8957)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666514
Cayci SEryilmaz ASrikant R(2019)Learning to Control Renewal Processes with Bandit FeedbackProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261583:2(1-32)Online publication date: 19-Jun-2019
https://dl.acm.org/doi/10.1145/3341617.3326158
Badanidiyuru AKleinberg RSlivkins A(2018)Bandits with KnapsacksJournal of the ACM10.1145/316453965:3(1-55)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1145/3164539
Zheng ZShroff NMohapatra PSingh SMarkovitch S(2017)When to reset your keysProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298102(3679-3685)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3298023.3298102
Lattimore TCrammer KSzepesvári C(2014)Optimal resource allocation with semi-bandit feedbackProceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence10.5555/3020751.3020801(477-486)Online publication date: 23-Jul-2014
https://dl.acm.org/doi/10.5555/3020751.3020801

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Robbing the bandit: less regret in online geometric optimization against an adaptive adversary

Bandit problems with fidelity rewards

Finite-time Analysis of the Multiarmed Bandit Problem