research-article

Non-stationary bandits with auto-regressive temporal dependency

AUTHORs:

Negin Golrezaei,

Djallel BouneffoufAuthors Info & Claims

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Article No.: 346, Pages 7895 - 7929

Published: 10 December 2023 Publication History

Abstract

Traditional multi-armed bandit (MAB) frameworks, predominantly examined under stochastic or adversarial settings, often overlook the temporal dynamics inherent in many real-world applications such as recommendation systems and online advertising. This paper introduces a novel non-stationary MAB framework that captures the temporal structure of these real-world dynamics through an auto-regressive (AR) reward structure. We propose an algorithm that integrates two key mechanisms: (i) an alternation mechanism adept at leveraging temporal dependencies to dynamically balance exploration and exploitation, and (ii) a restarting mechanism designed to discard out-of-date information. Our algorithm achieves a regret upper bound that nearly matches the lower bound, with regret measured against a robust dynamic benchmark. Finally, via a real-world case study on tourism demand prediction, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to more complex, rapidly evolving time series.

References

[1]

D. A. Aaker, J. M. Carman, and R. Jacobson. Modeling advertising-sales relationships involving feedback: a time series analysis of six cereal brands. Journal of Marketing Research, 19(1):116-125, 1982.

[2]

D. Agarwal, B.-C. Chen, and R Elango. Spatio-temporal models for estimating click-through rate. In Proceedings of the 18th International Conference on World. Wide Web, page 21-30, 2009.

Digital Library

[3]

O. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2191-2199, 2015.

[4]

A. A. Ariyo, A. O. Adewumi, and C. K. Ayo. Stock price prediction using the arima model. In 2014 UKSim-AMSS 16th international conference on computer modelling and simulation, pages 106-112. IEEE, 2014.

Digital Library

[5]

P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397^22, 2003.

Digital Library

[6]

P. Auer, N. Cesa-Bianchi, and R Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235-256, 2002.

Digital Library

[7]

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem. SIAM journal of computing, 32:48-77, 2002.

[8]

K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357-367, 1967.

[9]

D. Bertsimas and J. Nino-Mora. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operations Research, 48(1):80-90, 2000.

Digital Library

[10]

O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems, volume 27, 2014.

[11]

I. Bogunovic, J. Scarlett, and V. Cevher. Time-varying gaussian process bandit optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, pages 314-323. PMLR, 2016.

[12]

Y. Cao, Z. Wen, B. Kveton, and Y. Xie. Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89 of PMLR, pages 418-427, 2019.

[13]

L. Cella and N. Cesa-Bianchi. Stochastic bandits with delay-dependent payoffs. In International Conference on Artificial Intelligence and Statistics, pages 1168-1177. PMLR, 2020.

[14]

V. Chavez-Demoulin and J. McGill. High-frequency financial data modeling using hawkes processes. Journal of Banking & Finance, 36(12):3415-3426, 2012.

[15]

N. Chen, C. Wang, and L. Wang. Learning and optimization with seasonal patterns. arXiv preprint arXiv:2005.08088, 2020.

[16]

V. Dani and T. P. Hayes. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm (SODA), page 937-943, 2006.

[17]

H. A. David and H. N. Nagaraja. Order Statistics. Wiley, 2003.

[18]

M. G. Dekimpe and D. M. Hanssens. The persistence of marketing effects on sales. Marketing science, 14(1):1-21, 1995.

Digital Library

[19]

B. U. Devi, D. Sundar, and P. Alli. An effective time series analysis for stock trend prediction using arima model for nifty midcap-50. International Journal of Data Mining & Knowledge Management Process, 3(1):65, 2013.

[20]

R. Durrett. Probability: Theory and Examples. Cambridge University Press, 5th edition, 2019.

[21]

J. Fattah, L. Ezzine, Z. Aman, H. E. Moussami, and A. Lachhab. Forecasting of demand using arima model. International Journal of Engineering Business Management, 10, 2018.

[22]

T. C. Gard. Introduction to Stochastic Differential Equations. Marcel Dekker, 1988.

[23]

C. W. Gardiner. Handbook of stochastic methods: for the natural and social sciences. Springer, 2009.

[24]

A. Garivier, T. Lattimore, and E. Kaufmann. On explore-then-commit strategies. In Advances in Neural Information Processing Systems, volume 29, 2016.

[25]

A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. In Algorithmic Learning Theory, pages 174-188. Springer Berlin Heidelberg, 2011.

[26]

J. Gittins and D. Jones. A dynamic allocation index for the sequential design of experiments. In Progress in Statistics, pages 241-266. North-Holland, 1974.

[27]

S. Grünewälder and A. Khaleghi. Approximations of the restless bandit problem. Journal of Machine Learning Research, 20(14):1-37, 2019.

[28]

S. Guha and K. Munagala. Approximation algorithms for partial-information based stochastic control with markovian rewards. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07), pages 483-493, 2007.

Digital Library

[29]

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13-30, 1963.

[30]

N. B. Keskin and A. Zeevi. Chasing demand: Learning and earning in a changing environment. Mathematics of Operations Research, 42(2):277-307, 2017.

Digital Library

[31]

R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Proceedings of the 17th International Conference on Neural Information Processing Systems, page 697-704, 2004.

Digital Library

[32]

R. Kleinberg and N. Immorlica. Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 309-319. IEEE, 2018.

[33]

R. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 594-605, 2003.

Digital Library

[34]

P. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Springer, 1992.

[35]

T. L. Lai, H. Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4-22, 1985.

Digital Library

[36]

C. Lim and M. MacAleer. Time series forecasts of international tourism demand for australia. Technical report, ISER Discussion Paper, 2001.

[37]

F. Liu, J. Lee, and N. B. Shroff. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In AAAI, pages 3651-3658, 2018.

[38]

K. Liu and Q. Zhao. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theor., 56(11):5547-5567, nov 2010.

Digital Library

[39]

Y. Liu, B. Van Roy, and K. Xu. Nonstationary bandit learning via predictive sampling. In International Conference on Artificial Intelligence and Statistics, pages 6215-6244. PMLR, 2023.

[40]

A. Mate, J. Killian, H. Xu, A. Perrault, and M. Tambe. Collapsing bandits and their application to public health intervention. In Advances in Neural Information Processing Systems, volume 33, pages 15639-15650. Curran Associates, Inc., 2020.

[41]

R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, page 214-228, 2012.

Digital Library

[42]

S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A modelbased approach. In SIAM International Conference on Data Mining, pages 216-227, 2007.

[43]

C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293-305, 1999.

[44]

G. A. Pavliotis. Stochastic processes and applications. Springer-Verlag New York, 2016.

[45]

D. Pollard. Convergence of stochastic processes. Berlin: Springer, 1984.

[46]

H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 - 535, 1952.

[47]

R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications: With R Examples. Springer, 4th edition, 2017.

[48]

A. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. In 21st Conference on Learning Theory (COLT), pages 343-354, 2008.

[49]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.

Digital Library

[50]

C. Tekin and M. Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588-5611, 2012.

Digital Library

[51]

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285 - 294, 1933.

[52]

F. Trovo, S. Paladino, M. Restelli, and N. Gatti. Sliding-window thompson sampling for non-stationary settings. Journal of Artificial Intelligence Research, 68:311-364, 2020.

[53]

P. Whittle. Restless bandits: activity allocation in a changing world. Journal of Applied Probability, 25A:287-298, 1988.

Recommendations

A new look at dynamic regret for non-stationary stochastic bandits

We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of its dynamic regret, which ...
Learning Contextual Bandits in a Non-stationary Environment
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Multi-armed bandit algorithms have become a reference solution for handling the explore/exploit dilemma in recommender systems, and many other important real-world problems, such as display advertisement. However, such algorithms usually assume a ...
Non-stationary bandits with knapsacks
NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

December 2023

80772 pages

Copyright © 2023 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 10 December 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten