research-article

When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments

Authors:

Maarten de RijkeAuthors Info & Claims

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pages 447 - 455

https://doi.org/10.1145/3289600.3290958

Published: 30 January 2019 Publication History

Abstract

We consider the novel problem of evaluating a recommendation policy offline in environments where the reward signal is non-stationary. Non-stationarity appears in many Information Retrieval (IR) applications such as recommendation and advertising, but its effect on off-policy evaluation has not been studied at all. We are the first to address this issue. First, we analyze standard off-policy estimators in non-stationary environments and show both theoretically and experimentally that their bias grows with time. Then, we propose new off-policy estimators with moving averages and show that their bias is independent of time and can be bounded. Furthermore, we provide a method to trade-off bias and variance in a principled way to get an off-policy estimator that works well in both non-stationary and stationary environments. We experiment on publicly available recommendation datasets and show that our newly proposed moving average estimators accurately capture changes in non-stationary environments, while standard off-policy estimators fail to do so.

References

[1]

Elliot Aronson. 2008. The Social Animal 10th ed.). Worth/Freeman.

[2]

Heejung Bang and James M Robins. 2005. Doubly Robust Estimation in Missing Data and Causal Inference Models. Biometrics, Vol. 61, 4 (2005), 962--973.

[3]

Omar Besbes, Yonatan Gur, and Assaf Zeevi. 2014. Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 1. 199--207.

Digital Library

[4]

Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys 2011). ACM, New York, NY, USA.

Digital Library

[5]

Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. 2013. A Gang of Bandits. In Advances in Neural Information Processing Systems 26 . 737--745.

Digital Library

[6]

Delicious. 2018. Delicious website. http://www.delicious.com . (2018). Accessed: 2018-08-09.

[7]

Fernando Diaz. 2009. Integration of News Content into Web Results. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 182--191.

Digital Library

[8]

Miroslav Dud'ik, Dumitru Erhan, John Langford, and Lihong Li. 2012. Sample-efficient Nonstationary Policy Evaluation for Contextual Bandits. Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (2012), 247--254.

Digital Library

[9]

Miroslav Dud'ik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. Proceedings of the 28th International Conference on International Conference on Machine Learning (2011), 1097--1104.

Digital Library

[10]

Aurélien Garivier and Eric Moulines. 2011. On Upper-Confidence Bound Policies for Switching Bandit Problems. Proceedings of the 22nd International Conference on Algorithmic Learning Theory (2011), 174--188.

Digital Library

[11]

Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013. Reusing Historical Interaction Data for Faster Online Learning to Rank for IR. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, 183--192.

Digital Library

[12]

Daniel G Horvitz and Donovan J Thompson. 1952. A Generalization of Sampling Without Replacement From a Finite Universe. J. Amer. Statist. Assoc., Vol. 47, 260 (1952), 663--685.

[13]

Rolf Jagerman, Krisztian Balog, and Maarten de Rijke. 2018. OpenSearch: Lessons Learned from an Online Evaluation Campaign. J. Data and Information Quality, Vol. 10, 3 (2018), Article 13.

Digital Library

[14]

Thorsten Joachims. 2002. Optimizing Search Engines using Clickthrough Data. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133--142.

Digital Library

[15]

Daniel Kahneman, Paul Slovic, and Amos Tversky (Eds.). 1982. Judgment Under Uncertainty: Heuristics and Biases .Cambridge University Press.

[16]

Anagha Kulkarni, Jaime Teevan, Krysta M. Svore, and Susan T. Dumais. 2011. Understanding Temporal Query Dynamics. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. ACM, 167--176.

Digital Library

[17]

Last.fm. 2018. Last.fm website. http://www.lastfm.com . (2018). Accessed: 2018-08-09.

[18]

Damien Lefortier, Pavel Serdyukov, and Maarten de Rijke. 2014. Online Exploration for Detecting Shifts in Fresh Intent. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 589--598.

Digital Library

[19]

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. ACM, 661--670.

Digital Library

[20]

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased Offline Evaluation of Contextual-Bandit-Based News Article Recommendation Algorithms. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 297--306.

Digital Library

[21]

Fang Liu, Joohyun Lee, and Ness Shroff. 2018. A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem. Proceedings of the 32nd AAAI Conference on Artificial Intelligence (2018).

[22]

Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personalized News Recommendation Based on Click Behavior. In Proceedings of the 15th International Conference on Intelligent User Interfaces. ACM, 31--40.

Digital Library

[23]

Tucker S McElroy, Brian C Monsell, and Rebecca J Hutchinson. 2018. Modeling of Holiday Effects and Seasonality in Daily Time Series . Technical Report. Center for Statistical Research and Methodology.

[24]

Joshua L Moore, Shuo Chen, Douglas Turnbull, and Thorsten Joachims. 2013. Taste Over Time: The Temporal Dynamics of User Preferences. In Proceedings of the 14th International Society for Music Information Retrieval Conference . 401--406.

[25]

Olfa Nasraoui, Jeff Cerwinske, Carlos Rojas, and Fabio Gonzalez. 2007. Performance of Recommendation Systems in Dynamic Streaming Environments. In Proceedings of the 2007 SIAM International Conference on Data Mining. SIAM, 569--574.

[26]

Fab'iola S. F. Pereira, Jo ao Gama, Sandra de Amo, and Gina M. B. Oliveira. 2018. On Analyzing User Preference Dynamics with Temporal Social Networks. Machine Learning, Vol. 107, 11 (2018), 1745--1773.

Digital Library

[27]

Doina Precup, Richard S. Sutton, and Satinder P. Singh. 2000. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning. 759--766.

Digital Library

[28]

Massimo Quadrana, Marta Reznakova, Tao Ye, Erik Schmidt, and Hossein Vahabi. 2018. Modeling Musical Taste Evolution with Recurrent Neural Networks. arXiv preprint arXiv:1806.06535 (2018).

[29]

Kira Radinsky, Krysta Svore, Susan Dumais, Jaime Teevan, Alex Bocharov, and Eric Horvitz. 2012. Modeling and Predicting Behavioral Dynamics on the Web. In Proceedings of the 21st International Conference on World Wide Web. ACM, 599--608.

Digital Library

[30]

Tobias Schnabel, Paul N Bennett, Susan T Dumais, and Thorsten Joachims. 2018. Short-Term Satisfaction and Long-Term Coverage: Understanding How Users Tolerate Algorithmic Exploration. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, 513--521.

Digital Library

[31]

Norbert Schwarz and Fritz Strack. 1991. Context Effects in Attitude Surveys: Applying Cognitive Theory to Social Research. European Review of Social Psychology, Vol. 2 (1991), 31--50.

[32]

Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction .MIT press.

Digital Library

[33]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-policy Evaluation for Slate Recommendation. In Proceedings of the 31st Conference on Neural Information Processing Systems. 3632--3642.

Digital Library

[34]

Philip S. Thomas and Emma Brunskill. 2016. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. JMLR.org, 2139--2148.

Digital Library

[35]

Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High-Confidence Off-Policy Evaluation. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 3000--3006.

Digital Library

[36]

Philip S Thomas, Georgios Theocharous, Mohammad Ghavamzadeh, Ishan Durugkar, and Emma Brunskill. 2017. Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. 4740--4745.

Digital Library

[37]

Roger Tourangeau. 1992. Context Effects on Attitude Responses: The Role of Retrieval and Necessary Structures. Context Effects in Social and Psychological Research, Norbert Schwarz and Seymour Sudman (Eds.). Springer, 35--47.

[38]

Dennis Wackerly, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical Statistics with Applications .Cengage Learning.

[39]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning. 3589--3597.

Digital Library

[40]

Peter Whittle. 1988. Restless Bandits: Activity Allocation in a Changing World. J. Applied Probability, Vol. 25, A (1988), 287--298.

[41]

Qingyun Wu, Naveen Iyer, and Hongning Wang. 2018. Learning Contextual Bandits in a Non-stationary Environment. Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (2018).

Digital Library

[42]

Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2227--2236.

Digital Library

[43]

Jia Yuan Yu and Shie Mannor. 2009. Piecewise-stationary Bandit Problems with Side Observations. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 1177--1184.

Digital Library

Cited By

Liu YLi GPayne TYue YMan K(2024)Non-Stationary Transformer Architecture: A Versatile Framework for Recommendation SystemsElectronics10.3390/electronics1311207513:11(2075)Online publication date: 27-May-2024
https://doi.org/10.3390/electronics13112075
Chen XWang SMcAuley JJannach DYao L(2024)On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender SystemsACM Transactions on Information Systems10.1145/366199642:6(1-26)Online publication date: 19-Aug-2024
https://dl.acm.org/doi/10.1145/3661996
Kostric IBalog KRadlinski F(2023)Generating Usage-related Questions for Preference Elicitation in Conversational Recommender SystemsACM Transactions on Recommender Systems10.1145/3629981Online publication date: 27-Oct-2023
https://dl.acm.org/doi/10.1145/3629981
Show More Cited By

Index Terms

When People Change their Mind: Off-Policy Evaluation in Non-stationary Recommendation Environments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning
Demystifying the Curse of Horizon in Offline Reinforcement Learning in Order to Break It
Offline reinforcement learning (RL), where we evaluate and learn new policies using existing off-policy data, is crucial in applications where experimentation is ...
Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant ...
Double reinforcement learning for efficient off-policy evaluation in Markov decision processes

Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits ...
Learning Personalized Health Recommendations via Offline Reinforcement Learning
RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

The healthcare industry is strained and would benefit from personalized treatment plans for treating various health conditions (e.g., HIV and diabetes). Reinforcement Learning is a promising approach to learning such sequential recommendation systems. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

January 2019

874 pages

ISBN:9781450359405

DOI:10.1145/3289600

General Chairs:
J. Shane Culpepper
RMIT University
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Paul N. Bennett
Microsoft
,
Kristina Lerman
University of Southern California

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Nederlandse Organisatie voor Wetenschappelijk Onderzoek
Ahold Delhaize

Conference

WSDM '19

Sponsor:

WSDM '19: The Twelfth ACM International Conference on Web Search and Data Mining

February 11 - 15, 2019

Melbourne VIC, Australia

Acceptance Rates

WSDM '19 Paper Acceptance Rate 84 of 511 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
501
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)5

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YLi GPayne TYue YMan K(2024)Non-Stationary Transformer Architecture: A Versatile Framework for Recommendation SystemsElectronics10.3390/electronics1311207513:11(2075)Online publication date: 27-May-2024
https://doi.org/10.3390/electronics13112075
Chen XWang SMcAuley JJannach DYao L(2024)On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender SystemsACM Transactions on Information Systems10.1145/366199642:6(1-26)Online publication date: 19-Aug-2024
https://dl.acm.org/doi/10.1145/3661996
Kostric IBalog KRadlinski F(2023)Generating Usage-related Questions for Preference Elicitation in Conversational Recommender SystemsACM Transactions on Recommender Systems10.1145/3629981Online publication date: 27-Oct-2023
https://dl.acm.org/doi/10.1145/3629981
Liu HZhang YLi PQian CZhao PWu X(2023)DeepCPR: Deep Path Reasoning Using Sequence of User-Preferred Attributes for Conversational RecommendationACM Transactions on Knowledge Discovery from Data10.1145/361077518:1(1-22)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1145/3610775
Gao CWang SLi SChen JHe XLei WLi BZhang YJiang P(2023)CIRS: Bursting Filter Bubbles by Counterfactual Interactive Recommender SystemACM Transactions on Information Systems10.1145/359487142:1(1-27)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3594871
Deffayet RRenders JDe Rijke M(2023)Evaluating the Robustness of Click Models to Policy Distributional ShiftACM Transactions on Information Systems10.1145/356908641:4(1-28)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3569086
Chen JDong HWang XFeng FWang MHe X(2023)Bias and Debias in Recommender System: A Survey and Future DirectionsACM Transactions on Information Systems10.1145/356428441:3(1-39)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1145/3564284
Cai WFeng FWang QYang TLiu ZXu CChua TLauw HSi LTerzi ETsaparas P(2023)A Causal View for Item-level Effect of Recommendation on User PreferenceProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570461(240-248)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570461
Zhao MHuang XZhu LSang JYu J(2023)Knowledge Graph-Enhanced Sampling for Conversational Recommendation SystemIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.318515435:10(9890-9903)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TKDE.2022.3185154
Shi XLiu QXie HShang M(2023)Towards Long-term Fairness in Interactive Recommendation: A Maximum Entropy Reinforcement Learning Approach2023 IEEE International Conference on Web Services (ICWS)10.1109/ICWS60048.2023.00029(118-123)Online publication date: Jul-2023
https://doi.org/10.1109/ICWS60048.2023.00029
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents