research-article

Offline A/B Testing for Recommender Systems

Authors:

Alexandre Gilotte,

Clément Calauzènes,

Thomas Nedelec,

Alexandre Abraham,

Simon DolléAuthors Info & Claims

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

Pages 198 - 206

https://doi.org/10.1145/3159652.3159687

Published: 02 February 2018 Publication History

Abstract

Online A/B testing evaluates the impact of a new technology by running it in a real production environment and testing its performance on a subset of the users of the platform. It is a well-known practice to run a preliminary offline evaluation on historical data to iterate faster on new ideas, and to detect poor policies in order to avoid losing money or breaking the system. For such offline evaluations, we are interested in methods that can compute offline an estimate of the potential uplift of performance generated by a new technology. Offline performance can be measured using estimators known as counterfactual or off-policy estimators. Traditional counterfactual estimators, such as capped importance sampling or normalised importance sampling, exhibit unsatisfying bias-variance compromises when experimenting on personalized product recommendation systems. To overcome this issue, we model the bias incurred by these estimators rather than bound it in the worst case, which leads us to propose a new counterfactual estimator. We provide a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.

References

[1]

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval.

Digital Library

[2]

Léon Bottou and Jonas Peters. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. Proceedings of Journal of Machine Learning Research(JMLR).

Digital Library

[3]

Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: The Method of Paired Comparisons. Biometrika(1952).

[4]

Weiwei Cheng, Eyke Hüllermeier, and Krzysztof J Dembczynski. 2010. Label ranking methods based on the Plackett-Luce model. Proceedings of the 27th International Conference on Machine Learning(ICML).

Digital Library

[5]

Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experimental Comparison of Click Position-bias Models. Proceedings of the International Conference on Web Search and Data Mining(WSDM).

Digital Library

[6]

Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. Proceedings of the 28th International Conference on Machine Learning(ICML).

Digital Library

[7]

John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. Proceedings of the 26th annual International Conference on Machine Learning(ICML).

Digital Library

[8]

JM Hammersley and DC Handscomb. 1964. Monte Carlo Methods. Chapter.

[9]

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst.

Digital Library

[10]

Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. Proceedings of Transactions on Information Systems(TOIS).

Digital Library

[11]

Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association.

[12]

Business Insider. 2017. Morgan Stanley puts Amazon Prime subscribers at 65M. http://www.businessinsider.fr/us/morgan-stanley-puts-amazon-primesubscribers-at-65m-2017--2/.(2017).

[13]

Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international conference on Research and development in information retrieval(SIGIR).

Digital Library

[14]

Deba B Lahiri. 1951. A method of sample selection providing unbiased ratio estimates. Bulletin of the International Statistical Institute.

[15]

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the International Conference on Web Search and Data Mining(WSDM).

Digital Library

[16]

Benjamin M. Marlin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative Filtering and the Missing at Random Assumption. Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence(UAI).

Digital Library

[17]

Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory(COLT).

[18]

Hiroshi Midzuno. 1951. On the sampling system with probability proportionate to sum of sizes. Annals of the Institute of Statistical Mathematics.

[19]

Art Owen. 2010. Monte Carlo theory, methods and examples.(2010). arXiv:arXiv:1012.5461v2

[20]

R. L. Plackett. 1975. The Analysis of Permutations. Journal of the Royal Statistical Society. Series C(Applied Statistics).

[21]

MJD Powell and J Swann. 1966. Weighted uniform sampling: a Monte Carlo technique for reducing variance. Journal of Applied Mathematics.

[22]

Bruno Pradel, Nicolas Usunier, and Patrick Gallinari. 2012. Ranking with nonrandom missing ratings: influence of popularity and positivity on evaluation metrics. Proceedings of the Sixth Conference on Recommender Systems(RecSys).

Digital Library

[23]

Amode Ranjan Sen. 1952. Present status of probability sampling and its use in estimation of farm characteristics. Econometrica(1952).

[24]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. Proceeding of Neural Information Processing Systems(NIPS).

Digital Library

[25]

Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. Proceedings of the International Conference on Machine Learning(ICML).

Digital Library

[26]

Louis Leon Thurstone. 1927. A Law of Comparative Judgement. Psychological Review.

[27]

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Journal of Machine Learning Research(JMLR)(1992).

Digital Library

[28]

John I. Yellott. 1977. The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution. Journal of Mathematical Psychology.

Cited By

Azzopardi LClarke CKantor PMitra BTrippas JRen ZAliannejadi MArabzadeh NChandrasekar Rde Rijke MEustratiadis PHersh WHuang JKanoulas EKareem JLi YLupart SMekonnen KRoegiest ASoboroff ISilvestri FVerberne SVos DYang EZhao Y(2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
https://dl.acm.org/doi/10.1145/3687273.3687288
Wu Jd'Amorim M(2024)AutoOffAB: Toward Automated Offline A/B Testing for Data-Driven Requirement EngineeringCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663780(472-476)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663780
Jeunen OUstimenko A(2024)Δ-OPE: Off-Policy Estimation with Pairs of PoliciesProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688162(878-883)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688162
Show More Cited By

Index Terms

Offline A/B Testing for Recommender Systems
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Learning from implicit feedback
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

A Hybrid Approach for Offline A/B Evaluation for Item Ranking Algorithms in Recommendation Systems
AIMLSystems '21: Proceedings of the First International Conference on AI-ML Systems

A recommendation system generally outputs a ranked list of items which is presented to the user. Based on the consumption signals from the user (like click, play) in an production environment, various performance metrics like Click Through Rate (CTR), ...
Evaluating the Robustness of Off-Policy Evaluation
RecSys '21: Proceedings of the 15th ACM Conference on Recommender Systems

Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive ...
Improving Accuracy of Recommender System by Item Clustering

Recommender System (RS) predicts user's ratings towards items, and then recommends highly-predicted items to user. In recent years, RS has been playing more and more important role in the agent research field. There have been a great deal of researches ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

February 2018

821 pages

ISBN:9781450355810

DOI:10.1145/3159652

General Chairs:
Yi Chang
Jilin University, Huawei Inc.
,
Chengxiang Zhai
University of Illinois Urbana-Champaign
,
Program Chairs:
Yan Liu
University of Southern California
,
Yoelle Maarek
Amazon

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Honorable Mention

Author Tags

Qualifiers

Research-article

Conference

WSDM 2018

Sponsor:

WSDM 2018: The Eleventh ACM International Conference on Web Search and Data Mining

February 5 - 9, 2018

CA, Marina Del Rey, USA

Acceptance Rates

WSDM '18 Paper Acceptance Rate 81 of 514 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

111
Total Citations
View Citations
1,572
Total Downloads

Downloads (Last 12 months)180
Downloads (Last 6 weeks)23

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Azzopardi LClarke CKantor PMitra BTrippas JRen ZAliannejadi MArabzadeh NChandrasekar Rde Rijke MEustratiadis PHersh WHuang JKanoulas EKareem JLi YLupart SMekonnen KRoegiest ASoboroff ISilvestri FVerberne SVos DYang EZhao Y(2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
https://dl.acm.org/doi/10.1145/3687273.3687288
Wu Jd'Amorim M(2024)AutoOffAB: Toward Automated Offline A/B Testing for Data-Driven Requirement EngineeringCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663780(472-476)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663780
Jeunen OUstimenko A(2024)Δ-OPE: Off-Policy Estimation with Pairs of PoliciesProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688162(878-883)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688162
Jeunen OMandav JPotapov IAgarwal NVaid SShi WUstimenko A(2024)Multi-Objective Recommendation via Multivariate Policy LearningProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688132(712-721)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688132
Jeunen OPotapov IUstimenko ABaeza-Yates RBonchi F(2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671687
Bauer CZangerle ESaid A(2024)Exploring the Landscape of Recommender Systems Evaluation: Practices and PerspectivesACM Transactions on Recommender Systems10.1145/36291702:1(1-31)Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3629170
Xu WWu QWang RHa MMa QChen LHan BYan JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Rethinking Cross-Domain Sequential Recommendation under Open-World AssumptionsProceedings of the ACM Web Conference 202410.1145/3589334.3645351(3173-3184)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645351
Zhu XZhang YFeng FYang XWang DHe X(2024)Mitigating Hidden Confounding Effects for Causal RecommendationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337848236:9(4794-4805)Online publication date: Sep-2024
https://doi.org/10.1109/TKDE.2024.3378482
Zhang HWang SLi HZheng CChen XLiu LLuo SWu P(2024)Uncovering the Propensity Identification Problem in Debiased Recommendations2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00056(653-666)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00056
Krafft THauer MZweig K(2024)Black-Box Testing and Auditing of Bias in ADM SystemsMinds and Machines10.1007/s11023-024-09666-034:2Online publication date: 25-May-2024
https://doi.org/10.1007/s11023-024-09666-0
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents