Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3159652.3159687acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Offline A/B Testing for Recommender Systems

Published: 02 February 2018 Publication History

Abstract

Online A/B testing evaluates the impact of a new technology by running it in a real production environment and testing its performance on a subset of the users of the platform. It is a well-known practice to run a preliminary offline evaluation on historical data to iterate faster on new ideas, and to detect poor policies in order to avoid losing money or breaking the system. For such offline evaluations, we are interested in methods that can compute offline an estimate of the potential uplift of performance generated by a new technology. Offline performance can be measured using estimators known as counterfactual or off-policy estimators. Traditional counterfactual estimators, such as capped importance sampling or normalised importance sampling, exhibit unsatisfying bias-variance compromises when experimenting on personalized product recommendation systems. To overcome this issue, we model the bias incurred by these estimators rather than bound it in the worst case, which leads us to propose a new counterfactual estimator. We provide a benchmark of the different estimators showing their correlation with business metrics observed by running online A/B tests on a large-scale commercial recommender system.

References

[1]
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval.
[2]
Léon Bottou and Jonas Peters. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. Proceedings of Journal of Machine Learning Research(JMLR).
[3]
Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: The Method of Paired Comparisons. Biometrika(1952).
[4]
Weiwei Cheng, Eyke Hüllermeier, and Krzysztof J Dembczynski. 2010. Label ranking methods based on the Plackett-Luce model. Proceedings of the 27th International Conference on Machine Learning(ICML).
[5]
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experimental Comparison of Click Position-bias Models. Proceedings of the International Conference on Web Search and Data Mining(WSDM).
[6]
Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. Proceedings of the 28th International Conference on Machine Learning(ICML).
[7]
John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. Proceedings of the 26th annual International Conference on Machine Learning(ICML).
[8]
JM Hammersley and DC Handscomb. 1964. Monte Carlo Methods. Chapter.
[9]
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst.
[10]
Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. Proceedings of Transactions on Information Systems(TOIS).
[11]
Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association.
[12]
Business Insider. 2017. Morgan Stanley puts Amazon Prime subscribers at 65M. http://www.businessinsider.fr/us/morgan-stanley-puts-amazon-primesubscribers-at-65m-2017--2/.(2017).
[13]
Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international conference on Research and development in information retrieval(SIGIR).
[14]
Deba B Lahiri. 1951. A method of sample selection providing unbiased ratio estimates. Bulletin of the International Statistical Institute.
[15]
Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the International Conference on Web Search and Data Mining(WSDM).
[16]
Benjamin M. Marlin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative Filtering and the Missing at Random Assumption. Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence(UAI).
[17]
Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory(COLT).
[18]
Hiroshi Midzuno. 1951. On the sampling system with probability proportionate to sum of sizes. Annals of the Institute of Statistical Mathematics.
[19]
Art Owen. 2010. Monte Carlo theory, methods and examples.(2010). arXiv:arXiv:1012.5461v2
[20]
R. L. Plackett. 1975. The Analysis of Permutations. Journal of the Royal Statistical Society. Series C(Applied Statistics).
[21]
MJD Powell and J Swann. 1966. Weighted uniform sampling: a Monte Carlo technique for reducing variance. Journal of Applied Mathematics.
[22]
Bruno Pradel, Nicolas Usunier, and Patrick Gallinari. 2012. Ranking with nonrandom missing ratings: influence of popularity and positivity on evaluation metrics. Proceedings of the Sixth Conference on Recommender Systems(RecSys).
[23]
Amode Ranjan Sen. 1952. Present status of probability sampling and its use in estimation of farm characteristics. Econometrica(1952).
[24]
Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estimator for Counterfactual Learning. Proceeding of Neural Information Processing Systems(NIPS).
[25]
Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. Proceedings of the International Conference on Machine Learning(ICML).
[26]
Louis Leon Thurstone. 1927. A Law of Comparative Judgement. Psychological Review.
[27]
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Journal of Machine Learning Research(JMLR)(1992).
[28]
John I. Yellott. 1977. The relationship between Luce's Choice Axiom, Thurstone's Theory of Comparative Judgment, and the double exponential distribution. Journal of Mathematical Psychology.

Cited By

View all
  • (2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
  • (2024)AutoOffAB: Toward Automated Offline A/B Testing for Data-Driven Requirement EngineeringCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663780(472-476)Online publication date: 10-Jul-2024
  • (2024)Δ-OPE: Off-Policy Estimation with Pairs of PoliciesProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688162(878-883)Online publication date: 8-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining
February 2018
821 pages
ISBN:9781450355810
DOI:10.1145/3159652
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2018

Permissions

Request permissions for this article.

Check for updates

Badges

  • Honorable Mention

Author Tags

  1. counterfactual estimation
  2. importance sampling.
  3. off-policy evaluation
  4. recommender system

Qualifiers

  • Research-article

Conference

WSDM 2018

Acceptance Rates

WSDM '18 Paper Acceptance Rate 81 of 514 submissions, 16%;
Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)180
  • Downloads (Last 6 weeks)23
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
  • (2024)AutoOffAB: Toward Automated Offline A/B Testing for Data-Driven Requirement EngineeringCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663780(472-476)Online publication date: 10-Jul-2024
  • (2024)Δ-OPE: Off-Policy Estimation with Pairs of PoliciesProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688162(878-883)Online publication date: 8-Oct-2024
  • (2024)Multi-Objective Recommendation via Multivariate Policy LearningProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688132(712-721)Online publication date: 8-Oct-2024
  • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
  • (2024)Exploring the Landscape of Recommender Systems Evaluation: Practices and PerspectivesACM Transactions on Recommender Systems10.1145/36291702:1(1-31)Online publication date: 7-Mar-2024
  • (2024)Rethinking Cross-Domain Sequential Recommendation under Open-World AssumptionsProceedings of the ACM Web Conference 202410.1145/3589334.3645351(3173-3184)Online publication date: 13-May-2024
  • (2024)Mitigating Hidden Confounding Effects for Causal RecommendationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.337848236:9(4794-4805)Online publication date: Sep-2024
  • (2024)Uncovering the Propensity Identification Problem in Debiased Recommendations2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00056(653-666)Online publication date: 13-May-2024
  • (2024)Black-Box Testing and Auditing of Bias in ADM SystemsMinds and Machines10.1007/s11023-024-09666-034:2Online publication date: 25-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media