short-paper

Public Access

Estimating Error and Bias in Offline Evaluation Results

Authors:

Michael D. EkstrandAuthors Info & Claims

CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval

Pages 392 - 396

https://doi.org/10.1145/3343413.3378004

Published: 14 March 2020 Publication History

Abstract

Offline evaluations of recommender systems attempt to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations provide researchers and developers with first approximations of the likely performance of a new system and help weed out bad ideas before presenting them to users. However, offline evaluation cannot accurately assess novel, relevant recommendations, because the most novel items were previously unknown to the user, so they are missing from the historical data and cannot be judged as relevant. We present a simulation study to estimate the error that such missing data causes in commonly-used evaluation metrics in order to assess its prevalence and impact. We find that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Substantial breakthroughs in recommendation quality, therefore, will be difficult to assess with existing offline techniques.

Supplementary Material

ZIP File (chiirsp146aux.zip)

The ZIP archive includes code for reproducing the experiments, along with a more complete specification of the simulation models.

Download
48.30 MB

References

[1]

Joeran Beel and Victor Brunel. 2019. Data Pruning in Recommender Systems Research : Best -Practice or Malpractice ?. In ACM RecSys 2019 Late -Breaking Results, Vol. 2431. CEUR, 5.

[2]

Alejandro Bellog'in. 2012. Recommender System Performance Evaluation and Prediction: An Information Retrieval Perspective . Ph.D. Dissertation. Universidad Autónoma de Madrid, Madrid, Spain.

[3]

Alejandro Bellogin, Pablo Castells, and Ivan Cantador. 2011. Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparison. In Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys '11). ACM, New York, NY, USA, 333--336. https://doi.org/10.1145/2043932.2043996

Digital Library

[4]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3 (March 2003), 993--1022. http://dl.acm.org/citation.cfm?id=944919.944937

Digital Library

[5]

Léon Bottou, Jonas Peters, Joaquin Qui nonero Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, Vol. 14, 1 (Jan. 2013), 3207--3260. http://dl.acm.org/citation.cfm?id=2567709.2567766

Digital Library

[6]

Roc'io Ca namares and Pablo Castells. 2018. Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18). ACM, New York, NY, USA, 415--424. https://doi.org/10.1145/3209978.3210014

[7]

Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys '10). ACM, New York, NY, USA, 39--46. https://doi.org/10.1145/1864708.1864721

Digital Library

[8]

Michael D. Ekstrand. 2018. The LKPY Package for Recommender Systems Experiments: Next-Generation Tools and Lessons Learned from the LensKit Project. CoRR, Vol. abs/1809.03125 (2018). arxiv: 1809.03125 http://arxiv.org/abs/1809.03125

[9]

Michael D. Ekstrand and Joseph A. Konstan. 2019. Recommender Systems Notation: Proposed Common Notation for Teaching and Research. CoRR, Vol. abs/1902.01348 (2019). arxiv: 1902.01348 http://arxiv.org/abs/1902.01348

[10]

Michael D. Ekstrand and Vaibhav Mahant. 2017. Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation. In Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference . Association for the Advancement of Artificial Intelligence, 639--644. https://aaai.org/ocs/index.php/FLAIRS/FLAIRS17/paper/view/15534

[11]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM '18). ACM, New York, NY, USA, 198--206. https://doi.org/10.1145/3159652.3159687

Digital Library

[12]

Thomas L. Griffiths and Zoubin Ghahramani. 2011. The Indian Buffet Process: An Introduction and Review. Journal of Machine Learning Research, Vol. 12 (July 2011), 1185--1224. http://dl.acm.org/citation.cfm?id=1953048.2021039

[13]

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems, Vol. 5, 4 (Dec. 2015), 19:1--19:19. https://doi.org/10.1145/2827872

Digital Library

[14]

Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 507--517. https://doi.org/10.1145/2872427.2883037

Digital Library

[15]

Tim Head, MechCoder, Gilles Louppe, Iaroslav Shcherbatyi, fcharras, Zé Vinícius, cmmalone, Christopher Schröder, nel215, Nuno Campos, Todd Young, Stefano Cereda, Thomas Fan, rene rex, Kejia (KJ) Shi, Justus Schwabedal, carlosdanielcsantos, Hvass-Labs, Mikhail Pak, SoManyUsernamesTaken, Fred Callaway, Loïc Estève, Lilian Besson, Mehdi Cherti, Karlson Pfannschmidt, Fabian Linzberger, Christophe Cauet, Anna Gut, Andreas Mueller, and Alexander Fabisch. 2018. scikit-optimize/scikit-optimize: v0.5.2. https://doi.org/10.5281/zenodo.1207017

[16]

S. Kullback. 1997. Information Theory and Statistics .Dover Publications. 97014382 https://books.google.com/books?id=05LwShwkhFYC

Digital Library

[17]

Daryl Lim, Julian McAuley, and Gert Lanckriet. 2015. Top-N Recommendation with Missing Implicit Feedback. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys '15). ACM, New York, NY, USA, 309--312. https://doi.org/10.1145/2792838.2799671

Digital Library

[18]

Apurva Pathak, Kshitiz Gupta, and Julian McAuley. 2017. Generating and Personalizing Bundle Recommendations on Steam. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 1073--1076. https://doi.org/10.1145/3077136.3080724

Digital Library

[19]

Harald Steck. 2010. Training and Testing of Recommender Systems on Data Missing Not at Random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '10). ACM, New York, NY, USA, 713--722. https://doi.org/10.1145/1835804.1835895

Digital Library

[20]

Adith Swaminathan and Thorsten Joachims. 2015. Batch Learning from Logged Bandit Feedback Through Counterfactual Risk Minimization. J. Mach. Learn. Res., Vol. 16, 1 (Jan. 2015), 1731--1755. http://dl.acm.org/citation.cfm?id=2789272.2886805

Digital Library

[21]

Yee Whye Teh and Dilan Görür. 2009. Indian Buffet Processes with Power-law Behavior. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS'09). Curran Associates Inc., USA, 1838--1846. http://dl.acm.org/citation.cfm?id=2984093.2984299

Cited By

Zhu ZQin RHuang JDai XYu YYu YZhang W(2023)Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender SystemsACM Transactions on Information Systems10.1145/363786942:4(1-32)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3637869
Ekstrand MCarterette BDiaz F(2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
https://dl.acm.org/doi/10.1145/3613455
Vente TEkstrand MBeel J(2023)Introducing LensKit-Auto, an Experimental Automated Recommender System (AutoRecSys) ToolkitProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3610656(1212-1216)Online publication date: 14-Sep-2023
https://dl.acm.org/doi/10.1145/3604915.3610656
Show More Cited By

Index Terms

Estimating Error and Bias in Offline Evaluation Results
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation
RepSys '13: Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation

Offline evaluations are the most common evaluation method for research paper recommender systems. However, no thorough discussion on the appropriateness of offline evaluations has taken place, despite some voiced criticism. We conducted a study in which ...
Bridging the Gap Between User-centric and Offline Evaluation of Personalized Recommendation Systems
UMAP '18: Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization

In this paper, we propose to evaluate recommender systems by conducting both offline and user-centric evaluations, while considering multiple quality aspects in realistic settings. This comprehensive evaluation would provide insight on how to improve ...
New Measures for Offline Evaluation of Learning Path Recommenders
Addressing Global Challenges and Quality Education
Abstract
Recommending students useful and effective learning paths is highly valuable to improve their learning experience. The evaluation of the effectiveness of this recommendation is a challenging task that can be performed online or offline. Online ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval

March 2020

596 pages

ISBN:9781450368926

DOI:10.1145/3343413

General Chairs:
Heather O'Brien
University of British Columbia, Canada
,
Luanne Freund
University of British Columbia, Canada
,
Program Chairs:
Ioannis Arapakis
Telefonica I+D, Spain
,
Orland Hoeber
University of Regina, Canada
,
Irene Lopatovska
Pratt Institute, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Science Foundation

Conference

CHIIR '20

Sponsor:

CHIIR '20: Conference on Human Information Interaction and Retrieval

March 14 - 18, 2020

Vancouver BC, Canada

Acceptance Rates

Overall Acceptance Rate 55 of 163 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
286
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)8

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu ZQin RHuang JDai XYu YYu YZhang W(2023)Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender SystemsACM Transactions on Information Systems10.1145/363786942:4(1-32)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3637869
Ekstrand MCarterette BDiaz F(2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
https://dl.acm.org/doi/10.1145/3613455
Vente TEkstrand MBeel J(2023)Introducing LensKit-Auto, an Experimental Automated Recommender System (AutoRecSys) ToolkitProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3610656(1212-1216)Online publication date: 14-Sep-2023
https://dl.acm.org/doi/10.1145/3604915.3610656
Ihemelandu NEkstrand M(2023)Candidate Set Sampling for Evaluating Top-N Recommendation2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT59888.2023.00018(88-94)Online publication date: 26-Oct-2023
https://doi.org/10.1109/WI-IAT59888.2023.00018
Tamm YDamdinov RVasilev A(2021)Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?Proceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3478848(708-713)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3460231.3478848
Ekstrand MChaney ACastells PBurke RRohde DSlokom M(2021)SimuRec: Workshop on Synthetic Data and Simulation Methods for Recommender Systems ResearchProceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3470938(803-805)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3460231.3470938
Ekstrand Md'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)LensKit for Python: Next-Generation Software for Recommender Systems ExperimentsProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412778(2999-3006)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412778

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents