Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3343413.3378004acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Public Access

Estimating Error and Bias in Offline Evaluation Results

Published: 14 March 2020 Publication History

Abstract

Offline evaluations of recommender systems attempt to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations provide researchers and developers with first approximations of the likely performance of a new system and help weed out bad ideas before presenting them to users. However, offline evaluation cannot accurately assess novel, relevant recommendations, because the most novel items were previously unknown to the user, so they are missing from the historical data and cannot be judged as relevant. We present a simulation study to estimate the error that such missing data causes in commonly-used evaluation metrics in order to assess its prevalence and impact. We find that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Substantial breakthroughs in recommendation quality, therefore, will be difficult to assess with existing offline techniques.

Supplementary Material

ZIP File (chiirsp146aux.zip)
The ZIP archive includes code for reproducing the experiments, along with a more complete specification of the simulation models.

References

[1]
Joeran Beel and Victor Brunel. 2019. Data Pruning in Recommender Systems Research : Best -Practice or Malpractice ?. In ACM RecSys 2019 Late -Breaking Results, Vol. 2431. CEUR, 5.
[2]
Alejandro Bellog'in. 2012. Recommender System Performance Evaluation and Prediction: An Information Retrieval Perspective . Ph.D. Dissertation. Universidad Autónoma de Madrid, Madrid, Spain.
[3]
Alejandro Bellogin, Pablo Castells, and Ivan Cantador. 2011. Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparison. In Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys '11). ACM, New York, NY, USA, 333--336. https://doi.org/10.1145/2043932.2043996
[4]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3 (March 2003), 993--1022. http://dl.acm.org/citation.cfm?id=944919.944937
[5]
Léon Bottou, Jonas Peters, Joaquin Qui nonero Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, Vol. 14, 1 (Jan. 2013), 3207--3260. http://dl.acm.org/citation.cfm?id=2567709.2567766
[6]
Roc'io Ca namares and Pablo Castells. 2018. Should I Follow the Crowd?: A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18). ACM, New York, NY, USA, 415--424. https://doi.org/10.1145/3209978.3210014
[7]
Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys '10). ACM, New York, NY, USA, 39--46. https://doi.org/10.1145/1864708.1864721
[8]
Michael D. Ekstrand. 2018. The LKPY Package for Recommender Systems Experiments: Next-Generation Tools and Lessons Learned from the LensKit Project. CoRR, Vol. abs/1809.03125 (2018). arxiv: 1809.03125 http://arxiv.org/abs/1809.03125
[9]
Michael D. Ekstrand and Joseph A. Konstan. 2019. Recommender Systems Notation: Proposed Common Notation for Teaching and Research. CoRR, Vol. abs/1902.01348 (2019). arxiv: 1902.01348 http://arxiv.org/abs/1902.01348
[10]
Michael D. Ekstrand and Vaibhav Mahant. 2017. Sturgeon and the Cool Kids: Problems with Random Decoys for Top-N Recommender Evaluation. In Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference . Association for the Advancement of Artificial Intelligence, 639--644. https://aaai.org/ocs/index.php/FLAIRS/FLAIRS17/paper/view/15534
[11]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM '18). ACM, New York, NY, USA, 198--206. https://doi.org/10.1145/3159652.3159687
[12]
Thomas L. Griffiths and Zoubin Ghahramani. 2011. The Indian Buffet Process: An Introduction and Review. Journal of Machine Learning Research, Vol. 12 (July 2011), 1185--1224. http://dl.acm.org/citation.cfm?id=1953048.2021039
[13]
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems, Vol. 5, 4 (Dec. 2015), 19:1--19:19. https://doi.org/10.1145/2827872
[14]
Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 507--517. https://doi.org/10.1145/2872427.2883037
[15]
Tim Head, MechCoder, Gilles Louppe, Iaroslav Shcherbatyi, fcharras, Zé Vinícius, cmmalone, Christopher Schröder, nel215, Nuno Campos, Todd Young, Stefano Cereda, Thomas Fan, rene rex, Kejia (KJ) Shi, Justus Schwabedal, carlosdanielcsantos, Hvass-Labs, Mikhail Pak, SoManyUsernamesTaken, Fred Callaway, Loïc Estève, Lilian Besson, Mehdi Cherti, Karlson Pfannschmidt, Fabian Linzberger, Christophe Cauet, Anna Gut, Andreas Mueller, and Alexander Fabisch. 2018. scikit-optimize/scikit-optimize: v0.5.2. https://doi.org/10.5281/zenodo.1207017
[16]
S. Kullback. 1997. Information Theory and Statistics .Dover Publications. 97014382 https://books.google.com/books?id=05LwShwkhFYC
[17]
Daryl Lim, Julian McAuley, and Gert Lanckriet. 2015. Top-N Recommendation with Missing Implicit Feedback. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys '15). ACM, New York, NY, USA, 309--312. https://doi.org/10.1145/2792838.2799671
[18]
Apurva Pathak, Kshitiz Gupta, and Julian McAuley. 2017. Generating and Personalizing Bundle Recommendations on Steam. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 1073--1076. https://doi.org/10.1145/3077136.3080724
[19]
Harald Steck. 2010. Training and Testing of Recommender Systems on Data Missing Not at Random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '10). ACM, New York, NY, USA, 713--722. https://doi.org/10.1145/1835804.1835895
[20]
Adith Swaminathan and Thorsten Joachims. 2015. Batch Learning from Logged Bandit Feedback Through Counterfactual Risk Minimization. J. Mach. Learn. Res., Vol. 16, 1 (Jan. 2015), 1731--1755. http://dl.acm.org/citation.cfm?id=2789272.2886805
[21]
Yee Whye Teh and Dilan Görür. 2009. Indian Buffet Processes with Power-law Behavior. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS'09). Curran Associates Inc., USA, 1838--1846. http://dl.acm.org/citation.cfm?id=2984093.2984299

Cited By

View all
  • (2023)Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender SystemsACM Transactions on Information Systems10.1145/363786942:4(1-32)Online publication date: 15-Dec-2023
  • (2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
  • (2023)Introducing LensKit-Auto, an Experimental Automated Recommender System (AutoRecSys) ToolkitProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3610656(1212-1216)Online publication date: 14-Sep-2023
  • Show More Cited By

Index Terms

  1. Estimating Error and Bias in Offline Evaluation Results

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval
    March 2020
    596 pages
    ISBN:9781450368926
    DOI:10.1145/3343413
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 March 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. offline evaluation
    2. simulation

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    CHIIR '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 55 of 163 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)48
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 23 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender SystemsACM Transactions on Information Systems10.1145/363786942:4(1-32)Online publication date: 15-Dec-2023
    • (2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
    • (2023)Introducing LensKit-Auto, an Experimental Automated Recommender System (AutoRecSys) ToolkitProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3610656(1212-1216)Online publication date: 14-Sep-2023
    • (2023)Candidate Set Sampling for Evaluating Top-N Recommendation2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT59888.2023.00018(88-94)Online publication date: 26-Oct-2023
    • (2021)Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?Proceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3478848(708-713)Online publication date: 13-Sep-2021
    • (2021)SimuRec: Workshop on Synthetic Data and Simulation Methods for Recommender Systems ResearchProceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3470938(803-805)Online publication date: 13-Sep-2021
    • (2020)LensKit for Python: Next-Generation Software for Recommender Systems ExperimentsProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412778(2999-3006)Online publication date: 19-Oct-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media