Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2740908.2742565acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Unbiased Ranking Evaluation on a Budget

Published: 18 May 2015 Publication History

Abstract

We address the problem of assessing the quality of a ranking system (e.g., search engine, recommender system, review ranker) given a fixed budget for collecting expert judgments. In particular, we propose a method that selects which items to judge in order to optimize the accuracy of the quality estimate. Our method is not only efficient, but also provides estimates that are unbiased --- unlike common approaches that tend to underestimate performance or that have a bias against new systems that are evaluated re-using previous relevance scores.

References

[1]
J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In SIGIR, pages 541--548, 2006.
[2]
B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In SIGIR, pages 268--275, 2006.
[3]
B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. If I had a million queries. In ECIR, pages 288--300, 2009.
[4]
K. Hofmann, S. Whiteson, and M. de Rijke. Estimating interleaved comparison outcomes from historical click data. In CIKM: Short Papers, pages 1779--1783, 2012.
[5]
K. Jäarvelin and J. Kekäläainen. Cumulated gain-based evaluation of ir techniques. TOIS, 20(4):422--446, 2002.
[6]
L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In WSDM, pages 297--306, 2011.
[7]
L. Li, J. Y. Kim, and I. Zitouni. Toward predicting the outcome of an A/B experiment for search relevance. In WSDM, pages 37--46, 2015.
[8]
R. Nuray and F. Can. Automatic ranking of retrieval systems in imperfect environments. In SIGIR, pages 379--380, 2003.
[9]
A. Strehl, J. Langford, L. Li, and S. M. Kakade. Learning from logged implicit exploration data. In NIPS, pages 2217--2225. 2010.
[10]
L. Wasserman. All of statistics: a concise course in statistical inference. Springer, 2004.
[11]
E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In SIGIR, pages 603--610, 2008.
[12]
C. Yuan and M. J. Druzdzel. How heavy should the tails be? In FLAIRS, pages 799--805, 2005

Cited By

View all
  • (2017)Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080793(35-44)Online publication date: 7-Aug-2017

Index Terms

  1. Unbiased Ranking Evaluation on a Budget

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web
    May 2015
    1602 pages
    ISBN:9781450334730
    DOI:10.1145/2740908

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. budget
    2. dcg
    3. evaluation
    4. importance sampling

    Qualifiers

    • Research-article

    Funding Sources

    • NSF

    Conference

    WWW '15
    Sponsor:
    • IW3C2

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 25 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080793(35-44)Online publication date: 7-Aug-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media