research-article

Unbiased Ranking Evaluation on a Budget

Authors:

Tobias Schnabel,

Adith Swaminathan,

Thorsten JoachimsAuthors Info & Claims

WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web

Pages 935 - 937

https://doi.org/10.1145/2740908.2742565

Published: 18 May 2015 Publication History

Get Access

Abstract

We address the problem of assessing the quality of a ranking system (e.g., search engine, recommender system, review ranker) given a fixed budget for collecting expert judgments. In particular, we propose a method that selects which items to judge in order to optimize the accuracy of the quality estimate. Our method is not only efficient, but also provides estimates that are unbiased --- unlike common approaches that tend to underestimate performance or that have a bias against new systems that are evaluated re-using previous relevance scores.

References

[1]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In SIGIR, pages 541--548, 2006.

Digital Library

Google Scholar

[2]

B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections for retrieval evaluation. In SIGIR, pages 268--275, 2006.

Digital Library

Google Scholar

[3]

B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. If I had a million queries. In ECIR, pages 288--300, 2009.

Digital Library

Google Scholar

[4]

K. Hofmann, S. Whiteson, and M. de Rijke. Estimating interleaved comparison outcomes from historical click data. In CIKM: Short Papers, pages 1779--1783, 2012.

Digital Library

Google Scholar

[5]

K. Jäarvelin and J. Kekäläainen. Cumulated gain-based evaluation of ir techniques. TOIS, 20(4):422--446, 2002.

Digital Library

Google Scholar

[6]

L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In WSDM, pages 297--306, 2011.

Digital Library

Google Scholar

[7]

L. Li, J. Y. Kim, and I. Zitouni. Toward predicting the outcome of an A/B experiment for search relevance. In WSDM, pages 37--46, 2015.

Digital Library

Google Scholar

[8]

R. Nuray and F. Can. Automatic ranking of retrieval systems in imperfect environments. In SIGIR, pages 379--380, 2003.

Digital Library

Google Scholar

[9]

A. Strehl, J. Langford, L. Li, and S. M. Kakade. Learning from logged implicit exploration data. In NIPS, pages 2217--2225. 2010.

Google Scholar

[10]

L. Wasserman. All of statistics: a concise course in statistical inference. Springer, 2004.

Digital Library

Google Scholar

[11]

E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In SIGIR, pages 603--610, 2008.

Digital Library

Google Scholar

[12]

C. Yuan and M. J. Druzdzel. How heavy should the tails be? In FLAIRS, pages 799--805, 2005

Google Scholar

Cited By

View all

Lu XMoffat ACulpepper JKando NSakai TJoho HLi Hde Vries AWhite R(2017)Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080793(35-44)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080793

Index Terms

Unbiased Ranking Evaluation on a Budget
1. Information systems
  1. Information retrieval

Recommendations

Unbiased Comparative Evaluation of Ranking Functions
ICTIR '16: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling enables the ...
Offline Evaluation of Ranking Policies with Click Models
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Many web systems rank and present a list of items to users, from recommender systems to search and advertising. An important problem in practice is to evaluate new ranking policies offline and optimize them before they are deployed. We address this ...
An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia
ICIMP '09: Proceedings of the 2009 Fourth International Conference on Internet Monitoring and Protection

This paper investigates the semantic search performance of search engines. Initially, three keyword-based search engines (Google, Yahoo and Msn) and a semantic search engine (Hakia) were selected. Then, ten queries, from various topics, and four phrases,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web

May 2015

1602 pages

ISBN:9781450334730

DOI:10.1145/2740908

General Chairs:
Aldo Gangemi
National Research Council, Italy & Paris 13 University-CNRS, France
,
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

WWW '15

Sponsor:

IW3C2

WWW '15: 24th International World Wide Web Conference

May 18 - 22, 2015

Florence, Italy

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
111
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Lu XMoffat ACulpepper JKando NSakai TJoho HLi Hde Vries AWhite R(2017)Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080793(35-44)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080793

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Unbiased Comparative Evaluation of Ranking Functions

Offline Evaluation of Ranking Policies with Click Models

An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia