research-article

Reusable test collections through experimental design

Authors:

Ben Carterette,

Evangelos Kanoulas,

Hui FangAuthors Info & Claims

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 547 - 554

https://doi.org/10.1145/1835449.1835541

Published: 19 July 2010 Publication History

Abstract

Portable, reusable test collections are a vital part of research and development in information retrieval. Reusability is difficult to assess, however. The standard approach--simulating judgment collection when groups of systems are held out, then evaluating those held-out systems--only works when there is a large set of relevance judgments to draw on during the simulation. As test collections adapt to larger and larger corpora, it becomes less and less likely that there will be sufficient judgments for such simulation experiments. Thus we propose a methodology for information retrieval experimentation that collects evidence for or against the reusability of a test collection while judgments are being made. Using this methodology along with the appropriate statistical analyses, researchers will be able to estimate the reusability of their test collections while building them and implement "course corrections" if the collection does not seem to be achieving desired levels of reusability. We show the robustness of our design to inherent sources of variance, and provide a description of an actual implementation of the framework for creating a large test collection.

References

[1]

J. Allan, J. A. Aslam, B. Carterette, V. Pavlu, and E. Kanoulas. Overview of the TREC 2008 million query track. In Proceedings of TREC, 2008.

[2]

J. A. Aslam and V. Pavlu. A practical sampling strategy for efficient retrieval evaluat ion, technical report.

[3]

S. Buttcher, C. Clarke, P. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of SIGIR, pages 63--70, 2007.

Digital Library

[4]

B. Carterette. Robust test collections for retrieval evaluation. In Proceedings of SIGIR, pages 55--62, 2007.

Digital Library

[5]

B. Carterette. On rank correlation and the distance between rankings. In Proceedings of SIGIR, 2009.

Digital Library

[6]

B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268--275, 2006.

Digital Library

[7]

B. Carterette, E. Gabrilovitch, V. Josifovsky, and D. Metzler. Measuring the reusability of test collections. In Proceedings of WSDM, 2009.

Digital Library

[8]

B. Carterette, V. Pavlu, H. Fang, and E. Kanoulas. Overview of the TREC 2009 million query track. In Notebook Proceedings of TREC, 2009.

[9]

B. Carterette and M. D. Smucker. Hypothesis testing with incomplete relevance judgments. In Proceedings of CIKM, pages 643--652, 2007.

Digital Library

[10]

J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum, 2nd edition, 1998.

[11]

D. Harman. Overview of the second text retrieval conference (trec-2). Inf. Process. Manage., 31(3):271--289, 1995.

Digital Library

[12]

J. P. A. Ionnidis. Why most published research findings are false. PLoS Med., 2(8), 2005.

[13]

T. Sakai. Comparing metrics across trec and ntcir: the robustness to system bias. In Proceedings of CIKM, pages 581--590, 2008.

Digital Library

[14]

E. M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370, London, UK, 2002. Springer-Verlag.

Digital Library

[15]

E. M. Voorhees. Overview of trec 2009. In Proceedings of TREC, 2009. Notebook draft.

[16]

W. Webber, A. Moffat, and J. Zobel. Statistical power in retrieval experimentation. In Proceedings of CIKM, pages 571--580, 2008.

Digital Library

[17]

J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307--314, 1998.

Digital Library

Cited By

Harman D(2022)Information Retrieval EvaluationundefinedOnline publication date: 10-Mar-2022
Câmara ASantos RBogers TSaid ABrusilovsky PTikk D(2019)Traversing semantically annotated queries for task-oriented query recommendationProceedings of the 13th ACM Conference on Recommender Systems10.1145/3298689.3346994(511-515)Online publication date: 10-Sep-2019
Soboroff IKando NSakai TJoho HLi Hde Vries AWhite R(2017)Building Test CollectionsProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3082064(1407-1410)Online publication date: 7-Aug-2017
Show More Cited By

Index Terms

Reusable test collections through experimental design
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Robust test collections for retrieval evaluation
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time ...
Measuring the reusability of test collections
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

While test collection construction is a time-consuming and expensive process, the true cost is amortized by reusing the collection over hundreds or thousands of experiments. Some of these experiments may involve systems that retrieve documents not ...
Understanding and Predicting Characteristics of Test Collections in Information Retrieval
Information for a Better World: Shaping the Global Future
Abstract
Research community evaluations in information retrieval, such as NIST’s Text REtrieval Conference (TREC), build reusable test collections by pooling document rankings submitted by many teams. Naturally, the quality of the resulting test collection ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

July 2010

944 pages

ISBN:9781450301534

DOI:10.1145/1835449

General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '10

Sponsor:

SIGIR

SIGIR '10: The 33rd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2010

Geneva, Switzerland

Acceptance Rates

SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
405
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Harman D(2022)Information Retrieval EvaluationundefinedOnline publication date: 10-Mar-2022
Câmara ASantos RBogers TSaid ABrusilovsky PTikk D(2019)Traversing semantically annotated queries for task-oriented query recommendationProceedings of the 13th ACM Conference on Recommender Systems10.1145/3298689.3346994(511-515)Online publication date: 10-Sep-2019
Soboroff IKando NSakai TJoho HLi Hde Vries AWhite R(2017)Building Test CollectionsProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3082064(1407-1410)Online publication date: 7-Aug-2017
Joachims TSwaminathan APerego RSebastiani FAslam JRuthven IZobel J(2016)Counterfactual Evaluation and Learning for Search, Recommendation and Ad PlacementProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914803(1199-1201)Online publication date: 7-Jul-2016
Kanoulas E(2016)A Short Survey on Online and Offline Methods for Search Quality EvaluationInformation Retrieval10.1007/978-3-319-41718-9_3(38-87)Online publication date: 26-Jul-2016
Carterette BAllan JCroft Bde Vries AZhai C(2015)Statistical Significance Testing in Information RetrievalProceedings of the 2015 International Conference on The Theory of Information Retrieval10.1145/2808194.2809445(7-9)Online publication date: 27-Sep-2015
Tonon ADemartini GCudré-Mauroux P(2015)Pooling-based continuous evaluation of information retrieval systemsInformation Retrieval Journal10.1007/s10791-015-9266-y18:5(445-472)Online publication date: 8-Sep-2015
Jayasinghe GWebber WSanderson MCulpepper J(2014)Improving test collection pools with machine learningProceedings of the 19th Australasian Document Computing Symposium10.1145/2682862.2682864(2-9)Online publication date: 26-Nov-2014
Urbano JSchedl MSerra X(2013)Evaluation in Music Information RetrievalJournal of Intelligent Information Systems10.1007/s10844-013-0249-441:3(345-369)Online publication date: 1-Dec-2013
Sakai TDou ZSong RKando N(2012)The Reusability of a Diversified Search Test CollectionInformation Retrieval Technology10.1007/978-3-642-35341-3_3(26-38)Online publication date: 2012
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents