research-article

On the measurement of test collection reliability

Authors:

Julián Urbano,

Mónica Marrero,

Diego MartínAuthors Info & Claims

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 393 - 402

https://doi.org/10.1145/2484028.2484038

Published: 28 July 2013 Publication History

Abstract

The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.

References

[1]

J. Allan, J. A. Aslam, B. Carterette, V. Pavlu, and E. Kanoulas. Million Query Track 2008 Overview. In Text REtrieval Conference, 2008.

[2]

J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million Query Track 2007 Overview. In Text REtrieval Conference, 2007.

[3]

C. Arteaga, S. Jeyaratnam, and G. A. Franklin. Confidence Intervals for Proportions of Total Variance in the Two-Way Cross Component of Variance Model. Communications in Statistics: Theory and Methods, 11(15):1643--1658, 1982.

[4]

D. Banks, P. Over, and N.-F. Zhang. Blind Men and Elephants: Six Approaches to TREC data. Information Retrieval, 1(1--2):7--34, 1999.

Digital Library

[5]

D. Bodoff. Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management, 44(3):1117--1145, 2008.

Digital Library

[6]

D. Bodoff and P. Li. Test Theory for Assessing IR Test Collections. In ACM SIGIR, pages 367--374, 2007.

Digital Library

[7]

R. L. Brennan. Generalizability Theory. Springer, 2001.

[8]

C. Buckley and E. M. Voorhees. Evaluating Evaluation Measure Stability. In ACM SIGIR, pages 33--34, 2000.

Digital Library

[9]

B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. Evaluation Over Thousands of Queries. In ACM SIGIR, pages 651--658, 2008.

Digital Library

[10]

B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. If I Had a Million Queries. In ECIR, pages 288--300, 2009.

Digital Library

[11]

L. S. Feldt. The Approximate Sampling Distribution of Kuder-Richardson Reliability Coefficient Twenty. Psychometrika, 30(3):357--370, 1965.

[12]

E. Kanoulas and J. A. Aslam. Empirical Justification of the Gain and Discount Function for nDCG. In ACM CIKM, pages 611--620, 2009.

Digital Library

[13]

W.-H. Lin and A. Hauptmann. Revisiting the Effect of Topic Set Size on Retrieval Error. In ACM SIGIR, pages 637--638, 2005.

Digital Library

[14]

S. Robertson and E. Kanoulas. On Per-Topic Variance in IR Evaluation. In ACM SIGIR, pages 891--900, 2012.

Digital Library

[15]

T. Sakai. On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Information Processing and Management, 43(2):531--548, 2007.

Digital Library

[16]

M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4):247--375, 2010.

[17]

M. Sanderson and J. Zobel. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In ACM SIGIR, pages 162--169, 2005.

Digital Library

[18]

R. J. Shavelson and N. M. Webb. Generalizability Theory: A Primer. Sage Publications, 1991.

[19]

J. Urbano, M. Marrero, and D. Martín. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. In ACM SIGIR, 2013.

Digital Library

[20]

E. M. Voorhees. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In ACM SIGIR, pages 315--323, 1998.

Digital Library

[21]

E. M. Voorhees. Topic Set Size Redux. In ACM SIGIR, pages 806--807, 2009.

Digital Library

[22]

E. M. Voorhees and C. Buckley. The Effect of Topic Set Size on Retrieval Experiment Error. In ACM SIGIR, pages 316--323, 2002.

Digital Library

[23]

E. Yilmaz, J. A. Aslam, and S. Robertson. A New Rank Correlation Coefficient for Information Retrieval. In ACM SIGIR, pages 587--594, 2008.

Digital Library

[24]

J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In ACM SIGIR, pages 307--314, 1998.

Digital Library

Cited By

Constantin MStefan LIonescu BDemarty CSjoberg MSchedl MGravier G(2022)Affect in Multimedia: Benchmarking Violent Scenes DetectionIEEE Transactions on Affective Computing10.1109/TAFFC.2020.298696913:1(347-366)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TAFFC.2020.2986969
Liu J(2022)Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10300759:5Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.ipm.2022.103007
Ionescu BRohm MBoteanu BGinsca ALupu MMuller H(2021)Benchmarking Image Retrieval Diversification Techniques for Social MediaIEEE Transactions on Multimedia10.1109/TMM.2020.298657923(677-691)Online publication date: 2021
https://doi.org/10.1109/TMM.2020.2986579
Show More Cited By

Index Terms

On the measurement of test collection reliability
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation
Abstract
The number of topics that a test collection contains has a direct impact on how well the evaluation results reflect the true performance of systems. However, large collections can be prohibitively expensive, so researchers are bound to balance ...
A Test Collection for Ad-hoc Dataset Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

This paper introduces a new test collection for ad-hoc dataset retrieval, which have been developed through a shared task called Data Search in the fifteenth NTCIR. This test collection consists of dataset collections derived from the US and Japanese ...
Boiling down information retrieval test collections
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Constructing large-scale test collections is costly and time-consuming, and a few relevance assessment methods have been proposed for constructing "minimal" information retrieval test collections that may still provide reliable experimental results. In ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

July 2013

1188 pages

ISBN:9781450320344

DOI:10.1145/2484028

General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '13

Sponsor:

SIGIR

SIGIR '13: The 36th International ACM SIGIR conference on research and development in Information Retrieval

July 28 - August 1, 2013

Dublin, Ireland

Acceptance Rates

SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
362
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Constantin MStefan LIonescu BDemarty CSjoberg MSchedl MGravier G(2022)Affect in Multimedia: Benchmarking Violent Scenes DetectionIEEE Transactions on Affective Computing10.1109/TAFFC.2020.298696913:1(347-366)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TAFFC.2020.2986969
Liu J(2022)Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10300759:5Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.ipm.2022.103007
Ionescu BRohm MBoteanu BGinsca ALupu MMuller H(2021)Benchmarking Image Retrieval Diversification Techniques for Social MediaIEEE Transactions on Multimedia10.1109/TMM.2020.298657923(677-691)Online publication date: 2021
https://doi.org/10.1109/TMM.2020.2986579
Constantin MŞtefan LIonescu BDuong NDemarty CSjöberg M(2021)Visual Interestingness Prediction: A Benchmark Framework and Literature ReviewInternational Journal of Computer Vision10.1007/s11263-021-01443-1129:5(1526-1550)Online publication date: 22-Feb-2021
https://doi.org/10.1007/s11263-021-01443-1
Altun BKutlu Md'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Building Test Collections using Bandit TechniquesProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412121(1953-1956)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412121
Urbano JLima HHanjalic APiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)A New Perspective on Score StandardizationProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331315(1061-1064)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331315
Urbano JLima HHanjalic APiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Statistical Significance Testing in Information RetrievalProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331259(505-514)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331259
Yaghtin MSotudeh HMirzabeigi MFakhrahmad SMohammadi M(2019)In quest of new document relationsScientometrics10.1007/s11192-019-03058-3119:2(987-1008)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1007/s11192-019-03058-3
Roitero KCulpepper JSanderson MScholer FMizzaro S(2019)Fewer topics? A million topics? Both?! On topics subsets in test collectionsInformation Retrieval Journal10.1007/s10791-019-09357-wOnline publication date: 8-May-2019
https://doi.org/10.1007/s10791-019-09357-w
Goeuriot LJones GKelly LLeveling JLupu MPalotti JZuccon G(2018)An analysis of evaluation campaigns in ad-hoc medical information retrieval: CLEF eHealth 2013 and 2014Information Retrieval Journal10.1007/s10791-018-9331-421:6(507-540)Online publication date: 3-May-2018
https://doi.org/10.1007/s10791-018-9331-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents