Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2484028.2484038acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

On the measurement of test collection reliability

Published: 28 July 2013 Publication History

Abstract

The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.

References

[1]
J. Allan, J. A. Aslam, B. Carterette, V. Pavlu, and E. Kanoulas. Million Query Track 2008 Overview. In Text REtrieval Conference, 2008.
[2]
J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million Query Track 2007 Overview. In Text REtrieval Conference, 2007.
[3]
C. Arteaga, S. Jeyaratnam, and G. A. Franklin. Confidence Intervals for Proportions of Total Variance in the Two-Way Cross Component of Variance Model. Communications in Statistics: Theory and Methods, 11(15):1643--1658, 1982.
[4]
D. Banks, P. Over, and N.-F. Zhang. Blind Men and Elephants: Six Approaches to TREC data. Information Retrieval, 1(1--2):7--34, 1999.
[5]
D. Bodoff. Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management, 44(3):1117--1145, 2008.
[6]
D. Bodoff and P. Li. Test Theory for Assessing IR Test Collections. In ACM SIGIR, pages 367--374, 2007.
[7]
R. L. Brennan. Generalizability Theory. Springer, 2001.
[8]
C. Buckley and E. M. Voorhees. Evaluating Evaluation Measure Stability. In ACM SIGIR, pages 33--34, 2000.
[9]
B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. Evaluation Over Thousands of Queries. In ACM SIGIR, pages 651--658, 2008.
[10]
B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. If I Had a Million Queries. In ECIR, pages 288--300, 2009.
[11]
L. S. Feldt. The Approximate Sampling Distribution of Kuder-Richardson Reliability Coefficient Twenty. Psychometrika, 30(3):357--370, 1965.
[12]
E. Kanoulas and J. A. Aslam. Empirical Justification of the Gain and Discount Function for nDCG. In ACM CIKM, pages 611--620, 2009.
[13]
W.-H. Lin and A. Hauptmann. Revisiting the Effect of Topic Set Size on Retrieval Error. In ACM SIGIR, pages 637--638, 2005.
[14]
S. Robertson and E. Kanoulas. On Per-Topic Variance in IR Evaluation. In ACM SIGIR, pages 891--900, 2012.
[15]
T. Sakai. On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Information Processing and Management, 43(2):531--548, 2007.
[16]
M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4):247--375, 2010.
[17]
M. Sanderson and J. Zobel. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In ACM SIGIR, pages 162--169, 2005.
[18]
R. J. Shavelson and N. M. Webb. Generalizability Theory: A Primer. Sage Publications, 1991.
[19]
J. Urbano, M. Marrero, and D. Martín. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. In ACM SIGIR, 2013.
[20]
E. M. Voorhees. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In ACM SIGIR, pages 315--323, 1998.
[21]
E. M. Voorhees. Topic Set Size Redux. In ACM SIGIR, pages 806--807, 2009.
[22]
E. M. Voorhees and C. Buckley. The Effect of Topic Set Size on Retrieval Experiment Error. In ACM SIGIR, pages 316--323, 2002.
[23]
E. Yilmaz, J. A. Aslam, and S. Robertson. A New Rank Correlation Coefficient for Information Retrieval. In ACM SIGIR, pages 587--594, 2008.
[24]
J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In ACM SIGIR, pages 307--314, 1998.

Cited By

View all
  • (2022)Affect in Multimedia: Benchmarking Violent Scenes DetectionIEEE Transactions on Affective Computing10.1109/TAFFC.2020.298696913:1(347-366)Online publication date: 1-Jan-2022
  • (2022)Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10300759:5Online publication date: 1-Sep-2022
  • (2021)Benchmarking Image Retrieval Diversification Techniques for Social MediaIEEE Transactions on Multimedia10.1109/TMM.2020.298657923(677-691)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. On the measurement of test collection reliability

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
    July 2013
    1188 pages
    ISBN:9781450320344
    DOI:10.1145/2484028
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 July 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation
    2. generalizability theory
    3. reliability
    4. test collection
    5. trec

    Qualifiers

    • Research-article

    Conference

    SIGIR '13
    Sponsor:

    Acceptance Rates

    SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Affect in Multimedia: Benchmarking Violent Scenes DetectionIEEE Transactions on Affective Computing10.1109/TAFFC.2020.298696913:1(347-366)Online publication date: 1-Jan-2022
    • (2022)Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10300759:5Online publication date: 1-Sep-2022
    • (2021)Benchmarking Image Retrieval Diversification Techniques for Social MediaIEEE Transactions on Multimedia10.1109/TMM.2020.298657923(677-691)Online publication date: 2021
    • (2021)Visual Interestingness Prediction: A Benchmark Framework and Literature ReviewInternational Journal of Computer Vision10.1007/s11263-021-01443-1129:5(1526-1550)Online publication date: 22-Feb-2021
    • (2020)Building Test Collections using Bandit TechniquesProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412121(1953-1956)Online publication date: 19-Oct-2020
    • (2019)A New Perspective on Score StandardizationProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331315(1061-1064)Online publication date: 18-Jul-2019
    • (2019)Statistical Significance Testing in Information RetrievalProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331259(505-514)Online publication date: 18-Jul-2019
    • (2019)In quest of new document relationsScientometrics10.1007/s11192-019-03058-3119:2(987-1008)Online publication date: 1-May-2019
    • (2019)Fewer topics? A million topics? Both?! On topics subsets in test collectionsInformation Retrieval Journal10.1007/s10791-019-09357-wOnline publication date: 8-May-2019
    • (2018)An analysis of evaluation campaigns in ad-hoc medical information retrieval: CLEF eHealth 2013 and 2014Information Retrieval Journal10.1007/s10791-018-9331-421:6(507-540)Online publication date: 3-May-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media