Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3340531.3411998acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Corpus Bootstrapping for Assessment of the Properties of Effectiveness Measures

Published: 19 October 2020 Publication History

Abstract

Bootstrapping is an established tool for examining the behaviour of offline information retrieval (IR) experiments, where it has primarily been used to assess statistical significance and the robustness of significance tests. In this work we consider how bootstrapping can be used to assess the reliability of effectiveness measures for experimental IR. We use bootstrapping of the corpus of documents rather than, as in most prior work, the set of queries. We demonstrate that bootstrapping can provide new insights into the behaviour of effectiveness measures: the precision of the measurement of a system for a query can be quantified; some measures are more consistent than others; rankings of systems on a test corpus likewise have a precision (or uncertainty) that can be quantified; and, in experiments with limited volumes of relevance judgements, measures can be wildly different in terms of reliability and precision. Our results show that the uncertainty in measurement and ranking of system performance can be substantial and thus our approach to corpus bootstrapping provides a key tool for helping experimenters to choose measures and understand reported outcomes.

Supplementary Material

MP4 File (3340531.3411998.mp4)
10-minute presentation of the 2020 CIKM paper "Corpus Bootstrapping for Assessment of the Properties of Effectiveness Measures". This talk explains how repeated re-sampling of a document corpus can be used to produce versions of the corpus that induce variation in measured properties of information retrieval systems. These measurements can be taken with several commonly-used tools with different properties; these differences are illustrated and quantified by the use of bootstrapping, showing for example that some of the tools are much more accurate than others. A particular concern demonstrated by bootstrapping is that a popular tool can in some circumstances produce wildly misleading results.

References

[1]
B. A. Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. on Information Systems (TOIS), Vol. 30, 1 (2012), 4.
[2]
G. V. Cormack and T. R. Lynam. 2006. Statistical precision of information retrieval evaluation. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 533--540.
[3]
B. Efron and R. Tibshirani. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statist. Sci. (1986), 54--75.
[4]
B. Efron and R. J. Tibshirani. 1994. An introduction to the bootstrap. CRC press.
[5]
N. Fuhr. 2018. Some common mistakes in IR evaluation, and how they can be avoided. In ACM SIGIR Forum, Vol. 51. ACM, 32--41.
[6]
K. J"arvelin and J. Kek"al"ainen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446.
[7]
A. Moffat, F. Scholer, and P. Thomas. 2012. Models and metrics: IR evaluation as a user process. In Proceedings of the 17th Australasian Document Computing Symposium. 47--54.
[8]
A. Moffat and J. Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM Trans. on Information Systems (TOIS), Vol. 27, 1 (2008).
[9]
C. J. Van Rijsbergen. 1979. Information Retrieval.
[10]
S. Robertson. 2006. On GMAP: and other transformations. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, 78--83.
[11]
S. E. Robertson and E. Kanoulas. 2012. On per-topic variance in IR evaluation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 891--900.
[12]
T. Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 525--532.
[13]
T. Sakai. 2018. Laboratory experiments in information retrieval: Sample sizes, effect sizes, and statistical power. Vol. 40. Springer.
[14]
T. Sakai. 2020. On Fuhr's Guideline for IR Evaluation. ACM SIGIR Forum, Vol. 54, 1 (2020).
[15]
M. Sanderson. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval, Vol. 4, 4 (2010), 247--375.
[16]
M. Sanderson and J. Zobel. 2005. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 162--169.
[17]
J. Savoy. 1997. Statistical inference in retrieval effectiveness evaluation. Information Processing & Management, Vol. 33, 4 (1997), 495--512.
[18]
M. D. Smucker, J. Allan, and B. Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 623--632.
[19]
I. Soboroff. 2004. On evaluating web search with very few relevant documents. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 530--531.
[20]
J. Urbano. 2016. Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation. Information Retrieval Journal, Vol. 19, 3 (2016), 313--350.
[21]
E. M. Voorhees. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, Vol. 36, 5 (2000), 697--716.
[22]
E. M. Voorhees and C. Buckley. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 316--323.
[23]
E. M. Voorhees, D. K. Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge.
[24]
W. Webber, A. Moffat, J. Zobel, and T. Sakai. 2008. Precision-at-ten considered redundant. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 695--696.
[25]
J. Zobel. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 307--314.

Cited By

View all
  • (2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
  • (2024)Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_19(286-302)Online publication date: 16-Mar-2024
  • (2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
  • Show More Cited By

Index Terms

  1. Corpus Bootstrapping for Assessment of the Properties of Effectiveness Measures

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
    October 2020
    3619 pages
    ISBN:9781450368599
    DOI:10.1145/3340531
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bootstrap
    2. corpus properties
    3. experimental design
    4. measurement

    Qualifiers

    • Research-article

    Conference

    CIKM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 29 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Query Variability and Experimental Consistency: A Concerning Case StudyProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672519(35-41)Online publication date: 2-Aug-2024
    • (2024)Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_19(286-302)Online publication date: 16-Mar-2024
    • (2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
    • (2023)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 27-Jan-2023
    • (2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 17-Mar-2023
    • (2022)How Train–Test Leakage Affects Zero-Shot RetrievalString Processing and Information Retrieval10.1007/978-3-031-20643-6_11(147-161)Online publication date: 8-Nov-2022
    • (2021)Evaluating the Predictivity of IR ExperimentsProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463040(1667-1671)Online publication date: 11-Jul-2021
    • (2021)TILDE: Term Independent Likelihood moDEl for Passage Re-rankingProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462922(1483-1492)Online publication date: 11-Jul-2021
    • (2021)Deep Query Likelihood Model for Information RetrievalAdvances in Information Retrieval10.1007/978-3-030-72240-1_49(463-470)Online publication date: 28-Mar-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media