research-article

Corpus Bootstrapping for Assessment of the Properties of Effectiveness Measures

Authors:

Justin Zobel,

Lida RashidiAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 1933 - 1952

https://doi.org/10.1145/3340531.3411998

Published: 19 October 2020 Publication History

Get Access

Abstract

Bootstrapping is an established tool for examining the behaviour of offline information retrieval (IR) experiments, where it has primarily been used to assess statistical significance and the robustness of significance tests. In this work we consider how bootstrapping can be used to assess the reliability of effectiveness measures for experimental IR. We use bootstrapping of the corpus of documents rather than, as in most prior work, the set of queries. We demonstrate that bootstrapping can provide new insights into the behaviour of effectiveness measures: the precision of the measurement of a system for a query can be quantified; some measures are more consistent than others; rankings of systems on a test corpus likewise have a precision (or uncertainty) that can be quantified; and, in experiments with limited volumes of relevance judgements, measures can be wildly different in terms of reliability and precision. Our results show that the uncertainty in measurement and ranking of system performance can be substantial and thus our approach to corpus bootstrapping provides a key tool for helping experimenters to choose measures and understand reported outcomes.

Supplementary Material

MP4 File (3340531.3411998.mp4)

10-minute presentation of the 2020 CIKM paper "Corpus Bootstrapping for Assessment of the Properties of Effectiveness Measures". This talk explains how repeated re-sampling of a document corpus can be used to produce versions of the corpus that induce variation in measured properties of information retrieval systems. These measurements can be taken with several commonly-used tools with different properties; these differences are illustrated and quantified by the use of bootstrapping, showing for example that some of the tools are much more accurate than others. A particular concern demonstrated by bootstrapping is that a popular tool can in some circumstances produce wildly misleading results.

Download
28.70 MB

References

[1]

B. A. Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. on Information Systems (TOIS), Vol. 30, 1 (2012), 4.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Bootstrapping divergence statistics for testing homogeneity in multinomial populations

Score tests for zero-inflation and overdispersion in two-level count data

Bootstrapping for highly unbalanced clustered data

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations