Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1321440.1321528acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A comparison of statistical significance tests for information retrieval evaluation

Published: 06 November 2007 Publication History

Abstract

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.

Supplementary Material

PDF File (p623-smucker.pdf)
This is the original PDF as published in the proceedings. An error was found in the Conclusion and corrected post-publication. The Corrected Version of Record is now posted in the ACM Digital Library. See Full Text above.

References

[1]
G. E. P. Box, W. G. Hunter, and J. S. Hunter. Statistics for Experimenters. John Wiley & Sons, 1978.
[2]
J. V. Bradley. Distribution-Free Statistical Tests. Prentice-Hall, 1968.
[3]
C. Buckley. trec_eval. http://trec.nist.gov/trec_eval/trec_eval.8.0.tar.gz.
[4]
P. R. Cohen. Empirical methods for artificial intelligence. MIT Press, 1995.
[5]
G. Cormack and T. Lynam. Validity and power of t-test for comparing map and gmap. In SIGIR '07. ACM Press, 2007.
[6]
G. V. Cormack and T. R. Lynam. Statistical precision of information retrieval evaluation. In SIGIR '06, pages 533--540. ACM Press, 2006.
[7]
E. S. Edgington. Randomization Tests. Marcel Dekker, 1995.
[8]
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1998.
[9]
R. A. Fisher. The Design of Experiments. Oliver and Boyd, first edition, 1935.
[10]
D. Hull. Using statistical testing in the evaluation of retrieval experiments. In SIGIR '93, pages 329--338, New York, NY, USA, 1993. ACM Press.
[11]
O. Kempthorne and T. E. Doerfler. The behavior of some significance tests under experimental randomization. Biometrika, 56(2):231--248, August 1969.
[12]
M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 8(1):3--30, 1998.
[13]
W. Mendenhall, D. D. Wackerly, and R. L. Scheaffer. Mathematical Statistics with Applications. PWS-KENT Publishing Company, 1990.
[14]
E. W. Noreen. Computer Intensive Methods for Testing Hypotheses. John Wiley, 1989.
[15]
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. 3-900051-07-0.
[16]
T. Sakai. Evaluating evaluation metrics based on the bootstrap. In SIGIR '06, pages 525--532. ACM Press, 2006.
[17]
M. Sanderson and J. Zobel. Information retrieval system evaluation: effort, sensitivity, and reliability. In SIGIR '05, pages 162--169. ACM Press, 2005.
[18]
J. Savoy. Statistical inference in retrieval effectiveness evaluation. IPM, 33(4):495--512, 1997.
[19]
C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html.
[20]
E. M. Voorhees and D. K. Harman, editors. TREC. MIT Press, 2005.
[21]
W. J. Wilbur. Non-parametric significance tests of retrieval performance comparisons. J. Inf. Sci., 20(4):270--284, 1994.
[22]
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80--83, December 1945.

Cited By

View all
  • (2024)Bioecology of the pear lace bug, Stephanitis pyri (F.)(Hemiptera: Tingidae) on walnut trees in Kermanshah Province, IranJournal of Entomological Society of Iran10.61186/jesi.44.2.844:2(189-199)Online publication date: 18-Mar-2024
  • (2024)Effects of Essential Oils on Biological Characteristics and Potential Molecular Targets in Spodoptera frugiperdaPlants10.3390/plants1313180113:13(1801)Online publication date: 29-Jun-2024
  • (2024)Life Table Study of Liriomyza trifolii and Its Contribution to Thermotolerance: Responding to Long-Term Selection Pressure for Abamectin ResistanceInsects10.3390/insects1506046215:6(462)Online publication date: 20-Jun-2024
  • Show More Cited By

Index Terms

  1. A comparison of statistical significance tests for information retrieval evaluation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
    November 2007
    1048 pages
    ISBN:9781595938039
    DOI:10.1145/1321440
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bootstrap
    2. hypothesis test
    3. permutation
    4. randomization
    5. sign
    6. statistical significance
    7. student's t-test
    8. wilcoxon

    Qualifiers

    • Research-article

    Conference

    CIKM07

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)266
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Bioecology of the pear lace bug, Stephanitis pyri (F.)(Hemiptera: Tingidae) on walnut trees in Kermanshah Province, IranJournal of Entomological Society of Iran10.61186/jesi.44.2.844:2(189-199)Online publication date: 18-Mar-2024
    • (2024)Effects of Essential Oils on Biological Characteristics and Potential Molecular Targets in Spodoptera frugiperdaPlants10.3390/plants1313180113:13(1801)Online publication date: 29-Jun-2024
    • (2024)Life Table Study of Liriomyza trifolii and Its Contribution to Thermotolerance: Responding to Long-Term Selection Pressure for Abamectin ResistanceInsects10.3390/insects1506046215:6(462)Online publication date: 20-Jun-2024
    • (2024)Utilizing Star Polycation Nanocarrier for the Delivery of miR-184 Agomir and Its Impact on the Life History Traits of the English Grain Aphid, Sitobion avenaeInsects10.3390/insects1506045915:6(459)Online publication date: 19-Jun-2024
    • (2024)Effects of miR-306 Perturbation on Life Parameters in the English Grain Aphid, Sitobion avenae (Homoptera: Aphididae)International Journal of Molecular Sciences10.3390/ijms2511568025:11(5680)Online publication date: 23-May-2024
    • (2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
    • (2024)On the Evaluation of Machine-Generated ReportsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657846(1904-1915)Online publication date: 10-Jul-2024
    • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
    • (2024)Unbiased Learning-to-Rank Needs Unconfounded Propensity EstimationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657772(1535-1545)Online publication date: 10-Jul-2024
    • (2024)Mitigating Exploitation Bias in Learning to Rank with an Uncertainty-aware Empirical Bayes ApproachProceedings of the ACM Web Conference 202410.1145/3589334.3645487(1486-1496)Online publication date: 13-May-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media