research-article

A comparison of statistical significance tests for information retrieval evaluation

Authors:

Mark D. Smucker,

Ben CarteretteAuthors Info & Claims

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 623 - 632

https://doi.org/10.1145/1321440.1321528

Published: 06 November 2007 Publication History

Abstract

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.

Supplementary Material

PDF File (p623-smucker.pdf)

This is the original PDF as published in the proceedings. An error was found in the Conclusion and corrected post-publication. The Corrected Version of Record is now posted in the ACM Digital Library. See Full Text above.

Download
968.06 KB

References

[1]

G. E. P. Box, W. G. Hunter, and J. S. Hunter. Statistics for Experimenters. John Wiley & Sons, 1978.

[2]

J. V. Bradley. Distribution-Free Statistical Tests. Prentice-Hall, 1968.

[3]

C. Buckley. trec_eval. http://trec.nist.gov/trec_eval/trec_eval.8.0.tar.gz.

[4]

P. R. Cohen. Empirical methods for artificial intelligence. MIT Press, 1995.

Digital Library

[5]

G. Cormack and T. Lynam. Validity and power of t-test for comparing map and gmap. In SIGIR '07. ACM Press, 2007.

Digital Library

[6]

G. V. Cormack and T. R. Lynam. Statistical precision of information retrieval evaluation. In SIGIR '06, pages 533--540. ACM Press, 2006.

Digital Library

[7]

E. S. Edgington. Randomization Tests. Marcel Dekker, 1995.

Digital Library

[8]

B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1998.

[9]

R. A. Fisher. The Design of Experiments. Oliver and Boyd, first edition, 1935.

[10]

D. Hull. Using statistical testing in the evaluation of retrieval experiments. In SIGIR '93, pages 329--338, New York, NY, USA, 1993. ACM Press.

Digital Library

[11]

O. Kempthorne and T. E. Doerfler. The behavior of some significance tests under experimental randomization. Biometrika, 56(2):231--248, August 1969.

[12]

M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 8(1):3--30, 1998.

Digital Library

[13]

W. Mendenhall, D. D. Wackerly, and R. L. Scheaffer. Mathematical Statistics with Applications. PWS-KENT Publishing Company, 1990.

[14]

E. W. Noreen. Computer Intensive Methods for Testing Hypotheses. John Wiley, 1989.

[15]

R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. 3-900051-07-0.

[16]

T. Sakai. Evaluating evaluation metrics based on the bootstrap. In SIGIR '06, pages 525--532. ACM Press, 2006.

Digital Library

[17]

M. Sanderson and J. Zobel. Information retrieval system evaluation: effort, sensitivity, and reliability. In SIGIR '05, pages 162--169. ACM Press, 2005.

Digital Library

[18]

J. Savoy. Statistical inference in retrieval effectiveness evaluation. IPM, 33(4):495--512, 1997.

Digital Library

[19]

C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html.

Digital Library

[20]

E. M. Voorhees and D. K. Harman, editors. TREC. MIT Press, 2005.

[21]

W. J. Wilbur. Non-parametric significance tests of retrieval performance comparisons. J. Inf. Sci., 20(4):270--284, 1994.

Digital Library

[22]

F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80--83, December 1945.

Cited By

Li XCai XShang LWang YHaq IWang JHou Y(2025) Adaptability Analysis of Tuta absoluta to Different Hosts and Related Salivary Genes Identification Journal of Agricultural and Food Chemistry10.1021/acs.jafc.4c0943973:5(2814-2829)Online publication date: 15-Jan-2025
https://doi.org/10.1021/acs.jafc.4c09439
Hrtonova VNejedly PTravnicek VCimbalnik JMatouskova BPail MPeter-Derex LGrova CGotman JHalamek JJurak PBrazdil MKlimes PFrauscher B(2025)Metrics for evaluation of automatic epileptogenic zone localization in intracranial electrophysiologyClinical Neurophysiology10.1016/j.clinph.2024.11.007169(33-46)Online publication date: Jan-2025
https://doi.org/10.1016/j.clinph.2024.11.007
Montazersaheb HZamani APourian H(2024)Bioecology of the pear lace bug, Stephanitis pyri (F.)(Hemiptera: Tingidae) on walnut trees in Kermanshah Province, IranJournal of Entomological Society of Iran10.61186/jesi.44.2.844:2(189-199)Online publication date: 18-Mar-2024
https://doi.org/10.61186/jesi.44.2.8
Show More Cited By

Index Terms

A comparison of statistical significance tests for information retrieval evaluation
1. Information systems
  1. Information retrieval

Recommendations

Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on ...
A comparison of the optimality of statistical significance tests for information retrieval evaluation
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 ...
Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Research has shown that little practical difference exists between the randomization, Student's paired t, and bootstrap tests of statistical significance for TREC ad-hoc retrieval experiments with 50 topics. We compared these three tests on runs with ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

November 2007

1048 pages

ISBN:9781595938039

DOI:10.1145/1321440

Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 6 - 10, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

444
Total Citations
View Citations
3,067
Total Downloads

Downloads (Last 12 months)216
Downloads (Last 6 weeks)14

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li XCai XShang LWang YHaq IWang JHou Y(2025) Adaptability Analysis of Tuta absoluta to Different Hosts and Related Salivary Genes Identification Journal of Agricultural and Food Chemistry10.1021/acs.jafc.4c0943973:5(2814-2829)Online publication date: 15-Jan-2025
https://doi.org/10.1021/acs.jafc.4c09439
Hrtonova VNejedly PTravnicek VCimbalnik JMatouskova BPail MPeter-Derex LGrova CGotman JHalamek JJurak PBrazdil MKlimes PFrauscher B(2025)Metrics for evaluation of automatic epileptogenic zone localization in intracranial electrophysiologyClinical Neurophysiology10.1016/j.clinph.2024.11.007169(33-46)Online publication date: Jan-2025
https://doi.org/10.1016/j.clinph.2024.11.007
Montazersaheb HZamani APourian H(2024)Bioecology of the pear lace bug, Stephanitis pyri (F.)(Hemiptera: Tingidae) on walnut trees in Kermanshah Province, IranJournal of Entomological Society of Iran10.61186/jesi.44.2.844:2(189-199)Online publication date: 18-Mar-2024
https://doi.org/10.61186/jesi.44.2.8
Oliveira JFernandes LFigueiredo KCorrêa ELima LAlves DBertolucci SCarvalho G(2024)Effects of Essential Oils on Biological Characteristics and Potential Molecular Targets in Spodoptera frugiperdaPlants10.3390/plants1313180113:13(1801)Online publication date: 29-Jun-2024
https://doi.org/10.3390/plants13131801
Wang YChang YGong WDu Y(2024)Life Table Study of Liriomyza trifolii and Its Contribution to Thermotolerance: Responding to Long-Term Selection Pressure for Abamectin ResistanceInsects10.3390/insects1506046215:6(462)Online publication date: 20-Jun-2024
https://doi.org/10.3390/insects15060462
Zhang CWei GWu LZhang YZhu XMerchant AZhou XLiu XLi X(2024)Utilizing Star Polycation Nanocarrier for the Delivery of miR-184 Agomir and Its Impact on the Life History Traits of the English Grain Aphid, Sitobion avenaeInsects10.3390/insects1506045915:6(459)Online publication date: 19-Jun-2024
https://doi.org/10.3390/insects15060459
Wu LWei GYan YZhou XZhu XZhang YLi X(2024)Effects of miR-306 Perturbation on Life Parameters in the English Grain Aphid, Sitobion avenae (Homoptera: Aphididae)International Journal of Molecular Sciences10.3390/ijms2511568025:11(5680)Online publication date: 23-May-2024
https://doi.org/10.3390/ijms25115680
Oosterhuis HJagerman RQin ZWang XBendersky MBaeza-Yates RBonchi F(2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671883
Mayfield JYang ELawrie DMacAvaney SMcNamee POard DSoldaini LSoboroff IWeller OKayi ESanders KMason MHibbler NHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)On the Evaluation of Machine-Generated ReportsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657846(1904-1915)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657846
Thomas PKazai GCraswell NSpielman SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657845
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten