column

Evaluating Evaluation Measure Stability

Authors:

Ellen M. VoorheesAuthors Info & Claims

ACM SIGIR Forum, Volume 51, Issue 2

Pages 235 - 242

https://doi.org/10.1145/3130348.3130373

Published: 02 August 2017 Publication History

Abstract

This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.

References

[1]

James Allan, Jamie Callan, Fang-Fang Feng, and Daniella Malin. INQUERY and TREC-8. In Voorhees and Harman [26].

[2]

Chris Buckley and Janet Walz. SMART in TREC 8. In Voorhees and Harman [26].

[3]

C. W. Cleverdon, J. Mills, and E. M. Keen. Factors determining the performance of indexing systems. Two volumes, Cranfield, England, 1968.

[4]

William S. Cooper. On selecting a measure of retrieval effectiveness. Part I. In Karen Sparck Jones and Peter Willett, editors, Readings in Information Retrieval, pages 191-204. Morgan Kaufmann, 1997.

Digital Library

[5]

Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Croft et al. [6], pages 282--289.

Digital Library

[6]

W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York.

[7]

D.K. Harman, editor. Proceedings of the Fourth Text REtrieval Conference (TREC-4), October 1996. NIST Special Publication 500--236.

[8]

Donna Harman. Overview of the fourth Text REtrieval Conference (TREC-4). In Harman [7], pages 1--23. NIST Special Publication 500--236.

[9]

David Hawking, Peter Bailey, and Nick Craswell. ACSys TREC-8 experiments. In Voorhees and Harman [26].

[10]

David Hull. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 329--338, 1993.

Digital Library

[11]

E. Michael Keen. Presenting results of experimental retrieval comparisons. Information Processing and Management, 28(4):491--502, 1992.

Digital Library

[12]

K.L. Kwok, L. Grunfeld, and M. Chart. TREC-8 ad-hoc, query and filtering track experiments using PIRCS. In Voorhees and Harman [26].

[13]

David D. Lewis. Evaluating and optimizing autonomous text classification systems. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246-- 254, 1995.

Digital Library

[14]

David D. Lewis. The TREC-4 filtering track. In Harman [7], pages 165--180. NIST Special Publication 500--236.

[15]

J. Mayfiled, P. McNamee, and C. Piatko. The JHU/APL HAIRCUT system at TREC-8. In Voorhees and Harman [26].

[16]

Gerard Salton. The state of retrieval system evaluation. Information Processing and Management, 28(4):441--449, 1992.

Digital Library

[17]

K. Sparck Jones and C.J. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32(1):59--75, 1976.

[18]

Karen Sparck Jones. Automatic indexing. Journal of Documentation, 30:393--432, 1974.

[19]

Jean M. Tague. The pragmatics of information retrieval experimentation. In Karen Sparck Jones, editor, Information Retrieval Experiment, pages 59--102. Butterworths, 1981.

[20]

Jean Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4):467--490, 1992.

Digital Library

[21]

Jean Tague-Sutcliffe and James Blustein. A statistical analysis of the TREC-3 data. In D. K. Harman, editor, Overview of the Third Text REtrieval Conference (TREC- 3) [Proceedings of TREC-3.], pages 385--398, April 1995. NIST Special Publication 500--225.

[22]

C.J. vanRijsbergen. Information Retrieval, chapter7. Butterworths, 2 edition, 1979.

[23]

Ellen M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Croft et al. [6], pages 315--323.

Digital Library

[24]

Ellen M. Voorhees. Special issue: The sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1), January 2000.

[25]

Ellen M. Voorhees and Donna Harman. Overview of the seventh Text REtrieVal Conference (TREC-7). In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 1--23, August 1999. NIST Special Publication 500--242. Electronic version available at http://trec.nist.gov/pubs.html.

[26]

E.M. Voorhees and D.K. Harman, editors. Proceedings of the Eighth Text REtrieval Conference (TREC-8). Electronic version available at http://trec.nist.gov/pubs.html, 2000.

[27]

D. Williamson, R. Williamson, and M. Lesk. The Cornell implementation of the Smart system. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 2, pages 43--44. Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1971.

[28]

Justin Zobel. How reliable are the results of large-scale information retrieval experiments? In Croft et al. [6], pages 307--314.

Digital Library

Cited By

Liang HZhang ZPan JFu J(2024)Assessing Students’ Personality Traits: A Study of Virtual Reality-Based Educational PracticesElectronics10.3390/electronics1317335813:17(3358)Online publication date: 23-Aug-2024
https://doi.org/10.3390/electronics13173358
Hirsch THofer B(2024)Predictive Reranking using Code Smells for Information Retrieval Fault Localization2024 IEEE 22nd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI60510.2024.10432857(000277-000282)Online publication date: 25-Jan-2024
https://doi.org/10.1109/SAMI60510.2024.10432857
Hair JSharma PSarstedt MRingle CLiengaard B(2024)The shortcomings of equal weights estimation and the composite equivalence index in PLS-SEMEuropean Journal of Marketing10.1108/EJM-04-2023-030758:13(30-55)Online publication date: 8-Feb-2024
https://doi.org/10.1108/EJM-04-2023-0307
Show More Cited By

Index Terms

Evaluating Evaluation Measure Stability
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Evaluating evaluation measure stability
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good ...
Information systems evaluation: an ongoing measure

With increased spending of information systems (IS), IS evaluation is becoming increasingly important. It is important for organisations to evaluate business value of IS they have spent a portion, sometimes large, of their yearly revenue on. For ...
Evaluating usability evaluation methods: criteria, method and a case study
HCI'07: Proceedings of the 12th international conference on Human-computer interaction: interaction design and usability

The paper proposes an approach to comparative usability evaluation that incorporates important relevant criteria identified in previous work. It applies the proposed approach to a case study of a comparative evaluation of an academic website employing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGIR Forum

ACM SIGIR Forum Volume 51, Issue 2

SIGIR Test-of-Time Awardees 1978-2001

July 2017

276 pages

ISSN:0163-5840

DOI:10.1145/3130348

Editors:
Donna Harman
National Institutes of Science & Technology, Gaithersburg MD, USA
,
Diane Kelly
University of Tennessee, Knoxville TN, USA

Issue’s Table of Contents

Copyright © 2017 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2017

Published in SIGIR Volume 51, Issue 2

Check for updates

Qualifiers

Column

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
742
Total Downloads

Downloads (Last 12 months)116
Downloads (Last 6 weeks)10

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liang HZhang ZPan JFu J(2024)Assessing Students’ Personality Traits: A Study of Virtual Reality-Based Educational PracticesElectronics10.3390/electronics1317335813:17(3358)Online publication date: 23-Aug-2024
https://doi.org/10.3390/electronics13173358
Hirsch THofer B(2024)Predictive Reranking using Code Smells for Information Retrieval Fault Localization2024 IEEE 22nd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI60510.2024.10432857(000277-000282)Online publication date: 25-Jan-2024
https://doi.org/10.1109/SAMI60510.2024.10432857
Hair JSharma PSarstedt MRingle CLiengaard B(2024)The shortcomings of equal weights estimation and the composite equivalence index in PLS-SEMEuropean Journal of Marketing10.1108/EJM-04-2023-030758:13(30-55)Online publication date: 8-Feb-2024
https://doi.org/10.1108/EJM-04-2023-0307
Aydın AArslan ADinçer B(2024)A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrievalExpert Systems with Applications10.1016/j.eswa.2024.123177246(123177)Online publication date: Jul-2024
https://doi.org/10.1016/j.eswa.2024.123177
Giner F(2024)An Intrinsic Framework of Information Retrieval Evaluation MeasuresIntelligent Systems and Applications10.1007/978-3-031-47721-8_47(692-713)Online publication date: 10-Jan-2024
https://doi.org/10.1007/978-3-031-47721-8_47
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3597201
Hirsch THofer BBissyandé TKlein JBird CSarro F(2023)The MAP Metric in Information Retrieval Fault LocalizationProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE56229.2023.00041(1480-1491)Online publication date: 11-Nov-2023
https://dl.acm.org/doi/10.1109/ASE56229.2023.00041
Valentin SLancelot RRoche M(2023)Fusion of spatiotemporal and thematic features of textual data for animal disease surveillanceInformation Processing in Agriculture10.1016/j.inpa.2022.03.00410:3(347-360)Online publication date: Sep-2023
https://doi.org/10.1016/j.inpa.2022.03.004
de Oliveira LVargas DAlexandre ACordeiro FGomes DRodrigues MRomeu RMoreira V(2023)Evaluating and mitigating the impact of OCR errors on information retrievalInternational Journal on Digital Libraries10.1007/s00799-023-00345-624:1(45-62)Online publication date: 26-Jan-2023
https://doi.org/10.1007/s00799-023-00345-6
Türkmen MLease MKutlu M(2023)New Metrics to Encourage Innovation and Diversity in Information Retrieval ApproachesAdvances in Information Retrieval10.1007/978-3-031-28238-6_16(239-254)Online publication date: 17-Mar-2023
https://doi.org/10.1007/978-3-031-28238-6_16
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents