Article

Free access

Evaluating evaluation measure stability

Authors:

Ellen M. VoorheesAuthors Info & Claims

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

Pages 33 - 40

https://doi.org/10.1145/345508.345543

Published: 01 July 2000 Publication History

Abstract

This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.

References

[1]

James Allan, Jamie Callan, Fang-Fang Feng, and Daniella Malin. INQUERY and TREC-8. In Voorhees and Harman {26}.

[2]

Chris Buckley and Janet Walz. SMART in TREC 8. In Voorhees and Harman {26}.

[3]

C. W. Cleverdon, J. Mills, and E. M. Keen. Factors determining the performance of indexing systems. Two volumes, Cranfield, England, 1968.

[4]

William S. Cooper. On selecting a measure of retrieval effectiveness. Part I. In Karen Sparck Jones and Peter Willett, editors, Readings in Information Retrieval, pages 191-204. Morgan Kaufmann, 1997.

Digital Library

[5]

Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Croft et al. {6}, pages 282-289.

Digital Library

[6]

W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York.

[7]

D.K. Harman, editor. Proceedings of the Fourth Text RE- trieval Conference (TREC-4), October 1996. NIST Special Publication 500-236.

[8]

Donna Harman. Overview of the fourth Text REtrieval Conference (TREC-4). In Harman {7}, pages 1-23. NIST Special Publication 500-236.

[9]

David Hawking, Peter Bailey, and Nick Craswell. ACSys TREC-8 experiments. In Voorhees and Harman {26}.

[10]

David Hull. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 329-338, 1993.

Digital Library

[11]

E. Michael Keen. Presenting results of experimental retrieval comparisons. Information Processing and Management, 28(4):491-502, 1992.

Digital Library

[12]

K.L. Kwok, L. Grunfeld, and M. Chart. TREC-8 ad-hoc, query and filtering track experiments using PIRCS. In Voorhees and Harman {26}.

[13]

David D. Lewis. Evaluating and optimizing autonomous text classification systems. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246-- 254, 1995.

Digital Library

[14]

David D. Lewis. The TREC-4 filtering track. In Harman {7}, pages 165-180. NIST Special Publication 500-236.

[15]

J. Mayfiled, P. McNamee, and C. Piatko. The JHU/APL HAIRCUT system at TREC-8. In Voorhees and Harman {26}.

[16]

Gerard Salton. The state of retrieval system evaluation. Information Processing and Management, 28(4):441-449, 1992.

Digital Library

[17]

K. Sparck Jones and C.J. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32(1):59-75, 1976.

[18]

Karen Sparck Jones. Automatic indexing. Journal of Documentation, 30:393-432, 1974.

[19]

Jean M. Tague. The pragmatics of information retrieval experimentation. In Karen Sparck Jones, editor, Information Retrieval Experiment, pages 59-102. Butterworths, 1981.

[20]

Jean Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4):467-490, 1992.

Digital Library

[21]

Jean Tague-Sutcliffe and James Blustein. A statistical analysis of the TREC-3 data. In D. K. Harman, editor, Overview of the Third Text REtrieval Conference (TREC- 3) {Proceedings of TREC-3.}, pages 385-398, April 1995. NIST Special Publication 500-225.

[22]

C.J. vanRijsbergen. Information Retrieval, chapter7. Butterworths, 2 edition, 1979.

Digital Library

[23]

Ellen M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Croft et al. {6}, pages 315-323.

Digital Library

[24]

Ellen M. Voorhees. Special issue: The sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1), January 2000.

Digital Library

[25]

Ellen M. Voorhees and Donna Harman. Overview of the seventh Text REtrieVal Conference (TREC-7). In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 1-23, August 1999. NIST Special Publication 500-242. Electronic version available at http://trec.nist.gov/pubs.html.

[26]

E.M. Voorhees and D.K. Harman, editors. Proceedings of the Eighth Text REtrieval Conference (TREC-8). Electronic version available at http://trec.nist.gov/pubs.html, 2000.

[27]

D. Williamson, R. Williamson, and M. Lesk. The Cornell implementation of the Smart system. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 2, pages 43-44. Prentice- Hall, Inc. Englewood Cliffs, New Jersey, 1971.

[28]

Justin Zobel. How reliable are the results of large-scale information retrieval experiments? In Croft et al. {6}, pages 307-314.

Digital Library

Cited By

Liu QHui YLiu SJi Y(2024)Y-Rank: A Multi-Feature-Based Keyphrase Extraction Method for Short TextApplied Sciences10.3390/app1406251014:6(2510)Online publication date: 16-Mar-2024
https://doi.org/10.3390/app14062510
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
https://doi.org/10.62036/ISD.2022.38
Gherbi TZeggari AAhmed Seghir ZHachouf F(2023)An evaluation metric for image retrieval systems, using entropy for grouped precision of relevant retrievalsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22362345:3(3665-3677)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-223623
Show More Cited By

Index Terms

Evaluating evaluation measure stability
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Evaluating Evaluation Measure Stability
SIGIR Test-of-Time Awardees 1978-2001

This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good ...
Visual Quality Evaluation of Image Object Segmentation: Subjective Assessment and Objective Measure
A visual quality evaluation of image object segmentation as one member of the visual quality evaluation family has been studied over the years. Researchers aim at developing the objective measures that can evaluate the visual quality of object ...
Evaluating the evaluation: a case study using the TREC 2002 question answering track
NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

July 2000

396 pages

ISBN:1581132263

DOI:10.1145/345508

Chairmen:
Emmanuel Yannakoudakis
Athens Univ. of Economics and Business, Greece
,
Nicholas J. Belkin
Rutgers Univ.
,
Mun-Kew Leong
Kent Ridge Digital Labs
,
Peter Ingwersen
Royal School of Library and Information Science

Copyright © 2000 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Greek Com Soc: Greek Computer Society
SIGIR: ACM Special Interest Group on Information Retrieval
Athens U of Econ & Business: Athens University of Economics and Business

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2000

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGIR00

Sponsor:

Greek Com Soc
SIGIR
Athens U of Econ & Business

SIGIR00: 23rd ACM International SIGIR Conference on Research and Development in Information Retrieval

July 24 - 28, 2000

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

369
Total Citations
View Citations
445
Total Downloads

Downloads (Last 12 months)161
Downloads (Last 6 weeks)20

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu QHui YLiu SJi Y(2024)Y-Rank: A Multi-Feature-Based Keyphrase Extraction Method for Short TextApplied Sciences10.3390/app1406251014:6(2510)Online publication date: 16-Mar-2024
https://doi.org/10.3390/app14062510
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
https://doi.org/10.62036/ISD.2022.38
Gherbi TZeggari AAhmed Seghir ZHachouf F(2023)An evaluation metric for image retrieval systems, using entropy for grouped precision of relevant retrievalsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-22362345:3(3665-3677)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-223623
Bashar MNayak RKnapman GTurnbull PFforde C(2023)An Informed Neural Network for Discovering Historical Documentation Assisting the Repatriation of Indigenous Ancestral Human RemainsSocial Science Computer Review10.1177/0894439323115878841:6(2293-2317)Online publication date: 1-Mar-2023
https://doi.org/10.1177/08944393231158788
Otero DParapar JFerro NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614916
Chen NLiu JSakai T(2023)A Reference-Dependent Model for Web Search EvaluationProceedings of the ACM Web Conference 202310.1145/3543507.3583551(3396-3405)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583551
Jebari CHerrera-Viedma ECobo M(2023)Context-aware citation recommendation of scientific papers: comparative study, gaps and trendsScientometrics10.1007/s11192-023-04773-8128:8(4243-4268)Online publication date: 22-Jun-2023
https://doi.org/10.1007/s11192-023-04773-8
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval DatasetsAdvances in Information Systems Development10.1007/978-3-031-32418-5_9(149-168)Online publication date: 27-Jun-2023
https://doi.org/10.1007/978-3-031-32418-5_9
ISHIO TMAEDA NSHIBUYA KIWAMOTO KINOUE K(2022)NCDSearch: Sliding Window-Based Code Clone Search Using Lempel-Ziv Jaccard DistanceIEICE Transactions on Information and Systems10.1587/transinf.2021EDP7222E105.D:5(973-981)Online publication date: 1-May-2022
https://doi.org/10.1587/transinf.2021EDP7222
Voorhees ECraswell NLin JAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Too Many RelevantsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531728(2970-2980)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531728
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents