Article

Why batch and user evaluations do not give the same results

Authors:

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 225 - 231

https://doi.org/10.1145/383952.383992

Published: 01 September 2001 Publication History

Get Access

Abstract

Much system-oriented evaluation of information retrieval systems has used the Cranfield approach based upon queries run against test collections in a batch mode. Some researchers have questioned whether this approach can be applied to the real world, but little data exists for or against that assertion. We have studied this question in the context of the TREC Interactive Track. Previous results demonstrated that improved performance as measured by relevance-based metrics in batch studies did not correspond with the results of outcomes based on real user searching tasks. The experiments in this paper analyzed those results to determine why this occurred. Our assessment showed that while the queries entered by real users into systems yielding better results in batch studies gave comparable gains in ranking of relevant documents for those users, they did not translate into better performance on specific tasks. This was most likely due to users being able to adequately find and utilize relevant documents ranked further down the output list.

References

[1]

Salton G and Buckley C, Term-weighting approaches in automatic text retrieval. Info Proc Mgmt, 1988. 24: 513-23.

Digital Library

Google Scholar

[2]

Cleverdon C and Keen E, Aslib Cranfield Research Project: Factors determining the performance of indexing systems (Vol. 1: Design, Vol. 2: Results). 1966: Cranfield, UK.

Google Scholar

[3]

Harman D. Overview of the first Text REtrieval Conference, in Proceedings of the 16th Annual International ACM Special Interest Group in Information Retrieval. 1993. Pittsburgh: ACM Press, 36-47.

Digital Library

Google Scholar

[4]

Meadow C, Relevance? J Am Soc Info Sci, 1985. 36: 354-5.

Google Scholar

[5]

Swanson D, Information retrieval as a trial-and-error process. Library Quarterly, 1977. 47: 128-48.

Google Scholar

[6]

Hersh W, Relevance and retrieval evaluation: perspectives from medicine. J Am Soc Info Sci, 1994. 45: 201-6.

Digital Library

Google Scholar

[7]

Hersh W, et al. Do batch and user evaluations give the same results?, in Proceedings of the 23rd Annual International ACM Special Interest Group in Information Retrieval. 2000. Athens, Greece: ACM Press, 17-24.

Digital Library

Google Scholar

[8]

Buckley C and Voorhees E. Evaluating evaluation measure stability, in Proceedings of the 23rd Annual International ACM Special Interest Group in Information Retrieval. 2000. Athens, Greece: ACM.

Digital Library

Google Scholar

[9]

Hersh W and Over P. TREC-8 interactive track report, in Proceedings of the 8th Text REtrieval Conference (TREC-8). 2000. Gaithersburg, MD: NIST, 57-64.

Google Scholar

[10]

Hersh W, et al. Further analysis of whether batch and user evaluations give the same results with a different user task, in Proceedings of the Ninth Text Retrieval Conference (TREC- 9). 2001. Gaithersburg, MD: NIST, in press.

Google Scholar

[11]

Hersh W and Over P. TREC-9 Interactive Track Report, in Proceedings of the Ninth Text Retrieval Conference (TREC- 9). 2001. Gaithersburg, MD: NIST, in press.

Google Scholar

[12]

Robertson S and Walker S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, in Proceedings of the 17th Annual International ACM Special Interest Group in Information Retrieval. 1994. Dublin: Springer-Verlag, 232-41.

Digital Library

Google Scholar

[13]

Witten I, Moffat A, and Bell T, Managing Gigabytes - Compressing and Indexing Documents and Images. 1994, New York: Van Nostrand Reinhold.

Digital Library

Google Scholar

[14]

Singhal A, Buckley C, and Mitra M. Pivoted document length normalization, in Proceedings of the 19th Annual International ACM Special Interest Group in Information Retrieval. 1996. Zurich, Switzerland: ACM Press, 21-9.

Digital Library

Google Scholar

Cited By

View all

Engelmann BBreuer TFriese JSchaer PFuhr N(2024)Context-Driven Interactive Query Simulations Based on Generative Large Language ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_12(173-188)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56060-6_12
Rajagopal PAghris TFettah FRavana S(2023)Clustering of Relevant Documents Based on Findability Effort in Information RetrievalInternational Journal of Information Retrieval Research10.4018/IJIRR.31576412:1(1-18)Online publication date: 6-Jan-2023
https://doi.org/10.4018/IJIRR.315764
Warren CWarren C(2023)Beyond efficiency and renewablesHow to Create Sustainable Hospitality10.23912/9781911635659-5428Online publication date: Feb-2023
https://doi.org/10.23912/9781911635659-5428
Show More Cited By

Index Terms

Why batch and user evaluations do not give the same results
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Do batch and user evaluations give the same results?
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

Do improvements in system performance demonstrated by batch evaluations confer the same benefit for real users? We carried out experiments designed to investigate this question. After identifying a weighting scheme that gave maximum improvement over the ...
Large scale evaluations of multimedia information retrieval: the TRECVid experience
CIVR'05: Proceedings of the 4th international conference on Image and Video Retrieval

Information Retrieval is a supporting technique which underpins a broad range of content-based applications including retrieval, filtering, summarisation, browsing, classification, clustering, automatic linking, and others. Multimedia information ...
Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems

In this study the rankings of IR systems based on binary and graded relevance in TREC 7 and 8 data are compared. Relevance of a sample TREC results is reassessed using a relevance scale with four levels: non-relevant, marginally relevant, fairly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

September 2001

454 pages

ISBN:1581133316

DOI:10.1145/383952

Chairmen:
Donald H. Kraft
Louisiana State Univ.
,
W. Bruce Croft
University of Massachusetts, (For the Americas)
,
David J. Harper
The Robert Gordon University, (For Europe and Africa)
,
Justin Zobel
RMIT University, (For Asia and Australasia)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGIR01

Sponsor:

SIGIR

SIGIR01: 24th ACM/SIGIR International Conference on Research and Development in Information Retrieval

Louisiana, New Orleans, USA

Acceptance Rates

SIGIR '01 Paper Acceptance Rate 47 of 201 submissions, 23%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

99
Total Citations
View Citations
1,142
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Engelmann BBreuer TFriese JSchaer PFuhr N(2024)Context-Driven Interactive Query Simulations Based on Generative Large Language ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_12(173-188)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56060-6_12
Rajagopal PAghris TFettah FRavana S(2023)Clustering of Relevant Documents Based on Findability Effort in Information RetrievalInternational Journal of Information Retrieval Research10.4018/IJIRR.31576412:1(1-18)Online publication date: 6-Jan-2023
https://doi.org/10.4018/IJIRR.315764
Warren CWarren C(2023)Beyond efficiency and renewablesHow to Create Sustainable Hospitality10.23912/9781911635659-5428Online publication date: Feb-2023
https://doi.org/10.23912/9781911635659-5428
Siro CAliannejadi MDe Rijke M(2023)Understanding and Predicting User Satisfaction with Conversational Recommender SystemsACM Transactions on Information Systems10.1145/362498942:2(1-37)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1145/3624989
Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Zobel J(2023)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3582524.3582540
Douze LPelayo SMessaadi NGrosjean JKerdelhué GMarcilly R(2022)Designing Formulae for Ranking Search Results: Mixed Methods Evaluation StudyJMIR Human Factors10.2196/302589:1(e30258)Online publication date: 25-Mar-2022
https://doi.org/10.2196/30258
Moffat A(2022)Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and MeaningIEEE Access10.1109/ACCESS.2022.321166810(105564-105577)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3211668
Wicaksono AMoffat A(2021)Modeling search and session effectivenessInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10260158:4Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1016/j.ipm.2021.102601
Heggo IAbdelbaki N(2021)Textual Matching Framework for Measuring Similarity Between Profiles in E-recruitmentIntelligent Systems in Big Data, Semantic Web and Machine Learning10.1007/978-3-030-72588-4_21(291-315)Online publication date: 29-May-2021
https://doi.org/10.1007/978-3-030-72588-4_21
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Do batch and user evaluations give the same results?

Large scale evaluations of multimedia information retrieval: the TRECVid experience

Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems