Article

Reliable information retrieval evaluation with incomplete and biased judgements

Authors:

Stefan Büttcher,

Charles L. A. Clarke,

Peter C. K. Yeung,

Ian SoboroffAuthors Info & Claims

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 63 - 70

https://doi.org/10.1145/1277741.1277755

Published: 23 July 2007 Publication History

Abstract

Information retrieval evaluation based on the pooling method is inherently biased against systems that did not contribute to the pool of judged documents. This may distort the results obtained about the relative quality of the systems evaluated and thus lead to incorrect conclusions about the performance of a particular ranking technique.

We examine the magnitude of this effect and explore how it can be countered by automatically building an unbiased set of judgements from the original, biased judgements obtained through pooling. We compare the performance of this method with other approaches to the problem of incomplete judgements, such as bpref, and show that the proposed method leads to higher evaluation accuracy, especially if the set of manual judgements is rich in documents, but highly biased against some systems.

References

[1]

P. Ahlgren and L. Grönqvist. Retrieval Evaluation with Incomplete Relevance Data: A Comparative Study of Three Measures. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 872--873, Arlington, USA, November 2006.

Digital Library

[2]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A Statistical Method for System Evaluation Using Incomplete Judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 541--548, Seattle, USA, 2006.

Digital Library

[3]

J. A. Aslam and E. Yilmaz. Inferring Document Relevance via Average Precision. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 601--602, Seattle, USA, 2006.

Digital Library

[4]

C. Buckley and E. M. Voorhees. Retrieval Evaluation with Incomplete Information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 25--32, Sheffield, United Kingdom, 2004.

Digital Library

[5]

S. Böttcher, C. L. A. Clarke, and I. Soboroff. The TREC 2006 Terabyte Track. In Proceedings of TREC 2006, Gaithersburg, USA, November 2006.

[6]

C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2004 Terabyte Track. In Proceedings of the 13th Text REtrieval Conference, Gaithersburg, USA, November 2004.

[7]

C. Cleverdon. The Cranfield Tests on Index Language Devices. In Readings in Information Retrieval, pages 47--59, 1997.

Digital Library

[8]

C. Cortes and V. Vapnik. Support-Vector Networks. Machine Learning, 20(3):273--297, September 1995.

Digital Library

[9]

L. Grönqvist. Evaluating Latent Semantic Vector Models with Synonym Tests and Document Retrieval. In ELECTRA Workshop: Methodologies and Evaluation of Lexical Cohesion Techniques in Real-World Applications Beyond Bag of Words, pages 86--88, Salvador, Brazil, August 2005.

[10]

K. Järvelin and J. Kekäläinen. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002.

Digital Library

[11]

T. Joachims. Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, April 1998.

Digital Library

[12]

T. Joachims. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 200--209, Bled, Slovenia, June 1999.

Digital Library

[13]

M. G. Kendall. A New Measure of Rank Correlation. Biometrika, (30):81--89, 1938.

[14]

E. M. Voorhees. The Philosophy of Information Retrieval Evaluation. In Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum, pages 355--370, London, UK, 2002.

Digital Library

[15]

E. Yilmaz and J. A. Aslam. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 102--111, Arlington, USA, 2006.

Digital Library

[16]

J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 307--314, Melbourne, Australia, 1998.

Digital Library

Cited By

Joseph MRavana S(2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3377239
Arabzadeh NBigdeli ABagheri E(2024)LaQuE: Enabling Entity Search at ScaleAdvances in Information Retrieval10.1007/978-3-031-56060-6_18(270-285)Online publication date: 16-Mar-2024
https://doi.org/10.1007/978-3-031-56060-6_18
Giner F(2024)An Intrinsic Framework of Information Retrieval Evaluation MeasuresIntelligent Systems and Applications10.1007/978-3-031-47721-8_47(692-713)Online publication date: 10-Jan-2024
https://doi.org/10.1007/978-3-031-47721-8_47
Show More Cited By

Index Terms

Reliable information retrieval evaluation with incomplete and biased judgements
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Minimal test collections for retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In ...
Estimating average precision with incomplete and imperfect judgments
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

We consider the problem of evaluating retrieval systems using incomplete judgment information. Buckley and Voorhees recently demonstrated that retrieval systems can be efficiently and effectively evaluated using incomplete judgments via the bpref ...
Estimating average precision when judgments are incomplete

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

July 2007

946 pages

ISBN:9781595935977

DOI:10.1145/1277741

General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR07

Sponsor:

SIGIR07: The 30th Annual International SIGIR Conference

July 23 - 27, 2007

Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

75
Total Citations
View Citations
1,184
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)4

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Joseph MRavana S(2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3377239
Arabzadeh NBigdeli ABagheri E(2024)LaQuE: Enabling Entity Search at ScaleAdvances in Information Retrieval10.1007/978-3-031-56060-6_18(270-285)Online publication date: 16-Mar-2024
https://doi.org/10.1007/978-3-031-56060-6_18
Giner F(2024)An Intrinsic Framework of Information Retrieval Evaluation MeasuresIntelligent Systems and Applications10.1007/978-3-031-47721-8_47(692-713)Online publication date: 10-Jan-2024
https://doi.org/10.1007/978-3-031-47721-8_47
Bauer CCarterette BFerro NFuhr NBeel JBreuer TClarke CCrescenzi ADemartini GDi Nunzio GDietz LFaggioli GFerwerda BFröbe MHagen MHanbury AHauff CJannach DKando NKanoulas EKnijnenburg BKruschwitz ULi MMaistro MMichiels LPapenmeier APotthast MRosso PSaid ASchaer PSeifert CSpina DStein BTintarev NUrbano JWachsmuth HWillemsen MZobel J(2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1145/3636341.3636351
Rashidi LZobel JMoffat A(2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3596511
MacAvaney SSoldaini LChen HDuh WHuang HKato MMothe JPoblete B(2023)One-Shot Labeling for Automatic Relevance EstimationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592032(2230-2235)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592032
Fröbe MGienapp LPotthast MHagen M(2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 17-Mar-2023
https://doi.org/10.1007/978-3-031-28244-7_20
Rahman MKutlu MLease M(2022)Understanding and Predicting Characteristics of Test Collections in Information RetrievalInformation for a Better World: Shaping the Global Future10.1007/978-3-030-96960-8_10(136-148)Online publication date: 23-Feb-2022
https://doi.org/10.1007/978-3-030-96960-8_10
Castells PMoffat A(2022)Offline recommender system evaluationAI Magazine10.1002/aaai.1205143:2(225-238)Online publication date: 16-Jun-2022
https://dl.acm.org/doi/10.1002/aaai.12051
Ferrante MFerro NFuhr N(2021)Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval ScalesIEEE Access10.1109/ACCESS.2021.31168579(136182-136216)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3116857
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten