research-article

Choices in batch information retrieval evaluation

Authors:

Alistair Moffat,

Paul ThomasAuthors Info & Claims

ADCS '13: Proceedings of the 18th Australasian Document Computing Symposium

Pages 74 - 81

https://doi.org/10.1145/2537734.2537745

Published: 05 December 2013 Publication History

Abstract

Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores.

Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality.

We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.

References

[1]

A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between IR effectiveness measures and user satisfaction. In Proc. SIGIR, pages 773--774, 2007.

Digital Library

[2]

J. Allan, B. Carterette, and J. Lewis. When will information retrieval be "good enough"? In Proc. SIGIR, pages 433--440, 2005.

Digital Library

[3]

P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proc. SIGIR, pages 667--674, 2008.

Digital Library

[4]

C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling. In Proc. SIGIR, pages 619--620, 2006.

Digital Library

[5]

C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information Retrieval, 10(6): 491--508, 2007.

Digital Library

[6]

S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. SIGIR, pages 63--70, 2007.

Digital Library

[7]

S. Büttcher, C. L. A. Clarke, and G. V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010.

Digital Library

[8]

B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proc. SIGIR, pages 268--275, 2006.

Digital Library

[9]

C. L. A. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proc. WSDM, pages 75--84, 2011.

Digital Library

[10]

W. H. DeLone and E. R. McLean. Information systems success: The quest for the dependent variable. Information Systems Research, 3(1): 60--95, 1992.

Digital Library

[11]

G. Demartini and S. Mizzaro. A classification of IR effectiveness metrics. In Proc. ECIR, pages 488--491, 2006.

Digital Library

[12]

S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1): 37--49, 1996.

Digital Library

[13]

A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In Proc. WSDM, pages 221--230, 2010.

Digital Library

[14]

B. Hedin, S. Tomlinson, J. R. Baron, and D. W. Oard. Overview of the TREC 2009 legal track. In Proc. TREC, 2010.

[15]

W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In Proc. SIGIR, pages 17--24, 2000.

Digital Library

[16]

W. Hersh, A. M. Cohen, P. Roberts, and H. K. Rekapalli. TREC 2006 genomics track overview. In Proc. TREC, 2007.

[17]

S. B. Huffman and M. Hochster. How well does result relevance predict session satisfaction? In Proc. SIGIR, pages 567--574, 2007.

Digital Library

[18]

T. Jones, D. Hawking, P. Thomas, and R. Sankaranarayana. Relative effect of spam and irrelevant documents on user interaction with search engines. In Proc. CIKM, pages 2113--2116, 2011.

Digital Library

[19]

G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Proc. ECIR, pages 165--176, 2011.

Digital Library

[20]

D. Kelly and C. R. Sugimoto. A systematic review of interactive information retrieval evaluation studies, 1967--2006. JASIST, 64(4): 745--770, 2013.

[21]

M. E. Lesk and G. Salton. Relevance assessments and the measurement of retrieval effectiveness. Information Storage and Retrieval, 4: 343--359, 1969.

[22]

X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proc. SIGIR, pages 339--346, 2008.

Digital Library

[23]

A. Moffat. Seven numeric properties of effectiveness metrics. In Proc. AIRS, 2013. To appear.

[24]

A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382, 2007.

Digital Library

[25]

A. Moffat, F. Scholer, and P. Thomas. Models and metrics: IR evaluation as a user process. In Proc. Australasian Document Computing Symp., pages 47--54, December 2012.

Digital Library

[26]

A. Moffat, P. Thomas, and F. Scholer. Users versus models: What observation tells us about effectiveness metrics. In Proc. CIKM, October 2013. To appear.

Digital Library

[27]

S. Robertson. On GMAP and other transformations. In Proc. CIKM, pages 78--83, 2006.

Digital Library

[28]

T. Sakai. Alternatives to Bpref. In Proc. SIGIR, pages 71--78, 2007.

Digital Library

[29]

M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems, volume 4 of Foundations and Trends in Information Retrieval. now Publishers, 2010.

[30]

M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proc. SIGIR, pages 162--169, 2005.

Digital Library

[31]

K. L. Smith and P. B Kantor. User adaptation: Good results from poor systems. In Proc. SIGIR, pages 147--154, 2008.

Digital Library

[32]

P. Thomas, F. Scholer, and A. Moffat. What users do: The eyes have it. In Proc. AIRS, 2013. To appear.

[33]

A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. SIGIR, pages 225--231, 2001.

Digital Library

[34]

A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. SIGIR, pages 11--18, 2006.

Digital Library

[35]

A. Turpin, F. Scholer, K. Jarvelin, M. Wu, and J. S. Culpepper. Including summaries in system evaluation. In Proc. SIGIR, pages 508--515, 2009.

Digital Library

[36]

E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.

Digital Library

[37]

M. Wu, J. A. Thom, A. Turpin, and R. Wilkinson. Cost and benefit analysis of mediated enterprise search. In Proc. JCDL, pages 267--276, 2009.

Digital Library

[38]

W.-C. Wu, D. Kelly, A. Edwards, and J. Arguello. Grannies, tanning beds, tattoos and NASCAR: Evaluation of search tasks with varying levels of cognitive complexity. In Proc. Information Interaction in Context Symp., pages 254--257, 2012.

Digital Library

[39]

E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Proc. SIGIR, pages 603--610, 2008.

Digital Library

[40]

E. Yilmaz, G. Kazai, N. Craswell, and S. M. M. Tamaghoghi. On judgments obtained from a commercial search engine. In Proc. SIGIR, pages 1115--1116, 2012.

Digital Library

[41]

J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. SIGIR, pages 307--314, 1998.

Digital Library

Cited By

O'Brien HFerro NJoho HLewandowski DThomas Pvan Rijsbergen KKelly DCapra RBelkin NTeevan JVakkari P(2016)System And User Centered Evaluation Approaches in Interactive Information Retrieval (SAUCE 2016)Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval10.1145/2854946.2886106(337-340)Online publication date: 13-Mar-2016
https://dl.acm.org/doi/10.1145/2854946.2886106
Kim JYilmaz EBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)IR EvaluationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767875(1129-1132)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767875

Index Terms

Choices in batch information retrieval evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Robust test collections for retrieval evaluation
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time ...
Minimal test collections for retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In ...
A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users

Relevance judgment plays an extremely significant role in information retrieval. This study investigates the differences between American users and Chinese users in relevance judgment during the information retrieval process. 384 sets of relevance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ADCS '13: Proceedings of the 18th Australasian Document Computing Symposium

December 2013

126 pages

ISBN:9781450325240

DOI:10.1145/2537734

Conference Chairs:
Shane Culpepper
RMIT University
,
Guido Zuccon
CSIRO
,
General Chair:
Laurianne Sitbon
Queensland University of Technology

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ADCS '13

ADCS '13: The Australasian Document Computing Symposium

December 5 - 6, 2013

Queensland, Brisbane, Australia

Acceptance Rates

ADCS '13 Paper Acceptance Rate 12 of 23 submissions, 52%;

Overall Acceptance Rate 30 of 57 submissions, 53%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
116
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

O'Brien HFerro NJoho HLewandowski DThomas Pvan Rijsbergen KKelly DCapra RBelkin NTeevan JVakkari P(2016)System And User Centered Evaluation Approaches in Interactive Information Retrieval (SAUCE 2016)Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval10.1145/2854946.2886106(337-340)Online publication date: 13-Mar-2016
https://dl.acm.org/doi/10.1145/2854946.2886106
Kim JYilmaz EBaeza-Yates RLalmas MMoffat ARibeiro-Neto B(2015)IR EvaluationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767875(1129-1132)Online publication date: 9-Aug-2015
https://dl.acm.org/doi/10.1145/2766462.2767875

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten