research-article

When Measurement Misleads: The Limits of Batch Assessment of Retrieval Systems

Author:

Justin ZobelAuthors Info & Claims

ACM SIGIR Forum, Volume 56, Issue 1

Article No.: 12, Pages 1 - 20

https://doi.org/10.1145/3582524.3582540

Published: 27 January 2023 Publication History

Abstract

The discipline of information retrieval (IR) has a long history of examination of how best to measure performance. In particular, there is an extensive literature on the practice of assessing retrieval systems using batch experiments based on collections and relevance judgements. However, this literature has only rarely considered an underlying principle: that measured scores are inherently incomplete as a representation of human activity, that is, there is an innate gap between measured scores and the desired goal of human satisfaction. There are separate challenges such as poor experimental practices or the shortcomings of specific measures, but the issue considered here is more fundamental - straightforwardly, in batch experiments the human-machine gap cannot be closed. In other disciplines, the issue of the gap is well recognised and has been the subject of observations that provide valuable perspectives on the behaviour and effects of measures and the ways in which they can lead to unintended consequences, notably Goodhart's law and the Lucas critique. Here I describe these observations and argue that there is evidence that they apply to IR, thus showing that blind pursuit of performance gains based on optimisation of scores, and analysis based solely on aggregated measurements, can lead to misleading and unreliable outcomes.

References

[1]

T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: Ad-hoc retrieval results since 1998. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2009.

Digital Library

[2]

P. Bailey, A. Moffat, F. Scholer, and P. Thomas. User variability and IR system evaluation. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2015.

Digital Library

[3]

B. Billerbeck. Efficient Query Expansion. PhD thesis, RMIT University, Victoria, Australia, 2005. URL https://researchrepository.rmit.edu.au/esploro/outputs/doctoral/Efficient-query-expansion/9921861213101341.

[4]

B. Billerbeck and J. Zobel. Questioning query expansion: An examination of behaviour and parameters. In Proc. Aust. Document Computing Conf., 2004.

Digital Library

[5]

F. G. Boudinot and J. Wilson. Does a proxy measure up? A framework to assess and convey proxy reliability. Climate Past, 16, 2020.

[6]

E. Brynjolfsson, A. Collis, and F. Egger. Using massive online choice experiments to measure changes in well-being. Proc. National Academy of Sciences, 116(15), 2019.

[7]

D. T. Campbell. Assessing the impact of planned social change. Technical Report 8, Occasional Paper Series, Public Affairs Center, Dartmouth College, 1976. URL https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.6988&rep=rep1&type=pdf.

[8]

K. A. Chrystal and P. D. Mizen. Goodhart's law: Its origins, meaning and implications for monetary policy. In Festschrift in honour of Charles Goodhart, Bank of England, 2001.

[9]

G. V. Cormack and T. R. Lynam. Statistical precision of information retrieval evaluation. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2006.

Digital Library

[10]

M. Fire and C. Guestrin. Over-optimization of academic publishing metrics: observing Goodhart's law in action. GigaScience, 8, 2019.

[11]

J. K. Flake and E. I. Fried. Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 2020.

[12]

D. Hawking, T. Rowlands, and P. Thomas. C-TEST: Supporting novelty and diversity in test-files for search tuning. In Proc SIGIR Workshop: Redundancy, Diversity, and Interdependent Document Relevance, 2009.

[13]

S. B. Huffman and M. Hochster. How well does result relevance predict session satisfaction? In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2007.

Digital Library

[14]

D. Hull. Stemming algorithms: a case study for detailed evaluation. J. American Society for Information Science, 47, 1996.

Digital Library

[15]

D. Kelly. Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1--2), 2009.

Digital Library

[16]

X. Lu, A. Moffat, and J. S. Culpepper. The effect of pooling and evaluation depth on IR metrics. Information Retrieval Journal, 19(4), 2016.

Digital Library

[17]

X. Lu, A. Moffat, and J. S. Culpepper. Can deep effectiveness metrics be evaluated using shallow judgment pools? In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2017.

Digital Library

[18]

R. E. Lucas, Jr. Econometric policy analysis: A critique. In Carnegie-Rochester Conference Series on Public Policy, volume 1, 1976. URL https://EconPapers.repec.org/RePEc:eee:crcspp:v:1:y:1976:i::p:19-46.

[19]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

[20]

A. Maul. Method effects and the meaning of measurement. Frontiers in Psychology, 14(169), 2013.

[21]

J. Michell. Representational measurement theory: Is its number up? Theory & Psychology, 31(3), 2021.

[22]

A. Moffat. Seven numeric properties of effectiveness metrics. In Proc. AIRS Asian Information Retrieval Symposium, 2013.

[23]

A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Database Systems, 27(1), 2008.

Digital Library

[24]

A. Moffat, P. Thomas, and F. Scholer. Users versus models: What observation tells us about effectiveness metrics. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2013.

Digital Library

[25]

A. Moffat, P. Bailey, F. Scholer, and P. Thomas. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Transactions on Information Systems, 35(3), 2017.

Digital Library

[26]

A. Moffat, F. Scholer, and Z. Yang. Estimating measurement uncertainty for information retrieval effectiveness metrics. ACM Journal of Data and Information Quality, 10(3), 2018.

Digital Library

[27]

A. Olteanu, F. Diaz, and G. Kazai. When are search completion suggestions problematic? In Proc. Human-Computer Interaction, volume 4, CSCW2, 2020.

Digital Library

[28]

L. Rashidi, J. Zobel, and A. Moffat. Evaluating the predictivity of IR experiments. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2021.

Digital Library

[29]

I. Rowlands. What are we measuring? Refocusing on some fundamentals in the age of desktop bibliometrics. FEMS Microbiology Letters, (365), 2018.

[30]

T. Sakai. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2016.

Digital Library

[31]

T. Sakai and Z. Zeng. Retrieval evaluation measures that agree with users' SERP preferences: Traditional, preference-based, and diversity measures. ACM Transactions on Information Systems, 39(2), 2021.

Digital Library

[32]

M. Sanderson. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 2010.

[33]

T. Saracevic. Evaluation of evaluation in information retrieval. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 1995.

Digital Library

[34]

P. Sirotkin. On Search Engine Evaluation Metrics. PhD thesis, University of Düsseldorf, 2012. URL https://arxiv.org/pdf/1302.2318.pdf.

[35]

M. Strathern. 'Improving ratings': Audit in the British university system. European Review, 5, 1997.

[36]

A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2001.

Digital Library

[37]

A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2006.

Digital Library

[38]

C. J. van Rijsbergen. Information Retrieval. Butterworths, 2nd edition, 1979.

Digital Library

[39]

E. M. Voorhees. The philosophy of information retrieval evaluation. In Proc. CLEF Cross-Language Evaluation Forum, LNCS 2406, 2001.

[40]

E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.

Digital Library

[41]

W. Webber. Measurement in Information Retrieval Evaluation. PhD thesis, The University of Melbourne, Victoria, Australia, 2010. URL http://hdl.handle.net/11343/35779.

[42]

W. Webber, A. Moffat, J. Zobel, and T. Sakai. Precision-at-ten considered redundant. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2008. NO DOI.

Digital Library

[43]

W. Webber, A. Moffat, and J. Zobel. A similarity measure for indefinite rankings. ACM Transactions on Information Systems, 28(4), 2010.

Digital Library

[44]

A. Wicaksono and A. Moffat. Metrics, user models, and satisfaction. In Proc. WSDM Int. Conf. on Web Search and Data Mining, 2020.

Digital Library

[45]

F. Zhang, J. Mao, Y. Liu, X. Xie, W. Ma, M. Zhang, and S. Ma. Models versus satisfaction: Towards a better understanding of evaluation metrics. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2020.

Digital Library

[46]

J. Zobel. What we talk about when we talk about Information Retrieval. SIGIR Forum, 2018.

Digital Library

[47]

J. Zobel and Y. Bernstein. The case of the duplicate documents: Measurement, search, and science. In Proc. APWeb Australia-Pacific Conference on the Web, 2006.

Digital Library

[48]

J. Zobel and L. Rashidi. Corpus bootstrapping for assessment of the properties of effectiveness measures. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2020.

Digital Library

[49]

J. Zobel, A. Moffat, and L. Park. Against recall: Is it persistence, cardinality, density, coverage, or totality? SIGIR Forum, 2009.

Cited By

Breuer TMaistro M(2024)Toward Evaluating the Reproducibility of Information Retrieval Systems with Simulated UsersProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663619(25-29)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3641525.3663619
Bevendorff JWiegmann MPotthast MStein B(2024)Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search EnginesAdvances in Information Retrieval10.1007/978-3-031-56063-7_4(56-71)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_4
Moffat AMackenzie J(2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1002/asi.24874
Show More Cited By

Recommendations

Evaluating retrieval models using retrievability measurement

Evaluation is the main driving force in research, development and applications related to information retrieval (IR). In the traditional IR evaluation paradigm a list of query topics along with their relevance judgments are given. The main limitation of ...
Text Retrieval based on Least Information Measurement
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

We developed a new information retrieval framework based on the Least Information (LI) metric. We derived multiple term weighting schemes and combined them with a vector space representation for ad hoc retrieval. Given probability distributions in a ...
Ego-similarity measurement for relevance feedback

Relevance Feedback in Content-Based Image Retrieval is an active field of research. Many mechanisms of Relevance Feedback exist with many interactive techniques and implement criteria. In this paper, we proposed a novel approach of RF which can set ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGIR Forum

ACM SIGIR Forum Volume 56, Issue 1

June 2022

109 pages

ISSN:0163-5840

DOI:10.1145/3582524

Issue’s Table of Contents

Copyright © 2023 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023

Published in SIGIR Volume 56, Issue 1

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
51
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)3

Reflects downloads up to 18 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Breuer TMaistro M(2024)Toward Evaluating the Reproducibility of Information Retrieval Systems with Simulated UsersProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663619(25-29)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3641525.3663619
Bevendorff JWiegmann MPotthast MStein B(2024)Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search EnginesAdvances in Information Retrieval10.1007/978-3-031-56063-7_4(56-71)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_4
Moffat AMackenzie J(2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1002/asi.24874
Bauer CCarterette BFerro NFuhr NBeel JBreuer TClarke CCrescenzi ADemartini GDi Nunzio GDietz LFaggioli GFerwerda BFröbe MHagen MHanbury AHauff CJannach DKando NKanoulas EKnijnenburg BKruschwitz ULi MMaistro MMichiels LPapenmeier APotthast MRosso PSaid ASchaer PSeifert CSpina DStein BTintarev NUrbano JWachsmuth HWillemsen MZobel J(2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1145/3636341.3636351
Engelmann BBreuer TSchaer PFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Simulating Users in Interactive Web Table RetrievalProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615187(3875-3879)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615187
Fröbe MReimer JMacAvaney SDeckers NReich SBevendorff JStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591888

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents