Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

When Measurement Misleads: The Limits of Batch Assessment of Retrieval Systems

Published: 27 January 2023 Publication History

Abstract

The discipline of information retrieval (IR) has a long history of examination of how best to measure performance. In particular, there is an extensive literature on the practice of assessing retrieval systems using batch experiments based on collections and relevance judgements. However, this literature has only rarely considered an underlying principle: that measured scores are inherently incomplete as a representation of human activity, that is, there is an innate gap between measured scores and the desired goal of human satisfaction. There are separate challenges such as poor experimental practices or the shortcomings of specific measures, but the issue considered here is more fundamental - straightforwardly, in batch experiments the human-machine gap cannot be closed. In other disciplines, the issue of the gap is well recognised and has been the subject of observations that provide valuable perspectives on the behaviour and effects of measures and the ways in which they can lead to unintended consequences, notably Goodhart's law and the Lucas critique. Here I describe these observations and argue that there is evidence that they apply to IR, thus showing that blind pursuit of performance gains based on optimisation of scores, and analysis based solely on aggregated measurements, can lead to misleading and unreliable outcomes.

References

[1]
T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: Ad-hoc retrieval results since 1998. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2009.
[2]
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. User variability and IR system evaluation. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2015.
[3]
B. Billerbeck. Efficient Query Expansion. PhD thesis, RMIT University, Victoria, Australia, 2005. URL https://researchrepository.rmit.edu.au/esploro/outputs/doctoral/Efficient-query-expansion/9921861213101341.
[4]
B. Billerbeck and J. Zobel. Questioning query expansion: An examination of behaviour and parameters. In Proc. Aust. Document Computing Conf., 2004.
[5]
F. G. Boudinot and J. Wilson. Does a proxy measure up? A framework to assess and convey proxy reliability. Climate Past, 16, 2020.
[6]
E. Brynjolfsson, A. Collis, and F. Egger. Using massive online choice experiments to measure changes in well-being. Proc. National Academy of Sciences, 116(15), 2019.
[7]
D. T. Campbell. Assessing the impact of planned social change. Technical Report 8, Occasional Paper Series, Public Affairs Center, Dartmouth College, 1976. URL https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.6988&rep=rep1&type=pdf.
[8]
K. A. Chrystal and P. D. Mizen. Goodhart's law: Its origins, meaning and implications for monetary policy. In Festschrift in honour of Charles Goodhart, Bank of England, 2001.
[9]
G. V. Cormack and T. R. Lynam. Statistical precision of information retrieval evaluation. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2006.
[10]
M. Fire and C. Guestrin. Over-optimization of academic publishing metrics: observing Goodhart's law in action. GigaScience, 8, 2019.
[11]
J. K. Flake and E. I. Fried. Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 2020.
[12]
D. Hawking, T. Rowlands, and P. Thomas. C-TEST: Supporting novelty and diversity in test-files for search tuning. In Proc SIGIR Workshop: Redundancy, Diversity, and Interdependent Document Relevance, 2009.
[13]
S. B. Huffman and M. Hochster. How well does result relevance predict session satisfaction? In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2007.
[14]
D. Hull. Stemming algorithms: a case study for detailed evaluation. J. American Society for Information Science, 47, 1996.
[15]
D. Kelly. Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1--2), 2009.
[16]
X. Lu, A. Moffat, and J. S. Culpepper. The effect of pooling and evaluation depth on IR metrics. Information Retrieval Journal, 19(4), 2016.
[17]
X. Lu, A. Moffat, and J. S. Culpepper. Can deep effectiveness metrics be evaluated using shallow judgment pools? In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2017.
[18]
R. E. Lucas, Jr. Econometric policy analysis: A critique. In Carnegie-Rochester Conference Series on Public Policy, volume 1, 1976. URL https://EconPapers.repec.org/RePEc:eee:crcspp:v:1:y:1976:i::p:19-46.
[19]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[20]
A. Maul. Method effects and the meaning of measurement. Frontiers in Psychology, 14(169), 2013.
[21]
J. Michell. Representational measurement theory: Is its number up? Theory & Psychology, 31(3), 2021.
[22]
A. Moffat. Seven numeric properties of effectiveness metrics. In Proc. AIRS Asian Information Retrieval Symposium, 2013.
[23]
A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Database Systems, 27(1), 2008.
[24]
A. Moffat, P. Thomas, and F. Scholer. Users versus models: What observation tells us about effectiveness metrics. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2013.
[25]
A. Moffat, P. Bailey, F. Scholer, and P. Thomas. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Transactions on Information Systems, 35(3), 2017.
[26]
A. Moffat, F. Scholer, and Z. Yang. Estimating measurement uncertainty for information retrieval effectiveness metrics. ACM Journal of Data and Information Quality, 10(3), 2018.
[27]
A. Olteanu, F. Diaz, and G. Kazai. When are search completion suggestions problematic? In Proc. Human-Computer Interaction, volume 4, CSCW2, 2020.
[28]
L. Rashidi, J. Zobel, and A. Moffat. Evaluating the predictivity of IR experiments. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2021.
[29]
I. Rowlands. What are we measuring? Refocusing on some fundamentals in the age of desktop bibliometrics. FEMS Microbiology Letters, (365), 2018.
[30]
T. Sakai. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2016.
[31]
T. Sakai and Z. Zeng. Retrieval evaluation measures that agree with users' SERP preferences: Traditional, preference-based, and diversity measures. ACM Transactions on Information Systems, 39(2), 2021.
[32]
M. Sanderson. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 2010.
[33]
T. Saracevic. Evaluation of evaluation in information retrieval. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 1995.
[34]
P. Sirotkin. On Search Engine Evaluation Metrics. PhD thesis, University of Düsseldorf, 2012. URL https://arxiv.org/pdf/1302.2318.pdf.
[35]
M. Strathern. 'Improving ratings': Audit in the British university system. European Review, 5, 1997.
[36]
A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2001.
[37]
A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2006.
[38]
C. J. van Rijsbergen. Information Retrieval. Butterworths, 2nd edition, 1979.
[39]
E. M. Voorhees. The philosophy of information retrieval evaluation. In Proc. CLEF Cross-Language Evaluation Forum, LNCS 2406, 2001.
[40]
E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.
[41]
W. Webber. Measurement in Information Retrieval Evaluation. PhD thesis, The University of Melbourne, Victoria, Australia, 2010. URL http://hdl.handle.net/11343/35779.
[42]
W. Webber, A. Moffat, J. Zobel, and T. Sakai. Precision-at-ten considered redundant. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2008. NO DOI.
[43]
W. Webber, A. Moffat, and J. Zobel. A similarity measure for indefinite rankings. ACM Transactions on Information Systems, 28(4), 2010.
[44]
A. Wicaksono and A. Moffat. Metrics, user models, and satisfaction. In Proc. WSDM Int. Conf. on Web Search and Data Mining, 2020.
[45]
F. Zhang, J. Mao, Y. Liu, X. Xie, W. Ma, M. Zhang, and S. Ma. Models versus satisfaction: Towards a better understanding of evaluation metrics. In Proc. ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, 2020.
[46]
J. Zobel. What we talk about when we talk about Information Retrieval. SIGIR Forum, 2018.
[47]
J. Zobel and Y. Bernstein. The case of the duplicate documents: Measurement, search, and science. In Proc. APWeb Australia-Pacific Conference on the Web, 2006.
[48]
J. Zobel and L. Rashidi. Corpus bootstrapping for assessment of the properties of effectiveness measures. In Proc. ACM CIKM Int. Conf. on Information and Knowledge Manangement, 2020.
[49]
J. Zobel, A. Moffat, and L. Park. Against recall: Is it persistence, cardinality, density, coverage, or totality? SIGIR Forum, 2009.

Cited By

View all
  • (2024)Toward Evaluating the Reproducibility of Information Retrieval Systems with Simulated UsersProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663619(25-29)Online publication date: 18-Jun-2024
  • (2024)Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search EnginesAdvances in Information Retrieval10.1007/978-3-031-56063-7_4(56-71)Online publication date: 24-Mar-2024
  • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGIR Forum
ACM SIGIR Forum  Volume 56, Issue 1
June 2022
109 pages
ISSN:0163-5840
DOI:10.1145/3582524
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023
Published in SIGIR Volume 56, Issue 1

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)3
Reflects downloads up to 18 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Toward Evaluating the Reproducibility of Information Retrieval Systems with Simulated UsersProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663619(25-29)Online publication date: 18-Jun-2024
  • (2024)Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search EnginesAdvances in Information Retrieval10.1007/978-3-031-56063-7_4(56-71)Online publication date: 24-Mar-2024
  • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
  • (2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 4-Dec-2023
  • (2023)Simulating Users in Interactive Web Table RetrievalProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615187(3875-3879)Online publication date: 21-Oct-2023
  • (2023)The Information Retrieval Experiment PlatformProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591888(2826-2836)Online publication date: 19-Jul-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media