tutorial

IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline

Authors:

Emine YilmazAuthors Info & Claims

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1129 - 1132

https://doi.org/10.1145/2766462.2767875

Published: 09 August 2015 Publication History

Abstract

This tutorial aims to provide attendees with a detailed understanding of end-to-end evaluation pipeline based on human judgments (offline measurement). The tutorial will give an overview of the state of the art methods, techniques, and metrics necessary for each stage of evaluation process. We will mostly focus on evaluating an information retrieval (search) system, but the other tasks such as recommendation and classification will also be discussed. Practical examples will be drawn both from the literature and from real world usage scenarios in industry.

References

[1]

Azzah Al-Maskari, Mark Sanderson, and Paul Clough, The relationship between IR effectiveness measures and user satisfaction. SIGIR '07.

Digital Library

[2]

Al-Maskari, A., Sanderson, M., Clough, P., and Airio, E. The good and the bad system: does the test collection predict users' effectiveness? SIGIR '08.

Digital Library

[3]

Peter Bailey et al., "Evaluating search systems using result page context," in IIiX, 2010.

Digital Library

[4]

Ben Carterette. System effectiveness, user models, and user utility: a conceptual framework for investigation. SIGIR '11.

Digital Library

[5]

Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais, "Here or There," in ECIR, 2008.

[6]

Ben Carterette and Ian Soboroff. 2010. The effect of assessor error on IR system evaluation. SIGIR '10.

Digital Library

[7]

Praveen Chandar and Ben Carterette. Using preference judgments for novel document retrieval. SIGIR '12.

Digital Library

[8]

Chandar, P. and Carterette, B. Preference Based Evaluation Measures for Novelty and Diversity. SIGIR'13.

Digital Library

[9]

Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. CIKM '09.

Digital Library

[10]

Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Buttcher, Ian Mackinnon. Novelty and Diversity in Information Retrieval Evaluation. SIGIR '08.

Digital Library

[11]

Nick Craswell, Onno Zoeter, Michael Taylor and Bill Ramsey. An experimental comparison of click position-bias models. WSDM '08.

Digital Library

[12]

Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search. ACM Trans. Inf. Syst. 23, 2 (April 2005), 147--168.

Digital Library

[13]

Peter B. Golbus, Javed A. Aslam, Charles L. Clarke, Increasing evaluation sensitivity to diversity, Information Retrieval, v.16 n.4, p.530--555, August 2013.

Digital Library

[14]

Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM TOIS, 20(4):422--446, October 2002.

Digital Library

[15]

Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400--401.

[16]

Jinyoung Kim, Gabriella Kazai, and Imed Zitouni, Relevance Dimensions in Preference-based IR Evaluation. SIGIR '13.

Digital Library

[17]

Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1):2:1--2:27, December 2008.

Digital Library

[18]

E. Pronin, "Perception and misperception of bias in human judgment," Trends in cognitive sciences, vol. 11, no. 1, pp. 37--43, 2007.

[19]

Tetsuya Sakai and Ruihua Song. Evaluating diversified search results using per-intent graded relevance. SIGIR '11.

Digital Library

[20]

Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas, Do user preferences and evaluation measures line up? SIGIR '10.

Digital Library

[21]

Falk Scholer, Diane Kelly, Wan-Ching Wu, Hanseul S. Lee, and William Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. SIGIR '13.

Digital Library

[22]

Paul Thomas and David Hawking, Evaluation by comparing result sets in context, CIKM' 06.

Digital Library

[23]

Jean Tague-Sutcliffe. The pragmatics of information retrieval evaluation. In Information Retrieval Experiment: Experiment, pages 59--102. Butterworth-Heinemann, 1981.

[24]

Jean Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. Inf. Process. Management., 28(4):467--490, 1992.

Digital Library

[25]

Jean Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. Readings in Information Retrieval. lnformation retrieval, pages 205--216, 1997.

Digital Library

[26]

Ellen M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papes from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370, London, UK, 2002. Springer-Verlag.

Digital Library

[27]

Ellen M. Voorhees and Donna K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.

Digital Library

[28]

Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. A statistical method for system evaluation using incomplete judgments. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 541--548. ACM Press, August 2006.

Digital Library

[29]

Ben Carterette, James Allan, and Ramesh Sitaraman. Minimal test collections for retrieval evaluation. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 268--275, 2006.

Digital Library

[30]

Ben Carterette and Rosie Jones. Evaluating search engines by modeling the relationship between relevance and clicks. In NIPS '07: Proceedings of Advances in Neural Information Processing Systems, 2007.

[31]

Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam, and James Allan. If i had a million queries. In Advances in Information Retrieval: 31st European Conference on IR Research, Lecture Notes in Computer Science. Springer-Verlag, April 2009.

Digital Library

[32]

Matteo Cattelan and Stefano Mizzaro. IR evaluation without a common set of topics. In ICTIR '09: Proceedings of the 2nd International Conference on Theory of Information Retrieval, pages 342--345. Springer-Verlag, 2009.

Digital Library

[33]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How does clickthrough data reflect retrieval quality? In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 43--52, New York, NY, USA, 2008. ACM.

Digital Library

[34]

Kuansan Wang, Toby Walker, and Zijian Zheng. Pskip: estimating relevance ranking quality from web search clickthrough data. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1355--1364, New York, NY, USA, 2009. ACM.

Digital Library

[35]

Emine Yilmaz and Javed A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Philip S. Yu, Vassilis Tsotras, Edward Fox, and Bing Liu, editors, Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management, pages 102--111. ACM Press, November 2006.

Digital Library

[36]

Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR '08: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 603--610. ACM Press, July 2008.

Digital Library

[37]

Jianhan Zhu, JunWang, Vishwa Vinay, and Ingemar J. Cox. Topic (query) selection for IR evaluation. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 802{803, New York, NY, USA, 2009. ACM.

Digital Library

[38]

Mehdi Hosseini, Ingemar J. Cox, Natasa Milic-Frayling, Milad Shokouhi, Emine Yilmaz: An uncertainty-aware query selection model for evaluation of IR systems. SIGIR 2012: 901--910

Digital Library

[39]

Ben Carterette, Evangelos Kanoulas and Emine Yilmaz. Evaluating Web Retrieval Effectiveness". In Web Search Engine Research, chapter Evaluating Web Retrieval Effectiveness, Emerald Library and Information Science Book Series, 2011

[40]

Paul N. Bennett, Ben Carterette, Olivier Chapelle, Thorsten Joachims: Beyond binary relevance: preferences, diversity, and set-level judgments. SIGIR Forum 42(2): 53--58 (2008)

Digital Library

[41]

Javed A. Aslam, Emine Yilmaz, Virgiliu Pavlu: The maximum entropy method for analyzing retrieval measures. SIGIR 2005: 27--34

Digital Library

[42]

Tetsuya Sakai: Evaluating evaluation metrics based on the bootstrap. SIGIR 2006: 525--532

Digital Library

[43]

Voorhees, E. M. and Buckley, C.: The Effect of Topic Set Size on Retrieval Experiment Error, ACM SIGIR 2002 Proceedings, pp. 316--323, 2002.

Digital Library

[44]

Sanderson, M. and Zobel, J.: Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, ACM SIGIR 2005 Proceedings, pp. 162--169, 2005.

Digital Library

[45]

Chris Buckley and Ellen M. Voorhees. Evaluating evaluation measure stability. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 33{40, 2000.

Digital Library

[46]

Olivier Chapelle, Donald Metlzer, Ya Zhang, Pierre Grinspan: Expected reciprocal rank for graded relevance. CIKM 2009: 621--630

Digital Library

[47]

Emine Yilmaz, Milad Shokouhi, Nick Craswell, Stephen Robertson: Expected browsing utility for web search evaluation. CIKM 2010: 1561--1564

Digital Library

[48]

Mark D. Smucker, James Allan, and Ben Carterette. A comparison of statistical significance tests for information retrieval evaluation. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 623--632, New York, NY, USA, 2007. ACM.

Digital Library

[49]

David Banks, Paul Over, and Nien-Fan Zhang. Blind men and elephants: Six approaches to TREC data. Information Retrieval Journal, 1(1--2):7--34, 1999.

Digital Library

[50]

Mihai Georgescu and Xiaofei Zhu. 2014. Aggregation of Crowdsourced Labels Based on Worker History. In Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14).

Digital Library

[51]

Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. 2014. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web (WWW '14).

Digital Library

[52]

Falk Scholer, Alistair Moffat, and Paul Thomas. 2013. Choices in batch information retrieval evaluation. In Proceedings of the 18th Australasian Document Computing Symposium (ADCS '13).

Digital Library

[53]

Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S.M.M. Tahaghoghi. 2013. User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (CIKM '13).

Digital Library

[54]

Luca Busin and Stefano Mizzaro. 2013. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR '13).

Digital Library

[55]

Jinyang Gao, Xuan Liu, Beng Chin Ooi, Haixun Wang, and Gang Chen. 2013. An online cost sensitive decision-making method in crowdsourcing systems. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13).

Digital Library

[56]

Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudr--e-Mauroux. 2013. Pick-a-crowd: tell me what you like, and i'll tell you what to do. In Proceedings of the 22nd international conference on World Wide Web (WWW '13).

Digital Library

[57]

Omar Alonso. 2013. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Inf. Retr. 16, 2 (April 2013), 101--120.

Digital Library

[58]

Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2013. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf. Retr. 16, 2 (April 2013), 138--178.

Digital Library

[59]

Carsten Eickho and Arjen P. Vries. 2013. Increasing cheat robustness of crowdsourcing tasks. Inf. Retr. 16, 2 (April 2013), 121--137.

Digital Library

[60]

Ece Kamar, Ashish Kapoor, and Eric Horvitz. 2013. Lifelong learning for acquiring the wisdom of the crowd. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (IJCAI'13).

Digital Library

Index Terms

IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline
1. Information systems
  1. Information retrieval

Recommendations

On real-time ad-hoc retrieval evaluation
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Lab-based evaluations typically assess the quality of a retrieval system with respect to its ability to retrieve documents that are relevant to the information need of an end user. In a real-time search task however users not only wish to retrieve the ...
IR Evaluation: Modeling User Behavior for Measuring Effectiveness
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

This half-day tutorial on IR evaluation combines an introduction to classical IR evaluation methods with material on more recent user-oriented approaches. We primarily focus on off-line evaluation, but some material on on-line evaluation is also ...
Advances on the development of evaluation measures
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

The goal of the tutorial is to provide attendees with a comprehensive overview of the latest advances in the development of information retrieval evaluation measures and discuss the current challenges in the area. A number of topics are covered, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2015

1198 pages

ISBN:9781450336215

DOI:10.1145/2766462

General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil

Copyright © 2015 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2015

Check for updates

Author Tags

Qualifiers

Tutorial

Funding Sources

Conference

SIGIR '15

Sponsor:

SIGIR

SIGIR '15: The 38th International ACM SIGIR conference on research and development in Information Retrieval

August 9 - 13, 2015

Santiago, Chile

Acceptance Rates

SIGIR '15 Paper Acceptance Rate 70 of 351 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
231
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten