Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2766462.2767875acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections

IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline

Published: 09 August 2015 Publication History


This tutorial aims to provide attendees with a detailed understanding of end-to-end evaluation pipeline based on human judgments (offline measurement). The tutorial will give an overview of the state of the art methods, techniques, and metrics necessary for each stage of evaluation process. We will mostly focus on evaluating an information retrieval (search) system, but the other tasks such as recommendation and classification will also be discussed. Practical examples will be drawn both from the literature and from real world usage scenarios in industry.


Azzah Al-Maskari, Mark Sanderson, and Paul Clough, The relationship between IR effectiveness measures and user satisfaction. SIGIR '07.
Al-Maskari, A., Sanderson, M., Clough, P., and Airio, E. The good and the bad system: does the test collection predict users' effectiveness? SIGIR '08.
Peter Bailey et al., "Evaluating search systems using result page context," in IIiX, 2010.
Ben Carterette. System effectiveness, user models, and user utility: a conceptual framework for investigation. SIGIR '11.
Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais, "Here or There," in ECIR, 2008.
Ben Carterette and Ian Soboroff. 2010. The effect of assessor error on IR system evaluation. SIGIR '10.
Praveen Chandar and Ben Carterette. Using preference judgments for novel document retrieval. SIGIR '12.
Chandar, P. and Carterette, B. Preference Based Evaluation Measures for Novelty and Diversity. SIGIR'13.
Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. CIKM '09.
Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Buttcher, Ian Mackinnon. Novelty and Diversity in Information Retrieval Evaluation. SIGIR '08.
Nick Craswell, Onno Zoeter, Michael Taylor and Bill Ramsey. An experimental comparison of click position-bias models. WSDM '08.
Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search. ACM Trans. Inf. Syst. 23, 2 (April 2005), 147--168.
Peter B. Golbus, Javed A. Aslam, Charles L. Clarke, Increasing evaluation sensitivity to diversity, Information Retrieval, v.16 n.4, p.530--555, August 2013.
Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM TOIS, 20(4):422--446, October 2002.
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400--401.
Jinyoung Kim, Gabriella Kazai, and Imed Zitouni, Relevance Dimensions in Preference-based IR Evaluation. SIGIR '13.
Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1):2:1--2:27, December 2008.
E. Pronin, "Perception and misperception of bias in human judgment," Trends in cognitive sciences, vol. 11, no. 1, pp. 37--43, 2007.
Tetsuya Sakai and Ruihua Song. Evaluating diversified search results using per-intent graded relevance. SIGIR '11.
Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas, Do user preferences and evaluation measures line up? SIGIR '10.
Falk Scholer, Diane Kelly, Wan-Ching Wu, Hanseul S. Lee, and William Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. SIGIR '13.
Paul Thomas and David Hawking, Evaluation by comparing result sets in context, CIKM' 06.
Jean Tague-Sutcliffe. The pragmatics of information retrieval evaluation. In Information Retrieval Experiment: Experiment, pages 59--102. Butterworth-Heinemann, 1981.
Jean Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. Inf. Process. Management., 28(4):467--490, 1992.
Jean Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. Readings in Information Retrieval. lnformation retrieval, pages 205--216, 1997.
Ellen M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papes from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370, London, UK, 2002. Springer-Verlag.
Ellen M. Voorhees and Donna K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.
Javed A. Aslam, Virgil Pavlu, and Emine Yilmaz. A statistical method for system evaluation using incomplete judgments. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 541--548. ACM Press, August 2006.
Ben Carterette, James Allan, and Ramesh Sitaraman. Minimal test collections for retrieval evaluation. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 268--275, 2006.
Ben Carterette and Rosie Jones. Evaluating search engines by modeling the relationship between relevance and clicks. In NIPS '07: Proceedings of Advances in Neural Information Processing Systems, 2007.
Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A. Aslam, and James Allan. If i had a million queries. In Advances in Information Retrieval: 31st European Conference on IR Research, Lecture Notes in Computer Science. Springer-Verlag, April 2009.
Matteo Cattelan and Stefano Mizzaro. IR evaluation without a common set of topics. In ICTIR '09: Proceedings of the 2nd International Conference on Theory of Information Retrieval, pages 342--345. Springer-Verlag, 2009.
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How does clickthrough data reflect retrieval quality? In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 43--52, New York, NY, USA, 2008. ACM.
Kuansan Wang, Toby Walker, and Zijian Zheng. Pskip: estimating relevance ranking quality from web search clickthrough data. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1355--1364, New York, NY, USA, 2009. ACM.
Emine Yilmaz and Javed A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Philip S. Yu, Vassilis Tsotras, Edward Fox, and Bing Liu, editors, Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management, pages 102--111. ACM Press, November 2006.
Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR '08: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 603--610. ACM Press, July 2008.
Jianhan Zhu, JunWang, Vishwa Vinay, and Ingemar J. Cox. Topic (query) selection for IR evaluation. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 802{803, New York, NY, USA, 2009. ACM.
Mehdi Hosseini, Ingemar J. Cox, Natasa Milic-Frayling, Milad Shokouhi, Emine Yilmaz: An uncertainty-aware query selection model for evaluation of IR systems. SIGIR 2012: 901--910
Ben Carterette, Evangelos Kanoulas and Emine Yilmaz. Evaluating Web Retrieval Effectiveness". In Web Search Engine Research, chapter Evaluating Web Retrieval Effectiveness, Emerald Library and Information Science Book Series, 2011
Paul N. Bennett, Ben Carterette, Olivier Chapelle, Thorsten Joachims: Beyond binary relevance: preferences, diversity, and set-level judgments. SIGIR Forum 42(2): 53--58 (2008)
Javed A. Aslam, Emine Yilmaz, Virgiliu Pavlu: The maximum entropy method for analyzing retrieval measures. SIGIR 2005: 27--34
Tetsuya Sakai: Evaluating evaluation metrics based on the bootstrap. SIGIR 2006: 525--532
Voorhees, E. M. and Buckley, C.: The Effect of Topic Set Size on Retrieval Experiment Error, ACM SIGIR 2002 Proceedings, pp. 316--323, 2002.
Sanderson, M. and Zobel, J.: Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, ACM SIGIR 2005 Proceedings, pp. 162--169, 2005.
Chris Buckley and Ellen M. Voorhees. Evaluating evaluation measure stability. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 33{40, 2000.
Olivier Chapelle, Donald Metlzer, Ya Zhang, Pierre Grinspan: Expected reciprocal rank for graded relevance. CIKM 2009: 621--630
Emine Yilmaz, Milad Shokouhi, Nick Craswell, Stephen Robertson: Expected browsing utility for web search evaluation. CIKM 2010: 1561--1564
Mark D. Smucker, James Allan, and Ben Carterette. A comparison of statistical significance tests for information retrieval evaluation. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 623--632, New York, NY, USA, 2007. ACM.
David Banks, Paul Over, and Nien-Fan Zhang. Blind men and elephants: Six approaches to TREC data. Information Retrieval Journal, 1(1--2):7--34, 1999.
Mihai Georgescu and Xiaofei Zhu. 2014. Aggregation of Crowdsourced Labels Based on Worker History. In Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14).
Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. 2014. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web (WWW '14).
Falk Scholer, Alistair Moffat, and Paul Thomas. 2013. Choices in batch information retrieval evaluation. In Proceedings of the 18th Australasian Document Computing Symposium (ADCS '13).
Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S.M.M. Tahaghoghi. 2013. User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (CIKM '13).
Luca Busin and Stefano Mizzaro. 2013. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR '13).
Jinyang Gao, Xuan Liu, Beng Chin Ooi, Haixun Wang, and Gang Chen. 2013. An online cost sensitive decision-making method in crowdsourcing systems. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13).
Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudr--e-Mauroux. 2013. Pick-a-crowd: tell me what you like, and i'll tell you what to do. In Proceedings of the 22nd international conference on World Wide Web (WWW '13).
Omar Alonso. 2013. Implementing crowdsourcing-based relevance experimentation: an industrial perspective. Inf. Retr. 16, 2 (April 2013), 101--120.
Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2013. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf. Retr. 16, 2 (April 2013), 138--178.
Carsten Eickho and Arjen P. Vries. 2013. Increasing cheat robustness of crowdsourcing tasks. Inf. Retr. 16, 2 (April 2013), 121--137.
Ece Kamar, Ashish Kapoor, and Eric Horvitz. 2013. Lifelong learning for acquiring the wisdom of the crowd. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (IJCAI'13).

Index Terms

  1. IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline



    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors


    Published In

    cover image ACM Conferences
    SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
    August 2015
    1198 pages
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.



    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 August 2015

    Check for updates

    Author Tags

    1. crowdsourcing
    2. evaluation
    3. experiment design
    4. human judges
    5. measures


    • Tutorial

    Funding Sources


    SIGIR '15

    Acceptance Rates

    SIGIR '15 Paper Acceptance Rate 70 of 351 submissions, 20%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • 0
      Total Citations
    • 231
      Total Downloads
    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Feb 2025

    Other Metrics


    View Options

    Login options

    View options


    View or Download as a PDF file.



    View online with eReader.







    Share this Publication link

    Share on social media