Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2537734.2537745acmotherconferencesArticle/Chapter ViewAbstractPublication PagesadcsConference Proceedingsconference-collections
research-article

Choices in batch information retrieval evaluation

Published: 05 December 2013 Publication History

Abstract

Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores.
Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality.
We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.

References

[1]
A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between IR effectiveness measures and user satisfaction. In Proc. SIGIR, pages 773--774, 2007.
[2]
J. Allan, B. Carterette, and J. Lewis. When will information retrieval be "good enough"? In Proc. SIGIR, pages 433--440, 2005.
[3]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proc. SIGIR, pages 667--674, 2008.
[4]
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling. In Proc. SIGIR, pages 619--620, 2006.
[5]
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information Retrieval, 10(6): 491--508, 2007.
[6]
S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. SIGIR, pages 63--70, 2007.
[7]
S. Büttcher, C. L. A. Clarke, and G. V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010.
[8]
B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proc. SIGIR, pages 268--275, 2006.
[9]
C. L. A. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proc. WSDM, pages 75--84, 2011.
[10]
W. H. DeLone and E. R. McLean. Information systems success: The quest for the dependent variable. Information Systems Research, 3(1): 60--95, 1992.
[11]
G. Demartini and S. Mizzaro. A classification of IR effectiveness metrics. In Proc. ECIR, pages 488--491, 2006.
[12]
S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1): 37--49, 1996.
[13]
A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In Proc. WSDM, pages 221--230, 2010.
[14]
B. Hedin, S. Tomlinson, J. R. Baron, and D. W. Oard. Overview of the TREC 2009 legal track. In Proc. TREC, 2010.
[15]
W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In Proc. SIGIR, pages 17--24, 2000.
[16]
W. Hersh, A. M. Cohen, P. Roberts, and H. K. Rekapalli. TREC 2006 genomics track overview. In Proc. TREC, 2007.
[17]
S. B. Huffman and M. Hochster. How well does result relevance predict session satisfaction? In Proc. SIGIR, pages 567--574, 2007.
[18]
T. Jones, D. Hawking, P. Thomas, and R. Sankaranarayana. Relative effect of spam and irrelevant documents on user interaction with search engines. In Proc. CIKM, pages 2113--2116, 2011.
[19]
G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Proc. ECIR, pages 165--176, 2011.
[20]
D. Kelly and C. R. Sugimoto. A systematic review of interactive information retrieval evaluation studies, 1967--2006. JASIST, 64(4): 745--770, 2013.
[21]
M. E. Lesk and G. Salton. Relevance assessments and the measurement of retrieval effectiveness. Information Storage and Retrieval, 4: 343--359, 1969.
[22]
X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proc. SIGIR, pages 339--346, 2008.
[23]
A. Moffat. Seven numeric properties of effectiveness metrics. In Proc. AIRS, 2013. To appear.
[24]
A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382, 2007.
[25]
A. Moffat, F. Scholer, and P. Thomas. Models and metrics: IR evaluation as a user process. In Proc. Australasian Document Computing Symp., pages 47--54, December 2012.
[26]
A. Moffat, P. Thomas, and F. Scholer. Users versus models: What observation tells us about effectiveness metrics. In Proc. CIKM, October 2013. To appear.
[27]
S. Robertson. On GMAP and other transformations. In Proc. CIKM, pages 78--83, 2006.
[28]
T. Sakai. Alternatives to Bpref. In Proc. SIGIR, pages 71--78, 2007.
[29]
M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems, volume 4 of Foundations and Trends in Information Retrieval. now Publishers, 2010.
[30]
M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proc. SIGIR, pages 162--169, 2005.
[31]
K. L. Smith and P. B Kantor. User adaptation: Good results from poor systems. In Proc. SIGIR, pages 147--154, 2008.
[32]
P. Thomas, F. Scholer, and A. Moffat. What users do: The eyes have it. In Proc. AIRS, 2013. To appear.
[33]
A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. SIGIR, pages 225--231, 2001.
[34]
A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. SIGIR, pages 11--18, 2006.
[35]
A. Turpin, F. Scholer, K. Jarvelin, M. Wu, and J. S. Culpepper. Including summaries in system evaluation. In Proc. SIGIR, pages 508--515, 2009.
[36]
E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.
[37]
M. Wu, J. A. Thom, A. Turpin, and R. Wilkinson. Cost and benefit analysis of mediated enterprise search. In Proc. JCDL, pages 267--276, 2009.
[38]
W.-C. Wu, D. Kelly, A. Edwards, and J. Arguello. Grannies, tanning beds, tattoos and NASCAR: Evaluation of search tasks with varying levels of cognitive complexity. In Proc. Information Interaction in Context Symp., pages 254--257, 2012.
[39]
E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Proc. SIGIR, pages 603--610, 2008.
[40]
E. Yilmaz, G. Kazai, N. Craswell, and S. M. M. Tamaghoghi. On judgments obtained from a commercial search engine. In Proc. SIGIR, pages 1115--1116, 2012.
[41]
J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. SIGIR, pages 307--314, 1998.

Cited By

View all
  • (2016)System And User Centered Evaluation Approaches in Interactive Information Retrieval (SAUCE 2016)Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval10.1145/2854946.2886106(337-340)Online publication date: 13-Mar-2016
  • (2015)IR EvaluationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767875(1129-1132)Online publication date: 9-Aug-2015

Index Terms

  1. Choices in batch information retrieval evaluation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ADCS '13: Proceedings of the 18th Australasian Document Computing Symposium
    December 2013
    126 pages
    ISBN:9781450325240
    DOI:10.1145/2537734
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 December 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. effectiveness
    2. evaluation
    3. information retrieval
    4. relevance judgment

    Qualifiers

    • Research-article

    Conference

    ADCS '13
    ADCS '13: The Australasian Document Computing Symposium
    December 5 - 6, 2013
    Queensland, Brisbane, Australia

    Acceptance Rates

    ADCS '13 Paper Acceptance Rate 12 of 23 submissions, 52%;
    Overall Acceptance Rate 30 of 57 submissions, 53%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)System And User Centered Evaluation Approaches in Interactive Information Retrieval (SAUCE 2016)Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval10.1145/2854946.2886106(337-340)Online publication date: 13-Mar-2016
    • (2015)IR EvaluationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/2766462.2767875(1129-1132)Online publication date: 9-Aug-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media