research-article

Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation

Authors:

Michael Soprano,

Andrea Brunello,

Stefano MizzaroAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 10, Issue 3

Article No.: 12, Pages 1 - 21

https://doi.org/10.1145/3239573

Published: 29 September 2018 Publication History

Abstract

Effectiveness evaluation of information retrieval systems by means of a test collection is a widely used methodology. However, it is rather expensive in terms of resources, time, and money; therefore, many researchers have proposed methods for a cheaper evaluation. One particular approach, on which we focus in this article, is to use fewer topics: in TREC-like initiatives, usually system effectiveness is evaluated as the average effectiveness on a set of n topics (usually, n=50, but more than 1,000 have been also adopted); instead of using the full set, it has been proposed to find the best subsets of a few good topics that evaluate the systems in the most similar way to the full set. The computational complexity of the task has so far limited the analysis that has been performed. We develop a novel and efficient approach based on a multi-objective evolutionary algorithm. The higher efficiency of our new implementation allows us to reproduce some notable results on topic set reduction, as well as perform new experiments to generalize and improve such results. We show that our approach is able to both reproduce the main state-of-the-art results and to allow us to analyze the effect of the collection, metric, and pool depth used for the evaluation. Finally, differently from previous studies, which have been mainly theoretical, we are also able to discuss some practical topic selection strategies, integrating results of automatic evaluation approaches.

References

[1]

James Allan, Ben Carterette, Javed A. Aslam, Virgil Pavlu, Blagovest Dachev, and Evangelos Kanoulas. 2007. Million Query Track 2007 Overview. Technical Report. NIST. http://trec.nist.gov/pubs/trec18/papers/MQ09OVERVIEW.pdf.

[2]

David Banks, Paul Over, and Nien-Fan Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Information Retrieval 1, 1 (1999), 7--34.

Digital Library

[3]

Andrea Berto, Stefano Mizzaro, and Stephen Robertson. 2013. On using fewer topics in information retrieval evaluations. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR’13). ACM, New York, NY, Article 9, 8 pages.

Digital Library

[4]

Chris Buckley and Ellen M. Voorhees. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 33--40.

Digital Library

[5]

Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6, 2 (2002), 182--197.

Digital Library

[6]

Agoston E. Eiben and J. E. Smith. 2003. Introduction to Evolutionary Computing. Springer-Verlag.

Digital Library

[7]

Nicola Ferro. 2017. Reproducibility challenges in information retrieval evaluation. Journal of Data and Information Quality 8, 2, Article 8 (Jan. 2017), 4 pages.

Digital Library

[8]

Nicola Ferro, Norbert Fuhr, Kalervo Järvelin, Noriko Kando, Matthias Lippold, and Justin Zobel. 2016. Increasing reproducibility in IR: Findings from the Dagstuhl seminar on reproducibility of data-oriented experiments in E-science. In ACM SIGIR Forum, Vol. 50. ACM, 68--82.

Digital Library

[9]

John Guiver, Stefano Mizzaro, and Stephen Robertson. 2009. A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems 27, 4, Article 21 (Nov. 2009), 26 pages.

Digital Library

[10]

Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 5 (Sept. 1999), 604--632.

Digital Library

[11]

Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Information Retrieval 19, 4 (Aug. 2016), 416--445.

Digital Library

[12]

Stefano Mizzaro and Stephen Robertson. 2007. Hits hits TREC: Exploring IR evaluation results with network analysis. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 479--486.

Digital Library

[13]

Stephen Robertson. 2011. On the contributions of topics to system evaluation. In Proceedings of the 33rd European Conference on Advances in Information Retrieval, Volume 6611 (ECIR’11). Springer-Verlag, New York, 129--140.

Digital Library

[14]

Kevin Roitero, Eddy Maddalena, and Stefano Mizzaro. 2017. Do Easy Topics Predict Effectiveness Better Than Difficult Topics? Springer International Publishing, Cham, 605--611.

[15]

Tetsuya Sakai. 2014. Designing test collections for comparing many systems. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14). ACM, New York, NY, USA, 61--70.

Digital Library

[16]

Tetsuya Sakai. 2016. Topic set size design. Information Retrieval Journal 19, 3 (1 Jun 2016), 256--283.

Digital Library

[17]

Ian Soboroff, Charles Nicholas, and Patrick Cahan. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 66--73.

Digital Library

[18]

Karen Sparck Jones and Cornelis Joost van Rijsbergen. n.d. Information retrieval test collections. Journal of Documentation 32, 1.

[19]

Anselm Spoerri. 2005. How the overlap between the search results of different retrieval systems correlates with document relevance. Proceedings of the American Society for Information Science and Technology 42, 1 (2005).

[20]

Ellen M. Voorhees. 2004. Overview of the TREC 2004 robust track. In Proceedings of The Thirteenth Text Retrieval Conference (TREC'04). Vol. 4. http://trec.nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf.

[21]

Ellen M. Voorhees and Chris Buckley. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02). ACM, New York, NY, 316--323.

Digital Library

[22]

Ellen M. Voorhees and Donna Harman. 2000. Overview of the Proceedings of the 8th Text REtrieval Conference (TREC-8). 1--24.

[23]

William Webber, Alistair Moffat, and Justin Zobel. 2008. Statistical power in retrieval experimentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08). ACM, New York, NY, 571--580.

Digital Library

[24]

Shengli Wu and Fabio Crestani. 2003. Methods for ranking information retrieval systems without relevance judgments. In Proceedings of the 2003 ACM Symposium on Applied Computing (SAC’03). ACM, New York, NY, 811--816.

Digital Library

[25]

Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98). ACM, New York, NY, 307--314.

Digital Library

Cited By

Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3597201
Kırnap ÖDiaz FBiega AEkstrand MCarterette BYilmaz E(2021)Estimation of Fair Ranking Metrics with Incomplete JudgmentsProceedings of the Web Conference 202110.1145/3442381.3450080(1065-1075)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450080
Roitero KBrunello ASerra GMizzaro S(2020)Effectiveness evaluation without human relevance judgmentsInformation Processing and Management: an International Journal10.1016/j.ipm.2019.10214957:2Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1016/j.ipm.2019.102149
Show More Cited By

Index Terms

Reproduce and Improve: An Evolutionary Approach to Select a Few Good Topics for Information Retrieval Evaluation
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Bio-inspired approaches
        Genetic programming
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Test collections

Recommendations

Effectiveness Evaluation with a Subset of Topics: A Practical Approach
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Several researchers have proposed to reduce the number of topics used in TREC-like initiatives. One research direction that has been pursued is what is the optimal topic subset of a given cardinality that evaluates the systems/runs in the most accurate ...
Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments
Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and Analyses

The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate ...
Fewer topics? A million topics? Both?! On topics subsets in test collections
Abstract
When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 10, Issue 3

Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and Analyses

September 2018

94 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3282439

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 September 2018

Accepted: 01 July 2018

Revised: 01 April 2018

Received: 01 October 2017

Published in JDIQ Volume 10, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
147
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3597201
Kırnap ÖDiaz FBiega AEkstrand MCarterette BYilmaz E(2021)Estimation of Fair Ranking Metrics with Incomplete JudgmentsProceedings of the Web Conference 202110.1145/3442381.3450080(1065-1075)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450080
Roitero KBrunello ASerra GMizzaro S(2020)Effectiveness evaluation without human relevance judgmentsInformation Processing and Management: an International Journal10.1016/j.ipm.2019.10214957:2Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1016/j.ipm.2019.102149
Roitero KCulpepper JSanderson MScholer FMizzaro S(2020)Fewer topics? A million topics? Both?! On topics subsets in test collectionsInformation Retrieval10.1007/s10791-019-09357-w23:1(49-85)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10791-019-09357-w
Roitero KBrunello AUrbano JMizzaro SZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)Towards Stochastic Simulations of Relevance ProfilesProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358123(2217-2220)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3358123

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents