Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Examining Additivity and Weak Baselines

Published: 09 June 2016 Publication History

Abstract

We present a study of which baseline to use when testing a new retrieval technique. In contrast to past work, we show that measuring a statistically significant improvement over a weak baseline is not a good predictor of whether a similar improvement will be measured on a strong baseline. Sometimes strong baselines are made worse when a new technique is applied. We investigate whether conducting comparisons against a range of weaker baselines can increase confidence that an observed effect will also show improvements on a stronger baseline. Our results indicate that this is not the case -- at best, testing against a range of baselines means that an experimenter can be more confident that the new technique is unlikely to significantly harm a strong baseline. Examining recent past work, we present evidence that the information retrieval (IR) community continues to test against weak baselines. This is unfortunate as, in light of our experiments, we conclude that the only way to be confident that a new technique is a contribution is to compare it against nothing less than the state of the art.

References

[1]
R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. 2009. Diversifying search results. In Proceedings of WSDM. ACM, 5--14.
[2]
Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don’t add up: Ad-hoc retrieval results since 1998. In Proceedings of CIKM. ACM, 601--610.
[3]
David Bodoff. 2013. Fuhr’s challenge: Conceptual research, or bust. In ACM SIGIR Forum, Vol. 47. ACM, 3--16.
[4]
D. Bodoff and P. Li. 2007. Test Theory for Assessing IR Test Collections. ACM, New York, NY, 367374.
[5]
Jamie Callan and Alistair Moffat. 2012. Panel on use of proprietary data. In ACM SIGIR Forum, Vol. 46. ACM, 10--18.
[6]
J. Carbonell and J. Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR. ACM, 335--336.
[7]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of CIKM. ACM, 621--630.
[8]
C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of SIGIR. ACM, 659--666.
[9]
Ronan Cummins, Mounia Lalmas, and Colm O’Riordan. 2011. The limits of retrieval effectiveness. In Advances in Information Retrieval. Springer, 277--282.
[10]
Van Dang and W. Bruce Croft. 2012. Diversity by proportionality: An election-based approach to search result diversification. In Proceedings of SIGIR. ACM, 65--74.
[11]
J. Stephen Downie. 2008. The music information retrieval evaluation exchange (2005-2007): A window into music information retrieval research. Acoustical Science and Technology 29, 4, 247--255.
[12]
J. Stephen Downie, Andreas F. Ehmann, Mert Bay, and M. Cameron Jones. 2010. The music information retrieval evaluation exchange: Some observations and insights. In Advances in Music Information Retrieval. Springer, 93--115.
[13]
Hui Fang, Hao Wu, Peilin Yang, and ChengXiang Zhai. 2014. Virlab: A web-based virtual lab for learning and studying information retrieval models. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1249--1250.
[14]
Nicola Ferro and Gianmaria Silvello. 2015. Rank-biased precision reloaded: Reproducibility and generalization. In Advances in Information Retrieval. Springer, 768--780.
[15]
Norbert Fuhr. 2012. Salton Award Lecture: Information retrieval as engineering science. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). ACM, New York, NY, 1--2.
[16]
Matthias Hagen, Martin Potthast, Michel Büchner, and Benno Stein. 2015. Twitter sentiment detection via ensemble classification using averaged confidence scores. In Advances in Information Retrieval. Springer, 741--754.
[17]
Jiyin He, Vera Hollink, and Arjen de Vries. 2012. Combining implicit and explicit topic representations for result diversification. In Proceedings of SIGIR. ACM, 851--860.
[18]
Xuedong Huang, James Baker, and Raj Reddy. 2014. A historical perspective of speech recognition. Communications of the ACM 57, 1, 94--103.
[19]
Samuel Huston and W. Bruce Croft. 2014. A comparison of retrieval models using term dependencies. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. ACM, 111--120.
[20]
Johannes Leveling, Lorraine Goeuriot, Liadh Kelly, and Gareth J. Jones. 2012. DCU@ TRECMed 2012: Using ad-hoc baselines for domain-specific retrieval. In Proceedings of TREC.
[21]
Philipp Mayr, Andrea Scharnhorst, Birger Larsen, Philipp Schaer, and Peter Mutschke. 2014. Bibliometric-Enhanced Information Retrieval. Springer, 798--801. Retrieved May 13, 2016 from http://link.springer.com/chapter/10.1007/978-3-319-06028-6_99
[22]
Donald Metzler and Oren Kurland. 2012. Experimental methods for information retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). ACM, New York, NY, 1185--1186.
[23]
David Newman, Youn Noh, Edmund Talley, Sarvnaz Karimi, and Timothy Baldwin. 2010. Evaluating topic models for digital libraries. In Proceedings of the 10th Annual Joint Conference on Digital Libraries. ACM, 215--224.
[24]
Antti Puurula. 2013. Cumulative progress in language models for information retrieval. In Proceedings of Australasian Language Technology Association Workshop. 96--100.
[25]
Jinfeng Rao, Jimmy Lin, and Miles Efron. 2015. Reproducible experiments on lexical and temporal feedback for tweet search. In Advances in Information Retrieval. Springer, 755--767.
[26]
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gatford. 1995. Okapi at TREC-3. NIST SPECIAL PUBLICATION SP 109--109.
[27]
Alan Said and Alejandro Bellogín. 2014. Comparative recommender system evaluation: Benchmarking recommendation frameworks. In Proceedings of the 8th ACM Conference on Recommender Systems. ACM, 129--136.
[28]
Tetsuya Sakai and Chin-Yew Lin. 2010. Ranking retrieval systems without relevance assessments-revisited. In 3rd International Workshop on Evaluating Information Access (EVIA’10). 25--33.
[29]
Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4, 247375.
[30]
Mark Sanderson, Andrew Turpin, Ying Zhang, and Falk Scholer. 2012. Differences in effectiveness across sub-collections. In Proceedings of CIKM. ACM, 1965--1969.
[31]
S. Sanner, S. Guo, T. Graepel, S. Kharazmi, and S. Karimi. 2011. Diverse retrieval via greedy optimization of expected 1-call@ k in a latent subtopic relevance model. In Proceedings of CIKM.
[32]
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011a. How diverse are web search results? In Proceedings of SIGIR. ACM, 1187--1188.
[33]
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011b. Intent-aware search result diversification. In Proceedings of SIGIR. ACM, 595--604.
[34]
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011c. On the suitability of diversity metrics for learning-to-rank for diversity. In Proceedings of SIGIR. ACM, 1185--1186.
[35]
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2012. On the role of novelty for search result diversification. Information Retrieval 15, 5, 478--502.
[36]
Rodrygo L. T. Santos, Jie Peng, Craig Macdonald, and Iadh Ounis. 2010. Explicit search result diversification through sub-queries. In Proceedings of ECIR. Springer, Milton Keynes, UK, 87--99.
[37]
Markus Schedl, Emilia Gómez, and Julián Urbano. 2014. Music information retrieval: Recent developments and applications. Foundations and Trends in Information Retrieval.
[38]
Aliaksei Severyn, Alessandro Moschitti, Manos Tsagkias, Richard Berendsen, and Maarten de Rijke. 2014. A syntax-aware re-ranker for microblog retrieval. In Proceedings of SIGIR. ACM.
[39]
K. Spärck Jones and C. J. van Rijsbergen. 1976. Information retrieval test collections. Journal of Documentation 32, 59--75.
[40]
Florian Stegmaier, Christin Seifert, Roman Kern, Patrick Hfler, Sebastian Bayerl, Michael Granitzer, Harald Kosch, Stefanie Lindstaedt, Belgin Mutlu, Vedran Sabol, and others. 2014. Unleashing Semantics of Research Data. Springer, 103--112. Retrieved May 13, 2016 from http://link.springer.com/chapter/10.1007/978-3-642-53974-9_10.
[41]
Trevor Strohman, Donald Metzler, Howard Turtle, and W. Bruce Croft. 2005. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, Vol. 2. Citeseer, 26. Retrieved May 13, 2016 from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.3502&rep==rep1&type==pdf.
[42]
Andrew Trotman and David Keeler. 2011. Ad hoc IR: Not much room for improvement. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 10951096. http://dl.acm.org/citation.cfm?id=2010066
[43]
Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and language models examined. In Proceedings of the 2014 Australasian Document Computing Symposium (ADCS’14). ACM, 58:58--58:65.
[44]
David Vallet and Pablo Castells. 2012. Personalized diversification of search results. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 841--850.
[45]
Saúl Vargas, Pablo Castells, and David Vallet. 2012. Explicit relevance models in intent-oriented information retrieval diversification. In Proceedings of SIGIR. ACM, 75--84.
[46]
Ellen M. Voorhees and Chris Buckley. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of SIGIR. ACM, 316--323.
[47]
E. M. Voorhees and D. K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge MA.
[48]
J. Wang and J. Zhu. 2009. Portfolio theory of information retrieval. In Proceedings of SIGIR. ACM, 115--122.
[49]
William Webber, Alistair Moffat, and Justin Zobel. 2008. Score standardization for inter-collection comparison of retrieval systems. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 51--58.
[50]
Zheng Ye, Jimmy Xiangji Huang, and Jun Miao. 2012. A hybrid model for ad-hoc information retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1025--1026.
[51]
C. X. Zhai and J. Lafferty. 2006. A risk minimization framework for information retrieval. Information Processing and Management 42, 1, 31--55.
[52]
Guido Zuccon, Leif Azzopardi, Dell Zhang, and Jun Wang. 2012. Top-k retrieval using facility location analysis. In Proceedings of ECIR. Springer, 305--316.

Cited By

View all
  • (2024)What Happened in CLEF$$\ldots $$ For Another While?Experimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_1(3-57)Online publication date: 14-Sep-2024
  • (2023)DECAF: A Modular and Extensible Conversational Search FrameworkProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591913(3075-3085)Online publication date: 19-Jul-2023
  • (2023)A Geometric Framework for Query Performance Prediction in Conversational SearchProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591625(1355-1365)Online publication date: 19-Jul-2023
  • Show More Cited By

Index Terms

  1. Examining Additivity and Weak Baselines

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 34, Issue 4
    September 2016
    217 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2954381
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2016
    Accepted: 01 January 2016
    Revised: 01 November 2015
    Received: 01 July 2015
    Published in TOIS Volume 34, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Baselines
    2. evaluation
    3. information retrieval

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Australian Research Council's Discovery Projects scheme
    • NICTA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)What Happened in CLEF$$\ldots $$ For Another While?Experimental IR Meets Multilinguality, Multimodality, and Interaction10.1007/978-3-031-71736-9_1(3-57)Online publication date: 14-Sep-2024
    • (2023)DECAF: A Modular and Extensible Conversational Search FrameworkProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591913(3075-3085)Online publication date: 19-Jul-2023
    • (2023)A Geometric Framework for Query Performance Prediction in Conversational SearchProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591625(1355-1365)Online publication date: 19-Jul-2023
    • (2022)A bias–variance evaluation framework for information retrieval systemsInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10274759:1Online publication date: 1-Jan-2022
    • (2021)A Comparison between Term-Independence Retrieval Models for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/348361240:3(1-37)Online publication date: 8-Dec-2021
    • (2021)Component-based Analysis of Dynamic Search PerformanceACM Transactions on Information Systems10.1145/348323740:3(1-47)Online publication date: 22-Nov-2021
    • (2021)The Simplest Thing That Can Possibly Work: (Pseudo-)Relevance Feedback via Text ClassificationProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472261(123-129)Online publication date: 11-Jul-2021
    • (2020)Examining the Additivity of Top-k Query Processing InnovationsProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412000(1085-1094)Online publication date: 19-Oct-2020
    • (2020)Supervised approaches for explicit search result diversificationInformation Processing & Management10.1016/j.ipm.2020.10235657:6(102356)Online publication date: Nov-2020
    • (2020)Predicting the Size of Candidate Document Set for Implicit Web Search Result DiversificationAdvances in Information Retrieval10.1007/978-3-030-45442-5_51(410-417)Online publication date: 14-Apr-2020
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media