Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3511808.3557714acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper
Open access

TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

Published: 17 October 2022 Publication History

Abstract

Robust test collections are crucial for Information Retrieval research. Recently there is a growing interest in evaluating retrieval systems for domain-specific retrieval tasks, however these tasks often lack a reliable test collection with human-annotated relevance assessments following the Cranfield paradigm. In the medical domain, the TripClick collection was recently proposed, which contains click log data from the Trip search engine and includes two click-based test sets. However the clicks are biased to the retrieval model used, which remains unknown, and a previous study shows that the test sets have a low judgement coverage for the Top-10 results of lexical and neural retrieval models. In this paper we present the novel, relevance judgement test collection TripJudge for TripClick health retrieval. We collect relevance judgements in an annotation campaign and ensure the quality and reusability of TripJudge by a variety of ranking methods for pool creation, by multiple judgements per query-document pair and by an at least moderate inter-annotator agreement. We compare system evaluation with TripJudge and TripClick and find that that click and judgement-based evaluation can lead to substantially different system rankings.

References

[1]
Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC relevance assessment. Information Processing & Management, Vol. 48 (11 2012), 1053--1066. https://doi.org/10.1016/j.ipm.2012.01.004
[2]
Sophia Althammer, Sebastian Hofstätter, and Allan Hanbury. 2021. Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study. In Advances in Information Retrieval, 43rd European Conference on IR Research, ECIR 2021.
[3]
Sophia Althammer, Sebastian Hofstätter, Mete Sertkan, Suzan Verberne, and Allan Hanbury. 2022. Paragraph Aggregation Retrieval Model (PARM) for Dense Document-to-Document Retrieval. In Advances in Information Retrieval, 44rd European Conference on IR Research, ECIR 2022.
[4]
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does It Matter. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR '08). Association for Computing Machinery, New York, NY, USA, 667--674. https://doi.org/10.1145/1390334.1390447
[5]
Pia Borlund. 2003. The Concept of Relevance in IR. J. Am. Soc. Inf. Sci. Technol., Vol. 54, 10 (aug 2003), 913--925. https://doi.org/10.1002/asi.10286
[6]
Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. 2007. Bias and the Limits of Pooling for Large Collections. Inf. Retr., Vol. 10, 6 (dec 2007), 491--508. https://doi.org/10.1007/s10791-007--9032-x
[7]
Chris Buckley and Ellen M. Voorhees. 2004. Retrieval Evaluation with Incomplete Information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Sheffield, United Kingdom) (SIGIR '04). Association for Computing Machinery, New York, NY, USA, 25--32. https://doi.org/10.1145/1008992.1009000
[8]
Charles Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of the TREC 2012 Web Track. In Text Retrieval Conference (TREC). http://trec.nist.gov/pubs/trec21/papers/WEB12.overview.pdf
[9]
Cyril W. Cleverdon. 1991. The Significance of the Cranfield Tests on Index Languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Chicago, Illinois, USA) (SIGIR '91). Association for Computing Machinery, New York, NY, USA, 3--12. https://doi.org/10.1145/122860.122861
[10]
J. Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, Vol. 20, 1 (1960), 37.
[11]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, and Ian Soboroff. 2021. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. Association for Computing Machinery, New York, NY, USA, 2369--2375. https://doi.org/10.1145/3404835.3463249
[12]
Catherine Grady and Matthew Lease. 2010. Crowdsourcing Document Relevance Assessment with Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, Los Angeles, 172--179. https://aclanthology.org/W10-0727
[13]
Sebastian Hofst"atter, Sophia Althammer, Mete Sertkan, and Allan Hanbury. 2022a. Establishing Strong Baselines for TripClick Health Retrieval. In Advances in Information Retrieval, 44th European Conference on IR Research, ECIR 2022.
[14]
Sebastian Hofst"atter, Omar Khattab, Sophia Althammer, Mete Sertkan, and Allan Hanbury. 2022b. Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction. ArXiv. https://doi.org/10.48550/ARXIV.2203.13088
[15]
Sebastian Hofst"atter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proc. of SIGIR.
[16]
Sebastian Hofst"atter, Markus Zlabinger, and Allan Hanbury. 2020. Neural-IR-Explorer: A Content-Focused Tool to Explore Neural Re-Ranking Results. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14--17, 2020, Proceedings, Part II (Lisbon, Portugal). Springer-Verlag, Berlin, Heidelberg, 459--464. https://doi.org/10.1007/978--3-030--45442--5_58
[17]
Sebastian Hofst"atter, Markus Zlabinger, Mete Sertkan, Michael Schröder, and Allan Hanbury. 2020. Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering. In Proc. of CIKM.
[18]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately Interpreting Clickthrough Data as Implicit Feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Salvador, Brazil) (SIGIR '05). Association for Computing Machinery, New York, NY, USA, 154--161. https://doi.org/10.1145/1076034.1076063
[19]
Jaap Kamps, Marijn Koolen, and Andrew Trotman. 2009. Comparative analysis of clicks and judgments for IR evaluation. Journal of Visual Communication and Image Representation - JVCIR (01 2009), 80--87. https://doi.org/10.1145/1507509.1507522
[20]
M. G. Kendall. 1938. A New Measure of Rank Correlation. Biometrika, Vol. 30, 1/2 (1938), 81--93. http://www.jstor.org/stable/2332226
[21]
Joao Palotti, Guido Zuccon, Johannes Bernhardt, Allan Hanbury, and Lorraine Goeuriot. 2016. Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Norbert Fuhr, Paulo Quaresma, Teresa Goncc alves, Birger Larsen, Krisztian Balog, Craig Macdonald, Linda Cappellato, and Nicola Ferro (Eds.). Springer International Publishing, Cham, 40--53.
[22]
Navid Rekabsaz, Oleg Lesota, Markus Schedl, Jon Brassey, and Carsten Eickhoff. 2021. TripClick: The Log Files of a Large Health Web Search Engine. Association for Computing Machinery, New York, NY, USA, 2507--2513. https://doi.org/10.1145/3404835.3463242
[23]
Kirk Roberts, Dina Demner-Fushman, Ellen Voorhees, William Hersh, Steven Bedrick, Alexander Lazar, and Shubham Pant. 2017. Overview of the TREC 2017 Precision Medicine Track. The 26th text REtrieval conference: TREC. Text REtrieval Conference, Vol. 26 (11 2017).
[24]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., Vol. 3, 4 (April 2009), 333--389. https://doi.org/10.1561/1500000019
[25]
Mark Sanderson and Ian Soboroff. 2007. Problems with Kendall's Tau. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR '07). Association for Computing Machinery, New York, NY, USA, 839--840. https://doi.org/10.1145/1277741.1277935
[26]
Ian Soboroff. 2017. Building Test Collections: An Interactive Guide for Students and Others Without Their Own Evaluation Conference Series. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR '17). Association for Computing Machinery, New York, NY, USA, 1407--1410. https://doi.org/10.1145/3077136.3082064
[27]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
[28]
Suzan Verberne, Antal van den Bosch, Sander Wubben, and Emiel Krahmer. 2017. Automatic summarization of domain-specific forum threads: collecting reference data. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. 253--256.
[29]
Ellen M. Voorhees. 1998. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia) (SIGIR '98). Association for Computing Machinery, New York, NY, USA, 315--323. https://doi.org/10.1145/290941.291017
[30]
Ellen M. Voorhees. 2018. On Building Fair and Reusable Test Collections Using Bandit Techniques. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM '18). Association for Computing Machinery, New York, NY, USA, 407--416. https://doi.org/10.1145/3269206.3271766
[31]
Ellen M. Voorhees and Chris Buckley. 2002. The Effect of Topic Set Size on Retrieval Experiment Error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland) (SIGIR '02). Association for Computing Machinery, New York, NY, USA, 316--323. https://doi.org/10.1145/564376.564432
[32]
Ellen M. Voorhees and Kirk Roberts. 2021. On the Quality of the TREC-COVID IR Test Collections. Association for Computing Machinery, New York, NY, USA, 2422--2428. https://doi.org/10.1145/3404835.3463244
[33]
Ellen M. Voorhees, Ian Soboroff, and Jimmy Lin. 2022. Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models? CoRR, Vol. abs/2201.11086 (2022). showeprint[arXiv]2201.11086 https://arxiv.org/abs/2201.11086
[34]
Ryen White. 2013. Beliefs and Biases in Web Search. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR '13). Association for Computing Machinery, New York, NY, USA, 3--12. https://doi.org/10.1145/2484028.2484053
[35]
Chenyan Xiong, Zhenghao Liu, Si Sun, Zhuyun Dai, Kaitao Zhang, Shi Yu, Zhiyuan Liu, Hoifung Poon, Jianfeng Gao, and Paul Bennett. 2020. CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search. arxiv: 2011.01580 [cs.IR]
[36]
Jingtao Zhan, Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2022. Evaluating Extrapolation Performance of Dense Retrieval. (2022). https://doi.org/10.48550/ARXIV.2204.11447
[37]
Justin Zobel. 1998. How Reliable Are the Results of Large-Scale Information Retrieval Experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia) (SIGIR '98). Association for Computing Machinery, New York, NY, USA, 307--314. https://doi.org/10.1145/290941.291014

Cited By

View all
  • (2024)Resources for Combining Teaching and Research in Information Retrieval CourseworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657886(1115-1125)Online publication date: 10-Jul-2024
  • (2024)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/362364016:1(1-33)Online publication date: 6-Mar-2024
  • (2024)Enriching Simple Keyword Queries for Domain-Aware Narrative RetrievalProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00029(143-154)Online publication date: 26-Jun-2024
  • Show More Cited By

Index Terms

  1. TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
    October 2022
    5274 pages
    ISBN:9781450392365
    DOI:10.1145/3511808
    • General Chairs:
    • Mohammad Al Hasan,
    • Li Xiong
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2022

    Check for updates

    Author Tags

    1. health retrieval
    2. relevance judgements
    3. test collections

    Qualifiers

    • Short-paper

    Funding Sources

    • EU Horizon 2020 ITN/ETN

    Conference

    CIKM '22
    Sponsor:

    Acceptance Rates

    CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)124
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Resources for Combining Teaching and Research in Information Retrieval CourseworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657886(1115-1125)Online publication date: 10-Jul-2024
    • (2024)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/362364016:1(1-33)Online publication date: 6-Mar-2024
    • (2024)Enriching Simple Keyword Queries for Domain-Aware Narrative RetrievalProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries10.1109/JCDL57899.2023.00029(143-154)Online publication date: 26-Jun-2024
    • (2023)Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random SelectionProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625333(139-149)Online publication date: 26-Nov-2023
    • (2023)LongEval-Retrieval: French-English Dynamic Test Collection for Continuous Web Search EvaluationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591921(3086-3094)Online publication date: 19-Jul-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media