Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3477495.3531657acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

A Common Framework for Exploring Document-at-a-Time and Score-at-a-Time Retrieval Methods

Published: 07 July 2022 Publication History

Abstract

Document-at-a-time (DaaT) and score-at-a-time (SaaT) query evaluation techniques are different approaches to top-k retrieval with inverted indexes. While modern systems are dominated by DaaT, the academic literature has seen decades of debate about the merits of each. Recently, there has been renewed interest in SaaT methods for learned sparse lexical models, where studies have shown that transformers generate "wacky weights" that appear to reduce opportunities for optimizations in DaaT methods. However, researchers currently lack an easy-to-use SaaT system to support further exploration. This is the gap that our work fills. Starting with a modern SaaT system (JASS), we built Python bindings in order to integrate into the DaaT Pyserini IR toolkit (Lucene). The result is a common frontend to both a DaaT and a SaaT system. We demonstrate how recent experiments with a wide range of learned sparse lexical models can be easily reproduced. Our contribution is a framework that enables future research comparing DaaT and SaaT methods in the context of modern neural retrieval models.

References

[1]
Vo Ngoc Anh, Owen de Kretser, and Alistair Moffat. 2001. Vector-Space Ranking with Effective Early Termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001). 35--42.
[2]
Nima Asadi and Jimmy Lin. 2013. Effectiveness/Efficiency Tradeoffs for Candidate Generation in Multi-Stage Retrieval Architectures. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013). Dublin, Ireland, 997--1000.
[3]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).
[4]
David M. Beazley. 2003. Automated Scientific Software Scripting With SWIG. Future Generation Computer Systems, Vol. 19, 5 (2003), 599--609.
[5]
Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM 2017). 201--210.
[6]
Josh Devins, Julie Tibshirani, and Jimmy Lin. 2022. Aligning the Research and Practice of Building Search Applications: Elasticsearch and Pyserini. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM 2022). 1573--1576.
[7]
Marcus Fontoura, Vanja Josifovski, Jinhui Liu, Srihari Venkatesan, Xiangfei Zhu, and Jason Zien. 2011. Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indexes. Proc. VLDB Endow., Vol. 4, 12 (2011), 1213--1224.
[8]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv:2109.10086 (2021).
[9]
Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, 3030--3042.
[10]
Adrien Grand, Robert Muir, Jim Ferenczi, and Jimmy Lin. 2020. From Max­Score to Block-Max WAND: The Story of How Lucene Significantly Improved Query Evaluation Performance. In Proceedings of the 42nd European Conference on Information Retrieval, Part II (ECIR 2020). 20--27.
[11]
Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 113--122.
[12]
Xiang-Fei Jia, Andrew Trotman, and Richard O'Keefe. 2010. Efficient accumulator initialisation. In Proceedings of the 15th Australasian Document Computing Symposium. 44--51.
[13]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, Vol. 7, 3 (2021), 535--547.
[14]
Vladimir Karpukhin, Barlas Ouguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 6769--6781.
[15]
Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 (2021).
[16]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021 a. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356--2362.
[17]
Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, and Arjen de Vries. 2020. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020). 2149--2152.
[18]
Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021 b. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool Publishers.
[19]
Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings of the ACM International Conference on the Theory of Information Retrieval (ICTIR 2015). 301--304.
[20]
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021 c. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 163--173.
[21]
Xueguang Ma, Kai Sun, Ronak Pradeep, Minghan Li, and Jimmy Lin. 2022. Comparing Score Aggregation Approaches for Document Retrieval with Pretrained Transformers. In Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I. Stavanger, Norway, 613--626.
[22]
Joel Mackenzie, J. Shane Culpepper, Roi Blanco, Matt Crane, Charles L. A. Clarke, and Jimmy Lin. 2018. Query driven algorithm selection in early stage retrieval. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM 2018). 396--404.
[23]
Joel Mackenzie, Falk Scholer, and J. Shane Culpepper. 2017. Early Termination Heuristics for Score-at-a-Time Index Traversal. In Proceedings of the 22nd Australasian Document Computing Symposium (ADCS 2017). 8.1-8.8.
[24]
Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2021. Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation. arXiv:2110.11540 (2021).
[25]
Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning Passage Impacts for Inverted Indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 1723--1727.
[26]
Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019): CEUR Workshop Proceedings Vol-2409. 50--56.
[27]
Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.
[28]
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375 (2019).
[29]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67.
[30]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, Vol. 3, 4 (2009), 333--389.
[31]
Andrew Trotman and Matt Crane. 2019. Micro- and Macro-optimizations of textscSaaT Search. Software: Practice and Experience, Vol. 49, 5 (2019), 942--950.
[32]
Andrew Trotman and Kat Lilly. 2018. Elias Revisited: Group Elias SIMD Coding. In Proceedings of the 23rd Australasian Document Computing Symposium (ADCS 2018). Article 4, 8 pages.
[33]
Howard R. Turtle and James Flood. 1995. Query Evaluation: Strategies and Optimizations. Information Processing & Management, Vol. 31, 6 (1995), 831--850.
[34]
Shuai Wang, Shengyao Zhuang, and Guido Zuccon. 2021. BERT-Based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 317--324.
[35]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021).
[36]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). Tokyo, Japan, 1253--1256.
[37]
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, Vol. 10, 4 (2018), Article 16.
[38]
Shengyao Zhuang and Guido Zuccon. 2021 a. Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion. arXiv:2108.08513 (2021).
[39]
Shengyao Zhuang and Guido Zuccon. 2021 b. TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 1483--1492.

Index Terms

  1. A Common Framework for Exploring Document-at-a-Time and Score-at-a-Time Retrieval Methods

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2022
    3569 pages
    ISBN:9781450387323
    DOI:10.1145/3477495
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anserini
    2. efficiency
    3. jass
    4. procrastination
    5. python

    Qualifiers

    • Short-paper

    Funding Sources

    • Natural Sciences and Engineering Research Council (NSERC) of Canada
    • Australian Research Council

    Conference

    SIGIR '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 126
      Total Downloads
    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media