short-paper

A Common Framework for Exploring Document-at-a-Time and Score-at-a-Time Retrieval Methods

Authors:

Andrew Trotman,

Joel Mackenzie,

Pradeesh Parameswaran,

Jimmy LinAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3229 - 3234

https://doi.org/10.1145/3477495.3531657

Published: 07 July 2022 Publication History

Abstract

Document-at-a-time (DaaT) and score-at-a-time (SaaT) query evaluation techniques are different approaches to top-k retrieval with inverted indexes. While modern systems are dominated by DaaT, the academic literature has seen decades of debate about the merits of each. Recently, there has been renewed interest in SaaT methods for learned sparse lexical models, where studies have shown that transformers generate "wacky weights" that appear to reduce opportunities for optimizations in DaaT methods. However, researchers currently lack an easy-to-use SaaT system to support further exploration. This is the gap that our work fills. Starting with a modern SaaT system (JASS), we built Python bindings in order to integrate into the DaaT Pyserini IR toolkit (Lucene). The result is a common frontend to both a DaaT and a SaaT system. We demonstrate how recent experiments with a wide range of learned sparse lexical models can be easily reproduced. Our contribution is a framework that enables future research comparing DaaT and SaaT methods in the context of modern neural retrieval models.

References

[1]

Vo Ngoc Anh, Owen de Kretser, and Alistair Moffat. 2001. Vector-Space Ranking with Effective Early Termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001). 35--42.

Digital Library

[2]

Nima Asadi and Jimmy Lin. 2013. Effectiveness/Efficiency Tradeoffs for Candidate Generation in Multi-Stage Retrieval Architectures. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013). Dublin, Ireland, 997--1000.

Digital Library

[3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).

[4]

David M. Beazley. 2003. Automated Scientific Software Scripting With SWIG. Future Generation Computer Systems, Vol. 19, 5 (2003), 599--609.

Digital Library

[5]

Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM 2017). 201--210.

Digital Library

[6]

Josh Devins, Julie Tibshirani, and Jimmy Lin. 2022. Aligning the Research and Practice of Building Search Applications: Elasticsearch and Pyserini. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM 2022). 1573--1576.

Digital Library

[7]

Marcus Fontoura, Vanja Josifovski, Jinhui Liu, Srihari Venkatesan, Xiangfei Zhu, and Jason Zien. 2011. Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indexes. Proc. VLDB Endow., Vol. 4, 12 (2011), 1213--1224.

Digital Library

[8]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv:2109.10086 (2021).

[9]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, 3030--3042.

[10]

Adrien Grand, Robert Muir, Jim Ferenczi, and Jimmy Lin. 2020. From MaxScore to Block-Max WAND: The Story of How Lucene Significantly Improved Query Evaluation Performance. In Proceedings of the 42nd European Conference on Information Retrieval, Part II (ECIR 2020). 20--27.

[11]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 113--122.

Digital Library

[12]

Xiang-Fei Jia, Andrew Trotman, and Richard O'Keefe. 2010. Efficient accumulator initialisation. In Proceedings of the 15th Australasian Document Computing Symposium. 44--51.

[13]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, Vol. 7, 3 (2021), 535--547.

[14]

Vladimir Karpukhin, Barlas Ouguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 6769--6781.

[15]

Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 (2021).

[16]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021 a. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356--2362.

Digital Library

[17]

Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, and Arjen de Vries. 2020. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020). 2149--2152.

Digital Library

[18]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021 b. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool Publishers.

[19]

Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings of the ACM International Conference on the Theory of Information Retrieval (ICTIR 2015). 301--304.

Digital Library

[20]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021 c. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 163--173.

[21]

Xueguang Ma, Kai Sun, Ronak Pradeep, Minghan Li, and Jimmy Lin. 2022. Comparing Score Aggregation Approaches for Document Retrieval with Pretrained Transformers. In Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I. Stavanger, Norway, 613--626.

[22]

Joel Mackenzie, J. Shane Culpepper, Roi Blanco, Matt Crane, Charles L. A. Clarke, and Jimmy Lin. 2018. Query driven algorithm selection in early stage retrieval. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM 2018). 396--404.

Digital Library

[23]

Joel Mackenzie, Falk Scholer, and J. Shane Culpepper. 2017. Early Termination Heuristics for Score-at-a-Time Index Traversal. In Proceedings of the 22nd Australasian Document Computing Symposium (ADCS 2017). 8.1-8.8.

Digital Library

[24]

Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2021. Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation. arXiv:2110.11540 (2021).

[25]

Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning Passage Impacts for Inverted Indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 1723--1727.

Digital Library

[26]

Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019): CEUR Workshop Proceedings Vol-2409. 50--56.

[27]

Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.

[28]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375 (2019).

[29]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67.

[30]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, Vol. 3, 4 (2009), 333--389.

Digital Library

[31]

Andrew Trotman and Matt Crane. 2019. Micro- and Macro-optimizations of textscSaaT Search. Software: Practice and Experience, Vol. 49, 5 (2019), 942--950.

[32]

Andrew Trotman and Kat Lilly. 2018. Elias Revisited: Group Elias SIMD Coding. In Proceedings of the 23rd Australasian Document Computing Symposium (ADCS 2018). Article 4, 8 pages.

Digital Library

[33]

Howard R. Turtle and James Flood. 1995. Query Evaluation: Strategies and Optimizations. Information Processing & Management, Vol. 31, 6 (1995), 831--850.

Digital Library

[34]

Shuai Wang, Shengyao Zhuang, and Guido Zuccon. 2021. BERT-Based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 317--324.

Digital Library

[35]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021).

[36]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). Tokyo, Japan, 1253--1256.

Digital Library

[37]

Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, Vol. 10, 4 (2018), Article 16.

Digital Library

[38]

Shengyao Zhuang and Guido Zuccon. 2021 a. Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion. arXiv:2108.08513 (2021).

[39]

Shengyao Zhuang and Guido Zuccon. 2021 b. TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 1483--1492.

Digital Library

Index Terms

A Common Framework for Exploring Document-at-a-Time and Score-at-a-Time Retrieval Methods
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability

Recommendations

Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse Representations
Researchers have had much recent success with ranking models based on so-called learned sparse representations generated by transformers. One crucial advantage of this approach is that such models can exploit inverted indexes for top-k retrieval, thereby ...
A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct ...
Scalable and efficient processing of top-k multiple-type integrated queries
Abstract
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Natural Sciences and Engineering Research Council (NSERC) of Canada
Australian Research Council

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
126
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents