Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies

Puxuan Yu¹⁴,
Antonio Mallia¹⁵ &
Matthias Petri¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14610))

Included in the following conference series:

European Conference on Information Retrieval

871 Accesses

Abstract

We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.

A. Mallia—Work partly done while working at Amazon Alexa.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Unified Framework for Learned Sparse Retrieval

Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE

Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

References

Arabzadeh, N., Vtyurina, A., Yan, X., Clarke, C.: Shallow pooling for sparse labels. Inf. Retrieval 25(4), 365–385 (2022)
Article Google Scholar
Aslam, J., Montague, M.: Models for metasearch. In: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pp. 276–284 (2001)
Google Scholar
Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. arXiv:2102.07662 (2021)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. arXiv:2003.07820 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)
Google Scholar
Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., Shalita, A.: Compressing graphs and indexes with recursive graph bisection. In: Proceedings of Conference on Knowledge Discovery and Data Mining (KDD), pp. 1535–1544 (2016)
Google Scholar
Ding, S., Suel, T.: Faster top-$k$ document retrieval using block-max indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 993–1002 (2011)
Google Scholar
Feng, Z., et al.: Pretraining without wordpieces: learning over a vocabulary of millions of words (2022)
Google Scholar
Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval. arXiv:2109.10086 (2021)
Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: sparse lexical and expansion model for first stage ranking. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 2288–2292 (2021)
Google Scholar
Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval, pp. 2843–2853 (2022)
Google Scholar
Gao, L., Dai, Z., Callan, J.: COIL: revisit exact lexical match in information retrieval with contextualized inverted list, pp. 3030–3042 (2021)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet Google Scholar
Lassance, C., Clinchant, S.: An efficiency study for SPLADE models. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 2220–2226 (2022)
Google Scholar
Lassance, C., Déjean, H., Clinchant, S.: An experimental study on pretraining transformers from scratch for IR. arXiv:2301.10444 (2023)
Lin, J., Ma, X.: A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques. arXiv:2106.14807 (2021)
Mackenzie, J., Mallia, A., Moffat, A., Petri, M.: Accelerating learned sparse indexes via term impact decomposition, pp. 18–27 (2022)
Google Scholar
Mackenzie, J., Mallia, A., Petri, M., Culpepper, J.S., Suel, T.: Compressing inverted indexes with recursive graph bisection: a reproducibility study. In: Proceedings of European Conference on Information Retrieval (ECIR), pp. 339–352 (2019)
Google Scholar
Mackenzie, J., Petri, M., Moffat, A.: Faster index reordering with bipartite graph partitioning. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1910–1914 (2021)
Google Scholar
Mackenzie, J., Petri, M., Moffat, A.: A sensitivity analysis of the MSMARCO passage collection. arXiv:2112.03396 (2021)
Mackenzie, J., Trotman, A., Lin, J.: Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation. arXiv:2110.11540 (2021)
Mallia, A., Khattab, O., Tonellotto, N., Suel, T.: Learning passage impacts for inverted indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1723–1727 (2021)
Google Scholar
Mallia, A., Mackenzie, J., Suel, T., Tonellotto, N.: Faster learned sparse retrieval with guided traversal. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1901–1905 (2022)
Google Scholar
Mallia, A., Siedlaczek, M., Mackenzie, J., Suel, T.: PISA: performant indexes and search for academia. In: Proceedings of OSIRRC at SIGIR 2019, pp. 50–56 (2019)
Google Scholar
Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint arXiv:1901.04085 (2019)
Nogueira, R., Lin, J.: From doc2query to docTTTTTquery (2019)
Google Scholar
Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 273–282 (2014)
Google Scholar
Paria, B., Yeh, C., Yen, I., Xu, N., Ravikumar, P., Póczos, B.: Minimizing flops to learn efficient sparse representations. arXiv:2004.05665 (2020)
Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. ACM Comput. Surv. 53(6), 125.1–125.36 (2021)
Google Scholar
Reimers, N.: MS MARCO Passages Hard Negatives. In: HuggingFace, pp. 1747–1756 (2021). https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives
Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retrieval 3, 333–389 (2009)
Article Google Scholar
Santhanam, K., Khattab, O., Potts, C., Zaharia, M.: PLAID: an efficient engine for late interaction retrieval. In: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pp. 1747–1756. ACM (2022)
Google Scholar
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: Colbertv2: effective and efficient retrieval via lightweight late interaction. In: NAACL, pp. 3715–3734 (2022)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2016)
Google Scholar
Siedlaczek, M., Mallia, A., Suel, T.: Using conjunctions for faster disjunctive top-k queries. In: Proceedings of Conference on Web Search and Data Mining (WSDM), pp. 917–927 (2022)
Google Scholar
Turtle, H.R., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manag. 31(6), 831–850 (1995)
Article Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Yu, W., et al.: Dict-BERT: enhancing language model pre-training with dictionary (2022)
Google Scholar
Zhao, L.: Modeling and solving term mismatch for full-text retrieval. SIGIR Forum 46(2), 117–118 (2012)
Article Google Scholar
Zhuang, S., Zuccon, G.: TILDE: term independent likelihood moDEl for passage re-ranking. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1483–1492 (2021)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6:1–6:56 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Massachusetts Amherst, Amherst, USA
Puxuan Yu
Pinecone, New York, Italy
Antonio Mallia
Amazon AGI, Seattle, USA
Matthias Petri

Authors

Puxuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Mallia
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Petri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Mallia .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, P., Mallia, A., Petri, M. (2024). Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-56063-7_12
Published: 23 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Unified Framework for Learned Sparse Retrieval

Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE

Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Unified Framework for Learned Sparse Retrieval

Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE

Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation