Abstract
We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.
A. Mallia—Work partly done while working at Amazon Alexa.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arabzadeh, N., Vtyurina, A., Yan, X., Clarke, C.: Shallow pooling for sparse labels. Inf. Retrieval 25(4), 365–385 (2022)
Aslam, J., Montague, M.: Models for metasearch. In: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pp. 276–284 (2001)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. arXiv:2102.07662 (2021)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. arXiv:2003.07820 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)
Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., Shalita, A.: Compressing graphs and indexes with recursive graph bisection. In: Proceedings of Conference on Knowledge Discovery and Data Mining (KDD), pp. 1535–1544 (2016)
Ding, S., Suel, T.: Faster top-\(k\) document retrieval using block-max indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 993–1002 (2011)
Feng, Z., et al.: Pretraining without wordpieces: learning over a vocabulary of millions of words (2022)
Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval. arXiv:2109.10086 (2021)
Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: sparse lexical and expansion model for first stage ranking. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 2288–2292 (2021)
Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval, pp. 2843–2853 (2022)
Gao, L., Dai, Z., Callan, J.: COIL: revisit exact lexical match in information retrieval with contextualized inverted list, pp. 3030–3042 (2021)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Lassance, C., Clinchant, S.: An efficiency study for SPLADE models. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 2220–2226 (2022)
Lassance, C., Déjean, H., Clinchant, S.: An experimental study on pretraining transformers from scratch for IR. arXiv:2301.10444 (2023)
Lin, J., Ma, X.: A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques. arXiv:2106.14807 (2021)
Mackenzie, J., Mallia, A., Moffat, A., Petri, M.: Accelerating learned sparse indexes via term impact decomposition, pp. 18–27 (2022)
Mackenzie, J., Mallia, A., Petri, M., Culpepper, J.S., Suel, T.: Compressing inverted indexes with recursive graph bisection: a reproducibility study. In: Proceedings of European Conference on Information Retrieval (ECIR), pp. 339–352 (2019)
Mackenzie, J., Petri, M., Moffat, A.: Faster index reordering with bipartite graph partitioning. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1910–1914 (2021)
Mackenzie, J., Petri, M., Moffat, A.: A sensitivity analysis of the MSMARCO passage collection. arXiv:2112.03396 (2021)
Mackenzie, J., Trotman, A., Lin, J.: Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation. arXiv:2110.11540 (2021)
Mallia, A., Khattab, O., Tonellotto, N., Suel, T.: Learning passage impacts for inverted indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1723–1727 (2021)
Mallia, A., Mackenzie, J., Suel, T., Tonellotto, N.: Faster learned sparse retrieval with guided traversal. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1901–1905 (2022)
Mallia, A., Siedlaczek, M., Mackenzie, J., Suel, T.: PISA: performant indexes and search for academia. In: Proceedings of OSIRRC at SIGIR 2019, pp. 50–56 (2019)
Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint arXiv:1901.04085 (2019)
Nogueira, R., Lin, J.: From doc2query to docTTTTTquery (2019)
Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 273–282 (2014)
Paria, B., Yeh, C., Yen, I., Xu, N., Ravikumar, P., Póczos, B.: Minimizing flops to learn efficient sparse representations. arXiv:2004.05665 (2020)
Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. ACM Comput. Surv. 53(6), 125.1–125.36 (2021)
Reimers, N.: MS MARCO Passages Hard Negatives. In: HuggingFace, pp. 1747–1756 (2021). https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives
Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retrieval 3, 333–389 (2009)
Santhanam, K., Khattab, O., Potts, C., Zaharia, M.: PLAID: an efficient engine for late interaction retrieval. In: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pp. 1747–1756. ACM (2022)
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: Colbertv2: effective and efficient retrieval via lightweight late interaction. In: NAACL, pp. 3715–3734 (2022)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2016)
Siedlaczek, M., Mallia, A., Suel, T.: Using conjunctions for faster disjunctive top-k queries. In: Proceedings of Conference on Web Search and Data Mining (WSDM), pp. 917–927 (2022)
Turtle, H.R., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manag. 31(6), 831–850 (1995)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Yu, W., et al.: Dict-BERT: enhancing language model pre-training with dictionary (2022)
Zhao, L.: Modeling and solving term mismatch for full-text retrieval. SIGIR Forum 46(2), 117–118 (2012)
Zhuang, S., Zuccon, G.: TILDE: term independent likelihood moDEl for passage re-ranking. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1483–1492 (2021)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6:1–6:56 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, P., Mallia, A., Petri, M. (2024). Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-56063-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)