Nothing Special   »   [go: up one dir, main page]

Skip to main content

Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14610))

Included in the following conference series:

  • 871 Accesses

Abstract

We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.

A. Mallia—Work partly done while working at Amazon Alexa.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arabzadeh, N., Vtyurina, A., Yan, X., Clarke, C.: Shallow pooling for sparse labels. Inf. Retrieval 25(4), 365–385 (2022)

    Article  Google Scholar 

  2. Aslam, J., Montague, M.: Models for metasearch. In: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pp. 276–284 (2001)

    Google Scholar 

  3. Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. arXiv:2102.07662 (2021)

  4. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. arXiv:2003.07820 (2020)

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186 (2019)

    Google Scholar 

  6. Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., Shalita, A.: Compressing graphs and indexes with recursive graph bisection. In: Proceedings of Conference on Knowledge Discovery and Data Mining (KDD), pp. 1535–1544 (2016)

    Google Scholar 

  7. Ding, S., Suel, T.: Faster top-\(k\) document retrieval using block-max indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 993–1002 (2011)

    Google Scholar 

  8. Feng, Z., et al.: Pretraining without wordpieces: learning over a vocabulary of millions of words (2022)

    Google Scholar 

  9. Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval. arXiv:2109.10086 (2021)

  10. Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: sparse lexical and expansion model for first stage ranking. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 2288–2292 (2021)

    Google Scholar 

  11. Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval, pp. 2843–2853 (2022)

    Google Scholar 

  12. Gao, L., Dai, Z., Callan, J.: COIL: revisit exact lexical match in information retrieval with contextualized inverted list, pp. 3030–3042 (2021)

    Google Scholar 

  13. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)

  14. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  15. Lassance, C., Clinchant, S.: An efficiency study for SPLADE models. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 2220–2226 (2022)

    Google Scholar 

  16. Lassance, C., Déjean, H., Clinchant, S.: An experimental study on pretraining transformers from scratch for IR. arXiv:2301.10444 (2023)

  17. Lin, J., Ma, X.: A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques. arXiv:2106.14807 (2021)

  18. Mackenzie, J., Mallia, A., Moffat, A., Petri, M.: Accelerating learned sparse indexes via term impact decomposition, pp. 18–27 (2022)

    Google Scholar 

  19. Mackenzie, J., Mallia, A., Petri, M., Culpepper, J.S., Suel, T.: Compressing inverted indexes with recursive graph bisection: a reproducibility study. In: Proceedings of European Conference on Information Retrieval (ECIR), pp. 339–352 (2019)

    Google Scholar 

  20. Mackenzie, J., Petri, M., Moffat, A.: Faster index reordering with bipartite graph partitioning. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1910–1914 (2021)

    Google Scholar 

  21. Mackenzie, J., Petri, M., Moffat, A.: A sensitivity analysis of the MSMARCO passage collection. arXiv:2112.03396 (2021)

  22. Mackenzie, J., Trotman, A., Lin, J.: Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation. arXiv:2110.11540 (2021)

  23. Mallia, A., Khattab, O., Tonellotto, N., Suel, T.: Learning passage impacts for inverted indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1723–1727 (2021)

    Google Scholar 

  24. Mallia, A., Mackenzie, J., Suel, T., Tonellotto, N.: Faster learned sparse retrieval with guided traversal. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1901–1905 (2022)

    Google Scholar 

  25. Mallia, A., Siedlaczek, M., Mackenzie, J., Suel, T.: PISA: performant indexes and search for academia. In: Proceedings of OSIRRC at SIGIR 2019, pp. 50–56 (2019)

    Google Scholar 

  26. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint arXiv:1901.04085 (2019)

  27. Nogueira, R., Lin, J.: From doc2query to docTTTTTquery (2019)

    Google Scholar 

  28. Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 273–282 (2014)

    Google Scholar 

  29. Paria, B., Yeh, C., Yen, I., Xu, N., Ravikumar, P., Póczos, B.: Minimizing flops to learn efficient sparse representations. arXiv:2004.05665 (2020)

  30. Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. ACM Comput. Surv. 53(6), 125.1–125.36 (2021)

    Google Scholar 

  31. Reimers, N.: MS MARCO Passages Hard Negatives. In: HuggingFace, pp. 1747–1756 (2021). https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives

  32. Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retrieval 3, 333–389 (2009)

    Article  Google Scholar 

  33. Santhanam, K., Khattab, O., Potts, C., Zaharia, M.: PLAID: an efficient engine for late interaction retrieval. In: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pp. 1747–1756. ACM (2022)

    Google Scholar 

  34. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: Colbertv2: effective and efficient retrieval via lightweight late interaction. In: NAACL, pp. 3715–3734 (2022)

    Google Scholar 

  35. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2016)

    Google Scholar 

  36. Siedlaczek, M., Mallia, A., Suel, T.: Using conjunctions for faster disjunctive top-k queries. In: Proceedings of Conference on Web Search and Data Mining (WSDM), pp. 917–927 (2022)

    Google Scholar 

  37. Turtle, H.R., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manag. 31(6), 831–850 (1995)

    Article  Google Scholar 

  38. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  39. Yu, W., et al.: Dict-BERT: enhancing language model pre-training with dictionary (2022)

    Google Scholar 

  40. Zhao, L.: Modeling and solving term mismatch for full-text retrieval. SIGIR Forum 46(2), 117–118 (2012)

    Article  Google Scholar 

  41. Zhuang, S., Zuccon, G.: TILDE: term independent likelihood moDEl for passage re-ranking. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 1483–1492 (2021)

    Google Scholar 

  42. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6:1–6:56 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Mallia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, P., Mallia, A., Petri, M. (2024). Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56063-7_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56062-0

  • Online ISBN: 978-3-031-56063-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics