InDi: Informative and Diverse Sampling for Dense Retrieval

Nachshon Cohen¹⁴,
Hedda Cohen-Indelman¹⁵,
Yaron Fairstein¹⁴ &
…
Guy Kushilevitz¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14610))

Included in the following conference series:

European Conference on Information Retrieval

1156 Accesses
1 Citations

Abstract

Negative sample selection has been shown to have a crucial effect on the training procedure of dense retrieval systems. Nevertheless, most existing negative selection methods end by randomly choosing from some pool of samples. This calls for a better sampling solution. We define desired requirements for negative sample selection; the samples chosen should be informative, to advance the learning process, and diverse, to help the model generalize. We compose a sampling method designed to meet these requirements, and show that using our sampling method to enhance the training procedure of a recent significant dense retrieval solution (coCondenser) improves the obtained model’s performance. Specifically, we see a $\sim 2\%$ improvement in MRR@10 on the MS MARCO dataset (from 38.2 to 38.8) and a $\sim 1.5\%$ improvement in Recall@5 on the Natural Questions dataset (from $71\%$ to $72.1\%$), both statistically significant. Our solution, as opposed to other methods, does not require training or inferencing a large model, and adds only a small overhead ($\sim 1\%$ added time) to the training procedure. Finally, we report ablation studies showing that the objectives defined are indeed important when selecting negative samples for dense retrieval.

N. Cohen, H. Cohen-Indelman, Y. Fairstein and G. Kushilevitz–Contributed equally to this work

H. Cohen-Indelman–Work done as an intern at Amazon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

WSRR: Weighted Rank-Relevance Sampling for Dense Text Retrieval

Improving the Generalizability of the Dense Passage Retriever Using Generated Datasets

HSS: A Hierarchical Semantic Similarity Hard Negative Sampling Method for Dense Retrievers

Notes

1.
In this work we focus on retrieving passages, but dense retrieval methods are used for retrieving other items as well such as products, documents, images etc.
2.
https://github.com/amzn/informative-diverse-hard-negative-sampling.
3.
Informativeness measures the ability of a sample to reduce the uncertainty of a model. Informativeness is commonly approximated by measuring the loss a sample causes the model [15, 23]. High loss means high uncertainty, suggesting high informativeness of the sample.
4.
For example, many close-to-zero gradient vectors, pointing in different directions, will all be in the same cluster.
5.
We find that the best result is achieved with a threshold of 0.8 on the CE score.
6.
$\mathcal {T}$ represents the top t retrieved samples for q.
7.
We opt to use Euclidean distance as it is desired that the distance to a sample is minimized by the sample itself.
8.
To make sure the number of passages is sufficient, we define a minimal ratio between the number of passages that pass the CE-filtering and the number of negatives selected.
9.
RocketQA trains and inferences an ERNIE-large model.
10.
Times are measured using an NVIDIA T4 GPU.
11.
Our sampling method requires only a CPU. Time was measured on a 4-core machine.

References

Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086 (2021)
Fu, Y., Zhu, X., Li, B.: A survey on instance selection for active learning. Knowl. Inf. Syst. 35(2), 249–283 (2013)
Article Google Scholar
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. pp. 981–993. Association for Computational Linguistics (2021)
Google Scholar
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021)
Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2843–2853. Association for Computational Linguistics, Dublin, Ireland (May 2022)
Google Scholar
Gao, L., Ma, X., Lin, J., Callan, J.: Tevatron: An efficient and flexible toolkit for dense retrieval. CoRR abs/2203.05765 (2022)
Google Scholar
Gissin, D., Shalev-Shwartz, S.: Discriminative active learning. CoRR abs/1907.06347 (2019)
Google Scholar
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., Kumar, S.: Accelerating large-scale inference with anisotropic vector quantization. In: International Conference on Machine Learning (2020)
Google Scholar
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval augmented language model pre-training. In: International Conference on Machine Learning. pp. 3929–3938. PMLR (2020)
Google Scholar
Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 113–122 (2021)
Google Scholar
Hofstätter, S., Lin, S., Yang, J., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Diaz, F., Shah, C., Suel, T., Castells, P., Jones, R., Sakai, T. (eds.) SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11–15, 2021. pp. 113–122. ACM (2021)
Google Scholar
Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM international conference on Information & Knowledge Management. pp. 2333–2338 (2013)
Google Scholar
Huang, S.J., Jin, R., Zhou, Z.H.: Active learning by querying informative and representative examples. Advances in neural information processing systems 23 (2010)
Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7(3), 535–547 (2019)
Article Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of documentation (1972)
Google Scholar
Karpukhin, V., Oguz, B., Min, S., Lewis, P.S.H., Wu, L., Edunov, S., Chen, D., Yih, W.: Dense passage retrieval for open-domain question answering. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020. pp. 6769–6781. Association for Computational Linguistics (2020)
Google Scholar
Khattab, O., Zaharia, M.: Colbert: Efficient and effective passage search via contextualized late interaction over bert. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. pp. 39–48 (2020)
Google Scholar
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.W., Dai, A.M., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, 452–466 (2019)
Article Google Scholar
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A.P., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M., Dai, A.M., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions a benchmark for question answering research. Trans. Assoc. Comput. Linguistics 7, 452–466 (2019)
Google Scholar
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., Carlini, N.: Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 (2021)
Lewis, D.D.: A sequential algorithm for training text classifiers: Corrigendum and additional data. In: Acm Sigir Forum. vol. 29, pp. 13–19. ACM New York, NY, USA (1995)
Google Scholar
Lin, Z., Gong, Y., Liu, X., Zhang, H., Lin, C., Dong, A., Jiao, J., Lu, J., Jiang, D., Majumder, R., et al.: Prod: Progressive distillation for dense retrieval. arXiv preprint arXiv:2209.13335 (2022)
Lu, J., Ábrego, G.H., Ma, J., Ni, J., Yang, Y.: Multi-stage training with improved negative contrast for neural passage retrieval. In: Moens, M., Huang, X., Specia, L., Yih, S.W. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7–11 November, 2021. pp. 6091–6103. Association for Computational Linguistics (2021)
Google Scholar
Lu, Y., Liu, Y., Liu, J., Shi, Y., Huang, Z., Sun, S.F.Y., Tian, H., Wu, H., Wang, S., Yin, D., et al.: Ernie-search: Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval. arXiv preprint arXiv:2205.09153 (2022)
Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9, 329–345 (2021)
Article Google Scholar
Mackenzie, J., Dai, Z., Gallagher, L., Callan, J.: Efficiency implications of term weighting for passage retrieval. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1821–1824 (2020)
Google Scholar
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: A human generated machine reading comprehension dataset. In: Besold, T.R., Bordes, A., d’Avila Garcez, A.S., Wayne, G. (eds.) Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016. CEUR Workshop Proceedings, vol. 1773. CEUR-WS.org (2016)
Google Scholar
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: A human generated machine reading comprehension dataset. In: Besold, T.R., Bordes, A., d’Avila Garcez, A.S., Wayne, G. (eds.) Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016. CEUR Workshop Proceedings, vol. 1773. CEUR-WS.org (2016)
Google Scholar
Nogueira, R., Lin, J., Epistemic, A.: From doc2query to doctttttquery. Online preprint 6 (2019)
Google Scholar
Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019)
Prince, M.: Does active learning work? a review of the research. J. Eng. Educ. 93(3), 223–231 (2004)
Article Google Scholar
Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, W.X., Dong, D., Wu, H., Wang, H.: Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021. pp. 5835–5847. Association for Computational Linguistics (2021)
Google Scholar
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al.: Okapi at trec-3. Nist Special Publication Sp 109, 109 (1995)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information processing & management 24(5), 513–523 (1988)
Article Google Scholar
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 9929–9939. PMLR (2020)
Google Scholar
Xiong, L., Xiong, C., Li, Y., Tang, K., Liu, J., Bennett, P.N., Ahmed, J., Overwijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net (2021)
Google Scholar
Yuan, M., Lin, H., Boyd-Graber, J.L.: Cold-start active learning through self-supervised language modeling. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020. pp. 7935–7948. Association for Computational Linguistics (2020)
Google Scholar
Zhao, W.X., Liu, J., Ren, R., Wen, J.R.: Dense text retrieval based on pretrained language models: A survey. arXiv preprint arXiv:2211.14876 (2022)
Zhao, W.X., Liu, J., Ren, R., Wen, J.R.: Dense text retrieval based on pretrained language models: A survey. arXiv preprint arXiv:2211.14876 (2022)
Zhdanov, F.: Diverse mini-batch active learning. arXiv preprint arXiv:1901.05954 (2019)

Download references

Author information

Authors and Affiliations

Amazon, Haifa, Israel
Nachshon Cohen, Yaron Fairstein & Guy Kushilevitz
Technion, Haifa, Israel
Hedda Cohen-Indelman

Authors

Nachshon Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Hedda Cohen-Indelman
View author publications
You can also search for this author in PubMed Google Scholar
Yaron Fairstein
View author publications
You can also search for this author in PubMed Google Scholar
Guy Kushilevitz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nachshon Cohen .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cohen, N., Cohen-Indelman, H., Fairstein, Y., Kushilevitz, G. (2024). InDi: Informative and Diverse Sampling for Dense Retrieval. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-56063-7_16
Published: 23 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics