research-article

Open access

Constructing Tree-based Index for Efficient and Effective Dense Retrieval

Authors:

Zhao CaoAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 131 - 140

https://doi.org/10.1145/3539618.3591651

Published: 18 July 2023 Publication History

Abstract

Recent studies have shown that Dense Retrieval (DR) techniques can significantly improve the performance of first-stage retrieval in IR systems. Despite its empirical effectiveness, the application of DR is still limited. In contrast to statistic retrieval models that rely on highly efficient inverted index solutions, DR models build dense embeddings that are difficult to be pre-processed with most existing search indexing systems. To avoid the expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN) algorithm and corresponding indexes are widely applied to speed up the inference process of DR models. Unfortunately, while ANN can improve the efficiency of DR models, it usually comes with a significant price on retrieval performance.

To solve this issue, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. Specifically, we design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner. The tree-based negative sampling strategy is applied to make the tree have the maximum heap property, which supports the effectiveness of beam search well. Moreover, we treat the cluster assignment as an optimization problem to update the tree-based index that allows overlapped clustering. We evaluate JTR on numerous popular retrieval benchmarks. Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency compared with widely-adopted baselines. It provides a potential solution to balance efficiency and effectiveness in neural retrieval system designs.

References

[1]

Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence, Vol. 37, 6 (2014), 1247--1260.

[2]

Erik Bernhardsson. 2017. Annoy: approximate nearest neighbors in C/Python optimized for memory usage and loading/saving to disk. GitHub https://github. com/spotify/annoy (2017).

[3]

Jia Chen, Haitao Li, Weihang Su, Qingyao Ai, and Yiqun Liu. [n.,d.]. THUIR at WSDM Cup 2023 Task 1: Unbiased Learning to Rank. ([n.,d.]).

[4]

Guillaume Cleuziou. 2008. An extended version of the k-means method for overlapping clustering. In 2008 19th International Conference on Pattern Recognition. IEEE, 1--4.

[5]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020a. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).

[6]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Ellen M. Voorhees. 2020b. Overview of the TREC 2020 Deep Learning Track. ArXiv, Vol. abs/2003.07820 (2020).

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[8]

Qian Dong, Yiding Liu, Suqi Cheng, Shuaiqiang Wang, Zhicong Cheng, Shuzi Niu, and Dawei Yin. 2022. Incorporating Explicit Knowledge in Pre-trained Language Models for Passage Re-ranking. arXiv preprint arXiv:2204.11673 (2022).

[9]

Yan Fang, Jingtao Zhan, Yiqun Liu, Jiaxin Mao, Min Zhang, and Shaoping Ma. 2022. Joint Optimization of Multi-vector Representation with Product Quantization. In Natural Language Processing and Chinese Computing: 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24-25, 2022, Proceedings, Part I. Springer, 631--642.

Digital Library

[10]

Chao Feng, Wuchao Li, Defu Lian, Zheng Liu, and Enhong Chen. 2022a. Recommender Forest for Efficient Retrieval. Advances in Neural Information Processing Systems, Vol. 35 (2022), 38912--38924.

[11]

Chao Feng, Defu Lian, Zheng Liu, Xing Xie, Le Wu, and Enhong Chen. 2022b. Forest-based Deep Recommender. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 523--532.

Digital Library

[12]

Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. 2021. Complement lexical retrieval model with semantic residual embeddings. In European Conference on Information Retrieval. Springer, 146--160.

Digital Library

[13]

Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 4 (2013), 744--755.

[14]

Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, Vol. 33, 1 (2010), 117--128.

[15]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, Vol. 7, 3 (2019), 535--547.

[16]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).

[17]

Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39--48.

Digital Library

[18]

Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, and Jaewoo Kang. 2019. Contextualized sparse representations for real-time open-domain question answering. arXiv preprint arXiv:1911.02896 (2019).

[19]

Haitao Li, Jia Chen, Weihang Su, Qingyao Ai, and Yiqun Liu. 2023. Towards Better Web Search Performance: Pre-training, Fine-tuning and Learning to Rank. arXiv preprint arXiv:2303.04710 (2023).

[20]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2020. Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386 (2020).

[21]

Xuanqing Liu, Wei-Cheng Chang, Hsiang-Fu Yu, Cho-Jui Hsieh, and Inderjit Dhillon. 2021. Label disentanglement in partition-based extreme multilabel classification. Advances in Neural Information Processing Systems, Vol. 34 (2021), 15359--15369.

[22]

Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).

[23]

Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 4 (2018), 824--836.

[24]

Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning passage impacts for inverted indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1723--1727.

Digital Library

[25]

Clara Meister, Tim Vieira, and Ryan Cotterell. 2020. Best-first beam search. Transactions of the Association for Computational Linguistics, Vol. 8 (2020), 795--809.

[26]

Marius Muja and David Lowe. 2009. Flann-fast library for approximate nearest neighbors user manual. Computer Science Department, University of British Columbia, Vancouver, BC, Canada, Vol. 5 (2009).

[27]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPs.

[28]

Ninh Pham and Tao Liu. 2022. Falconn: A Locality-sensitive Filtering Approach for Approximate Nearest Neighbor Search. arXiv preprint arXiv:2206.01382 (2022).

[29]

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191 (2020).

[30]

Yi Tay, Vinh Q Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index. arXiv preprint arXiv:2202.06991 (2022).

[31]

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, et al. 2022. A Neural Corpus Indexer for Document Retrieval. arXiv preprint arXiv:2206.02743 (2022).

[32]

Joyce Jiyoung Whang, Inderjit S Dhillon, and David F Gleich. 2015. Non-exhaustive, overlapping k-means. In Proceedings of the 2015 SIAM international conference on data mining. SIAM, 936--944.

[33]

Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, et al. 2023. T2Ranking: A large-scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679 (2023).

[34]

Shenghao Yang, Haitao Li, Zhumin Chu, Jingtao Zhan, Yiqun Liu, Min Zhang, and Shaoping Ma. 2022. THUIR at the NTCIR-16 WWW-4 Task. Proceedings of NTCIR-16. to appear (2022).

[35]

Shenghao Yang, Yiqun Liu, Xiaohui Xie, Min Zhang, and Shaoping Ma. 2023. Enhance Performance of Ad-hoc Search via Prompt Learning. In Information Retrieval: 28th China Conference, CCIR 2022, Chongqing, China, September 16-18, 2022, Revised Selected Papers. Springer, 28--39.

Digital Library

[36]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021a. Jointly optimizing query encoder and product quantization to improve retrieval performance. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2487--2496.

Digital Library

[37]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021b. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503--1512.

Digital Library

[38]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2022. Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1328--1336.

Digital Library

[39]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized text embeddings for first-stage retrieval. arXiv preprint arXiv:2006.15498 (2020).

[40]

Han Zhu, Daqing Chang, Ziru Xu, Pengye Zhang, Xiang Li, Jie He, Han Li, Jian Xu, and Kun Gai. 2019. Joint optimization of tree-based index and deep model for recommender systems. Advances in Neural Information Processing Systems, Vol. 32 (2019).

[41]

Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1079--1088.

Digital Library

[42]

Jingwei Zhuo, Ziru Xu, Wei Dai, Han Zhu, Han Li, Jian Xu, and Kun Gai. 2020. Learning optimal tree models under beam search. In International Conference on Machine Learning. PMLR, 11650--11659.

Cited By

Hosking TTang HLapata M(2024)Hierarchical Indexing for Retrieval-Augmented Opinion SummarizationTransactions of the Association for Computational Linguistics10.1162/tacl_a_0070312(1533-1555)Online publication date: 27-Nov-2024
https://doi.org/10.1162/tacl_a_00703
Zhou XGao YJie XCai XBu JWang HHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)EASE-DR: Enhanced Sentence Embeddings for Dense RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657925(2374-2378)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657925
Yue LLiu QZhao LWang LGao WAn YHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Event Grounded Criminal Court View Generation with Cooperative (Large) Language ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657698(2221-2230)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657698
Show More Cited By

Index Terms

Constructing Tree-based Index for Efficient and Effective Dense Retrieval
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking

Recommendations

Efficient Textual Web Retrieval using Wavelet Tree

Searching on the web is one of the most progressive and expanding field nowadays. A large amount of information is available on the World Wide Web, motivating the need of efficient text indexing method that support fast text retrieval. In the past, two ...
Fast Forward Index Methods for Pseudo-Relevance Feedback Retrieval

The inverted index is the dominant indexing method in information retrieval systems. It enables fast return of the list of all documents containing a given query term. However, for retrieval schemes involving query expansion, as in pseudo-relevance ...
Improving zero-shot retrieval using dense external expansion
Abstract
Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on ...
Highlights
- Dense external expansion improves zero-shot retrieval performance.
- High quality ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Tsinghua-Tencent Tiangong Institute for Intelligent Computing
Natural Science Foundation of China
Tsinghua University Guoqiang Research Institute

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
914
Total Downloads

Downloads (Last 12 months)750
Downloads (Last 6 weeks)92

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hosking TTang HLapata M(2024)Hierarchical Indexing for Retrieval-Augmented Opinion SummarizationTransactions of the Association for Computational Linguistics10.1162/tacl_a_0070312(1533-1555)Online publication date: 27-Nov-2024
https://doi.org/10.1162/tacl_a_00703
Zhou XGao YJie XCai XBu JWang HHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)EASE-DR: Enhanced Sentence Embeddings for Dense RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657925(2374-2378)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657925
Yue LLiu QZhao LWang LGao WAn YHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Event Grounded Criminal Court View Generation with Cooperative (Large) Language ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657698(2221-2230)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657698
Pan JWang JLi G(2024)Survey of vector database management systemsThe VLDB Journal10.1007/s00778-024-00864-x33:5(1591-1615)Online publication date: 15-Jul-2024
https://doi.org/10.1007/s00778-024-00864-x
Li HChen YGe ZAi QLiu YZhou QHuo S(2024)Towards an In-Depth Comprehension of Case Relevance for Better Legal RetrievalNew Frontiers in Artificial Intelligence10.1007/978-981-97-3076-6_15(212-227)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1007/978-981-97-3076-6_15

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents