Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3539618.3591651acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Open access

Constructing Tree-based Index for Efficient and Effective Dense Retrieval

Published: 18 July 2023 Publication History

Abstract

Recent studies have shown that Dense Retrieval (DR) techniques can significantly improve the performance of first-stage retrieval in IR systems. Despite its empirical effectiveness, the application of DR is still limited. In contrast to statistic retrieval models that rely on highly efficient inverted index solutions, DR models build dense embeddings that are difficult to be pre-processed with most existing search indexing systems. To avoid the expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN) algorithm and corresponding indexes are widely applied to speed up the inference process of DR models. Unfortunately, while ANN can improve the efficiency of DR models, it usually comes with a significant price on retrieval performance.
To solve this issue, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. Specifically, we design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner. The tree-based negative sampling strategy is applied to make the tree have the maximum heap property, which supports the effectiveness of beam search well. Moreover, we treat the cluster assignment as an optimization problem to update the tree-based index that allows overlapped clustering. We evaluate JTR on numerous popular retrieval benchmarks. Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency compared with widely-adopted baselines. It provides a potential solution to balance efficiency and effectiveness in neural retrieval system designs.

References

[1]
Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence, Vol. 37, 6 (2014), 1247--1260.
[2]
Erik Bernhardsson. 2017. Annoy: approximate nearest neighbors in C/Python optimized for memory usage and loading/saving to disk. GitHub https://github. com/spotify/annoy (2017).
[3]
Jia Chen, Haitao Li, Weihang Su, Qingyao Ai, and Yiqun Liu. [n.,d.]. THUIR at WSDM Cup 2023 Task 1: Unbiased Learning to Rank. ([n.,d.]).
[4]
Guillaume Cleuziou. 2008. An extended version of the k-means method for overlapping clustering. In 2008 19th International Conference on Pattern Recognition. IEEE, 1--4.
[5]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020a. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).
[6]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Ellen M. Voorhees. 2020b. Overview of the TREC 2020 Deep Learning Track. ArXiv, Vol. abs/2003.07820 (2020).
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Qian Dong, Yiding Liu, Suqi Cheng, Shuaiqiang Wang, Zhicong Cheng, Shuzi Niu, and Dawei Yin. 2022. Incorporating Explicit Knowledge in Pre-trained Language Models for Passage Re-ranking. arXiv preprint arXiv:2204.11673 (2022).
[9]
Yan Fang, Jingtao Zhan, Yiqun Liu, Jiaxin Mao, Min Zhang, and Shaoping Ma. 2022. Joint Optimization of Multi-vector Representation with Product Quantization. In Natural Language Processing and Chinese Computing: 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24-25, 2022, Proceedings, Part I. Springer, 631--642.
[10]
Chao Feng, Wuchao Li, Defu Lian, Zheng Liu, and Enhong Chen. 2022a. Recommender Forest for Efficient Retrieval. Advances in Neural Information Processing Systems, Vol. 35 (2022), 38912--38924.
[11]
Chao Feng, Defu Lian, Zheng Liu, Xing Xie, Le Wu, and Enhong Chen. 2022b. Forest-based Deep Recommender. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 523--532.
[12]
Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. 2021. Complement lexical retrieval model with semantic residual embeddings. In European Conference on Information Retrieval. Springer, 146--160.
[13]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 4 (2013), 744--755.
[14]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, Vol. 33, 1 (2010), 117--128.
[15]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, Vol. 7, 3 (2019), 535--547.
[16]
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
[17]
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39--48.
[18]
Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, and Jaewoo Kang. 2019. Contextualized sparse representations for real-time open-domain question answering. arXiv preprint arXiv:1911.02896 (2019).
[19]
Haitao Li, Jia Chen, Weihang Su, Qingyao Ai, and Yiqun Liu. 2023. Towards Better Web Search Performance: Pre-training, Fine-tuning and Learning to Rank. arXiv preprint arXiv:2303.04710 (2023).
[20]
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2020. Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386 (2020).
[21]
Xuanqing Liu, Wei-Cheng Chang, Hsiang-Fu Yu, Cho-Jui Hsieh, and Inderjit Dhillon. 2021. Label disentanglement in partition-based extreme multilabel classification. Advances in Neural Information Processing Systems, Vol. 34 (2021), 15359--15369.
[22]
Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).
[23]
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 4 (2018), 824--836.
[24]
Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning passage impacts for inverted indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1723--1727.
[25]
Clara Meister, Tim Vieira, and Ryan Cotterell. 2020. Best-first beam search. Transactions of the Association for Computational Linguistics, Vol. 8 (2020), 795--809.
[26]
Marius Muja and David Lowe. 2009. Flann-fast library for approximate nearest neighbors user manual. Computer Science Department, University of British Columbia, Vancouver, BC, Canada, Vol. 5 (2009).
[27]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPs.
[28]
Ninh Pham and Tao Liu. 2022. Falconn: A Locality-sensitive Filtering Approach for Approximate Nearest Neighbor Search. arXiv preprint arXiv:2206.01382 (2022).
[29]
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191 (2020).
[30]
Yi Tay, Vinh Q Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index. arXiv preprint arXiv:2202.06991 (2022).
[31]
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, et al. 2022. A Neural Corpus Indexer for Document Retrieval. arXiv preprint arXiv:2206.02743 (2022).
[32]
Joyce Jiyoung Whang, Inderjit S Dhillon, and David F Gleich. 2015. Non-exhaustive, overlapping k-means. In Proceedings of the 2015 SIAM international conference on data mining. SIAM, 936--944.
[33]
Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, et al. 2023. T2Ranking: A large-scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679 (2023).
[34]
Shenghao Yang, Haitao Li, Zhumin Chu, Jingtao Zhan, Yiqun Liu, Min Zhang, and Shaoping Ma. 2022. THUIR at the NTCIR-16 WWW-4 Task. Proceedings of NTCIR-16. to appear (2022).
[35]
Shenghao Yang, Yiqun Liu, Xiaohui Xie, Min Zhang, and Shaoping Ma. 2023. Enhance Performance of Ad-hoc Search via Prompt Learning. In Information Retrieval: 28th China Conference, CCIR 2022, Chongqing, China, September 16-18, 2022, Revised Selected Papers. Springer, 28--39.
[36]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021a. Jointly optimizing query encoder and product quantization to improve retrieval performance. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2487--2496.
[37]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021b. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503--1512.
[38]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2022. Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1328--1336.
[39]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized text embeddings for first-stage retrieval. arXiv preprint arXiv:2006.15498 (2020).
[40]
Han Zhu, Daqing Chang, Ziru Xu, Pengye Zhang, Xiang Li, Jie He, Han Li, Jian Xu, and Kun Gai. 2019. Joint optimization of tree-based index and deep model for recommender systems. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[41]
Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1079--1088.
[42]
Jingwei Zhuo, Ziru Xu, Wei Dai, Han Zhu, Han Li, Jian Xu, and Kun Gai. 2020. Learning optimal tree models under beam search. In International Conference on Machine Learning. PMLR, 11650--11659.

Cited By

View all
  • (2024)Hierarchical Indexing for Retrieval-Augmented Opinion SummarizationTransactions of the Association for Computational Linguistics10.1162/tacl_a_0070312(1533-1555)Online publication date: 27-Nov-2024
  • (2024)EASE-DR: Enhanced Sentence Embeddings for Dense RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657925(2374-2378)Online publication date: 10-Jul-2024
  • (2024)Event Grounded Criminal Court View Generation with Cooperative (Large) Language ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657698(2221-2230)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Constructing Tree-based Index for Efficient and Effective Dense Retrieval

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2023
      3567 pages
      ISBN:9781450394086
      DOI:10.1145/3539618
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 July 2023

      Check for updates

      Author Tags

      1. approximate nearest neighbor
      2. information retrieval
      3. tree-based index

      Qualifiers

      • Research-article

      Funding Sources

      • Tsinghua-Tencent Tiangong Institute for Intelligent Computing
      • Natural Science Foundation of China
      • Tsinghua University Guoqiang Research Institute

      Conference

      SIGIR '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)750
      • Downloads (Last 6 weeks)92
      Reflects downloads up to 30 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Hierarchical Indexing for Retrieval-Augmented Opinion SummarizationTransactions of the Association for Computational Linguistics10.1162/tacl_a_0070312(1533-1555)Online publication date: 27-Nov-2024
      • (2024)EASE-DR: Enhanced Sentence Embeddings for Dense RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657925(2374-2378)Online publication date: 10-Jul-2024
      • (2024)Event Grounded Criminal Court View Generation with Cooperative (Large) Language ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657698(2221-2230)Online publication date: 10-Jul-2024
      • (2024)Survey of vector database management systemsThe VLDB Journal10.1007/s00778-024-00864-x33:5(1591-1615)Online publication date: 15-Jul-2024
      • (2024)Towards an In-Depth Comprehension of Case Relevance for Better Legal RetrievalNew Frontiers in Artificial Intelligence10.1007/978-981-97-3076-6_15(212-227)Online publication date: 28-May-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media