research-article

Free access

Learning Balanced Tree Indexes for Large-Scale Vector Retrieval

Authors:

Yong Ge,

Enhong ChenAuthors Info & Claims

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 1353 - 1362

https://doi.org/10.1145/3580305.3599406

Published: 04 August 2023 Publication History

PDF eReader

Abstract

Vector retrieval focuses on finding the k-nearest neighbors from a bunch of data points, and is widely used in a diverse set of areas such as information retrieval and recommender system. The current state-of-the-art methods represented by HNSW usually generate indexes with a big memory footprint, restricting the scale of data they can handle, except resorting to a hybrid index with external storage. The space-partitioning learned indexes, which only occupy a small memory, have made great breakthroughs in recent years. However, these methods rely on a large amount of labeled data for supervised learning, so model complexity affects the generalization.

To this end, we propose a lightweight learnable hierarchical space partitioning index based on a balanced K-ary tree, called BAlanced Tree Learner (BATL), where the same bucket of data points are represented by a path from the root to the corresponding leaf. Instead of mapping each query into a bucket, BATL classifies it into a sequence of branches (i.e. a path), which drastically reduces the number of classes and potentially improves generalization. BATL updates the classifier and the balanced tree in an alternating way. When updating the classifier, we innovatively leverage the sequence-to-sequence learning paradigm for learning to route each query into the ground-truth leaf on the balanced tree. Retrieval is then boiled down into a sequence (i.e. path) generation task, which can be simply achieved by beam search on the encoder-decoder. When updating a balanced tree, we apply the classifier for navigating each data point into the tree nodes layer by layer under the balance constraints. We finally evaluate BATL with several large-scale vector datasets, where the experimental results show the superiority of the proposed method to the SOTA baselines in the tradeoff among latency, accuracy, and memory cost.

Supplementary Material

MP4 File (rtfp1343-2min-promo.mp4)

Presentation video - short version. We propose a new lightweight learnable hierarchical space partitioning index based on a balanced K-ary tree, called BATL. In this video, we briefly introduce the framework and algorithm of the index, and show some experimental results and conclusions.

Download
4.46 MB

MP4 File (rtfp1343-20min-video.mp4)

Presentation video. Learning Balanced Tree Indexes for Large-Scale Vector Retrieval.

Download
489.85 MB

References

[1]

Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems 87 (2020), 101374.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Efficient large-scale multi-class image classification by learning balanced trees

Learning Balanced Trees for Large Scale Image Classification

Exact and Approximate Maximum Inner Product Search with LEMP

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations