Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3292500.3330875acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Revisiting kd-tree for Nearest Neighbor Search

Published: 25 July 2019 Publication History

Abstract

\kdtree \citefriedman1976algorithm has long been deemed unsuitable for exact nearest-neighbor search in high dimensional data. The theoretical guarantees and the empirical performance of \kdtree do not show significant improvements over brute-force nearest-neighbor search in moderate to high dimensions. \kdtree has been used relatively more successfully for approximate search \citemuja2009flann but lack theoretical guarantees. In the article, we build upon randomized-partition trees \citedasgupta2013randomized to propose \kdtree based approximate search schemes with $O(d łog d + łog n)$ query time for data sets with n points in d dimensions and rigorous theoretical guarantees on the search accuracy. We empirically validate the search accuracy and the query time guarantees of our proposed schemes, demonstrating the significantly improved scaling for same level of accuracy.

References

[1]
Nir Ailon and Bernard Chazelle. 2009. The fast Johnson--Lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, Vol. 39, 1 (2009), 302--322.
[2]
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. In Advances in Neural Information Processing Systems. 1225--1233.
[3]
Alexandr Andoni, Assaf Naor, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten. 2018. Hölder homeomorphisms and approximate nearest neighbors. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 159--169.
[4]
Alexandr Andoni, Huy L Nguyen, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten. 2017. Approximate near neighbors for general symmetric norms. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. ACM, 902--913.
[5]
Sunil Arya, David M Mount, Nathan Netanyahu, Ruth Silverman, and Angela Y Wu. 1994. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In Proc. 5th ACM-SIAM Sympos. Discrete Algorithms. 573--582.
[6]
Patrice Assouad. 1979. Etude d'une dimension metrique liee a la possibilite de plongements dans R^ n. CR Acad. Sci. Paris Sér. AB, Vol. 288 (1979), A731--A734.
[7]
Arthur Asuncion and David Newman. 2007. UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets.html
[8]
Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2017. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. In International Conference on Similarity Search and Applications. Springer, 34--49. https://github.com/erikbern/ann-benchmarks
[9]
Artem Babenko and Victor Lempitsky. 2012. The inverted multi-index. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 3069--3076.
[10]
Alina Beygelzimer, Sham Kakade, and John Langford. 2006. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning. ACM, 97--104.
[11]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 380--388.
[12]
Kenneth L Clarkson. 2006. Nearest-neighbor searching and metric space dimensions. Nearest-neighbor methods for learning and vision: theory and practice (2006), 15--59.
[13]
Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing. ACM, 537--546.
[14]
Sanjoy Dasgupta and Kaushik Sinha. 2013. Randomized partition trees for exact nearest neighbor search. In Conference on Learning Theory. 317--337.
[15]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. ACM, 253--262.
[16]
Jerome H Friedman, Jon Louis Bentley, and Raphael Ari Finkel. 1976. An algorithm for finding best matches in logarithmic time. ACM Trans. Math. Software, Vol. 3, SLAC-PUB-1549-REV. 2 (1976), 209--226.
[17]
Keinosuke Fukunaga and Patrenahalli M. Narendra. 1975. A branch and bound algorithm for computing k-nearest neighbors. IEEE transactions on computers, Vol. 100, 7 (1975), 750--753.
[18]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 4 (2014), 744--755.
[19]
Aristides Gionis, Piotr Indyk, Rajeev Motwani, et almbox. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529.
[20]
Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. 2013. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 12 (2013), 2916--2929.
[21]
Robert M Gray et almbox. 2006. Toeplitz and circulant matrices: A review. Foundations and Trends® in Communications and Information Theory, Vol. 2, 3 (2006), 155--239.
[22]
Kiana Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi, and Hong Zhang. 2011. Fast approximate nearest-neighbor search with k-nearest neighbor graph. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, Vol. 22. 1312.
[23]
Ville Hyvönen, Teemu Pitk"anen, Sotiris Tasoulis, Elias J"a"asaari, Risto Tuomainen, Liang Wang, Jukka Corander, and Teemu Roos. 2016. Fast nearest neighbor search through sparse random projections and voting. In Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 881--888.
[24]
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 604--613.
[25]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, Vol. 33, 1 (2011), 117--128.
[26]
Omid Keivani and Kaushik Sinha. 2018. Improved nearest neighbor search using auxiliary information and priority functions. In International Conference on Machine Learning . 2578--2586.
[27]
Omid Keivani, Kaushik Sinha, and Parikshit Ram. 2017. Improved maximum inner product search with better theoretical guarantees. In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2927--2934.
[28]
Omid Keivani, Kaushik Sinha, and Parikshit Ram. 2018. Improved maximum inner product search with better theoretical guarantee using randomized partition trees. Machine Learning (2018), 1--26.
[29]
Quoc Le, Tamás Sarlós, and Alex Smola. 2013. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, Vol. 85.
[30]
Ting Liu, Andrew W Moore, Ke Yang, and Alexander G Gray. 2005. An investigation of practical approximate nearest neighbor algorithms. In Advances in neural information processing systems. 825--832.
[31]
Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 950--961.
[32]
Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, Vol. 45 (2014), 61--68.
[33]
Yu A Malkov and Dmitry A Yashunin. 2016. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv preprint arXiv:1603.09320 (2016).
[34]
James McNames. 2001. A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Transactions on Pattern Analysis & Machine Intelligence 9 (2001), 964--976.
[35]
Andrew W Moore. 2000. The anchors hierarchy: Using the triangle inequality to survive high dimensional data. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 397--405.
[36]
Marius Muja and David G Lowe. 2009. Flann, fast library for approximate nearest neighbors. In International Conference on Computer Vision Theory and Applications (VISAPP'09), Vol. 3. INSTICC Press.
[37]
David Nister and Henrik Stewenius. 2006. Scalable recognition with a vocabulary tree. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, Vol. 2. Ieee, 2161--2168.
[38]
Michael T Orchard. 1991. A fast nearest-neighbor search algorithm. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on. IEEE, 2297--2300.
[39]
Parikshit Ram, Dongryeol Lee, and Alexander G Gray. 2012. Nearest-neighbor search on a time budget via max-margin trees. In Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 1011--1022.
[40]
Kaushik Sinha. 2014. LSH vs Randomized Partition Trees: Which One to Use for Nearest Neighbor Search?. In Machine Learning and Applications (ICMLA), 2014 13th International Conference on. IEEE, 41--46.
[41]
Kaushik Sinha and Omid Keivani. 2017. Sparse Randomized Partition Trees for Nearest Neighbor Search. In Artificial Intelligence and Statistics. 681--689.
[42]
Antonio Torralba, Rob Fergus, and William T Freeman. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 30, 11 (2008), 1958--1970.
[43]
Santosh S Vempala. 2012. Randomly-oriented kd trees adapt to intrinsic dimension. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 18. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[44]
Nakul Verma, Samory Kpotufe, and Sanjoy Dasgupta. 2009. Which spatial partition trees are adaptive to intrinsic dimension?. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 565--574.
[45]
Enrique Vidal. 1994. New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA). Pattern Recognition Letters, Vol. 15, 1 (1994), 1--7.
[46]
Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. 2016. Learning to hash for indexing big data-a survey. Proc. IEEE, Vol. 104, 1 (2016), 34--57.
[47]
Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et almbox. 2018. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 4 (2018), 769--790.
[48]
Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N Holtmann-Rice, David Simcha, and Felix Yu. 2017. Multiscale quantization for fast similarity search. In Advances in Neural Information Processing Systems. 5745--5755.
[49]
Yubao Wu, Ruoming Jin, and Xiang Zhang. 2014. Fast and unified local search for random walk based k-nearest-neighbor query in large graphs. In Proceedings of the 2014 ACM SIGMOD international conference on Management of Data. ACM, 1139--1150.
[50]
Felix Yu, Sanjiv Kumar, Yunchao Gong, and Shih-Fu Chang. 2014. Circulant binary embedding. In International conference on machine learning . 946--954.

Cited By

View all
  • (2025)DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation GraphProceedings of the ACM on Management of Data10.1145/37096793:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Path Overlap Detection and Property Graph Construction from TCX Data for Advanced Analysis2025 IEEE 23rd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI63904.2025.10883235(000489-000494)Online publication date: 23-Jan-2025
  • (2025)SpaCE: a spatial counterfactual explainable deep learning model for predicting out-of-hospital cardiac arrest survival outcomeInternational Journal of Geographical Information Science10.1080/13658816.2024.2443757(1-32)Online publication date: 28-Jan-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. nearest-neighbor search
  2. randomized algorithms
  3. similarity search
  4. space-partitioning trees

Qualifiers

  • Research-article

Conference

KDD '19
Sponsor:

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)367
  • Downloads (Last 6 weeks)38
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)DEG: Efficient Hybrid Vector Search Using the Dynamic Edge Navigation GraphProceedings of the ACM on Management of Data10.1145/37096793:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Path Overlap Detection and Property Graph Construction from TCX Data for Advanced Analysis2025 IEEE 23rd World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI63904.2025.10883235(000489-000494)Online publication date: 23-Jan-2025
  • (2025)SpaCE: a spatial counterfactual explainable deep learning model for predicting out-of-hospital cardiac arrest survival outcomeInternational Journal of Geographical Information Science10.1080/13658816.2024.2443757(1-32)Online publication date: 28-Jan-2025
  • (2025)A robust few‐shot classifier with image as set of pointsIET Computer Vision10.1049/cvi2.1234019:1Online publication date: 2-Jan-2025
  • (2025)Impacts of Lake Basin Topography Predicted by Long Short-Term Memory on Difference of Lake Area and Water Resource EvolutionJournal of Cleaner Production10.1016/j.jclepro.2025.144781(144781)Online publication date: Jan-2025
  • (2025)Oversampling multi-label data based on natural neighbor and label correlationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.125257259:COnline publication date: 1-Jan-2025
  • (2024)Cucumber Seedling Segmentation Network Based on a Multiview Geometric Graph Encoder from 3D Point CloudsPlant Phenomics10.34133/plantphenomics.02546(0254)Online publication date: 2024
  • (2024)Spatial Pattern and Environmental Driving Factors of Treeline Elevations in Yulong Snow Mountain, ChinaForests10.3390/f1507126115:7(1261)Online publication date: 19-Jul-2024
  • (2024)Modeling and Simulation of Distribution Networks with High Renewable Penetration in Open-Source Software: QGIS and OpenDSSEnergies10.3390/en1712292517:12(2925)Online publication date: 14-Jun-2024
  • (2024)Visual Odometry in GPS-Denied Zones for Fixed-Wing Unmanned Aerial Vehicle with Reduced Accumulative Error Based on Satellite ImageryApplied Sciences10.3390/app1416742014:16(7420)Online publication date: 22-Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media