Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3448016.3457240acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Point-to-Hyperplane Nearest Neighbor Search Beyond the Unit Hypersphere

Published: 18 June 2021 Publication History

Abstract

Point-to-Hyperplane Nearest Neighbor Search (P2HNNS) is a fundamental yet challenging problem, and it has plenty of applications in various fields. Existing hyperplane hashing schemes enjoy sub-linear query time and achieve excellent performance on applications such as large-scale active learning with Support Vector Machines (SVMs). However, they only conditionally deal with this problem with a strong assumption that all of the data objects are normalized, located at the unit hypersphere. Those hyperplane hashing schemes may be arbitrarily bad without this assumption. In this paper, we introduce a new asymmetric transformation and develop the first two provable hyperplane hashing schemes, Nearest Hyperplane hashing (NH) and Furthest Hyperplane hashing (FH), for high-dimensional P2HNNS beyond the unit hypersphere. With this asymmetric transformation, we demonstrate that the hash functions of NH and FH are locality-sensitive to the hyperplane queries, and both of them enjoy quality guarantee on query results. Moreover, we propose a data-dependent multi-partition strategy to boost the search performance of FH. NH can perform the hyperplane queries in sub-linear time, while FH enjoys a better practical performance. We evaluate NH and FH over five real-life datasets and show that we are around $3 \sim 100 \times$ faster than the best competitor in four out of five datasets, especially for the recall in $[20%, 80%]$. Code is available at \urlhttps://github.com/HuangQiang/P2HNNS.

Supplementary Material

MP4 File (3448016.3457240.mp4)
Point-to-Hyperplane Nearest Neighbor Search (P2HNNS) is a fundamental yet challenging problem, and it has plenty of applications in various fields. Existing hyperplane hashing schemes enjoy sub-linear query time and achieve excellent performance on applications such as large-scale active learning with Support Vector Machines (SVMs). However, they only conditionally deal with this problem with a strong assumption that all of the data objects are located at the unit hypersphere. Those hyperplane hashing schemes may be arbitrarily bad if without this assumption.In this paper, we introduce a new asymmetric hyperplane transformation and develop the first two novel hashing schemes Nearest Hyperplane hashing (NH) and Furthest Hyperplane hashing (FH) for high-dimensional P2HNNS beyond the unit hypersphere. With this new asymmetric hyperplane transformation, we demonstrate that the hash functions of NH and FH are locality-sensitive to the hyperplane queries, and both of them enjoy quality guarantee on query results. Moreover, we propose a data-dependent multi-partition strategy to boost the search performance of FH. NH can answer the hyperplane queries in sub-linear time, while FH enjoys the better practical performance. Extensive experiments over five real-life datasets demonstrate the superior performance of these two proposed hyperplane hashing schemes.

References

[1]
Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS . 459--468.
[2]
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. In NeurIPS. 1225--1233.
[3]
Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal data-dependent hashing for approximate near neighbors. In STOC. 793--801.
[4]
Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Francesco Silvestri. 2018. Distance-sensitive hashing. In PODS. 89--104.
[5]
Yoram Bachrach, Yehuda Finkelstein, Ran Gilad-Bachrach, Liran Katzir, Noam Koenigstein, Nir Nice, and Ulrich Paquet. 2014. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In RecSys. 257--264.
[6]
Jon Louis Bentley. 1990. K-d trees for semidynamic point sets. In SoCG. 187--197.
[7]
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When is “nearest neighbor” meaningful?. In ICDT. 217--235.
[8]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388.
[9]
Ryan R Curtin, Javier Echauz, and Andrew B Gardner. 2019. Exploiting the structure of furthest neighbor search for fast approximate results. Information Systems, Vol. 80 (2019), 124--135.
[10]
Ryan R Curtin and Andrew B Gardner. 2016. Fast approximate furthest neighbors with data-dependent candidate selection. In International Conference on Similarity Search and Applications. 221--235.
[11]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SoCG . 253--262.
[12]
Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2012. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD . 541--552.
[13]
Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. 2012. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, Vol. 8, 1 (2012), 321--350.
[14]
Qiang Huang, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2017a. Two Efficient Hashing Schemes for High-Dimensional Furthest Neighbor Search. TKDE, Vol. 29, 12 (2017), 2772--2785.
[15]
Qiang Huang, Jianlin Feng, Qiong Fang, Wilfred Ng, and Wei Wang. 2017b. Query-aware locality-sensitive hashing scheme for $l_p$ norm. The VLDB Journal, Vol. 26, 5 (2017), 683--708.
[16]
Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB, Vol. 9, 1 (2015), 1--12.
[17]
Qiang Huang, Guihong Ma, Jianlin Feng, Qiong Fang, and Anthony KH Tung. 2018. Accurate and Fast Asymmetric Locality-Sensitive Hashing Scheme for Maximum Inner Product Search. In KDD. 1561--1570.
[18]
Piotr Indyk. 2003. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In SODA . 539--545.
[19]
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC . 604--613.
[20]
Prateek Jain, Sudheendra Vijayanarasimhan, and Kristen Grauman. 2010. Hashing hyperplane queries to near points with applications to large-scale active learning. In NeurIPS. 928--936.
[21]
Yifan Lei, Qiang Huang, Mohan Kankanhalli, and Anthony Tung. 2019. Sublinear Time Nearest Neighbor Search over Generalized Weighted Space. In ICML . 3773--3781.
[22]
Yifan Lei, Qiang Huang, Mohan Kankanhalli, and Anthony KH Tung. 2020. Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring. In SIGMOD . 2589--2599.
[23]
Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, and Lu Qin. 2019. I-LSH: I/O Efficient c-Approximate Nearest Neighbor Search in High-Dimensional Space. In IEEE ICDE . 1670--1673.
[24]
Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, and Shih-Fu Chang. 2012. Compact hyperplane hashing with bilinear functions. In ICML. 467--474.
[25]
Xianglong Liu, Xinjie Fan, Cheng Deng, Zhujin Li, Hao Su, and Dacheng Tao. 2016. Multilinear hyperplane hashing. In CVPR. 5119--5127.
[26]
Kejing Lu and Mineichi Kudo. 2020. R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces. In IEEE ICDE . 1045--1056.
[27]
Kejing Lu, Hongya Wang, Wei Wang, and Mineichi Kudo. 2020. VHP: Approximate Nearest Neighbor Search via Virtual Hypersphere Partitioning. PVLDB, Vol. 13, 9 (2020), 1443--1455.
[28]
Behnam Neyshabur and Nathan Srebro. 2015. On Symmetric and Asymmetric LSHs for Inner Product Search. In ICML. 1926--1934.
[29]
Stephen M Omohundro. 1989. Five balltree construction algorithms .International Computer Science Institute Berkeley.
[30]
Rasmus Pagh, Francesco Silvestri, Johan Sivertsen, and Matthew Skala. 2015. Approximate furthest neighbor in high dimensions. In International Conference on Similarity Search and Applications. 3--14.
[31]
Rasmus Pagh, Francesco Silvestri, Johan Sivertsen, and Matthew Skala. 2017. Approximate furthest neighbor with application to annulus query. Information Systems, Vol. 64 (2017), 152--162.
[32]
Parikshit Ram and Alexander G Gray. 2012. Maximum inner-product search using cone trees. In KDD. 931--939.
[33]
Mohammad Saberian, Jose Costa Pereira, Can Xu, Jian Yang, and Nuno Nvasconcelos. 2016. Large margin discriminant dimensionality reduction in prediction space. In NeurIPS . 1488--1496.
[34]
Greg Schohn and David Cohn. 2000. Less is More: Active Learning with Support Vector Machines. In ICML. 839--846.
[35]
Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear time Maximum Inner Product Search (MIPS). In NeurIPS . 2321--2329.
[36]
Anshumali Shrivastava and Ping Li. 2015. Improved asymmetric locality sensitive hashing (ALSH) for Maximum Inner Product Search (MIPS). In UAI. 812--821.
[37]
Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin. 2014. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB, Vol. 8, 1 (2014), 1--12.
[38]
Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD . 563--576.
[39]
Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification. JMLR, Vol. 2, Nov (2001), 45--66.
[40]
Antonio Torralba, Rob Fergus, and William T Freeman. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. TPAMI, Vol. 30, 11 (2008), 1958--1970.
[41]
Sudheendra Vijayanarasimhan and Kristen Grauman. 2014. Large-scale live active learning: Training object detectors with crawled data and crowds. IJCV, Vol. 108, 1--2 (2014), 97--114.
[42]
Sudheendra Vijayanarasimhan, Prateek Jain, and Kristen Grauman. 2014. Hashing hyperplane queries to near points with applications to large-scale active learning. TPAMI, Vol. 36, 2 (2014), 276--288.
[43]
Runhui Wang and Dong Deng. 2020. DeltaPQ: lossless product quantization code compression for high dimensional similarity search. PVLDB, Vol. 13, 13 (2020), 3603--3616.
[44]
Roger Weber, Hans-Jörg Schek, and Stephen Blott. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, Vol. 98. 194--205.
[45]
Chang Xu, Dacheng Tao, Chao Xu, and Yong Rui. 2014. Large-margin weakly supervised dimensionality reduction. In ICML. 865--873.
[46]
Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, and James Cheng. 2018. Norm-Ranging LSH for Maximum Inner Product Search. In NeurIPS. 2956--2965.
[47]
Teng Zhang and Zhi-Hua Zhou. 2018. Optimal margin distribution clustering. In AAAI. 4474--4481.
[48]
Bin Zhao, Fei Wang, and Changshui Zhang. 2008. Efficient maximum margin clustering via cutting plane algorithm. In SDM . 751--762.
[49]
Bolong Zheng, Xi Zhao, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu, and Christian S Jensen. 2020. PM-LSH: A fast and accurate LSH framework for high-dimensional approximate NN search. PVLDB, Vol. 13, 5 (2020), 643--655.
[50]
Yuxin Zheng, Qi Guo, Anthony KH Tung, and Sai Wu. 2016. Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index. In SIGMOD. 2023--2037.

Cited By

View all
  • (2024)From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model QueryingProceedings of the VLDB Endowment10.14778/3659437.365944617:8(1898-1910)Online publication date: 31-May-2024
  • (2024)Cluster-Based Graph Collaborative FilteringACM Transactions on Information Systems10.1145/368748142:6(1-24)Online publication date: 12-Aug-2024
  • (2024)HJG: An Effective Hierarchical Joint Graph for ANNS in Multi-Metric Spaces2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00326(4275-4287)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Check for updates

Author Tags

  1. active learning
  2. asymmetric transformation
  3. furthest neighbor search
  4. locality-sensitive hashing
  5. nearest neighbor search

Qualifiers

  • Research-article

Funding Sources

  • National Research Foundation Singapore under its Strategic Capability Research Centres Funding Initiative

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)248
  • Downloads (Last 6 weeks)44
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model QueryingProceedings of the VLDB Endowment10.14778/3659437.365944617:8(1898-1910)Online publication date: 31-May-2024
  • (2024)Cluster-Based Graph Collaborative FilteringACM Transactions on Information Systems10.1145/368748142:6(1-24)Online publication date: 12-Aug-2024
  • (2024)HJG: An Effective Hierarchical Joint Graph for ANNS in Multi-Metric Spaces2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00326(4275-4287)Online publication date: 13-May-2024
  • (2023)A New Sparse Data Clustering Method Based On Frequent ItemsProceedings of the ACM on Management of Data10.1145/35886851:1(1-28)Online publication date: 30-May-2023
  • (2023)Co-design Hardware and Algorithm for Vector SearchProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607045(1-15)Online publication date: 12-Nov-2023
  • (2023)DB-LSH 2.0: Locality-Sensitive Hashing With Query-Based Dynamic BucketingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.329583136:3(1000-1015)Online publication date: 17-Jul-2023
  • (2023)Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00040(436-449)Online publication date: Apr-2023
  • (2022)MQHProceedings of the VLDB Endowment10.14778/3574245.357426916:4(864-876)Online publication date: 1-Dec-2022
  • (2022)ONe Index for All Kernels (ONIAK)Proceedings of the VLDB Endowment10.14778/3565838.356584715:13(3937-3949)Online publication date: 1-Sep-2022
  • (2022)DB-LSH: Locality-Sensitive Hashing with Query-based Dynamic Bucketing2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00214(2250-2262)Online publication date: May-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media