Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Efficient continuous kNN join over dynamic high-dimensional data

Published: 11 September 2023 Publication History

Abstract

Given a user dataset U and an object dataset I, a kNN join query in high-dimensional space returns the k nearest neighbors of each object in dataset U from the object dataset I. The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR+ Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.

References

[1]
Dasarathy, B.V.: Nearest neighbor (nn) norms: Nn pattern classification techniques. IEEE Computer Society Tutorial (1991)
[2]
Zhang S, Li X, Zong M, Zhu X, and Wang R Efficient knn classification with different numbers of nearest neighbors IEEE Trans. Neural Netw. Learn. Sys. 2017 29 5 1774-1785
[3]
Zhou, C., Tham, C.-K.: Graphel: A graph-based ensemble learning method for distributed diagnostics and prognostics in the industrial internet of things. In: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 903–909 IEEE (2018)
[4]
Hartigan JA and Wong MA Algorithm as 136: A k-means clustering algorithm J. R. Stat. Soc. Ser C (applied statistics) 1979 28 1 100-108
[5]
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, and Wu AY An efficient k-means clustering algorithm: Analysis and implementation IEEE Trans. Pattern Anal. Mach. Intell. 2002 24 7 881-892
[6]
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)
[7]
Angiulli F, Basta S, and Pizzuti C Distance-based detection and prediction of outliers IEEE transactions on knowledge and data engineering 2005 18 2 145-160
[8]
Ghoting A, Parthasarathy S, and Otey ME Fast mining of distance-based outliers in high-dimensional datasets Data Min. Knowl. Disc. 2008 16 3 349-364
[9]
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. (2012) arXiv preprint arXiv:1207.0141
[10]
Ning J, Chen L, Zhou C, and Wen Y Parameter k search strategy in outlier detection Pattern Recogn. Lett. 2018 112 56-62
[11]
Lin, R.A.K.-l., Shim, H.S.S.K.: Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In: Proceeding of the 21th International Conference on Very Large Data Bases, pp. 490–501. Citeseer (1995)
[12]
Zhang Y, Wu J, Wang J, and Xing C A transformation-based framework for knn set similarity search IEEE Trans. Knowl. Data Eng. 2018 32 3 409-423
[13]
Amorim, L.A., Freitas, M.F., da Silva, P.H., Martins, W.S.: A fast similarity search knn for textual datasets. In: 2018 Symposium on High Performance Computing Systems (WSCAD), pp. 229–236. IEEE (2018)
[14]
Samariya D, Ma J, Aryal S, and Zhao X Detection and explanation of anomalies in healthcare data Health. Inf. Sci. Syst. 2023 11 1 20
[15]
Ashour AS, Hawas AR, and Guo Y Comparative study of multiclass classification methods on light microscopic images for hepatic schistosomiasis fibrosis diagnosis Health. Inf. Sci. Syst. 2018 6 1-12
[16]
Bajaj V, Taran S, and Sengur A Emotion classification using flexible analytic wavelet transform for electroencephalogram signals Health. Inf. Sci. Syst. 2018 6 1-7
[17]
Chen C, Zhu Q, Wu Y, Sun R, Wang X, and Liu X Efficient critical relationships identification in bipartite networks World. Wide. Web. 2022 25 2 741-761
[18]
Rabie AH and Saleh AI A new diagnostic autism spectrum disorder (DASD) strategy using ensemble diagnosis methodology based on blood tests Health. Inf. Sci. Syst. 2023 11 1 36
[19]
Tweets, M.: Twitter official blog [web-page]. 22 feb. Electronic resource https://blog.twitter.com/official/en_us/a/2010/measuring-tweets.html (2010)
[20]
Böhm C and Krebs F The k-nearest neighbour join: Turbo charging the kdd process Knowl. Inf. Syst. 2004 6 6 728-749
[21]
Xia, C., Lu, H., Ooi, B.C., Hu, J.: Gorder: an efficient method for knn join processing. In: Proceedings of the Thirtieth International Conference on Very Large Data bases-Volume 30, pp. 756–767. (2004)
[22]
Yu C, Cui B, Wang S, and Su J Efficient index-based knn join processing for high-dimensional data Inf. Softw. Technol. 2007 49 4 332-344
[23]
Wang, J., Lin, L., Huang, T., Wang, J., He, Z.: Efficient k-nearest neighbor join algorithms for high dimensional sparse data. (2010) arXiv preprint arXiv:1011.2807
[24]
Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49 (2012)
[25]
Yu C, Zhang R, Huang Y, and Xiong H High-dimensional knn joins with incremental updates Geoinformatica 2010 14 1 55-82
[26]
Yang, C., Yu, X., Liu, Y.: Continuous knn join processing for real-time recommendation. In: 2014 IEEE International Conference on Data Mining, pp. 640–649. IEEE (2014)
[27]
Ukey, N., Yang, Z., Zhang, G., Liu, B., Li, B., Zhang, W.: Efficient knn join over dynamic high-dimensional data. In: Australasian Database Conference, pp. 63–75. Springer (2022)
[28]
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Vector approximation based indexing for non-uniform high dimensional data sets. In: Proceedings of the Ninth International Conference on Information and Knowledge Management, pp. 202–209 (2000)
[29]
Cui B, Coi BC, Su J, and Tan K-L Indexing high-dimensional data for efficient in-memory similarity search IEEE Trans. Knowl. Data Eng. 2005 17 3 339-353
[30]
Chakrabarti, K., Mehrotra, S.: Local dimensionality reduction: A new approach to indexing high dimensional spaces. Technical Report, TR-MARS-00-04, University of California at Irvin (2000). http://www-db.ics.uci.edu/pages/publications/
[31]
Cheema MA, Zhang W, Lin X, and Zhang Y Efficiently processing snapshot and continuous reverse k nearest neighbors queries The VLDB Journal 2012 21 5 703-728
[32]
Berchtold, S., Böhm, C., Kriegal, H.-P.: The pyramid-technique: Towards breaking the curse of dimensionality. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 142–153 (1998)
[33]
Yu, C., Ooi, B.C., Tan, K.-L., Jagadish, H.: Indexing the distance: An efficient method to knn processing. In: Vldb, vol. 1, pp. 421–430. (2001)
[34]
Jagadish HV, Ooi BC, Tan K-L, Yu C, and Zhang R idistance: An adaptive b+-tree based indexing method for nearest neighbor search ACM Trans. Database Syst. (TODS) 2005 30 2 364-397
[35]
Cui, B., Ooi, B.C., Su, J., Tan, K.-L.: Contorting high dimensional data for efficient main memory knn processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 479–490 (2003)
[36]
Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, vol. 98, pp. 194–205. (1998)
[37]
Ooi, B.C., Tan, K.-L., Yu, C., Bressan, S.: Indexing the edges-a simple and yet efficient approach to high-dimensional indexing. In: Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 166–174. (2000)
[38]
Pillai, K.G., Sturlaugson, L., Banda, J.M., Angryk, R.A.: Extending high-dimensional indexing techniques pyramid and iminmax (θ): Lessons learned. In: Big Data: 29th British National Conference on Databases, BNCOD 2013, Oxford, UK, July 8-10, 2013. Proceedings 29, pp. 253–267. Springer (2013)
[39]
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: Vldb, vol. 99, pp. 518–529 (1999)
[40]
Hu Y, Yang C, Zhan P, Zhao J, Li Y, and Li X Efficient continuous knn join processing for real-time recommendation Pers. Ubiquit. Comput. 2021 25 6 1001-1011
[41]
Böhm, C., Krebs, F.: Supporting kdd applications by the k-nearest neighbor join. In: International Conference on Database and Expert Systems Applications, pp. 504–516. Springer (2003)
[42]
Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, pp. 47–57 (1984)
[43]
Ukey N, Yang Z, Li B, Zhang G, Hu Y, and Zhang W Survey on exact knn queries over high-dimensional data space Sensors 2023 23 2 629
[44]
Jiaqi, J., Chung, Y.: Research on k nearest neighbor join for big data. In: 2017 IEEE International Conference on Information and Automation (ICIA), pp. 1077–1081. IEEE (2017)
[45]
Souza, V., Carvalho, L.O., de Oliveira, D., Bedo, M., Santos, L.F.: Adding result diversification to k nn-based joins in a map-reduce framework. In: International Conference on Database and Expert Systems Applications, pp. 68–83. Springer (2023)
[46]
Nalepa, F., Batko, M., Zezula, P.: Speeding up continuous knn join by binary sketches. In: Advances in Data Mining. Applications and Theoretical Aspects: 18th Industrial Conference, ICDM 2018, New York, NY, USA, July 11-12, 2018, Proceedings 18, pp. 183–198. Springer (2018)
[47]
Shahvarani, A., Jacobsen, H.-A.: Distributed stream knn join. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1597–1609 (2021)
[48]
Lee, H., Chang, J.-W., Chae, C.: knn-join query processing algorithm on mapreduce for large amounts of data. In: 2021 International Symposium on Electrical, Electronics and Information Engineering, pp. 538–544 (2021)
[49]
Allheeib, N., Adhinugraha, K., Taniar, D., Islam, Md. Saiful.: Computing reverse nearest neighbourhood on road maps. World. Wide. Web. 1–32 (2022)
[50]
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Vldb, vol. 97, pp. 426–435 (1997)
[51]
Chakrabarti, K., Mehrotra, S.: Local dimensionality reduction: A new approach to indexing high dimensional spaces. In: VLDB Conference (2000)
[52]
Achlioptas, D.: Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281 (2001)
[53]
Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, and Equitz W Efficient and effective querying by image content Journal of intelligent information systems 1994 3 231-262
[54]
Leon SJ, Björck Å, and Gander W Gram-schmidt orthogonalization: 100 years and more Numer. Linear Algebra Appl. 2013 20 3 492-532
[55]
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world Web image database from national university of singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1–9. (2009)

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image World Wide Web
World Wide Web  Volume 26, Issue 6
Nov 2023
448 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 11 September 2023
Accepted: 22 August 2023
Revision received: 22 August 2023
Received: 27 February 2023

Author Tags

  1. K nearest neighbors
  2. KNN join
  3. Dynamic data
  4. High-dimensional data

Qualifiers

  • Research-article

Funding Sources

  • University of New South Wales

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media