Nothing Special   »   [go: up one dir, main page]

Skip to main content

Survey on KNN Methods in Data Science

  • Conference paper
  • First Online:
Learning and Intelligent Optimization (LION 2022)

Abstract

The k-nearest neighbors (KNN) algorithm remains a useful and widely applied approach. In the recent years, we have seen many advances in KNN methods, but few research works give a holistic account of all aspects of KNN and the progress made. This paper is a brief survey on modern KNN methods and their role in data science. Furthermore, we survey: the challenges, how they are approached in the literature, the impact of the distance metric, several KNN variations, as well as query methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alfeilat, H.A., et al.: Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data, 7 (2019)

    Google Scholar 

  2. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. (1991)

    Google Scholar 

  3. Ando, S.: Classifying imbalanced data in distance-based feature space. Knowl. Inf. Syst. 46 (2016)

    Google Scholar 

  4. Andoni, A., Razenshteyn, I.: Optimal data-dependent hashing for approximate near neighbors. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC 2015, pp. 793–801. Association for Computing Machinery, New York, NY (2015). ISBN 9781450335362

    Google Scholar 

  5. Arnaiz-González, Á., Díez-Pastor, J.-F., Rodríguez, J.J., García-Osorio, C.: Instance selection of linear complexity for big data. Knowl.-Based Syst. 107, 83–95 (2016)

    Article  Google Scholar 

  6. Babenko, A., Lempitsky, V.: The inverted multi-index. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1247–1260 (2014)

    Article  Google Scholar 

  7. Bandaragoda, T.R., Ting, K.M., Albrecht, D., Liu, F.T., Wells, J.R.: Efficient anomaly detection by isolation using nearest neighbour ensemble. In: 2014 IEEE International Conference on Data Mining Workshop, pp. 698–705. IEEE (2014)

    Google Scholar 

  8. Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 97–104 (2006)

    Google Scholar 

  9. Chatzimilioudis, G., Costa, C., Zeinalipour-Yazti, D., Lee, W.-C., Pitoura, E.: Distributed in-memory processing of all k nearest neighbor queries. IEEE Trans. Knowl. Data Eng. 28(4), 925–938 (2015)

    Article  Google Scholar 

  10. Cunningham, P., Delany, S.: k-nearest neighbour classifiers. Mult Classif. Syst. 54, 04 (2007)

    Google Scholar 

  11. Cunningham, P., Delany, S.J.: k-nearest neighbour classifiers - a tutorial. ACM Comput. Surv. (CSUR) 54(6), 1–25 (2021)

    Article  Google Scholar 

  12. Derrac, J., García, S., Herrera, F.: Fuzzy nearest neighbor algorithms: taxonomy, experimental analysis and prospects. Inf. Sci. 260, 98–119 (2014)

    Article  Google Scholar 

  13. Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC-6(4), 325–327 (1976)

    Google Scholar 

  14. Fernández, A., del Río, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)

    Article  Google Scholar 

  15. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. (TOMS) 3(3), 209–226 (1977)

    Article  MATH  Google Scholar 

  16. Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012)

    Article  Google Scholar 

  17. García-Pedrajas, N., Romero del Castillo, J.A., Cerruela-García, G.: A proposal for local \(k\) values for \(k\) -nearest neighbor rule. IEEE Trans. Neural Netw. Learn. Syst. 28(2), 470–475 (2017)

    Article  Google Scholar 

  18. Goldberger, J., Hinton, G.E., Roweis, S., Salakhutdinov, R.R.: Neighbourhood components analysis. In: Saul, L., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 17. MIT Press, Cambridge (2004)

    Google Scholar 

  19. Gou, J., Du, L., Zhang, Y., Xiong, T.: A new distance-weighted k-nearest neighbor classifier. J. Inf. Comput. Sci. 9, 1429–1436 (2012)

    Google Scholar 

  20. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. (2009)

    Google Scholar 

  21. He, J., Liu, W., Chang, S.-F.: Scalable similarity search with optimized kernel hashing. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1129–1138 (2010)

    Google Scholar 

  22. He, X., Niyogi, P.: Locality preserving projections. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16. MIT Press, Cambridge (2003)

    Google Scholar 

  23. Hu, L.-Y., Huang, M.-W., Ke, S.-W., Tsai, C.-F.: The distance function effect on k-nearest neighbor classification for medical datasets. Springerplus 5(1), 1–9 (2016). https://doi.org/10.1186/s40064-016-2941-7

    Article  Google Scholar 

  24. Indyk, P.: Nearest neighbors in high-dimensional spaces (2004)

    Google Scholar 

  25. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Conference Proceedings of the Annual ACM Symposium on Theory of Computing, pp. 604–613, October 2000

    Google Scholar 

  26. Iwamura, M., Sato, T., Kise, K.: What is the most efficient way to select nearest neighbor candidates for fast approximate nearest neighbor search? In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3535–3542 (2013)

    Google Scholar 

  27. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2010)

    Article  Google Scholar 

  28. Jin, Z., Li, C., Lin, Y., Cai, D.: Density sensitive hashing. IEEE Trans. Cybern. 44(8), 1362–1371 (2013)

    Article  Google Scholar 

  29. Kim, W., Kim, Y., Shim, K.: Parallel computation of k-nearest neighbor joins using mapreduce. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 696–705. IEEE (2016)

    Google Scholar 

  30. Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 1–45 (2017)

    Article  Google Scholar 

  31. Li, S., Harner, E.J., Adjeroh, D.A.: Random KNN feature selection - a fast and stable alternative to random forests. BMC Bioinform. 12(1), 1–11 (2011)

    Article  Google Scholar 

  32. Liu, W., Chawla, S.: Class confidence weighted kNN algorithms for imbalanced data sets. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6635, pp. 345–356. Springer, Heidelberg (2011). ISBN 978-3-642-20847-8. https://doi.org/10.1007/978-3-642-20847-8_29

  33. Loizou, G., Maybank, S.J.: The nearest neighbor and the Bayes error rates. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9(2), 254–262 (1987)

    Google Scholar 

  34. Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: KNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)

    Article  Google Scholar 

  35. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2 (331–340), 2 (2009)

    Google Scholar 

  36. Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)

    Article  Google Scholar 

  37. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 2161–2168. IEEE (2006)

    Google Scholar 

  38. Pang, G., Ting, K.M., Albrecht, D.: LeSiNN: detecting anomalies by identifying least similar nearest neighbours. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 623–630. IEEE (2015)

    Google Scholar 

  39. Park, C.H., Kim, S.B.: Sequential random k-nearest neighbor feature selection for high-dimensional data. Expert Syst. Appl. 42(5), 2336–2342 (2015)

    Article  Google Scholar 

  40. Patwary, M.M.A., et al.: Panda: extreme scale parallel k-nearest neighbor on distributed architectures. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 494–503. IEEE (2016)

    Google Scholar 

  41. Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 659–661 (2002)

    Google Scholar 

  42. Shalev-Shwartz, S., Singer, Y., Ng, A.Y.: Online and batch learning of pseudo-metrics. In: Proceedings of the Twenty-First International Conference on Machine Learning, ICML 2004, pp. 94. Association for Computing Machinery, New York (2004)

    Google Scholar 

  43. Shimomura, L.C., Oyamada, R.S., Vieira, M.R., Kaster, D.S.: A survey on graph-based methods for similarity searches in metric spaces. Inf. Syst. 95, 101507 (2021)

    Article  Google Scholar 

  44. Silpa-Anan, C., Hartley, R.: Optimised KD-trees for fast image descriptor matching. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)

    Google Scholar 

  45. Sisodia, D., Sisodia, D.S.: Quad division prototype selection-based k-nearest neighbor classifier for click fraud detection from highly skewed user click dataset. Int. J. Eng. Sci. Technol. 28, 101011 (2022)

    Google Scholar 

  46. Song, Y., Liang, J., Lu, J., Zhao, X.: An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing 251, 26–34 (2017)

    Article  Google Scholar 

  47. Susan, S., Kumar, A.: DST-ML-EkNN: data space transformation with metric learning and elite k-nearest neighbor cluster formation for classification of imbalanced datasets. In: Chiplunkar, N.N., Fukao, T. (eds.) Advances in Artificial Intelligence and Data Engineering. AISC, vol. 1133, pp. 319–328. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-3514-7_26

    Chapter  Google Scholar 

  48. Tahir, M.A., Bouridane, A., Kurugollu, F.: Simultaneous feature selection and feature weighting using hybrid Tabu search/K-nearest neighbor classifier. Pattern Recogn. Lett. 28(4), 438–446 (2007)

    Article  Google Scholar 

  49. Tang, B., He, H.: ENN: extended nearest neighbor method for pattern recognition [research frontier]. IEEE Comput. Intell. Mag. 10(3), 52–60 (2015)

    Article  Google Scholar 

  50. Ting, K.M., Washio, T., Wells, J.R., Aryal, S.: Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach. Learn. 106(1), 55–91 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  51. Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F.: MRPR: a mapreduce solution for prototype reduction in big data classification. Neurocomputing 150, 331–345 (2015)

    Google Scholar 

  52. Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., Herrera, F.: Transforming big data into smart data: an insight on the use of the k-nearest neighbors algorithm to obtain quality data. WIREs Data Min. Knowl. Discov. 9(2) (2019)

    Google Scholar 

  53. Vasuki, A., Vanathi, P.: A review of vector quantization techniques. IEEE Potentials 25(4), 39–47 (2006)

    Article  Google Scholar 

  54. Vincent, P., Bengio, Y.: K-local hyperplane and convex distance nearest neighbor algorithms. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge (2001)

    Google Scholar 

  55. Wang, A., An, N., Chen, G., Li, L., Alterovitz, G.: Accelerating wrapper-based feature selection with k-nearest-neighbor. Knowl.-Based Syst. 83, 81–91 (2015)

    Article  Google Scholar 

  56. Wang, J., Neskovic, P., Cooper, L.N.: Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. Pattern Recogn. 39(3), 417–423 (2006)

    Article  MATH  Google Scholar 

  57. Wang, J., Zhang, T., Song, J., Sebe, N., Shen, H.T.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)

    Google Scholar 

  58. Wang, M., Xu, X., Yue, Q., Wang, Y.: A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. arXiv preprint arXiv:2101.12631 (2021)

  59. Weinberger, K., Blitzer, J., Saul, L.: Distance metric learning for large margin nearest neighbor classification, January 2005

    Google Scholar 

  60. Wettschereck, D., Dietterich, T.: Locally adaptive nearest neighbor algorithms. In: Cowan, J., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems, vol. 6. Morgan-Kaufmann, Burlington (1993)

    Google Scholar 

  61. Wu, Z., Yu, J.: Vector quantization: a review. Front. Inf. Technol. Electron. Eng. 20(4), 507–524 (2019). https://doi.org/10.1631/FITEE.1700833

    Article  Google Scholar 

  62. Xiao, C., Chaovalitwongse, W.A.: Optimization models for feature selection of decomposed nearest neighbor. IEEE Trans. Syst. Man Cybern. Syst. 46(2), 177–184 (2016)

    Article  Google Scholar 

  63. Xing, E., Jordan, M., Russell, S.J., Ng, A.: Distance metric learning with application to clustering with side-information. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, vol. 15. MIT Press, Cambridge (2002)

    Google Scholar 

  64. Xu, H., Wang, J., Li, Z., Zeng, G., Li, S., Yu, N.: Complementary hashing for approximate nearest neighbor search. In: 2011 International Conference on Computer Vision, pp. 1631–1638 (2011)

    Google Scholar 

  65. Yu, Z., Chen, H., Liu, J., You, J., Leung, H., Han, G.: Hybrid \(k\)-nearest neighbor classifier. IEEE Trans. Cybern. 46(6), 1263–1275 (2016)

    Google Scholar 

  66. Zhang, S.: Challenges in KNN classification. IEEE Trans. Knowl. Data Eng. 1 (2021)

    Google Scholar 

  67. Zhang, S., Li, X., Zong, M., Zhu, X., Cheng, D.: Learning k for KNN classification. ACM Trans. Intell. Syst. Technol. (TIST) 8(3), 1–19 (2017)

    Google Scholar 

  68. Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient KNN classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2017)

    Article  MathSciNet  Google Scholar 

  69. Zhang, S., Cheng, D., Deng, Z., Zong, M., Deng, X.: A novel KNN algorithm with data-driven k parameter computation. Pattern Recogn. Lett. 109, 44–54 (2018). Special Issue on Pattern Discovery Multi-Source Data (PDMSD)

    Google Scholar 

  70. Zhang, X., Li, Y., Kotagiri, R., Wu, L., Tari, Z., Cheriet, M.: KRNN: k rare-class nearest neighbour classification. Pattern Recogn. 62, 33–44 (2017)

    Article  Google Scholar 

  71. Zhang, X., Xiao, H., Gao, R., Zhang, H., Wang, Y.: K-nearest neighbors rule combining prototype selection and local feature weighting for classification. Knowl.-Based Syst. 243, 108451 (2022)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panos K. Syriopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Syriopoulos, P.K., Kotsiantis, S.B., Vrahatis, M.N. (2022). Survey on KNN Methods in Data Science. In: Simos, D.E., Rasskazova, V.A., Archetti, F., Kotsireas, I.S., Pardalos, P.M. (eds) Learning and Intelligent Optimization. LION 2022. Lecture Notes in Computer Science, vol 13621. Springer, Cham. https://doi.org/10.1007/978-3-031-24866-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24866-5_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24865-8

  • Online ISBN: 978-3-031-24866-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics