RHC: a non-parametric cluster-based data reduction for efficient $$k$$ -NN classification

Stefanos Ougiaroglou¹ &
Georgios Evangelidis¹

474 Accesses
31 Citations
Explore all metrics

Abstract

Although the $k$-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient $k$-NN classification. Both have low preprocessing cost and can achieve high data reduction rates while maintaining $k$-NN classification accuracy at high levels. The first proposed algorithm is called reduction through homogeneous clusters (RHC) and is based on a fast preprocessing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training set. The second proposed algorithm is a dynamic version of RHC that retains all its properties and, in addition, it can manage datasets that cannot fit in main memory and is appropriate for dynamic environments where new training data are gradually available. Experimental results, based on fourteen datasets, illustrate that both algorithms are faster and achieve higher reduction rates than four known methods, while maintaining high classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Tree-Based Classification via Homogeneous Clustering

Dynamic k-NN Classification Based on Region Homogeneity

Reduction Through Homogeneous Clustering: Variations for Categorical Data and Fast Data Reduction

Article 25 June 2024

Notes

DRTs have two points of view: (1) item reduction, and, (2) dimensionality reduction. We consider them from the first point of view.
One should not confuse $k$ of $k$-NN with $k$ of $k$-Means.
http://sci2s.ugr.es/keel/datasets.php.
Many datasets distributed by the KEEL repository have been preprocessed to remove duplicates.

References

Aha DW (1992) Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int J Man-Mach Stud 36(2):267–287. doi:10.1016/0020-7373(92)90018-G
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66. doi:10.1023/A:1022689900470
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult-Val Logic Soft Comput 17(2–3):255–287
Google Scholar
Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning., ICML ’05ACM, New York, NY, USA, pp 25–32
Angiulli F (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Trans Knowl Data Eng 19(11):1450–1464. doi:10.1109/TKDE.2007.190645
Beringer J, Hüllermeier E (2007) Efficient instance-based learning on data streams. Intell Data Anal 11(6):627–650. http://dl.acm.org/citation.cfm?id=1368018.1368022
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172. doi:10.1023/A:1014043630878
Chang CL (1974) Finding prototypes for nearest neighbor classifiers. IEEE Trans Comput 23(11):1179–1184. doi:10.1109/T-C.1974.223827
Chen CH, Jóźwik A (1996) A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn Lett 17(8):819–823. doi:10.1016/0167-8655(96)00041-4
Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: Proceedings of the 18th international conference on pattern recognition, vol 02, ICPR ’06. IEEE Computer Society, Washington, DC, pp 556–559. doi:10.1109/ICPR.2006.1119
Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theor 13(1):21–27. doi:10.1109/TIT.1967.1053964
Dasarathy BV (1991) Nearest neighbor. NN pattern classification techniques. IEEE Computer Society Press, NN) norms
Google Scholar
Dasarathy BV, Snchez JS, Townsend S (2000) Nearest neighbour editing and condensing toolssynergy exploitation. Pattern Anal Appl 3(1):19–30. doi:10.1007/s100440050003
Datta P, Kibler DF (1997) Learning symbolic prototypes. In: Proceedings of the fourteenth international conference on machine learning., ICML ’97Morgan Kaufmann Publishers Inc., San Francisco, pp 75–82
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548
Devi VS, Murty MN (2002) An incremental prototype set building technique. Pattern Recogn 35(2):505–513
Article MATH Google Scholar
Devijver PA, Kittler J (1980) On the edited nearest neighbor rule. In: Proceedings of the fifth international conference on pattern recognition. The Institute of Electrical and Electronics Engineers, New Jersey
Fayed HA, Hashem SR, Atiya AF (2007) Self-generating prototypes for pattern classification. Pattern Recogn 40(5):1498–1509. doi:10.1016/j.patcog.2006.10.018
García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recogn 41(8):2693–2709. 10.1016/j.patcog.2008.02.006
Garcia S, Derrac J, Cano J, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435. doi:10.1109/TPAMI.2011.142
García-Borroto M, Villuendas-Rey Y, Carrasco-Ochoa JA, Martínez-Trinidad JF (2009) Using maximum similarity graphs to edit nearest neighbor classifiers. In: Proceedings of the 14th Iberoamerican conference on pattern recognition: progress in pattern recognition, image analysis, computer vision, and applications, CIARP ’09. Springer, Berlin, pp 489–496. doi:10.1007/978-3-642-10268-4_57
Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433
Article Google Scholar
Grochowski M, Jankowski N (2004) Comparison of instance selection algorithms ii. results and comments. In: Artificial intelligence and soft computing—ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 580–585
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Elsevier Science, The Morgan Kaufmann Series in Data Management Systems
Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14(3):515–516
Article Google Scholar
James M (1985) Classification algorithms. Wiley-Interscience, New York
MATH Google Scholar
Jankowski N, Grochowski M (2004) Comparison of instances seletion algorithms i. algorithms survey. Artif Intell Soft Comput ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 598–603
Lozano M (2007) Data reduction techniques in classification processes. Phd Thesis, Universitat Jaume I
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley symp. on math. statistics and probability. University of California Press, Berkeley, pp 281–298
Mollineda R, Ferri F, Vidal E (2002) An efficient prototype merging strategy for the condensed 1-nn rule through class-conditional hierarchical clustering. Pattern Recogn 35(12):2771–2782
Article MATH Google Scholar
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34(2):133–143. doi:10.1007/s10462-010-9165-y
Olvera-Lopez JA, Carrasco-Ochoa JA, Trinidad JFM (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141
Article MathSciNet Google Scholar
Olvera-López JA, Martínez-Trinidad JF, Carrasco-Ochoa JA (2007) Mixed data object selection based on clustering and border objects. In: Proceedings of the congress on pattern recognition 12th Iberoamerican conference on progress in pattern recognition, image analysis and applications, CIARP’07. Springer, Berlin, pp 674–683. http://dl.acm.org/citation.cfm?id=1782914.1782996
Olvera-López JA, Carrasco-Ochoa JA, Trinidad JFM (2008) Object selection based on clustering and border objects. In: Kurzynski M, Puchala E, Wozniak M, Zolnierek A (eds) Computer Recognition Systems 2, advances in soft computing, vol 45. Springer, Berlin, pp 27–34
Ougiaroglou S, Evangelidis G (2012) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics, BCI ’12. ACM, New York, pp 168–173. doi:10.1145/2371316.2371349
Ougiaroglou S, Evangelidis G (2012) Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic conference on informatics (PCI), 2012, pp 168–173
Ritter G, Woodruff H, Lowry S, Isenhour T (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669
Article MATH Google Scholar
Samet H (2006) Foundations of multidimensional and metric data structures. In: The Morgan Kaufmann series in computer graphics, Elsevier/Morgan Kaufmann
Sánchez JS (2004) High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn 37(7):1561–1564
Article Google Scholar
Sheskin D (2011) Handbook of parametric and monparametric statistical procedures. A Chapman & Hall book. Chapman & Hall/CRC, Boca Raton
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452
Article MATH MathSciNet Google Scholar
Tomek I (1976) Two modifications of cnn. Syst Man Cybern IEEE Trans SMC 6(11):769–772. doi:10.1109/TSMC.1976.4309452
Toussaint G (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: 34th symposium on the INTERFACE, pp 17–20
Triguero I, Derrac J, Garcia S, Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans Syst Man Cyber Part C 42(1):86–100. doi:10.1109/TSMCC.2010.2103939
Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. Rep. TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Wilson DR, Martinez TR (2000) Reduction techniques for instance-basedlearning algorithms. Mach Learn 38(3):257–286. doi:10.1023/A:1007626913721
Wu J (2012) Advances in K-means clustering: a data mining thinking. Springer Publishing Company, Incorporated
Book Google Scholar
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, pp. 1033–1040. ACM, New York. doi:10.1145/1143844.1143974

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their valuable comments on the original form of the paper.

Author information

Authors and Affiliations

Department of Applied Informatics, School of Information Sciences, University of Macedonia, 156 Egnatia str., Thessaloniki, 54006, Greece
Stefanos Ougiaroglou & Georgios Evangelidis

Authors

Stefanos Ougiaroglou
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Evangelidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefanos Ougiaroglou.

Additional information

S. Ougiaroglou is supported by a scholarship from the state scholarships foundation of Greece (I.K.Y.).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ougiaroglou, S., Evangelidis, G. RHC: a non-parametric cluster-based data reduction for efficient $k$-NN classification. Pattern Anal Applic 19, 93–109 (2016). https://doi.org/10.1007/s10044-014-0393-7

Download citation

Received: 02 August 2013
Accepted: 22 July 2014
Published: 08 August 2014
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10044-014-0393-7

RHC: a non-parametric cluster-based data reduction for efficient \(k\)-NN classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Tree-Based Classification via Homogeneous Clustering

Dynamic k-NN Classification Based on Region Homogeneity

Reduction Through Homogeneous Clustering: Variations for Categorical Data and Fast Data Reduction

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

RHC: a non-parametric cluster-based data reduction for efficient \(k\)-NN classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast Tree-Based Classification via Homogeneous Clustering

Dynamic k-NN Classification Based on Region Homogeneity

Reduction Through Homogeneous Clustering: Variations for Categorical Data and Fast Data Reduction

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation