Abstract
Although the \(k\)-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient \(k\)-NN classification. Both have low preprocessing cost and can achieve high data reduction rates while maintaining \(k\)-NN classification accuracy at high levels. The first proposed algorithm is called reduction through homogeneous clusters (RHC) and is based on a fast preprocessing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training set. The second proposed algorithm is a dynamic version of RHC that retains all its properties and, in addition, it can manage datasets that cannot fit in main memory and is appropriate for dynamic environments where new training data are gradually available. Experimental results, based on fourteen datasets, illustrate that both algorithms are faster and achieve higher reduction rates than four known methods, while maintaining high classification accuracy.
Similar content being viewed by others
Notes
DRTs have two points of view: (1) item reduction, and, (2) dimensionality reduction. We consider them from the first point of view.
One should not confuse \(k\) of \(k\)-NN with \(k\) of \(k\)-Means.
Many datasets distributed by the KEEL repository have been preprocessed to remove duplicates.
References
Aha DW (1992) Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. Int J Man-Mach Stud 36(2):267–287. doi:10.1016/0020-7373(92)90018-G
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66. doi:10.1023/A:1022689900470
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult-Val Logic Soft Comput 17(2–3):255–287
Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning., ICML ’05ACM, New York, NY, USA, pp 25–32
Angiulli F (2007) Fast nearest neighbor condensation for large data sets classification. IEEE Trans Knowl Data Eng 19(11):1450–1464. doi:10.1109/TKDE.2007.190645
Beringer J, Hüllermeier E (2007) Efficient instance-based learning on data streams. Intell Data Anal 11(6):627–650. http://dl.acm.org/citation.cfm?id=1368018.1368022
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172. doi:10.1023/A:1014043630878
Chang CL (1974) Finding prototypes for nearest neighbor classifiers. IEEE Trans Comput 23(11):1179–1184. doi:10.1109/T-C.1974.223827
Chen CH, Jóźwik A (1996) A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn Lett 17(8):819–823. doi:10.1016/0167-8655(96)00041-4
Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: Proceedings of the 18th international conference on pattern recognition, vol 02, ICPR ’06. IEEE Computer Society, Washington, DC, pp 556–559. doi:10.1109/ICPR.2006.1119
Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theor 13(1):21–27. doi:10.1109/TIT.1967.1053964
Dasarathy BV (1991) Nearest neighbor. NN pattern classification techniques. IEEE Computer Society Press, NN) norms
Dasarathy BV, Snchez JS, Townsend S (2000) Nearest neighbour editing and condensing toolssynergy exploitation. Pattern Anal Appl 3(1):19–30. doi:10.1007/s100440050003
Datta P, Kibler DF (1997) Learning symbolic prototypes. In: Proceedings of the fourteenth international conference on machine learning., ICML ’97Morgan Kaufmann Publishers Inc., San Francisco, pp 75–82
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548
Devi VS, Murty MN (2002) An incremental prototype set building technique. Pattern Recogn 35(2):505–513
Devijver PA, Kittler J (1980) On the edited nearest neighbor rule. In: Proceedings of the fifth international conference on pattern recognition. The Institute of Electrical and Electronics Engineers, New Jersey
Fayed HA, Hashem SR, Atiya AF (2007) Self-generating prototypes for pattern classification. Pattern Recogn 40(5):1498–1509. doi:10.1016/j.patcog.2006.10.018
García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recogn 41(8):2693–2709. 10.1016/j.patcog.2008.02.006
Garcia S, Derrac J, Cano J, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435. doi:10.1109/TPAMI.2011.142
García-Borroto M, Villuendas-Rey Y, Carrasco-Ochoa JA, Martínez-Trinidad JF (2009) Using maximum similarity graphs to edit nearest neighbor classifiers. In: Proceedings of the 14th Iberoamerican conference on pattern recognition: progress in pattern recognition, image analysis, computer vision, and applications, CIARP ’09. Springer, Berlin, pp 489–496. doi:10.1007/978-3-642-10268-4_57
Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433
Grochowski M, Jankowski N (2004) Comparison of instance selection algorithms ii. results and comments. In: Artificial intelligence and soft computing—ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 580–585
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Elsevier Science, The Morgan Kaufmann Series in Data Management Systems
Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14(3):515–516
James M (1985) Classification algorithms. Wiley-Interscience, New York
Jankowski N, Grochowski M (2004) Comparison of instances seletion algorithms i. algorithms survey. Artif Intell Soft Comput ICAISC 2004, vol 3070. LNCS/Springer, Berlin/Heidelberg, pp 598–603
Lozano M (2007) Data reduction techniques in classification processes. Phd Thesis, Universitat Jaume I
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proc. of 5th Berkeley symp. on math. statistics and probability. University of California Press, Berkeley, pp 281–298
Mollineda R, Ferri F, Vidal E (2002) An efficient prototype merging strategy for the condensed 1-nn rule through class-conditional hierarchical clustering. Pattern Recogn 35(12):2771–2782
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34(2):133–143. doi:10.1007/s10462-010-9165-y
Olvera-Lopez JA, Carrasco-Ochoa JA, Trinidad JFM (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141
Olvera-López JA, Martínez-Trinidad JF, Carrasco-Ochoa JA (2007) Mixed data object selection based on clustering and border objects. In: Proceedings of the congress on pattern recognition 12th Iberoamerican conference on progress in pattern recognition, image analysis and applications, CIARP’07. Springer, Berlin, pp 674–683. http://dl.acm.org/citation.cfm?id=1782914.1782996
Olvera-López JA, Carrasco-Ochoa JA, Trinidad JFM (2008) Object selection based on clustering and border objects. In: Kurzynski M, Puchala E, Wozniak M, Zolnierek A (eds) Computer Recognition Systems 2, advances in soft computing, vol 45. Springer, Berlin, pp 27–34
Ougiaroglou S, Evangelidis G (2012) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics, BCI ’12. ACM, New York, pp 168–173. doi:10.1145/2371316.2371349
Ougiaroglou S, Evangelidis G (2012) Fast and accurate k-nearest neighbor classification using prototype selection by clustering. In: 16th Panhellenic conference on informatics (PCI), 2012, pp 168–173
Ritter G, Woodruff H, Lowry S, Isenhour T (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669
Samet H (2006) Foundations of multidimensional and metric data structures. In: The Morgan Kaufmann series in computer graphics, Elsevier/Morgan Kaufmann
Sánchez JS (2004) High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn 37(7):1561–1564
Sheskin D (2011) Handbook of parametric and monparametric statistical procedures. A Chapman & Hall book. Chapman & Hall/CRC, Boca Raton
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452
Tomek I (1976) Two modifications of cnn. Syst Man Cybern IEEE Trans SMC 6(11):769–772. doi:10.1109/TSMC.1976.4309452
Toussaint G (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: 34th symposium on the INTERFACE, pp 17–20
Triguero I, Derrac J, Garcia S, Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. Trans Syst Man Cyber Part C 42(1):86–100. doi:10.1109/TSMCC.2010.2103939
Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. Rep. TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Wilson DR, Martinez TR (2000) Reduction techniques for instance-basedlearning algorithms. Mach Learn 38(3):257–286. doi:10.1023/A:1007626913721
Wu J (2012) Advances in K-means clustering: a data mining thinking. Springer Publishing Company, Incorporated
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, pp. 1033–1040. ACM, New York. doi:10.1145/1143844.1143974
Acknowledgments
We are grateful to the anonymous reviewers for their valuable comments on the original form of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
S. Ougiaroglou is supported by a scholarship from the state scholarships foundation of Greece (I.K.Y.).
Rights and permissions
About this article
Cite this article
Ougiaroglou, S., Evangelidis, G. RHC: a non-parametric cluster-based data reduction for efficient \(k\)-NN classification. Pattern Anal Applic 19, 93–109 (2016). https://doi.org/10.1007/s10044-014-0393-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-014-0393-7