Abstract
Incomplete data are often encountered in data sets used in clustering problems, and inappropriate treatment of incomplete data can significantly degrade the clustering performance. In view of the uncertainty of missing attributes, we put forward an interval representation of missing attributes based on nearest-neighbor information, named nearest-neighbor interval, and a hybrid approach utilizing genetic algorithm and fuzzy c-means is presented for incomplete data clustering. The overall algorithm is within the genetic algorithm framework, which searches for appropriate imputations of missing attributes in corresponding nearest-neighbor intervals to recover the incomplete data set, and hybridizes fuzzy c-means to perform clustering analysis and provide fitness metric for genetic optimization simultaneously. Several experimental results on a set of real-life data sets are presented to demonstrate the better clustering performance of our hybrid approach over the compared methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Acuna E, Rodriguez C (2004) The treatment of missing values and its effect in the classifier accuracy. Classification, clustering and data mining applications, vol 3. pp 639–648
Bai H, Zhang P, Ajjarapu V (2009) A novel parameter identification approach via hybrid learning for aggregate load modeling. IEEE Trans Power Syst 24:1145–1154
Bandyopadhyay S (2005) Simulated annealing using a reversible jump Markov chain Monte Carlo algorithm for fuzzy clustering. IEEE Trans Knowl Data Eng 17:479–490
Bandyopadhyay S, Sara S (2008) A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Trans Knowl Data Eng 20:1441–1457
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CA
Blickle T, Thiele L (1996) A comparison of selection schemes used in evolutionary algorithms. Evol Comput 4:361–394
Chang PC, Liao TW (2006) Combing SOM and fuzzy rule base for flow time prediction in semiconductor manufacturing factory. Appl Soft Comput 6:198–206
Chang PC, Liu CH, Fan CY (2009) Data clustering and fuzzy neural network for sales forecasting: a case study in printed circuit board industry. Knowl Based Syst 22:344–355
Chang PC, Fan CY, Dzan WY (2010) A CBR-based fuzzy decision tree approach for database classification. Expert Syst Appl 37:214–225
Davis L (1991) Handbook of genetic algorithms. Van Nostrand Reinhold, New York
Deb K (2001) Multiobjective optimization using evolutionary algorithms. Wiley, Chichester
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38
Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 9:617–621
Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern A 37:692–709
Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Menlo Park
Hathaway RJ, Bezdek JC (1995) Optimization of clustering criteria by reformulation. IEEE Trans Fuzzy Syst 3:241–245
Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B 31:735–744
Hathaway RJ, Bezdek JC (2002) Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm. Pattern Recognit Lett 23:151–160
Honda K, Ichihashi H (2004) Linear fuzzy clustering techniques with missing values and their application to local principle component analysis. IEEE Trans Fuzzy Syst 12:183–193
Hoppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis: methods for classification data analysis and image recognition. Wiley, New York
Huang X, Zhu Q (2002) A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets. Pattern Recognit Lett 23:1613–1622
Leung FHF, Lam HK, Ling SH, Tam PKS (2003) Tuning of the structure and parameters of a neural network using an improved genetic algorithm. IEEE Trans Neural Netw 14:79–88
Li D, Gu H, Zhang LY (2010a) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Expert Syst Appl 37:6942–6947
Li D, Zhong CQ, Zhang LY (2010) Fuzzy c-means Clustering of partially missing data sets based on statistical representation. In: Proceedings of the 7th international conference on fuzzy systems and knowledge discovery, pp 460–464
Lim CP, Leong JH, Kuan MM (2005) A hybrid neural network system for pattern classification tasks with missing features. IEEE Trans Pattern Anal Mach Intell 27:648–653
Liu YG, Chen KF, Liao XF, Zhang W (2004) A genetic clustering method for intrusion detection. Pattern Recognit 37:927–942
Mclachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
Michalewicz Z (1994) Genetic algorithms + data structure = evolution programs. Springer, New York
Miyamoto S, Takata O, Umayahara K (1998) Handling missing values in fuzzy c-means. In: Proceedings of the third Asian fuzzy systems symposium, pp 139–142
Mukhopadhyay A, Maulik U, Bandyopadhyay S (2009) Multiobjective genetic algorithm-based fuzzy clustering of categorical attributes. IEEE Trans Evol Comput 13:991–1005
Ren ZW, San Y (2007) Improvement of real-valued genetic algorithm and performance study. Acta Electronica Sinica 35:269–274 (in Chinese)
Silva EL, Gil HA, Areiza JM (2000) Transmission network expansion planning under an improved genetic algorithm. IEEE Trans Power Syst 15:1168–1175
Stade I (1996) Hot deck imputation procedures. In: Incomplete data in sample survey symposium on incomplete data proceedings, pp 225–248
Su JP, Lee TE, Yu KW (2009) A combined hard and soft variable-structure control scheme for a class of nonlinear systems. IEEE Trans Ind Electron 56:3305–3313
Timm H, Doring C, Kruse R (2004) Different approaches to fuzzy clustering of incomplete data sets. Int J Approx Reason 35:239–249
Wei CH, Fahn CS (2002) The multisynapse neural network and its application to fuzzy clustering. IEEE Trans Neural Netw 13:600–618
Zhu JJ, Liu SX, Wang MG (2004) Estimation of weight vector of interval numbers judgment matrix in AHP using genetic algorithm. J Syst Eng 19:343–349 (in Chinese)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by T. P. Hong.
Rights and permissions
About this article
Cite this article
Li, D., Gu, H. & Zhang, L. A hybrid genetic algorithm–fuzzy c-means approach for incomplete data clustering based on nearest-neighbor intervals. Soft Comput 17, 1787–1796 (2013). https://doi.org/10.1007/s00500-013-0997-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-013-0997-7