Missing value imputation using a fuzzy clustering-based EM approach

Md. Geaur Rahman¹ &
Md Zahidul Islam¹

1668 Accesses
56 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

Data preprocessing and cleansing play a vital role in data mining by ensuring good quality of data. Data-cleansing tasks include imputation of missing values, identification of outliers, and identification and correction of noisy data. In this paper, we present a novel technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). It imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value. While identifying a group of similar records and making a guess based on the group, it applies a fuzzy clustering approach and our novel fuzzy expectation maximization algorithm. We evaluate FEMI on eight publicly available natural data sets by comparing its performance with the performance of five high-quality existing techniques, namely EMI, GkNN, FKMI, SVR and IBLLS. We use thirty-two types (patterns) of missing values for each data set. Two evaluation criteria namely root mean squared error and mean absolute error are used. Our experimental results indicate (according to a confidence interval and $t$ test analysis) that FEMI performs significantly better than EMI, GkNN, FKMI, SVR, and IBLLS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Distribution table: students t [online available: http://www.statsoft.com/textbook/distribution-tables/] (2012). Accessed 17 July 2012
Tests for significance [online available: http://www.csulb.edu/msaintg/ppa696/696stsig.htm] (2014). Accessed 12 May 2014
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749
MathSciNet MATH Google Scholar
Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533
Article Google Scholar
Bezdek JC, Ehrlich R, Full W (1984) FCM: The fuzzy c-means clustering algorithm. Comput Geosci 10(2):191–203
Article Google Scholar
Bilmes JA et al (1998) A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Int Comput Sci Inst 4(510):126
Google Scholar
Bø TH, Dysvik B, Jonassen I (2004) Lsimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32(3):e34–e34
Article Google Scholar
Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2013) In-network outlier detection in wireless sensor networks. Knowl Inf Syst 34(1):23–54
Article Google Scholar
Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value imputation. J Bioinform Comput Biol 4(5):935–958
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2: 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Chatzis SP (2011) The fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38:8684–8689
Article Google Scholar
Cheng K, Law N, Siu W (2012) Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data. Pattern Recognit 45(4):1281–1289. doi:10.1016/j.patcog.2011.10.012
Article Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Article Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38
MathSciNet MATH Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 7 June 2012
Han J, Kamber M (2000) Data: mining Concepts and techniques. The Morgan Kaufmann Series in data management systems 2
Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336
Article Google Scholar
Honaker J, King G (2010) What to do about missing values in time-series cross-section data. Am J Polit Sci 54(2):561–581
Article Google Scholar
Hourani M, El Emary IM (2009) Microarray missing values imputation methods: critical analysis review. Comput Sci Inf Syst ComSIS 6(2):165–190
Article Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, Englewood Cliffs NJ
MATH Google Scholar
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907
Article Google Scholar
Khoshgoftaar T, Van Hulse J (2005) Empirical case studies in attribute noise detection. In: IRI-2005 IEEE international conference on information reuse and integration, conf, 2005. IEEE, pp 211–216
Kim DW, Lee KH, Lee D (2004) Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognit Lett 25(11):1263–1271
Article Google Scholar
Kim H, Golub G, Park H (2005) Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Article Google Scholar
Lee M, Pedrycz W (2009) The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst 160(24):3590–3600
Article MathSciNet MATH Google Scholar
Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. Tsumoto S, Słowiński R, Komorowski J, Grzymała-Busse JW (eds) RSCTC 2004, LNAI, vol 3066. Springer, Berlin, Heidelberg, pp 573–579
Li L, Huang L, Yang W, Yao X, Liu A (2013) Privacy-preserving lof outlier detection. Knowl Inf Syst 42(3):579–597
Article Google Scholar
Liu B, Xiao Y, Cao L, Hao Z, Deng F (2013) SVDD-based outlier detection on uncertain data. Knowl Inf Syst 34(3):597–618
Article Google Scholar
Lu Y, Roychowdhury V (2008) Parallel randomized sampling for support vector machine (SVM) and support vector regression (SVR). Knowl Inf Syst 14(2):233–247
Article Google Scholar
Luengo J, García S, Herrera F (2011) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108
Article Google Scholar
Maletic J, Marcus A (2000) Data cleansing: beyond integrity analysis. In: Proceedings of the conference on information quality. Citeseer, pp 200–209
Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096
Article Google Scholar
Pham DT, Dimov SS, Nguyen C (2005) Selection of k in k-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–119
Article Google Scholar
Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Australasian data mining conference (AusDM 11), CRPIT, vol 121, pp 41–50. ACS, Ballarat, Australia. http://crpit.com/confpapers/CRPITV121Rahman.pdf
Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: International conference on computer science and information technology (CSIT-2013). Yogyakarta, Indonesia, pp 82–88
Rahman MG, Islam MZ (2013) KDMI: a novel method for missing values imputation using two levels of horizontal partitioning in a data set. In: The 9th international conference on advanced data mining and applications (ADMA 2013) Hangzhou, China
Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl Based Syst. doi:10.1016/j.knosys.2013.08.023
Google Scholar
Rahman MG, Islam MZ (2013) A novel framework using two layers of missing value imputation. In: Australasian data mining conference (AusDM 13), CRPIT, vol 146. ACS, Canberra, Australia
Rahman MG, Islam MZ, Bossomaier T, Gao J (2012) Cairad: a co-appearance based analysis for incorrect records and attribute-values detection. In: The 2012 international joint conference on neural networks (IJCNN). IEEE, Brisbane, Australia, pp 1–10. doi:10.1109/IJCNN.2012.6252669
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Rubin D (1976) Inference and missing data. Biometrika 63(3):581–592
Article MathSciNet MATH Google Scholar
Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871
Article Google Scholar
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar
Sun H, Wang S, Jiang Q (2004) Fcm-based model selection algorithms for determining the number of clusters. Pattern Recognit 37(10):2027–2037
Article MATH Google Scholar
Triola MF, Goodman WM, LaBute G, Law R, MacKay L (2006) Elementary statistics. Pearson/Addison-Wesley, Reading, MA
Google Scholar
Tseng S, Wang K, Lee CI (2003) A pre-processing method to deal with missing values by integrating clustering and regression techniques. Appl Artif Intell 17(5–6):535–544
Article Google Scholar
Wang H, Wang S (2010) Mining incomplete survey data through classification. Knowl Inf Syst 24(2):221–233
Wang X, Li A, Jiang Z, Feng H (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(1):32
Article Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE international conference on industrial informatics. IEEE, pp 1081–1086 (2006)
Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35(1):123–133
Article MATH Google Scholar
Zhang S (2012) Nearest neighbor selection for iteratively k-nn imputation. J Syst Softw 85(11):2541–2552
Article Google Scholar
Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459
Article Google Scholar
Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Center for Research in Complex Systems (CRiCS), School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW, 2795, Australia
Md. Geaur Rahman & Md Zahidul Islam

Authors

Md. Geaur Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Md Zahidul Islam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Zahidul Islam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rahman, M.G., Islam, M.Z. Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46, 389–422 (2016). https://doi.org/10.1007/s10115-015-0822-y

Download citation

Received: 12 July 2013
Revised: 17 October 2014
Accepted: 31 January 2015
Published: 25 February 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10115-015-0822-y

Missing value imputation using a fuzzy clustering-based EM approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Handling Missing Values Using Fuzzy Clustering: A Review

Missing data imputation using decision trees and fuzzy clustering with iterative learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Missing value imputation using a fuzzy clustering-based EM approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Handling Missing Values Using Fuzzy Clustering: A Review

Missing data imputation using decision trees and fuzzy clustering with iterative learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation