Abstract
Many datasets include missing values in their attributes. Data mining techniques are not applicable in the presence of missing values. So an important step in preprocessing of a data mining task is missing value management. One of the most important categories in missing value management techniques is missing value imputation. This paper presents a new imputation technique. The proposed imputation technique is based on statistical measurements. The suggested imputation technique employs an ensemble of the estimators built to estimate the missing values based on positive and negative correlated observed attributes separately. Each estimator guesses a value for a missed value based on the average and variance of that feature. The average and variance of the feature are estimated from the non-missed values of that feature. The final consensus value for a missed value is the weighted aggregation of the values estimated by different estimators. The chief weight is attribute correlation, and the slight weight is dependent to kernel function such as kurtosis, skewness, number of involved samples and composition of them. The missing values are deliberately produced randomly at different levels. The experimentations indicate that the suggested technique has a good accuracy in comparison with the classical methods.
Similar content being viewed by others
References
Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35:123–133
Conrady S, Jouffe L (2011) Missing values imputation. Bayesia, Changé
Eirola E, Doquire G, Verleysen M, Lendasse A (2013) Distance estimation in numerical data sets with missing values. Inf Sci 240:115–128
Zhu B, He C, Liatsis P (2012) A robust missing value imputation method for noisy data. Appl Intell 36:61–74
Ghannad-Rezaie M, Soltanian-Zadeh H, Ying H, Dong M (2010) Selection fusion approach for classification of datasets with missing values. Pattern Recognit 43:2340–2350
Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH (2005) Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346
Kang P (2013) Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118:65–78
Acuña E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. In: Banks D, McMorris FR, Arabie P, Gaul W (eds) Classification, clustering, and data mining applications. Studies in classification, data analysis, and knowledge organisation. Springer, Berlin, Heidelberg
Hron K, Templ M, Filzmoser P (2010) Imputation of missing values for compositional data using classical and robust methods. Comput Stat Data Anal 54:3095–3107
Silva-Ramrez E-L, Pino-Mejas R, Lpez-Coello M, Cubiles-de-la-Vega M-D (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw 24:121–129
Stekhoven DJ, Bhlmann P (2012) MissForest non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118
Qin Y, Zhang S, Zhu X, Zhang J, Zhang C (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27:79–88
Theodoridis S, Koutroumbas K (2003) Pattern recognition
Wang J (2003) Data mining: opportunities and challenges. IGI Global, Hershey
Schafer JL (2010) Analysis of incomplete multivariate data. CRC Press, Boca Raton
Liu Y, Brown SD (2013) Comparison of five iterative imputation methods for multivariate classification. Chemom Intell Lab Syst 120:106–115
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41:3692–3705
Ford B (1983) An overview of hot deck procedures. In: Madow W, Nisselson H, Olkin I (eds) Incomplete data in sample surveys, theory and bibliographies, vol 2. Academic Press, pp 185–207
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
Ghahramani Z, Jordan M (1994) Supervised learning from incomplete data via an EM approach. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems, vol 6, pp 120–127
Liao Z, Lu X, Yang T, Wang H (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Sixth international conference on fuzzy systems and knowledge discovery, FSKD’09, pp 133–137
Zhang S, Zhang J, Zhu XF, Qin YQ, Zhang C (2008) Missing value imputation based on data clustering. In: Gavrilova ML, Tan CJK (eds) Transactions on computational science I, vol 4750. Springer, Berlin, Heidelberg, pp 128–138
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Ennett CM, Frize M, Walker CR (2008) Imputation of missing values by integrating neural networks and case-based reasoning. In: 30th annual international conference of the IEEE on engineering in medicine and biology society, 2008. EMBS 2008, pp 4337–4341
Grzymała-Busse J, Hu M (2001) A comparison of several approaches to missing attribute values in data mining. In: Ziarko W, Yao Y (eds) Rough sets and current trends in computing. Lecture notes in computer science, vol 2005. Springer, Berlin, Heidelberg, pp 378–385
Su X, Greiner R, Khoshgoftaar TM, Napolitano A (2011) Using classifier-based nominal imputation to improve machine learning. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 124–135
Hruschka ER, Jr Hruschka ER, Ebecken NFF (2003) Evaluating a nearest-neighbor method to substitute continuous missing values. In: The 16th Australian joint conference on artificial intelligence. Lecture notes in artificial intelligence (LNAI), vol 2903. Springer, pp 723–734
Van Hulse J, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259:596–610
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Statistical description of data. In: Numerical recipes in FORTRAN: The art of scientific computing, 2nd edn, Chap 14. Cambridge University Press, Cambridge, England, pp 603–649
Frank A, Asuncion A (2010) UCI machine learning repository. In: School of Information and Computer Science. University of California, Irvine, CA, vol 213. http://archive.ics.uci.edu/ml
Acknowledgements
We thank anonymous reviewers for their very useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jenghara, M.M., Ebrahimpour-Komleh, H., Rezaie, V. et al. Imputing missing value through ensemble concept based on statistical measures. Knowl Inf Syst 56, 123–139 (2018). https://doi.org/10.1007/s10115-017-1118-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1118-1