Missing data imputation using decision trees and fuzzy clustering with iterative learning

Sanaz Nikfalazar¹,
Chung-Hsing Yeh¹,
Susan Bedingfield¹ &
…
Hadi A. Khorshidi²

2344 Accesses
Explore all metrics

Abstract

Various imputation approaches have been proposed to address the issue of missing values in data mining and machine learning applications. To improve the accuracy of missing data imputation, this paper proposes a new method called DIFC by integrating the merits of decision tress and fuzzy clustering into an iterative learning approach. To compare the performance of the DIFC method against five effective imputation methods, extensive experiments are conducted on six widely used datasets with numerical and categorical missing data, and with various amounts and types of missing values. The experimental results show that the DIFC method outperforms other methods in terms of imputation accuracy. Further experiments on the effect of missing value types demonstrate the robustness of the DIFC method in dealing with different types of missing values. This paper contributes to missing data imputation research by providing an accurate and robust method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Missing value imputation using a fuzzy clustering-based EM approach

Article 25 February 2015

Handling Missing Values Using Fuzzy Clustering: A Review

References

Batista GEAPA, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
Article Google Scholar
Beysolow T II (2017) Introduction to deep learning using R. Apress, Berkeley
Book Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth & Brooks, Monterey
MATH Google Scholar
Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value imputation. J Bioinform Comput Biol 4:935–957
Article Google Scholar
Campello RJGB, Hruschka ER (2006) A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets Syst 157:2858–2875
Article MathSciNet Google Scholar
Cheng KO, Law NF, Siu WC (2012) Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data. Pattern Recogn 45:1281–1289
Article Google Scholar
Deb R, Liew AWC (2016) Missing value imputation for the analysis of incomplete traffic accident data. Inf Sci 339:274–289
Article Google Scholar
Dua D, Taniskidou EK (2017) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning: with applications in R. Springer, New York
Book Google Scholar
Jenghara MM, Ebrahimpour-Komleh H, Rezaie V, Nejatian S, Parvin H, Yusof SKS (2018) Imputing missing value through ensemble concept based on statistical measures. Knowl Inf Syst 56:123–139
Article Google Scholar
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38:2895–2907
Article Google Scholar
Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21:187–198
Article Google Scholar
Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy K-means clustering method. In: Tsumoto S, Słowiński R, Komorowski J, Grzymała-Busse JW (eds) Rough sets and current trends in computing. Springer, Berlin, pp 573–579
Chapter Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Book Google Scholar
Luengo J, García S, Herrera F (2012) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108
Article Google Scholar
Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Software Eng 27:999–1013
Article Google Scholar
Nikfalazar S, Yeh C-H, Bedingfield S, Khorshidi HA (2017) A new iterative fuzzy clustering algorithm for multiple imputation of missing data. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), Naples, pp 1–6
Oba S, Sato MA, Takemasa I, Monden M, Matsubara KI, Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19:2088–2096
Article Google Scholar
Panda S, Sahu S, Jena P, Chattopadhyay S (2012) Comparing fuzzy-C means and K-means clustering techniques: a comprehensive study. In: Wyld DC, Zizka J, Nagamalai D (eds) Advances in computer science. Engineering & Applications, Springer, pp 451–460
Google Scholar
Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst 52:709–750
Article Google Scholar
Rahman MG, Islam MZ (2010) A decision tree-based missing value imputation technique for data pre-processing. In: Conferences in research and practice in information technology series, vol 121, pp 41–50
Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl-Based Syst 53:51–65
Article Google Scholar
Rahman MG, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl-Based Syst 56:311–327
Article Google Scholar
Rahman MG, Islam MZ (2016) Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46:389–422
Article Google Scholar
Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14:853–871
Article Google Scholar
Wang X, Li A, Jiang Z, Feng H (2006) Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32
Article Google Scholar
Zhang S (2012) Nearest neighbor selection for iteratively kNN imputation. J Syst Softw 85:2541–2552
Article Google Scholar

Download references

Acknowledgements

This project was supported through an Australian Government Research Training Program Scholarship. The authors are grateful to the editor and the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, Clayton, VIC, 3800, Australia
Sanaz Nikfalazar, Chung-Hsing Yeh & Susan Bedingfield
School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia
Hadi A. Khorshidi

Authors

Sanaz Nikfalazar
View author publications
You can also search for this author in PubMed Google Scholar
Chung-Hsing Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Susan Bedingfield
View author publications
You can also search for this author in PubMed Google Scholar
Hadi A. Khorshidi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanaz Nikfalazar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikfalazar, S., Yeh, CH., Bedingfield, S. et al. Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 62, 2419–2437 (2020). https://doi.org/10.1007/s10115-019-01427-1

Download citation

Received: 28 May 2019
Revised: 20 November 2019
Accepted: 26 November 2019
Published: 11 December 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10115-019-01427-1

Missing data imputation using decision trees and fuzzy clustering with iterative learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Missing value imputation using a fuzzy clustering-based EM approach

Handling Missing Values Using Fuzzy Clustering: A Review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Missing data imputation using decision trees and fuzzy clustering with iterative learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Missing value imputation using a fuzzy clustering-based EM approach

Handling Missing Values Using Fuzzy Clustering: A Review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation