Model-based clustering and outlier detection with missing data

Hung Tong¹ &
Cristina Tortora¹

1114 Accesses
Explore all metrics

Abstract

The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm—a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student’s t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student’s t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing Values and Directional Outlier Detection in Model-Based Clustering

Article 31 October 2023

The Impact of Clustering-Based Sequential Multivariate Outliers Detection in Handling Missing Values

Clustering with missing features: a penalized dissimilarity measure based approach

Article 12 June 2018

Notes

References

Aitken A (1926) A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb 45(1):14–22
Article MATH Google Scholar
Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575
Article MathSciNet MATH Google Scholar
Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388
Article MATH Google Scholar
Buck S (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J R Stat Soc B 22:302–306
MathSciNet MATH Google Scholar
Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust gaussian clustering. J Am Stat Assoc 111(516):1648–1659
Article MathSciNet Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–22
MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A et al (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345
Article MathSciNet MATH Google Scholar
Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2019) mvtnorm: multivariate normal and t distributions. R package version 1.0-10
Ghahramani Z, Jordan MI (1994) Learning from incomplete data. Technical report, USA
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article MATH Google Scholar
Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41(3–4):577–590
Article MathSciNet MATH Google Scholar
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods, pp 405–416
Lin TI (2014) Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195
Article MathSciNet MATH Google Scholar
Liu C, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4):633–648
Article MathSciNet MATH Google Scholar
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K, Studer M, Roudier P (2016) cluster: cluster analysis extended Rousseeuw et al. R package version 2.0.4
McNicholas PD (2016) Mixture model-based classification. CRC Press, Boca Raton
Book MATH Google Scholar
McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723
Article MathSciNet MATH Google Scholar
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278
Article MathSciNet MATH Google Scholar
Novi Inverardi PL, Taufer E (2020) Outlier detection through mixtures with an improper component. Electron J Appl Stat Anal 13(1):146–163
Google Scholar
Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348
Article Google Scholar
Punzo A, McNicholas PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biom J 58(6):1506–1537
Article MathSciNet MATH Google Scholar
Punzo A, Tortora C (2021) Multiple scaled contaminated normal distribution and its application in clustering. Stat Model 21(4):332–358
Punzo A, Mazza A, McNicholas PD (2016) Contaminatedmixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. arXiv preprint arXiv:1606.03766
Qiu W, Joe H (2020) clusterGeneration: random cluster generation (with specified degree of separation). https://CRAN.R-project.org/package=clusterGeneration. R package version 1.3.7
R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Boca Raton
Book MATH Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Article MathSciNet MATH Google Scholar
Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, Hoboken
MATH Google Scholar
Salgado CM, Azevedo C, Proença H, Vieira SM (2016) Noise versus outliers, pp 163–183. https://doi.org/10.1007/978-3-319-43742-2_14
Serafini A, Murphy TB, Scrucca L (2020) Handling missing data in model-based clustering. arXiv preprint arXiv:2006.02954
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, Chichester
MATH Google Scholar
Tortora C, ElSherbiny A, Browne RP, Franczak BC, McNicholas PD, Amos DD (2020) MixGHD: model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. https://CRAN.R-project.org/package=MixGHD. R package version 2.3.4
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Article Google Scholar
Wang WL, Lin TI (2015) Robust model-based clustering via mixtures of skew-t distributions with missing information. Adv Data Anal Classif 9(4):423–445
Article MathSciNet MATH Google Scholar
Wang H, Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate $t$-distribution with missing information. Pattern Recognit Lett 25(6):701–710
Google Scholar
Wei Y, Tang Y, McNicholas PD (2019) Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Comput Stat Data Anal 130:18–41
Article MathSciNet MATH Google Scholar
Wilks SS (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3(3):163–195
Article MATH Google Scholar
Yu C, Chen K, Yao W (2015) Outlier detection and robust mixture modeling using nonconvex penalized likelihood. J Stat Plan Inference 164:27–38
Article MathSciNet MATH Google Scholar
Yu C, Yao W, Chen K (2017) A new method for robust mixture regression. Can J Stat 45(1):77–94
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

San José State University, San Jose, USA
Hung Tong & Cristina Tortora

Authors

Hung Tong
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Tortora
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristina Tortora.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tong, H., Tortora, C. Model-based clustering and outlier detection with missing data. Adv Data Anal Classif 16, 5–30 (2022). https://doi.org/10.1007/s11634-021-00476-1

Download citation

Received: 07 January 2021
Revised: 18 August 2021
Accepted: 03 October 2021
Published: 22 January 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11634-021-00476-1

Keywords

Mathematics Subject Classification

62H30

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Missing Values and Directional Outlier Detection in Model-Based Clustering

The Impact of Clustering-Based Sequential Multivariate Outliers Detection in Handling Missing Values

Clustering with missing features: a penalized dissimilarity measure based approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now