Abstract
Uncertainty in data occurs in diverse applications due to measurement errors, data incompleteness, and multiple repeated measurements. Several classifiers for uncertain data have been developed to tackle this uncertainty. However, the existing classifiers do not consider the dependencies among uncertain features, even though this dependency has a critical effect on classification accuracy. Therefore, we propose a new Bayesian classification model that considers the correlation among uncertain features. To handle the uncertainty of data, new multivariate kernel density estimators are developed to estimate the class conditional probability density function of categorical, continuous, and mixed uncertain data. Experimental results with simulated data and real-life data sets show that the proposed approach is better than the existing approaches for classification of uncertain data in terms of classification accuracy.
Similar content being viewed by others
References
Aggarwal, C. C. (2007). On density-based transforms for uncertain data mining. In 2007 IEEE 23rd international conference on data engineering (pp. 866–875). IEEE.
Aggarwal, C. C., & Philip, S. Y. (2008). A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 21(5), 609–623.
Angiulli, F., & Fassetti, F. (2013). Nearest neighbor-based classification of uncertain data. ACM Transactions on Knowledge Discovery from Data (TKDD), 7(1), 1–35.
Bi, J., & Zhang, T. (2004). Support vector classification with input data uncertainty. Advances in neural information processing systems (pp. 161–168). Vancouver: British Columbia.
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Oakland: University of California.
Casella, G., & Berger, R. L. (2002). Statistical inference (Vol. 2, pp. 337–472). Pacific Grove: Duxbury.
Chaovalitwongse, W., Jeong, Y., Jeong, M. K., Danish, S., & Wong, S. (2011). Pattern recognition approaches for identifying subcortical targets during deep brain stimulation surgery. IEEE Intelligent Systems, 26(5), 54–63.
Elgammal, A., Duraiswami, R., & Davis, L. S. (2003). Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(11), 1499–1504.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Berlin: Springer.
Hülsmann, J., & Brockmann, W. (2012). Classification of uncertain data: An application in nondestructive testing. In International conference on information processing and management of uncertainty in knowledge-based systems (pp. 231–240). Springer, Berlin, Heidelberg.
Jeong, Y. S., Kim, S. J., & Jeong, M. K. (2008). Automatic identification of defect patterns in semiconductor wafer maps using spatial correlogram and dynamic time warping. IEEE Transactions on Semiconductor Manufacturing, 21(4), 625–637.
Kim, B. (2015). Advanced spatial data mining methodology and its applications to semiconductor manufacturing processes (Doctoral dissertation, Rutgers University-Graduate School-New Brunswick).
Lee, J., & Jun, C. H. (2015). Classification of high dimensionality data through feature selection using Markov blanket. Industrial Engineering & Management Systems, 14(2), 210–219.
Li, Q., & Racine, J. (2003). Nonparametric estimation of distributions with categorical and continuous data. Journal of Multivariate Analysis, 86(2), 266–292.
Li, M., & Sethi, I. K. (2006). Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8), 1251–1261.
Nydick, S. W. (2012). The wishart and inverse wishart distributions. Electronic Journal of Statistics, 6, 1–19.
Pei, J., Jiang, B., Lin, X., & Yuan, Y. (2007). Probabilistic skylines on uncertain data. In Proceedings of the 33rd international conference on Very large data bases (pp. 15–26).
Qin, B., Xia, Y., & Li, F. (2010). A Bayesian classifier for uncertain data. In Proceedings of the 2010 ACM symposium on applied computing (pp. 1010–1014).
Qin, B., Xia, Y., Prabhakar, S., & Tu, Y. (2009). A rule-based classification algorithm for uncertain data. In 2009 IEEE 25th international conference on data engineering (pp. 1633–1640). IEEE.
Ren, J., Lee, S. D., Chen, X., Kao, B., Cheng, R., & Cheung, D. (2009). Naive bayes classification of uncertain data. In 2009 Ninth IEEE international conference on data mining (pp. 944–949). IEEE.
Sariannidis, N., Papadakis, S., Garefalakis, A., Lemonakis, C., & Kyriaki-Argyro, T. (2019). Default avoidance on credit card portfolios using accounting, demographical and exploratory factors: Decision making based on machine learning (ML) techniques. Annals of Operations Research. https://doi.org/10.1007/s10479-019-03188-0.
Scott, D. W. (2015). Multivariate density estimation: Theory, practice, and visualization. New York: Wiley.
Street, W. N., Wolberg, W. H., & Mangasarian, O. L. (1993). Nuclear feature extraction for breast tumor diagnosis. In Biomedical image processing and biomedical visualization (Vol. 1905, pp. 861–870). San Jose, CA, United States: IS&T/SPIE’s Symposium on Electronic Imaging: Science and Technology.
Sun, L., Cheng, R., Cheung, D. W., & Cheng, J. (2010). Mining uncertain data with probabilistic guarantees. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 273–282).
Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification. Neurocomputing, 230, 143–151.
Tsang, S., Kao, B., Yip, K. Y., Ho, W. S., & Lee, S. D. (2009). Decision trees for uncertain data. IEEE Transactions on Knowledge and Data Engineering, 23(1), 64–78.
Wang, X., Fan, N., & Pardalos, P. M. (2018). Robust chance-constrained support vector machines with second-order moment information. Annals of Operations Research, 263(1–2), 45–68.
Acknowledgements
Part of this work was supported by the Korea Institute for Advancement of Technology grant funded by the Korea Government (Grant No.: P0008691, HRD Program for Industrial Innovation) and by the research fund of the National Research Foundation of Korea (Grant No.: NRF-2019R1F1A1042307). We thank the anonymous reviewers whose comments and suggestions helped improve and clarify this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Derivation of \( E\left[ {K_{\varvec{H}} \left( {\varvec{x} - \varvec{U}_{\varvec{i}} } \right)} \right] \) in Eq. (5)
Appendix: Derivation of \( E\left[ {K_{\varvec{H}} \left( {\varvec{x} - \varvec{U}_{\varvec{i}} } \right)} \right] \) in Eq. (5)
\( E\left[ {K_{\varvec{H}} \left( {\varvec{x} - \varvec{U}_{\varvec{i}} } \right)} \right] \) can be obtained by the following convolution integral:
Using the factorization of quadratic forms, Equation (A.1) can be represented as follows:
where \( \varvec{c} = \left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{i}^{ - 1} } \right)\left( {\varvec{H}^{ - 1} \varvec{x} + {\varvec{\Sigma}}_{\varvec{i}}^{ - 1}\varvec{\mu}_{\varvec{i}} } \right) \) and \( \varvec{C} = \varvec{H}^{ - 1} \left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{i}^{ - 1} } \right){\varvec{\Sigma}}_{\varvec{i}}^{ - 1} = \left( {\varvec{H} +\varvec{\varSigma}_{i} } \right)^{ - 1} \).
Because \( \left| {\left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right)^{ - 1} } \right|^{{\frac{1}{2}}} = \frac{1}{{\left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right|^{{\frac{1}{2}}} }} \), Eq. (A.2) can be rewritten as
Finally, \( \left| \varvec{H} \right|\left| {\varvec{\varSigma}_{\varvec{i}} } \right|\left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right| \) can be simplified as \( \left| {\varvec{H} +\varvec{\varSigma}_{i} } \right| \) because
Rights and permissions
About this article
Cite this article
Kim, B., Jeong, YS. & Jeong, M.K. New multivariate kernel density estimator for uncertain data classification. Ann Oper Res 303, 413–431 (2021). https://doi.org/10.1007/s10479-020-03715-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-020-03715-4