Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

New multivariate kernel density estimator for uncertain data classification

  • S.I.: Data Mining and Decision Analytics
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Uncertainty in data occurs in diverse applications due to measurement errors, data incompleteness, and multiple repeated measurements. Several classifiers for uncertain data have been developed to tackle this uncertainty. However, the existing classifiers do not consider the dependencies among uncertain features, even though this dependency has a critical effect on classification accuracy. Therefore, we propose a new Bayesian classification model that considers the correlation among uncertain features. To handle the uncertainty of data, new multivariate kernel density estimators are developed to estimate the class conditional probability density function of categorical, continuous, and mixed uncertain data. Experimental results with simulated data and real-life data sets show that the proposed approach is better than the existing approaches for classification of uncertain data in terms of classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Aggarwal, C. C. (2007). On density-based transforms for uncertain data mining. In 2007 IEEE 23rd international conference on data engineering (pp. 866–875). IEEE.

  • Aggarwal, C. C., & Philip, S. Y. (2008). A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 21(5), 609–623.

    Article  Google Scholar 

  • Angiulli, F., & Fassetti, F. (2013). Nearest neighbor-based classification of uncertain data. ACM Transactions on Knowledge Discovery from Data (TKDD), 7(1), 1–35.

    Article  Google Scholar 

  • Bi, J., & Zhang, T. (2004). Support vector classification with input data uncertainty. Advances in neural information processing systems (pp. 161–168). Vancouver: British Columbia.

    Google Scholar 

  • Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Oakland: University of California.

    Google Scholar 

  • Casella, G., & Berger, R. L. (2002). Statistical inference (Vol. 2, pp. 337–472). Pacific Grove: Duxbury.

    Google Scholar 

  • Chaovalitwongse, W., Jeong, Y., Jeong, M. K., Danish, S., & Wong, S. (2011). Pattern recognition approaches for identifying subcortical targets during deep brain stimulation surgery. IEEE Intelligent Systems, 26(5), 54–63.

    Article  Google Scholar 

  • Elgammal, A., Duraiswami, R., & Davis, L. S. (2003). Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(11), 1499–1504.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Berlin: Springer.

    Book  Google Scholar 

  • Hülsmann, J., & Brockmann, W. (2012). Classification of uncertain data: An application in nondestructive testing. In International conference on information processing and management of uncertainty in knowledge-based systems (pp. 231–240). Springer, Berlin, Heidelberg.

  • Jeong, Y. S., Kim, S. J., & Jeong, M. K. (2008). Automatic identification of defect patterns in semiconductor wafer maps using spatial correlogram and dynamic time warping. IEEE Transactions on Semiconductor Manufacturing, 21(4), 625–637.

    Article  Google Scholar 

  • Kim, B. (2015). Advanced spatial data mining methodology and its applications to semiconductor manufacturing processes (Doctoral dissertation, Rutgers University-Graduate School-New Brunswick).

  • Lee, J., & Jun, C. H. (2015). Classification of high dimensionality data through feature selection using Markov blanket. Industrial Engineering & Management Systems, 14(2), 210–219.

    Article  Google Scholar 

  • Li, Q., & Racine, J. (2003). Nonparametric estimation of distributions with categorical and continuous data. Journal of Multivariate Analysis, 86(2), 266–292.

    Article  Google Scholar 

  • Li, M., & Sethi, I. K. (2006). Confidence-based active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8), 1251–1261.

    Article  Google Scholar 

  • Nydick, S. W. (2012). The wishart and inverse wishart distributions. Electronic Journal of Statistics, 6, 1–19.

    Google Scholar 

  • Pei, J., Jiang, B., Lin, X., & Yuan, Y. (2007). Probabilistic skylines on uncertain data. In Proceedings of the 33rd international conference on Very large data bases (pp. 15–26).

  • Qin, B., Xia, Y., & Li, F. (2010). A Bayesian classifier for uncertain data. In Proceedings of the 2010 ACM symposium on applied computing (pp. 1010–1014).

  • Qin, B., Xia, Y., Prabhakar, S., & Tu, Y. (2009). A rule-based classification algorithm for uncertain data. In 2009 IEEE 25th international conference on data engineering (pp. 1633–1640). IEEE.

  • Ren, J., Lee, S. D., Chen, X., Kao, B., Cheng, R., & Cheung, D. (2009). Naive bayes classification of uncertain data. In 2009 Ninth IEEE international conference on data mining (pp. 944–949). IEEE.

  • Sariannidis, N., Papadakis, S., Garefalakis, A., Lemonakis, C., & Kyriaki-Argyro, T. (2019). Default avoidance on credit card portfolios using accounting, demographical and exploratory factors: Decision making based on machine learning (ML) techniques. Annals of Operations Research. https://doi.org/10.1007/s10479-019-03188-0.

    Article  Google Scholar 

  • Scott, D. W. (2015). Multivariate density estimation: Theory, practice, and visualization. New York: Wiley.

    Book  Google Scholar 

  • Street, W. N., Wolberg, W. H., & Mangasarian, O. L. (1993). Nuclear feature extraction for breast tumor diagnosis. In Biomedical image processing and biomedical visualization (Vol. 1905, pp. 861–870). San Jose, CA, United States: IS&T/SPIE’s Symposium on Electronic Imaging: Science and Technology.

    Chapter  Google Scholar 

  • Sun, L., Cheng, R., Cheung, D. W., & Cheng, J. (2010). Mining uncertain data with probabilistic guarantees. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 273–282).

  • Tavakkol, B., Jeong, M. K., & Albin, S. L. (2017). Object-to-group probabilistic distance measure for uncertain data classification. Neurocomputing, 230, 143–151.

    Article  Google Scholar 

  • Tsang, S., Kao, B., Yip, K. Y., Ho, W. S., & Lee, S. D. (2009). Decision trees for uncertain data. IEEE Transactions on Knowledge and Data Engineering, 23(1), 64–78.

    Article  Google Scholar 

  • Wang, X., Fan, N., & Pardalos, P. M. (2018). Robust chance-constrained support vector machines with second-order moment information. Annals of Operations Research, 263(1–2), 45–68.

    Article  Google Scholar 

Download references

Acknowledgements

Part of this work was supported by the Korea Institute for Advancement of Technology grant funded by the Korea Government (Grant No.: P0008691, HRD Program for Industrial Innovation) and by the research fund of the National Research Foundation of Korea (Grant No.: NRF-2019R1F1A1042307). We thank the anonymous reviewers whose comments and suggestions helped improve and clarify this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young-Seon Jeong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Derivation of \( E\left[ {K_{\varvec{H}} \left( {\varvec{x} - \varvec{U}_{\varvec{i}} } \right)} \right] \) in Eq. (5)

Appendix: Derivation of \( E\left[ {K_{\varvec{H}} \left( {\varvec{x} - \varvec{U}_{\varvec{i}} } \right)} \right] \) in Eq. (5)

\( E\left[ {K_{\varvec{H}} \left( {\varvec{x} - \varvec{U}_{\varvec{i}} } \right)} \right] \) can be obtained by the following convolution integral:

$$ \begin{aligned} & \mathop \int \limits_{\varvec{u}} \left( {2\pi } \right)^{{ - \frac{s}{2}}} \left| \varvec{H} \right|^{{ - \frac{1}{2}}} e^{{ - \frac{{\left( {\varvec{x} - \varvec{u}} \right)^{T} \varvec{H}^{ - 1} \left( {\varvec{x} - \varvec{u}} \right)}}{2}}} \left( {2\pi } \right)^{{ - \frac{s}{2}}} \left| {\varSigma_{i} } \right|^{{ - \frac{1}{2}}} e^{{ - \frac{{\left( {\varvec{u} -\varvec{\mu}_{\varvec{i}} } \right)^{T}\varvec{\varSigma}_{i}^{ - 1} \left( {\varvec{u} -\varvec{\mu}_{\varvec{i}} } \right)}}{2}}} d\varvec{u} \\ & \quad \varvec{ = }\mathop \int \limits_{\varvec{u}} \frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| \varvec{H} \right|^{{\frac{1}{2}}} }}\frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| {\varSigma_{i} } \right|^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\left( {\varvec{x} - \varvec{u}} \right)^{T} \varvec{H}^{ - 1} \left( {\varvec{x} - \varvec{u}} \right) + \left( {\varvec{u} -\varvec{\mu}_{\varvec{i}} } \right)^{T}\varvec{\varSigma}_{i}^{ - 1} \left( {\varvec{u} -\varvec{\mu}_{\varvec{i}} } \right)} \right) }} d\varvec{u} \\ \end{aligned} $$
(A.1)

Using the factorization of quadratic forms, Equation (A.1) can be represented as follows:

$$ \begin{aligned} & \mathop \int \limits_{\varvec{u}} \frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| \varvec{H} \right|^{{\frac{1}{2}}} }}\frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| {\varSigma_{i} } \right|^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\left( {\varvec{u} - \varvec{x}} \right)^{T} \varvec{H}^{ - 1} \left( {\varvec{u} - \varvec{x}} \right) + \left( {\varvec{u} -\varvec{\mu}_{\varvec{i}} } \right)^{T}\varvec{\varSigma}_{i}^{ - 1} \left( {\varvec{u} -\varvec{\mu}_{\varvec{i}} } \right)} \right) }} d\varvec{u} \\ & \quad = \mathop \int \limits_{\varvec{u}} \frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| \varvec{H} \right|^{{\frac{1}{2}}} }}\frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| {\varSigma_{i} } \right|^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\left( {\varvec{u} - \varvec{c}} \right)^{T} \left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{i}^{ - 1} } \right)\left( {\varvec{u} - \varvec{c}} \right) + \left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right)^{T} \varvec{C}\left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right)} \right) }} d\varvec{u} \\ & \quad = \frac{{\left| {\left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right)^{ - 1} } \right|^{{\frac{1}{2}}} }}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| \varvec{H} \right|^{{\frac{1}{2}}} \left| {\varSigma_{i} } \right|^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right)^{T} \varvec{C}\left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right) }} \\ & \quad \quad \times \mathop \int \limits_{\varvec{u}} \frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| {\left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right)^{ - 1} } \right|^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\varvec{u} - \varvec{c}} \right)^{T} \left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{i}^{ - 1} } \right)\left( {\varvec{u} - \varvec{c}} \right) }} d\varvec{u} \\ & \quad = \frac{{\left| {\left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right)^{ - 1} } \right|^{{\frac{1}{2}}} }}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| \varvec{H} \right|^{{\frac{1}{2}}} \left| {\varSigma_{i} } \right|^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right)^{T} \varvec{C}\left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right) }} \\ \end{aligned} $$
(A.2)

where \( \varvec{c} = \left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{i}^{ - 1} } \right)\left( {\varvec{H}^{ - 1} \varvec{x} + {\varvec{\Sigma}}_{\varvec{i}}^{ - 1}\varvec{\mu}_{\varvec{i}} } \right) \) and \( \varvec{C} = \varvec{H}^{ - 1} \left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{i}^{ - 1} } \right){\varvec{\Sigma}}_{\varvec{i}}^{ - 1} = \left( {\varvec{H} +\varvec{\varSigma}_{i} } \right)^{ - 1} \).

Because \( \left| {\left( {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right)^{ - 1} } \right|^{{\frac{1}{2}}} = \frac{1}{{\left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right|^{{\frac{1}{2}}} }} \), Eq. (A.2) can be rewritten as

$$ \begin{aligned} & \frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left| \varvec{H} \right|^{{\frac{1}{2}}} \left| {\varvec{\varSigma}_{i} } \right|^{{\frac{1}{2}}} \left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right|^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right)^{T} \left( {\varvec{H} +\varvec{\varSigma}_{i} } \right)^{ - 1} \left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right) }} \\ & \quad = \frac{1}{{\left( {2\pi } \right)^{{\frac{s}{2}}} \left( {\left| \varvec{H} \right|\left| {\varvec{\varSigma}_{\varvec{i}} } \right|\left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right|} \right)^{{\frac{1}{2}}} }}e^{{ - \frac{1}{2} \left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right)^{T} \left( {\varvec{H} +\varvec{\varSigma}_{i} } \right)^{ - 1} \left( {\varvec{x} -\varvec{\mu}_{\varvec{i}} } \right) }} . \\ \end{aligned} $$

Finally, \( \left| \varvec{H} \right|\left| {\varvec{\varSigma}_{\varvec{i}} } \right|\left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right| \) can be simplified as \( \left| {\varvec{H} +\varvec{\varSigma}_{i} } \right| \) because

$$ \begin{aligned} \left| \varvec{H} \right|\left| {\varvec{\varSigma}_{\varvec{i}} } \right|\left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right| & = \left| {\varvec{H\varSigma }_{\varvec{i}} } \right|\left| {\varvec{H}^{ - 1} +\varvec{\varSigma}_{\varvec{i}}^{ - 1} } \right| \\ & = \left| {\varvec{H\varSigma }_{\varvec{i}} \varvec{H}^{ - 1} + \varvec{H}} \right| = \left| {\varvec{H\varSigma }_{\varvec{i}} \varvec{H}^{ - 1} + \varvec{HHH}^{ - 1} } \right| \\ & = \left| {\varvec{H}\left( {\varvec{\varSigma}_{\varvec{i}} + \varvec{H}} \right)\varvec{H}^{ - 1} } \right| = \left| \varvec{H} \right|\left| {\varvec{\varSigma}_{\varvec{i}} + \varvec{H}} \right|\left| {\varvec{H}^{ - 1} } \right| \\ & = \left| {{\varvec{\Sigma}}_{\varvec{i}} + \varvec{H}} \right|. \\ \end{aligned} $$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, B., Jeong, YS. & Jeong, M.K. New multivariate kernel density estimator for uncertain data classification. Ann Oper Res 303, 413–431 (2021). https://doi.org/10.1007/s10479-020-03715-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-020-03715-4

Keywords

Navigation