Abstract
The presence of missing components in incomplete instances precludes a kernel-based model from incorporating partially observed components of incomplete instances and computing kernels, including Gaussian kernels that are extensively used in machine learning modeling and applications. Existing methods with Gaussian kernels to handle incomplete data, however, are based on independence among variables. In this study, we propose a new method, the expected Gaussian kernel with correlated variables, that estimates the Gaussian kernel with incomplete data, by considering the correlation among variables. In the proposed method, the squared distance between two instance vectors is modeled with the sum of the correlated squared unit-dimensional distances between the instances, and the Gaussian kernel with missing values is obtained by estimating the expected Gaussian kernel function under the probability distribution for the squared distance between the vectors. The proposed method is evaluated on synthetic data and real-life data from benchmarks and a case from a multi-pattern photolithographic process for wafer fabrication in semiconductor manufacturing. The experimental results show the improvement by the proposed method in the estimation of Gaussian kernels with incomplete data of correlated variables.
Similar content being viewed by others
References
Alvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Foundations and Trends® in Machine Learning, 4(3), 195–266.
Andridge, R. R., & Little, R. J. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78(1), 40–64.
Bae, J., & Park, J. (2020). Count-based change point detection via multi-output log-Gaussian Cox processes. IISE Transactions, 52(9), 998–1013.
Belanche, L. A., Kobayashi, V., & Aluja, T. (2014). Handling missing values in kernel methods with application to microbiology data. Neurocomputing, 141, 110–116.
Cai, J. F., Candès, E. J., & Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4), 1956–1982.
Choi, J., Son, Y., & Jeong, M. K. (2021). Restricted Relevance Vector Machine for Missing Data and Application to Virtual Metrology. IEEE Transactions on Automation Science and Engineering, 19(4), 3172–3183.
Cotton, C. (1991). Functional description of the generalized edit and imputation system. Statistics Canada, Business Survey Methods Division, 59, 447–461.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273–297.
Covo, S., & Elalouf, A. (2014). A novel single-gamma approximation to the sum of independent gamma variables, and a generalization to infinitely divisible distributions. Electronic Journal of Statistics, 8(1), 894–926.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (methodological), 39(1), 1–38.
Di Maio, F., Tsui, K. L., & Zio, E. (2012). Combining relevance vector machines and exponential regression for bearing residual life estimation. Mechanical Systems and Signal Processing, 31, 405–427.
Eirola, E., Doquire, G., Verleysen, M., & Lendasse, A. (2013). Distance estimation in numerical data sets with missing values. Information Sciences, 240, 115–128.
Eirola, E., Lendasse, A., Vandewalle, V., & Biernacki, C. (2014). Mixture of Gaussians for distance estimation with missing data. Neurocomputing, 131, 32–42.
Feng, Y., Wen, M., Zhang, J., Ji, F., & Ning, G. X. (2016). Sum of arbitrarily correlated Gamma random variables with unequal parameters and its application in wireless communications. In 2016 international conference on computing, networking and communications (ICNC) (pp. 1–5).
Gazzola, G., Choi, J., Kwak, D. S., Kim, B., Kim, D. M., Tong, S. H., & Jeong, M. K. (2018). Integrated variable importance assessment in multi-stage processes. IEEE Transactions on Semiconductor Manufacturing, 31(3), 343–355.
He, S., Xiao, L., Wang, Y., Liu, X., Yang, C., Lu, J., Gui, W., & Sun, Y. (2017). A novel fault diagnosis method based on optimal relevance vector machine. Neurocomputing, 267, 651–663.
Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics, 36(3), 1171–1220.
Huang, K., Wen, H., Yang, C., Gui, W., & Hu, S. (2021). Outlier detection for process monitoring in industrial cyber-physical systems. IEEE Transactions on Automation Science and Engineering.
Hwang, S., Jeong, M. K., & Yum, B. J. (2014). Robust relevance vector machine with variational inference for improving virtual metrology accuracy. IEEE Transactions on Semiconductor Manufacturing, 27(1), 83–94.
Jia, S., Ma, B., Guo, W., & Li, Z. S. (2021). A sample entropy based prognostics method for lithium-ion batteries using relevance vector machine. Journal of Manufacturing Systems, 61, 773–781.
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1970). Continuous univariate distributions. Houghton Mifflin.
Jurado, S., Nebot, À., Mugica, F., & Mihaylov, M. (2017). Fuzzy inductive reasoning forecasting strategies able to cope with missing data: A smart grid application. Applied Soft Computing, 51, 225–238.
Kim, B., Jeong, Y. S., & Jeong, M. K. (2021). New multivariate kernel density estimator for uncertain data classification. Annals of Operations Research, 303(1), 413–431.
Kim, J. K., & Fuller, W. (2004). Fractional hot deck imputation. Biometrika, 91(3), 559–578.
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml.
Lin, T. H. (2010). A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data. Quality & Quantity, 44, 277–287.
Little, R. J. (1992). Regression with missing X’s: A review. Journal of the American Statistical Association, 87(420), 1227–1237.
Little, R. J., & Rubin, D. B. (2020). Statistical analysis with missing data. John Wiley and Sons.
Meng, X. L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267–278.
Mesquita, D. P., Gomes, J. P., Corona, F., Junior, A. H. S., & Nobre, J. S. (2019). Gaussian kernels for incomplete data. Applied Soft Computing, 77, 356–365.
Mesquita, D. P., Gomes, J. P., Junior, A. H. S., & Nobre, J. S. (2017). Euclidean distance estimation in incomplete datasets. Neurocomputing, 248, 11–18.
Nakagami, M. (1960). The m-distribution—A general formula of intensity distribution of rapid fading. In Statistical methods in radio wave propagation (pp. 3–36).
Nebot-Troyano, G., & Belanche-Muñoz, L. A. (2009). A kernel extension to handle missing data. In Research and development in intelligent systems XXVI: incorporating applications and innovations in intelligent systems XVII (pp. 165-178). Springer.
Nguyen, T. T., & Tsoy, Y. (2017). A kernel PLS based classification method with missing data handling. Statistical Papers, 58(1), 211–225.
Pelckmans, K., De Brabanter, J., Suykens, J. A., & De Moor, B. (2005). Handling missing values in support vector machine classifiers. Neural Networks, 18(5), 684–692.
Piccialli, V., & Sciandrone, M. (2022). Nonlinear optimization and support vector machines. Annals of Operations Research, 314(1), 15−47.
Genton, M. G. (Ed.). (2004). Skew-elliptical distributions and their applications: A journey beyond normality. CRC Press.
Roberts, C., & Geisser, S. (1966). A necessary and sufficient condition for the square of a random variable to be gamma. Biometrika, 53(1/2), 275–278.
Rubin, D. B. (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. In Proceedings of the survey research methods section of the American Statistical Association (Vol. 1, pp. 20–34). American Statistical Association.
Sande, I. G. (1983). Hot-deck imputation procedures. Incomplete Data in Sample Surveys, 3, 339–349.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC Press.
Schölkopf, B., Smola, A. J., & Bach, F. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press.
Sexton, J., & Swensen, A. R. (2000). ECM algorithms that converge at the rate of EM. Biometrika, 87(3), 651–662.
Shahzad, U., Sengupta, T., Rao, A., & Cui, L. (2023). Forecasting carbon emissions future prices using the machine learning methods. Annals of Operations Research. https://doi.org/10.1007/s10479-023-05188-7
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press.
Smola, A. J., Schölkopf, B., & Müller, K. R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11(4), 637–649.
Smola, A. J., Vishwanathan, S. V. N., & Hofmann, T. (2005). Kernel methods for missing variables. In Proceedings of the 10th international workshop on artificial intelligence and statistics (pp. 325–332).
Son, Y., Byun, H., & Lee, J. (2016). Nonparametric machine learning models for predicting the credit default swaps: An empirical study. Expert Systems with Applications, 58, 210–220.
Sun, C. Y., Yin, Y. Z., Kang, H. B., & Ma, H. J. (2022). A quality-related fault detection method based on the dynamic data-driven algorithm for industrial systems. IEEE Transactions on Automation Science and Engineering, 19(4), 3942–3952.
Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520–525.
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–67.
Van Hulse, J., & Khoshgoftaar, T. M. (2014). Incomplete-case nearest neighbor imputation in software measurement data. Information Sciences, 259, 596–610.
Von Hippel, P. T. (2009). 8. How to impute interactions, squares, and other transformed variables. Sociological Methodology, 39(1), 265–291.
Wang, Y., & Fu, L. (2023). Study on regional tourism performance evaluation based on the fuzzy analytic hierarchy process and radial basis function neural network. Annals of Operations Research. https://doi.org/10.1007/s10479-023-05224-6
Wei, C., Chen, J., Song, Z., & Chen, C. I. (2018). Development of self-learning kernel regression models for virtual sensors on nonlinear processes. IEEE Transactions on Automation Science and Engineering, 16(1), 286–297.
Zhang, K., Song, Z., & Guan, Y. L. (2004). Simulation of Nakagami fading channels with arbitrary cross-correlation and fading parameters. IEEE Transactions on Wireless Communications, 3(5), 1463–1468.
Zhong, Y., Ma, A., Soon Ong, Y., Zhu, Z., & Zhang, L. (2018). Computational intelligence in optical remote sensing image processing. Applied Soft Computing, 64, 75–93.
Funding
This research was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (MSIT) of Korea (No. RS-2023-00208412).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Appendix A: Proof of Proposition 1
For the squared distance between two real vectors \({\mathbf{x}}_{i}\) and \({\mathbf{x}}_{j}\), a Gamma variable \(\zeta_{ij}\) is approximated from the sum of correlated Gamma variables \(\gamma_{ijp}\) \(\sim\) \(Gamma\left( {k_{ijp} ,\theta_{ijp} } \right)\) for \(p\) \(=\) 1, …, \(D\) based on the approximation in Feng et al., (2016). The shape parameter \(k_{ijp}\), which is estimated using \(E\left[ {\gamma_{ijp} } \right]\) in (19) and \(Var\left( {\gamma_{p} } \right)\) in (20) from the moments of the missing components in the original space, satisfies the condition \(k_{ijp}\) \(\ge\) \(\frac{1}{2}\) if \(\sigma_{pp,i} + \sigma_{pp,j}\) \(>\) \(0\):
Similarly, the scale parameter \(\theta_{ijp}\) satisfies the condition \(\theta_{ijp}\) \(>\) \(0\) if \(\sigma_{pp,i} + \sigma_{pp,j}\) \(>\) \(0\):
Appendix B: Covariance between squared unit-dimensional distances
Under the assumption of the independence between two instances \({\mathbf{x}}_{i}\) and \({\mathbf{x}}_{j}\), The covariance between \(\gamma_{ijp}\) and \(\gamma_{ijq}\) in (20) can be rewritten as
where the two terms in (20) are
and
To compute the high-order moments of the \(i\)-th instance in (B.1), let \({\mathbf{x}}_{i(pq)}\) \(=\) \(\left[ {X_{ip} ,X_{iq} } \right]^{{\text{T}}}\) be the bivariate normal distribution, as a subset of the variables in \({\mathbf{x}}_{i}\), with the mean \({\tilde{\mathbf{x}}}_{{i\left( {pq} \right)}} = \left[ {\tilde{x}_{ip} , \tilde{x}_{iq} } \right]^{{\text{T}}}\) and covariance matrix \({\tilde{\mathbf{S}}}_{{i\left( {pq} \right)}}\) \(=\) \(\left[ {\begin{array}{*{20}c} {\sigma_{pp,i} } & {\sigma_{pq,i} } \\ {\sigma_{pq,i} } & {\sigma_{qq,i} } \\ \end{array} } \right]\). Let \(M\left( {\mathbf{t}} \right)\) be the moment generating function of \({\mathbf{x}}_{{i\left( {pq} \right)}}\) with a variable vector \({\mathbf{t}}\) \(=\) \(\left[ {t_{p} ,t_{q} } \right]^{{\text{T}}}\) as
High-order raw cross moments of \({\mathbf{x}}_{{i\left( {pq} \right)}}\) are given by
and, accordingly, we have
From (B.4) to (B.7), the covariance in (B.1) becomes
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Choi, J., Son, Y. & Jeong, M.K. Gaussian kernel with correlated variables for incomplete data. Ann Oper Res 341, 223–244 (2024). https://doi.org/10.1007/s10479-023-05656-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-023-05656-0