Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Kernelized elastic net regularization: Generalization bounds, and sparse recovery

Published: 01 March 2016 Publication History

Abstract

Kernelized elastic net regularization KENReg is a kernelization of the well-known elastic net regularization Zou & Hastie, 2005. The kernel in KENReg is not required to be a Mercer kernel since it learns from a kernelized dictionary in the coefficient space. Feng, Yang, Zhao, Lv, and Suykens 2014 showed that KENReg has some nice properties including stability, sparseness, and generalization. In this letter, we continue our study on KENReg by conducting a refined learning theory analysis. This letter makes the following three main contributions. First, we present refined error analysis on the generalization performance of KENReg. The main difficulty of analyzing the generalization error of KENReg lies in characterizing the population version of its empirical target function. We overcome this by introducing a weighted Banach space associated with the elastic net regularization. We are then able to conduct elaborated learning theory analysis and obtain fast convergence rates under proper complexity and regularity assumptions. Second, we study the sparse recovery problem in KENReg with fixed design and show that the kernelization may improve the sparse recovery ability compared to the classical elastic net regularization. Finally, we discuss the interplay among different properties of KENReg that include sparseness, stability, and generalization. We show that the stability of KENReg leads to generalization, and its sparseness confidence can be derived from generalization. Moreover, KENReg is stable and can be simultaneously sparse, which makes it attractive theoretically and practically.

References

[1]
Bach, F. (2013). Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the Twenty-Sixth Annual Conference on Computational Learning Theory. New York: Springer.
[2]
Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33(4), 1497-1537.
[3]
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499-526.
[4]
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
[5]
Buldygin, V. V., & Kozachenko, Y. V. (2000). Metric characterization of random variables and random processes. Providence, RI: American Mathematical Society.
[6]
Candès, E. J., Romberg, J. K., & Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8), 1207-1223.
[7]
Candès, E. J., & Tao, T. (2005). Decoding by linear programming. IEEE Transactions on Information Theory, 51(12), 4203-4215.
[8]
Chen, H., Pan, Z., Li, L., & Tang, Y. (2013). Error analysis of coefficient-based regularized algorithm for density-level detection. Neural Computation, 25(4), 1107-1121.
[9]
Cucker, F., & Zhou, D.-X. (2007). Learning theory: An approximation theory viewpoint. Cambridge: Cambridge University Press.
[10]
Debruyne, M., Hubert, M., & Suykens, J. A. K. (2008). Model selection in kernel based regression using the influence function. Journal of Machine Learning Research, 9, 2377-2400.
[11]
Donoho, D. L., & Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences, 100(5), 2197-2202.
[12]
Donoho, D. L., & Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory, 47(7), 2845-2862.
[13]
Drezet, P.M.L., & Harrison, R. R. (1998). Support vector machines for system identification. In Proceedings of UKACC International Conference on Control (vol. 1, pp. 688-692). London: Institution of Electrical Engineers.
[14]
Dudley, R. M. (1999). Uniform central limit theorems. Cambridge: Cambridge University Press.
[15]
Feng, Y., Huang, X., Shi, L., Yang, Y., & Suykens, J. A. K. (2015). Learning with the maximum correntropy criterion induced losses for regression. Journal of Machine Learning Research, 16, 993-1034.
[16]
Feng, Y., Yang, Y., Zhao, Y., Lv, S., & Suykens, J. A. K. (2014). Learning with kernelized elastic net regularization (Internal Report, ESAT-STADIUS). Leuven, Belgium: KU Leuven. ftp://ftp.esat.kuleuven.ac.be/stadius/yfeng/KENReg2014.pdf
[17]
Györfi, L., Kohler, M., Krzyzak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. New York: Springer.
[18]
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (2011). Robust statistics: The approach based on influence functions. Hoboken, NJ: Wiley.
[19]
Hang, H., & Steinwart, I. (2014). Fast learning from ¿-mixing observations. Journal of Multivariate Analysis, 127, 184-199.
[20]
Honeine, P. (2014). Analyzing sparse dictionaries for online learning with kernels. arXiv:1409.6045
[21]
Honeine, P. (2015). Approximation errors of online sparsification criteria. IEEE Transactions on Signal Processing, 63(37), 4700-4709.
[22]
Huang, J., & Zhang, T. (2010). The benefit of group sparsity. Annals of Statistics, 38(4), 1978-2004.
[23]
Huang, J., Zhang, T., & Metaxas, D. (2011). Learning with structured sparsity. Journal of Machine Learning Research, 12, 3371-3412.
[24]
Huber, P. J., & Ronchetti, E. (2009). Robust statistics. Hoboken, NJ: Wiley.
[25]
Koltchinskii, V. (2011). Oracle inequalities in empirical risk minimization and sparse recovery problems. New York: Springer.
[26]
Koltchinskii, V., Sakhanenko, L., & Cai, S. (2007). Integral curves of noisy vector fields and statistical problems in diffusion tensor imaging: Nonparametric kernel estimation and hypotheses testing. Annals of Statistics, 35(4), 1576-1607.
[27]
Lin, S., Zeng, J., Fang, J., & Xu, Z. (2014). Learning rates of lq-coefficient regularization learning with gaussian kernel. Neural Computation, 26(10), 2350-2378.
[28]
Lv, S. (2015). Refined generalization bounds of gradient learning over reproducing kernel Hilbert spaces. Neural Computation, 27(6), 1294-1320.
[29]
Maronna, R., Martin, D., & Yohai, V. (2006). Robust statistics: Theory and methods. Chichester: Wiley.
[30]
Pinelis, I. (1994). Optimum bounds for the distributions of martingales in Banach spaces. Annals of Probability, 22(4), 1679-1706.
[31]
Richard, C., Bermudez, J. C. M., & Honeine, P. (2009). Online prediction of time series data with kernels. IEEE Transactions on Signal Processing, 57(3), 1058-1067.
[32]
Roth, V. (2004). The generalized Lasso. IEEE Transactions on Neural Networks, 15(1), 16-28.
[33]
Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press.
[34]
Shi L. (2013). Learning theory estimates for coefficient-based regularized regression. Applied and Computational Harmonic Analysis, 34(2), 252-265.
[35]
Shi, L., Feng, Y., & Zhou, D.-X. (2011). Concentration estimates for learning with l1-regularizer and data dependent hypothesis spaces. Applied and Computational Harmonic Analysis, 31(2), 286-302.
[36]
Steinwart, I., & Christmann, A. (2008). Support vector machines. New York: Springer.
[37]
Sun, H., & Wu, Q. (2011). Least square regression with indefinite kernels and coefficient regularization. Applied and Computational Harmonic Analysis, 30(1), 96-109.
[38]
Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2002). Least squares support vector machines. Singapore: World Scientific.
[39]
Tibshirani, R. J. (2013). The Lasso problem and uniqueness. Electronic Journal of Statistics, 7, 1456-1490.
[40]
Tong, H., Chen, D.-R., & Yang, F. (2010). Least square regression with lp-coefficient regularization. Neural Computation, 22(12), 3221-3235.
[41]
Tropp, J., & Gilbert, A.C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12), 4655-4666.
[42]
Vapnik, V. (1998). Statistical learning theory. Hoboken, NJ: Wiley-Interscience.
[43]
Wang, G., Yeung, D.-Y., & Lochovsky, F. H. (2007). The kernel path in kernelized Lasso. In Proceedings of the International Conference on Artificial Intelligence and Statistics (pp. 580-587). JMLR.
[44]
Wu, Q., & Zhou, D.-X. (2005). SVM soft margin classifiers: Linear programming versus quadratic programming. Neural Computation, 17(5), 1160-1187.
[45]
Wu, Q., & Zhou, D.-X. (2008). Learning with sample dependent hypothesis spaces. Computers & Mathematics with Applications, 56(11), 2896-2907.
[46]
Xiao, Q.-W., & Zhou, D.-X. (2010). Learning by nonsymmetric kernels with data dependent spaces and l1-regularizer. Taiwanese Journal of Mathematics, 14(5), 1821-1836.
[47]
Xu, H., Caramanis, C., & Mannor, S. (2012). Sparse algorithms are not stable: A no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 187-193.
[48]
Xu, Z., Jin, R., Shen, B., & Zhu, S. (2015). Nyström approximation for sparse kernel methods: Theoretical analysis and empirical evaluation. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press.
[49]
Zhang, T. (2010). Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11, 1081-1107.
[50]
Zhang, T. (2011). Sparse recovery with orthogonal matching pursuit under RIP. IEEE Transactions on Information Theory, 57(9), 6215-6221.
[51]
Zhao, P., & Yu, B. (2006). On model selection consistency of Lasso. Journal of Machine Learning Research, 7, 2541-2563.
[52]
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301-320.

Cited By

View all
  • (2023)New bounds for hyperparameter tuning of regression problems across instancesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669629(80066-80078)Online publication date: 10-Dec-2023
  • (2020)Sparse shrunk additive modelsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525513(6194-6204)Online publication date: 13-Jul-2020
  • (2017)Error analysis of regularized least-square regression with Fredholm kernelNeurocomputing10.1016/j.neucom.2017.03.076249:C(237-244)Online publication date: 2-Aug-2017
  • Show More Cited By
  1. Kernelized elastic net regularization: Generalization bounds, and sparse recovery

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Neural Computation
    Neural Computation  Volume 28, Issue 3
    March 2016
    168 pages

    Publisher

    MIT Press

    Cambridge, MA, United States

    Publication History

    Published: 01 March 2016

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)New bounds for hyperparameter tuning of regression problems across instancesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669629(80066-80078)Online publication date: 10-Dec-2023
    • (2020)Sparse shrunk additive modelsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525513(6194-6204)Online publication date: 13-Jul-2020
    • (2017)Error analysis of regularized least-square regression with Fredholm kernelNeurocomputing10.1016/j.neucom.2017.03.076249:C(237-244)Online publication date: 2-Aug-2017
    • (2017)Robust Lp-norm least squares support vector regression with feature selectionApplied Mathematics and Computation10.1016/j.amc.2017.01.062305:C(32-52)Online publication date: 15-Jul-2017
    • (2016)Error analysis of generalized nyström kernel regressionProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157382.3157383(2549-2557)Online publication date: 5-Dec-2016
    • (2016)Local Rademacher complexity bounds based on covering numbersNeurocomputing10.1016/j.neucom.2016.08.074218:C(320-330)Online publication date: 19-Dec-2016

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media