Abstract
We discuss regularization methods in machine learning with an emphasis on the interplay between statistical and computational aspects. We begin recalling a connection between inverse problem and machine learning. Then, we discuss how it allows to translate modeling principles into computational procedures, suggesting strategies to deal with large-scale learning problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
More precisely, two continuous functions that are equal almost everywhere, they coincide only on \(\mathcal{X}_\rho \).
- 2.
Or rather the second moment matrix.
References
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950). http://dx.doi.org/10.2307/1990404
Aymeric Dieuleveut, F.B.: Nonparametric stochastic approximation with large step-sizes. Ann. Stat. 44(4), 1363–1399 (2016). https://doi.org/10.1214/15-AOS1391
Bathia, R., Johnson, C.R.: Matrix analysis. SIAM Rev. 40(2), 413–413 (1998)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006)
Bertero, M., De Mol, C., Pike, E.R.: Linear inverse problems with discrete data. i. general formulation and singular system analysis. Inverse Probl. 1(4), 301 (1985)
Blanchard, G., Bousquet, O., Zwald, L.: Statistical properties of kernel principal component analysis. Mach. Learn. 66(2–3), 259–294 (2007)
Blanchard, G., Krämer, N.: Optimal learning rates for kernel conjugate gradient regression. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, pp. 226–234. Curran Associates, Inc. (2010). https://proceedings.neurips.cc/paper/2010/hash/b2f627fff19fda463cb386442eac2b3d-Abstract.html
Blanchard, G., Mücke, N.: Optimal rates for regularization of statistical inverse learning problems. Found. Comput. Math. 18(4), 971–1013 (2018)
Bottou, L., Bousquet, O.: The Tradeoffs of Large Scale Learning. In: NIPS (2007)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Bühlmann, P., Yu, B.: Boosting with the l 2 loss: regression and classification. J. Am. Stat. Assoc. 98(462), 324–339 (2003)
Caponnetto, A., De Vito, E.: Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 7(3), 331–368 (2007)
Caponnetto, A., Yao, Y.: Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl. 8, 161–3183 (2010)
Coifman, R.R., Lafon, S.: Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions. Appl. Comput. Harmon. Anal. 21(1), 31–52 (2006)
Cucker, F., Smale, S.: On the mathematical foundation of learning. Am. Math. Soc. 39(1), 1–49 (2001)
Cucker, F., Zhou, D.X.: Learning theory: an approximation theory viewpoint, vol. 24. Cambridge University Press (2007)
De Vito, E., Caponnetto, A., Rosasco, L.: Model selection for regularized least-squares algorithm in learning theory. Found. Comput. Math. 5(1) (2005)
De Vito, E., Rosasco, L., Caponnetto, A.: Discretization error analysis for Tikhonov regularization. Anal. Appl. (Singap.) 4(1), 81–99 (2006)
De Vito, E., Rosasco, L., Caponnetto, A., Giovannini, U.D., Odone, F.: Learning from examples as an inverse problem. J. Mach. Learn. Res. 6(May), 883–904 (2005)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition, vol. 31. Springer Science & Business Media (2013)
Dudley, R.: Real Analysis and Probability, Cambridge Studies in Advanced Mathematics, vol. 74. Cambridge University Press, Cambridge (2002). https://doi.org/10.1017/CBO9780511755347
Engl, H.W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Springer (1996)
Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Birkhäuser Basel (2013)
Györfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer Science & Business Media (2006)
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)
Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. J. Mach. Learn. Res.-Proc. Track 23, 9–1 (2012)
Kress, R.: Linear Integral Equations, vol. 82. Springer Science & Business Media (1989)
Landweber, L.: An iteration formula for fredholm integral equations of the first kind. Am. J. Math. 73, 615–624 (1951)
Lang, S.: Real and functional analysis, Graduate Texts in Mathematics, vol. 142, 3rd edn. Springer, New York (1993). https://doi.org/10.1007/978-1-4612-0897-6
Lin, J., Rosasco, L.: Optimal rates for multi-pass stochastic gradient methods. J. Mach. Learn. Res. 18, 97:1–97:47 (2017). http://jmlr.org/papers/v18/17-176.html
Lin, J., Rudi, A., Rosasco, L., Cevher, V.: Optimal rates for spectral algorithms with least-squares regression over hilbert spaces. Appl. Comput. Harmon. Anal. 48(3), 868–890 (2020)
Minsker, S.: On some extensions of bernstein’s inequality for self-adjoint operators. Stat. Probab. Lett. 127, 111–119 (2017)
Mücke, N., Neu, G., Rosasco, L.: Beating SGD saturation with tail-averaging and minibatching. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 12,568–12,577 (2019). https://proceedings.neurips.cc/paper/2019/hash/4d0b954f0bef437c29dfa73fafdf3fa5-Abstract.html
Pagliana, N., Rosasco, L.: Implicit regularization of accelerated methods in hilbert spaces. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 14,454–14,464 (2019). https://proceedings.neurips.cc/paper/2019/hash/c61aed648da48aa3893fb3eaadd88a7f-Abstract.html
Pinelis, I.: Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probab. 22(4), 1679–1706 (1994)
Pinelis, I.: Correction: Optimum bounds for the distributions of martingales in Banach spaces. [Ann. Probab. 22(4), 1679–1706 (1994); MR1331198 (96b:60010)]. Ann. Probab. 27(4), 2119 (1999)
Poggio, T., Girosi, F.: Networks for approximation and learning. Proc. IEEE 78(9), 1481–1497 (1990)
Rahimi, A., Recht, B.: Random Features for Large-Scale Kernel Machines. In: NIPS, pp. 1177–1184. Curran Associates, Inc. (2007)
Rosasco, L., Villa, S.: Learning with incremental iterative regularization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1630–1638 (2015). https://proceedings.neurips.cc/paper/2015/hash/1587965fb4d4b5afe8428a4a024feb0d-Abstract.html
Rudi, A., Camoriano, R., Rosasco, L.: Less is more: Nyström computational regularization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 1648–1656. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization.pdf
Rudi, A., Camoriano, R., Rosasco, L.: Generalization Properties of Learning with Random Features. (2016)
Rudi, A., Canas, G.D., Rosasco, L.: On the sample complexity of subspace learning. Adv. Neural. Inf. Process. Syst. 26, 2067–2075 (2013)
Smale, S., Zhou, D.X.: Shannon sampling and function reconstruction from point values. Bull. Am. Math. Soc. 41(3), 279–305 (2004)
Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26(2), 153–172 (2007)
Steinwart, I., Christmann, A.: Support Vector Machines. Springer New York (2008). https://books.google.de/books?id=HUnqnrpYt4IC
Steinwart, I., Hush, D.R., Scovel, C., et al.: Optimal rates for regularized least squares regression. In: COLT, pp. 79–93 (2009)
Tropp, J.A.: User-friendly tools for random matrices: an introduction (2012)
Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)
Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia (1990)
Yen, E.H., Lin, T.W., Lin, S.D., Ravikumar, P.K., Dhillon, I.S.: Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space. In: Advances in Neural Information Processing Systems 27 (2014). http://papers.nips.cc/paper/5479-sparse-random-feature-algorithm-as-coordinate-descent-in-hilbert-space.pdf
Yurinsky, V.: Sums and Gaussian Vectors, vol. 1617. Springer, Berlin (1995)
Acknowledgements
This chapter evolved from a set of notes for the Summer School “Applied Harmonic Analysis and Machine Learning” in Genova, as well as the Summer School “Structured Regularization for High-Dimensional Data Analysis” at IHP Paris, and the Spring School “Structural Inference” Malente, Germany. The results we present are an overview of the work with and of a number of collaborators and colleagues. Among others, we would like to thank Gilles Blanchard, Andrea Caponnetto, Luigi Carratino, Raffaello Camoriano, Junhong Lin, Nicole Mücke, Nicoló Pagliana, Steve Smale, Alessandro Verri, Silvia Villa and Ding-Xuan Zhou.
This material is based on work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. L. R. acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-18-1-7009, FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS-DLV-777826. Part of this work was funded by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’ avenir program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). A. R. acknowledges the financial support of the European Research Council (grant REAL 947908). E.D.V. is a member of the Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
In this section, we collect the notation and some basic mathematical facts, adapted to our setting.
\(\mathcal {H},\mathcal{G}\) | Hilbert spaces |
\(\bigl \Vert {f}\bigr \Vert _{\mathcal {H}}\) | norm of \(f\in \mathcal {H}\) |
\(\left\langle {f},{g}\right\rangle _{\mathcal {H}}\) | scalar product between \(f,h\in \mathcal {H}\) |
\(\mathcal K^\perp =\{f\in \mathcal {H}:\left\langle {f},{f'}\right\rangle _\mathcal {H}=0\,\forall f'\in \mathcal K\}\) | complement of a closed subspace \(\mathcal K\subset \mathcal {H}\) |
\(P_\mathcal K\) | projection onto a closed subspace \(\mathcal K\subset \mathcal {H}\) |
\(\mathop \mathrm{ker}{A}=\{f\in \mathcal {H}:Af=0\}\) | kernel of \(A:\mathcal {H}\rightarrow \mathcal{G}\) |
\({\text {ran}}{A}=\{Af\in \mathcal{G}:f\in \mathcal {H}\}\) | range of \(A:\mathcal {H}\rightarrow \mathcal{G}\) |
\(A^*:\mathcal{G}\rightarrow \mathcal {H}\) | adjoint of \(A:\mathcal {H}\rightarrow \mathcal{G}\) |
\(|{A}|=\sqrt{A^*A}\) | absolute value of \(A:\mathcal {H}\rightarrow \mathcal{G}\) |
\(\bigl \Vert {A}\bigr \Vert _\infty = \sup _{f\in \mathcal {H}} \frac{\bigl \Vert {Af}\bigr \Vert _{\mathcal{G}}}{\Vert {f}\Vert _\mathcal {H}}\) | operator norm of \(A:\mathcal {H}\rightarrow \mathcal{G}\) |
\(w\otimes v = \left\langle {\cdot },{v}\right\rangle _\mathcal {H} w:\mathcal {H}\rightarrow \mathcal{G}\) | rank-one operator \(v\in \mathcal {H}\), \(w\in \mathcal{G}\) |
\((\Omega ,\mathcal F,\mathbb P)\) | probability space |
\(\mathbb E[\xi ]\) | expectation of the random variable \(\xi :\Omega \rightarrow \mathcal {H}\) |
We recall the following results.
-
Trace class operators: an operator \(A:\mathcal {H}\rightarrow \mathcal {H}\) is trace class if there exists (any) a base \((v_\ell )_{\ell \in \Lambda }\) of \(\mathcal {H}\) such that the series \(\sum _{\ell \in \Lambda } \left\langle {|{A}|v_\ell },{v_\ell }\right\rangle _\mathcal {H}\) converges and
$$ {\text {Tr}} A = \sum _{\ell \in \Lambda } \left\langle {A v_\ell },{v_\ell }\right\rangle _\mathcal {H} $$is called the trace of A. An \(A:\mathcal {H}\rightarrow \mathcal{G}\) is trace class if \(|{A}|\) is trace class and the space \(\mathcal S_1(\mathcal {H},\mathcal{G})\) of trace class operators from \(\mathcal {H}\) to \(\mathcal{G}\) is a Banach space with respect the norm
$$ \bigl \Vert {A}\bigr \Vert _{ \mathcal S_1(\mathcal {H},\mathcal{G})} = {\text {Tr}}(|{A}|). $$ -
Hilbert–Schmidt operators: \(A:\mathcal {H}\rightarrow \mathcal{G}\) is a Hilbert–Schmidt operator if \(A^*A\) is a trace class operator. The space \(\mathcal S_2(\mathcal {H},\mathcal{G})\) of Hilbert–Schmidt operators from \(\mathcal {H}\) to \(\mathcal{G}\) is a Hilbert space with respect to the scalar product
$$ \left\langle {A},{B}\right\rangle _{ \mathcal S_2(\mathcal {H},\mathcal{G})} ={\text {Tr}}(B^*A), $$and \(\mathcal S_1(\mathcal {H},\mathcal{G})\subset \mathcal S_2(\mathcal {H},\mathcal{G})\).
-
Hilbert–Schmidt theorem: if \(A:\mathcal {H}\rightarrow \mathcal {H}\) is a Hilbert–Schmidt self-adjoint operator, there exists a base \((v_\ell )_{\ell \in \Lambda }\) of \(\mathcal {H}\) and a sequence \((v_\ell )_{\ell \in \Lambda }\) of real numbers such that
$$\begin{aligned} A v_\ell =\sigma _\ell v_\ell . \end{aligned}$$ -
Functional calculus: if \(A:\mathcal {H}\rightarrow \mathcal {H}\) is a Hilbert–Schmidt self-adjoint operator and \(\varphi :{\mathbb {R}}\rightarrow {\mathbb {R}}\) is a bounded function, by the functional calculus the operator \(\varphi (A):\mathcal {H}\rightarrow \mathcal {H}\) is defined as
$$ \varphi (A) f =\sum _{\ell \in \Lambda } \varphi (\sigma _\ell ) \left\langle {f},{v_\ell }\right\rangle _\mathcal {H} v_\ell . $$where the series converges in \(\mathcal {H}\).
-
For any operator \(A:\mathcal {H}\rightarrow \mathcal{G}\)
$$\begin{aligned} \Vert A\Vert ^2_\infty = \Vert AA^*\Vert _\infty \end{aligned}$$(68) -
Cordes inequality: for any pair of positive operator \(A, B:\mathcal {H}\rightarrow \mathcal {H}\) and \(s \in (0,1]\)
$$\begin{aligned} \Vert A^s B^s\Vert _\infty \le \Vert A B\Vert ^s_\infty \end{aligned}$$(69) -
Take two operators \(B,C:\mathcal {H}\rightarrow \mathcal{G}\) such that
$$\begin{aligned} B B^* \le CC^*, \end{aligned}$$(70)then
$$\begin{aligned} \bigl \Vert {AB}\bigr \Vert _\infty \le \bigl \Vert {AC}\bigr \Vert _\infty \end{aligned}$$(71)for any operator \(A:\mathcal{G}\rightarrow \mathcal{G}'\). Indeed, by (70)
$$ ABB^*A^* \le ACC^*A^* \qquad \Longrightarrow \qquad \bigl \Vert {ABB^*A^*}\bigr \Vert _\infty \le \bigl \Vert {ACC^*A^*}\bigr \Vert _\infty . $$Then,
$$\begin{aligned} \bigl \Vert {AB}\bigr \Vert ^2_\infty&= \bigl \Vert {ABB^*A^*}\bigr \Vert _\infty \le \bigl \Vert {CBB^*C^*}\bigr \Vert _\infty = \bigl \Vert {AC}\bigr \Vert ^2_\infty . \end{aligned}$$ -
If \(A:\mathcal {H}\rightarrow \mathcal{G}\) and \(\lambda >0\)
$$\begin{aligned} ((A^*A + \lambda I)^{-1}A^*A - I) = \lambda (A^*A + \lambda I)^{-1} \end{aligned}$$(72)see [22].
-
For any positive operator \(C:\mathcal {H}\rightarrow \mathcal {H}\) and projection \(P:\mathcal {H}\rightarrow \mathcal {H}\)
$$\begin{aligned} P(PCP + \lambda I)^{-1}P \le P(C + \lambda I)^{-1}P \end{aligned}$$(73)see, for example, Theorem V.2.3-(iv) of [3] for the proof of this property of operator convex functions, and Cor. V.2.6 together with Exercise V.1.10-(ii) and Exercise V.2.11, in the same book, for the proof of operator convexity of \((\cdot +\lambda )^{-1}\).
-
For any pair \(A,B:\mathcal {H}\rightarrow \mathcal {H}\) of positive operators with bounded inverse such that
$$\begin{aligned} \bigl \Vert {B^{-\frac{1}{2}} (A-B) B^{-\frac{1}{2}}}\bigr \Vert _{\infty } < 1 \end{aligned}$$(74)then
$$\begin{aligned} \bigl \Vert {A^{-\frac{1}{2}} B^{\frac{1}{2}} }\bigr \Vert _{\infty }&\le \sqrt{\frac{1}{1 - \bigl \Vert {B^{-\frac{1}{2}} (A-B) B^{-\frac{1}{2}}}\bigr \Vert _{\infty }}}. \end{aligned}$$(75)Indeed, observe that
$$\begin{aligned} (A^{-\frac{1}{2}} B^{\frac{1}{2}})^* A^{-\frac{1}{2}} B^{\frac{1}{2}}&= B^{\frac{1}{2}} A^{-1} B^{\frac{1}{2}}=B^{\frac{1}{2}} \left( A-B+B\right) ^{-1} B^{\frac{1}{2}} \\&= B^{\frac{1}{2}} \left( B^{\frac{1}{2}}\left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}+I\right) B^{\frac{1}{2}}\right) ^{-1} B^{\frac{1}{2}}\\&= \left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}+I\right) ^{-1}=\sum _{\ell =0}^{+\infty } (-1)^\ell \left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}\right) ^{\ell }, \end{aligned}$$where the Neumann series converges by (74). Hence, triangular inequality gives
$$\begin{aligned} \bigl \Vert {A^{-\frac{1}{2}} B^{\frac{1}{2}}}\bigr \Vert ^2_\infty&= \bigl \Vert {\left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}+I\right) ^{-1}}\bigr \Vert _{\infty } \le \frac{1}{1 - \bigl \Vert { B^{-\frac{1}{2}} (A-B) B^{-\frac{1}{2}}}\bigr \Vert _{\infty }}. \end{aligned}$$ -
Höeffidng inequality in separable Hilbert spaces [35, 36, 51] : take a family
$$\begin{aligned} \xi _1,\ldots ,\xi _n:\Omega \rightarrow \mathcal {H}\end{aligned}$$of independent zero mean random variables such that \(\Vert {\xi _i}\Vert _\mathcal {H}\le \kappa \), then for all \(\epsilon >0\)
$$ \mathbb P\left[ \Vert {\frac{1}{n} \sum _{i=1}^n \xi _i}\Vert _\mathcal {H} >\epsilon \right] \le 2 \exp \left( -\frac{\epsilon ^2 n }{4 \kappa ^2}\right) , $$i.e., for all \(\tau >0\) with probability at least \(1- 2 e^{-\tau }\)
$$\begin{aligned} \Vert {\frac{1}{n} \sum _{i=1}^n \xi _i}\Vert _\mathcal {H} \le \dfrac{2\kappa \sqrt{\tau }}{\sqrt{n}}. \end{aligned}$$(76)
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
De Vito, E., Rosasco, L., Rudi, A. (2021). Regularization: From Inverse Problems to Large-Scale Machine Learning. In: De Mari, F., De Vito, E. (eds) Harmonic and Applied Analysis. Applied and Numerical Harmonic Analysis. Birkhäuser, Cham. https://doi.org/10.1007/978-3-030-86664-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-86664-8_5
Published:
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-030-86663-1
Online ISBN: 978-3-030-86664-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)