Nothing Special   »   [go: up one dir, main page]

Skip to main content

Regularization: From Inverse Problems to Large-Scale Machine Learning

  • Chapter
  • First Online:
Harmonic and Applied Analysis

Abstract

We discuss regularization methods in machine learning with an emphasis on the interplay between statistical and computational aspects. We begin recalling a connection between inverse problem and machine learning. Then, we discuss how it allows to translate modeling principles into computational procedures, suggesting strategies to deal with large-scale learning problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    More precisely, two continuous functions that are equal almost everywhere, they coincide only on \(\mathcal{X}_\rho \).

  2. 2.

    Or rather the second moment matrix.

References

  1. Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950). http://dx.doi.org/10.2307/1990404

  2. Aymeric Dieuleveut, F.B.: Nonparametric stochastic approximation with large step-sizes. Ann. Stat. 44(4), 1363–1399 (2016). https://doi.org/10.1214/15-AOS1391

    Article  MathSciNet  MATH  Google Scholar 

  3. Bathia, R., Johnson, C.R.: Matrix analysis. SIAM Rev. 40(2), 413–413 (1998)

    Google Scholar 

  4. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006)

    MathSciNet  MATH  Google Scholar 

  5. Bertero, M., De Mol, C., Pike, E.R.: Linear inverse problems with discrete data. i. general formulation and singular system analysis. Inverse Probl. 1(4), 301 (1985)

    Google Scholar 

  6. Blanchard, G., Bousquet, O., Zwald, L.: Statistical properties of kernel principal component analysis. Mach. Learn. 66(2–3), 259–294 (2007)

    Article  Google Scholar 

  7. Blanchard, G., Krämer, N.: Optimal learning rates for kernel conjugate gradient regression. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, pp. 226–234. Curran Associates, Inc. (2010). https://proceedings.neurips.cc/paper/2010/hash/b2f627fff19fda463cb386442eac2b3d-Abstract.html

  8. Blanchard, G., Mücke, N.: Optimal rates for regularization of statistical inverse learning problems. Found. Comput. Math. 18(4), 971–1013 (2018)

    Article  MathSciNet  Google Scholar 

  9. Bottou, L., Bousquet, O.: The Tradeoffs of Large Scale Learning. In: NIPS (2007)

    Google Scholar 

  10. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  11. Bühlmann, P., Yu, B.: Boosting with the l 2 loss: regression and classification. J. Am. Stat. Assoc. 98(462), 324–339 (2003)

    Article  MathSciNet  Google Scholar 

  12. Caponnetto, A., De Vito, E.: Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 7(3), 331–368 (2007)

    Article  MathSciNet  Google Scholar 

  13. Caponnetto, A.,  Yao, Y.: Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl. 8, 161–3183 (2010)

    Google Scholar 

  14. Coifman, R.R., Lafon, S.: Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions. Appl. Comput. Harmon. Anal. 21(1), 31–52 (2006)

    Article  MathSciNet  Google Scholar 

  15. Cucker, F., Smale, S.: On the mathematical foundation of learning. Am. Math. Soc. 39(1), 1–49 (2001)

    Article  MathSciNet  Google Scholar 

  16. Cucker, F., Zhou, D.X.: Learning theory: an approximation theory viewpoint, vol. 24. Cambridge University Press (2007)

    Google Scholar 

  17. De Vito, E., Caponnetto, A., Rosasco, L.: Model selection for regularized least-squares algorithm in learning theory. Found. Comput. Math. 5(1) (2005)

    Google Scholar 

  18. De Vito, E., Rosasco, L., Caponnetto, A.: Discretization error analysis for Tikhonov regularization. Anal. Appl. (Singap.) 4(1), 81–99 (2006)

    Article  MathSciNet  Google Scholar 

  19. De Vito, E., Rosasco, L., Caponnetto, A., Giovannini, U.D., Odone, F.: Learning from examples as an inverse problem. J. Mach. Learn. Res. 6(May), 883–904 (2005)

    MathSciNet  MATH  Google Scholar 

  20. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition, vol. 31. Springer Science & Business Media (2013)

    Google Scholar 

  21. Dudley, R.: Real Analysis and Probability, Cambridge Studies in Advanced Mathematics, vol. 74. Cambridge University Press, Cambridge (2002). https://doi.org/10.1017/CBO9780511755347

  22. Engl, H.W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Springer (1996)

    Google Scholar 

  23. Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Birkhäuser Basel (2013)

    Google Scholar 

  24. Györfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer Science & Business Media (2006)

    Google Scholar 

  25. Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)

    Article  Google Scholar 

  26. Hsu, D., Kakade, S.M., Zhang, T.: Random design analysis of ridge regression. J. Mach. Learn. Res.-Proc. Track 23, 9–1 (2012)

    Google Scholar 

  27. Kress, R.: Linear Integral Equations, vol. 82. Springer Science & Business Media (1989)

    Google Scholar 

  28. Landweber, L.: An iteration formula for fredholm integral equations of the first kind. Am. J. Math. 73, 615–624 (1951)

    Article  MathSciNet  Google Scholar 

  29. Lang, S.: Real and functional analysis, Graduate Texts in Mathematics, vol. 142, 3rd edn. Springer, New York (1993). https://doi.org/10.1007/978-1-4612-0897-6

  30. Lin, J., Rosasco, L.: Optimal rates for multi-pass stochastic gradient methods. J. Mach. Learn. Res. 18, 97:1–97:47 (2017). http://jmlr.org/papers/v18/17-176.html

  31. Lin, J., Rudi, A., Rosasco, L., Cevher, V.: Optimal rates for spectral algorithms with least-squares regression over hilbert spaces. Appl. Comput. Harmon. Anal. 48(3), 868–890 (2020)

    Article  MathSciNet  Google Scholar 

  32. Minsker, S.: On some extensions of bernstein’s inequality for self-adjoint operators. Stat. Probab. Lett. 127, 111–119 (2017)

    Article  MathSciNet  Google Scholar 

  33. Mücke, N., Neu, G., Rosasco, L.: Beating SGD saturation with tail-averaging and minibatching. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 12,568–12,577 (2019). https://proceedings.neurips.cc/paper/2019/hash/4d0b954f0bef437c29dfa73fafdf3fa5-Abstract.html

  34. Pagliana, N., Rosasco, L.: Implicit regularization of accelerated methods in hilbert spaces. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 14,454–14,464 (2019). https://proceedings.neurips.cc/paper/2019/hash/c61aed648da48aa3893fb3eaadd88a7f-Abstract.html

  35. Pinelis, I.: Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probab. 22(4), 1679–1706 (1994)

    Article  MathSciNet  Google Scholar 

  36. Pinelis, I.: Correction: Optimum bounds for the distributions of martingales in Banach spaces. [Ann. Probab. 22(4), 1679–1706 (1994); MR1331198 (96b:60010)]. Ann. Probab. 27(4), 2119 (1999)

    Google Scholar 

  37. Poggio, T., Girosi, F.: Networks for approximation and learning. Proc. IEEE 78(9), 1481–1497 (1990)

    Article  Google Scholar 

  38. Rahimi, A., Recht, B.: Random Features for Large-Scale Kernel Machines. In: NIPS, pp. 1177–1184. Curran Associates, Inc. (2007)

    Google Scholar 

  39. Rosasco, L., Villa, S.: Learning with incremental iterative regularization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1630–1638 (2015). https://proceedings.neurips.cc/paper/2015/hash/1587965fb4d4b5afe8428a4a024feb0d-Abstract.html

  40. Rudi, A., Camoriano, R., Rosasco, L.: Less is more: Nyström computational regularization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 1648–1656. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization.pdf

  41. Rudi, A., Camoriano, R., Rosasco, L.: Generalization Properties of Learning with Random Features. (2016)

    Google Scholar 

  42. Rudi, A., Canas, G.D., Rosasco, L.: On the sample complexity of subspace learning. Adv. Neural. Inf. Process. Syst. 26, 2067–2075 (2013)

    Google Scholar 

  43. Smale, S., Zhou, D.X.: Shannon sampling and function reconstruction from point values. Bull. Am. Math. Soc. 41(3), 279–305 (2004)

    Article  MathSciNet  Google Scholar 

  44. Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26(2), 153–172 (2007)

    Article  MathSciNet  Google Scholar 

  45. Steinwart, I., Christmann, A.: Support Vector Machines. Springer New York (2008). https://books.google.de/books?id=HUnqnrpYt4IC

  46. Steinwart, I., Hush, D.R., Scovel, C., et al.: Optimal rates for regularized least squares regression. In: COLT, pp. 79–93 (2009)

    Google Scholar 

  47. Tropp, J.A.: User-friendly tools for random matrices: an introduction (2012)

    Google Scholar 

  48. Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)

    MATH  Google Scholar 

  49. Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia (1990)

    Google Scholar 

  50. Yen, E.H., Lin, T.W., Lin, S.D., Ravikumar, P.K., Dhillon, I.S.: Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space. In: Advances in Neural Information Processing Systems 27 (2014). http://papers.nips.cc/paper/5479-sparse-random-feature-algorithm-as-coordinate-descent-in-hilbert-space.pdf

  51. Yurinsky, V.: Sums and Gaussian Vectors, vol. 1617. Springer, Berlin (1995)

    Google Scholar 

Download references

Acknowledgements

This chapter evolved from a set of notes for the Summer School “Applied Harmonic Analysis and Machine Learning” in Genova, as well as the Summer School “Structured Regularization for High-Dimensional Data Analysis” at IHP Paris, and the Spring School “Structural Inference” Malente, Germany. The results we present are an overview of the work with and of a number of collaborators and colleagues. Among others, we would like to thank Gilles Blanchard, Andrea Caponnetto, Luigi Carratino, Raffaello Camoriano, Junhong Lin, Nicole Mücke, Nicoló Pagliana, Steve Smale, Alessandro Verri, Silvia Villa and Ding-Xuan Zhou.

This material is based on work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. L. R. acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-18-1-7009, FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS-DLV-777826. Part of this work was funded by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’ avenir program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). A. R. acknowledges the financial support of the European Research Council (grant REAL 947908). E.D.V. is a member of the Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ernesto De Vito .

Editor information

Editors and Affiliations

Appendix

Appendix

In this section, we collect the notation and some basic mathematical facts, adapted to our setting.

\(\mathcal {H},\mathcal{G}\)

Hilbert spaces

\(\bigl \Vert {f}\bigr \Vert _{\mathcal {H}}\)

norm of \(f\in \mathcal {H}\)

\(\left\langle {f},{g}\right\rangle _{\mathcal {H}}\)

scalar product between \(f,h\in \mathcal {H}\)

\(\mathcal K^\perp =\{f\in \mathcal {H}:\left\langle {f},{f'}\right\rangle _\mathcal {H}=0\,\forall f'\in \mathcal K\}\)

complement of a closed subspace \(\mathcal K\subset \mathcal {H}\)

\(P_\mathcal K\)

projection onto a closed subspace \(\mathcal K\subset \mathcal {H}\)

\(\mathop \mathrm{ker}{A}=\{f\in \mathcal {H}:Af=0\}\)

kernel of \(A:\mathcal {H}\rightarrow \mathcal{G}\)

\({\text {ran}}{A}=\{Af\in \mathcal{G}:f\in \mathcal {H}\}\)

range of \(A:\mathcal {H}\rightarrow \mathcal{G}\)

\(A^*:\mathcal{G}\rightarrow \mathcal {H}\)

adjoint of \(A:\mathcal {H}\rightarrow \mathcal{G}\)

\(|{A}|=\sqrt{A^*A}\)

absolute value of \(A:\mathcal {H}\rightarrow \mathcal{G}\)

\(\bigl \Vert {A}\bigr \Vert _\infty = \sup _{f\in \mathcal {H}} \frac{\bigl \Vert {Af}\bigr \Vert _{\mathcal{G}}}{\Vert {f}\Vert _\mathcal {H}}\)

operator norm of \(A:\mathcal {H}\rightarrow \mathcal{G}\)

\(w\otimes v = \left\langle {\cdot },{v}\right\rangle _\mathcal {H} w:\mathcal {H}\rightarrow \mathcal{G}\)

rank-one operator \(v\in \mathcal {H}\), \(w\in \mathcal{G}\)

\((\Omega ,\mathcal F,\mathbb P)\)

probability space

\(\mathbb E[\xi ]\)

expectation of the random variable \(\xi :\Omega \rightarrow \mathcal {H}\)

We recall the following results.

  • Trace class operators: an operator \(A:\mathcal {H}\rightarrow \mathcal {H}\) is trace class if there exists (any) a base \((v_\ell )_{\ell \in \Lambda }\) of \(\mathcal {H}\) such that the series \(\sum _{\ell \in \Lambda } \left\langle {|{A}|v_\ell },{v_\ell }\right\rangle _\mathcal {H}\) converges and

    $$ {\text {Tr}} A = \sum _{\ell \in \Lambda } \left\langle {A v_\ell },{v_\ell }\right\rangle _\mathcal {H} $$

    is called the trace of A. An \(A:\mathcal {H}\rightarrow \mathcal{G}\) is trace class if \(|{A}|\) is trace class and the space \(\mathcal S_1(\mathcal {H},\mathcal{G})\) of trace class operators from \(\mathcal {H}\) to \(\mathcal{G}\) is a Banach space with respect the norm

    $$ \bigl \Vert {A}\bigr \Vert _{ \mathcal S_1(\mathcal {H},\mathcal{G})} = {\text {Tr}}(|{A}|). $$
  • Hilbert–Schmidt operators: \(A:\mathcal {H}\rightarrow \mathcal{G}\) is a Hilbert–Schmidt operator if \(A^*A\) is a trace class operator. The space \(\mathcal S_2(\mathcal {H},\mathcal{G})\) of Hilbert–Schmidt operators from \(\mathcal {H}\) to \(\mathcal{G}\) is a Hilbert space with respect to the scalar product

    $$ \left\langle {A},{B}\right\rangle _{ \mathcal S_2(\mathcal {H},\mathcal{G})} ={\text {Tr}}(B^*A), $$

    and \(\mathcal S_1(\mathcal {H},\mathcal{G})\subset \mathcal S_2(\mathcal {H},\mathcal{G})\).

  • Hilbert–Schmidt theorem: if \(A:\mathcal {H}\rightarrow \mathcal {H}\) is a Hilbert–Schmidt self-adjoint operator, there exists a base \((v_\ell )_{\ell \in \Lambda }\) of \(\mathcal {H}\) and a sequence \((v_\ell )_{\ell \in \Lambda }\) of real numbers such that

    $$\begin{aligned} A v_\ell =\sigma _\ell v_\ell . \end{aligned}$$
  • Functional calculus: if \(A:\mathcal {H}\rightarrow \mathcal {H}\) is a Hilbert–Schmidt self-adjoint operator and \(\varphi :{\mathbb {R}}\rightarrow {\mathbb {R}}\) is a bounded function, by the functional calculus the operator \(\varphi (A):\mathcal {H}\rightarrow \mathcal {H}\) is defined as

    $$ \varphi (A) f =\sum _{\ell \in \Lambda } \varphi (\sigma _\ell ) \left\langle {f},{v_\ell }\right\rangle _\mathcal {H} v_\ell . $$

    where the series converges in \(\mathcal {H}\).

  • For any operator \(A:\mathcal {H}\rightarrow \mathcal{G}\)

    $$\begin{aligned} \Vert A\Vert ^2_\infty = \Vert AA^*\Vert _\infty \end{aligned}$$
    (68)
  • Cordes inequality: for any pair of positive operator \(A, B:\mathcal {H}\rightarrow \mathcal {H}\) and \(s \in (0,1]\)

    $$\begin{aligned} \Vert A^s B^s\Vert _\infty \le \Vert A B\Vert ^s_\infty \end{aligned}$$
    (69)
  • Take two operators \(B,C:\mathcal {H}\rightarrow \mathcal{G}\) such that

    $$\begin{aligned} B B^* \le CC^*, \end{aligned}$$
    (70)

    then

    $$\begin{aligned} \bigl \Vert {AB}\bigr \Vert _\infty \le \bigl \Vert {AC}\bigr \Vert _\infty \end{aligned}$$
    (71)

    for any operator \(A:\mathcal{G}\rightarrow \mathcal{G}'\). Indeed, by (70)

    $$ ABB^*A^* \le ACC^*A^* \qquad \Longrightarrow \qquad \bigl \Vert {ABB^*A^*}\bigr \Vert _\infty \le \bigl \Vert {ACC^*A^*}\bigr \Vert _\infty . $$

    Then,

    $$\begin{aligned} \bigl \Vert {AB}\bigr \Vert ^2_\infty&= \bigl \Vert {ABB^*A^*}\bigr \Vert _\infty \le \bigl \Vert {CBB^*C^*}\bigr \Vert _\infty = \bigl \Vert {AC}\bigr \Vert ^2_\infty . \end{aligned}$$
  • If \(A:\mathcal {H}\rightarrow \mathcal{G}\) and \(\lambda >0\)

    $$\begin{aligned} ((A^*A + \lambda I)^{-1}A^*A - I) = \lambda (A^*A + \lambda I)^{-1} \end{aligned}$$
    (72)

    see [22].

  • For any positive operator \(C:\mathcal {H}\rightarrow \mathcal {H}\) and projection \(P:\mathcal {H}\rightarrow \mathcal {H}\)

    $$\begin{aligned} P(PCP + \lambda I)^{-1}P \le P(C + \lambda I)^{-1}P \end{aligned}$$
    (73)

    see, for example, Theorem V.2.3-(iv) of [3] for the proof of this property of operator convex functions, and Cor. V.2.6 together with Exercise V.1.10-(ii) and Exercise V.2.11, in the same book, for the proof of operator convexity of \((\cdot +\lambda )^{-1}\).

  • For any pair \(A,B:\mathcal {H}\rightarrow \mathcal {H}\) of positive operators with bounded inverse such that

    $$\begin{aligned} \bigl \Vert {B^{-\frac{1}{2}} (A-B) B^{-\frac{1}{2}}}\bigr \Vert _{\infty } < 1 \end{aligned}$$
    (74)

    then

    $$\begin{aligned} \bigl \Vert {A^{-\frac{1}{2}} B^{\frac{1}{2}} }\bigr \Vert _{\infty }&\le \sqrt{\frac{1}{1 - \bigl \Vert {B^{-\frac{1}{2}} (A-B) B^{-\frac{1}{2}}}\bigr \Vert _{\infty }}}. \end{aligned}$$
    (75)

    Indeed, observe that

    $$\begin{aligned} (A^{-\frac{1}{2}} B^{\frac{1}{2}})^* A^{-\frac{1}{2}} B^{\frac{1}{2}}&= B^{\frac{1}{2}} A^{-1} B^{\frac{1}{2}}=B^{\frac{1}{2}} \left( A-B+B\right) ^{-1} B^{\frac{1}{2}} \\&= B^{\frac{1}{2}} \left( B^{\frac{1}{2}}\left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}+I\right) B^{\frac{1}{2}}\right) ^{-1} B^{\frac{1}{2}}\\&= \left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}+I\right) ^{-1}=\sum _{\ell =0}^{+\infty } (-1)^\ell \left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}\right) ^{\ell }, \end{aligned}$$

    where the Neumann series converges by (74). Hence, triangular inequality gives

    $$\begin{aligned} \bigl \Vert {A^{-\frac{1}{2}} B^{\frac{1}{2}}}\bigr \Vert ^2_\infty&= \bigl \Vert {\left( B^{-\frac{1}{2}}(A-B) B^{-\frac{1}{2}}+I\right) ^{-1}}\bigr \Vert _{\infty } \le \frac{1}{1 - \bigl \Vert { B^{-\frac{1}{2}} (A-B) B^{-\frac{1}{2}}}\bigr \Vert _{\infty }}. \end{aligned}$$
  • Höeffidng inequality in separable Hilbert spaces [35, 36, 51] : take a family

    $$\begin{aligned} \xi _1,\ldots ,\xi _n:\Omega \rightarrow \mathcal {H}\end{aligned}$$

    of independent zero mean random variables such that \(\Vert {\xi _i}\Vert _\mathcal {H}\le \kappa \), then for all \(\epsilon >0\)

    $$ \mathbb P\left[ \Vert {\frac{1}{n} \sum _{i=1}^n \xi _i}\Vert _\mathcal {H} >\epsilon \right] \le 2 \exp \left( -\frac{\epsilon ^2 n }{4 \kappa ^2}\right) , $$

    i.e., for all \(\tau >0\) with probability at least \(1- 2 e^{-\tau }\)

    $$\begin{aligned} \Vert {\frac{1}{n} \sum _{i=1}^n \xi _i}\Vert _\mathcal {H} \le \dfrac{2\kappa \sqrt{\tau }}{\sqrt{n}}. \end{aligned}$$
    (76)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

De Vito, E., Rosasco, L., Rudi, A. (2021). Regularization: From Inverse Problems to Large-Scale Machine Learning. In: De Mari, F., De Vito, E. (eds) Harmonic and Applied Analysis. Applied and Numerical Harmonic Analysis. Birkhäuser, Cham. https://doi.org/10.1007/978-3-030-86664-8_5

Download citation

Publish with us

Policies and ethics