Lecture03_kernel

ECE595 / STAT598: Machine Learning I
Lecture 03: Regression with Kernels
Spring 2020
Stanley Chan
School of Electrical and Computer Engineering

Purdue University
c Stanley Chan 2020. All Rights Reserved.

1 / 28
Outline

2 / 28
Outline
Mathematical Background
Lecture 1: Linear regression: A basic data analytic tool
Lecture 2: Regularization: Constraining the solution
Lecture 3: Kernel Method: Enabling nonlinearity
Lecture 3: Kernel Method

Kernel Method
Dual Form
Kernel Trick
Algorithm
Examples
Radial Basis Function (RBF)
Regression using RBF
Kernel Methods in Classification
3 / 28
Why Another Method?
Linear regression: Pick a global model, best fit globally.
Kernel method: Pick a local model, best fit locally.
In kernel method, instead of picking a line / a quadratic equation, we
pick a kernel.
A kernel is a measure of distance between training samples.
Kernel method buys us the ability to handle nonlinearity.
Ordinary regression is based on the columns (features) of A.
Kernel method is based on the rows (samples) of A.

4 / 28
Pictorial Illustration

5 / 28
Overview of the Method
Model Parameter:
We want the model parameter θb to look like: (How? Question 1)
N
αn x n .
X
θb =
n=1
This model expresses θb as a combination of the samples.

The trainable parameters are αn , where n = 1, . . . , N.
If we can make αn local, i.e., non-zero for only a few of them, then
we can achieve our goal: localized, sample-dependent.
Predicted Value
The predicted value of a new sample x is
N
T
x= αn hx , x n i.
X
yb = θb
n=1
We want this model to encapsulate nonlinearity. (How? Question 2)6 / 28
Dual Form of Linear Regression
Goal: Addresses Question 1: Express θb as
N
αn x n .
X
θb =
n=1
We start by listing out a technical lemma:

Lemma
For any A ∈ RN×d , y ∈ Rd , and λ > 0,
(AT A + λI )−1 AT y = AT (AAT + λI )−1 y . (1)
Proof: See Appendix.
Remark:
The dimensions of I on the left is d × d, on the right is N × N.
If λ = 0, then the above is true only when A is invertible.c Stanley Chan 2020. All Rights Reserved.
7 / 28
Dual Form of Linear Regression
Using the Lemma, we can show that
θb = (AT A + λI )−1 AT y (Primal Form)
= AT (AAT + λI )−1 y (Dual Form)
| {z }
def
=α
— ( x 1 )T
 T
—  
— (x 2 )T —
α1 N
αn x n , αn = [(AAT + λI )−1 y ]n
 ..  X def
= =
 
..   . 
 . 
n=1
— ( x N )T
αN
—

8 / 28
The Kernel Trick
Goal: Addresses Question 2: Introduce nonlinearity to
N
yb = θb x = αn hx , x n i.
T X
n=1
The Idea:
Replace the inner product hx , x n i by k(x , x n ):
N
yb = θb x = αn k(x , x n ).
T X
n=1
k(·, ·) is called a kernel.

A kernel is a measure of the distance between two samples x i and x j .
hx i , x j i measure distance in the ambient space, k(x i , x j ) measure
distance in a transformed space.
In particular, a valid kernel takes the form k(x i , x j ) = hφ(x i ), φ(x j )i
for some nonlinear transforms φ. c Stanley Chan 2020. All Rights Reserved.
9 / 28
Kernels Illustrated
A kernel typically lifts the ambient dimension to a higher one.

For example, mapping from R2 to R3
 2 
x1
x = x
n x1
and φ(x n ) = x1 x2 

2
x22
10 / 28
Relationship between Kernel and Transform
Consider the following kernel k(u , v ) = (u T v )2 . What is the transform?
Suppose u and v are in R2 . Then (u T v )2 is
2
! 2 
(u T v )2 =
X X
ui vi  uj vj 
i=1 j=1
v12
 
2 X
2
X v1 v2 
(ui uj )(vi vj ) = u12 u1 u2 u2 u1 u22 

= v2 v1  .

i=1 j=1
v22
So if we define φ as
u12
 

u u
φ(u ) = 
u1 u2 
= 1 7→ 
u2 u2 u1 
u22
then (u T v )2 = hφ(u ), φ(v )i. c Stanley Chan 2020. All Rights Reserved.
11 / 28
Radial Basis Function
A useful kernel is the radial basis kernel (RBF):
ku − v k2

k(u , v ) = exp − .
2σ 2
The corresponding nonlinear transform of RBF is infinite
dimensional. See Appendix.
ku − v k2 measures the distance between two data points u and v .
σ is the std dev, defining “far” and “close”.
RBF enforces local structure; Only a few samples are used.

12 / 28
Kernel Method
Given the choice of the kernel function, we can write down the algorithm
as follows.
1 Pick a kernel function k(·, ·).
2 Construct a kernel matrix K ∈ RN×N , where [K ]ij = k(x i , x j ), for
i = 1, . . . , N and j = 1, . . . , N.
3 Compute the coefficients α ∈ RN , with
αn = [(K + λI )−1 y ]n .
4 Estimate the predicted value for a new sample x :

N
gθ (x ) = αn k(x , x n ).
X
n=1
Therefore, the choice of the regression function is shifted to the choice of

the kernel. c Stanley Chan 2020. All Rights Reserved.
13 / 28
Outline
Mathematical Background
Lecture 1: Linear regression: A basic data analytic tool
Lecture 2: Regularization: Constraining the solution
Lecture 3: Kernel Method: Enabling nonlinearity
Lecture 3: Kernel Method

Kernel Method
Dual Form
Kernel Trick
Algorithm
Examples
Radial Basis Function (RBF)
Regression using RBF
14 / 28
Example
Goal: Use the kernel method to fit the data points shown as follows.
What is the input feature vector x n ? x n = tn : The time stamps.
What is the output yn ? y n is the height.
Which kernel to choose? Let us consider the RBF.
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
-0.1
0 10 20 30 40 50 60 70 80 90 100

15 / 28
Example (using RBF)
Define the fitted function as gθ (t). [Here, θ refers to α.]

The RBF is defined as k(ti , tj ) = exp{−(ti − tj )2 /2σ 2 }, for some σ.
The matrix K looks something below
[K ]ij = exp{−(ti − tj )2 /2σ 2 }.
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
K is a banded diagonal matrix. (Why?)

The coefficient vector is αn = [(K + λI )−1 y ]n .
16 / 28
Example (using RBF)
Using the RBF, the predicted value is given by
N N
(t new −tn )2
αn e −
X X
gθ (t new ) = αn k(t new , tn ) = 2σ 2 .
n=1 n=1
Pictorially, the predicted function gθ can be viewed as the linear
combination of the Gaussian kernels.

17 / 28
Effect of σ
Large σ: Flat kernel. Over-smoothing.
Small σ: Narrow kernel. Under-smoothing.
Below shows an example of the fitting and the kernel matrix K.
1 1 1
noisy sample noisy sample noisy sample
kernel regression kernel regression kernel regression
0.8 0.8 0.8
ground truth ground truth ground truth
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
-0.2 -0.2 -0.2

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
10 10 10
20 20 20
30 30 30
40 40 40
50 50 50
60 60 60
70 70 70
80 80 80
90 90 90
100 100 100

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 c 40 50 Chan
Stanley 60 2020.
70 80 Rights
All 90 100
Reserved.
18 / 28
Any Improvement?
We can improve the above kernel by considering x n = [yn , t n ]T .

Define the kernel as
(yi − yj )2 (ti − tj )2

k(x i , x j ) = exp − + .
2σr2 2σs2
This new kernel is adaptive (edge-aware).

19 / 28
Any Improvement?
Here is a comparison.
1 1
noisy sample noisy sample
kernel regression kernel regression
0.8 0.8
ground truth ground truth
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
without improvement with improvement
This idea is known as bilateral filter in the computer vision literature.

Can be further extended to 2D image where x n = [yn , s n ], for some
spatial coordinate s n .
Many applications. See Reading List.
20 / 28
The concept of lifting the data to higher dimension is useful for

classification. 1
1
Image source:
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78c Stanley Chan 2020. All Rights Reserved.
21 / 28
Kernels in Support Vector Machines
Example. RBF for SVM (We will discuss SVM later in the semester.)
Radial Basis Function is often used in support vector machine.
Poor choice of parameter can lead to low training loss, but with the
risk of over-fit.
Under-fitted data can sometimes give better generalization.

22 / 28
Reading List
Kernel Method:
Learning from Data (Chapter 3.4)
https://work.caltech.edu/telecourse
CMU 10701 Lecture 4 https://www.cs.cmu.edu/~tom/10701_
sp11/slides/Kernels_SVM_04_7_2011-ann.pdf
Berkeley CS 194 Lecture 7 https:
//people.eecs.berkeley.edu/~russell/classes/cs194/
Oxford C19 Lecture 3
http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
Kernel Regression in Computer Vision:
Bilateral Filter https://people.csail.mit.edu/sparis/bf_
course/course_notes.pdf
Takeda and Milanfar, “Kernel regression for image processing and
reconstruction”, IEEE Trans. Image Process. (2007)
https://ieeexplore.ieee.org/document/4060955 c Stanley Chan 2020. All Rights Reserved.
23 / 28
Appendix

24 / 28
Proof of Lemma
Lemma
For any matrix A ∈ RN×d , y ∈ Rd , and λ > 0,
(AT A + λI )−1 AT y = AT (AAT + λI )−1 y . (2)
The left hand side is solution to normal equation, which means

AT Aθ + λθ = AT y .
Rearrange terms gives θ = AT λ1 (y − Aθ) .

Define α = λ1 (y − Aθ), then θ = AT α.

Substitute θ = AT α into α = λ1 (y − Aθ), we have
(y − AAT α).
1
α=
λ
Rearrange terms gives (AAT + λI )α = y , which yields
α = (AAT + λI )−1 y .
Substitute into θ = AT α gives θ = AT (AAT + λI )−1 y . c Stanley Chan 2020. All Rights Reserved.
25 / 28
Non-Linear Transform for RBF
Let us consider scalar u ∈ R.
k(u, v ) = exp{−(u − v )2 }
= exp{−u 2 } exp{2uv } exp{−v 2 }
∞
!
X 2 k uk v k
= exp{−u 2 } exp{−v 2 }
k!
k=0
r r r !T
2 1 2 2 2 3
= exp{−u 2 } 1, u, u2, u3, . . . ,
1! 2! 3!
r r r !
21 22 2 23 3
× 1, v, v , v , . . . , exp{−v 2 }
1! 2! 3!
So Φ is
r r r !
21 22 2 23 3
φ(x) = exp{−x 2 } 1, x, x , x ,...,
1! 2! 3! c Stanley Chan 2020. All Rights Reserved.
26 / 28
Kernels are Positive Semi-Definite
Given {x j }N
j=1 , construct a N × N matrix K such that
[K ]ij = K (x i , x j ) = Φ(x i )T Φ(x j ).
Claim: K is positive semi-definite.
Let z be an arbitrary vector. Then,

n X
N N X
N
z T Kz = zi Φ(x i )T Φ(x j )zj
X X
zi Kij zj =
i=1 j=1 i=1 j=1
N X
N N
! N N
!2
[Φ(x i )]k [Φ(x j )]k [Φ(x i )]k zi
X X (a)X X
= zi zj = ≥0
i=1 j=1 k=1 k=1 i=1
where [Φ(x i )]k denotes the k-th element of the vector Φ(x i ).
27 / 28
Existence of Nonlinear Transform
We just showed that: If K (x i , x j ) = Φ(x i )T Φ(x j ) for any x 1, . . . , x N ,

then K is symmetric positive semi-definite.
The converse also holds: If K is symmetric positive semi-definite for
any x 1 , . . . , x N , then there exist Φ such that
K (x i , x j ) = Φ(x i )T Φ(x j ).
This converse is difficult to prove.
It is called the Mercer Condition.
Kernels satisfying Mercer’s condition have Φ.
You can use the condition to rule out invalid kernels.
But proving a valid kernel is still hard.

28 / 28

Lecture03_kernel

Uploaded by

Copyright:

Available Formats

Lecture03_kernel

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture03_kernel

Uploaded by

Copyright:

Available Formats

ECE595 / STAT598: Machine Learning I

Lecture 03: Regression with Kernels

School of Electrical and Computer Engineering

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

Lecture 3: Kernel Method

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

This model expresses θb as a combination of the samples.

We start by listing out a technical lemma:

Proof: See Appendix.

c Stanley Chan 2020. All Rights Reserved.

k(·, ·) is called a kernel.

A kernel typically lifts the ambient dimension to a higher one.

c Stanley Chan 2020. All Rights Reserved.

4 Estimate the predicted value for a new sample x :

Therefore, the choice of the regression function is shifted to the choice of

Lecture 3: Kernel Method

c Stanley Chan 2020. All Rights Reserved.

Define the fitted function as gθ (t). [Here, θ refers to α.]

[K ]ij = exp{−(ti − tj )2 /2σ 2 }.

K is a banded diagonal matrix. (Why?)

c Stanley Chan 2020. All Rights Reserved.

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

-0.2 -0.2 -0.2

100 100 100

We can improve the above kernel by considering x n = [yn , t n ]T .

This new kernel is adaptive (edge-aware).

c Stanley Chan 2020. All Rights Reserved.

without improvement with improvement

This idea is known as bilateral filter in the computer vision literature.

The concept of lifting the data to higher dimension is useful for

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

The left hand side is solution to normal equation, which means

Define α = λ1 (y − Aθ), then θ = AT α.

[K ]ij = K (x i , x j ) = Φ(x i )T Φ(x j ).

Claim: K is positive semi-definite.

Let z be an arbitrary vector. Then,

We just showed that: If K (x i , x j ) = Φ(x i )T Φ(x j ) for any x 1, . . . , x N ,

c Stanley Chan 2020. All Rights Reserved.

You might also like