Nothing Special   »   [go: up one dir, main page]

Lecture03_kernel

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

ECE595 / STAT598: Machine Learning I

Lecture 03: Regression with Kernels

Spring 2020

Stanley Chan

School of Electrical and Computer Engineering


Purdue University

c Stanley Chan 2020. All Rights Reserved.


1 / 28
Outline

c Stanley Chan 2020. All Rights Reserved.


2 / 28
Outline

Mathematical Background
Lecture 1: Linear regression: A basic data analytic tool
Lecture 2: Regularization: Constraining the solution
Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 3: Kernel Method


Kernel Method
Dual Form
Kernel Trick
Algorithm
Examples
Radial Basis Function (RBF)
Regression using RBF
Kernel Methods in Classification
c Stanley Chan 2020. All Rights Reserved.
3 / 28
Why Another Method?
Linear regression: Pick a global model, best fit globally.
Kernel method: Pick a local model, best fit locally.
In kernel method, instead of picking a line / a quadratic equation, we
pick a kernel.
A kernel is a measure of distance between training samples.
Kernel method buys us the ability to handle nonlinearity.
Ordinary regression is based on the columns (features) of A.
Kernel method is based on the rows (samples) of A.

c Stanley Chan 2020. All Rights Reserved.


4 / 28
Pictorial Illustration

c Stanley Chan 2020. All Rights Reserved.


5 / 28
Overview of the Method
Model Parameter:
We want the model parameter θb to look like: (How? Question 1)
N
αn x n .
X
θb =
n=1

This model expresses θb as a combination of the samples.


The trainable parameters are αn , where n = 1, . . . , N.
If we can make αn local, i.e., non-zero for only a few of them, then
we can achieve our goal: localized, sample-dependent.

Predicted Value
The predicted value of a new sample x is
N
T
x= αn hx , x n i.
X
yb = θb
n=1
We want this model to encapsulate nonlinearity. (How? Question 2)6 / 28
c Stanley Chan 2020. All Rights Reserved.
Dual Form of Linear Regression
Goal: Addresses Question 1: Express θb as
N
αn x n .
X
θb =
n=1

We start by listing out a technical lemma:


Lemma
For any A ∈ RN×d , y ∈ Rd , and λ > 0,
(AT A + λI )−1 AT y = AT (AAT + λI )−1 y . (1)

Proof: See Appendix.

Remark:
The dimensions of I on the left is d × d, on the right is N × N.
If λ = 0, then the above is true only when A is invertible.c Stanley Chan 2020. All Rights Reserved.
7 / 28
Dual Form of Linear Regression
Using the Lemma, we can show that
θb = (AT A + λI )−1 AT y (Primal Form)
= AT (AAT + λI )−1 y (Dual Form)
| {z }
def

— ( x 1 )T
 T
—  
— (x 2 )T —
α1 N
αn x n , αn = [(AAT + λI )−1 y ]n
 ..  X def
= =
 
..   . 
 . 
n=1
— ( x N )T
αN

c Stanley Chan 2020. All Rights Reserved.


8 / 28
The Kernel Trick
Goal: Addresses Question 2: Introduce nonlinearity to
N
yb = θb x = αn hx , x n i.
T X

n=1
The Idea:
Replace the inner product hx , x n i by k(x , x n ):
N
yb = θb x = αn k(x , x n ).
T X

n=1

k(·, ·) is called a kernel.


A kernel is a measure of the distance between two samples x i and x j .
hx i , x j i measure distance in the ambient space, k(x i , x j ) measure
distance in a transformed space.
In particular, a valid kernel takes the form k(x i , x j ) = hφ(x i ), φ(x j )i
for some nonlinear transforms φ. c Stanley Chan 2020. All Rights Reserved.
9 / 28
Kernels Illustrated

A kernel typically lifts the ambient dimension to a higher one.


For example, mapping from R2 to R3
 2 
  x1
x = x
n x1
and φ(x n ) = x1 x2 

2
x22
c Stanley Chan 2020. All Rights Reserved.
10 / 28
Relationship between Kernel and Transform
Consider the following kernel k(u , v ) = (u T v )2 . What is the transform?
Suppose u and v are in R2 . Then (u T v )2 is
2
! 2 

(u T v )2 =
X X
ui vi  uj vj 
i=1 j=1

v12
 
2 X
2
X  v1 v2 
(ui uj )(vi vj ) = u12 u1 u2 u2 u1 u22 

= v2 v1  .

i=1 j=1
v22
So if we define φ as
u12
 
 
u u
φ(u ) = 
u1 u2 
= 1 7→ 
u2 u2 u1 
u22
then (u T v )2 = hφ(u ), φ(v )i. c Stanley Chan 2020. All Rights Reserved.
11 / 28
Radial Basis Function
A useful kernel is the radial basis kernel (RBF):
ku − v k2
 
k(u , v ) = exp − .
2σ 2
The corresponding nonlinear transform of RBF is infinite
dimensional. See Appendix.
ku − v k2 measures the distance between two data points u and v .
σ is the std dev, defining “far” and “close”.
RBF enforces local structure; Only a few samples are used.

c Stanley Chan 2020. All Rights Reserved.


12 / 28
Kernel Method
Given the choice of the kernel function, we can write down the algorithm
as follows.
1 Pick a kernel function k(·, ·).
2 Construct a kernel matrix K ∈ RN×N , where [K ]ij = k(x i , x j ), for
i = 1, . . . , N and j = 1, . . . , N.
3 Compute the coefficients α ∈ RN , with

αn = [(K + λI )−1 y ]n .

4 Estimate the predicted value for a new sample x :


N
gθ (x ) = αn k(x , x n ).
X

n=1

Therefore, the choice of the regression function is shifted to the choice of


the kernel. c Stanley Chan 2020. All Rights Reserved.
13 / 28
Outline

Mathematical Background
Lecture 1: Linear regression: A basic data analytic tool
Lecture 2: Regularization: Constraining the solution
Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 3: Kernel Method


Kernel Method
Dual Form
Kernel Trick
Algorithm
Examples
Radial Basis Function (RBF)
Regression using RBF
Kernel Methods in Classification
c Stanley Chan 2020. All Rights Reserved.
14 / 28
Example
Goal: Use the kernel method to fit the data points shown as follows.
What is the input feature vector x n ? x n = tn : The time stamps.
What is the output yn ? y n is the height.
Which kernel to choose? Let us consider the RBF.
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

-0.1
0 10 20 30 40 50 60 70 80 90 100

c Stanley Chan 2020. All Rights Reserved.


15 / 28
Example (using RBF)

Define the fitted function as gθ (t). [Here, θ refers to α.]


The RBF is defined as k(ti , tj ) = exp{−(ti − tj )2 /2σ 2 }, for some σ.
The matrix K looks something below

[K ]ij = exp{−(ti − tj )2 /2σ 2 }.

10

20

30

40

50

60

70

80

90

100
10 20 30 40 50 60 70 80 90 100

K is a banded diagonal matrix. (Why?)


The coefficient vector is αn = [(K + λI )−1 y ]n .
c Stanley Chan 2020. All Rights Reserved.
16 / 28
Example (using RBF)
Using the RBF, the predicted value is given by
N N
(t new −tn )2
αn e −
X X
gθ (t new ) = αn k(t new , tn ) = 2σ 2 .
n=1 n=1
Pictorially, the predicted function gθ can be viewed as the linear
combination of the Gaussian kernels.

c Stanley Chan 2020. All Rights Reserved.


17 / 28
Effect of σ
Large σ: Flat kernel. Over-smoothing.
Small σ: Narrow kernel. Under-smoothing.
Below shows an example of the fitting and the kernel matrix K.
1 1 1
noisy sample noisy sample noisy sample
kernel regression kernel regression kernel regression
0.8 0.8 0.8
ground truth ground truth ground truth

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

-0.2 -0.2 -0.2


10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

10 10 10

20 20 20

30 30 30

40 40 40

50 50 50

60 60 60

70 70 70

80 80 80

90 90 90

100 100 100


10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 c 40 50 Chan
Stanley 60 2020.
70 80 Rights
All 90 100
Reserved.
18 / 28
Any Improvement?

We can improve the above kernel by considering x n = [yn , t n ]T .


Define the kernel as
(yi − yj )2 (ti − tj )2
  
k(x i , x j ) = exp − + .
2σr2 2σs2

This new kernel is adaptive (edge-aware).

c Stanley Chan 2020. All Rights Reserved.


19 / 28
Any Improvement?
Here is a comparison.
1 1
noisy sample noisy sample
kernel regression kernel regression
0.8 0.8
ground truth ground truth

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

without improvement with improvement

This idea is known as bilateral filter in the computer vision literature.


Can be further extended to 2D image where x n = [yn , s n ], for some
spatial coordinate s n .
Many applications. See Reading List.
c Stanley Chan 2020. All Rights Reserved.
20 / 28
Kernel Methods in Classification

The concept of lifting the data to higher dimension is useful for


classification. 1

1
Image source:
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78c Stanley Chan 2020. All Rights Reserved.
21 / 28
Kernels in Support Vector Machines

Example. RBF for SVM (We will discuss SVM later in the semester.)
Radial Basis Function is often used in support vector machine.
Poor choice of parameter can lead to low training loss, but with the
risk of over-fit.
Under-fitted data can sometimes give better generalization.

c Stanley Chan 2020. All Rights Reserved.


22 / 28
Reading List
Kernel Method:
Learning from Data (Chapter 3.4)
https://work.caltech.edu/telecourse
CMU 10701 Lecture 4 https://www.cs.cmu.edu/~tom/10701_
sp11/slides/Kernels_SVM_04_7_2011-ann.pdf
Berkeley CS 194 Lecture 7 https:
//people.eecs.berkeley.edu/~russell/classes/cs194/
Oxford C19 Lecture 3
http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
Kernel Regression in Computer Vision:
Bilateral Filter https://people.csail.mit.edu/sparis/bf_
course/course_notes.pdf
Takeda and Milanfar, “Kernel regression for image processing and
reconstruction”, IEEE Trans. Image Process. (2007)
https://ieeexplore.ieee.org/document/4060955 c Stanley Chan 2020. All Rights Reserved.
23 / 28
Appendix

c Stanley Chan 2020. All Rights Reserved.


24 / 28
Proof of Lemma
Lemma
For any matrix A ∈ RN×d , y ∈ Rd , and λ > 0,
(AT A + λI )−1 AT y = AT (AAT + λI )−1 y . (2)

The left hand side is solution to normal equation, which means


AT Aθ + λθ = AT y .
Rearrange terms gives θ = AT λ1 (y − Aθ) .
 

Define α = λ1 (y − Aθ), then θ = AT α.


Substitute θ = AT α into α = λ1 (y − Aθ), we have

(y − AAT α).
1
α=
λ
Rearrange terms gives (AAT + λI )α = y , which yields
α = (AAT + λI )−1 y .
Substitute into θ = AT α gives θ = AT (AAT + λI )−1 y . c Stanley Chan 2020. All Rights Reserved.
25 / 28
Non-Linear Transform for RBF
Let us consider scalar u ∈ R.
k(u, v ) = exp{−(u − v )2 }
= exp{−u 2 } exp{2uv } exp{−v 2 }

!
X 2 k uk v k
= exp{−u 2 } exp{−v 2 }
k!
k=0
r r r !T
2 1 2 2 2 3
= exp{−u 2 } 1, u, u2, u3, . . . ,
1! 2! 3!
r r r !
21 22 2 23 3
× 1, v, v , v , . . . , exp{−v 2 }
1! 2! 3!
So Φ is
r r r !
21 22 2 23 3
φ(x) = exp{−x 2 } 1, x, x , x ,...,
1! 2! 3! c Stanley Chan 2020. All Rights Reserved.
26 / 28
Kernels are Positive Semi-Definite

Given {x j }N
j=1 , construct a N × N matrix K such that

[K ]ij = K (x i , x j ) = Φ(x i )T Φ(x j ).

Claim: K is positive semi-definite.

Let z be an arbitrary vector. Then,


n X
N N X
N
z T Kz = zi Φ(x i )T Φ(x j )zj
X X
zi Kij zj =
i=1 j=1 i=1 j=1
N X
N N
! N N
!2
[Φ(x i )]k [Φ(x j )]k [Φ(x i )]k zi
X X (a)X X
= zi zj = ≥0
i=1 j=1 k=1 k=1 i=1

where [Φ(x i )]k denotes the k-th element of the vector Φ(x i ).
c Stanley Chan 2020. All Rights Reserved.
27 / 28
Existence of Nonlinear Transform

We just showed that: If K (x i , x j ) = Φ(x i )T Φ(x j ) for any x 1, . . . , x N ,


then K is symmetric positive semi-definite.
The converse also holds: If K is symmetric positive semi-definite for
any x 1 , . . . , x N , then there exist Φ such that
K (x i , x j ) = Φ(x i )T Φ(x j ).
This converse is difficult to prove.
It is called the Mercer Condition.
Kernels satisfying Mercer’s condition have Φ.
You can use the condition to rule out invalid kernels.
But proving a valid kernel is still hard.

c Stanley Chan 2020. All Rights Reserved.


28 / 28

You might also like