Nothing Special   »   [go: up one dir, main page]

Note 2: EECS 189 Introduction To Machine Learning Fall 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

EECS 189 Introduction to Machine Learning

Fall 2020 Note 2


Our goal in machine learning is to extract a relationship from data. In regression tasks, this
relationship takes the form of a function y = f (x), where y ∈ R is some quantity that can be
predicted from an input x ∈ Rd , which should for the time being be thought of as some collection
of numerical measurements. The true relationship f is unknown to us, and our aim is to recover it
as well as we can from data. Our end product is a function ŷ = h(x), called the hypothesis, that
should approximate f . We assume that we have access to a dataset D = {(xi , yi )}ni=1 , where each pair
(xi , yi ) is an example (possibly noisy or otherwise approximate) of the input-output mapping to be
learned. Since learning arbitrary functions is intractable, we restrict ourselves to some hypothesis
class H of allowable functions. More specifically, we typically employ a parametric model,
meaning that there is some finite-dimensional vector w ∈ Rd , the elements of which are known as
parameters or weights, that controls the behavior of the function. That is,

hw (x) = g(x, w)

for some other function g. The hypothesis class is then the set of all functions induced by the
possible choices of the parameters w:

H = {hw | w ∈ Rd }

After designating a cost function L, which measures how poorly the predictions ŷ of the hypothesis
match the true output y, we can proceed to search for the parameters that best fit the data by
minimizing this function:
w∗ = arg min L(w)
w

1 Ordinary Least Squares


Ordinary least squares (OLS) is one of the simplest regression problems, but it is well-understood
and practically useful. It is a linear regression problem, which means that we take hw to be of the
form hw (x) = x>w. We want
yi ≈ ŷi = hw (xi ) = xi>w
for each i = 1, . . . , n. This set of equations can be written in matrix form as
     
y1  x1> w1 
 ..  ≈  ..   .. 
     
 .   .   . 
y  x > w 
n n d
|{z} |{z} |{z}
y X w

Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
In words, the matrix X ∈ Rn×d has the input datapoint xi as its ith row. This matrix is sometimes
called the design matrix. Usually n ≥ d, meaning that there are more datapoints than measure-
ments.
There will in general be no exact solution to the equation y = Xw (even if the data were perfect,
consider how many equations and variables there are), but we can find an approximate solution by
minimizing the sum (or equivalently, the mean) of the squared errors:
n
X
L(w) = (xi>w − yi )2 = min kXw − yk22
w
i=1

Now that we have formulated an optimization problem, we want to go about solving it. We will see
that the particular structure of OLS allows us to compute a closed-form expression for a globally
optimal solution, which we denote w∗ols .

1.1 Approach 1: Vector calculus


Calculus is the primary mathematical workhorse for studying the optimization of differentiable
functions. Recall the following important result: if L : Rd → R is continuously differentiable, then
any local optimum w∗ satisfies ∇L(w∗ ) = 0. In the OLS case,

L(w) = kXw − yk22


= (Xw − y)>(Xw − y)
= (Xw)>Xw − (Xw)>y − y>Xw + y>y
= w>X>Xw − 2w>X>y + y>y

Using the following results from matrix calculus

∇x (a>x) = a
∇x (x>Ax) = (A + A>)x

the gradient of L is easily seen to be

∇L(w) = ∇w (w>X>Xw − 2w>X>y + y>y)


= ∇w (w>X>Xw) − 2∇w (w>X>y) + ∇w (y>y)
| {z }
0
= 2X Xw − 2X y
> >

where in the last line we have used the symmetry of X>X to simplify X>X+(X>X)> = 2X>X. Setting
the gradient to 0, we conclude that any optimum w∗ols satisfies

X>Xw∗ols = X>y

If X is full rank, then X>X is as well (assuming n ≥ d), so we can solve for a unique solution

w∗ols = (X>X)−1 X>y

Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Note: Although we write (X>X)−1 , in practice one would not actually compute the inverse; it
is more numerically stable to solve the linear system of equations above (e.g. with Gaussian
elimination).
In this derivation we have used the condition ∇L(w∗ ) = 0, which is a necessary but not sufficient
condition for optimality. We found a critical point, but in general such a point could be a local
minimum, a local maximum, or a saddle point. Fortunately, in this case the objective function
is convex, which implies that any critical point is indeed a global minimum. To show that L is
convex, it suffices to compute the Hessian of L, which in this case is

∇2 L(w) = 2X>X

and show that this is positive semi-definite:

∀w, w>(2X>X)w = 2(Xw)>Xw = 2kXwk22 ≥ 0

1.2 Approach 2: Orthogonal projection


There is also a linear algebraic way to arrive at the same solution: orthogonal projections.
Recall that if V is an inner product space and S a subspace of V, then any v ∈ V can be decomposed
uniquely in the form
v = vS + v⊥
where vS ∈ S and v⊥ ∈ S ⊥ . Here S ⊥ is the orthogonal complement of S , i.e. the set of vectors that
are perpendicular to every vector in S .
The orthogonal projection onto S , denoted PS , is the linear operator that maps v to vS in the
decomposition above. An important property of the orthogonal projection is that

kv − PS vk ≤ kv − sk

for all s ∈ S , with equality if and only if s = P s v. That is,

PS v = arg min kv − sk
s∈S

Proof. By the Pythagorean theorem,

kv − sk2 = k v − PS v + PS v − s k2 = kv − PS vk2 + kPS v − sk2 ≥ kv − PS vk2


| {z } | {z }
∈S ⊥ ∈S

with equality holding if and only if kPS v − sk2 = 0, i.e. s = PS v. Taking square roots on both sides
gives kv − sk ≥ kv − PS vk as claimed (since norms are nonnegative). 

Here is a visual representation of the argument above:

Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
In the OLS case,
w∗ols = arg min kXw − yk22
w

But observe that the set of vectors that can be written Xw for some w ∈ Rd is precisely the range
of X, which we know to be a subspace of Rn , so
min kz − yk22 = min kXw − yk22
z∈range(X) w∈Rd

By pattern matching with the earlier optimality statement about PS , we observe that Prange(X) y =
Xw∗ols , where w∗ols is any optimum for the right-hand side. The projected point Xw∗ols is always
unique, but if X is full rank (again assuming n ≥ d), then the optimum w∗ols is also unique (as
expected). This is because X being full rank means that the columns of X are linearly independent,
in which case there is a one-to-one correspondence between w and Xw.
To solve for w∗ols , we need the following fact1 :
null(X>) = range(X)⊥
Since we are projecting onto range(X), the orthogonality condition for optimality is that y − Py ⊥
range(X), i.e. y − Xw∗ols ∈ null(X>). This leads to the equation
X>(y − Xw∗ols ) = 0
which is equivalent to
X>Xw∗ols = X>y
as before.

2 Ridge Regression
While Ordinary Least Squares can be used for solving linear least squares problems, it falls short
due to numerical instability and generalization issues. Numerical instability arises when the fea-
tures of the data are close to collinear (leading to linearly dependent feature columns), causing the
1
This result is often stated as part of the Fundamental Theorem of Linear Algebra.

Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
input matrix X to lose its rank or have singular values that very close to 0. Why are small singular
values bad? Let us illustrate this via the singular value decomposition (SVD) of X:

X = UΣV>

where U ∈ Rn×n , Σ ∈ Rn×d , V ∈ Rd×d . In the context of OLS, we must have that X>X is invertible,
or equivalently, rank(X>X) = rank(X>) = rank(X) = d. Assuming that X and X> are full column
rank d, we can express the SVD of X as
Σ 
 
X = U  d  V>
0

where Σd ∈ Rd×d is a diagonal matrix with strictly positive entries. Now let’s try to expand the
(X>X)−1 term in OLS using the SVD of X:
Σ 
 
h i
(X X) = (V Σd 0 U U  d  V>)−1
> −1 > 
0
i Σ 
 
h
= (V Σd 0 I  d  V>)−1
0
= (VΣ2d V>)−1 = (V>)−1 (Σ2d )−1 V−1 = VΣ−2
d V
>

This means that (X>X)−1 will have singular values that are the squared inverse of the singular val-
ues of X, potentially leading to extremely large singular values when the singular value of X are
close to 0. Such excessively large singular values can be very problematic for numerical stability
purposes. In addition, abnormally high values to the optimal w solution would prevent OLS from
generalizing to unseen data.

There is a very simple solution to these issues: penalize the entries of w from becoming too large.
We can do this by adding a penalty term constraining the norm of w. For a fixed, small scalar
λ > 0, we now have: 2
min Xw − y 2 + λkwk22
w

Note that the λ in our objective function is a hyperparameter that measures the sensitivity to the
values in w. Just like the degree in polynomial features, λ is a value that we must choose arbitrarily
through validation. Let’s expand the terms of the objective function:

L(w) = kXw − yk22 + λkwk22


= w>X>Xw − 2w>X>y + y>y + λw>w

Finally take the gradient of the objective and find the value of w that achieves 0 for the gradient:

∇w L(w) = 0
2X>Xw − 2X>y + 2λw = 0
(X>X + λI)w = X>y
w∗ridge = (X>X + λI)−1 X>y

Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
This value is guaranteed to achieve the (unique) global minimum, because the objective function is
strongly convex. To show that f is strongly convex, it suffices to compute the Hessian of f , which
in this case is
∇2 L(w) = 2X>X + 2λI
and show that this is positive definite (PD):

∀w , 0, w>(X>X + λI)w = (Xw)>Xw + λw>w = kXwk22 + λkwk22 > 0

Since the Hessian is positive definite, we can equivalently say that the eigenvalues of the Hessian
are strictly positive and that the objective function is strongly convex. A useful property of strongly
convex functions is that they have a unique optimum point, so the solution to ridge regression is
unique. We cannot make such guarantees about ordinary least squares, because the corresponding
Hessian could have eigenvalues that are 0. Let us explore the case in OLS when the Hessian has a
0 eigenvalue. In this context, the term X>X is not invertible, but this does not imply that no solution
exists! In OLS, there always exists a solution, and when the Hessian is PD that solution is unique;
when the Hessian is PSD, there are infinitely many solutions. (There always exists a solution to
the expression X>Xw = X>y, because the range of X>X and the range space of X> are equivalent;
since X>y lies in the range of X>, it must equivalently lie in the range of X>X and therefore there
always exists a w that satisfies the equation X>Xw = X>y.)
The technique we just described is known as ridge regression. Note that now the expression
X>X + λI is invertible, regardless of rank of X. Let’s find (X>X + λI)−1 through SVD:
  −1
 Σr 0 > Σr 0 >
  
(X X + λI) = V   V + λI
> −1

 U U 
0 0 0 0 
  −1
 Σ2r 0 >

= V   V + λI

 
0 0
  −1
 Σ2r 0 >

= V   V + V(λI)V>

0 0 
   −1
 Σ2 0  
= V  r  + λI V>
0 0
 Σ2r + λI 0  >−1
   
= V    V 
0 λI 
−1
Σ + λI
 2
0
= (V>)−1  r  V−1

 
0 λI
(Σr + λI)−1 0  >
 2 
= V  1  V
0 λ
I

Now with our slight tweak, the matrix X>X + λI has become full rank and thus invertible. The
singular values have become σ21+λ and λ1 , meaning that the singular values are guaranteed to be
at most λ1 , solving our numerical instability issues. Furthermore, we have partially solved the

Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
overfitting issue. By penalizing the norm of x, we encourage the weights corresponding to relevant
features that capture the main structure of the true model, and penalize the weights corresponding
to complex features that only serve to fine tune the model and fit noise in the data.

Note 2, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7

You might also like