Nothing Special   »   [go: up one dir, main page]

02 - Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Machine Learning

Introduction
• In the case of classification, you were provided a training set, which consists of
pairs of examples.

{( ) }
S n = x (t ) , y (t ) , t = 1,..., n , x (t ) ∈ Rd , y (t ) ∈ {− 1, 1}
• These examples were just feature vectors with associated label.
• You have n examples, the goal of your classifier is to land the mapping between
the feature vectors to the corresponding labels.
• Feature vectors are of dimension d.
• Because we are talking about the case of classification, we assume that our
labels were, for instance, -1 and +1.

Introduction 2
Introduction
• Let’s consider the case of the stock exchange monitoring
• We looked at this problem, again as a classification.
• Our feature vector would, for instance, record the prices for the previous d days.
• It's very clear, specification we're looking now at the prices of the d days from the
moment that we have to make a decision, this is our history, and then we're
recommending either plus 1 or minus 1, either sell or buy.

Introduction 3
Introduction
• In many problems, it is not enough to say, yes or no kind of questions.
• Sometimes you really want to know what is the extent of something happening?
• For instance, in this case, one doesn't only want the classifier to tell someone to
sell or to buy. We also want to know, how much do we expect the cost will
increase, the extent of the increase.
• If we're looking at it as a classification problem, we are really oversimplifying by
limiting the prediction to just two simple values.
• But what we ideally want to do, we actually want, for instance, to predict the
price that is coming next day.

Introduction 4
Introduction
• Simple regression analysis is a statistical tool that gives us the ability to estimate
the mathematical relationship between a dependent variable (usually called y)
and an independent variable (usually called x).
• The dependent variable is the variable for which we want to make a prediction.
• While various non-linear forms may be used, simple linear regression models
are the most common.

Introduction 5
Introduction
• The primary goal of quantitative analysis is to use current information about a
phenomenon to predict its future behavior.
• Current information is usually in the form of a set of data.
• In a simple case, when the data form a set of pairs of numbers, we may interpret
them as representing the observed values of an independent (or predictor )
variable x and a dependent ( or response) variable y.
• The goal of the analyst who studies the data is to find a functional relation
between the response variable y and the predictor variable x.

Introduction 6
Introduction
The statement that the relation between x and y is statistical should be interpreted
as providing the following guidelines:
• Regard y as a random variable.
• For each x, take f (x) to be the expected value (i.e., mean value) of y.
• Given that E(y) denotes the expected value of y, call the equation the
regression function.

Introduction 7
Introduction
Dependant variable Y

X
Independent variable

Introduction 8
Introduction
Dependant variable Y

X
Independent variable

Introduction 9
Introduction
Dependant variable Y

X
Independent variable

Introduction 10
Introduction
Dependant variable Y

X
Independent variable

Introduction 11
Introduction
Dependant variable Y

X
Independent variable

Introduction 12
Introduction
Y

Predicted value
Dependant variable

error

Actual value

X
Independent variable

Introduction 13
Introduction
Y
x y
1 3
5 2 4
3 2
Dependant variable

4
4 4
(3, 3.6) 5 5
3
Mean: 3 3.6

X
1 2 3 4 5
Independent variable

Introduction 14
Introduction
Y
x y x−x y− y
1 3 -2 -0.6
5 2 4 -1 0.4
3 2 0 -1.6
Dependant variable

4
4 4 1 0.4
(3, 3.6) 5 5 2 1.4
3
Mean: 3 (
3.6 x − x )2
(x − x )( y − y )
4 1.2
2
1 -0.4
0 0
∑ (x − x )( y − y )
1 4
θ= = 1 0.4
∑ (x − x )
2 10
4 2.8
X
1 2 3 4 5
Independent variable ∑10 ∑4
Introduction 15
Introduction
Y
y = θ x + θ0

3.6 = 0.4 × 3 + c
5
3.6 = 1.2 + c
c = 2.4
Dependant variable

4
(3, 3.6)
3

2
(x _ x )( y − y )
θ=∑
4
=
∑ (x _ x )
2 10
1

X
1 2 3 4 5
Independent variable

Introduction 16
Introduction
Y

5
Dependant variable

4 y = 0.4 × 1 + 2.4 = 2.8


y = 0.4 × 2 + 2.4 = 3.2
y = 0.4 × 3 + 2.4 = 3.6
3 y = 0.4 × 4 + 2.4 = 4.0
y = 0.4 × 5 + 2.4 = 4.4
2

1
The job is to find the line that minimizes errors

X
1 2 3 4 5
Independent variable

Introduction 17
Introduction
• In a regression problem, the setup is the same than a classification except for
the nature of y(t).

{( ) }
S n = x (t ) , y (t ) , t = 1,..., n , x (t ) ∈ Rd , y (t ) ∈ R

• The goal of our model will be to map − to learn how to map − the feature vectors
into these continuous values.

Introduction 18
Introduction
• There could be many different responses for a value of x.
• In simple linear regression, the model assumes that for each value of x the
observed values of the response variable y are normally distributed with a mean
that depends on x.
• We assume that these means all lie on a straight line when plotted against x (a
line of means).

Introduction 19
Introduction
Regression analysis serves Three major purposes.
• Description
• Control
• Prediction

The several purposes of regression analysis frequently overlap in practice

Introduction 20
Formal Statement of the Model
General regression model

y = θ .x + θ 0 + ε
• θ et θ0, are parameters
• x is a known constant
• Deviations ε ~ N (0, σ2)

• The values of the regression parameters θ0 and θ are not known. We estimate
them from data.
• θ indicates the change in the mean response per unit increase in x.

Introduction 21
Formal Statement of the Model
• We will write an estimated regression line based on sample data as

yˆ = θ .x + θ 0

• The method of least squares chooses the values for θ0 and θ to minimize the
sum of squared errors

n n
SSE = ∑ ( yi − yˆi ) 2 =∑ ( y − (θ .x + θ0 ))
2

i =1 i =1

Introduction 22
Regression as a machine learning technique
Since it’s a linear regression, we take our predictor variable x as well as some
parameter (the same way as a classifier), then we just use linear combination (linear
weighting) of x-coordinate.

d
f (x;θ ,θ 0 ) = ∑ θ i xi + θ 0 = θ .x + θ 0
i =1

In order to simplify thing and for the compactness of notation, we assume that
θ0 = 0.

Introduction 23
Regression as a machine learning technique
You can think of it as a very simple linear combination. What if our dependency was
much more complex?

It is still possible to use a linear classifier by making an appropriate representation of


the feature vector.

We're going to take our problem, map it in a very complex space, and in this space
linear regression may be sufficient.

Introduction 24
Regression as a machine learning technique
Issues to address:

• We need to look at how we will formulate our objective. What will be appropriate
objective that can quantify the extent of our mistake

• The second problem that we need to address is actually how are we doing the
learning (learning algorithm). We have an objective, but we still need to be able
to effectively utilize our training data to find the best parameter sets.

• The Regularization: in order to have a general model that well estimates future
cases.

Introduction 25
Empirical Risk
Intuitively, what we would want to do is kind of quantify how our predictions is
making deviate from the known values of y.
Because there's continuous numbers, you can just look how far the predictions are
from he expected values.
These objectives are called empirical risk and we would call it R.

1
Rn = ∑
(
y − θ .x
n (t ) (t )
)
2

n t =1 2
Dividing the SSE by 2 is only for mathematical convenience.
The question we can be asking ourselves, why do we have to square the distance
between our prediction and the true value?

Empirical Risk 26
Objective
The goal is not only to optimize the training results (this is the best we can do).
What we originally want to do is actually to talk about generalization: how well do we
generalize to the test example.
In an ideal case what we will minimize is something like that:

1
Rn = ∑
(
y − θ .x
n (t ) (t )
)2

n t =1 2

1
Rn = ∑
n+n'
(
y − θ .x(t ) (t )
) 2

n' t = n +1 2

Empirical Risk 27
Objective
Two possible types of mistakes:
• Structural: the linear model doesn’t fit my data and the mapping is highly non
linear.

• Estimation: very limited set of data so that, even if we know that the mapping
itself is linear, you cannot estimate it correctly.

The objective is to minimize R: find θ which would result is the smallest empirical
risk.

Empirical Risk 28
Learning Algorithm: Gradient Based Approach

∇θ 
(
 y (t ) − θ . x (t ) )
2

( ) (
 = y ( t ) − θ . x ( t ) ∇θ y ( t ) − θ . x ( t ) )
 2 
 
( )
= − y (t ) − θ . x (t ) . x (t )

Algorithm
θ←0
Randomly pick t ∈ {1, …, n }
( (
(t ) (t )
θ ← θ − η − y − θ .x . x
(t )
) ) = θ +η ((y ( ) − θ .x( ) ). x( ) )
t t t

1
Variable learning rate usage is possible: η k =
1+ k
Empirical Risk 29
Learning Algorithm: Gradient Based Approach
Few notes about the algorithm:

• The update of the parameter (correcting it) is done at every step where there is
some discrepancy. The question here is not to check if there’s a mistake or not.
We are looking how much difference is there. If the prediction, and the correct
value deviate very much, then we're going to be correcting more. And if they are
very close to each other, we're going to be correcting less. This is the difference
with the classification algorithm.

• The algorithm is somehow self-correcting. If, for example y(t) > θ.x(t), in this
case, the expression y(t) − θ.x(t) would be positive. We're going to push the
values in the positive direction of point x. Each time you select new point, it may
push it in its own direction to get closer to the true value.

Empirical Risk 30
Learning Algorithm: Closed form solution

Rn = ∑
(
1 n y (t ) − θ . x (t )) 2
b ∈ Rd
n t =1 2 θˆ.x (t ) dot product = scalar
1 n
∇θ Rn (θ ) θ =θˆ = ∑ ∇θ 
 (
(t )
y − θ .x )
(t ) 2 
 d × d Matrix
n t =1  2  θ =θˆ
(
1 n ( t ) ˆ ( t ) (t )
= − ∑ y − θ .x .x
n t =1
)
1 n (t ) (t ) 1 n ˆ ( t ) (t )
= − ∑ y . x + ∑ θ .x x
n t =1 n t =1
bn
1
= −b + ∑ x θ .x = −b + ∑ x x .θ
n t =1
(t ) ˆ ( t )
n t =1
( )
1 n (t ) ( t ) T ˆ

= −b + Aθˆ A
−1
Aθˆ = b θ=A b ?
ˆ
Empirical Risk 31
Learning Algorithm: Closed form solution
Two caveats

• A is reversible if our x(i) span Rd. Do we have enough sample in our dataset ?

• T(n) = O (n3)

Empirical Risk 32
Regularization and Generalization

y = θ .x + θ 0 + ε
ε ~ N (0 , σ 2 )

Regularization and Generalization 33


Regularization
λ
θ + Rn (θ )
2
J λ ,n =
2
λ
1
= θ + ∑
2y − θ .x
n
( (t ) (t )
)
2

2 n t =1 2

Both the gradient and closed form solution, can be very easily adjusted to this new
loss function.

∇θ  θ + ∑
(
 λ 2 1 n y (t ) − θ . x (t ) )
2


2 n t =1 2 
 

Regularization and Generalization 34


Regularization

∇θ  θ + ∑
(
 λ 2 1 n y ( t ) − θ . x (t ) )
2

( )
 = λθ − y (t ) − θ .x (t ) x (t )
2 n t =1 2 
 

The update of the parameter θ in the gradient based algorithm will therefore be:

θ = θ − η (λθ − ( y (t ) − θ .x (t ) ) x (t ) )

( )
= (1 − λη )θ + η y (t ) − θ .x (t ) x (t )

Regularization and Generalization 35

You might also like