Unit 3 Notes
Unit 3 Notes
Unit 3 Notes
Linear Regression is general form of predictive analysis. It is broadly used for all
statistical techniques. It measures the relationship between one or more predictor variables
and one outcome variable. Regression analysis examines the relationship between a
dependent and independent variable.
UQ.What do you mean by a linear regression? Which applications are best modeled by
linear regression?
(SPPU - Q. 5(a), March 19, Q. 5(b), March 19,
Q. 1(b), Nov./Dec. 17, 4 Marks))
Linear regression model illustrates the relationship between two variables or factors.
Regression analysis is generally used to show the correlation between two variables.
The prediction variable present in the equation model of linear regression is known as
dependent variable. Let us consider it as Y. The variables that predict the dependent
variable are known as independent variables. Let say X.
Y becomes a dependent variable, as the prediction (Y) dependent on the other variables
(X).
In simple linear regression analysis, each observation has two variables. That is the
independent variable and the dependent variable. Multiple regression analysis consists of
two or more independent variables and in what way they correlate to the independent
variable. The equation that defines how y is related to X is called the regression model.
The coefficient describes that a change in independent variable is maybe not totally
equivalent to a change
in y.
Now let us look at the proof. To put a line through our data that best fits the data. A
regression line shows a positive linear relationship (the line that is sloping up), a negative
linear relationship (the line that is sloping down), or totally absence of relationship
(shown by a flat line)
Where the line crosses the vertical axis, that point is said to be as constant.
The steeper the slope, the more the salary for years of experience.
For example, if 1 more year of experience is considered, then salary (y) should be
incremented by $10,000, but due to steeper slope, it may increase like $15,000
When we look at a graph, vertical lines can be drawn from the line to our actual
observations. The actual observations can be clearly seen as the dots, while the line
displays the model observations (the predictions).
Fig. 3.1.5
Fig. 3.1.6
The line shows the difference between employees’ actually earning and he’s prediction to
be earned. To find the best line, the minimum sum of squares is looked, the sum of all the
squared differences is done and the minimum is found out. This is known as Ordinary
least squares method.
Regression is one of the parametric techniques that make assumptions. Let's have a
glance at the assumptions it makes :
1. A linear and additive relationship is present between dependent variable (DV) and
independent variable (IV). Linear relationship means, that the change in DV by 1 unit
change in IV is constant. By additive it means, the effect of X on Y is independent of
other variables.
2. No correlation between independent variable must be present. As presence of correlation
in ndependent variables may cause Multicollinearity. That is, it becomes difficult for the
model to define the actual effect of IV on DV.
3. The error terms should consist of constant variance. Due to its absence, it causes
heteroskedest a city.
4. Error at ∈t should not decide the error at ∈t+1
i.e. the error terms must not be uncorrelated.
5. Correlation in error terms is called as Autocorrelation. Its presence extremely affects the
regression coefficients and standard error values as they are based on the assumption of
uncorrelated error terms.
6. There should be normal distribution between dependent variable and the error terms.
Presence of these assumptions makes regression relatively obstructive. By restrictive it
means, the performance of a regression model is dependent on completion of these
assumptions.
Simple linear regression consists of a single input, that can be used in statistics to
estimate the coefficients.
Statistical properties from data are required to get calculated like means, standard
deviations, correlations and covariance. All the data should be present to traverse and
calculate statistics. Hypothesis function for it is given by
y = 1 + 2 x
In simple linear regression, the topic of this section, the predictions of Y when plotted as
a function of X form a straight line.
In Table 3.1.1. Example data is plotted in Fig. 3.1.7. There exists a positive relationship
between X and Y. If Y is predicted from X, higher the value of X, higher will be
prediction of Y.
X Y
1.00 1.00
2.00 2.00
3.00 1.30
X Y
4.00 3.75
5.00 2.25
The best fit straight line through the points is found in linear regression. That best-fitting
line is known as regression line. In Fig. 3.1.8, the back diagonal line is regression line
which consists of the predicted score on Y for each possible value of X. Errors of
prediction are represented by the vertical lines from the point to the regression line. As
shown in Fig. 3.1.8, the red point is near the regression line; it has less error prediction
level. Whereas, the yellow point is much higher, so has more error prediction.
The black line comprises of the predictions, the points depict the actual data, and the
vertical lines between the points and the black line denote prediction errors.
X Y Y' Y – Y (Y – Y)2
We have not yet defined the term "best-fitting line." The line that has minimum sum of
the squared errors of prediction is so far widely used criterion. Same criterion is used for
finding the line in Fig. 3.1.8.
The squared errors of prediction are given in last column of Table 3.1.2. Compared to
any other regression line, the sum of the squared errors of prediction shown in Table
3.1.2 is lowest.
Y = bX + A
where Y' is the predicted value, b is the slope of the line, and A is the Y intercept.
The equation for the line in Fig. 3.1.8 is
Y = 0.425X + 0.785
For X = 1,
For X = 2,
The relationship between more than one explanatory variables and response variable is
modeled by multiple linear regression through fitting a linear equation to observed data.
Each value of the independent variable x is associated with a value of the dependent
variable y. The population regression line for p explanatory variables x1, x2, ...xp is
defined to be
p in this line it is shown that the way the mean response y changes with the explanatory
variables. The observed values for y differ about their means y and are assumed to have
the same standard deviation. The parameters 0, 1,...p are estimated by the fitted
values b0, b1,…... bp of the population regression line.
The multiple regression models include a term for variation, as the observed values for y
vary about their means y.
DATA = FIT + RESIDUAL, where the “FIT” term represents the expression 0 + 1 x1
+ 2x2 + ... + p xp.
The “RESIDUAL” term signifies the deviations of the observed values y from their
means y, which are normally distributed with mean 0 and variance. The notation for the
model deviations is .
Here we have learnt the concept of simple linear regression where to model the response
variable Y, a single predictor variable X was used. In so many applications, more than
one factor are responsible to influence the response.
Multiple regression models define how a single response variable Y is dependent linearly
on a number of predictor variables.
Examples
The selling price of a house can depend on various factors like the popularity of the
location, the number of bedrooms, the number of bathrooms, the year the house was
built, the square footage of the plot etc.
The child’s height can depend on the height of the parents, nutrition he gets, and other
environmental factors.
The “least squares” method is a type of mathematical regression analysis that determines
the best fit line for a collection of data, displaying the relationship between the points
visually.
UQ.What do you mean by least square method? Explain least square method in the context
of linear regression. (SPPU - Q. 2(b), Dec. 19, 5 Marks,
Q. 1(a), May/June 2016, 5 Marks)
UQ.What do you mean by coefficient of regression? SSR, MSE in the context of
regression. Explain SST, SSE,
(SPPU - Q. 4(b), Dec. 19, 5 Marks)
Regression analysis is used to identify linear relationship between single dependent and
an independent variable.
Y = 0 + 1 X +
where,
Y = Value of dependent variable
X = independent variable ;
= random error
b = Y axis intercept = 0
m = slope of line = 1
Fig. 3.3.1
Even if X = 0 i.e. value of independent variable is zero then also it is expected that value
of Y is 0.
It does not need to have same number of sample points above and below to it.
0 = ––– 1 ––
1 =
Least square method is discussed with respect to shaft univariate linear regression
Y = 0 + 1 X +
Here, target is to find values of 0 and 1 which should best fit to the given sample data.
y1 is predicted as
y2 is predicted as
For each point on regression, we can calculate the difference between actual value y and
predicted value (predicted by regression line)
for point x1, e1 = y1 –
the difference between actual value and predicted value is called as Residual or Errors
Least square method is used to find values of 0 and 1 to construct in regression line for
which sum of squares of all error (SSE) is minimum.
SSE = n
Objective of least square method is to find values of 0 and 1 for which nis minimum
i.e. sum of square of errors is minimum,
SSE = n = n
= n
SSE = n ...(1)
To get minimum value of SSE for 0 and 1 partial derivative of SSE w.r.to of 0 and
1 must be equal to 0
=0 ; =0
= n=0
n =0
can be written as,
n =0
n– 2 = 0
–2n =0
n =0 ...(2)
Similarly, for 1 find partial derivative to get value of 1
= 0
n = 0
n = 0
n – 2xi = 0
– 2nxi = 0
n xi = 0 ...(3)
n yi – n1 – n xi = 0
n yi – n – n xi = 0
n = n– yi – n xi
=–
But, X = ; Y=
= Y– X ...(4)
n xi = 0
n xi = 0
n xi = 0
n xi = 0
n xi = 0
n xi = n xi
= =
= ...(5)
= =
=
Y = + X
Even if x = 0 i.e. value of independent variable is zero then also it is expected that value
of Y is 0.
–– = and ––=
= ––– –– ; =
A cost function is a measure of how wrong the model is in terms of its ability to estimate
the relationship between X and y. This is typically expressed as a difference or distance
between the predicted value and the actual value.
Regression models deal with predicting a continuous value for example salary of an
employee, price of a car, loan prediction, etc.
A cost function used in the regression problem is called “Regression Cost Function”.
In this cost function, the error for each training data is calculated and then the mean value
of all these errors is derived.
Calculating the mean of the errors is the simplest and most intuitive way possible.
The errors can be both negative and positive. So they can cancel each other out during
summation giving zero mean error for the model.
Thus this is not a recommended cost function but it does lay the foundation for other cost
functions of regression models.
This improves the drawback we encountered in Mean Error above. Here a square of the
difference between the actual and predicted value is calculated to avoid any possibility of
negative error.
It is measured as the average of the sum of squared differences between predictions and
actual observations. It is also known as L2 loss.
In MSE, since each error is squared, it helps to penalize even small deviations in
prediction when compared to MAE. But if our dataset has outliers that contribute to
larger prediction errors, then squaring this error further will magnify the error many times
more and also lead to higher MSE error.
This cost function also addresses the shortcoming of mean error differently. Here an
absolute difference between the actual and predicted value is calculated to avoid any
possibility of negative error.
So in this cost function, MAE is measured as the average of the sum of absolute
differences between predictions and actual observations.
It is robust to outliers thus it will give better results even when our dataset has noise or
outliers.
R – Squared
R-squared (R2) is a statistical measure that represents the proportion of the variance for a
dependent variable that's explained by an independent variable or variables in a
regression model. ... It may also be known as the coefficient of determination.
yi = 0 + 2 Xi + i
Values of 0 and 1 are estimated by various methods for e.g for least square method,
maximum likelihood method
y = + x
SSE = n
MSE = (SSE) = n
(c) Root Mean Squared Error (RMSE)
RMSE = =
(d) Normalised Mean Squared Error (NMSE)
NMSE = = n
(e) R-Squared
Rr-squared = 1 –
(f) Mean Absolute Error (MAE)
MAE = n
Solved Examples
Ex. 3.4.1 :
Consider following data, for 5 students. Each
Xi (i = 1 to 5) represents score of i th student in standard
X and corresponding Yi (j = 1 to 5) respects score of ith student in standard XII.
Sample data
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Soln. :
Given Xi = Score of ith student in Xth std.
=
In this example, n = 5
xi yi
95 8 17 289 +8 136
5
85 9 7 49 + 18 126
5
80 7 2 4 –7 – 14
0
70 6 –8 64 – 12 96
5
60 7 – 18 324 –7 126
0
Su 730 470
m
nxi = 390 ; –– = = = 78
n xi = 385 ; –– = nyi = = 77
1 = = = 0.644
= 26.768 + 0.644 x
(b) Interpretation of linear regression equation
= 26.768 + 0.644 x
Interpretation 1
Interpretation 2
Even if X = 0 (practically it is not possible to apply this interpretation in real word). i.e. a
students score in Xth is 0 then also expected score of student in XIIth standard is 26.76
(c) If student score in Xth Std is 90 then students score in XIIth std is calculated as :
UQ.For a given data having 100 examples, if squared errors SE 1, SE2, and SE3 are 13.33,
3.33 and 4.00 respectively, calculate Mean Squared Error (MSE). State the formula for
MSE.
(SPPU - Q. 1(b), May/June 16, 5 Marks)
UQ.Consider the following data points :
Calculate the Cost Function for 00 = 0.5 and
01 = l using linear regression.
X Y
1 1.5
2 2.75
3 4
4 4.5
5 5.5
Linear regression is a statistical model that observes the linear relationship between two
(Simple Linear Regression) or more (Multiple Linear Regression) variables a dependent
variable and independent variable(s). Linear relationship mainly means that dependent
variable too increases (or decreases), when one (or more) independent variables increases
(or decreases).
As it can be seen, that a linear relationship may be positive (independent variable goes
up, dependent variable goes up) or negative (independent variable goes up, dependent
variable goes down.
Multiple Linear Regression attempts to model the Relationship between two or more
features and a response by fitting a linear equation to observed data. The steps that are
needed to perform multiple linear Regression are similar to that of simple linear
Regression.
The difference is in the Evalution. it can be used to find out which factor has the highest
impact on the predicted output and different variable relate to each other.
There are two important disadvantages of Linear Regression. Let’s consider that the
shown model is actually close to (or exactly) linear, i.e.,
yi = r(xi) + I i = 1,….n,
for some underlying regression function r(X0) that is approximately (or exactly) linear in
X0.
For a linear model, the linear regression has predictable test error σ 2 + p σ2/n. The first
term is the irreducible error; the second term is entirely from the variance of the linear
regression estimate (averaged over the input points). Its bias is exactly zero
What can be understood from this? If another predictor variable is being added into the
mix, then same amount of variance will get added, σ 2/n, irrespective of whether its true
coefficient is large or small (or zero)
Hence in the last example, efforts for “spending” variance for trying to fit truly small
coefficients was done there were 20 out of 30 them.
One may find that it can be done better by shrinking small coefficients towards zero,
which possibly introduces some bias, but also reduces the variance. In other words,
“small details” were ignored in order to get a more stable “big picture”. If it is properly
done, this way can actually work.
UQ. Write short notes on : Linearly and non- linearly separable data (SPPU - Q.
6(a), March 19, 5 Marks)
UQ.What is a polynomial regression? How it can be represented in a form of a matrix?
(SPPU - Q. 2(b), May/June 18, 5 Marks)
UQ.What do you mean by zero centered and
un-correlated features? What is the use of it in the solution of multivariate linear
regression?
(SPPU - Q. 4(b), May/June 18, 6 Marks)
The non- linear relationship between value of x and equivalent conditional mean y, is
fitted by polynomial regression, denoted as E(y | x).
In the case above, the model remains linear externally, but it can hold internal non-
linearity. Let’s consider the above Fig. 3.9.1, which shows how scikit-learn implements
this technique. This is obviously a non-linear dataset, and any linear regression based
only on the original two-dimensional points cannot capture the dynamics.
If a linear model is applied on a linear dataset, then it offers good result as seen in Simple
Linear Regression, but if the same model is applied without any alteration on a non-linear
dataset, then the result that is being produced may be drastic.
Because of which loss function may increase, the error rate will become high, and
accuracy will ultimately get decreased.
Thus in such cases, a Polynomial Regression model is needed, where data points are
arranged in a non-linear fashion. This can be understood in a better way using the
comparison shown in below Fig. 3.9.2 and
Fig. 3.9.3 of the linear dataset and non-linear dataset.
Henceforth, if the datasets are organized in a non-linear way, then the Polynomial
Regression model is used instead of Simple Linear Regression.
y = b0 + b1x + b2 x2 + b3 x3 + …... + bn xn
Polynomial regression is a crucial part of linear regression. It’s main idea is how to
select the features. Observing at the multivariate regression with 2 variables: x1 and
x2. Linear regression will look like this:
y = a1 * x1 + a2 * x2.
y= a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * + a5 *
A polynomial term: linear model can be turned into curve by a quadratic (squared) or
cubic (cubed) terms. But here data X is squared or cubed, and not Beta coefficients, so it
is still a linear model. This helps us to model curves easily without explicitly modelling a
complicated nonlinear model.
One frequent pattern in machine learning is to use trained linear models on nonlinear
functions of the data. This method gives fast performance of linear methods while letting
them to fit a much wider range of data.
1. Polynomial delivers the best approximation of the relationship between the dependent
and independent variable.
2. A Wide range of function can be fit onto it.
3. Polynomial mostly fits a wide range of curvature.
Fig. 3.12.1
A model can be called as a good machine learning model, if it can generalize new input
data from the problem domain appropriately. This may help to make future data
prediction, that data model will be unknown about.
Assume that we need to check how good machine learning model learns and generalizes
new data, for this is the concept of over fitting and Underfitting. They are responsible for
poor performance of machine learning algorithms.
3.12.1 Underfitting
Underfitting means that data is not able to fit well. this happens usually when limited data
is present to build an accurate model and also possibly when a linear model is tried to
build using non-linear model.
In these situations machine learning model is much easier and flexible to apply rules on
minimal data when results that model makes in wrong predictions. For this, more data
and reduction of feature selection is required to avoid underfitting.
3.12.2 Overfitting
It is said to be overfitted, when a model is trained with a lot of data. And when such
situation occurs, it starts learning from the noise and inaccurate data entries are done in
data set.
Thus model cannot categorize the data properly, due to too much of details and noise.
The non-parametric and non-linear methods are the causes of overfitting as these types of
machine learning algorithms have more freedom in building the model based on the
dataset and hence they can really build unrealistic models.
For instance, decision trees are a nonparametric machine learning algorithm as they are
flexible, the problem of overfitting arrives.
This problem can be overcome by reducing a tree after learning so that some details can
be removed.
1. Cross- Validation : It is a standard way that finds out-of-sample prediction error which
can be used for 5-fold cross validation.
2. Early Stopping : This rule guides us to know how many iterations can be run before
learner begins to over-fit.
3. Pruning : Pruning is widely used while building related models. In this nodes are
removed that have predictive power for the problem in hand.
4. Regularization : Regularization brings new features by introducing a cost term with the
objective function. Therefore it pushes the coefficients of many variables to zero that will
reduce cost term.
3.13.1 Bias
Let’s consider we have two values, one is predicted by our model and other is actual
value of data (target value).
Bias refers to the gap between these two values (predicted value by our model and actual
value of data).
Bias helps us to generalize better and make our model less sensitive to some single data
point.
Model with high bias pays very little attention to the training data and oversimplifies the
model.
High Bias
Our estimated data value is a long way from the actual data value, resulting in a large gap
between the two.
Low Bias
Our estimated data value is close to the actual data value, i.e. there is a smaller gap
between expected and actual data value.
3.13.2 Variance
A high variance model pays close attention to training data and does not generalise to
data it hasn't seen before.
On training data, such models work well, but on test data, they have a high error rate.
Variance comes from highly complex models with a large number of features.
Low Variance
High Variance
The difference between actual and predicted values is small, and it belongs to the same
group as the low biassed and low variance rule (refer Fig. 3.13.1)
Data is scattered due to high variance, but due to the rule of low bias, it is not far from
the actual data (target value) as seen Fig. 3.13.1.
By the rule of high bias it’s a huge gap and by the rule of high variance data is scattered
refer Fig. 3.13.1.
The predicted values are almost identical to the data's actual value. So the ideal option is
Low Bias and Low Variance.
Underfitting
The data predicted with a high bias is in a straight line
format, which does not fit the data in the data set adequately.