Nothing Special   »   [go: up one dir, main page]

Introduction To Statistical Learning 12

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Introduction to Statistical Learning

Introduction
1. An overview of statistical learning
a. Statistical learning refers to a vast set of tools for understanding data
b. Statistical learning tools are classified as below
i. Supervised - Building a statistical model for predicting or estimating an
output based on one or more inputs
ii. Unsupervised - inputs but no supervised output
1. Regression - continuous or quantitative output (ex: wage data:
Predicting wage based on age, year and education)
2. Classification - categorical or qualitative output (ex: stock
market data: predicting today’s stock will increase or decrease)
3. Clustering - We only observe input without predicting ny
output(gene expression data: grouping groups of cancer
causing cells)
2. Notations:
a. n → number of distinct data points or observations
b. p → number of variables available in making predictions
c. 𝑥𝑖𝑗→ represents value of the j th variable for i th observation.
d. Vectors of length n → bold font
e. Vectors of length p → normal font
3. Datasets used:
a. ISLR package at https://cran.r-project.org/web/packages/ISLR/index.html
b. Below re the datasets used in this book
Chapter 2: Statistical Learning
1. What is statistical learning?
a. Input variables are also called as predictors, independent variables, features
or variables. Typically denoted by X.
b. Output variables are also called as response or dependent variables.
Typically denoted by Y.
c. Suppose that we observe a quantitative response Y of p different predictors,
we assume that there is some relation between Y and X(X1, X2, …..,Xn)
which can be written as
Y=f(X)+ ε
Where f is some fixed but unknown function of 𝑋1, 𝑋2, .., 𝑋𝑝
and ε is some random error term, which is independent of X and has
mean zero.
d. Statistical learning refers to a set of approaches to estimate f.
e. Why estimate f: Two reasons
i. Prediction
1. In many situations a set of inputs X are available, but output Y
can’t be easily obtained. Since error averages to zero, we can
predict Y using,
Where f^ represents our estimate for f and Y^ for our
resulting prediction for Y.
2. The accuracy of Y^ as a prediction for Y depends on two
quantities called reducible error and irreducible error.
a. Reducible error: Which can be reduced using most
appropriate statistical learning technique
b. Irreducible error: caused by ε as it may contain some
other variables which we do not contain in X.
c. We can say

2
Where E(𝑌 − Ŷ) represents average, or expected value of the
squared difference between predicted and expected value of Y
and var(ε) is the variance associated with error.
3. Example for prediction: Suppose we have a marketing
campaign and we have a set of demographic parameters
available. When the client wants to know who will respond
positively for a campaign, it is called a prediction. We do not go
to inner details how a particular demographic variable affects
the campaign.
ii. Inference:
1. In inference we do not estimate Y but we just want to know
how Y varies as 𝑋1, 𝑋2, .., 𝑋𝑝varies.
2. Here the f^ can’t be treated as a black box, because we need
to know its exact form
3. We look for
a. Which predictors are associated with the response
b. What is the relation between the response and each
predictor
c. Can the relationship between Y and other predictors is
linear or is more complicated?
4. Example: If the client want to know how sales are affected
based on certain variables(predictors) like price, store location,
competitor price, etc.. this is called inference as we are
interested in finding the relation between sales and each of
predictors.
5. Another simple example of prediction and inference is if in a
real estate business if the builder want to know having a
school near by will increase or decrease the house value, it is
called prediction. Instead if he wants to price the house for
having a school near by it is called inference.
iii. How do we estimate f:
1. All the methods will share a common characteristics.
2. We use certain data points or observation which we called
training data, because we use this data to train or teach, our
model to estimate f.
3. Corresponding to each input 𝑥𝑖there will be a output 𝑦𝑖. A set of
such input and output points is called the training data.
{(𝑥1, 𝑦1),(𝑥2, 𝑦2),......(𝑥𝑛, 𝑦𝑛)} → training data
4. Statistical methods to find f can be categorised into parametric
methods and nonparametric methods based on how we find f.
a. Parametric methods:
i. In parametric form we assume about the
functional form or shape of f and then we need
to find a procedure to fit or train the data.
ii. Example: we assume that the function is linear
as below
f(X)=β0 + β1𝑋1+β2𝑋2 +....... + β𝑝𝑋𝑝
Now we have to estimate p+1 parameters
(β0 , β1, β2,... β𝑝 ) which simplifies the problem

instead of assuming completely arbitrary function f. The


most common approach to fit β0 , β1, β2,... β𝑝 in linear
model is (ordinary) least squares. Below figure shows
our estimated function for income data.

iii.Disadvantage: The model we choose do not fit


the true unknown form of f.
a. Non parametric methods: not read
2. The trade-off between prediction accuracy and model interoperability

Chapter 3: Linear Regression

Simple linear regression:


1. Predicting a quantitative response Y on the basis of a single predictor variable X.
2. Assumes that there is approximately a linear relationship between X and Y
3. Mathematically

Read as Y “is approximately modeled as”


Where β0 and β1 are two unknown constants that represent the intercept and
slope terms in the linear model. These are also known as model coefficients or
parameters.
4. Estimating the coefficients
a. Let us assume that we have n observation pairs {(𝑥1, 𝑦1),(𝑥2, 𝑦2),....(𝑥𝑛, 𝑦𝑛)}
each representing an input X and a corresponding output Y for that
observation. Our goal is to estimate coefficientsβ0^ and β1^ so that

. For i=1,2,...n. In other words we want to find an


intercept β0^ and slope β1^ so that the resulting line is as close as possible to
to n data points. There are number of ways to measure closeness in which
the most used one is the least squares criterion.
b. By using the least squares criterion we can get coefficients as

Where x bar and y bar are the means(averages) of the n observations and
corresponding outputs.
c.

You might also like