Nothing Special   »   [go: up one dir, main page]

0% found this document useful (0 votes)
9 views94 pages

PA DL Consolidated

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 94

BITS Pilani

Pilani Campus

ZG 512
Predictive Analytics M1 Predictive Analytics
BITS Pilani Pravin Mhaske
Lecture 1 Introduction, Model Assessment
Pilani Campus

Model Assessment Training and Test Errors


• The model is developed on the training data.
For Regression, the most commonly used measure is MSE
• The Statistical Method estimates f (y = f(X)) by minimizing MSETr
2
MSE = Average y − yො • A procedure that minimizes MSETr will tend to “overfit” the data

For Classification, the most commonly used measure is the Error Rate: • The training error shows the performance of the model on the training
data

Ave ෍ I yi ≠ yො i
What about the accuracy of the prediction on an unseen test
data?
Where I is the Indicator function
• The usefulness of the model depends on the performance on unseen
test data
• We need a model that minimizes the test error
• We want a method that gives the lowest MSETe as opposed to
the lowest MSETr
• There are ways of estimating MSETe: Test Data, Cross Validation
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Overfitting and underfitting Overfitting and underfitting

Data is fitted to a linear function and


a polynomial function. Overfitting
• Occurs when the model captures the noise of the training data – fits the data too well
• The polynomial function is a perfect • A method is said to be overfitting the data when it generates a small MSETr and a large MSETe
fit – overfitting since it is adapting to • It is often a result of an excessively complicated model
the training set • Can be prevented by fitting multiple models and using validation or cross-validation to
• The linear function is more rigid but compare their predictive accuracies on test data.
may generalize better • The model may have low bias but high variance
• If the two functions were used to Underfitting
extrapolate beyond the fitted data, • Occurs when the model cannot capture the underlying trend of the training data
the linear model • Occurs when the model does not fit the data well enough
• May generalize better • It is often a result of an excessively simple model.
• May make better predictions. • The model may have low variance but high bias

Both overfitting and underfitting lead to poor predictions on new data sets.
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Bias Vs Variance Bias Vs Variance

The goal of any supervised statistical learning


algorithm is to achieve The algorithm learns a model from training data
• low bias and low variance The prediction error can be broken down into three parts: Bias Error, Variance Error &
• Thereby achieving good prediction Irreducible Error (noise)
performance. • The irreducible error cannot be reduced: it is the error introduced by modelling a real life
scenario
In reality, we cannot calculate the real bias and
• Bias error arises from the simplifying assumptions made by a model to make the target
variance error terms because we do not know
the actual underlying target function. function easier to learn
• Variance is the amount that the predictions will vary with different training data sets
Nevertheless, as a framework, bias and variance
provide the tools to understand the behaviour We want a good predictor – low bias and low variance
of machine learning algorithms in the pursuit of
predictive performance.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Bias Vs Variace Training and Test Errors

Underfitting
Low Bias, High Variance High Bias, Low Variance 1. Rigidity or under-complexity
Overly Flexible Less Flexible 2. High Bias, Low variance
Overfitting
1. Flexibility or over-complexity
2. Low Bias, High variance

1. When do we know that we are


underfitting?
2. When are we overfitting?
3. What is the optimal flexibility?
4. Does obtaining more data help in a
case of underfitting? Overfitting?

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Training and Test Errors Bias-Variance Trade-off

The algorithm learns a model from training data


2.5

At any iteration, we would like to know whether:


The prediction error can be broken down into three parts: Bias Error, Variance Error & Irreducible Error
• Bias error arises from the simplifying assumptions made by a model to make the target function easier to learn
We have High Bias or High Variance?
2.0

• Variance is the amount that the predictions will vary with different training data sets
• The irreducible error cannot be reduced: it is the error introduced by modelling a real-life scenario
Mean Squared Error

High Bias:
1.5

• Large MSETr & Large MSETe


• MSETr ~ MSETe
1.0

We want a good predictor – low bias and low variance


0.5

High Variance: 2
• Bias = E መf x0 − f x0
• Small MSETr & Large MSETe
Inability of our model to capture the true relationship (no matter how much data we throw at it!)
0.0

• Small MSETr << MSETe 2


2 5 10 20 • Variance = E fመ x0 − E fመ x0
Flexibility
Measures how much the predictions change across different training sets.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Bias-Variance Trade-off A rigid and a flexible model

f is the true model

Given a data set D = {(x, y)}


• fመ1 (x) = Y

• fመ2 x = β0 + β1 x1 + β2 x 2 + ⋯ + βk x k

• fመ1 has higher bias & lower variance


• fመ2 has lower bias & higher variance

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

A rigid and a flexible model Bias – Variance Equation


We want a good predictor – low bias and low variance Let
But • fመ be our estimation of the true f
• (x0) be any observation drawn from the population
As we increase the complexity of our assumptions
• The Bias decreases and the Variance increases 2
Error at x0 = E f x0 − fመ x0
Conversely, 2 2
As we decrease the complexity of our assumptions = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)
• The Bias increases and the Variance Decreases
= Bias2 + Variance + Irreducible Error
Thus, there is a trade-off at play between these two concerns (Bias & Variance) = Reducible Error + Irreducible Error
Note that E[ε) = 0

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


The Bias The Variance

2 2 2 2 2 2
E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε) E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)

• The first term is the squared bias • Note that different training sets will generate different fመ
• It is the difference between the averages of the estimate and the true values • Ideally the estimates should not vary too much across training sets
• Bias refers to the error that is introduced by approximating a real-life problem • The second term is the variance of the estimate across many training sets
• A model with high bias is not capturing the underlying behaviour of the true • It determines how much the average model estimation deviates across the training
functional form well data
• For example, if a simple linear regression is used to model a sine curve, no matter
how low the MSETr is, it will never capture the non-linearity inherent in a sine curve • In general, more flexible methods have higher variance

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

The Irreducible Error

2 2 2
E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)

• The last term is known as the irreducible error


• It is the minimum lower bound for the MSETe.

This variance arises because ZG 512


• ε may contain unknown variables that are useful in predicting Y Predictive Analytics
• Y may contain unmeasurable variations
BITS Pilani Pravin Mhaske
Pilani Campus

BITS Pilani, Pilani Campus


What is Machine Learning?
BITS Pilani
Pilani Campus

According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)
M3 Linear Regression
Lecture 2 Linear Regression 1
Many data mining tasks are executed successfully with help of machine learning

BITS Pilani, Pilani Campus

Types of Machine Learning Regression

Predict a value of a given continuous valued variable based on the values


of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Variables Matrix X and Vector y

# TV Radio Paper Sales The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1 observations 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3 The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables “Sales”
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5 4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9 p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2 n = 6 (the number of observations) 6 8.7 48.9 75 7.2

X represents the input data set; X is a 6 * 3 matrix


y represents the output variable; y is a 6 * 1 vector

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Matrix X and Vector y A Linear Model

# TV Radio Paper Sales


1 230.1 37.8 69.2 22.1 The linear model is an important example of a parametric model:
2 44.5 39.3 45.1 10.4 f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5 • A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .
5 180.8 10.8 58.4 12.9 • We estimate the parameters by fitting the model to training data.
6 8.7 48.9 75 7.2 • Although it is almost never correct, a linear model often serves as a good and interpretable
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1 approximation to the unknown true function f (X).
xi represents the ith observation. xi is a vector represented as (xi1 xi2 …..xip)

xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)


yi represents the ith observation of the output variable. y is the vector (y1 y2 ….. yp)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


A Linear Model (Parametric) A Linear Model

The linear model is an example of a parametric model We want to predict Y for a given value of x
f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p
Is there an ideal f (X)?
• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p . • What is a good value for f (X) at any selected value of X , say X = 4? There
can be many Y values at X = 4
• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters
A good value is f (4) = E(Y |X = 4), the expected value of Y given X = 4.
• We estimate the parameters by fitting the model to training data.
This ideal f (x) = E(Y |X = x) is called the regression function.
• Simple Linear Regression: Only one x variable
• Multiple Linear Regression: Many x variables

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

A Linear Model Regression Assumptions

Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε


1. E(ε) = 0
• β’s: Unknown constants, known as coefficients or parameters 2. The model adequately captures the relationship
• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
• ε is the error term – captures measurement errors and missing variables 4. ε is normally distributed
• ε is a random variable independent of X 5. The values of ε are independent (No Serial Correlation or Autocorrelation)
• E(ε) = 0
6. There is no (or little) multicollinearity among the independent variables
• In the advertising example, the model becomes
sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

f is said to represent the systematic information that X provides about Y

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Multicollinearity and VIF Residual Analysis
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
• The red line should be approximately horizontal at zero.
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results. • There is no pattern in the first residual plot. The presence of a pattern may indicate a problem with some
aspect of the linear model (case 2)
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables. • E(ε) = 0
• Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
• Test for Multicollinearity: Variance Inflation Factor

• VIF equal to 1 = variables are not correlated


• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated

Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Homoscedasticity Vs Heteroscedasticity Types of Regression Models


Simple Regression
• Are the residuals spread equally along
the ranges of predictors? (Education) x y (Income)
• The plot should have a horizontal line
with equally spread points.
Multiple Regression
In the second plot, this is not the case.
• The variability (variances) of the (Education) x1
residual points increases with the value
of the fitted outcome variable, (Soft Skills) x2 y (Income)
suggesting non-constant variances in
the residuals errors (Experience) x3
(or heteroscedasticity)
(Age) x4

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Direct Solution Method Exercise
Least Squares Method (Ordinary Least Squares or OLS) Kumar’s Electronics periodically has a special week-long sale. As part of the advertising
campaign Kumar runs one or more TV commercials during the weekend preceding the
• Slope for the Estimated Regression Equation sale. Data from a sample of 5 previous sales are shown below.

σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത
𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2
# of TV Ads # of Cars Sold
(x) (y)
• y Intercept for the Estimated Regression Equation
1 14
𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ 3 24
2 18
where:
xi = value of independent variable for ith observation 1 17
yi =value of dependent variable for ith observation 3 27
𝑥=
ҧ mean value for dependent variable
𝑦=
ത mean value for dependent variable

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Solution Evaluation of Regression Model


𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝟐
# of TV Ads # of Cars Sold 𝐱 𝐢 − 𝐱ത
(x) (y)
1 14 -1 -6 6 1
3 24 1 4 4 1
2 18 0 -2 0 0
1 17 -1 -3 3 1
3 27 1 7 7 1
Sum 10 100 0 0 20 4
Mean 2 20

σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖−𝑦ത
• Slope for the Estimated Regression Equation 𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2
= 20/4 = 5

• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ = 20 – 10 = 10


• Estimated Regression Equation: 𝑦ො = b0 + b1x = 10 + 5x
• Predict Sales if Ads run = 5? 15?
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Goodness of Fit

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus

Coefficient of Determination (R-squared) Exercise (already covered)

• Proportion of variance in a dependent variable that can be explained by an independent variable • Simple Linear Regression using Excel
• Multiple Linear Regression using Excel
• Multiple Linear Regression using statsmodel
• Multiple Linear Regression using scikitlearn

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Qualitative Predictors
BITS Pilani
Pilani Campus

One Hot Encoding (Dummy Variables)


1. Find out count of distinct values, say n, in the column.
2. Create n new columns – each with a name of the distinct value.
3. Encode values 0 and 1 under those columns depending upon the value in that observation.
4. Avoid when the count of distinct values is high.

M3 Linear Regression
Lecture 2 Extensions to the Linear Model

BITS Pilani, Pilani Campus

Label Encoding

1. Find out count of distinct values, say n, in the column.


2. Find the order.
3. Encode values 0, 1, 2….based on the order.

BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Advertising Model The Advertising Model
Call:
lm(formula = Sales ~ TV + Radio, data = Ad)
In the first model, we assumed that the effect on sales of increasing one advertising
Residuals: medium is independent of the amount spent on the other media.
Min 1Q Median 3Q Max
-8.7977 -0.8752 0.2422 1.1708 2.8328 The estimated model was
෡ = 2.9389 + 0.0458*TV + 0.1885*Radio – 0.0010*Paper
Y
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.92110 0.29449 9.919 <2e-16 ***
TV 0.04575 0.00139 32.909 <2e-16 *** From this we infer that the average effect on sales of a one-unit increase in TV is always
Radio 0.18799 0.00804 23.382 <2e-16 *** 0.0458, regardless of the amount spent on radio.
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.681 on 197 degrees of freedom


Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

The Synergy Effect The Interaction Model


• But suppose that spending money on radio advertising actually increases the Model takes the form
effectiveness of TV advertising, so that the slope term for TV should increase as radio sales = β0 + β1 × TV + β2 × radio + β3 × (radio × TV) + ε
increases. = β0 + (β1 + β3 × radio) × TV + β2 × radio + ε
• In this situation, given a fixed budget of $100, 000, spending half on radio and half on TV
may increase sales more than allocating the entire amount to either TV or to radio.
• In marketing, this is known as a synergy effect, and in statistics it is referred to as an
interaction effect.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


The Interaction Model Interpretation I
Coefficients: Estimate Std. Error t value Pr(>|t|)
Call: lm(formula = Sales ~ TV * Radio, data = Ad)
(Intercept) 6.7502 2.479e-01 27.233 <2e-16 ***
Residuals: TV 0.0191 1.504e-03 12.699 <2e-16 ***
Min 1Q Median 3Q Max Radio 0.0289 8.905e-03 3.241 0.0014 **
-6.3366 -0.4028 0.1831 0.5948 1.5246 TV:Radio 0.0011 5.242e-05 20.727 <2e-16 ***
Residual standard error: 0.9435 on 196 degrees of freedom
Coefficients: Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
Estimate Std. Error t value Pr(>|t|) F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
(Intercept) 6.7502 2.479e-01 27.233 <2e-16 ***
The results in this table suggests that interactions are important.
TV 0.0191 1.504e-03 12.699 <2e-16 ***
Radio 0.0289 8.905e-03 3.241 0.0014 ** • The p-value for the interaction term TV×Radio is ~ 0 < α, indicating that there is strong evidence for HA:
TV:Radio 0.0011 5.242e-05 20.727 <2e-16 *** βTV*Radio ≠ 0.
• The R2 for the interaction model is 96.8%, compared to only 89.7% for the model that predicts sales using TV
Residual standard error: 0.9435 on 196 degrees of freedom and radio without an interaction term.
Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673 • This means that (96.8 − 89.7)/(100 − 89.7) = 69% of the variability in sales that remains after fitting the
F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16 additive model has been explained by the interaction term.
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Interpretation II Hierarchy Principle

Coefficients: Estimate Std. Error t value Pr(>|t|)


(Intercept) 6.7502 2.479e-01 27.233 <2e-16 ***
• Sometimes it is the case that an interaction term has a very small p-value, but the
TV 0.0191 1.504e-03 12.699 <2e-16 *** associated main effects (in this case, TV and radio) do not.
Radio 0.0289 8.905e-03 3.241 0.0014 ** The hierarchy principle:
TV:Radio 0.0011 5.242e-05 20.727 <2e-16 ***
• If we include an interaction in a model, we should also include the main effects, even if
the p-values associated with their coefficients are not significant.
෣ = 6.7502 + 0.0191 TV + 0.0289 Radio + 0.0011 TV × Radio
Sales
• The rationale for this principle is that interactions are hard to interpret in a model
• An increase in TV ads of $1000 leads to an incremental sales of 19.1 + 1.1*Radio
without main effects - their meaning is changed.
• An increase in radio ads of $1000 leads to an incremental sales of 28.9 + 1.1*TV
• If the main effect terms are dropped, the interaction terms will include main effects

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Curse of Dimensionality

• High dimensional data – efficiency of


algorithm deteriorates
• Sparse data, more features require more
training data
• Leads to computational complexity, high
training time, overfitting, spurious
correlations
• Algorithms like kNN are highly susceptible to
curse of dimensionality
• Dimensionality reduction techniques to avoid
high dimension data
• Feature selection – keep most relevant
ZG 512
features
• Feature extraction – Transformation from
Predictive Analytics
high dimension to low dimension using
Principal Component Analysis BITS Pilani Pravin Mhaske
Pilani Campus

BITS Pilani, Pilani Campus

Non-linear Relationships
BITS Pilani
Pilani Campus

Extend the linear model to accommodate non-linear relationships. Perform EDA to find patterns in the
relationships.

• Polynomial: Fits higher degree polynomial function to data


y = β0 + β1 × x + β2 × x2 +…..+ βn × xn

• Exponential: y = αe(βx)

• Logarithmic: y = α + β*ln(x)

• Power: y = αxβ
M3 Linear Regression • Generalized additive model (GAM): Combines multiple regression models for a complex relationship
Lecture 3 Linear Regression - Diagnostics
y = f1(x1) + f2(x2) +…+ fn(xn)

BITS Pilani, Pilani Campus


Polynomial Regression

The figure suggests that

mpg = β0 + β1 × horsepower + β2 × horsepower2 + ε

may provide a better fit.

Exercise:
Explore relationships for other predictors

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Model: Non-linear Regression Liner Regression Assumptions


Call: lm(formula = mpg ~ horsepower + I(horsepower^2), data = auto) Assumptions
Residuals:
1. E(ε) = 0
Min 1Q Median 3Q Max 2. The model adequately captures the relationship
-14.7135 -2.5943 -0.0859 2.2868 15.8961 3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
Coefficients:
Estimate Std. Error t value Pr(>|t|) 5. The values of ε are independent (No Serial Correlation or Autocorrelation)
(Intercept) 56.9000997 1.8004268 31.60 <2e-16 *** 6. There is no (or little) multicollinearity among the independent variables
horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***

Residual standard error: 4.374 on 389 degrees of freedom


Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Model Adequacy (Homoscedasticity) Model Adequacy (Homoscedasticity)
Ideally, there should not be any pattern – trend in the residuals
• The red line should be approximately horizontal at zero.
• The presence of a pattern may indicate a problem with some
aspect of the linear model.

In the plot, there is no pattern in the residual plot.


We may assume
• There is a linear relationship between the selected predictors and
the outcome variable
• E(ε) = 0
• There may be an issue with the Homoscedasticity assumption
Case 1: The red line is fairly horizontal about 0
Case 2: The red line is not horizontal indicating an inadequate model

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Examples Is ε Normal?

NORMAL NOT NORMAL


Sales ~

1. TV + Radio + Paper,

2. TV*Radio

3. TV*Radio + I(TV^2)

4. TV*Radio + I(TV^2) + I(Radio^2)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Multicollinearity and VIF Iterative Method - Gradient Descent

• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists • What is Gradient?
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect • What is Descent? Gradient Descent with one parameter
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the w = parameter or coefficient
independent variables.
J(w) = Cost function corresponding to that w
• Test for Multicollinearity: Variance Inflation Factor J(w)
Objective:

• VIF equal to 1 = variables are not correlated To compute (approximate) the w that minimizes
• VIF between 1 and 5 = variables are moderately correlated
J(w)
• VIF greater than 5 = variables are highly correlated

Solutions to multicollinearity
w
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Gradient Descent Process Positive and Negative Gradient

• The starting point is labelled 1 – a guess


• The slope (dJ/dw) gives the direction to move
• Take small steps iteratively – towards the local minima
• What is learning rate?

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Learning Rate Types of Gradient Descent
• Batch gradient descent refers to calculating the derivative from all training data before calculating an update.

• Minibatch refers to calculating derivative of mini groups of training data before calculating an update.

• Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating
the update immediately

Small Steps: The process will Large Steps: The process may not
eventually reach the local minimum converge – just bounces back and
but it may be very slow forth

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Batch Gradient Descent Batch Gradient Descent for SLR

1 2
We need to minimize J β = σn yi − xiT β
2n i=1
y = f(x) + ε = β0 + β1x + ε
Differentiating J(β) wrt β, we get 𝛻 = 𝐗 T 𝐲 − 𝐗β
1 2
The following iterative process solves 𝛻 = 0: We need to minimize J β = σn yi − β0 + β1x
2∗n i=1

Repeat until convergence


This is given by
δ
{ βj = βj − α δβ J β , j = 0, … p } – Update the β’s simultaneously dJ 1
j
= σni=1 β0 + β1 xi − yi = 0 &
dβ0 n

dJ 1
= σni=1 β0 + β1 xi − yi xi = 0
dβ1 n

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


The Iterative Process Problem

Training Set:
y = f(x) + ε = β0 + β1x + ε
X Y
1 2
We need to minimize J β = σn yi − β0 + β1x -1 -1
2∗n i=1
1 3
Repeat until convergence:
Learning Rate (α) = 0.08
{
After 4th iteration of the Gradient Method, β0 = 0.64 &, β1 = 1.64
α n α n
β0 = β0 − σ β0 + β1 xi − yi , β1 = β1 − σ β0 + β1 xi − yi xi
𝑛 i=1 𝑛 i=1 Compute the estimates at the next iteration
}

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Solution Gradient Descent with 2 parameters

Training Set: (-1, -1) and (1, 3), α = 0.08 and after 4 iterations, β0 = 0.64 &, β1 = 1.64

α
β 0 = β0 − β0 + β1 x1 − y1 + β0 + β1 x2 − y2
2

α
β1 = β1 − [ β0 + β1 x1 − y1 x1 + β0 + β1 x2 − y2 x2 ]
2

Both estimates need to be updated simultaneously

β0 = 0.64 − 0.04 0.64 + 1.64 ∗ −1 − −1 + 0.64 + 1.64 ∗ 1 − 3

= 0.64 − 0.04 0 + −0.72 = 𝟎. 𝟔𝟔𝟖𝟖

β1 = 1.64 − 0.04 0.64 + 1.64 ∗ −1 − −1 ∗ −1 + 0.64 + 1.64 ∗ 1 − 3 ∗ 1

= 0.64 − 0.04 0 ∗ −1 + −0.72 ∗ 1 = 1.64 + 0.0288 = 𝟏. 𝟔𝟔𝟖𝟖

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Gradient Descent Gradient Descent

Size in feet2 Price ($) in


(x) 1000's (y)
2104 460
1416 232
1534 315 Source Credit : Slide by
852 178 Andrew Ng
… …

Gradient Descent Gradient Descent

Source Credit : Slide by Source Credit : Slide by


Andrew Ng Andrew Ng
Gradient Descent Gradient Descent

Source Credit : Slide by Source Credit : Slide by


Andrew Ng Andrew Ng

Gradient Descent Gradient Descent

Source Credit : Slide by Source Credit : Slide by


Andrew Ng Andrew Ng
Gradient Descent

ZG 512
Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus
Source Credit : Slide by
Andrew Ng

Linear Model for 3D


BITS Pilani
Pilani Campus

In a three-dimensional setting, with two


predictors and one response,
• The least squares regression line becomes a
plane
• The ‘plane (of best fit)’ is chosen to minimize
the sum of the squared vertical distances
between each observation (shown in red)
and the plane.
Lecture 4 Non-parametric Methods

BITS Pilani, Pilani Campus


Linear Model for 2 Independent Variables Linear Model for 2 Independent Variables

Linear Model (parametric) Moderate Complexity (non-parametric) Overfit Model

Real Function Linear Model

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Non-parametric Methods k Nearest Neighbors

• The aim of a regression analysis is to develop a reasonable approximation to the unknown response
function f, where Y = f(X) + ε
• Parametric methods make and assumption about the shape of f. For example, OLS assumes f to be
linear and estimates a small set of parameters, i.e., β0, β1, β2…
• Unlike parametric approach where the function f is fully described by a finite set of parameters,
nonparametric modeling accommodate a very flexible form of the regression curve.
• They do not make any explicit assumption of the functional form of f
• Instead try to seek an estimate of f that is close to the data points without being too flexible
• So, they can fit multiple possible shapes of f (Pic 1 on previous slide)
K = 3 & six blue observations and six orange observations
• They provide a versatile method of exploring a general relationship between variables
• Disadvantage: Need much more data to accurately fit the function f (or model) x is a test observation
The 3 closest points are identified, and therefore x belongs to the blue class
The KNN decision boundary for this example is shown in black.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Flexibility Distance Metrics

• Euclidean:
• The Training error rate (blue) & Test • Straight line distance
error rate (orange) Vs Flexibility on a • Scale variant, Sensitive to data dimensionality
simulated data • Normalization (scaling) can solve this issue
• Flexibility is measured by 1/k
• High k, stable model
• Low k, flexible model
• Bias-Variance Tradeoff?
• Manhattan:
• Taxicab or cityblock distance
• Sum of absoluter differences of cartesian coordinates

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Normalization and Standardization

✓ Feature Scaling – equivalence of features


✓ Algorithms can be biased towards the features having numerically larger values
✓ Faster computations

Normalization:
• Min-Max scaling
• Converts to 0-1 range

Standardization:
• Converts to z-values using mean and std dev

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
The right k Practical Considerations

Small K • Accuracy depends solely on a data model


• Many small regions
• Non-smooth decision
• Scaling of features may be required
boundaries • Nearest Neighbor classifiers are lazy learners
• May overfit
Large K
• Can sometimes be the best method
• Fewer larger regions • Memory Intensive
• Smoother decision boundaries
• Low training time, high prediction time
• May Underfit
Choosing K
• Domain dependent
• Cross-validation

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

kNN function kNN Regression

• kNN for Regression (advertising data)

K = 1, a rough step function. Flexible fit. Low bias, high variance. Overfitting.
K = 9, a smoother function. More rigid fit. High Bias, low variance. Changing one observation has small
effect.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


kNN Regression

• Knn Regression is non-parametric


• For regression, the average is computed
• (For classification, majority voting is taken into account)
• The algorithm requires
• K
• Distance metric (usually the Euclidean metric)
• Features should be on the same scale
• Memory Intensive – Instance based method
• Curse of Dimensionality
ZG 512
Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus

BITS Pilani, Pilani Campus

Resampling Methods
BITS Pilani
Pilani Campus

• They involve repeatedly drawing samples from the training set in order to
obtain additional information about the fitted model
• For example,

• Estimate population parameters by sampling with replacement the dataset

• To estimate test-set prediction error

• Select the appropriate level of flexibility


Lecture 5 Resampling Methods
• Two resampling methods: cross-validation and the bootstrap

BITS Pilani, Pilani Campus


Training Error Vs Test Error Validation set approach

• Recall the distinction between the test error and the training error: Holding out a subset of the training observations from the fitting process, and then applying the statistical learning
method to those held out observations to estimate the test error
• The test error is the average error that results from using a statistical learning method
• We randomly divide the training set into two equal parts: a training set and a validation or hold-out set.
to predict the response on a new observation, one that was not used in training the • The model is fit on the training set, and the fitted model is used to predict the responses for the observations in
method. the validation set.

• The resulting validation-set error provides an estimate of the test error.


• In contrast, the training error is calculated by applying the statistical learning method
• This is typically assessed using MSE in the case of a quantitative response and misclassification rate in the case
to the observations used in its training.
of a qualitative (discrete) response.
• But the training error rate often is quite different from the test error rate, and in
particular the former can dramatically underestimate the latter.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Drawbacks Leave One Out Cross Validation (LOOCV)


• The validation estimate of the test error can be highly variable, depending on precisely which • Tries to address the drawbacks of validation set approach
observations are included in the training set and which observations are included in the validation • Only a single observation is used to validation. N-1 observations for training.
set. • Very time consuming, computationally expensive.
• In the validation approach, only a subset of the observations — those that are included in the
training are used to fit the model. We lose a major proportion of the data.
26 28
24
Mean Squared Error

Mean Squared Error


22

16 18 20 22 24 26 28
20
16 18

2 4 6 8 10 2 4 6 8 10

Degree of Polynomial Degree of Polynomial

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Bias-Variance Tradeoff in Cross-Validation k-fold Cross-Validation
• Statistical Methods tend to perform worse when trained on fewer observations • Widely used approach for estimating test error.
• In 50% validation set approach, only a subset is available for training • Estimates can be used to
• In LOOCV, bias will be the lowest, but high variance. It also doesn’t shake up the data • Select best model
enough. • Estimate the test error of the final chosen model.
• What’s the way in between? • Process
• Set k = 5 or 10 • Randomly divide the data into k equal-sized parts
• Moderate bias, moderate variance • Leave out part k, fit the model to the other k−1 parts (combined), and then
obtain predictions and the error for the left-out kth part.
• This is done in turn for each part k = 1, 2, . . . k
• The average of the k errors is the estimate of the test error
• Setting k = n yields n-fold or leave one out cross-validation (LOOCV)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

5-fold CV Bootstrap
• Widely used approach for estimating test error. • The bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we
can estimate the variability of our estimate without generating additional samples.
• Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets
by repeatedly sampling observations from the original data set with replacement.
• Each of these “bootstrap data sets” is created by sampling with replacement and is the same size as our original
dataset. As a result. some observations may appear more than once in each bootstrap data set and some, not at
all.
• Example: Training dataset = {1,2,3} can give below possible datasets

111 222 333


112 221 331
113 223 332
122 233 322
123 231 321
133 211 311

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Implementing 5-fold CV
• House Prices Dataset

ZG 512
Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus

BITS Pilani, Pilani Campus

Curse of Dimensionality
BITS Pilani
Pilani Campus

• Despite their simplicity, the linear models often have a good predictive power.
• If the relationship between predictor and response is linear, they tend to have low bias.
• And if n is large enough, i.e., n >> p, they have low variance too. No overfitting.
• But, what if p > n?
• New technologies can generate unlimited features.
• On the other hand, collection of data at times may have some constraints like cost and
availability. Result: p > n.
Lecture 6 Regularization
• How to manage irrelevant variables?

BITS Pilani, Pilani Campus


Dimensionality Reduction Shrinkage Methods

• We need better generalization, low overfitting and high interpretability. • Fit a model on all p predictors. But constrain or regularize the coefficient estimates, or shrink
• How to manage irrelevant variables? those to zero.
• Perform feature selection • Ridge Regression and Lasso Regression
• Push the coefficients to some variables to zero
• Methods for variable selection/elimination:
• Subset selection: Select subset of feature which we believe are relevant
• Shrinkage: Shrink the coefficients of some variables to zero. Called “Regularization”.
Result: Feature selection.
• Dimension reduction: Project the p predictors into an M-dimension subspace. M < p.
Achieved by computing M different linear combinations of p variables. Fit the model on M
features. Principal Component Analysis.
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Ridge Regression
• Recall : The Residual Sum of Squares

• The least squares method estimates the β0, β1, β2, … βp coefficients those minimize RSS, which
can also be expressed as

• Ridge regression is same as OLS, except that the coefficients are estimated by minimizing a
slightly different quantity which is

• Where λ ≥ 0, is called regularization parameter.


BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ridge Regression

• Ridge regression minimizes

p
• The second term λ σj=1 β2j is called shrinkage penalty. As β values → 0, penalty decreases.

• This penalty is called l 2 penalty


• The tuning parameter λ serves to control the relative impact of these two terms on the regression
coefficient estimates.

• Selecting a good value for λ is critical; cross-validation is used for this.

• λ = 0 ⇒ The penalty has no effect. It is same as least squares regression coefficients.

• λ → ∞ ⇒ The coefficients β1, . . . , βp → 0. What happens to β0?

BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

𝛌 Vs β’s Bias Variance Tradeoff

• As λ increases, flexibility decreases which means more robust model, less overfitting, less
variance. But bias increases.
• Appropriate value of λ can minimize the sum of bias and variance errors, i.e., the MSE.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


𝛌 from 0.01 to 1010 Lasso Regression

OLS • Ridge regression does have one obvious disadvantage: unlike subset selection, which
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian will generally select models that involve just a subset of the variables, ridge regression
-479.21 -7.80 0.19 1.14 17.72 -0.61 -1.10 -10.65 425.75 -8.53 10.11 will include all p predictors in the final model
At λ = 0.01 • The Lasso is a relatively recent alternative to ridge regression that overcomes this
disadvantage. The lasso coefficients 𝛽෠𝜆𝐿 , minimize the quantity
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian
2
𝑛 𝑝 𝑝 𝑝
-473.9088 -7.7987 0.1801 1.2954 16.9949 -0.6373 -1.056 -10.3862 425.8620 -7.4602 1.4765
෍ 𝑦𝑖 − 𝛽0 − ෍ 𝛽𝑗 𝑥𝑖𝑗 + 𝜆 ෍ |𝛽𝑗 | = 𝑅𝑆𝑆 + 𝜆 ෍ |𝛽𝑗 |
At λ = 1010
𝑖=1 𝑗=1 𝑗=1 𝑗=1
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian
359.9029 0.0002 0.0336 0.0023 0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000
• In statistical parlance, the lasso uses an l 1 penalty instead of an l 2 penalty.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Ridge Vs Lasso Ridge Vs Lasso

Parameter Ridge Lasso • Ridge and Lasso for auto-mpg data


Coefficients Shrinks close to 0 Pushes to exact 0
(if λ is large enough)
Penalty type L2 L1
Penalty λ * sum of squared coefficients λ * sum of absolute coefficients
Features Selection Reduces impact of least important Eliminates least important features
features
Complexity and High Low
Interpretability
Sparsity Low High
Sensitivity More robust. Less sensitive to More sensitive to outliers
outliers compared with Lasso.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

ZG 512
Predictive Analytics Lecture 7 Bayesian Learning

BITS Pilani Pravin Mhaske


Pilani Campus

Conditional Probability Conditional Probability

Consider a family with two children. Sample space S = {GG, GB, BG, BB}.
• What is the probability that both the children are girls? Sol. The two child problem:
• What is the probability that both children are girls, given that the first child is a P ( A  B) P ( A) 14 1
girl? P ( A | B) = = = =
• What is the probability that both the children are girls, given that at least one of P ( B) P ( B) 12 2
them is a girl?
P( AC) P ( A) 14 1
P ( A| C ) = = = =
Sol.: Let A, B, and C be the events that both children are girls, first child is a girl, and at P (C ) P (C ) 34 3
least one of them is a girl, respectively.
Q. In rolling of a fair die, what is the probability that the outcome is an odd
Here A = {GG}, B = {GB, GG}, C = {GG, GB, BG}; P(A)=1/4, P(B)=1/2, P(C)=3/4
What are now
number, given that it is less than or equal to 3?
P ( A|B ) , P ( A|C ) ? Sol. P ( A  B) 26 2
P ( A | B) = = =
P ( B) 36 3
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Conditional Probability Total Probability Rule

Q. In rolling of a fair die, what is the probability that the outcome is an odd I have three bags each containing 100 balls.
• Bag 1 has 75 red and 25 blue balls
number, given that it is less than or equal to 3?
• Bag 2 has 60 red and 40 blue balls.
Sol. P ( A  B) 26 2 • Bag 3 has 45 red and 55 blue balls.
P ( A | B) = = =
P ( B)
If I choose a bag at random, and then pick a ball from it, also at random, what is the probability
36 3 that the chosen ball is red?

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Total Probability Rule Total Probability Rule

Q 2: A manufacturing firm receives shipments of parts from two different suppliers.


Sol.: Let R be the event of picking red ball, and
Supplier A provides 65%, whereas supplier B provides 35% of total parts. Historical data
B1, B2, B3 are the events of choosing bag 1, bag 2, and bag 3. suggests that supplier A provides 2% bad parts, whereas supplier B provides 5% bad
Using total probability rule, products. With this data, can you find out the total probability of bad parts? (You may use
P ( R ) = P ( B1  R ) + P ( B2  R ) + P ( B3  R )
tree diagram)
= P ( B1 ) P ( R | B1 ) + P ( B2 ) P ( R | B2 ) + P ( B3 ) P ( R | B3 )
1  1  1 
=   0.75  +   0.60  +   0.45  = 0.60
3  3  3 

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Bayes Theorem Bayes Theorem

For any two events A and B, where P ( A )  0 Q 1. I have three bags each containing 100 balls.
• Bag 1 has 75 red and 25 blue balls
P ( B  A) P ( B) P ( A | B)
P ( B A) =
• Bag 2 has 60 red and 40 blue balls.
=
P ( A) P ( A) • Bag 3 has 45 red and 55 blue balls.
A bag is chosen at random, and a ball is picked also at random. If the ball is found to be red, what is
the probability that Bag 1 was chosen?

Bayes’ theorem helps to revise probabilities (prior to posterior) with the new information. This By total probability rule,

theorem has wide applications in business statistics. P ( R ) = P ( B1 ) P ( R | B1 ) + P ( B2 ) P ( R | B2 ) + P ( B3 ) P ( R | B3 ) = 0.60


Now by Bayes' theorem,
P ( B1 ) P ( R | B1 ) P ( B1 ) P ( R | B1 )
P ( B1 R ) =
5
= =
P ( R) 3

 P(B ) P(R | B )
12
i i
i =1

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


The ideal መf Bayesian Classifiers

• Y: Categorical variable with K ≥ 2 Y: Categorical variable with K ≥ 2


• x0 is an observation Given an observation (X1, X2, …, Xp), we want to find K that maximizes P(Y = K|X1, X2, …, Xp)
• Suppose the conditional probabilities are available: {P(Y = j|X = x): ∀𝑥, ∀y} Can we estimate P(Y = K|X1, X2, …, Xp) from the data?

x0 is assigned to class j if P(Y = j|X = x0) = Max{P(Y = j|X = x0) : j = 1 … K}


P X1 , X 2 , … , X p Y = k P(Y= k)
From Bayes Theorem, P Y = k|X1 , X 2 , … , X p =
P X1 ,X2 ,…,Xp
This is the ideal fመ Now
This classifier gives the lowest possible test error rate. max P Y = k|X1 , X2 , … , Xp ≡ max P X1 , X2 , … , Xp Y = k × P Y = k = Likelihood × Prior
K K

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

The Naïve Bayes Classifier Fruits Dataset

Suppose the predictors are independent No Long Sweet Yellow Fruit


1 Y Y Y Banana
P X1 , X 2 , … , X p k P(Y= k) P X1 |k) × 𝑃(X2 |𝑘) × ⋯ × 𝑃(Xp k P(k)
• P 𝑌 = k|X1 , X 2 , … , X p = = 2 Y Y Y Banana
P X1 ,X2 ,…,Xp P X1 ,X2 ,…,Xp
3 Y Y Y Banana
• max P k|X1 , X2 , … , Xp ≡ max P X1 , X2 , … , Xp k P(k) = max P X1 |k) × 𝑃(X2 |𝑘) × ⋯ × 𝑃(Xp k P(k) 4 Y Y Y Banana
𝑘 𝑘 𝑘

5 Y Y Y Banana
And 6 Y Y Y Banana
• P(X1 = x1 |Y = k) can be estimated from the data DATA SUMMARY
Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Fruits Dataset Fruits Dataset
DATA SUMMARY
DATA SUMMARY Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL
Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Banana 400 100 350 150 450 50 500 Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000
Orange 0 300 150 150 300 0 300
P(A)
Rest 100 100 150 50 50 150 200 P A│E1&E2&E3 = P E1 ∗P E2 ∗P(E3)
∗ P E1 A ∗ P E2 A ∗ P(E3|A)
TOTAL 500 500 650 350 800 200 1000 P(Banana) = 0.5, P(Orange) = 0.3, P(Other) = 0.2
P(Long) = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8

• P(A Ո B) = P(A) * P(B) = P(A | B) * P(B) = P(B | A) * P(A) 0.5


P B│L = 1&S = 1&Y = 1 = 0.5∗0.65∗0.80 ∗ P L = 1 B ∗ P S = 1 B ∗ P Y = 1 B
1
= 0.26 ∗ 0.5 ∗ 0.8 ∗ 0.7 ∗ 0.9 = 𝟎. 𝟗𝟔𝟗𝟐
• P(A | B) = P(A Ո B) / P(B) = P(B | A) * P(A) / P(B) 0.3
P O│L = 1&S = 1&Y = 1 = 0.5∗0.65∗0.80 ∗ P L = 1 O ∗ P S = 1 O ∗ P Y = 1 O
1
= 0.26 ∗ 0.3 ∗ 0 ∗ 0.5 ∗ 1 = 0
0.2 1
P R│L = 1&S = 1&Y = 1 = ∗ P L=1 R ∗P S=1 R ∗P Y=1 R = ∗ 0.2 ∗ 0.5 ∗ 0.75 ∗ 0.25 = 0.0721
And if E1, E2 & E3 are independent features 0.26 0.26
P(A)
• P A│E1&E2&E3 = ∗ P E1 A ∗ P E2 A ∗ P(E3|A) Note: They do not add up to 1.
P E1 ∗P E2 ∗P(E3)
Divide each “probability” by the sum of the three “probabilities” to get 0.9308, 0, 0.0692

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Spherical, Sweet and Red?


Not Not
Long Not Long Sweet Sweet Yellow Yellow TOTAL
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000

P(A)
P A│E1&E2&E3 = P E1 ∗P E2 ∗P(E3)
∗ P E1 A ∗ P E2 A ∗ P(E3|A)

P(Banana) = 0.5, P(Orange) = 0.3, P(Other) = 0.2


P(Long) = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
ZG 512
0.5
P B│L = 0&S = 1&Y = 0 = 0.5∗0.35∗0.20 ∗ P L = 0 B ∗ P S = 1 B ∗ P Y = 0 B = 28.57 ∗ 0.5 ∗ 0.2 ∗ 0.7 ∗ 0.1 = 𝟎. 𝟏𝟕𝟗 Predictive Analytics
0.3
P O│L = 0&S = 1&Y = 0 =
0.5∗0.35∗0.20
∗ P L=0 O ∗P S=1 O ∗P Y=0 O = 28.57 ∗ 0.3 ∗ 1 ∗ 0.5 ∗ 0 = 0 BITS Pilani Pravin Mhaske
Pilani Campus
0.2
P R│L = 0&S = 1&Y = 0 = 0.26
∗ P L=0 R ∗P S=1 R ∗P Y=0 R = 28.57 ∗ 0.2 ∗ 0.5 ∗ 0.75 ∗ 0.75 = 1.607

B = 0.1, R = 0.9
BITS Pilani, Pilani Campus
Eager and Lazy learning
BITS Pilani
Pilani Campus

Parameter Eager learning Lazy Learning


Process System tries to construct a general, Generalization beyond the training
input-independent target function data is delayed until a query is made
during training of the system to the system (Instance Based
Learning)
Generalization Generalizes globally, beyond the Generalizes locally, depending upon
training data, independent of the new the query instance
query
Training Time High Almost nil
Prediction time Low High
Lecture 8 k Nearest Neighbours
Memory requirement Low High
Examples Linear Regression, Logistic Regression kNN

BITS Pilani, Pilani Campus

k Nearest Neighbors Flexibility

• The Training error rate (blue) & Test


error rate (orange) Vs Flexibility on a
simulated data
• Flexibility is measured by 1/k
• High k, stable model
• Low k, flexible model
• Bias-Variance Tradeoff?

K = 3 & six blue observations and six orange observations

x is a test observation
The 3 closest points are identified, and therefore x belongs to the blue class
The KNN decision boundary for this example is shown in black.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Flexibility Distance Metrics

• Euclidean:
• Straight line distance
• Scale variant, Sensitive to data dimensionality
• Normalization (scaling) can solve this issue

• Manhattan:
• Taxicab or cityblock distance
• Sum of absoluter differences of cartesian coordinates

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Normalization and Standardization The right k

✓ Feature Scaling – equivalence of features Small K


✓ Algorithms can be biased towards the features having numerically larger values • Many small regions
✓ Faster computations • Non-smooth decision
boundaries
Normalization: • May overfit
• Min-Max scaling Large K
• Converts to 0-1 range • Fewer larger regions
• Smoother decision boundaries
Standardization: • May Underfit
• Converts to z-values using mean and std dev Choosing K
• Domain dependent
• Cross-validation

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Practical Considerations Exercise

• Accuracy depends solely on a data model • Classification model using kNN


• Scaling of features may be required
• Nearest Neighbor classifiers are lazy learners
• Can sometimes be the best method
• Memory Intensive
• Low training time, high prediction time

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Mid-Sem exam How to write?


• Try to write in the textbox wherever possible. Use sheet upload for calculations, charts etc.
• 5-7 questions • Upload only the answer to the specific question, not the complete answer sheet.
• Difficulty level: Easy, moderate • Take the photo at an appropriate angle. No creative photography!

• Application based: 2-4 • Write legibly. If the evaluator can't read or understand, you may lose marks.

• Theory: 2-4 • Underline, box or highlight important parts of the answer, like the keywords, final answer, etc.
• If you don’t really know the answer, writing anything doesn’t give marks.
• Topics: Linear Regression, Bias-Variance, Over and underfitting, Types of
errors, Model selection, Regularization, Cross-validation, Parametric or non-
parametric methods, Resampling Re-evaluations:
• Cross-check the answer with the sample answers. If it is not correct or even nearby, do not apply.
• Unjustified and vague requests may result in deduction.
• No marks for “attempting” a question.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

ZG 512
Predictive Analytics Lecture 9 Classification, Logistic Regression

BITS Pilani Pravin Mhaske


Pilani Campus

Classification Regression Vs Classification

Here the response variable Y is Qualitative/Categorical


• Email Spam: email is one of Y = (spam, email) Variables can either be Quantitative or Qualitative (Categorical)
• Handwritten Digit Recognition: Digit class is oneof Y = {0,1, . . . , 9}.
• Quantitative variables take on numerical values – Income, Bill amount
Our goals are: • Categorical values take on values on one of K different classes – Gender, Digit
1. Prediction
• Build a classifier f( X ) that assigns a class label to a future unlabeled observation X Regression Problem: The response variable is quantitative
• Estimate the probability that X belongs to each category in Y Classification Problem: The response variable is categorical
Example: We may be more interested to have an estimate of the probability that a transaction is fraudulent
than it is to classify that the transaction is fraudulent or not
2. Inference
• Understand the roles of the different predictors among X = (X 1 , X 2 , . . . , X p )

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Classification Algorithms Estimator and Error Rate
We have seen y = f(X)
f is a function that best maps an input x to output y. We wish to estimate this f.
• Naïve Bayes
• K-nearest Neighbour The accuracy of መf is usually defined by the Error Rate
• Logistic Regression • The proportion of mis-classifications
• Discriminant Analysis
Ave ෍ I yi ≠ yො i
• Decision Trees
• Support Vector Machine Where I is the Indicator function

There are two error rates


• Training Error Rate
• Test Error Rate

A good classifier is one for which the test error rate is the smallest

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Regression Revision Classification – why and how?


Relationships between a numerical response and numerical / categorical predictors
• Regression gives a number. What if I want to identify a class or category and not a
• Hypothesis tests for all regression parameters together – Testing the Model
number?
• Model coefficient interpretation
• Hypothesis tests for each regression parameters • Let’s say, I want to identify genuine emails vs spam emails, genuine transaction vs
• Confidence intervals for regression parameters fraud transaction. Here the outcomes are text values, but models can understand only
• Confidence and prediction intervals for predicted means and values numbers?
• Model diagnostics, residuals plots, outliers
• How do I handle this? I will replace the 2 classes by numbers. Say, one class as 1 and
• RSS, MSE, R2
• Interpreting computer outputs another as 0 and train a model which can predict the outcome value that is 0 or 1.
• But models can’t give discrete values 0 and 1.
• We can rather make it give a continuous value between 0 and 1.
• If the value is closer to 1 (i.e. >= 0.5), I consider it as 1, otherwise 0.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Classification – why and how? Odds and Logit Function

• Can I use the concepts of linear regression here? How?


• Linear regression can throw out any value between - ∞ and + ∞. Odds is commonly used in gambling (and logistic regression)
• However, I want to map or convert that range (- ∞,+ ∞) to (0,1).
For an event E ,
• We need a link function to do this. • If we know P(E) then
• The most appropriate one is a sigmoid or logistic function. P(E) P(E)
Odds E = =
P(~E) 1 − P(E)
x
If the odds of E are “x to y”, then P E = x+y

Logit function:
p
• logit p = ln ,0 ≤ p ≤ 1
1−p
P(E)
log Odds(E) = ln 1−P(E)

logit can be interpreted as the log odds of a success

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Logit Function and Logistic (Sigmoid) Function Sigmoid Curve

• Logistic regression is a Generalized Linear Model (GLM)


• Uses Logistic or sigmoid function.

The logit function


p
• logit p = ln ,0 ≤ p ≤ 1
1−p

• Converts (0, 1) to range (–∞, + ∞)

The inverse function is known as Sigmoid function (or Logistic function)


ex 1
• S x = = , −∞ < x < +∞
1+ex 1+e−x

• Converts (–∞, + ∞) to range [0, 1]

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Evaluating the Model

Confusion Matrix


The performance of𝑓can also be described
by a confusion matrix

A confusion matrix is a table that is used


to describe the performance of a
classification model (or "classifier") on a
set of data for which the true values are
known.

The confusion matrix gives strong clues as


to where 𝑓መ is going wrong.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus

Example

Consider a classical problem of predicting spam and non-spam email.


The objective is to identify Spams.
The training set consists of 15 emails that are Spam, and 85 emails that are Not Spam
The model correctly classified 95 emails
• All 85 Non-Spams were correctly classified
• 10 Spams were correctly classified
• 5 Spams were classified as Non-Spams (False Negative if Target is Spam).

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
The Matrix Is accuracy a good metric always?

The objective is to identify Spams. Building a highly accurate (useless) classifier


True Class
Predicted Class Spam Non-Spam
Spam 10 0
Non-Spam 5 85

• “true positive” for correctly identifying target event


• “true negative” for correctly identifying non-target event
• “false positive” for incorrectly identifying a non-target event as a target event
• “false negative” for incorrectly identifying a target event as a non-target event

TP = 10, FP = 0, TN = 85, FN = 5

True Class
Predicted Class Target Non-Target
Target TP FP
Non-Target FN TN

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

What’s the right metric? The Metrics

1. Many classifiers are designed to optimize error/accuracy


2. This tends to bias the performance towards the majority class
1 is same as Positive
3. Anytime there is an imbalance in the data this can happen 0 is same as Negative
4. It is particularly pronounced, though, when the imbalance is more pronounced
5. Accuracy is not the right measure of classifier performance in such cases
6. What are other metrics?
1. Precision
2. Recall (Sensitivity or TPR or True Positive Rate = TP/P)
3. F1-score?

Also check*
1. Specificity (TNR or True Negative Rate = TN/N)
2. False Positive Rate (FPR) = FP/N = 1 – TNR
3. And others…

Refer https://en.wikipedia.org/wiki/Confusion_matrix

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Strategies for Imbalanced Data Under-sampling (majority class)

1. Under-sampling Create a new training data set by:


• Include all k “positive” examples
2. Over-sampling • randomly pick k “negative” examples
3. Optimize AUC Pros:
Easy to implement
Training becomes much more efficient (smaller training set)
For some domains, can work very well

Cons:
Throwing away a lot of data/information

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Over-sampling (minority class) Multiclass Classification

Suppose the possible responses are A, B & C.


Create a new training data set by: fመ was run on the training set and the following Confusion Matrix was generated
- including all m negative examples
- include m positive examples: True Class
- repeat each example a fixed number of times, or
Predicted Class A B C
- sample with replacement
A 30 20 10
Pros: B 50 60 10
Easy to implement C 20 20 80
Utilizes all of the training data
Tends to perform well in a broader set of circumstances than subsampling The confusion matrix gives strong clues as to where fመ is going wrong.
• For class A, fመ incorrectly labelled Label B for majority of the mislabelled cases.
Cons: Perhaps features need to be added to improve classification of label
Computationally expensive to train a classifier
The more zeroes or smaller the numbers on all cells but the diagonal, the better the
classifier is doing.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Multiclass Classification More Metrics

True Positive Rate (TPR)


Suppose the possible responses are A, B & C.
• TP/(TP+FN) or TP/AP
መf was run on the training set and the following Confusion Matrix was generated
• Proportion of actual positive instances correctly identified
True Class • Also called Recall or Sensitivity
Predicted Class A B C • You’d always want to maximize TPR
A 30 20 10 • Example: Correctly detected defaulters out of all actual defaulters
B 50 60 10
C 20 20 80
False Positive Rate (FPR)
True Positive are those observations of a particular class that were classified correctly
False Positive are those observations that were incorrectly mapped to one class • FP/(TN+FP) or FP/AN
False Negative are those observations of a particular class that were classified incorrectly • Proportion of actual negatives incorrectly identified as positives (false alarms)
True Negative: Applicable for a Two-Class scenario • Also called Type 1 error
• You’d always want to minimize FPR
TP_A = 30, TP_B = 60, TP_C = 80 • Especially harmful in cases like medical diagnosis
FP_A = 30, FP_B = 60, FP_C = 40 • Example: Incorrectly detected as defaulter out of all non-defaulter customers
FN_A = 70, FN_B = 40, FN_C = 20

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Receiver Operating Characteristics (ROC) Exercise

• To display the trade-off between TPR and FPR at all


possible thresholds. • Building a Logistic Regression Classifier using sklearn
• Plot ROC
• Changing the threshold changes TPR and FPR.
• Good to compare multiple different classifiers.
• The “no information” classifier (which works no better
than random guess) has AUC = 0.5
• The closer the curve to the top left, better the classifier.
• High TPR - Minimum FN
• Low FPR - Minimum FP
• Maximize Area Under the ROC Curve (AUC) for a better
classifier. Provides quantitative measure of model’s
performance.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

ZG 512
Predictive Analytics Lecture 11 Tree-based Methods
BITS Pilani Pravin Mhaske
Pilani Campus

Tree based methods Pros and Cons


• Tree-based methods are simple and useful for interpretation.
• Tree-based methods work for both regression and classification.
• However, they typically are not competitive with the best supervised learning
• These involve stratifying or segmenting the predictor space into a number of
simple regions. approaches in terms of prediction accuracy.

• Since the set of splitting rules used to segment the predictor space can be • Hence, we also discuss bagging, random forests, and boosting. These methods grow
summarized in a tree, these types of approaches are known as decision-tree multiple trees which are then combined to yield a single consensus prediction.
methods.
• Combining a large number of trees can often result in dramatic improvements in
prediction accuracy, at the expense of some loss in interpretation.

• Decision trees can be applied to both regression and classification problems. So also
called CART.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Classification Trees Gini Index (Gini Impurity)
• For a classification tree, we predict that each observation belongs to the most commonly occurring class
of training observations in the region to which it belongs. • The Gini index is defined by
n
• We use recursive binary splitting to grow a classification tree.
෍ pො i (1 − pො i )
• Cost function is classification error rate. This is simply the fraction of the training observations in that
i=1
region that do not belong to the most common class:

E = 1 − max pො mk
k The Gini index takes on a small value if all of the pො mk ’s are close to zero or
Here pො mk represents the proportion of training observations in the mth region that are from the kth one.
class. • For this reason, the Gini index is referred to as a measure of node purity — a
small value indicates that a node contains predominantly observations from
• However, classification error is not sufficiently sensitive for tree-growing, and in practice two other
a single class.
measures are preferable Gini Index and Cross-Entropy
• G = 0.5  maximum impurity

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Cross-Entropy Gini Vs Cross-Entropy

• An alternative to the Gini index is cross-entropy, given by

• Related to this is the Information Gain: The decrease of D from one


level to the next

It turns out that the Gini index and the cross-entropy are very similar
numerically.

D = 1  maximum impurity

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Building a tree Building a tree

X1, X2 & X3 are the independent variables; Y is the response variable Level 0: G = 0.38 & D = 0.81
X1 X2 X3 Y X1 X2 X3 Y Level 1: Optimum Candidate: X1 with G = 0.25 & D = 0.50
62 62 6 6 2 2 Left Branch: (0, 0, 0, 0), (0, 0, 1, 0), (0, 1, 0, 0), (0, 1, 1, 0); G = 0 & D = 0; No further splitting
0 0 0 0 Level 0: 6 ‘0’ & 2 ‘1’⇒ 𝐆 = 8 8 + 8 8 = 0.38 & 𝐃 = − 8 log 2 8
− 8 log 2 8
= 0.81 0 0 0 0
Right Branch: {1, 0, 0, 0), (1, 0, 1, 1), (1, 1, 0, 0), (1, 1, 1, 1); G = 0.5 & D = 1
0 0 1 0 Level 1: 0 0 1 0
0 1 0 0 • Splitting Candidate X1: X1 < 0.5 & X1 ≥ 0.5 0 1 0 0 Level 2:
0 1 1 0 YL: 0,0,0,0; G = 0 & D = 0; YR: 0,1,0,1; G = 0.5 & D = 1; GTotal = 0.25 & DTotal = 0.5 0 1 1 0 Left Branch: G = 0 & D = 0. No further splitting
1 0 0 0 1 0 0 0
• Splitting Candidate X2: X2 < 0.5 & X2 ≥ 0.5 Right Branch:
1 0 1 1 YL: 0,0,0,1; G = 0.38 & D = 0.81; ; YR: 0,0,0,1; G = 0.38 & D = 0.81; GTotal = 0.38 & DTotal =
1 0 1 1
1 1 0 0 1 1 0 0 • Splitting Candidate X2: X2 < 0.5 & X2 >= 0.5
0.81 YL: 0,1; G = 0.5 & D = 1; ; YR: 0,1; G = 0.5 & D = 1; ; GTotal = 0.25 & DTotal = 1
1 1 1 1 1 1 1 1
• Splitting Candidate X3: X3 < 0.5 & X3 > 0.5 • Splitting Candidate X3: X3 < 0.5 & X3 > 0.5
YL: 0,0,0,0; G = 0 & D = 0; ; YR: 0,0,1,1; G = 0.5 & D = 1; GTotal = 0.25 & DTotal = 0.5 YL: 0,0; G = 0 & D = 0; ; YR: 1,1; G = 0 & D = 0; ; GTotal = 0 & DTotal = 0
Optimum Candidate: X3
Optimum Candidate: X1
Level 3: No further splitting

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Tree and predictor space Error rate

X1 X2 X3 Y X3 X1 X2 X3 Y Level Error Rate


0 0 0 0 0 0 0 0 0 0.25 = 2/8
0 0 1 0 0 0 1 0
0 1 0 0 0 1 0 0
0 1 1 0 0 1 1 0 1 0.25 = 0/4 + 2/4
1 0 0 0 0.5 1 0 0 0
1 0 1 1 1 0 1 1
1 1 0 0 1 1 0 0 2 0=0+0
1 1 1 1 1 1 1 1

(0, 0) 0.5 X1

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Pruning a tree Pruning a tree and cross validation

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Pros and Cons

▲ Trees are very easy to explain, even easier than linear regression!
▲ Decision trees more closely mirror human decision-making than do the other regression and
classification approaches.
▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert
(especially if they are small).
▲ Trees can easily handle qualitative predictors without the need to create dummy
variables.
▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of
the other regression and classification approaches.

However, by aggregating many decision trees, the predictive performance of


trees can be substantially improved.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Summary
BITS Pilani
Pilani Campus

• Decision trees are simple and interpretable models for regression and classification
• However, they are often not competitive with other methods in terms of
prediction accuracy
• Bagging, random forests, and boosting are good methods for improving the prediction
accuracy of trees.
• Random Forest – They work by growing many trees on the training data and then combining
the predictions of the resulting ensemble of trees.
• The latter two methods, random forests and boosting, are among the state-of-the-art
methods for supervised learning. However, their results can be difficult to interpret. Lecture 11 Model Ensembles
• It appears that overfitting may not happen in Boosting

BITS Pilani, Pilani Campus

Why Ensemble? Bootstrap aggregating

• Decision trees are simple and interpretable models for regression and classification • A general-purpose procedure for reducing the variance of a statistical learning method
• However they are often not competitive with other methods in terms of prediction
accuracy • It is particularly useful and frequently used in the context of decision trees.

• Bagging, random forests and boosting are good methods for improving the prediction accuracy of • This is not practical because we generally do not have access to multiple training sets.
trees. They work by growing many trees on the training data and then combining the predictions
of the resulting ensemble of trees.
• The latter two methods— random forests and boosting— are among the state-of-the-art
methods for supervised learning. However their results can be difficult to interpret.
• It appears that overfitting may not happen in Boosting

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Bootstrap Bagging

• The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty
associated with a given estimator or statistical learning method.

• Sampling randomly ‘n’ subsets of original training data.

• Distinct data sets by repeatedly sampling observations from the original data set with replacement.

• New dataset obtained is of same size as our original dataset.

• This step ensures that the base models are trained on diverse subsets of the data, as some samples
may appear multiple times in the new subset, while others may be omitted. It reduces the risks of
overfitting and improves the accuracy of the model.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Bagging - Regression Bagging - Classification

Since we do not have access to multiple training sets, we bootstrap • Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training dataset.
• Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training data set. • Train our method on the bth bootstrapped training set to get fመ b (x), b =
1, …, B
• Train the method on the bth bootstrapped training set to get fመ b (x), b = 1,
…, B
• Take the majority vote:
• Average all the predictions to obtain The overall prediction is the most commonly occurring class
B
1
fመavg x = ෍ fመ b (x)
B
b=1

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Random Forest Random Forest

• An improvement over bagged trees (m = p).


• Predictions from bagged trees are highly correlated. This doesn’t help in reducing the
variance. How?
• Decorrelating the trees

• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p


• For example, if there are say 15 predictors, each split considers only 4 predictors. The
algorithm is not even allowed to consider majority of the predictors. All predictors get a
fair chance.
• Each tree becomes different from other. Averaging the predictions reduces variance.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Random Forest - Regression Random Forest - Classification

• Generate B different bootstrapped training data sets. • Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training data set. Sample with replacement B times from the (single) training data set.
• When building a decision tree • When building a decision tree
• At each split, a random sample of m predictors is chosen as split candidates from the full set of • At each split, a random sample of m predictors is chosen as split candidates from the full
p predictors. set of p predictors.
• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p
• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p
• We get fመ b (x)
• We get fመ b (x), b = 1, …, B

𝟏
• Take the majority vote:
• Average all the predictions to obtain 𝐟መ𝐚𝐯𝐠 𝐱 = σ𝐁𝐛=𝟏 𝐟መ 𝐛 (x) The overall prediction is the most commonly occurring class
𝐁

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Boosting Boosting - Procedure

• Boosting can be applied to regression and classification methods


• h1 is applied on the original dataset &
misclassifies 3 points. The weights of these points
• Bagging involves parallel process
• Creating multiple datasets using the bootstrap are increased and others are decreased.
• Fitting a separate decision tree to each copy • h2 is applied on the new dataset & misclassifies 3
• Combining all of the trees in order to create a single predictive model. points. The weights of these points are increased
• Each tree is built independently of the other trees.
and others are decreased.
• h3 is applied on the new dataset & misclassifies 3
• Boosting is a Serial process
• The trees are grown sequentially points.
• Each tree is grown using information from previously grown trees • The final classifier is the weighted sum of the
H = α1h1 + α2h2 + α3h3
• Each tree works on a modified version of the training set three classifiers.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Bagging Vs Boosting Types of Boosting


AdaBoost (Adaptive Boosting)
• Multiple weak learners into a single strong learner.
• The weak learners in AdaBoost are decision trees with a single split, called decision stumps

Gradient Boosting
• Sequentially adding predictors to an ensemble, each one correcting its predecessor.
• Tries to fit the new predictor to the residual errors made by the previous predictor.

XGBoost (Extreme Gradient Boosting)


• Implementation of gradient boosted decision trees designed for speed and performance.
• Generally very slow in implementation because of sequential model training.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

ZG 512
Predictive Analytics Lecture 12 Support Vector Machines

BITS Pilani Pravin Mhaske


Pilani Campus

Support Vector Machine Hyperplane

• Maximum Margin Classifier • A hyperplane in p dimensions is a flat subspace of dimension p − 1.


• In general, the equation for a hyperplane has the form
• Can classify linearly separable as well as non-linearly separable data
• Can also be used for regression β0 + β1X1 + β2X2 + . . . + βpXp = 0
• A data point is considered a p-dimensional vector • If the hyperplane equation is β0 + β1X1, it is a line
• Objective is to separate such data points using (p-1) dimensional hyperplane • If β0 = 0, the hyperplane goes through the origin, otherwise not.
• The vector β = (β1, β2, · · · , βp) is called the normal vector Data Hyperplane Hyperplane
Dimension dimension
1 0 Point
2 1 Line
3 2 Plane
P P-1 hyperplane

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


2D and 3D Hyperplane Maximum margin classifier

• H1 – can’t classify
• H2 – Classifies, but with thin margin
• H3 – Best among all. Maximum margin.

Data Hyperplane Hyperplane


Dimension dimension
2 1 Line
3 2 Plane

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Maximum margin classifier Support Vectors

The data points or vectors that are the closest to


the hyperplane and which affect the position of
the hyperplane are termed as Support Vector.

Since these vectors decide the hyperplane, they


are called Support vectors.

Suppose the data is perfectly linearly separable


• There will be an infinite number of such hyperplanes
• There will be one with the biggest gap or margin between the two classes.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Classification using hyperplane Classification – a maximization problem
maximize M,
β0 ,β1 ,…,βp
• In 2-dimensions, let’s assume a hyperplane β0 + β1X1 + β2X2 = 0
subject to
• If a point X (x1, x2) satisfies above equation, X lies on the hyperplane
σpj=1 β2j = 1 𝐀𝐍𝐃 yi × β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip ≥ M ∀ i = 1, … , n
• If X makes the equation > 0, X lies on one side of the hyperplane
1. The distance of a point from the hyperplane is given by yi × ൫β0 + β1 xi1 + β2 xi2 + ⋯
• If X makes it < 0, X lies on the other side
+ βp xip ൯
• For example, the hyperplane 1 + 2X1 + 3X2 = 0
2. β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip > 0 ⇒ y = +1
• Blue region: set of points for which 1 + 2X1 + 3X2 > 0
3. In the figure there are 3 Support Vectors
• Purple region: set of points for which 1 + 2X1 + 3X2 < 0 4. The optimal separating classifier depends only on these vectors
5. M is the distance from a support vector to the hyperplane
6. 2*M is the width of the margin (distance between the margins)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Non-separable data Noisy data


• Support vector classifiers are sensitive to individual observations.
• Maximal margin classifier works only if a separating hyperplane exists.
• Sometimes, data is separable, but noisy. Right: an additional observation is added.
• Here, the data is not separable by a linear boundary.
• The hyperplane shifts dramatically. The margin is also very small. Higher margin means a robust and reliable classifier.
• Solution creating a hyperplane that almost separates. • Solution: Allow a few mis-classifications to maximize the margin. Soft Margin Classifier.
• It is called soft margin.
2.0
1.5
1.0
X2
0.5
0.0
−0.5
−1.0

0 1 2 3

X1

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Soft Margin Classifier The Slack Variable

• It is worthwhile to misclassify a few training observations in order to do a better job in classifying the maximize M
majority of observations. β0 ,β1 ,…,βp ,ε1 ,…,εn
p
• The margin is kept soft. That is, it can be violated by some observations. s.t. σj=1 β2j = 1

• Sometimes, some observation can be on wrong side of the margin (1 and 8) or on wrong side of the yi β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip ≥ M 1 − εi ∀ i = 1, … , n
hyperplane (11 and 12). εi ≥ 0 & σ εi ≤ C

The slack variable ε


• Allows the observation to be on the wrong side of the margin or the hyperplane
• If εi = 0, the i-th observation is on the right side of the margin
• If εi > 1, the i-th observation is on the wrong side of the hyperplane
• If εi > 0, the i-th observation is on the wrong side of the margin

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

The Tuning Parameter Effect of C on margin


maximize M
β0 ,β1 ,…,βp ,𝜀1 ,…,𝜀𝑛
p
s.t. σj=1 β2j = 1
yi β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip ≥ M 1 − εi ∀ i = 1, … , n
εi ≥ 0 & σ εi ≤ C

The tuning parameter C


• Controls the error-margin tradeoff, that is, the bias-variance trade-off. How?
• Think of C as a budget for the amount that the margin can be violated by the n observations
• Higher the C value, more the budget, wider the margin
• C = 0 means there is no budget for violating the margin
• C > 0, no more than C training points can be on the wrong side of the hyperplane
• C is chosen by cross-validation
• As C increases the bias potentially increases and the variance may decrease
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus

ZG 512
Predictive Analytics Unsupervised Learning
BITS Pilani Pravin Mhaske
Lecture 13 Principal Components Analysis
Pilani Campus

Supervised Vs Unsupervised

• We observe both a set of features X1, X2, . . . , Xp for each object, as well as a response or
outcome variable Y. The goal is then to predict Y using X1, X2, . . . , Xp.

• Here we instead focus on unsupervised learning, where we observe only the features X1, X2,
. . . , Xp. We are not interested in prediction, because we do not have an associated response
variable Y.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
Challenges of Unsupervised Learning Goals of Unsupervised Learning
• Unsupervised learning is more subjective than supervised learning, as there is no simple goal
for the analysis, such as prediction of response.
• The goal is to discover interesting things about the measurements:
• Difficult to assess the results. Nobody knows the true answer. No universally accepted
• Is there an informative way to visualize the data?
mechanism for cross-validation.
• Can we discover subgroups among the variables or among the observations?
• Unsupervised learning is often performed as part of an exploratory data analysis.
• Can we interpret these subgroups?
• But techniques for unsupervised learning are of growing importance in a number of fields:
• subgroups of breast cancer patients grouped by their gene expression measurements
• We discuss two methods:
• groups of shoppers characterized by their browsing and purchase histories
• principal components analysis, a tool used for data visualization or data pre-
• movies grouped by the ratings assigned by movie viewers
processing before supervised techniques are applied, and
• Search engine will display results based on search history of other users with similar search
• clustering, a broad class of methods for discovering unknown subgroups in
patterns
data.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Dimensionality Reduction Dimensionality Reduction


• Modern methods generate very high-dimension data • Original data (2 dimensions)
• Computationally expensive, difficult to interpret, difficult to visualize • Two Principal Components
• Dimensionality needs to be reduced, without losing much information • Data projected on first PC (only one dimension)
• Multiple methods for dimensionality reduction • Even in high dimensions, generally, first 2-3 components explain a high proportion of variance, and
makes it easy to plot
• PCA – one of the most popular method based on vectors, matrices, eigen values and eigenvectors
• Principal Components are the vectors that define a new coordinate system
• First axis (PC) goes in the direction of highest variance in the data, followed by second and so on…
• All axes/PCs are orthogonal to each other…every new PC contains information that is not present in the
earlier PCs.
• Data is projected on principal components
• https://www.youtube.com/watch?v=fNk_zzaMoSs&t=21s
• https://www.youtube.com/watch?v=FD4DeN81ODY

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Principal Components Analysis Why Principal Components?

• PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear • Say, we have n observations and p features and we wish to plot scatterplots for all pairs.
combinations of the variables that have maximal variance and are mutually uncorrelated.
• If p = 10, it will create 10C2 = 45 plots! This grows exponentially with p.
• When there is a large set of correlated variables, principal components allow summarizing those by a
• Each feature contains a small fraction of information. And not all dimensions are equally interesting.
smaller set of representative variables. Dimensionality reduction.
• We need a lower dimension representation of data without losing much information.
• We use these new variables in another supervised learning problem.
• PCA does this job. Finds small number of dimensions that are as interesting as possible.
• PCA – a process of computing the principal components and using those.
• Interesting = how much the observation vary in each direction.
• Unsupervised because only X1, X2, X3… Xp are available. No label.
• Each new dimension found is a linear combination of the p features. It is a Principal Component.
• Apart from producing derived variables for use in supervised learning problems, PCA also serves as a
tool for data visualization.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Principal Components PCA example

• The first principal component found is a normalized linear combination of all features. • The population size (pop) and ad spending (ad) for 100 different cities are shown as purple circles.
The green solid line indicates the first principal component direction, and the blue dashed line
Z1 = Φ11X1 + Φ21X2 +…+ Φp1Xp
indicates the second principal component direction.
• The second principal component looks like (just to understand the notations)
Z2 = Φ12X1 + Φ22X2 +…+ Φp2Xp
𝑝 2
• Normalized means 𝛴𝑗=1 𝜙𝑗1 = 1. Φ11, Φ21,… Φp1 are called loadings (or eigen vectors) of the first
PC.
• Together, the loadings make up the principal component loading vector,

Φ1 = (Φ11 Φ21 …Φp1)𝑇


• We constrain the squared sum of all 𝜙 equal to 1.
• These Zs are constructs and are not directly observable.
• All these Zs are mutually uncorrelated (orthogonal to each other).
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
PCA example Geometry of PCA and the scores
8
• Z1 = φ11X1 + φ21X2 + . . . + φp1Xp
PC 1
6
• The loading vector φ1 with elements φ11, φ21, . . . , φp1 defines a direction in feature space along
which the data vary the most.

4

PC 2 If we project the n data points x1, . . . , x n onto this direction, the projected values are the first
principal component scores z11, . . . , zn1 themselves.
Variable X 2

0
-8 -6 -4 -2 0 2 4 6 8 10 12

-2

-4

-6

Variable X1

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

PCA – two borderline cases Computation of Principal Components*

• Only one PC
• Two almost equal PCs

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Computation of Principal Components* Further Principal Components*

• The second principal component is the linear combination of X1, . . . , Xp that


has maximal variance among all linear combinations that are uncorrelated
with Z1.
• The second principal component scores z12, z22, . . . , zn2 take the form zi2 =
φ12xi1 + φ22xi2 + . . . + φp2xip, where φ2 is the second principal component
loading vector, with elements φ12, φ22, . . . , φp2.
• It turns out that constraining Z2 to be uncorrelated with Z1 is equivalent to
constraining the direction φ2 to be orthogonal (perpendicular) to the direction
φ1. And so on.
• The principal component directions φ1, φ2, φ3, . . . are the ordered sequence
of right singular vectors of the matrix X, and the variances of the components
are 1/n times the squares of the singular values. There are at most min(n − 1,
p) principal components.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

ZG 512
Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering
BITS Pilani
Pilani Campus

• Clustering refers to a very broad set of techniques for finding subgroups, or clusters,
in a data set.
• We seek a partition of the data into distinct groups so that the observations within each
group are quite similar to each other,
• We must define what it means for two or more observations to be similar or different.
• This is often a domain-specific consideration that must be made based on knowledge
of the data being studied.
• Finally, we must be able to interpret the clusters for the exercise to be useful
Unsupervised Learning • Example, Customer/market/industry segmentation, document clustering, fake news,
fraud detection
Lecture 14 Clustering

BITS Pilani, Pilani Campus

Clustering Distance Metrics

• Clustering is an unsupervised learning technique


Euclidean:
• It is the task of grouping together a set of objects in such a way that elements in the
• Straight line distance
same group are more similar to each other than the elements in other groups.
• Scale variant, Sensitive to data dimensionality
• Similarity is a metric that reflects the strength of relationships between two data points • Normalization (scaling) can solve this issue

Manhattan:
• Taxicab or cityblock distance
• Sum of absolute differences of cartesian coordinates

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


K-means Clustering K-means Clustering

❑ A centroid-based clustering method Objective: Minimize Sum of Squared distances

❑ A centroid is a data point at the center of a cluster. • Step 0: Create random k centroids

❑ In centroid-based clustering, clusters are represented by a central vector or a centroid. Repeat iteratively till convergence (no further improvement/movement)

This centroid might not necessarily be a point of the dataset. • Step 1: Assign points to the closest centroid

❑ Centroid-based clustering is an iterative algorithm in which the notion of similarity is • Step 2: Re-calculate and move centroids

derived by how close a data point is to the centroid of the cluster.

❑ Objective is to minimize the sum of sqaured distances (distance between every point and
• http://alekseynp.com/viz/k-means.html
the corresponding cluster center).
• https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

What’s the right k? – Elbow Method Global and Local Minima

✓ The plot represents the variance within


the clusters.
✓ It decreases as k increases, but it can be
seen a bend (or “elbow”) at k = 4.
✓ This bend indicates that additional
clusters beyond the fourth have little
value.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


The Problem – Local Minima The Solution – k-Means++

Smarter initialization of centroids!


Steps:
1. Randomly select the first centroid from the data points.
2. For each data point, compute its distance from the nearest, previously chosen centroid.
3. Select the next centroid from the data points such that the probability of choosing a point as
centroid is directly proportional to its distance from the nearest, previously chosen centroid.
(i.e. the point having maximum distance from the nearest centroid is most likely to be selected
next as a centroid)
4. Repeat steps 2 and 3 until k centroids have been sampled
5. Continue with standard k-means.
Reason? – k-Means is sensitive to the initialization of centroid.
Some centroids may not get any points at all. Some clusters get split into multiple clusters.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Exercise Density-based Clustering


20 • Kmeans, at times, be insufficient to learn the unknown trends in the data.
18 • It assumes clusters to be spherical, which may not be always true.
16
• Solution: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

14

12

10

0
0 10 20 30 40 50 60

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


DBSCAN DBSCAN

• Philosophy: If a particular point belongs to a cluster, it should be near to lots of other points in • Noise point: If a new point selected doesn’t have more than minPoints within epsilon, it remains
that cluster. unclustered.
• It groups closely packed data points together based on two parameters - a distance measure • Border point: There maybe a fight between two clusters for a certain point qualifying for both
(epsilon) and a density threshold (minPoints). the clusters. That point may get into any arbitrary cluster.
Steps:
1. Select two parameters: Epsilon and minPoints. • https://scikit-learn.org/dev/auto_examples/cluster/plot_dbscan.html
2. Pick a random point.
3. If there are minPoints (including self) with the distance epsilon, merge them.
4. Keep repeating step 3 iteratively for all the new points included in the cluster. If no more points
found within epsilon, stop.
5. Select a new random unclustered point as a new cluster.
6. Repeat steps 3, 4 and 5.

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

K-Means Vs DBSCAN

kMeans DBSCAN
Type Centroid-based Density-based
Cluster Shape Spherical/Convex Irregular
Number of clusters To be specified Decided by algorithm
Dimensions Can handle high dimension Not efficient for high dimension
Outliers Get clustered Become Noise points
Anomaly detection Not useful Highlights anomalous points ZG 512
When to use? K is known and data has spherical K is unknown and data has
shape irregular shape Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus

BITS Pilani, Pilani Campus


Linear Models
BITS Pilani
Pilani Campus

• Simple to implement and interpret.


• But has limitations. Real relations may not be always linear. We only make an
assumption of linearity.
• Can the linearity assumption be relaxed?
• Extension to linear model – polynomial regression and step function
• Polynomial regression extends the linear model by adding extra predictors.
• The extra predictors are higher order versions of original predictors X1, X2…
• You can still make a ‘Linear’ Regression model.
Lecture 16 Beyond Linearity

BITS Pilani, Pilani Campus

Non-linear Relationships Polynomial Regression


Extend the linear model to accommodate non-linear relationships. Perform EDA to find patterns in
• The relationship between X and y is modeled as nth degree polynomial.
the relationships.
• The ‘line of best fit’ is not a line, rather a polynomial curve.
• Polynomial: Fits higher degree polynomial function to data
• Replaces the standard linear regression equation y = β0 + β1 X1 + β2X2 +…..+ βn Xn
y = β0 + β1 × x + β2 × x2 +…..+ βn × xn by a higher degree polynomial function y = β0 + β1 X + β2 X2 +…..+ βn Xn
• Exponential: y = αe(βx) • It is only a linear model with new predictors created using original predictors.

• Logarithmic: y = α + β*ln(x) • Generally, not recommended to cross n = 3 or 4. Else?

• Power: y = αxβ • How to chose the correct degree?


• This may also be called as feature engineering.
• Generalized additive model (GAM): Combines multiple regression models for a complex
relationship

y = f1(x1) + f2(x2) +…+ fn(xn)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Polynomial Regression Example

The figure suggests that Years of Experience Salary (in dollars)


1 50000
mpg = β0 + β1 × horsepower + β2 × horsepower2 + ε
2 55000
may provide a better fit. 3 65000
4 80000
Exercise:
5 110000
Explore relationships for other predictors
6 150000
7 200000

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


AI vs ML vs DL

Understanding Deep
Neural Networks

Why Deep Learning is a buzz now? Relation with Data

ANN ANN

● A large part of this section will focus on ● Theory Topics


theory behind many of the ideas we will ○ Perceptron Model to Neural Networks
implement with code. ○ Activation Functions
● Let’s do a quick review of how we will ○ Cost Functions
gradually build an understanding of ○ Feed Forward Networks
artificial neural networks. ○ BackPropagation

Perceptron model

● To begin understanding deep learning, we


will build up our model abstractions:
Perceptron Model ○ Single Biological Neuron
○ Perceptron
○ Multi-layer Perceptron Model
○ Deep Learning Neural Network
Perceptron model Perceptron model

● As we learn about more complex models, ● If the whole idea behind deep learning is to
we’ll also introduce concepts, such as: have computers artificially mimic biological
○ Activation Functions natural intelligence, we should probably
○ Gradient Descent build a general understanding of how
○ BackPropagation biological neurons work!

Perceptron model Perceptron model


● Illustration of biological neurons
● Stained Neurons in cerebral cortex

Perceptron model Perceptron model


● Illustration of biological neurons ● Let’s really simplify this!

Perceptron model Perceptron model


● Simplified Biological Neuron Model
● A perceptron was a form of neural network
introduced in 1958 by Frank Rosenblatt.
● Amazingly, even back then he saw huge
Dendrites potential:
Axon
○ "...perceptron may eventually be able
to learn, make decisions, and translate
Nucleus languages."
Perceptron model Perceptron model

● However, in 1969 Marvin Minsky and ● Fortunately for us, we now know the
Seymour Papert's published their book amazing power of neural networks, which
Perceptrons. all stem from the simple perceptron model,
● It suggested that there were severe so let’s head back and convert our simple
limitations to what perceptrons could do. biological neuron model into the
● This marked the beginning of what is perceptron model.
known as the AI Winter, with little funding
into AI and Neural Networks in the 1970s.

Perceptron model Perceptron model


● Perceptron Model ● Perceptron Model

Dendrites Inputs Output


Axon

Nucleus

Perceptron model Perceptron model


● Let’s work through a simple example ● Let’s work through a simple example

x1 x1
Inputs Output Inputs f(X) Output

x2 x2

Perceptron model Perceptron model


● Let’s work through a simple example ● If f(X) is just a sum, then y=x1+x2

x1 x1
y y
Inputs f(X) Output Inputs f(X) Output

x2 x2
Perceptron model Perceptron model
● Realistically, we would want to be able to ● Let’s add an adjustable weight we multiply
adjust some parameter in order to “learn” against x
w1
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
w2

x2 x2

Perceptron model Perceptron model


● Now y = x1w1 + x2w2 ● We could update the weights to effect y

w1 w1
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
w2 w2

x2 x2

Perceptron model Perceptron model


● But what if an x is zero? w won’t change ● Let’s add in a bias term b to the inputs.
anything!
w1 w1
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
w2 w2

x2 x2

Perceptron model Perceptron model


● Let’s add in a bias term b to the inputs. ● y = (x1w1 + b) + (x2w2 + b)

*w1 + b *w1 + b
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output

x2 *w2 + b x2 *w2 + b
Perceptron model Perceptron model
● We can expand this to a generalization:
● We’ve been able to model a biological
neuron as a simple perceptron!
x1 Mathematically our generalization was:
*w1 + b y
Inputs f(X) Output
x2 *w2 + b

xn *wn + b

Perceptron model Perceptron model

● Later on we will see how we can expand ● Also we’ll see we can even simplify the bias
this model to have X be a tensor of to be at a layer level instead of a bias per
information ( an n-dimensional matrix). input.

Perceptron model Perceptron model


● Notice how at the end we sum a bunch of ● Theoretically for any number of biases,
biases… there exists a bias that is the sum.
x1 x1
*w1 + b y *w1 + b y
Inputs f(X) Output Inputs f(X) Output
x2 *w2 + b x2 *w2 + b

xn *wn + b xn *wn + b

Perceptron model Perceptron model


● Theoretically for any number of biases, ● Theoretically for any number of biases,
there exists a bias that is the sum. there exists a bias that is the sum.
x1 x1
*w1 y *w1 y
Inputs f(X) Output Inputs f(X) Output
x2 *w2 x2 *w2

xn *wn xn *wn
B B = b1 + b2 + … +bn
Perceptron model

● Let’s review what we learned:


○ We understand the very basics of a
biological neuron Neural Networks
○ We saw how we can create a simple
perceptron model replicating the core
concepts behind a neuron.

Neural Networks Neural Networks

● A single perceptron won’t be enough to ● A single perceptron won’t be enough to


learn complicated systems. learn complicated systems.
● Fortunately, we can expand on the idea of ● Fortunately, we can expand on the idea of
a single perceptron, to create a multi-layer a single perceptron, to create a multi-layer
perceptron model. perceptron model.
● We’ll also introduce the idea of activation
functions.

Neural Networks Neural Networks

● To build a network of perceptrons, we can ● The outputs of one perceptron are directly
connect layers of perceptrons, using a fed into as inputs to another perceptron.
multi-layer perceptron model.

Neural Networks Neural Networks

● This allows the network as a whole to learn ● The first layer is the input layer
about interactions and relationships
between features.
Neural Networks Neural Networks

● The last layer is the output layer. ● Layers in between the input and output
● Note: This last layer can be more than one layers are the hidden layers.
neuron

Neural Networks Neural Networks

● Hidden layers are difficult to interpret, due ● Neural Networks become “deep neural
to their high interconnectivity and distance networks” if then contain 2 or more
away from known input or output values. hidden layers.

Neural Networks Neural Networks

● Neural Networks become “deep neural ● Terminology:


networks” if then contain 2 or more ○ Input Layer: First layer that directly
hidden layers. accepts real data values
○ Hidden Layer: Any layer between
input and output layers
○ Output Layer: The final estimate of the
output.

Relation with Data Neural Networks

● What is incredible about the neural


network framework is that it can be used
to approximate any function.
● Zhou Lu and later on Boris Hanin proved
mathematically that Neural Networks can
approximate any convex continuous
function.
Neural Networks Neural Networks

● For more details on this check out the ● Previously in our simple model we saw that
Wikipedia page for “Universal the perceptron itself contained a very
Approximation Theorem” simple summation function f(x).
● For most use cases however that won’t be
useful, we’ll want to be able to set
constraints to our output values, especially
in classification tasks.

Neural Networks

● In a classification tasks, it would be useful


to have all outputs fall between 0 and 1.
● These values can then present probability Activation Functions
assignments for each class.
● In the next lecture, we’ll explore how to use
activation functions to set boundaries to
output values from the neuron.

Neural Networks Neural Networks

● Recall that inputs x have a weight w and a ● Which means we have


bias term b attached to them in the ○ x*w + b
perceptron model. ● Clearly w implies how much weight or
● Which means we have strength to give the incoming input.
○ x*w + b ● We can think of b as an offset value,
making x*w have to reach a certain
threshold before having an effect.

Neural Networks Neural Networks

● For example if b= -10 ● Next we want to set boundaries for the


○ x*w + b overall output value of:
● Then the effects of x*w won’t really start ○ x*w + b
to overcome the bias until their product ● We can state:
surpasses 10. ○ z = x*w + b
● After that, then the effect is solely based ● And then pass z through some
on the value of w. activation function to limit its value.
● Thus the term “bias”
Neural Networks Perceptron model
● Recall our simple perceptron has an f(X)
● A lot of research has been done into
activation functions and their effectiveness. w1
● Let’s explore some common activation x1
functions. y
Inputs f(X) Output
w2
+b
x2

Perceptron model Neural Networks


● If we had a binary classification problem,
● To avoid confusion, let’s define the total
we would want an output of either 0 or 1.
w1 inputs as a variable z.
x1 ● Where z = wx + b
y ● In this context, we’ll then refer to activation
Inputs f(X) Output
w2 functions as f(z).
+b ● Keep in mind, you will often see these
x2
variables capitalized f(Z) or X to denote a
tensor input consisting of multiple values.

Deep Learning Deep Learning

● The most simple networks rely on a basic ● Regardless of the values, this always
step function that outputs 0 or 1. outputs 0 or 1.
1 1
Output Output

0 0
0 0
z = wx + b z = wx + b

Deep Learning Deep Learning

● This sort of function could be useful for ● However this is a very “strong” function,
classification (0 or 1 class). since small changes aren’t reflected.
1 1
Output Output

0 0
0 0
z = wx + b z = wx + b
Deep Learning Deep Learning

● There is just an immediate cut off that ● It would be nice if we could have a more
splits between 0 and 1. dynamic function, for example the red
1 line! 1
Output Output

0 0
0 0
z = wx + b z = wx + b

Deep Learning Deep Learning

● Lucky for us, this is the sigmoid function! ● Changing the activation function used can
be beneficial depending on the task!
1 1
Output Output

0 0
0 0
z = wx + b z = wx + b

Deep Learning Deep Learning

● This still works for classification, and will be ● Let’s discuss a few more activation
more sensitive to small changes. functions that we’ll encounter!
1 1
Output Output

0 0
0 0
z = wx + b z = wx + b

Deep Learning Deep Learning

● Hyperbolic Tangent: tanh(z) ● Hyperbolic Tangent: tanh(z)


● Outputs between -1 and 1 instead of 0 to 1
1 1
Output Output

-1 -1
0 0
Deep Learning Deep Learning

● Rectified Linear Unit (ReLU): This is ● ReLu has been found to have very good
actually a relatively simple function: performance, especially when dealing
max(0,z) with the issue of vanishing gradient.
Output ● We’ll often default to ReLu due to its
overall good performance.
0
z = wx + b

Deep Learning

● For a full list of possible activation functions


check out:
● en.wikipedia.org/wiki/Activation_function

Deep Learning

Multi-Class ● Notice all these activation functions make


sense for a single output, either a
Activation Functions continuous label or trying to predict a
binary classification (either a 0 or 1).
● But what should we do if we have a multi-
class situation?

Deep Learning Deep Learning

● There are 2 main types of multi-class ● Non-Exclusive Classes


situations ■ A data point can have multiple
○ Non-Exclusive Classes classes/categories assigned to it
■ A data point can have multiple ○ Photos can have multiple tags (e.g.
classes/categories assigned to it beach, family, vacation, etc…)
○ Mutually Exclusive Classes
■ Only one class per data point.
Deep Learning Deep Learning

● Mutually Exclusive Classes ● Organizing Multiple Classes


■ A data point can only have one ○ The easiest way to organize multiple
class/category assigned to it classes is to simply have 1 output node
○ Photos can be categorized as being in per class.
grayscale (black and white) or full color
photos. A photo can not be both at the
same time.

Neural Networks Neural Networks

● Previously we thought of the last output ● This single node could output a continuous
layer as a single node. regression value or binary classification (0
or 1).

Neural Networks Multiclass Classification

● Let’s expand this output layer to work for ● Organizing for Multiple Classes
the case of multi-classification.

Hidden Layers

Multiclass Classification Deep Learning

● Organizing for Multiple Classes ● Organizing Multiple Classes


Class One
○ This means we will need to organize
categories for this output layer.
Class Two ○ We can’t just have categories like “red”,
“blue”, “green”, etc...
Hidden Layers
Class N
Deep Learning Deep Learning

● Organizing Multiple Classes ● Mutually Exclusive Classes


○ Instead we use one-hot encoding
Data Point 1 RED
○ Let’s take a look at what this looks like Data Point 2 GREEN

for mutually exclusive classes. Data Point 3 BLUE

... ...

Data Point N RED

Deep Learning Deep Learning

● Mutually Exclusive Classes ● Non-Exclusive Classes


RED GREEN BLUE A B C
Data Point 1 RED Data Point 1 A,B
Data Point 1 1 0 0 Data Point 1 1 1 0
Data Point 2 GREEN Data Point 2 A
Data Point 2 0 1 0 Data Point 2 1 0 0
Data Point 3 BLUE Data Point 3 C,B
Data Point 3 0 0 1 Data Point 3 0 1 1
... ... ... ...
... ... ... ... ... ... ... ...
Data Point N RED Data Point N B
Data Point N 1 0 0 Data Point N 0 1 0

Deep Learning Deep Learning

● Now that we have our data correctly ● Non-exclusive


organized, we just need to choose the ○ Sigmoid function
correct classification activation function ■ Each neuron will output a value
that the last output layer should have. between 0 and 1, indicating the
probability of having that class
assigned to it.

Multiclass Classification Multiclass Classification

● Sigmoid Function for Non-Exclusive ● Sigmoid Function for Non-Exclusive


Classes Classes 1
Class One Class One 0.8
0

1
Class Two Class Two 0.2
0

Hidden Layers Hidden Layers


1
Class N Class N 0.3
0
Multiclass Classification Deep Learning

● Sigmoid Function for Non-Exclusive ● Non-exclusive


Classes 1
○ Sigmoid function
Class One 0.8
0
■ Keep in mind this allows each neuron
1
Class Two 0.6 to output independent of the other
0
classes, allowing for a single data
Hidden Layers
Class N
1 point fed into the function to have
0.2
0 multiple classes assigned to it.

Deep Learning Deep Learning

● Mutually Exclusive Classes ● Softmax Function


○ But what do we do when each data
point can only have a single class
assigned to it?
○ We can use the softmax function for
this!

Deep Learning Deep Learning

● Mutually Exclusive Classes


○ Softmax function calculates the
probabilities distribution of the event
over K different events.
○ This function will calculate the
probabilities of each target class over all
possible target classes.

Deep Learning Deep Learning

● Mutually Exclusive Classes ● Mutually Exclusive Classes


○ The range will be 0 to 1, and the sum of ○ The main thing to keep in mind is that if
all the probabilities will be equal to you use softmax for multi-class
one. problems you get this sort of output:
○ The model returns the probabilities of ■ [Red , Green , Blue]
each class and the target class chosen ■ [ 0.1 , 0.6 , 0.3 ]
will have the highest probability.
Deep Learning

● Mutually Exclusive Classes


○ The probabilities for each class all sum
up to 1. We choose the highest
probability as our assignment.
■ [Red , Green , Blue]
■ [ 0.1 , 0.6 , 0.3 ]

Deep Learning

● Review
○ Perceptrons expanded to neural
network model
○ Weights and Biases
○ Activation Functions
○ Time to learn about Cost Functions!
Deep Learning

Cost Functions and ● We now understand that neural


networks take in inputs, multiply them
Gradient Descent by weights, and add biases to them.
● Then this result is passed through an
activation function which at the end of
all the layers leads to some output.

Deep Learning Deep Learning

● This output ŷ is the model’s estimation of ● We need to take the estimated outputs
what it predicts the label to be. of the network and then compare them
● So after the network creates its to the real values of the label.
prediction, how do we evaluate it? ● Keep in mind this is using the training
● And after the evaluation how can we data set during the fitting/training of the
update the network’s weights and model.
biases?

Deep Learning Deep Learning

● The cost function (often referred to as a ● We’ll use the following variables:
loss function) must be an average so it ○ y to represent the true value
can output a single value. ○ a to represent neuron’s prediction
● We can keep track of our loss/cost ● In terms of weights and bias:
during training to monitor network ○ w*x + b = z
performance. ○ Pass z into activation function σ(z) = a

Deep Learning Deep Learning

● One very common cost function is the ● We simply calculate the difference
quadratic cost function: between the real values y(x) against our
predicted values a(x).
Deep Learning Deep Learning

● Note: The notation shown here ● Notice how squaring this does 2 useful
corresponds to vector inputs and things for us, keeps everything positive
outputs, since we will be dealing with a and punishes large errors!
batch of training points and predictions.

Deep Learning Deep Learning

● We can think of the cost function as: ● W is our neural network's weights, B is
our neural network's biases, Sr is the
input of a single training sample, and Er
is the desired output of that training
sample.

Deep Learning Deep Learning

● Notice how that information was all ● This means that if we have a huge
encoded in our simplified notation. network, we can expect C to be quite
● The a(x) holds information about complex, with huge vectors of weights
weights and biases. and biases.

Deep Learning Deep Learning

● Here is a small network with all its ● That is a lot to calculate!


parameters labeled: ● How do we solve this?
Deep Learning Deep Learning

● In a real case, this means we have some ● For simplicity, let’s imagine we only had
cost function C dependent lots of one weight in our cost function w.
weights! ● We want to minimize our loss/cost
○ C(w1,w2,w3,....wn) (overall error).
● How do we figure out which weights ● Which means we need to figure out
lead us to the lowest cost? what value of w results in the minimum
of C(w)

Deep Learning Deep Learning

● Our “simple” function C(w) ● What value of w minimizes our cost?


C(w) C(w)

w w

Deep Learning Deep Learning

● What value of w minimizes our cost? ● What value of w minimizes our cost?
C(w) C(w)

w w

wmin

Deep Learning Deep Learning

● Students of calculus know we could take ● But recall our real cost function will be
a derivative and solve for 0. C(w)
very complex! C(w)

w w

wmin
Deep Learning Deep Learning

● And it will be n-dimensional! ● And it will be n-dimensional since our


C(w)
networks will have 1000s of weights C(w)

w w

Deep Learning

● We can use gradient descent to solve


this problem. C(w)

w
Deep Learning

● Let’s go back to our simplified version to


see how this works. C(w)

wmin

Deep Learning Deep Learning

● We can calculate the slope at a point ● We can calculate the slope at a point
C(w) C(w)

w w

wmin wmin
Deep Learning Deep Learning

● Then we move in the downward ● Then we move in the downward


direction of the slope. C(w)
direction of the slope. C(w)

w w

wmin wmin

Deep Learning Deep Learning

● Then we move in the downward ● Until we converge to zero, indicating a


direction of the slope. C(w)
minimum. C(w)

w w

wmin wmin

Deep Learning Deep Learning

● We could have changed our step size to ● Our steps:


find the next point! C(w) C(w)

w w

wmin wmin

Deep Learning Deep Learning

● Smaller steps sizes take longer to find ● Larger steps are faster, but we risk
the minimum. C(w)
overshooting the minimum! C(w)

w w

wmin wmin
Deep Learning Deep Learning

● This step size is known as the learning ● The learning rate we showed in our
rate. C(w) illustrations was constant (each step size
was equal)
● But we can be clever and adapt our step
size as we go along.
w

wmin

Deep Learning Deep Learning

● We could start with larger steps, then go ● In 2015, Kingma and Ba published their
smaller as we realize the slope gets paper: “Adam: A Method for Stochastic
closer to zero. Optimization“.
● This is known as adaptive gradient ● Adam is a much more efficient way of
descent. searching for these minimums, so you
will see us use it for our code!

Deep Learning Deep Learning

● Adam versus other gradient descent ● Realistically we’re calculating this


algorithms: descent in an n-dimensional space for all
our weights.

Deep Learning Deep Learning

● When dealing with these N-dimensional ● For classification problems, we often use
vectors (tensors), the notation changes the cross entropy loss function.
from derivative to gradient. ● The assumption is that your model
● This means we calculate ∇C(w1,w2,...wn) predicts a probability distribution p(y=i)
for each class i=1,2,…,C.
Deep Learning Deep Learning

● For a binary classification this results in: ● Review:


○ Cost Functions
○ Gradient Descent
● For M number of classes > 2 ○ Adam Optimizer
○ Quadratic Cost and Cross-Entropy

Deep Learning Deep Learning

● So far we understand how networks can ● The next thing we need to learn about
take in input , effect that input with theory is:
weights, biases, and activation functions ○ Once we get our cost/loss value,
to produce an estimated output. how do we actually go back and
● Then we learned how to evaluate that adjust our weights and biases?
output. ● This is backpropagation, and it is
what we are going to cover next!

You might also like