PA DL Consolidated

BITS Pilani
Pilani Campus
ZG 512
Predictive Analytics M1 Predictive Analytics
BITS Pilani Pravin Mhaske
Lecture 1 Introduction, Model Assessment
Pilani Campus
Model Assessment Training and Test Errors

• The model is developed on the training data.
For Regression, the most commonly used measure is MSE
• The Statistical Method estimates f (y = f(X)) by minimizing MSETr
2
MSE = Average y − yො • A procedure that minimizes MSETr will tend to “overfit” the data
For Classification, the most commonly used measure is the Error Rate: • The training error shows the performance of the model on the training
data
Ave ෍ I yi ≠ yො i
What about the accuracy of the prediction on an unseen test
data?
Where I is the Indicator function
• The usefulness of the model depends on the performance on unseen
test data
• We need a model that minimizes the test error
• We want a method that gives the lowest MSETe as opposed to
the lowest MSETr
• There are ways of estimating MSETe: Test Data, Cross Validation
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Overfitting and underfitting Overfitting and underfitting
Data is fitted to a linear function and

a polynomial function. Overfitting
• Occurs when the model captures the noise of the training data – fits the data too well
• The polynomial function is a perfect • A method is said to be overfitting the data when it generates a small MSETr and a large MSETe
fit – overfitting since it is adapting to • It is often a result of an excessively complicated model
the training set • Can be prevented by fitting multiple models and using validation or cross-validation to
• The linear function is more rigid but compare their predictive accuracies on test data.
may generalize better • The model may have low bias but high variance
• If the two functions were used to Underfitting
extrapolate beyond the fitted data, • Occurs when the model cannot capture the underlying trend of the training data
the linear model • Occurs when the model does not fit the data well enough
• May generalize better • It is often a result of an excessively simple model.
• May make better predictions. • The model may have low variance but high bias
Both overfitting and underfitting lead to poor predictions on new data sets.
Bias Vs Variance Bias Vs Variance
The goal of any supervised statistical learning

algorithm is to achieve The algorithm learns a model from training data
• low bias and low variance The prediction error can be broken down into three parts: Bias Error, Variance Error &
• Thereby achieving good prediction Irreducible Error (noise)
performance. • The irreducible error cannot be reduced: it is the error introduced by modelling a real life
scenario
In reality, we cannot calculate the real bias and
• Bias error arises from the simplifying assumptions made by a model to make the target
variance error terms because we do not know
the actual underlying target function. function easier to learn
• Variance is the amount that the predictions will vary with different training data sets
Nevertheless, as a framework, bias and variance
provide the tools to understand the behaviour We want a good predictor – low bias and low variance
of machine learning algorithms in the pursuit of
predictive performance.

Bias Vs Variace Training and Test Errors
Underfitting
Low Bias, High Variance High Bias, Low Variance 1. Rigidity or under-complexity
Overly Flexible Less Flexible 2. High Bias, Low variance
Overfitting
1. Flexibility or over-complexity
2. Low Bias, High variance
1. When do we know that we are

underfitting?
2. When are we overfitting?
3. What is the optimal flexibility?
4. Does obtaining more data help in a
case of underfitting? Overfitting?
Training and Test Errors Bias-Variance Trade-off
The algorithm learns a model from training data

2.5
At any iteration, we would like to know whether:

The prediction error can be broken down into three parts: Bias Error, Variance Error & Irreducible Error
• Bias error arises from the simplifying assumptions made by a model to make the target function easier to learn
We have High Bias or High Variance?
2.0
• Variance is the amount that the predictions will vary with different training data sets
• The irreducible error cannot be reduced: it is the error introduced by modelling a real-life scenario
Mean Squared Error
High Bias:
1.5
• Large MSETr & Large MSETe

• MSETr ~ MSETe
1.0
We want a good predictor – low bias and low variance

0.5
High Variance: 2
• Bias = E መf x0 − f x0
• Small MSETr & Large MSETe
Inability of our model to capture the true relationship (no matter how much data we throw at it!)
0.0
• Small MSETr << MSETe 2

2 5 10 20 • Variance = E fመ x0 − E fመ x0
Flexibility
Measures how much the predictions change across different training sets.

Bias-Variance Trade-off A rigid and a flexible model
f is the true model
Given a data set D = {(x, y)}

• fመ1 (x) = Y
ഥ
• fመ2 x = β0 + β1 x1 + β2 x 2 + ⋯ + βk x k
• fመ1 has higher bias & lower variance

• fመ2 has lower bias & higher variance
A rigid and a flexible model Bias – Variance Equation

We want a good predictor – low bias and low variance Let
But • fመ be our estimation of the true f
• (x0) be any observation drawn from the population
As we increase the complexity of our assumptions
• The Bias decreases and the Variance increases 2
Error at x0 = E f x0 − fመ x0
Conversely, 2 2
As we decrease the complexity of our assumptions = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)
• The Bias increases and the Variance Decreases
= Bias2 + Variance + Irreducible Error
Thus, there is a trade-off at play between these two concerns (Bias & Variance) = Reducible Error + Irreducible Error
Note that E[ε) = 0

The Bias The Variance
2 2 2 2 2 2
E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε) E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)
• The first term is the squared bias • Note that different training sets will generate different fመ
• It is the difference between the averages of the estimate and the true values • Ideally the estimates should not vary too much across training sets
• Bias refers to the error that is introduced by approximating a real-life problem • The second term is the variance of the estimate across many training sets
• A model with high bias is not capturing the underlying behaviour of the true • It determines how much the average model estimation deviates across the training
functional form well data
• For example, if a simple linear regression is used to model a sine curve, no matter
how low the MSETr is, it will never capture the non-linearity inherent in a sine curve • In general, more flexible methods have higher variance
The Irreducible Error
2 2 2
E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)
• The last term is known as the irreducible error

• It is the minimum lower bound for the MSETe.
This variance arises because ZG 512

• ε may contain unknown variables that are useful in predicting Y Predictive Analytics
• Y may contain unmeasurable variations
Pilani Campus
BITS Pilani, Pilani Campus

What is Machine Learning?
BITS Pilani
Pilani Campus
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)
M3 Linear Regression
Lecture 2 Linear Regression 1
Many data mining tasks are executed successfully with help of machine learning
Types of Machine Learning Regression
Predict a value of a given continuous valued variable based on the values

of other variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
◦ Predicting sales amounts of new product based on advertising expenditure.
◦ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
◦ Predicting price of a house based on its attributes

Variables Matrix X and Vector y
# TV Radio Paper Sales The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1 observations 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3 The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables “Sales”
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5 4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9 p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2 n = 6 (the number of observations) 6 8.7 48.9 75 7.2
X represents the input data set; X is a 6 * 3 matrix

y represents the output variable; y is a 6 * 1 vector
Matrix X and Vector y A Linear Model
# TV Radio Paper Sales

1 230.1 37.8 69.2 22.1 The linear model is an important example of a parametric model:
2 44.5 39.3 45.1 10.4 f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p .
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5 • A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p .
5 180.8 10.8 58.4 12.9 • We estimate the parameters by fitting the model to training data.
6 8.7 48.9 75 7.2 • Although it is almost never correct, a linear model often serves as a good and interpretable
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1 approximation to the unknown true function f (X).
xi represents the ith observation. xi is a vector represented as (xi1 xi2 …..xip)
xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)

yi represents the ith observation of the output variable. y is the vector (y1 y2 ….. yp)

A Linear Model (Parametric) A Linear Model
The linear model is an example of a parametric model We want to predict Y for a given value of x
f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p
Is there an ideal f (X)?
• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p . • What is a good value for f (X) at any selected value of X , say X = 4? There
can be many Y values at X = 4
• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters
A good value is f (4) = E(Y |X = 4), the expected value of Y given X = 4.
• We estimate the parameters by fitting the model to training data.
This ideal f (x) = E(Y |X = x) is called the regression function.
• Simple Linear Regression: Only one x variable
• Multiple Linear Regression: Many x variables
A Linear Model Regression Assumptions
Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε

1. E(ε) = 0
• β’s: Unknown constants, known as coefficients or parameters 2. The model adequately captures the relationship
• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
• ε is the error term – captures measurement errors and missing variables 4. ε is normally distributed
• ε is a random variable independent of X 5. The values of ε are independent (No Serial Correlation or Autocorrelation)
• E(ε) = 0
6. There is no (or little) multicollinearity among the independent variables
• In the advertising example, the model becomes
sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε
f is said to represent the systematic information that X provides about Y

Multicollinearity and VIF Residual Analysis
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists
• The red line should be approximately horizontal at zero.
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect
the regression results. • There is no pattern in the first residual plot. The presence of a pattern may indicate a problem with some
aspect of the linear model (case 2)
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the
independent variables. • E(ε) = 0
• Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
• Test for Multicollinearity: Variance Inflation Factor
• VIF equal to 1 = variables are not correlated

• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression
Homoscedasticity Vs Heteroscedasticity Types of Regression Models

Simple Regression
• Are the residuals spread equally along
the ranges of predictors? (Education) x y (Income)
• The plot should have a horizontal line
with equally spread points.
Multiple Regression
In the second plot, this is not the case.
• The variability (variances) of the (Education) x1
residual points increases with the value
of the fitted outcome variable, (Soft Skills) x2 y (Income)
suggesting non-constant variances in
the residuals errors (Experience) x3
(or heteroscedasticity)
(Age) x4

Direct Solution Method Exercise
Least Squares Method (Ordinary Least Squares or OLS) Kumar’s Electronics periodically has a special week-long sale. As part of the advertising
campaign Kumar runs one or more TV commercials during the weekend preceding the
• Slope for the Estimated Regression Equation sale. Data from a sample of 5 previous sales are shown below.
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത
𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2
# of TV Ads # of Cars Sold
(x) (y)
• y Intercept for the Estimated Regression Equation
1 14
𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ 3 24
2 18
where:
xi = value of independent variable for ith observation 1 17
yi =value of dependent variable for ith observation 3 27
𝑥=
ҧ mean value for dependent variable
𝑦=
ത mean value for dependent variable
Solution Evaluation of Regression Model

𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝟐
# of TV Ads # of Cars Sold 𝐱 𝐢 − 𝐱ത
(x) (y)
1 14 -1 -6 6 1
3 24 1 4 4 1
2 18 0 -2 0 0
1 17 -1 -3 3 1
3 27 1 7 7 1
Sum 10 100 0 0 20 4
Mean 2 20
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖−𝑦ത
• Slope for the Estimated Regression Equation 𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2
= 20/4 = 5
• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ = 20 – 10 = 10

• Estimated Regression Equation: 𝑦ො = b0 + b1x = 10 + 5x
• Predict Sales if Ads run = 5? 15?
Goodness of Fit
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
Coefficient of Determination (R-squared) Exercise (already covered)
• Proportion of variance in a dependent variable that can be explained by an independent variable • Simple Linear Regression using Excel
• Multiple Linear Regression using Excel
• Multiple Linear Regression using statsmodel
• Multiple Linear Regression using scikitlearn

Qualitative Predictors
BITS Pilani
Pilani Campus
One Hot Encoding (Dummy Variables)

1. Find out count of distinct values, say n, in the column.
2. Create n new columns – each with a name of the distinct value.
3. Encode values 0 and 1 under those columns depending upon the value in that observation.
4. Avoid when the count of distinct values is high.
M3 Linear Regression
Lecture 2 Extensions to the Linear Model
Label Encoding
1. Find out count of distinct values, say n, in the column.

2. Find the order.
3. Encode values 0, 1, 2….based on the order.
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Advertising Model The Advertising Model
Call:
lm(formula = Sales ~ TV + Radio, data = Ad)
In the first model, we assumed that the effect on sales of increasing one advertising
Residuals: medium is independent of the amount spent on the other media.
Min 1Q Median 3Q Max
-8.7977 -0.8752 0.2422 1.1708 2.8328 The estimated model was
෡ = 2.9389 + 0.0458*TV + 0.1885*Radio – 0.0010*Paper
Y
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.92110 0.29449 9.919 <2e-16 ***
TV 0.04575 0.00139 32.909 <2e-16 *** From this we infer that the average effect on sales of a one-unit increase in TV is always
Radio 0.18799 0.00804 23.382 <2e-16 *** 0.0458, regardless of the amount spent on radio.
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.681 on 197 degrees of freedom

Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
The Synergy Effect The Interaction Model

• But suppose that spending money on radio advertising actually increases the Model takes the form
effectiveness of TV advertising, so that the slope term for TV should increase as radio sales = β0 + β1 × TV + β2 × radio + β3 × (radio × TV) + ε
increases. = β0 + (β1 + β3 × radio) × TV + β2 × radio + ε
• In this situation, given a fixed budget of $100, 000, spending half on radio and half on TV
may increase sales more than allocating the entire amount to either TV or to radio.
• In marketing, this is known as a synergy effect, and in statistics it is referred to as an
interaction effect.

The Interaction Model Interpretation I
Coefficients: Estimate Std. Error t value Pr(>|t|)
Call: lm(formula = Sales ~ TV * Radio, data = Ad)
(Intercept) 6.7502 2.479e-01 27.233 <2e-16 ***
Residuals: TV 0.0191 1.504e-03 12.699 <2e-16 ***
Min 1Q Median 3Q Max Radio 0.0289 8.905e-03 3.241 0.0014 **
-6.3366 -0.4028 0.1831 0.5948 1.5246 TV:Radio 0.0011 5.242e-05 20.727 <2e-16 ***
Coefficients: Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
Estimate Std. Error t value Pr(>|t|) F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
(Intercept) 6.7502 2.479e-01 27.233 <2e-16 ***
The results in this table suggests that interactions are important.
TV 0.0191 1.504e-03 12.699 <2e-16 ***
Radio 0.0289 8.905e-03 3.241 0.0014 ** • The p-value for the interaction term TV×Radio is ~ 0 < α, indicating that there is strong evidence for HA:
TV:Radio 0.0011 5.242e-05 20.727 <2e-16 *** βTV*Radio ≠ 0.
• The R2 for the interaction model is 96.8%, compared to only 89.7% for the model that predicts sales using TV
Residual standard error: 0.9435 on 196 degrees of freedom and radio without an interaction term.
Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673 • This means that (96.8 − 89.7)/(100 − 89.7) = 69% of the variability in sales that remains after fitting the
F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16 additive model has been explained by the interaction term.
Interpretation II Hierarchy Principle
Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.7502 2.479e-01 27.233 <2e-16 ***
• Sometimes it is the case that an interaction term has a very small p-value, but the
TV 0.0191 1.504e-03 12.699 <2e-16 *** associated main effects (in this case, TV and radio) do not.
Radio 0.0289 8.905e-03 3.241 0.0014 ** The hierarchy principle:
TV:Radio 0.0011 5.242e-05 20.727 <2e-16 ***
• If we include an interaction in a model, we should also include the main effects, even if
the p-values associated with their coefficients are not significant.
෣ = 6.7502 + 0.0191 TV + 0.0289 Radio + 0.0011 TV × Radio
Sales
• The rationale for this principle is that interactions are hard to interpret in a model
• An increase in TV ads of $1000 leads to an incremental sales of 19.1 + 1.1*Radio
without main effects - their meaning is changed.
• An increase in radio ads of $1000 leads to an incremental sales of 28.9 + 1.1*TV
• If the main effect terms are dropped, the interaction terms will include main effects

Curse of Dimensionality
• High dimensional data – efficiency of

algorithm deteriorates
• Sparse data, more features require more
training data
• Leads to computational complexity, high
training time, overfitting, spurious
correlations
• Algorithms like kNN are highly susceptible to
curse of dimensionality
• Dimensionality reduction techniques to avoid
high dimension data
• Feature selection – keep most relevant
ZG 512
features
• Feature extraction – Transformation from
Predictive Analytics
high dimension to low dimension using
Principal Component Analysis BITS Pilani Pravin Mhaske
Pilani Campus
Non-linear Relationships
BITS Pilani
Pilani Campus
Extend the linear model to accommodate non-linear relationships. Perform EDA to find patterns in the
relationships.
• Polynomial: Fits higher degree polynomial function to data

y = β0 + β1 × x + β2 × x2 +…..+ βn × xn
• Exponential: y = αe(βx)
• Logarithmic: y = α + β*ln(x)
• Power: y = αxβ
M3 Linear Regression • Generalized additive model (GAM): Combines multiple regression models for a complex relationship
Lecture 3 Linear Regression - Diagnostics
y = f1(x1) + f2(x2) +…+ fn(xn)

Polynomial Regression
The figure suggests that
mpg = β0 + β1 × horsepower + β2 × horsepower2 + ε
may provide a better fit.
Exercise:
Explore relationships for other predictors
Model: Non-linear Regression Liner Regression Assumptions

Call: lm(formula = mpg ~ horsepower + I(horsepower^2), data = auto) Assumptions
Residuals:
1. E(ε) = 0
Min 1Q Median 3Q Max 2. The model adequately captures the relationship
-14.7135 -2.5943 -0.0859 2.2868 15.8961 3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
Coefficients:
Estimate Std. Error t value Pr(>|t|) 5. The values of ε are independent (No Serial Correlation or Autocorrelation)
(Intercept) 56.9000997 1.8004268 31.60 <2e-16 *** 6. There is no (or little) multicollinearity among the independent variables
horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***

Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
Model Adequacy (Homoscedasticity) Model Adequacy (Homoscedasticity)
Ideally, there should not be any pattern – trend in the residuals
• The red line should be approximately horizontal at zero.
• The presence of a pattern may indicate a problem with some
aspect of the linear model.
In the plot, there is no pattern in the residual plot.

We may assume
• There is a linear relationship between the selected predictors and
the outcome variable
• E(ε) = 0
• There may be an issue with the Homoscedasticity assumption
Case 1: The red line is fairly horizontal about 0
Case 2: The red line is not horizontal indicating an inadequate model
Examples Is ε Normal?
NORMAL NOT NORMAL

Sales ~
1. TV + Radio + Paper,
2. TV*Radio
3. TV*Radio + I(TV^2)
4. TV*Radio + I(TV^2) + I(Radio^2)

Multicollinearity and VIF Iterative Method - Gradient Descent
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists • What is Gradient?
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect • What is Descent? Gradient Descent with one parameter
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the w = parameter or coefficient
independent variables.
J(w) = Cost function corresponding to that w
• Test for Multicollinearity: Variance Inflation Factor J(w)
Objective:
• VIF equal to 1 = variables are not correlated To compute (approximate) the w that minimizes
• VIF between 1 and 5 = variables are moderately correlated
J(w)
• VIF greater than 5 = variables are highly correlated
Solutions to multicollinearity
w
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression
Gradient Descent Process Positive and Negative Gradient
• The starting point is labelled 1 – a guess

• The slope (dJ/dw) gives the direction to move
• Take small steps iteratively – towards the local minima
• What is learning rate?

Learning Rate Types of Gradient Descent
• Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
• Minibatch refers to calculating derivative of mini groups of training data before calculating an update.
• Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating
the update immediately
Small Steps: The process will Large Steps: The process may not
eventually reach the local minimum converge – just bounces back and
but it may be very slow forth
Batch Gradient Descent Batch Gradient Descent for SLR
1 2
We need to minimize J β = σn yi − xiT β
2n i=1
y = f(x) + ε = β0 + β1x + ε
Differentiating J(β) wrt β, we get 𝛻 = 𝐗 T 𝐲 − 𝐗β
1 2
The following iterative process solves 𝛻 = 0: We need to minimize J β = σn yi − β0 + β1x
2∗n i=1
Repeat until convergence

This is given by
δ
{ βj = βj − α δβ J β , j = 0, … p } – Update the β’s simultaneously dJ 1
j
= σni=1 β0 + β1 xi − yi = 0 &
dβ0 n
dJ 1
= σni=1 β0 + β1 xi − yi xi = 0
dβ1 n

The Iterative Process Problem
Training Set:
y = f(x) + ε = β0 + β1x + ε
X Y
1 2
We need to minimize J β = σn yi − β0 + β1x -1 -1
2∗n i=1
1 3
Repeat until convergence:
Learning Rate (α) = 0.08
{
After 4th iteration of the Gradient Method, β0 = 0.64 &, β1 = 1.64
α n α n
β0 = β0 − σ β0 + β1 xi − yi , β1 = β1 − σ β0 + β1 xi − yi xi
𝑛 i=1 𝑛 i=1 Compute the estimates at the next iteration
}
Solution Gradient Descent with 2 parameters
Training Set: (-1, -1) and (1, 3), α = 0.08 and after 4 iterations, β0 = 0.64 &, β1 = 1.64
α
β 0 = β0 − β0 + β1 x1 − y1 + β0 + β1 x2 − y2
2
α
β1 = β1 − [ β0 + β1 x1 − y1 x1 + β0 + β1 x2 − y2 x2 ]
2
Both estimates need to be updated simultaneously
β0 = 0.64 − 0.04 0.64 + 1.64 ∗ −1 − −1 + 0.64 + 1.64 ∗ 1 − 3
= 0.64 − 0.04 0 + −0.72 = 𝟎. 𝟔𝟔𝟖𝟖
β1 = 1.64 − 0.04 0.64 + 1.64 ∗ −1 − −1 ∗ −1 + 0.64 + 1.64 ∗ 1 − 3 ∗ 1
= 0.64 − 0.04 0 ∗ −1 + −0.72 ∗ 1 = 1.64 + 0.0288 = 𝟏. 𝟔𝟔𝟖𝟖

Gradient Descent Gradient Descent
Size in feet2 Price ($) in

(x) 1000's (y)
2104 460
1416 232
1534 315 Source Credit : Slide by
852 178 Andrew Ng
… …
Source Credit : Slide by Source Credit : Slide by

Andrew Ng Andrew Ng

Andrew Ng Andrew Ng

Andrew Ng Andrew Ng
Gradient Descent
ZG 512
Pilani Campus
Source Credit : Slide by
Andrew Ng
Linear Model for 3D

BITS Pilani
Pilani Campus
In a three-dimensional setting, with two

predictors and one response,
• The least squares regression line becomes a
plane
• The ‘plane (of best fit)’ is chosen to minimize
the sum of the squared vertical distances
between each observation (shown in red)
and the plane.
Lecture 4 Non-parametric Methods

Linear Model for 2 Independent Variables Linear Model for 2 Independent Variables
Linear Model (parametric) Moderate Complexity (non-parametric) Overfit Model
Real Function Linear Model
Non-parametric Methods k Nearest Neighbors
• The aim of a regression analysis is to develop a reasonable approximation to the unknown response
function f, where Y = f(X) + ε
• Parametric methods make and assumption about the shape of f. For example, OLS assumes f to be
linear and estimates a small set of parameters, i.e., β0, β1, β2…
• Unlike parametric approach where the function f is fully described by a finite set of parameters,
nonparametric modeling accommodate a very flexible form of the regression curve.
• They do not make any explicit assumption of the functional form of f
• Instead try to seek an estimate of f that is close to the data points without being too flexible
• So, they can fit multiple possible shapes of f (Pic 1 on previous slide)
K = 3 & six blue observations and six orange observations
• They provide a versatile method of exploring a general relationship between variables
• Disadvantage: Need much more data to accurately fit the function f (or model) x is a test observation
The 3 closest points are identified, and therefore x belongs to the blue class
The KNN decision boundary for this example is shown in black.

Flexibility Distance Metrics
• Euclidean:
• The Training error rate (blue) & Test • Straight line distance
error rate (orange) Vs Flexibility on a • Scale variant, Sensitive to data dimensionality
simulated data • Normalization (scaling) can solve this issue
• Flexibility is measured by 1/k
• High k, stable model
• Low k, flexible model
• Bias-Variance Tradeoff?
• Manhattan:
• Taxicab or cityblock distance
• Sum of absoluter differences of cartesian coordinates
Normalization and Standardization
✓ Feature Scaling – equivalence of features

✓ Algorithms can be biased towards the features having numerically larger values
✓ Faster computations
Normalization:
• Min-Max scaling
• Converts to 0-1 range
Standardization:
• Converts to z-values using mean and std dev
The right k Practical Considerations
Small K • Accuracy depends solely on a data model

• Many small regions
• Non-smooth decision
• Scaling of features may be required
boundaries • Nearest Neighbor classifiers are lazy learners
• May overfit
Large K
• Can sometimes be the best method
• Fewer larger regions • Memory Intensive
• Smoother decision boundaries
• Low training time, high prediction time
• May Underfit
Choosing K
• Domain dependent
• Cross-validation
kNN function kNN Regression
• kNN for Regression (advertising data)
K = 1, a rough step function. Flexible fit. Low bias, high variance. Overfitting.
K = 9, a smoother function. More rigid fit. High Bias, low variance. Changing one observation has small
effect.

kNN Regression
• Knn Regression is non-parametric

• For regression, the average is computed
• (For classification, majority voting is taken into account)
• The algorithm requires
• K
• Distance metric (usually the Euclidean metric)
• Features should be on the same scale
• Memory Intensive – Instance based method
• Curse of Dimensionality
ZG 512
Pilani Campus
Resampling Methods
BITS Pilani
Pilani Campus
• They involve repeatedly drawing samples from the training set in order to
obtain additional information about the fitted model
• For example,
• Estimate population parameters by sampling with replacement the dataset
• To estimate test-set prediction error
• Select the appropriate level of flexibility

Lecture 5 Resampling Methods
• Two resampling methods: cross-validation and the bootstrap

Training Error Vs Test Error Validation set approach
• Recall the distinction between the test error and the training error: Holding out a subset of the training observations from the fitting process, and then applying the statistical learning
method to those held out observations to estimate the test error
• The test error is the average error that results from using a statistical learning method
• We randomly divide the training set into two equal parts: a training set and a validation or hold-out set.
to predict the response on a new observation, one that was not used in training the • The model is fit on the training set, and the fitted model is used to predict the responses for the observations in
method. the validation set.
• The resulting validation-set error provides an estimate of the test error.

• In contrast, the training error is calculated by applying the statistical learning method
• This is typically assessed using MSE in the case of a quantitative response and misclassification rate in the case
to the observations used in its training.
of a qualitative (discrete) response.
• But the training error rate often is quite different from the test error rate, and in
particular the former can dramatically underestimate the latter.
Drawbacks Leave One Out Cross Validation (LOOCV)

• The validation estimate of the test error can be highly variable, depending on precisely which • Tries to address the drawbacks of validation set approach
observations are included in the training set and which observations are included in the validation • Only a single observation is used to validation. N-1 observations for training.
set. • Very time consuming, computationally expensive.
• In the validation approach, only a subset of the observations — those that are included in the
training are used to fit the model. We lose a major proportion of the data.
26 28
24
Mean Squared Error
Mean Squared Error

22
16 18 20 22 24 26 28
20
16 18
2 4 6 8 10 2 4 6 8 10
Degree of Polynomial Degree of Polynomial

Bias-Variance Tradeoff in Cross-Validation k-fold Cross-Validation
• Statistical Methods tend to perform worse when trained on fewer observations • Widely used approach for estimating test error.
• In 50% validation set approach, only a subset is available for training • Estimates can be used to
• In LOOCV, bias will be the lowest, but high variance. It also doesn’t shake up the data • Select best model
enough. • Estimate the test error of the final chosen model.
• What’s the way in between? • Process
• Set k = 5 or 10 • Randomly divide the data into k equal-sized parts
• Moderate bias, moderate variance • Leave out part k, fit the model to the other k−1 parts (combined), and then
obtain predictions and the error for the left-out kth part.
• This is done in turn for each part k = 1, 2, . . . k
• The average of the k errors is the estimate of the test error
• Setting k = n yields n-fold or leave one out cross-validation (LOOCV)
5-fold CV Bootstrap
• Widely used approach for estimating test error. • The bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we
can estimate the variability of our estimate without generating additional samples.
• Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets
by repeatedly sampling observations from the original data set with replacement.
• Each of these “bootstrap data sets” is created by sampling with replacement and is the same size as our original
dataset. As a result. some observations may appear more than once in each bootstrap data set and some, not at
all.
• Example: Training dataset = {1,2,3} can give below possible datasets
111 222 333

112 221 331
113 223 332
122 233 322
123 231 321
133 211 311

Implementing 5-fold CV
• House Prices Dataset
ZG 512
Pilani Campus
Curse of Dimensionality
BITS Pilani
Pilani Campus
• Despite their simplicity, the linear models often have a good predictive power.
• If the relationship between predictor and response is linear, they tend to have low bias.
• And if n is large enough, i.e., n >> p, they have low variance too. No overfitting.
• But, what if p > n?
• New technologies can generate unlimited features.
• On the other hand, collection of data at times may have some constraints like cost and
availability. Result: p > n.
Lecture 6 Regularization
• How to manage irrelevant variables?

Dimensionality Reduction Shrinkage Methods
• We need better generalization, low overfitting and high interpretability. • Fit a model on all p predictors. But constrain or regularize the coefficient estimates, or shrink
• How to manage irrelevant variables? those to zero.
• Perform feature selection • Ridge Regression and Lasso Regression
• Push the coefficients to some variables to zero
• Methods for variable selection/elimination:
• Subset selection: Select subset of feature which we believe are relevant
• Shrinkage: Shrink the coefficients of some variables to zero. Called “Regularization”.
Result: Feature selection.
• Dimension reduction: Project the p predictors into an M-dimension subspace. M < p.
Achieved by computing M different linear combinations of p variables. Fit the model on M
features. Principal Component Analysis.
Ridge Regression
• Recall : The Residual Sum of Squares
• The least squares method estimates the β0, β1, β2, … βp coefficients those minimize RSS, which
can also be expressed as
• Ridge regression is same as OLS, except that the coefficients are estimated by minimizing a
slightly different quantity which is
• Where λ ≥ 0, is called regularization parameter.

Ridge Regression
• Ridge regression minimizes
p
• The second term λ σj=1 β2j is called shrinkage penalty. As β values → 0, penalty decreases.
• This penalty is called l 2 penalty

• The tuning parameter λ serves to control the relative impact of these two terms on the regression
coefficient estimates.
• Selecting a good value for λ is critical; cross-validation is used for this.
• λ = 0 ⇒ The penalty has no effect. It is same as least squares regression coefficients.
• λ → ∞ ⇒ The coefficients β1, . . . , βp → 0. What happens to β0?
𝛌 Vs β’s Bias Variance Tradeoff
• As λ increases, flexibility decreases which means more robust model, less overfitting, less
variance. But bias increases.
• Appropriate value of λ can minimize the sum of bias and variance errors, i.e., the MSE.

𝛌 from 0.01 to 1010 Lasso Regression
OLS • Ridge regression does have one obvious disadvantage: unlike subset selection, which
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian will generally select models that involve just a subset of the variables, ridge regression
-479.21 -7.80 0.19 1.14 17.72 -0.61 -1.10 -10.65 425.75 -8.53 10.11 will include all p predictors in the final model
At λ = 0.01 • The Lasso is a relatively recent alternative to ridge regression that overcomes this
disadvantage. The lasso coefficients 𝛽෠𝜆𝐿 , minimize the quantity
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian
2
𝑛 𝑝 𝑝 𝑝
-473.9088 -7.7987 0.1801 1.2954 16.9949 -0.6373 -1.056 -10.3862 425.8620 -7.4602 1.4765
෍ 𝑦𝑖 − 𝛽0 − ෍ 𝛽𝑗 𝑥𝑖𝑗 + 𝜆 ෍ |𝛽𝑗 | = 𝑅𝑆𝑆 + 𝜆 ෍ |𝛽𝑗 |
At λ = 1010
𝑖=1 𝑗=1 𝑗=1 𝑗=1
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian
359.9029 0.0002 0.0336 0.0023 0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000
• In statistical parlance, the lasso uses an l 1 penalty instead of an l 2 penalty.
Ridge Vs Lasso Ridge Vs Lasso
Parameter Ridge Lasso • Ridge and Lasso for auto-mpg data

Coefficients Shrinks close to 0 Pushes to exact 0
(if λ is large enough)
Penalty type L2 L1
Penalty λ * sum of squared coefficients λ * sum of absolute coefficients
Features Selection Reduces impact of least important Eliminates least important features
features
Complexity and High Low
Interpretability
Sparsity Low High
Sensitivity More robust. Less sensitive to More sensitive to outliers
outliers compared with Lasso.

BITS Pilani
Pilani Campus
ZG 512
Predictive Analytics Lecture 7 Bayesian Learning

Pilani Campus
Conditional Probability Conditional Probability
Consider a family with two children. Sample space S = {GG, GB, BG, BB}.
• What is the probability that both the children are girls? Sol. The two child problem:
• What is the probability that both children are girls, given that the first child is a P ( A  B) P ( A) 14 1
girl? P ( A | B) = = = =
• What is the probability that both the children are girls, given that at least one of P ( B) P ( B) 12 2
them is a girl?
P( AC) P ( A) 14 1
P ( A| C ) = = = =
Sol.: Let A, B, and C be the events that both children are girls, first child is a girl, and at P (C ) P (C ) 34 3
least one of them is a girl, respectively.
Q. In rolling of a fair die, what is the probability that the outcome is an odd
Here A = {GG}, B = {GB, GG}, C = {GG, GB, BG}; P(A)=1/4, P(B)=1/2, P(C)=3/4
What are now
number, given that it is less than or equal to 3?
P ( A|B ) , P ( A|C ) ? Sol. P ( A  B) 26 2
P ( A | B) = = =
P ( B) 36 3
Conditional Probability Total Probability Rule
Q. In rolling of a fair die, what is the probability that the outcome is an odd I have three bags each containing 100 balls.
• Bag 1 has 75 red and 25 blue balls
number, given that it is less than or equal to 3?
• Bag 2 has 60 red and 40 blue balls.
Sol. P ( A  B) 26 2 • Bag 3 has 45 red and 55 blue balls.
P ( A | B) = = =
P ( B)
If I choose a bag at random, and then pick a ball from it, also at random, what is the probability
36 3 that the chosen ball is red?
Total Probability Rule Total Probability Rule
Q 2: A manufacturing firm receives shipments of parts from two different suppliers.

Sol.: Let R be the event of picking red ball, and
Supplier A provides 65%, whereas supplier B provides 35% of total parts. Historical data
B1, B2, B3 are the events of choosing bag 1, bag 2, and bag 3. suggests that supplier A provides 2% bad parts, whereas supplier B provides 5% bad
Using total probability rule, products. With this data, can you find out the total probability of bad parts? (You may use
P ( R ) = P ( B1  R ) + P ( B2  R ) + P ( B3  R )
tree diagram)
= P ( B1 ) P ( R | B1 ) + P ( B2 ) P ( R | B2 ) + P ( B3 ) P ( R | B3 )
1  1  1 
=   0.75  +   0.60  +   0.45  = 0.60
3  3  3 

Bayes Theorem Bayes Theorem
For any two events A and B, where P ( A )  0 Q 1. I have three bags each containing 100 balls.
• Bag 1 has 75 red and 25 blue balls
P ( B  A) P ( B) P ( A | B)
P ( B A) =
• Bag 2 has 60 red and 40 blue balls.
=
P ( A) P ( A) • Bag 3 has 45 red and 55 blue balls.
A bag is chosen at random, and a ball is picked also at random. If the ball is found to be red, what is
the probability that Bag 1 was chosen?
Bayes’ theorem helps to revise probabilities (prior to posterior) with the new information. This By total probability rule,
theorem has wide applications in business statistics. P ( R ) = P ( B1 ) P ( R | B1 ) + P ( B2 ) P ( R | B2 ) + P ( B3 ) P ( R | B3 ) = 0.60

Now by Bayes' theorem,
P ( B1 ) P ( R | B1 ) P ( B1 ) P ( R | B1 )
P ( B1 R ) =
5
= =
P ( R) 3
 P(B ) P(R | B )
12
i i
i =1

The ideal መf Bayesian Classifiers
• Y: Categorical variable with K ≥ 2 Y: Categorical variable with K ≥ 2

• x0 is an observation Given an observation (X1, X2, …, Xp), we want to find K that maximizes P(Y = K|X1, X2, …, Xp)
• Suppose the conditional probabilities are available: {P(Y = j|X = x): ∀𝑥, ∀y} Can we estimate P(Y = K|X1, X2, …, Xp) from the data?
x0 is assigned to class j if P(Y = j|X = x0) = Max{P(Y = j|X = x0) : j = 1 … K}

P X1 , X 2 , … , X p Y = k P(Y= k)
From Bayes Theorem, P Y = k|X1 , X 2 , … , X p =
P X1 ,X2 ,…,Xp
This is the ideal fመ Now
This classifier gives the lowest possible test error rate. max P Y = k|X1 , X2 , … , Xp ≡ max P X1 , X2 , … , Xp Y = k × P Y = k = Likelihood × Prior
K K
The Naïve Bayes Classifier Fruits Dataset
Suppose the predictors are independent No Long Sweet Yellow Fruit

1 Y Y Y Banana
P X1 , X 2 , … , X p k P(Y= k) P X1 |k) × 𝑃(X2 |𝑘) × ⋯ × 𝑃(Xp k P(k)
• P 𝑌 = k|X1 , X 2 , … , X p = = 2 Y Y Y Banana
P X1 ,X2 ,…,Xp P X1 ,X2 ,…,Xp
3 Y Y Y Banana
• max P k|X1 , X2 , … , Xp ≡ max P X1 , X2 , … , Xp k P(k) = max P X1 |k) × 𝑃(X2 |𝑘) × ⋯ × 𝑃(Xp k P(k) 4 Y Y Y Banana
𝑘 𝑘 𝑘
5 Y Y Y Banana
And 6 Y Y Y Banana
• P(X1 = x1 |Y = k) can be estimated from the data DATA SUMMARY
Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000
Fruits Dataset Fruits Dataset
DATA SUMMARY
DATA SUMMARY Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL
Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Banana 400 100 350 150 450 50 500 Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000
Orange 0 300 150 150 300 0 300
P(A)
Rest 100 100 150 50 50 150 200 P A│E1&E2&E3 = P E1 ∗P E2 ∗P(E3)
∗ P E1 A ∗ P E2 A ∗ P(E3|A)
TOTAL 500 500 650 350 800 200 1000 P(Banana) = 0.5, P(Orange) = 0.3, P(Other) = 0.2
P(Long) = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
• P(A Ո B) = P(A) * P(B) = P(A | B) * P(B) = P(B | A) * P(A) 0.5

P B│L = 1&S = 1&Y = 1 = 0.5∗0.65∗0.80 ∗ P L = 1 B ∗ P S = 1 B ∗ P Y = 1 B
1
= 0.26 ∗ 0.5 ∗ 0.8 ∗ 0.7 ∗ 0.9 = 𝟎. 𝟗𝟔𝟗𝟐
• P(A | B) = P(A Ո B) / P(B) = P(B | A) * P(A) / P(B) 0.3
P O│L = 1&S = 1&Y = 1 = 0.5∗0.65∗0.80 ∗ P L = 1 O ∗ P S = 1 O ∗ P Y = 1 O
1
= 0.26 ∗ 0.3 ∗ 0 ∗ 0.5 ∗ 1 = 0
0.2 1
P R│L = 1&S = 1&Y = 1 = ∗ P L=1 R ∗P S=1 R ∗P Y=1 R = ∗ 0.2 ∗ 0.5 ∗ 0.75 ∗ 0.25 = 0.0721
And if E1, E2 & E3 are independent features 0.26 0.26
P(A)
• P A│E1&E2&E3 = ∗ P E1 A ∗ P E2 A ∗ P(E3|A) Note: They do not add up to 1.
P E1 ∗P E2 ∗P(E3)
Divide each “probability” by the sum of the three “probabilities” to get 0.9308, 0, 0.0692
Spherical, Sweet and Red?

Not Not
Long Not Long Sweet Sweet Yellow Yellow TOTAL
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000
P(A)
P A│E1&E2&E3 = P E1 ∗P E2 ∗P(E3)
∗ P E1 A ∗ P E2 A ∗ P(E3|A)
P(Banana) = 0.5, P(Orange) = 0.3, P(Other) = 0.2

P(Long) = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
ZG 512
0.5
P B│L = 0&S = 1&Y = 0 = 0.5∗0.35∗0.20 ∗ P L = 0 B ∗ P S = 1 B ∗ P Y = 0 B = 28.57 ∗ 0.5 ∗ 0.2 ∗ 0.7 ∗ 0.1 = 𝟎. 𝟏𝟕𝟗 Predictive Analytics
0.3
P O│L = 0&S = 1&Y = 0 =
0.5∗0.35∗0.20
∗ P L=0 O ∗P S=1 O ∗P Y=0 O = 28.57 ∗ 0.3 ∗ 1 ∗ 0.5 ∗ 0 = 0 BITS Pilani Pravin Mhaske
Pilani Campus
0.2
P R│L = 0&S = 1&Y = 0 = 0.26
∗ P L=0 R ∗P S=1 R ∗P Y=0 R = 28.57 ∗ 0.2 ∗ 0.5 ∗ 0.75 ∗ 0.75 = 1.607
B = 0.1, R = 0.9
Eager and Lazy learning
BITS Pilani
Pilani Campus
Parameter Eager learning Lazy Learning

Process System tries to construct a general, Generalization beyond the training
input-independent target function data is delayed until a query is made
during training of the system to the system (Instance Based
Learning)
Generalization Generalizes globally, beyond the Generalizes locally, depending upon
training data, independent of the new the query instance
query
Training Time High Almost nil
Prediction time Low High
Lecture 8 k Nearest Neighbours
Memory requirement Low High
Examples Linear Regression, Logistic Regression kNN
k Nearest Neighbors Flexibility
• The Training error rate (blue) & Test

error rate (orange) Vs Flexibility on a
simulated data
• Flexibility is measured by 1/k
• High k, stable model
• Low k, flexible model
• Bias-Variance Tradeoff?
K = 3 & six blue observations and six orange observations
x is a test observation
The 3 closest points are identified, and therefore x belongs to the blue class
The KNN decision boundary for this example is shown in black.

Flexibility Distance Metrics
• Euclidean:
• Straight line distance
• Scale variant, Sensitive to data dimensionality
• Normalization (scaling) can solve this issue
• Manhattan:
• Sum of absoluter differences of cartesian coordinates
Normalization and Standardization The right k
✓ Feature Scaling – equivalence of features Small K

✓ Algorithms can be biased towards the features having numerically larger values • Many small regions
✓ Faster computations • Non-smooth decision
boundaries
Normalization: • May overfit
• Min-Max scaling Large K
• Converts to 0-1 range • Fewer larger regions
• Smoother decision boundaries
Standardization: • May Underfit
• Converts to z-values using mean and std dev Choosing K
• Domain dependent
• Cross-validation

Practical Considerations Exercise
• Accuracy depends solely on a data model • Classification model using kNN

• Scaling of features may be required
• Nearest Neighbor classifiers are lazy learners
• Can sometimes be the best method
• Memory Intensive
• Low training time, high prediction time
Mid-Sem exam How to write?

• Try to write in the textbox wherever possible. Use sheet upload for calculations, charts etc.
• 5-7 questions • Upload only the answer to the specific question, not the complete answer sheet.
• Difficulty level: Easy, moderate • Take the photo at an appropriate angle. No creative photography!
• Application based: 2-4 • Write legibly. If the evaluator can't read or understand, you may lose marks.
• Theory: 2-4 • Underline, box or highlight important parts of the answer, like the keywords, final answer, etc.
• If you don’t really know the answer, writing anything doesn’t give marks.
• Topics: Linear Regression, Bias-Variance, Over and underfitting, Types of
errors, Model selection, Regularization, Cross-validation, Parametric or non-
parametric methods, Resampling Re-evaluations:
• Cross-check the answer with the sample answers. If it is not correct or even nearby, do not apply.
• Unjustified and vague requests may result in deduction.
• No marks for “attempting” a question.

BITS Pilani
Pilani Campus
ZG 512
Predictive Analytics Lecture 9 Classification, Logistic Regression

Pilani Campus
Classification Regression Vs Classification
Here the response variable Y is Qualitative/Categorical

• Email Spam: email is one of Y = (spam, email) Variables can either be Quantitative or Qualitative (Categorical)
• Handwritten Digit Recognition: Digit class is oneof Y = {0,1, . . . , 9}.
• Quantitative variables take on numerical values – Income, Bill amount
Our goals are: • Categorical values take on values on one of K different classes – Gender, Digit
1. Prediction
• Build a classifier f( X ) that assigns a class label to a future unlabeled observation X Regression Problem: The response variable is quantitative
• Estimate the probability that X belongs to each category in Y Classification Problem: The response variable is categorical
Example: We may be more interested to have an estimate of the probability that a transaction is fraudulent
than it is to classify that the transaction is fraudulent or not
2. Inference
• Understand the roles of the different predictors among X = (X 1 , X 2 , . . . , X p )

Classification Algorithms Estimator and Error Rate
We have seen y = f(X)
f is a function that best maps an input x to output y. We wish to estimate this f.
• Naïve Bayes
• K-nearest Neighbour The accuracy of መf is usually defined by the Error Rate
• Logistic Regression • The proportion of mis-classifications
• Discriminant Analysis
Ave ෍ I yi ≠ yො i
• Decision Trees
• Support Vector Machine Where I is the Indicator function
There are two error rates

• Training Error Rate
• Test Error Rate
A good classifier is one for which the test error rate is the smallest
Regression Revision Classification – why and how?

Relationships between a numerical response and numerical / categorical predictors
• Regression gives a number. What if I want to identify a class or category and not a
• Hypothesis tests for all regression parameters together – Testing the Model
number?
• Model coefficient interpretation
• Hypothesis tests for each regression parameters • Let’s say, I want to identify genuine emails vs spam emails, genuine transaction vs
• Confidence intervals for regression parameters fraud transaction. Here the outcomes are text values, but models can understand only
• Confidence and prediction intervals for predicted means and values numbers?
• Model diagnostics, residuals plots, outliers
• How do I handle this? I will replace the 2 classes by numbers. Say, one class as 1 and
• RSS, MSE, R2
• Interpreting computer outputs another as 0 and train a model which can predict the outcome value that is 0 or 1.
• But models can’t give discrete values 0 and 1.
• We can rather make it give a continuous value between 0 and 1.
• If the value is closer to 1 (i.e. >= 0.5), I consider it as 1, otherwise 0.

Classification – why and how? Odds and Logit Function
• Can I use the concepts of linear regression here? How?

• Linear regression can throw out any value between - ∞ and + ∞. Odds is commonly used in gambling (and logistic regression)
• However, I want to map or convert that range (- ∞,+ ∞) to (0,1).
For an event E ,
• We need a link function to do this. • If we know P(E) then
• The most appropriate one is a sigmoid or logistic function. P(E) P(E)
Odds E = =
P(~E) 1 − P(E)
x
If the odds of E are “x to y”, then P E = x+y
Logit function:
p
• logit p = ln ,0 ≤ p ≤ 1
1−p
P(E)
log Odds(E) = ln 1−P(E)
logit can be interpreted as the log odds of a success
Logit Function and Logistic (Sigmoid) Function Sigmoid Curve
• Logistic regression is a Generalized Linear Model (GLM)

• Uses Logistic or sigmoid function.
The logit function

p
• logit p = ln ,0 ≤ p ≤ 1
1−p
• Converts (0, 1) to range (–∞, + ∞)
The inverse function is known as Sigmoid function (or Logistic function)

ex 1
• S x = = , −∞ < x < +∞
1+ex 1+e−x
• Converts (–∞, + ∞) to range [0, 1]

Evaluating the Model
Confusion Matrix
መ
The performance of𝑓can also be described
by a confusion matrix
A confusion matrix is a table that is used

to describe the performance of a
classification model (or "classifier") on a
set of data for which the true values are
known.
The confusion matrix gives strong clues as

to where 𝑓መ is going wrong.
Example
Consider a classical problem of predicting spam and non-spam email.

The objective is to identify Spams.
The training set consists of 15 emails that are Spam, and 85 emails that are Not Spam
The model correctly classified 95 emails
• All 85 Non-Spams were correctly classified
• 10 Spams were correctly classified
• 5 Spams were classified as Non-Spams (False Negative if Target is Spam).
The Matrix Is accuracy a good metric always?
The objective is to identify Spams. Building a highly accurate (useless) classifier

True Class
Predicted Class Spam Non-Spam
Spam 10 0
Non-Spam 5 85
• “true positive” for correctly identifying target event

• “true negative” for correctly identifying non-target event
• “false positive” for incorrectly identifying a non-target event as a target event
• “false negative” for incorrectly identifying a target event as a non-target event
TP = 10, FP = 0, TN = 85, FN = 5
True Class
Predicted Class Target Non-Target
Target TP FP
Non-Target FN TN
What’s the right metric? The Metrics
1. Many classifiers are designed to optimize error/accuracy

2. This tends to bias the performance towards the majority class
1 is same as Positive
3. Anytime there is an imbalance in the data this can happen 0 is same as Negative
4. It is particularly pronounced, though, when the imbalance is more pronounced
5. Accuracy is not the right measure of classifier performance in such cases
6. What are other metrics?
1. Precision
2. Recall (Sensitivity or TPR or True Positive Rate = TP/P)
3. F1-score?
Also check*
1. Specificity (TNR or True Negative Rate = TN/N)
2. False Positive Rate (FPR) = FP/N = 1 – TNR
3. And others…
Refer https://en.wikipedia.org/wiki/Confusion_matrix

Strategies for Imbalanced Data Under-sampling (majority class)
1. Under-sampling Create a new training data set by:

• Include all k “positive” examples
2. Over-sampling • randomly pick k “negative” examples
3. Optimize AUC Pros:
Easy to implement
Training becomes much more efficient (smaller training set)
For some domains, can work very well
Cons:
Throwing away a lot of data/information
Over-sampling (minority class) Multiclass Classification
Suppose the possible responses are A, B & C.

Create a new training data set by: fመ was run on the training set and the following Confusion Matrix was generated
- including all m negative examples
- include m positive examples: True Class
- repeat each example a fixed number of times, or
Predicted Class A B C
- sample with replacement
A 30 20 10
Pros: B 50 60 10
Easy to implement C 20 20 80
Utilizes all of the training data
Tends to perform well in a broader set of circumstances than subsampling The confusion matrix gives strong clues as to where fመ is going wrong.
• For class A, fመ incorrectly labelled Label B for majority of the mislabelled cases.
Cons: Perhaps features need to be added to improve classification of label
Computationally expensive to train a classifier
The more zeroes or smaller the numbers on all cells but the diagonal, the better the
classifier is doing.

Multiclass Classification More Metrics
True Positive Rate (TPR)

Suppose the possible responses are A, B & C.
• TP/(TP+FN) or TP/AP
መf was run on the training set and the following Confusion Matrix was generated
• Proportion of actual positive instances correctly identified
True Class • Also called Recall or Sensitivity
Predicted Class A B C • You’d always want to maximize TPR
A 30 20 10 • Example: Correctly detected defaulters out of all actual defaulters
B 50 60 10
C 20 20 80
False Positive Rate (FPR)
True Positive are those observations of a particular class that were classified correctly
False Positive are those observations that were incorrectly mapped to one class • FP/(TN+FP) or FP/AN
False Negative are those observations of a particular class that were classified incorrectly • Proportion of actual negatives incorrectly identified as positives (false alarms)
True Negative: Applicable for a Two-Class scenario • Also called Type 1 error
• You’d always want to minimize FPR
TP_A = 30, TP_B = 60, TP_C = 80 • Especially harmful in cases like medical diagnosis
FP_A = 30, FP_B = 60, FP_C = 40 • Example: Incorrectly detected as defaulter out of all non-defaulter customers
FN_A = 70, FN_B = 40, FN_C = 20
Receiver Operating Characteristics (ROC) Exercise
• To display the trade-off between TPR and FPR at all

possible thresholds. • Building a Logistic Regression Classifier using sklearn
• Plot ROC
• Changing the threshold changes TPR and FPR.
• Good to compare multiple different classifiers.
• The “no information” classifier (which works no better
than random guess) has AUC = 0.5
• The closer the curve to the top left, better the classifier.
• High TPR - Minimum FN
• Low FPR - Minimum FP
• Maximize Area Under the ROC Curve (AUC) for a better
classifier. Provides quantitative measure of model’s
performance.

BITS Pilani
Pilani Campus
ZG 512
Predictive Analytics Lecture 11 Tree-based Methods
Pilani Campus
Tree based methods Pros and Cons

• Tree-based methods are simple and useful for interpretation.
• Tree-based methods work for both regression and classification.
• However, they typically are not competitive with the best supervised learning
• These involve stratifying or segmenting the predictor space into a number of
simple regions. approaches in terms of prediction accuracy.
• Since the set of splitting rules used to segment the predictor space can be • Hence, we also discuss bagging, random forests, and boosting. These methods grow
summarized in a tree, these types of approaches are known as decision-tree multiple trees which are then combined to yield a single consensus prediction.
methods.
• Combining a large number of trees can often result in dramatic improvements in
prediction accuracy, at the expense of some loss in interpretation.
• Decision trees can be applied to both regression and classification problems. So also
called CART.

Classification Trees Gini Index (Gini Impurity)
• For a classification tree, we predict that each observation belongs to the most commonly occurring class
of training observations in the region to which it belongs. • The Gini index is defined by
n
• We use recursive binary splitting to grow a classification tree.
෍ pො i (1 − pො i )
• Cost function is classification error rate. This is simply the fraction of the training observations in that
i=1
region that do not belong to the most common class:
E = 1 − max pො mk
k The Gini index takes on a small value if all of the pො mk ’s are close to zero or
Here pො mk represents the proportion of training observations in the mth region that are from the kth one.
class. • For this reason, the Gini index is referred to as a measure of node purity — a
small value indicates that a node contains predominantly observations from
• However, classification error is not sufficiently sensitive for tree-growing, and in practice two other
a single class.
measures are preferable Gini Index and Cross-Entropy
• G = 0.5  maximum impurity
Cross-Entropy Gini Vs Cross-Entropy
• An alternative to the Gini index is cross-entropy, given by
• Related to this is the Information Gain: The decrease of D from one

level to the next
It turns out that the Gini index and the cross-entropy are very similar
numerically.
D = 1  maximum impurity

Building a tree Building a tree
X1, X2 & X3 are the independent variables; Y is the response variable Level 0: G = 0.38 & D = 0.81
X1 X2 X3 Y X1 X2 X3 Y Level 1: Optimum Candidate: X1 with G = 0.25 & D = 0.50
62 62 6 6 2 2 Left Branch: (0, 0, 0, 0), (0, 0, 1, 0), (0, 1, 0, 0), (0, 1, 1, 0); G = 0 & D = 0; No further splitting
0 0 0 0 Level 0: 6 ‘0’ & 2 ‘1’⇒ 𝐆 = 8 8 + 8 8 = 0.38 & 𝐃 = − 8 log 2 8
− 8 log 2 8
= 0.81 0 0 0 0
Right Branch: {1, 0, 0, 0), (1, 0, 1, 1), (1, 1, 0, 0), (1, 1, 1, 1); G = 0.5 & D = 1
0 0 1 0 Level 1: 0 0 1 0
0 1 0 0 • Splitting Candidate X1: X1 < 0.5 & X1 ≥ 0.5 0 1 0 0 Level 2:
0 1 1 0 YL: 0,0,0,0; G = 0 & D = 0; YR: 0,1,0,1; G = 0.5 & D = 1; GTotal = 0.25 & DTotal = 0.5 0 1 1 0 Left Branch: G = 0 & D = 0. No further splitting
1 0 0 0 1 0 0 0
• Splitting Candidate X2: X2 < 0.5 & X2 ≥ 0.5 Right Branch:
1 0 1 1 YL: 0,0,0,1; G = 0.38 & D = 0.81; ; YR: 0,0,0,1; G = 0.38 & D = 0.81; GTotal = 0.38 & DTotal =
1 0 1 1
1 1 0 0 1 1 0 0 • Splitting Candidate X2: X2 < 0.5 & X2 >= 0.5
0.81 YL: 0,1; G = 0.5 & D = 1; ; YR: 0,1; G = 0.5 & D = 1; ; GTotal = 0.25 & DTotal = 1
1 1 1 1 1 1 1 1
• Splitting Candidate X3: X3 < 0.5 & X3 > 0.5 • Splitting Candidate X3: X3 < 0.5 & X3 > 0.5
YL: 0,0,0,0; G = 0 & D = 0; ; YR: 0,0,1,1; G = 0.5 & D = 1; GTotal = 0.25 & DTotal = 0.5 YL: 0,0; G = 0 & D = 0; ; YR: 1,1; G = 0 & D = 0; ; GTotal = 0 & DTotal = 0
Optimum Candidate: X3
Optimum Candidate: X1
Level 3: No further splitting
Tree and predictor space Error rate
X1 X2 X3 Y X3 X1 X2 X3 Y Level Error Rate

0 0 0 0 0 0 0 0 0 0.25 = 2/8
0 0 1 0 0 0 1 0
0 1 0 0 0 1 0 0
0 1 1 0 0 1 1 0 1 0.25 = 0/4 + 2/4
1 0 0 0 0.5 1 0 0 0
1 0 1 1 1 0 1 1
1 1 0 0 1 1 0 0 2 0=0+0
1 1 1 1 1 1 1 1
(0, 0) 0.5 X1

Pruning a tree Pruning a tree and cross validation
Pros and Cons
▲ Trees are very easy to explain, even easier than linear regression!
▲ Decision trees more closely mirror human decision-making than do the other regression and
classification approaches.
▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert
(especially if they are small).
▲ Trees can easily handle qualitative predictors without the need to create dummy
variables.
▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of
the other regression and classification approaches.
However, by aggregating many decision trees, the predictive performance of

trees can be substantially improved.

Summary
BITS Pilani
Pilani Campus
• Decision trees are simple and interpretable models for regression and classification
• However, they are often not competitive with other methods in terms of
prediction accuracy
• Bagging, random forests, and boosting are good methods for improving the prediction
accuracy of trees.
• Random Forest – They work by growing many trees on the training data and then combining
the predictions of the resulting ensemble of trees.
• The latter two methods, random forests and boosting, are among the state-of-the-art
methods for supervised learning. However, their results can be difficult to interpret. Lecture 11 Model Ensembles
• It appears that overfitting may not happen in Boosting
Why Ensemble? Bootstrap aggregating
• Decision trees are simple and interpretable models for regression and classification • A general-purpose procedure for reducing the variance of a statistical learning method
• However they are often not competitive with other methods in terms of prediction
accuracy • It is particularly useful and frequently used in the context of decision trees.
• Bagging, random forests and boosting are good methods for improving the prediction accuracy of • This is not practical because we generally do not have access to multiple training sets.
trees. They work by growing many trees on the training data and then combining the predictions
of the resulting ensemble of trees.
• The latter two methods— random forests and boosting— are among the state-of-the-art
methods for supervised learning. However their results can be difficult to interpret.
• It appears that overfitting may not happen in Boosting

Bootstrap Bagging
• The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty
associated with a given estimator or statistical learning method.
• Sampling randomly ‘n’ subsets of original training data.
• Distinct data sets by repeatedly sampling observations from the original data set with replacement.
• New dataset obtained is of same size as our original dataset.
• This step ensures that the base models are trained on diverse subsets of the data, as some samples
may appear multiple times in the new subset, while others may be omitted. It reduces the risks of
overfitting and improves the accuracy of the model.
Bagging - Regression Bagging - Classification
Since we do not have access to multiple training sets, we bootstrap • Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training dataset.
• Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training data set. • Train our method on the bth bootstrapped training set to get fመ b (x), b =
1, …, B
• Train the method on the bth bootstrapped training set to get fመ b (x), b = 1,
…, B
• Take the majority vote:
• Average all the predictions to obtain The overall prediction is the most commonly occurring class
B
1
fመavg x = ෍ fመ b (x)
B
b=1

Random Forest Random Forest
• An improvement over bagged trees (m = p).

• Predictions from bagged trees are highly correlated. This doesn’t help in reducing the
variance. How?
• Decorrelating the trees
• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p

• For example, if there are say 15 predictors, each split considers only 4 predictors. The
algorithm is not even allowed to consider majority of the predictors. All predictors get a
fair chance.
• Each tree becomes different from other. Averaging the predictions reduces variance.
Random Forest - Regression Random Forest - Classification
• Generate B different bootstrapped training data sets. • Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training data set. Sample with replacement B times from the (single) training data set.
• When building a decision tree • When building a decision tree
• At each split, a random sample of m predictors is chosen as split candidates from the full set of • At each split, a random sample of m predictors is chosen as split candidates from the full
p predictors. set of p predictors.
• We get fመ b (x)
• We get fመ b (x), b = 1, …, B
𝟏
• Take the majority vote:
• Average all the predictions to obtain 𝐟መ𝐚𝐯𝐠 𝐱 = σ𝐁𝐛=𝟏 𝐟መ 𝐛 (x) The overall prediction is the most commonly occurring class
𝐁

Boosting Boosting - Procedure
• Boosting can be applied to regression and classification methods

• h1 is applied on the original dataset &
misclassifies 3 points. The weights of these points
• Bagging involves parallel process
• Creating multiple datasets using the bootstrap are increased and others are decreased.
• Fitting a separate decision tree to each copy • h2 is applied on the new dataset & misclassifies 3
• Combining all of the trees in order to create a single predictive model. points. The weights of these points are increased
• Each tree is built independently of the other trees.
and others are decreased.
• h3 is applied on the new dataset & misclassifies 3
• Boosting is a Serial process
• The trees are grown sequentially points.
• Each tree is grown using information from previously grown trees • The final classifier is the weighted sum of the
H = α1h1 + α2h2 + α3h3
• Each tree works on a modified version of the training set three classifiers.
Bagging Vs Boosting Types of Boosting

AdaBoost (Adaptive Boosting)
• Multiple weak learners into a single strong learner.
• The weak learners in AdaBoost are decision trees with a single split, called decision stumps
Gradient Boosting
• Sequentially adding predictors to an ensemble, each one correcting its predecessor.
• Tries to fit the new predictor to the residual errors made by the previous predictor.
XGBoost (Extreme Gradient Boosting)

• Implementation of gradient boosted decision trees designed for speed and performance.
• Generally very slow in implementation because of sequential model training.

BITS Pilani
Pilani Campus
ZG 512
Predictive Analytics Lecture 12 Support Vector Machines

Pilani Campus
Support Vector Machine Hyperplane
• Maximum Margin Classifier • A hyperplane in p dimensions is a flat subspace of dimension p − 1.

• In general, the equation for a hyperplane has the form
• Can classify linearly separable as well as non-linearly separable data
• Can also be used for regression β0 + β1X1 + β2X2 + . . . + βpXp = 0
• A data point is considered a p-dimensional vector • If the hyperplane equation is β0 + β1X1, it is a line
• Objective is to separate such data points using (p-1) dimensional hyperplane • If β0 = 0, the hyperplane goes through the origin, otherwise not.
• The vector β = (β1, β2, · · · , βp) is called the normal vector Data Hyperplane Hyperplane
Dimension dimension
1 0 Point
2 1 Line
3 2 Plane
P P-1 hyperplane

2D and 3D Hyperplane Maximum margin classifier
• H1 – can’t classify
• H2 – Classifies, but with thin margin
• H3 – Best among all. Maximum margin.
Data Hyperplane Hyperplane

Dimension dimension
2 1 Line
3 2 Plane
Maximum margin classifier Support Vectors
The data points or vectors that are the closest to

the hyperplane and which affect the position of
the hyperplane are termed as Support Vector.
Since these vectors decide the hyperplane, they

are called Support vectors.
Suppose the data is perfectly linearly separable

• There will be an infinite number of such hyperplanes
• There will be one with the biggest gap or margin between the two classes.

Classification using hyperplane Classification – a maximization problem
maximize M,
β0 ,β1 ,…,βp
• In 2-dimensions, let’s assume a hyperplane β0 + β1X1 + β2X2 = 0
subject to
• If a point X (x1, x2) satisfies above equation, X lies on the hyperplane
σpj=1 β2j = 1 𝐀𝐍𝐃 yi × β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip ≥ M ∀ i = 1, … , n
• If X makes the equation > 0, X lies on one side of the hyperplane
1. The distance of a point from the hyperplane is given by yi × ൫β0 + β1 xi1 + β2 xi2 + ⋯
• If X makes it < 0, X lies on the other side
+ βp xip ൯
• For example, the hyperplane 1 + 2X1 + 3X2 = 0
2. β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip > 0 ⇒ y = +1
• Blue region: set of points for which 1 + 2X1 + 3X2 > 0
3. In the figure there are 3 Support Vectors
• Purple region: set of points for which 1 + 2X1 + 3X2 < 0 4. The optimal separating classifier depends only on these vectors
5. M is the distance from a support vector to the hyperplane
6. 2*M is the width of the margin (distance between the margins)
Non-separable data Noisy data

• Support vector classifiers are sensitive to individual observations.
• Maximal margin classifier works only if a separating hyperplane exists.
• Sometimes, data is separable, but noisy. Right: an additional observation is added.
• Here, the data is not separable by a linear boundary.
• The hyperplane shifts dramatically. The margin is also very small. Higher margin means a robust and reliable classifier.
• Solution creating a hyperplane that almost separates. • Solution: Allow a few mis-classifications to maximize the margin. Soft Margin Classifier.
• It is called soft margin.
2.0
1.5
1.0
X2
0.5
0.0
−0.5
−1.0
0 1 2 3
X1

Soft Margin Classifier The Slack Variable
• It is worthwhile to misclassify a few training observations in order to do a better job in classifying the maximize M
majority of observations. β0 ,β1 ,…,βp ,ε1 ,…,εn
p
• The margin is kept soft. That is, it can be violated by some observations. s.t. σj=1 β2j = 1
• Sometimes, some observation can be on wrong side of the margin (1 and 8) or on wrong side of the yi β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip ≥ M 1 − εi ∀ i = 1, … , n
hyperplane (11 and 12). εi ≥ 0 & σ εi ≤ C
The slack variable ε

• Allows the observation to be on the wrong side of the margin or the hyperplane
• If εi = 0, the i-th observation is on the right side of the margin
• If εi > 1, the i-th observation is on the wrong side of the hyperplane
• If εi > 0, the i-th observation is on the wrong side of the margin
The Tuning Parameter Effect of C on margin

maximize M
β0 ,β1 ,…,βp ,𝜀1 ,…,𝜀𝑛
p
s.t. σj=1 β2j = 1
yi β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip ≥ M 1 − εi ∀ i = 1, … , n
εi ≥ 0 & σ εi ≤ C
The tuning parameter C

• Controls the error-margin tradeoff, that is, the bias-variance trade-off. How?
• Think of C as a budget for the amount that the margin can be violated by the n observations
• Higher the C value, more the budget, wider the margin
• C = 0 means there is no budget for violating the margin
• C > 0, no more than C training points can be on the wrong side of the hyperplane
• C is chosen by cross-validation
• As C increases the bias potentially increases and the variance may decrease
BITS Pilani
Pilani Campus
ZG 512
Predictive Analytics Unsupervised Learning
Lecture 13 Principal Components Analysis
Pilani Campus
Supervised Vs Unsupervised
• We observe both a set of features X1, X2, . . . , Xp for each object, as well as a response or
outcome variable Y. The goal is then to predict Y using X1, X2, . . . , Xp.
• Here we instead focus on unsupervised learning, where we observe only the features X1, X2,
. . . , Xp. We are not interested in prediction, because we do not have an associated response
variable Y.
Challenges of Unsupervised Learning Goals of Unsupervised Learning
• Unsupervised learning is more subjective than supervised learning, as there is no simple goal
for the analysis, such as prediction of response.
• The goal is to discover interesting things about the measurements:
• Difficult to assess the results. Nobody knows the true answer. No universally accepted
• Is there an informative way to visualize the data?
mechanism for cross-validation.
• Can we discover subgroups among the variables or among the observations?
• Unsupervised learning is often performed as part of an exploratory data analysis.
• Can we interpret these subgroups?
• But techniques for unsupervised learning are of growing importance in a number of fields:
• subgroups of breast cancer patients grouped by their gene expression measurements
• We discuss two methods:
• groups of shoppers characterized by their browsing and purchase histories
• principal components analysis, a tool used for data visualization or data pre-
• movies grouped by the ratings assigned by movie viewers
processing before supervised techniques are applied, and
• Search engine will display results based on search history of other users with similar search
• clustering, a broad class of methods for discovering unknown subgroups in
patterns
data.
Dimensionality Reduction Dimensionality Reduction

• Modern methods generate very high-dimension data • Original data (2 dimensions)
• Computationally expensive, difficult to interpret, difficult to visualize • Two Principal Components
• Dimensionality needs to be reduced, without losing much information • Data projected on first PC (only one dimension)
• Multiple methods for dimensionality reduction • Even in high dimensions, generally, first 2-3 components explain a high proportion of variance, and
makes it easy to plot
• PCA – one of the most popular method based on vectors, matrices, eigen values and eigenvectors
• Principal Components are the vectors that define a new coordinate system
• First axis (PC) goes in the direction of highest variance in the data, followed by second and so on…
• All axes/PCs are orthogonal to each other…every new PC contains information that is not present in the
earlier PCs.
• Data is projected on principal components
• https://www.youtube.com/watch?v=fNk_zzaMoSs&t=21s
• https://www.youtube.com/watch?v=FD4DeN81ODY

Principal Components Analysis Why Principal Components?
• PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear • Say, we have n observations and p features and we wish to plot scatterplots for all pairs.
combinations of the variables that have maximal variance and are mutually uncorrelated.
• If p = 10, it will create 10C2 = 45 plots! This grows exponentially with p.
• When there is a large set of correlated variables, principal components allow summarizing those by a
• Each feature contains a small fraction of information. And not all dimensions are equally interesting.
smaller set of representative variables. Dimensionality reduction.
• We need a lower dimension representation of data without losing much information.
• We use these new variables in another supervised learning problem.
• PCA does this job. Finds small number of dimensions that are as interesting as possible.
• PCA – a process of computing the principal components and using those.
• Interesting = how much the observation vary in each direction.
• Unsupervised because only X1, X2, X3… Xp are available. No label.
• Each new dimension found is a linear combination of the p features. It is a Principal Component.
• Apart from producing derived variables for use in supervised learning problems, PCA also serves as a
tool for data visualization.
Principal Components PCA example
• The first principal component found is a normalized linear combination of all features. • The population size (pop) and ad spending (ad) for 100 different cities are shown as purple circles.
The green solid line indicates the first principal component direction, and the blue dashed line
Z1 = Φ11X1 + Φ21X2 +…+ Φp1Xp
indicates the second principal component direction.
• The second principal component looks like (just to understand the notations)
Z2 = Φ12X1 + Φ22X2 +…+ Φp2Xp
𝑝 2
• Normalized means 𝛴𝑗=1 𝜙𝑗1 = 1. Φ11, Φ21,… Φp1 are called loadings (or eigen vectors) of the first
PC.
• Together, the loadings make up the principal component loading vector,
Φ1 = (Φ11 Φ21 …Φp1)𝑇

• We constrain the squared sum of all 𝜙 equal to 1.
• These Zs are constructs and are not directly observable.
• All these Zs are mutually uncorrelated (orthogonal to each other).
PCA example Geometry of PCA and the scores
8
• Z1 = φ11X1 + φ21X2 + . . . + φp1Xp
PC 1
6
• The loading vector φ1 with elements φ11, φ21, . . . , φp1 defines a direction in feature space along
which the data vary the most.
•
4
PC 2 If we project the n data points x1, . . . , x n onto this direction, the projected values are the first
principal component scores z11, . . . , zn1 themselves.
Variable X 2
0
-8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
PCA – two borderline cases Computation of Principal Components*
• Only one PC
• Two almost equal PCs

Computation of Principal Components* Further Principal Components*
• The second principal component is the linear combination of X1, . . . , Xp that

has maximal variance among all linear combinations that are uncorrelated
with Z1.
• The second principal component scores z12, z22, . . . , zn2 take the form zi2 =
φ12xi1 + φ22xi2 + . . . + φp2xip, where φ2 is the second principal component
loading vector, with elements φ12, φ22, . . . , φp2.
• It turns out that constraining Z2 to be uncorrelated with Z1 is equivalent to
constraining the direction φ2 to be orthogonal (perpendicular) to the direction
φ1. And so on.
• The principal component directions φ1, φ2, φ3, . . . are the ordered sequence
of right singular vectors of the matrix X, and the variances of the components
are 1/n times the squares of the singular values. There are at most min(n − 1,
p) principal components.
ZG 512
Pilani Campus
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering
BITS Pilani
Pilani Campus
• Clustering refers to a very broad set of techniques for finding subgroups, or clusters,
in a data set.
• We seek a partition of the data into distinct groups so that the observations within each
group are quite similar to each other,
• We must define what it means for two or more observations to be similar or different.
• This is often a domain-specific consideration that must be made based on knowledge
of the data being studied.
• Finally, we must be able to interpret the clusters for the exercise to be useful
Unsupervised Learning • Example, Customer/market/industry segmentation, document clustering, fake news,
fraud detection
Lecture 14 Clustering
Clustering Distance Metrics
• Clustering is an unsupervised learning technique

Euclidean:
• It is the task of grouping together a set of objects in such a way that elements in the
• Straight line distance
same group are more similar to each other than the elements in other groups.
• Scale variant, Sensitive to data dimensionality
• Similarity is a metric that reflects the strength of relationships between two data points • Normalization (scaling) can solve this issue
Manhattan:
• Sum of absolute differences of cartesian coordinates

K-means Clustering K-means Clustering
❑ A centroid-based clustering method Objective: Minimize Sum of Squared distances
❑ A centroid is a data point at the center of a cluster. • Step 0: Create random k centroids
❑ In centroid-based clustering, clusters are represented by a central vector or a centroid. Repeat iteratively till convergence (no further improvement/movement)
This centroid might not necessarily be a point of the dataset. • Step 1: Assign points to the closest centroid
❑ Centroid-based clustering is an iterative algorithm in which the notion of similarity is • Step 2: Re-calculate and move centroids
derived by how close a data point is to the centroid of the cluster.
❑ Objective is to minimize the sum of sqaured distances (distance between every point and
• http://alekseynp.com/viz/k-means.html
the corresponding cluster center).
• https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
What’s the right k? – Elbow Method Global and Local Minima
✓ The plot represents the variance within

the clusters.
✓ It decreases as k increases, but it can be
seen a bend (or “elbow”) at k = 4.
✓ This bend indicates that additional
clusters beyond the fourth have little
value.

The Problem – Local Minima The Solution – k-Means++
Smarter initialization of centroids!

Steps:
1. Randomly select the first centroid from the data points.
2. For each data point, compute its distance from the nearest, previously chosen centroid.
3. Select the next centroid from the data points such that the probability of choosing a point as
centroid is directly proportional to its distance from the nearest, previously chosen centroid.
(i.e. the point having maximum distance from the nearest centroid is most likely to be selected
next as a centroid)
4. Repeat steps 2 and 3 until k centroids have been sampled
5. Continue with standard k-means.
Reason? – k-Means is sensitive to the initialization of centroid.
Some centroids may not get any points at all. Some clusters get split into multiple clusters.
Exercise Density-based Clustering

20 • Kmeans, at times, be insufficient to learn the unknown trends in the data.
18 • It assumes clusters to be spherical, which may not be always true.
16
• Solution: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
14
12
10
0
0 10 20 30 40 50 60

DBSCAN DBSCAN
• Philosophy: If a particular point belongs to a cluster, it should be near to lots of other points in • Noise point: If a new point selected doesn’t have more than minPoints within epsilon, it remains
that cluster. unclustered.
• It groups closely packed data points together based on two parameters - a distance measure • Border point: There maybe a fight between two clusters for a certain point qualifying for both
(epsilon) and a density threshold (minPoints). the clusters. That point may get into any arbitrary cluster.
Steps:
1. Select two parameters: Epsilon and minPoints. • https://scikit-learn.org/dev/auto_examples/cluster/plot_dbscan.html
2. Pick a random point.
3. If there are minPoints (including self) with the distance epsilon, merge them.
4. Keep repeating step 3 iteratively for all the new points included in the cluster. If no more points
found within epsilon, stop.
5. Select a new random unclustered point as a new cluster.
6. Repeat steps 3, 4 and 5.
K-Means Vs DBSCAN
kMeans DBSCAN
Type Centroid-based Density-based
Cluster Shape Spherical/Convex Irregular
Number of clusters To be specified Decided by algorithm
Dimensions Can handle high dimension Not efficient for high dimension
Outliers Get clustered Become Noise points
Anomaly detection Not useful Highlights anomalous points ZG 512
When to use? K is known and data has spherical K is unknown and data has
shape irregular shape Predictive Analytics
Pilani Campus

Linear Models
BITS Pilani
Pilani Campus
• Simple to implement and interpret.

• But has limitations. Real relations may not be always linear. We only make an
assumption of linearity.
• Can the linearity assumption be relaxed?
• Extension to linear model – polynomial regression and step function
• Polynomial regression extends the linear model by adding extra predictors.
• The extra predictors are higher order versions of original predictors X1, X2…
• You can still make a ‘Linear’ Regression model.
Lecture 16 Beyond Linearity
Non-linear Relationships Polynomial Regression

Extend the linear model to accommodate non-linear relationships. Perform EDA to find patterns in
• The relationship between X and y is modeled as nth degree polynomial.
the relationships.
• The ‘line of best fit’ is not a line, rather a polynomial curve.
• Polynomial: Fits higher degree polynomial function to data
• Replaces the standard linear regression equation y = β0 + β1 X1 + β2X2 +…..+ βn Xn
y = β0 + β1 × x + β2 × x2 +…..+ βn × xn by a higher degree polynomial function y = β0 + β1 X + β2 X2 +…..+ βn Xn
• Exponential: y = αe(βx) • It is only a linear model with new predictors created using original predictors.
• Logarithmic: y = α + β*ln(x) • Generally, not recommended to cross n = 3 or 4. Else?
• Power: y = αxβ • How to chose the correct degree?

• This may also be called as feature engineering.
• Generalized additive model (GAM): Combines multiple regression models for a complex
relationship
y = f1(x1) + f2(x2) +…+ fn(xn)

Polynomial Regression Example
The figure suggests that Years of Experience Salary (in dollars)

1 50000
mpg = β0 + β1 × horsepower + β2 × horsepower2 + ε
2 55000
may provide a better fit. 3 65000
4 80000
Exercise:
5 110000
Explore relationships for other predictors
6 150000
7 200000

AI vs ML vs DL
Understanding Deep
Neural Networks
Why Deep Learning is a buzz now? Relation with Data
ANN ANN
● A large part of this section will focus on ● Theory Topics

theory behind many of the ideas we will ○ Perceptron Model to Neural Networks
implement with code. ○ Activation Functions
● Let’s do a quick review of how we will ○ Cost Functions
gradually build an understanding of ○ Feed Forward Networks
artificial neural networks. ○ BackPropagation
Perceptron model
● To begin understanding deep learning, we

will build up our model abstractions:
Perceptron Model ○ Single Biological Neuron
○ Perceptron
○ Multi-layer Perceptron Model
○ Deep Learning Neural Network
Perceptron model Perceptron model
● As we learn about more complex models, ● If the whole idea behind deep learning is to
we’ll also introduce concepts, such as: have computers artificially mimic biological
○ Activation Functions natural intelligence, we should probably
○ Gradient Descent build a general understanding of how
○ BackPropagation biological neurons work!

● Illustration of biological neurons
● Stained Neurons in cerebral cortex

● Illustration of biological neurons ● Let’s really simplify this!

● Simplified Biological Neuron Model
● A perceptron was a form of neural network
introduced in 1958 by Frank Rosenblatt.
● Amazingly, even back then he saw huge
Dendrites potential:
Axon
○ "...perceptron may eventually be able
to learn, make decisions, and translate
Nucleus languages."
● However, in 1969 Marvin Minsky and ● Fortunately for us, we now know the
Seymour Papert's published their book amazing power of neural networks, which
Perceptrons. all stem from the simple perceptron model,
● It suggested that there were severe so let’s head back and convert our simple
limitations to what perceptrons could do. biological neuron model into the
● This marked the beginning of what is perceptron model.
known as the AI Winter, with little funding
into AI and Neural Networks in the 1970s.

● Perceptron Model ● Perceptron Model
Dendrites Inputs Output

Axon
Nucleus

● Let’s work through a simple example ● Let’s work through a simple example
x1 x1
Inputs Output Inputs f(X) Output
x2 x2

● Let’s work through a simple example ● If f(X) is just a sum, then y=x1+x2
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
x2 x2
● Realistically, we would want to be able to ● Let’s add an adjustable weight we multiply
adjust some parameter in order to “learn” against x
w1
x1 x1
y y
w2
x2 x2

● Now y = x1w1 + x2w2 ● We could update the weights to effect y
w1 w1
x1 x1
y y
w2 w2
x2 x2

● But what if an x is zero? w won’t change ● Let’s add in a bias term b to the inputs.
anything!
w1 w1
x1 x1
y y
w2 w2
x2 x2

● Let’s add in a bias term b to the inputs. ● y = (x1w1 + b) + (x2w2 + b)
*w1 + b *w1 + b
x1 x1
y y
x2 *w2 + b x2 *w2 + b
● We can expand this to a generalization:
● We’ve been able to model a biological
neuron as a simple perceptron!
x1 Mathematically our generalization was:
*w1 + b y
Inputs f(X) Output
x2 *w2 + b
xn *wn + b
● Later on we will see how we can expand ● Also we’ll see we can even simplify the bias
this model to have X be a tensor of to be at a layer level instead of a bias per
information ( an n-dimensional matrix). input.

● Notice how at the end we sum a bunch of ● Theoretically for any number of biases,
biases… there exists a bias that is the sum.
x1 x1
*w1 + b y *w1 + b y
x2 *w2 + b x2 *w2 + b
xn *wn + b xn *wn + b

● Theoretically for any number of biases, ● Theoretically for any number of biases,
there exists a bias that is the sum. there exists a bias that is the sum.
x1 x1
*w1 y *w1 y
x2 *w2 x2 *w2
xn *wn xn *wn
B B = b1 + b2 + … +bn
Perceptron model
● Let’s review what we learned:

○ We understand the very basics of a
biological neuron Neural Networks
○ We saw how we can create a simple
perceptron model replicating the core
concepts behind a neuron.
Neural Networks Neural Networks
● A single perceptron won’t be enough to ● A single perceptron won’t be enough to

learn complicated systems. learn complicated systems.
● Fortunately, we can expand on the idea of ● Fortunately, we can expand on the idea of
a single perceptron, to create a multi-layer a single perceptron, to create a multi-layer
perceptron model. perceptron model.
● We’ll also introduce the idea of activation
functions.
● To build a network of perceptrons, we can ● The outputs of one perceptron are directly
connect layers of perceptrons, using a fed into as inputs to another perceptron.
multi-layer perceptron model.
● This allows the network as a whole to learn ● The first layer is the input layer
about interactions and relationships
between features.
● The last layer is the output layer. ● Layers in between the input and output
● Note: This last layer can be more than one layers are the hidden layers.
neuron
● Hidden layers are difficult to interpret, due ● Neural Networks become “deep neural
to their high interconnectivity and distance networks” if then contain 2 or more
away from known input or output values. hidden layers.
● Neural Networks become “deep neural ● Terminology:

networks” if then contain 2 or more ○ Input Layer: First layer that directly
hidden layers. accepts real data values
○ Hidden Layer: Any layer between
input and output layers
○ Output Layer: The final estimate of the
output.
Relation with Data Neural Networks
● What is incredible about the neural

network framework is that it can be used
to approximate any function.
● Zhou Lu and later on Boris Hanin proved
mathematically that Neural Networks can
approximate any convex continuous
function.
● For more details on this check out the ● Previously in our simple model we saw that
Wikipedia page for “Universal the perceptron itself contained a very
Approximation Theorem” simple summation function f(x).
● For most use cases however that won’t be
useful, we’ll want to be able to set
constraints to our output values, especially
in classification tasks.
Neural Networks
● In a classification tasks, it would be useful

to have all outputs fall between 0 and 1.
● These values can then present probability Activation Functions
assignments for each class.
● In the next lecture, we’ll explore how to use
activation functions to set boundaries to
output values from the neuron.
● Recall that inputs x have a weight w and a ● Which means we have

bias term b attached to them in the ○ x*w + b
perceptron model. ● Clearly w implies how much weight or
● Which means we have strength to give the incoming input.
○ x*w + b ● We can think of b as an offset value,
making x*w have to reach a certain
threshold before having an effect.
● For example if b= -10 ● Next we want to set boundaries for the

○ x*w + b overall output value of:
● Then the effects of x*w won’t really start ○ x*w + b
to overcome the bias until their product ● We can state:
surpasses 10. ○ z = x*w + b
● After that, then the effect is solely based ● And then pass z through some
on the value of w. activation function to limit its value.
● Thus the term “bias”
Neural Networks Perceptron model
● Recall our simple perceptron has an f(X)
● A lot of research has been done into
activation functions and their effectiveness. w1
● Let’s explore some common activation x1
functions. y
Inputs f(X) Output
w2
+b
x2
Perceptron model Neural Networks

● If we had a binary classification problem,
● To avoid confusion, let’s define the total
we would want an output of either 0 or 1.
w1 inputs as a variable z.
x1 ● Where z = wx + b
y ● In this context, we’ll then refer to activation
Inputs f(X) Output
w2 functions as f(z).
+b ● Keep in mind, you will often see these
x2
variables capitalized f(Z) or X to denote a
tensor input consisting of multiple values.
Deep Learning Deep Learning
● The most simple networks rely on a basic ● Regardless of the values, this always
step function that outputs 0 or 1. outputs 0 or 1.
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● This sort of function could be useful for ● However this is a very “strong” function,
classification (0 or 1 class). since small changes aren’t reflected.
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● There is just an immediate cut off that ● It would be nice if we could have a more
splits between 0 and 1. dynamic function, for example the red
1 line! 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● Lucky for us, this is the sigmoid function! ● Changing the activation function used can
be beneficial depending on the task!
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● This still works for classification, and will be ● Let’s discuss a few more activation
more sensitive to small changes. functions that we’ll encounter!
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● Hyperbolic Tangent: tanh(z) ● Hyperbolic Tangent: tanh(z)

● Outputs between -1 and 1 instead of 0 to 1
1 1
Output Output
-1 -1
0 0
● Rectified Linear Unit (ReLU): This is ● ReLu has been found to have very good
actually a relatively simple function: performance, especially when dealing
max(0,z) with the issue of vanishing gradient.
Output ● We’ll often default to ReLu due to its
overall good performance.
0
z = wx + b
Deep Learning
● For a full list of possible activation functions

check out:
● en.wikipedia.org/wiki/Activation_function
Deep Learning
Multi-Class ● Notice all these activation functions make

sense for a single output, either a
Activation Functions continuous label or trying to predict a
binary classification (either a 0 or 1).
● But what should we do if we have a multi-
class situation?
● There are 2 main types of multi-class ● Non-Exclusive Classes

situations ■ A data point can have multiple
○ Non-Exclusive Classes classes/categories assigned to it
■ A data point can have multiple ○ Photos can have multiple tags (e.g.
classes/categories assigned to it beach, family, vacation, etc…)
○ Mutually Exclusive Classes
■ Only one class per data point.
● Mutually Exclusive Classes ● Organizing Multiple Classes

■ A data point can only have one ○ The easiest way to organize multiple
class/category assigned to it classes is to simply have 1 output node
○ Photos can be categorized as being in per class.
grayscale (black and white) or full color
photos. A photo can not be both at the
same time.
● Previously we thought of the last output ● This single node could output a continuous
layer as a single node. regression value or binary classification (0
or 1).
Neural Networks Multiclass Classification
● Let’s expand this output layer to work for ● Organizing for Multiple Classes
the case of multi-classification.
Hidden Layers
Multiclass Classification Deep Learning
● Organizing for Multiple Classes ● Organizing Multiple Classes

Class One
○ This means we will need to organize
categories for this output layer.
Class Two ○ We can’t just have categories like “red”,
“blue”, “green”, etc...
Hidden Layers
Class N
● Organizing Multiple Classes ● Mutually Exclusive Classes

○ Instead we use one-hot encoding
Data Point 1 RED
○ Let’s take a look at what this looks like Data Point 2 GREEN
for mutually exclusive classes. Data Point 3 BLUE
... ...
Data Point N RED
● Mutually Exclusive Classes ● Non-Exclusive Classes

RED GREEN BLUE A B C
Data Point 1 RED Data Point 1 A,B
Data Point 1 1 0 0 Data Point 1 1 1 0
Data Point 2 GREEN Data Point 2 A
Data Point 3 BLUE Data Point 3 C,B
... ... ... ...
... ... ... ... ... ... ... ...
Data Point N RED Data Point N B
Data Point N 1 0 0 Data Point N 0 1 0
● Now that we have our data correctly ● Non-exclusive

organized, we just need to choose the ○ Sigmoid function
correct classification activation function ■ Each neuron will output a value
that the last output layer should have. between 0 and 1, indicating the
probability of having that class
assigned to it.
Multiclass Classification Multiclass Classification
● Sigmoid Function for Non-Exclusive ● Sigmoid Function for Non-Exclusive

Classes Classes 1
Class One Class One 0.8
0
1
Class Two Class Two 0.2
0
Hidden Layers Hidden Layers

1
Class N Class N 0.3
0
Multiclass Classification Deep Learning
● Sigmoid Function for Non-Exclusive ● Non-exclusive

Classes 1
○ Sigmoid function
Class One 0.8
0
■ Keep in mind this allows each neuron
1
Class Two 0.6 to output independent of the other
0
classes, allowing for a single data
Hidden Layers
Class N
1 point fed into the function to have
0.2
0 multiple classes assigned to it.
● Mutually Exclusive Classes ● Softmax Function

○ But what do we do when each data
point can only have a single class
assigned to it?
○ We can use the softmax function for
this!
● Mutually Exclusive Classes

○ Softmax function calculates the
probabilities distribution of the event
over K different events.
○ This function will calculate the
probabilities of each target class over all
possible target classes.
● Mutually Exclusive Classes ● Mutually Exclusive Classes

○ The range will be 0 to 1, and the sum of ○ The main thing to keep in mind is that if
all the probabilities will be equal to you use softmax for multi-class
one. problems you get this sort of output:
○ The model returns the probabilities of ■ [Red , Green , Blue]
each class and the target class chosen ■ [ 0.1 , 0.6 , 0.3 ]
will have the highest probability.
Deep Learning
● Mutually Exclusive Classes

○ The probabilities for each class all sum
up to 1. We choose the highest
probability as our assignment.
■ [Red , Green , Blue]
■ [ 0.1 , 0.6 , 0.3 ]
Deep Learning
● Review
○ Perceptrons expanded to neural
network model
○ Weights and Biases
○ Activation Functions
○ Time to learn about Cost Functions!
Deep Learning
Cost Functions and ● We now understand that neural

networks take in inputs, multiply them
Gradient Descent by weights, and add biases to them.
● Then this result is passed through an
activation function which at the end of
all the layers leads to some output.
● This output ŷ is the model’s estimation of ● We need to take the estimated outputs
what it predicts the label to be. of the network and then compare them
● So after the network creates its to the real values of the label.
prediction, how do we evaluate it? ● Keep in mind this is using the training
● And after the evaluation how can we data set during the fitting/training of the
update the network’s weights and model.
biases?
● The cost function (often referred to as a ● We’ll use the following variables:
loss function) must be an average so it ○ y to represent the true value
can output a single value. ○ a to represent neuron’s prediction
● We can keep track of our loss/cost ● In terms of weights and bias:
during training to monitor network ○ w*x + b = z
performance. ○ Pass z into activation function σ(z) = a
● One very common cost function is the ● We simply calculate the difference
quadratic cost function: between the real values y(x) against our
predicted values a(x).
● Note: The notation shown here ● Notice how squaring this does 2 useful
corresponds to vector inputs and things for us, keeps everything positive
outputs, since we will be dealing with a and punishes large errors!
batch of training points and predictions.
● We can think of the cost function as: ● W is our neural network's weights, B is
our neural network's biases, Sr is the
input of a single training sample, and Er
is the desired output of that training
sample.
● Notice how that information was all ● This means that if we have a huge
encoded in our simplified notation. network, we can expect C to be quite
● The a(x) holds information about complex, with huge vectors of weights
weights and biases. and biases.
● Here is a small network with all its ● That is a lot to calculate!

parameters labeled: ● How do we solve this?
● In a real case, this means we have some ● For simplicity, let’s imagine we only had
cost function C dependent lots of one weight in our cost function w.
weights! ● We want to minimize our loss/cost
○ C(w1,w2,w3,....wn) (overall error).
● How do we figure out which weights ● Which means we need to figure out
lead us to the lowest cost? what value of w results in the minimum
of C(w)
● Our “simple” function C(w) ● What value of w minimizes our cost?

C(w) C(w)
w w
● What value of w minimizes our cost? ● What value of w minimizes our cost?
C(w) C(w)
w w
wmin
● Students of calculus know we could take ● But recall our real cost function will be
a derivative and solve for 0. C(w)
very complex! C(w)
w w
wmin
● And it will be n-dimensional! ● And it will be n-dimensional since our

C(w)
networks will have 1000s of weights C(w)
w w
Deep Learning
● We can use gradient descent to solve

this problem. C(w)
w
Deep Learning
● Let’s go back to our simplified version to

see how this works. C(w)
wmin
● We can calculate the slope at a point ● We can calculate the slope at a point
C(w) C(w)
w w
wmin wmin
● Then we move in the downward ● Then we move in the downward

direction of the slope. C(w)
w w
wmin wmin
● Then we move in the downward ● Until we converge to zero, indicating a

minimum. C(w)
w w
wmin wmin
● We could have changed our step size to ● Our steps:

find the next point! C(w) C(w)
w w
wmin wmin
● Smaller steps sizes take longer to find ● Larger steps are faster, but we risk
the minimum. C(w)
overshooting the minimum! C(w)
w w
wmin wmin
● This step size is known as the learning ● The learning rate we showed in our
rate. C(w) illustrations was constant (each step size
was equal)
● But we can be clever and adapt our step
size as we go along.
w
wmin
● We could start with larger steps, then go ● In 2015, Kingma and Ba published their
smaller as we realize the slope gets paper: “Adam: A Method for Stochastic
closer to zero. Optimization“.
● This is known as adaptive gradient ● Adam is a much more efficient way of
descent. searching for these minimums, so you
will see us use it for our code!
● Adam versus other gradient descent ● Realistically we’re calculating this

algorithms: descent in an n-dimensional space for all
our weights.
● When dealing with these N-dimensional ● For classification problems, we often use
vectors (tensors), the notation changes the cross entropy loss function.
from derivative to gradient. ● The assumption is that your model
● This means we calculate ∇C(w1,w2,...wn) predicts a probability distribution p(y=i)
for each class i=1,2,…,C.
● For a binary classification this results in: ● Review:

○ Cost Functions
○ Gradient Descent
● For M number of classes > 2 ○ Adam Optimizer
○ Quadratic Cost and Cross-Entropy
● So far we understand how networks can ● The next thing we need to learn about
take in input , effect that input with theory is:
weights, biases, and activation functions ○ Once we get our cost/loss value,
to produce an estimated output. how do we actually go back and
● Then we learned how to evaluate that adjust our weights and biases?
output. ● This is backpropagation, and it is
what we are going to cover next!

PA DL Consolidated

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

PA DL Consolidated

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PA DL Consolidated

Uploaded by

Copyright:

Available Formats

BITS Pilani

Model Assessment Training and Test Errors

Data is fitted to a linear function and

Bias Vs Variance Bias Vs Variance

The goal of any supervised statistical learning

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

1. When do we know that we are

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Training and Test Errors Bias-Variance Trade-off

The algorithm learns a model from training data

At any iteration, we would like to know whether:

• Large MSETr & Large MSETe

We want a good predictor – low bias and low variance

• Small MSETr << MSETe 2

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

f is the true model

Given a data set D = {(x, y)}

• fመ1 has higher bias & lower variance

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

A rigid and a flexible model Bias – Variance Equation

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

The Irreducible Error

• The last term is known as the irreducible error

This variance arises because ZG 512

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Types of Machine Learning Regression

Predict a value of a given continuous valued variable based on the values

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

X represents the input data set; X is a 6 * 3 matrix

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Matrix X and Vector y A Linear Model

# TV Radio Paper Sales

xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

A Linear Model Regression Assumptions

Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε

f is said to represent the systematic information that X provides about Y

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

• VIF equal to 1 = variables are not correlated

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Homoscedasticity Vs Heteroscedasticity Types of Regression Models

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Solution Evaluation of Regression Model

• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ = 20 – 10 = 10

Coefficient of Determination (R-squared) Exercise (already covered)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

One Hot Encoding (Dummy Variables)

BITS Pilani, Pilani Campus

1. Find out count of distinct values, say n, in the column.

Residual standard error: 1.681 on 197 degrees of freedom

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

The Synergy Effect The Interaction Model

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Interpretation II Hierarchy Principle