PA DL Consolidated
PA DL Consolidated
PA DL Consolidated
Pilani Campus
ZG 512
Predictive Analytics M1 Predictive Analytics
BITS Pilani Pravin Mhaske
Lecture 1 Introduction, Model Assessment
Pilani Campus
For Classification, the most commonly used measure is the Error Rate: • The training error shows the performance of the model on the training
data
Ave I yi ≠ yො i
What about the accuracy of the prediction on an unseen test
data?
Where I is the Indicator function
• The usefulness of the model depends on the performance on unseen
test data
• We need a model that minimizes the test error
• We want a method that gives the lowest MSETe as opposed to
the lowest MSETr
• There are ways of estimating MSETe: Test Data, Cross Validation
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Overfitting and underfitting Overfitting and underfitting
Both overfitting and underfitting lead to poor predictions on new data sets.
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Underfitting
Low Bias, High Variance High Bias, Low Variance 1. Rigidity or under-complexity
Overly Flexible Less Flexible 2. High Bias, Low variance
Overfitting
1. Flexibility or over-complexity
2. Low Bias, High variance
• Variance is the amount that the predictions will vary with different training data sets
• The irreducible error cannot be reduced: it is the error introduced by modelling a real-life scenario
Mean Squared Error
High Bias:
1.5
High Variance: 2
• Bias = E መf x0 − f x0
• Small MSETr & Large MSETe
Inability of our model to capture the true relationship (no matter how much data we throw at it!)
0.0
2 2 2 2 2 2
E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε) E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)
• The first term is the squared bias • Note that different training sets will generate different fመ
• It is the difference between the averages of the estimate and the true values • Ideally the estimates should not vary too much across training sets
• Bias refers to the error that is introduced by approximating a real-life problem • The second term is the variance of the estimate across many training sets
• A model with high bias is not capturing the underlying behaviour of the true • It determines how much the average model estimation deviates across the training
functional form well data
• For example, if a simple linear regression is used to model a sine curve, no matter
how low the MSETr is, it will never capture the non-linearity inherent in a sine curve • In general, more flexible methods have higher variance
2 2 2
E y0 − fመ x0 = E fመ x0 − f x0 + E fመ x0 − E fመ x0 + Var(ε)
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)
M3 Linear Regression
Lecture 2 Linear Regression 1
Many data mining tasks are executed successfully with help of machine learning
# TV Radio Paper Sales The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1 observations 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y 2 44.5 39.3 45.1 10.4 2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3 The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent variables “Sales”
• Also known as features, or inputs, or predictors
4 151.5 41.3 58.5 18.5 4 151.5 41.3 58.5 18.5
• Generically referred to as X (or X1, X2, X3) 5 180.8 10.8 58.4 12.9 p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2 n = 6 (the number of observations) 6 8.7 48.9 75 7.2
The linear model is an example of a parametric model We want to predict Y for a given value of x
f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p
Is there an ideal f (X)?
• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p . • What is a good value for f (X) at any selected value of X , say X = 4? There
can be many Y values at X = 4
• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters
A good value is f (4) = E(Y |X = 4), the expected value of Y given X = 4.
• We estimate the parameters by fitting the model to training data.
This ideal f (x) = E(Y |X = x) is called the regression function.
• Simple Linear Regression: Only one x variable
• Multiple Linear Regression: Many x variables
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത
𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2
# of TV Ads # of Cars Sold
(x) (y)
• y Intercept for the Estimated Regression Equation
1 14
𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ 3 24
2 18
where:
xi = value of independent variable for ith observation 1 17
yi =value of dependent variable for ith observation 3 27
𝑥=
ҧ mean value for dependent variable
𝑦=
ത mean value for dependent variable
σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖−𝑦ത
• Slope for the Estimated Regression Equation 𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2
= 20/4 = 5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
• Proportion of variance in a dependent variable that can be explained by an independent variable • Simple Linear Regression using Excel
• Multiple Linear Regression using Excel
• Multiple Linear Regression using statsmodel
• Multiple Linear Regression using scikitlearn
M3 Linear Regression
Lecture 2 Extensions to the Linear Model
Label Encoding
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Advertising Model The Advertising Model
Call:
lm(formula = Sales ~ TV + Radio, data = Ad)
In the first model, we assumed that the effect on sales of increasing one advertising
Residuals: medium is independent of the amount spent on the other media.
Min 1Q Median 3Q Max
-8.7977 -0.8752 0.2422 1.1708 2.8328 The estimated model was
= 2.9389 + 0.0458*TV + 0.1885*Radio – 0.0010*Paper
Y
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.92110 0.29449 9.919 <2e-16 ***
TV 0.04575 0.00139 32.909 <2e-16 *** From this we infer that the average effect on sales of a one-unit increase in TV is always
Radio 0.18799 0.00804 23.382 <2e-16 *** 0.0458, regardless of the amount spent on radio.
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Non-linear Relationships
BITS Pilani
Pilani Campus
Extend the linear model to accommodate non-linear relationships. Perform EDA to find patterns in the
relationships.
• Exponential: y = αe(βx)
• Logarithmic: y = α + β*ln(x)
• Power: y = αxβ
M3 Linear Regression • Generalized additive model (GAM): Combines multiple regression models for a complex relationship
Lecture 3 Linear Regression - Diagnostics
y = f1(x1) + f2(x2) +…+ fn(xn)
Exercise:
Explore relationships for other predictors
Examples Is ε Normal?
1. TV + Radio + Paper,
2. TV*Radio
3. TV*Radio + I(TV^2)
• X1 and X2 are significant when included separately, but together the effect of both variables shrink. Multicollinearity exists • What is Gradient?
when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect • What is Descent? Gradient Descent with one parameter
the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical significance of the w = parameter or coefficient
independent variables.
J(w) = Cost function corresponding to that w
• Test for Multicollinearity: Variance Inflation Factor J(w)
Objective:
• VIF equal to 1 = variables are not correlated To compute (approximate) the w that minimizes
• VIF between 1 and 5 = variables are moderately correlated
J(w)
• VIF greater than 5 = variables are highly correlated
Solutions to multicollinearity
w
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression
• Minibatch refers to calculating derivative of mini groups of training data before calculating an update.
• Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating
the update immediately
Small Steps: The process will Large Steps: The process may not
eventually reach the local minimum converge – just bounces back and
but it may be very slow forth
1 2
We need to minimize J β = σn yi − xiT β
2n i=1
y = f(x) + ε = β0 + β1x + ε
Differentiating J(β) wrt β, we get 𝛻 = 𝐗 T 𝐲 − 𝐗β
1 2
The following iterative process solves 𝛻 = 0: We need to minimize J β = σn yi − β0 + β1x
2∗n i=1
dJ 1
= σni=1 β0 + β1 xi − yi xi = 0
dβ1 n
Training Set:
y = f(x) + ε = β0 + β1x + ε
X Y
1 2
We need to minimize J β = σn yi − β0 + β1x -1 -1
2∗n i=1
1 3
Repeat until convergence:
Learning Rate (α) = 0.08
{
After 4th iteration of the Gradient Method, β0 = 0.64 &, β1 = 1.64
α n α n
β0 = β0 − σ β0 + β1 xi − yi , β1 = β1 − σ β0 + β1 xi − yi xi
𝑛 i=1 𝑛 i=1 Compute the estimates at the next iteration
}
Training Set: (-1, -1) and (1, 3), α = 0.08 and after 4 iterations, β0 = 0.64 &, β1 = 1.64
α
β 0 = β0 − β0 + β1 x1 − y1 + β0 + β1 x2 − y2
2
α
β1 = β1 − [ β0 + β1 x1 − y1 x1 + β0 + β1 x2 − y2 x2 ]
2
ZG 512
Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus
Source Credit : Slide by
Andrew Ng
• The aim of a regression analysis is to develop a reasonable approximation to the unknown response
function f, where Y = f(X) + ε
• Parametric methods make and assumption about the shape of f. For example, OLS assumes f to be
linear and estimates a small set of parameters, i.e., β0, β1, β2…
• Unlike parametric approach where the function f is fully described by a finite set of parameters,
nonparametric modeling accommodate a very flexible form of the regression curve.
• They do not make any explicit assumption of the functional form of f
• Instead try to seek an estimate of f that is close to the data points without being too flexible
• So, they can fit multiple possible shapes of f (Pic 1 on previous slide)
K = 3 & six blue observations and six orange observations
• They provide a versatile method of exploring a general relationship between variables
• Disadvantage: Need much more data to accurately fit the function f (or model) x is a test observation
The 3 closest points are identified, and therefore x belongs to the blue class
The KNN decision boundary for this example is shown in black.
• Euclidean:
• The Training error rate (blue) & Test • Straight line distance
error rate (orange) Vs Flexibility on a • Scale variant, Sensitive to data dimensionality
simulated data • Normalization (scaling) can solve this issue
• Flexibility is measured by 1/k
• High k, stable model
• Low k, flexible model
• Bias-Variance Tradeoff?
• Manhattan:
• Taxicab or cityblock distance
• Sum of absoluter differences of cartesian coordinates
Normalization:
• Min-Max scaling
• Converts to 0-1 range
Standardization:
• Converts to z-values using mean and std dev
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
The right k Practical Considerations
K = 1, a rough step function. Flexible fit. Low bias, high variance. Overfitting.
K = 9, a smoother function. More rigid fit. High Bias, low variance. Changing one observation has small
effect.
Resampling Methods
BITS Pilani
Pilani Campus
• They involve repeatedly drawing samples from the training set in order to
obtain additional information about the fitted model
• For example,
• Recall the distinction between the test error and the training error: Holding out a subset of the training observations from the fitting process, and then applying the statistical learning
method to those held out observations to estimate the test error
• The test error is the average error that results from using a statistical learning method
• We randomly divide the training set into two equal parts: a training set and a validation or hold-out set.
to predict the response on a new observation, one that was not used in training the • The model is fit on the training set, and the fitted model is used to predict the responses for the observations in
method. the validation set.
16 18 20 22 24 26 28
20
16 18
2 4 6 8 10 2 4 6 8 10
5-fold CV Bootstrap
• Widely used approach for estimating test error. • The bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we
can estimate the variability of our estimate without generating additional samples.
• Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets
by repeatedly sampling observations from the original data set with replacement.
• Each of these “bootstrap data sets” is created by sampling with replacement and is the same size as our original
dataset. As a result. some observations may appear more than once in each bootstrap data set and some, not at
all.
• Example: Training dataset = {1,2,3} can give below possible datasets
ZG 512
Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus
Curse of Dimensionality
BITS Pilani
Pilani Campus
• Despite their simplicity, the linear models often have a good predictive power.
• If the relationship between predictor and response is linear, they tend to have low bias.
• And if n is large enough, i.e., n >> p, they have low variance too. No overfitting.
• But, what if p > n?
• New technologies can generate unlimited features.
• On the other hand, collection of data at times may have some constraints like cost and
availability. Result: p > n.
Lecture 6 Regularization
• How to manage irrelevant variables?
• We need better generalization, low overfitting and high interpretability. • Fit a model on all p predictors. But constrain or regularize the coefficient estimates, or shrink
• How to manage irrelevant variables? those to zero.
• Perform feature selection • Ridge Regression and Lasso Regression
• Push the coefficients to some variables to zero
• Methods for variable selection/elimination:
• Subset selection: Select subset of feature which we believe are relevant
• Shrinkage: Shrink the coefficients of some variables to zero. Called “Regularization”.
Result: Feature selection.
• Dimension reduction: Project the p predictors into an M-dimension subspace. M < p.
Achieved by computing M different linear combinations of p variables. Fit the model on M
features. Principal Component Analysis.
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Ridge Regression
• Recall : The Residual Sum of Squares
• The least squares method estimates the β0, β1, β2, … βp coefficients those minimize RSS, which
can also be expressed as
• Ridge regression is same as OLS, except that the coefficients are estimated by minimizing a
slightly different quantity which is
p
• The second term λ σj=1 β2j is called shrinkage penalty. As β values → 0, penalty decreases.
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• As λ increases, flexibility decreases which means more robust model, less overfitting, less
variance. But bias increases.
• Appropriate value of λ can minimize the sum of bias and variance errors, i.e., the MSE.
OLS • Ridge regression does have one obvious disadvantage: unlike subset selection, which
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian will generally select models that involve just a subset of the variables, ridge regression
-479.21 -7.80 0.19 1.14 17.72 -0.61 -1.10 -10.65 425.75 -8.53 10.11 will include all p predictors in the final model
At λ = 0.01 • The Lasso is a relatively recent alternative to ridge regression that overcomes this
disadvantage. The lasso coefficients 𝛽𝜆𝐿 , minimize the quantity
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian
2
𝑛 𝑝 𝑝 𝑝
-473.9088 -7.7987 0.1801 1.2954 16.9949 -0.6373 -1.056 -10.3862 425.8620 -7.4602 1.4765
𝑦𝑖 − 𝛽0 − 𝛽𝑗 𝑥𝑖𝑗 + 𝜆 |𝛽𝑗 | = 𝑅𝑆𝑆 + 𝜆 |𝛽𝑗 |
At λ = 1010
𝑖=1 𝑗=1 𝑗=1 𝑗=1
(Intercept) Income Limit Rating Cards Age Education Female Student Married Caucasian
359.9029 0.0002 0.0336 0.0023 0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000
• In statistical parlance, the lasso uses an l 1 penalty instead of an l 2 penalty.
ZG 512
Predictive Analytics Lecture 7 Bayesian Learning
Consider a family with two children. Sample space S = {GG, GB, BG, BB}.
• What is the probability that both the children are girls? Sol. The two child problem:
• What is the probability that both children are girls, given that the first child is a P ( A B) P ( A) 14 1
girl? P ( A | B) = = = =
• What is the probability that both the children are girls, given that at least one of P ( B) P ( B) 12 2
them is a girl?
P( AC) P ( A) 14 1
P ( A| C ) = = = =
Sol.: Let A, B, and C be the events that both children are girls, first child is a girl, and at P (C ) P (C ) 34 3
least one of them is a girl, respectively.
Q. In rolling of a fair die, what is the probability that the outcome is an odd
Here A = {GG}, B = {GB, GG}, C = {GG, GB, BG}; P(A)=1/4, P(B)=1/2, P(C)=3/4
What are now
number, given that it is less than or equal to 3?
P ( A|B ) , P ( A|C ) ? Sol. P ( A B) 26 2
P ( A | B) = = =
P ( B) 36 3
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Conditional Probability Total Probability Rule
Q. In rolling of a fair die, what is the probability that the outcome is an odd I have three bags each containing 100 balls.
• Bag 1 has 75 red and 25 blue balls
number, given that it is less than or equal to 3?
• Bag 2 has 60 red and 40 blue balls.
Sol. P ( A B) 26 2 • Bag 3 has 45 red and 55 blue balls.
P ( A | B) = = =
P ( B)
If I choose a bag at random, and then pick a ball from it, also at random, what is the probability
36 3 that the chosen ball is red?
For any two events A and B, where P ( A ) 0 Q 1. I have three bags each containing 100 balls.
• Bag 1 has 75 red and 25 blue balls
P ( B A) P ( B) P ( A | B)
P ( B A) =
• Bag 2 has 60 red and 40 blue balls.
=
P ( A) P ( A) • Bag 3 has 45 red and 55 blue balls.
A bag is chosen at random, and a ball is picked also at random. If the ball is found to be red, what is
the probability that Bag 1 was chosen?
Bayes’ theorem helps to revise probabilities (prior to posterior) with the new information. This By total probability rule,
P(B ) P(R | B )
12
i i
i =1
5 Y Y Y Banana
And 6 Y Y Y Banana
• P(X1 = x1 |Y = k) can be estimated from the data DATA SUMMARY
Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL
Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
Fruits Dataset Fruits Dataset
DATA SUMMARY
DATA SUMMARY Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL
Long Not Long Sweet Not Sweet Yellow Not Yellow TOTAL Banana 400 100 350 150 450 50 500
Orange 0 300 150 150 300 0 300
Banana 400 100 350 150 450 50 500 Rest 100 100 150 50 50 150 200
TOTAL 500 500 650 350 800 200 1000
Orange 0 300 150 150 300 0 300
P(A)
Rest 100 100 150 50 50 150 200 P A│E1&E2&E3 = P E1 ∗P E2 ∗P(E3)
∗ P E1 A ∗ P E2 A ∗ P(E3|A)
TOTAL 500 500 650 350 800 200 1000 P(Banana) = 0.5, P(Orange) = 0.3, P(Other) = 0.2
P(Long) = 0.5, P(Sweet) = 0.65, P(Yellow) = 0.8
P(A)
P A│E1&E2&E3 = P E1 ∗P E2 ∗P(E3)
∗ P E1 A ∗ P E2 A ∗ P(E3|A)
B = 0.1, R = 0.9
BITS Pilani, Pilani Campus
Eager and Lazy learning
BITS Pilani
Pilani Campus
x is a test observation
The 3 closest points are identified, and therefore x belongs to the blue class
The KNN decision boundary for this example is shown in black.
• Euclidean:
• Straight line distance
• Scale variant, Sensitive to data dimensionality
• Normalization (scaling) can solve this issue
• Manhattan:
• Taxicab or cityblock distance
• Sum of absoluter differences of cartesian coordinates
• Application based: 2-4 • Write legibly. If the evaluator can't read or understand, you may lose marks.
• Theory: 2-4 • Underline, box or highlight important parts of the answer, like the keywords, final answer, etc.
• If you don’t really know the answer, writing anything doesn’t give marks.
• Topics: Linear Regression, Bias-Variance, Over and underfitting, Types of
errors, Model selection, Regularization, Cross-validation, Parametric or non-
parametric methods, Resampling Re-evaluations:
• Cross-check the answer with the sample answers. If it is not correct or even nearby, do not apply.
• Unjustified and vague requests may result in deduction.
• No marks for “attempting” a question.
ZG 512
Predictive Analytics Lecture 9 Classification, Logistic Regression
A good classifier is one for which the test error rate is the smallest
Logit function:
p
• logit p = ln ,0 ≤ p ≤ 1
1−p
P(E)
log Odds(E) = ln 1−P(E)
Confusion Matrix
መ
The performance of𝑓can also be described
by a confusion matrix
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
Example
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
The Matrix Is accuracy a good metric always?
TP = 10, FP = 0, TN = 85, FN = 5
True Class
Predicted Class Target Non-Target
Target TP FP
Non-Target FN TN
Also check*
1. Specificity (TNR or True Negative Rate = TN/N)
2. False Positive Rate (FPR) = FP/N = 1 – TNR
3. And others…
Refer https://en.wikipedia.org/wiki/Confusion_matrix
Cons:
Throwing away a lot of data/information
ZG 512
Predictive Analytics Lecture 11 Tree-based Methods
BITS Pilani Pravin Mhaske
Pilani Campus
• Since the set of splitting rules used to segment the predictor space can be • Hence, we also discuss bagging, random forests, and boosting. These methods grow
summarized in a tree, these types of approaches are known as decision-tree multiple trees which are then combined to yield a single consensus prediction.
methods.
• Combining a large number of trees can often result in dramatic improvements in
prediction accuracy, at the expense of some loss in interpretation.
• Decision trees can be applied to both regression and classification problems. So also
called CART.
E = 1 − max pො mk
k The Gini index takes on a small value if all of the pො mk ’s are close to zero or
Here pො mk represents the proportion of training observations in the mth region that are from the kth one.
class. • For this reason, the Gini index is referred to as a measure of node purity — a
small value indicates that a node contains predominantly observations from
• However, classification error is not sufficiently sensitive for tree-growing, and in practice two other
a single class.
measures are preferable Gini Index and Cross-Entropy
• G = 0.5 maximum impurity
It turns out that the Gini index and the cross-entropy are very similar
numerically.
D = 1 maximum impurity
X1, X2 & X3 are the independent variables; Y is the response variable Level 0: G = 0.38 & D = 0.81
X1 X2 X3 Y X1 X2 X3 Y Level 1: Optimum Candidate: X1 with G = 0.25 & D = 0.50
62 62 6 6 2 2 Left Branch: (0, 0, 0, 0), (0, 0, 1, 0), (0, 1, 0, 0), (0, 1, 1, 0); G = 0 & D = 0; No further splitting
0 0 0 0 Level 0: 6 ‘0’ & 2 ‘1’⇒ 𝐆 = 8 8 + 8 8 = 0.38 & 𝐃 = − 8 log 2 8
− 8 log 2 8
= 0.81 0 0 0 0
Right Branch: {1, 0, 0, 0), (1, 0, 1, 1), (1, 1, 0, 0), (1, 1, 1, 1); G = 0.5 & D = 1
0 0 1 0 Level 1: 0 0 1 0
0 1 0 0 • Splitting Candidate X1: X1 < 0.5 & X1 ≥ 0.5 0 1 0 0 Level 2:
0 1 1 0 YL: 0,0,0,0; G = 0 & D = 0; YR: 0,1,0,1; G = 0.5 & D = 1; GTotal = 0.25 & DTotal = 0.5 0 1 1 0 Left Branch: G = 0 & D = 0. No further splitting
1 0 0 0 1 0 0 0
• Splitting Candidate X2: X2 < 0.5 & X2 ≥ 0.5 Right Branch:
1 0 1 1 YL: 0,0,0,1; G = 0.38 & D = 0.81; ; YR: 0,0,0,1; G = 0.38 & D = 0.81; GTotal = 0.38 & DTotal =
1 0 1 1
1 1 0 0 1 1 0 0 • Splitting Candidate X2: X2 < 0.5 & X2 >= 0.5
0.81 YL: 0,1; G = 0.5 & D = 1; ; YR: 0,1; G = 0.5 & D = 1; ; GTotal = 0.25 & DTotal = 1
1 1 1 1 1 1 1 1
• Splitting Candidate X3: X3 < 0.5 & X3 > 0.5 • Splitting Candidate X3: X3 < 0.5 & X3 > 0.5
YL: 0,0,0,0; G = 0 & D = 0; ; YR: 0,0,1,1; G = 0.5 & D = 1; GTotal = 0.25 & DTotal = 0.5 YL: 0,0; G = 0 & D = 0; ; YR: 1,1; G = 0 & D = 0; ; GTotal = 0 & DTotal = 0
Optimum Candidate: X3
Optimum Candidate: X1
Level 3: No further splitting
(0, 0) 0.5 X1
▲ Trees are very easy to explain, even easier than linear regression!
▲ Decision trees more closely mirror human decision-making than do the other regression and
classification approaches.
▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert
(especially if they are small).
▲ Trees can easily handle qualitative predictors without the need to create dummy
variables.
▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of
the other regression and classification approaches.
• Decision trees are simple and interpretable models for regression and classification
• However, they are often not competitive with other methods in terms of
prediction accuracy
• Bagging, random forests, and boosting are good methods for improving the prediction
accuracy of trees.
• Random Forest – They work by growing many trees on the training data and then combining
the predictions of the resulting ensemble of trees.
• The latter two methods, random forests and boosting, are among the state-of-the-art
methods for supervised learning. However, their results can be difficult to interpret. Lecture 11 Model Ensembles
• It appears that overfitting may not happen in Boosting
• Decision trees are simple and interpretable models for regression and classification • A general-purpose procedure for reducing the variance of a statistical learning method
• However they are often not competitive with other methods in terms of prediction
accuracy • It is particularly useful and frequently used in the context of decision trees.
• Bagging, random forests and boosting are good methods for improving the prediction accuracy of • This is not practical because we generally do not have access to multiple training sets.
trees. They work by growing many trees on the training data and then combining the predictions
of the resulting ensemble of trees.
• The latter two methods— random forests and boosting— are among the state-of-the-art
methods for supervised learning. However their results can be difficult to interpret.
• It appears that overfitting may not happen in Boosting
• The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty
associated with a given estimator or statistical learning method.
• Distinct data sets by repeatedly sampling observations from the original data set with replacement.
• This step ensures that the base models are trained on diverse subsets of the data, as some samples
may appear multiple times in the new subset, while others may be omitted. It reduces the risks of
overfitting and improves the accuracy of the model.
Since we do not have access to multiple training sets, we bootstrap • Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training dataset.
• Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training data set. • Train our method on the bth bootstrapped training set to get fመ b (x), b =
1, …, B
• Train the method on the bth bootstrapped training set to get fመ b (x), b = 1,
…, B
• Take the majority vote:
• Average all the predictions to obtain The overall prediction is the most commonly occurring class
B
1
fመavg x = fመ b (x)
B
b=1
• Generate B different bootstrapped training data sets. • Generate B different bootstrapped training data sets.
Sample with replacement B times from the (single) training data set. Sample with replacement B times from the (single) training data set.
• When building a decision tree • When building a decision tree
• At each split, a random sample of m predictors is chosen as split candidates from the full set of • At each split, a random sample of m predictors is chosen as split candidates from the full
p predictors. set of p predictors.
• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p
• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p
• We get fመ b (x)
• We get fመ b (x), b = 1, …, B
𝟏
• Take the majority vote:
• Average all the predictions to obtain 𝐟መ𝐚𝐯𝐠 𝐱 = σ𝐁𝐛=𝟏 𝐟መ 𝐛 (x) The overall prediction is the most commonly occurring class
𝐁
Gradient Boosting
• Sequentially adding predictors to an ensemble, each one correcting its predecessor.
• Tries to fit the new predictor to the residual errors made by the previous predictor.
ZG 512
Predictive Analytics Lecture 12 Support Vector Machines
• H1 – can’t classify
• H2 – Classifies, but with thin margin
• H3 – Best among all. Maximum margin.
0 1 2 3
X1
• It is worthwhile to misclassify a few training observations in order to do a better job in classifying the maximize M
majority of observations. β0 ,β1 ,…,βp ,ε1 ,…,εn
p
• The margin is kept soft. That is, it can be violated by some observations. s.t. σj=1 β2j = 1
• Sometimes, some observation can be on wrong side of the margin (1 and 8) or on wrong side of the yi β0 + β1 xi1 + β2 xi2 + ⋯ + βp xip ≥ M 1 − εi ∀ i = 1, … , n
hyperplane (11 and 12). εi ≥ 0 & σ εi ≤ C
ZG 512
Predictive Analytics Unsupervised Learning
BITS Pilani Pravin Mhaske
Lecture 13 Principal Components Analysis
Pilani Campus
Supervised Vs Unsupervised
• We observe both a set of features X1, X2, . . . , Xp for each object, as well as a response or
outcome variable Y. The goal is then to predict Y using X1, X2, . . . , Xp.
• Here we instead focus on unsupervised learning, where we observe only the features X1, X2,
. . . , Xp. We are not interested in prediction, because we do not have an associated response
variable Y.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
Challenges of Unsupervised Learning Goals of Unsupervised Learning
• Unsupervised learning is more subjective than supervised learning, as there is no simple goal
for the analysis, such as prediction of response.
• The goal is to discover interesting things about the measurements:
• Difficult to assess the results. Nobody knows the true answer. No universally accepted
• Is there an informative way to visualize the data?
mechanism for cross-validation.
• Can we discover subgroups among the variables or among the observations?
• Unsupervised learning is often performed as part of an exploratory data analysis.
• Can we interpret these subgroups?
• But techniques for unsupervised learning are of growing importance in a number of fields:
• subgroups of breast cancer patients grouped by their gene expression measurements
• We discuss two methods:
• groups of shoppers characterized by their browsing and purchase histories
• principal components analysis, a tool used for data visualization or data pre-
• movies grouped by the ratings assigned by movie viewers
processing before supervised techniques are applied, and
• Search engine will display results based on search history of other users with similar search
• clustering, a broad class of methods for discovering unknown subgroups in
patterns
data.
• PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear • Say, we have n observations and p features and we wish to plot scatterplots for all pairs.
combinations of the variables that have maximal variance and are mutually uncorrelated.
• If p = 10, it will create 10C2 = 45 plots! This grows exponentially with p.
• When there is a large set of correlated variables, principal components allow summarizing those by a
• Each feature contains a small fraction of information. And not all dimensions are equally interesting.
smaller set of representative variables. Dimensionality reduction.
• We need a lower dimension representation of data without losing much information.
• We use these new variables in another supervised learning problem.
• PCA does this job. Finds small number of dimensions that are as interesting as possible.
• PCA – a process of computing the principal components and using those.
• Interesting = how much the observation vary in each direction.
• Unsupervised because only X1, X2, X3… Xp are available. No label.
• Each new dimension found is a linear combination of the p features. It is a Principal Component.
• Apart from producing derived variables for use in supervised learning problems, PCA also serves as a
tool for data visualization.
• The first principal component found is a normalized linear combination of all features. • The population size (pop) and ad spending (ad) for 100 different cities are shown as purple circles.
The green solid line indicates the first principal component direction, and the blue dashed line
Z1 = Φ11X1 + Φ21X2 +…+ Φp1Xp
indicates the second principal component direction.
• The second principal component looks like (just to understand the notations)
Z2 = Φ12X1 + Φ22X2 +…+ Φp2Xp
𝑝 2
• Normalized means 𝛴𝑗=1 𝜙𝑗1 = 1. Φ11, Φ21,… Φp1 are called loadings (or eigen vectors) of the first
PC.
• Together, the loadings make up the principal component loading vector,
PC 2 If we project the n data points x1, . . . , x n onto this direction, the projected values are the first
principal component scores z11, . . . , zn1 themselves.
Variable X 2
0
-8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
• Only one PC
• Two almost equal PCs
ZG 512
Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus
• Clustering refers to a very broad set of techniques for finding subgroups, or clusters,
in a data set.
• We seek a partition of the data into distinct groups so that the observations within each
group are quite similar to each other,
• We must define what it means for two or more observations to be similar or different.
• This is often a domain-specific consideration that must be made based on knowledge
of the data being studied.
• Finally, we must be able to interpret the clusters for the exercise to be useful
Unsupervised Learning • Example, Customer/market/industry segmentation, document clustering, fake news,
fraud detection
Lecture 14 Clustering
Manhattan:
• Taxicab or cityblock distance
• Sum of absolute differences of cartesian coordinates
❑ A centroid is a data point at the center of a cluster. • Step 0: Create random k centroids
❑ In centroid-based clustering, clusters are represented by a central vector or a centroid. Repeat iteratively till convergence (no further improvement/movement)
This centroid might not necessarily be a point of the dataset. • Step 1: Assign points to the closest centroid
❑ Centroid-based clustering is an iterative algorithm in which the notion of similarity is • Step 2: Re-calculate and move centroids
❑ Objective is to minimize the sum of sqaured distances (distance between every point and
• http://alekseynp.com/viz/k-means.html
the corresponding cluster center).
• https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
14
12
10
0
0 10 20 30 40 50 60
• Philosophy: If a particular point belongs to a cluster, it should be near to lots of other points in • Noise point: If a new point selected doesn’t have more than minPoints within epsilon, it remains
that cluster. unclustered.
• It groups closely packed data points together based on two parameters - a distance measure • Border point: There maybe a fight between two clusters for a certain point qualifying for both
(epsilon) and a density threshold (minPoints). the clusters. That point may get into any arbitrary cluster.
Steps:
1. Select two parameters: Epsilon and minPoints. • https://scikit-learn.org/dev/auto_examples/cluster/plot_dbscan.html
2. Pick a random point.
3. If there are minPoints (including self) with the distance epsilon, merge them.
4. Keep repeating step 3 iteratively for all the new points included in the cluster. If no more points
found within epsilon, stop.
5. Select a new random unclustered point as a new cluster.
6. Repeat steps 3, 4 and 5.
K-Means Vs DBSCAN
kMeans DBSCAN
Type Centroid-based Density-based
Cluster Shape Spherical/Convex Irregular
Number of clusters To be specified Decided by algorithm
Dimensions Can handle high dimension Not efficient for high dimension
Outliers Get clustered Become Noise points
Anomaly detection Not useful Highlights anomalous points ZG 512
When to use? K is known and data has spherical K is unknown and data has
shape irregular shape Predictive Analytics
BITS Pilani Pravin Mhaske
Pilani Campus
Understanding Deep
Neural Networks
ANN ANN
Perceptron model
● As we learn about more complex models, ● If the whole idea behind deep learning is to
we’ll also introduce concepts, such as: have computers artificially mimic biological
○ Activation Functions natural intelligence, we should probably
○ Gradient Descent build a general understanding of how
○ BackPropagation biological neurons work!
● However, in 1969 Marvin Minsky and ● Fortunately for us, we now know the
Seymour Papert's published their book amazing power of neural networks, which
Perceptrons. all stem from the simple perceptron model,
● It suggested that there were severe so let’s head back and convert our simple
limitations to what perceptrons could do. biological neuron model into the
● This marked the beginning of what is perceptron model.
known as the AI Winter, with little funding
into AI and Neural Networks in the 1970s.
Nucleus
x1 x1
Inputs Output Inputs f(X) Output
x2 x2
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
x2 x2
Perceptron model Perceptron model
● Realistically, we would want to be able to ● Let’s add an adjustable weight we multiply
adjust some parameter in order to “learn” against x
w1
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
w2
x2 x2
w1 w1
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
w2 w2
x2 x2
x2 x2
*w1 + b *w1 + b
x1 x1
y y
Inputs f(X) Output Inputs f(X) Output
x2 *w2 + b x2 *w2 + b
Perceptron model Perceptron model
● We can expand this to a generalization:
● We’ve been able to model a biological
neuron as a simple perceptron!
x1 Mathematically our generalization was:
*w1 + b y
Inputs f(X) Output
x2 *w2 + b
xn *wn + b
● Later on we will see how we can expand ● Also we’ll see we can even simplify the bias
this model to have X be a tensor of to be at a layer level instead of a bias per
information ( an n-dimensional matrix). input.
xn *wn + b xn *wn + b
xn *wn xn *wn
B B = b1 + b2 + … +bn
Perceptron model
● To build a network of perceptrons, we can ● The outputs of one perceptron are directly
connect layers of perceptrons, using a fed into as inputs to another perceptron.
multi-layer perceptron model.
● This allows the network as a whole to learn ● The first layer is the input layer
about interactions and relationships
between features.
Neural Networks Neural Networks
● The last layer is the output layer. ● Layers in between the input and output
● Note: This last layer can be more than one layers are the hidden layers.
neuron
● Hidden layers are difficult to interpret, due ● Neural Networks become “deep neural
to their high interconnectivity and distance networks” if then contain 2 or more
away from known input or output values. hidden layers.
● For more details on this check out the ● Previously in our simple model we saw that
Wikipedia page for “Universal the perceptron itself contained a very
Approximation Theorem” simple summation function f(x).
● For most use cases however that won’t be
useful, we’ll want to be able to set
constraints to our output values, especially
in classification tasks.
Neural Networks
● The most simple networks rely on a basic ● Regardless of the values, this always
step function that outputs 0 or 1. outputs 0 or 1.
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● This sort of function could be useful for ● However this is a very “strong” function,
classification (0 or 1 class). since small changes aren’t reflected.
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
Deep Learning Deep Learning
● There is just an immediate cut off that ● It would be nice if we could have a more
splits between 0 and 1. dynamic function, for example the red
1 line! 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● Lucky for us, this is the sigmoid function! ● Changing the activation function used can
be beneficial depending on the task!
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
● This still works for classification, and will be ● Let’s discuss a few more activation
more sensitive to small changes. functions that we’ll encounter!
1 1
Output Output
0 0
0 0
z = wx + b z = wx + b
-1 -1
0 0
Deep Learning Deep Learning
● Rectified Linear Unit (ReLU): This is ● ReLu has been found to have very good
actually a relatively simple function: performance, especially when dealing
max(0,z) with the issue of vanishing gradient.
Output ● We’ll often default to ReLu due to its
overall good performance.
0
z = wx + b
Deep Learning
Deep Learning
● Previously we thought of the last output ● This single node could output a continuous
layer as a single node. regression value or binary classification (0
or 1).
● Let’s expand this output layer to work for ● Organizing for Multiple Classes
the case of multi-classification.
Hidden Layers
... ...
1
Class Two Class Two 0.2
0
Deep Learning
● Review
○ Perceptrons expanded to neural
network model
○ Weights and Biases
○ Activation Functions
○ Time to learn about Cost Functions!
Deep Learning
● This output ŷ is the model’s estimation of ● We need to take the estimated outputs
what it predicts the label to be. of the network and then compare them
● So after the network creates its to the real values of the label.
prediction, how do we evaluate it? ● Keep in mind this is using the training
● And after the evaluation how can we data set during the fitting/training of the
update the network’s weights and model.
biases?
● The cost function (often referred to as a ● We’ll use the following variables:
loss function) must be an average so it ○ y to represent the true value
can output a single value. ○ a to represent neuron’s prediction
● We can keep track of our loss/cost ● In terms of weights and bias:
during training to monitor network ○ w*x + b = z
performance. ○ Pass z into activation function σ(z) = a
● One very common cost function is the ● We simply calculate the difference
quadratic cost function: between the real values y(x) against our
predicted values a(x).
Deep Learning Deep Learning
● Note: The notation shown here ● Notice how squaring this does 2 useful
corresponds to vector inputs and things for us, keeps everything positive
outputs, since we will be dealing with a and punishes large errors!
batch of training points and predictions.
● We can think of the cost function as: ● W is our neural network's weights, B is
our neural network's biases, Sr is the
input of a single training sample, and Er
is the desired output of that training
sample.
● Notice how that information was all ● This means that if we have a huge
encoded in our simplified notation. network, we can expect C to be quite
● The a(x) holds information about complex, with huge vectors of weights
weights and biases. and biases.
● In a real case, this means we have some ● For simplicity, let’s imagine we only had
cost function C dependent lots of one weight in our cost function w.
weights! ● We want to minimize our loss/cost
○ C(w1,w2,w3,....wn) (overall error).
● How do we figure out which weights ● Which means we need to figure out
lead us to the lowest cost? what value of w results in the minimum
of C(w)
w w
● What value of w minimizes our cost? ● What value of w minimizes our cost?
C(w) C(w)
w w
wmin
● Students of calculus know we could take ● But recall our real cost function will be
a derivative and solve for 0. C(w)
very complex! C(w)
w w
wmin
Deep Learning Deep Learning
w w
Deep Learning
w
Deep Learning
wmin
● We can calculate the slope at a point ● We can calculate the slope at a point
C(w) C(w)
w w
wmin wmin
Deep Learning Deep Learning
w w
wmin wmin
w w
wmin wmin
w w
wmin wmin
● Smaller steps sizes take longer to find ● Larger steps are faster, but we risk
the minimum. C(w)
overshooting the minimum! C(w)
w w
wmin wmin
Deep Learning Deep Learning
● This step size is known as the learning ● The learning rate we showed in our
rate. C(w) illustrations was constant (each step size
was equal)
● But we can be clever and adapt our step
size as we go along.
w
wmin
● We could start with larger steps, then go ● In 2015, Kingma and Ba published their
smaller as we realize the slope gets paper: “Adam: A Method for Stochastic
closer to zero. Optimization“.
● This is known as adaptive gradient ● Adam is a much more efficient way of
descent. searching for these minimums, so you
will see us use it for our code!
● When dealing with these N-dimensional ● For classification problems, we often use
vectors (tensors), the notation changes the cross entropy loss function.
from derivative to gradient. ● The assumption is that your model
● This means we calculate ∇C(w1,w2,...wn) predicts a probability distribution p(y=i)
for each class i=1,2,…,C.
Deep Learning Deep Learning
● So far we understand how networks can ● The next thing we need to learn about
take in input , effect that input with theory is:
weights, biases, and activation functions ○ Once we get our cost/loss value,
to produce an estimated output. how do we actually go back and
● Then we learned how to evaluate that adjust our weights and biases?
output. ● This is backpropagation, and it is
what we are going to cover next!