Random Forest: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
Random Forest: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
Random Forest: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Basic steps -Classification algorithms
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2
Should I invest in a company – ask the experts
Employee of XYZ Financial Advisor of XYZ Stock Market Trader Employee of acompetitor Market Researchteam Social Media Expert
Knows internal perspective on companies observed company’sstock internal functionality of the analyzes the customer understand product
functionality vs competition price over past 3years competitor firms preference of XYZ’sproduct positioning
lacks a broader perspective has been right 60% of have been right 75%of unaware of detailsbeyond
has been right 75%times. has been right 70%times.
on competitors times. times. digital marketing
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
3
Scenario1 - Combine all the info – informed decision
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
4
Scenario 2 – info from similar sources
6 experts, all of If we combinetheir
Everyone has a
them are employees advice into single
propensity of 70%to
of XYZworking in prediction based on
advocate correctly.
the samedivision voting?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
5
Ensemble learning
• Machine learning technique that combines several base
models in order to produce one optimal predictive model.
• Weak classifiers
• Different set of variables for each classifier
• Combine into singleprediction
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
6
What is a boot strapped dataset
Sno X1 X2 Y
4 144 29 No
2 529 34 Yes
3 125 67 No
Sno X1 X2 Y Sno X1 X2 Y
1 432 29 Yes Random sample rows 3 125 67 No
2 529 34 Yes with replacement 4 144 29 No
4 144 29 No
3 125 67 No
4 144 29 No
Sno X1 X2 Y
3 125 67 No
2 529 34 Yes
3 125 67 No
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
7
Using a random set of variables every time
Sno X1 X2 Y
4 144 29 No
2 529 34 Yes
Random 3 125 67 No
sample rows
with
Sno X1 X2 X3 X4 Y replacement Sno X3 X4 Y
1 432 29 313 6 Yes
2 529 34 379 2 Yes 3 317 4 No
3 125 67 317 4 No Random 4 103 8 No
4 144 29 103 8 No 4 103 8 No
subset of X
variables
Sno X1 X3 Y
3 125 317 No
2 529 379 Yes
3 125 317 No
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
8
Basic idea ofrandom forest
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
9
Steps in random forest algorithm
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
10
Out of bag data points
Sno
4
X1
144
X2
29 No
Y
• When we create a
bootstrapped dataset, ~1/3
2 529 34 Yes
3 125 67 No
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
11
How to calculate accuracy
• OOB samples used to measure how accurate our random
forest is
• by the ratio of out of bag samples correctly classified by the
random forest model
• Proportion of OOB samples incorrectly classified – out of
bag error
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
12
How to decide on how many variables to use per step?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
13
Summary of Random forest
Consists of a large number Each tree in the random class with most votes
of individual decision trees forest spits out a class
becomes model’s prediction
that operate asan ensemble prediction
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
14
Overall flow of the RFclassification process
Feature engineering EDA–Univariate
Find Baseline Yclass
– convert relevant • Boxplot for numvar
Read csv file %to checkclass
variables to • Barplot for catvar
imbalance
categorical
EDA– bivariate
• Boxplot – num X vs catY Split into training Build a random
Tune ntree & mtry
• Stacked bar – cat X vsY and test sets forest model
Model performance
Predict for train & • Acc, sens, spec
Variable importance
test • AUC
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
40