Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
Lecture Notes For Chapter 3 Introduction To Data Mining, 2 Edition
Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets
Ensemble Classifiers
– Boosting, Bagging, Random Forests
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
cal cal us
i i o
or or nu
teg
teg
nti
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K
fits the same data!
Yes
10
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Multi-way split:
Marital
– Use as many partitions as Status
distinct values.
Binary split:
– Divides values into two subsets
{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2 nd Edition 24
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
Sorted data for price (in dollars): 5, 10, 50, 72, 92, 104, 215
• Number of bins: 3
• A=5, B=215, N=3
• W = (B –A)/N.
• W= (215-5)/3 = 70
28
Example: Equal-depth (frequency)
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34 29
How to determine the Best Split
Greedy approach:
– Nodes with purer class distribution are
preferred
C0: 5 C0: 9
C1: 5 C1: 1
Entropy 𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0
Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑒𝑟𝑟𝑜𝑟 =1− max [𝑝¿ ¿ 𝑖(𝑡)]¿
Gain = P - M
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2 nd Edition 34
Measure of Impurity: GINI
𝑖=0
Where is the frequency of class at node , and is the total
number of classes
𝑖=0
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
𝑖=0
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Information Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖 𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑝 ) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑖)
𝑖=1 𝑛
Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=− ∑ 𝑙𝑜𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=∑ 𝑙𝑜 𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416