4.Decision Tree
4.Decision Tree
4.Decision Tree
Overview Overview
Decision Trees
I Simple but powerful learning algorithm
I One of the most widely used learning algorithms in Kaggle competitions
Lets us introduce ensembles, a key idea in ML more broadly
Useful information theoretic concepts (entropy, mutual information, etc.)
Yes No
Yes No Yes No
Attributes:
Example with Discrete Inputs
Decision Tree: Example with Discrete Inputs
The tree to decide whether to wait (T) or not (F)
Decision Trees
Decision Trees
Yes No
Yes No Yes No
Classification tree:
I discrete output
I leaf value y m typically set to the most common value in
{t (m1 ) , . . . , t (mk ) }
Regression tree:
I continuous output
I leaf value y m typically set to the mean value in {t (m1 ) , . . . , t (mk ) }
Note: We will focus on classification
f
Expressiveness
Expressiveness
Learning the simplest (smallest) decision tree is an NP complete problem [if you
are interested, check: Hyafil & Rivest’76]
Resort to a greedy heuristic:
I Start from an empty decision tree
I Split on the “best” attribute
I Recurse
Which attribute is the “best”?
I Choose based on accuracy?
Choosing a Good Split
Choosing a Good Split
Note: Let’s take a slight detour and remember concepts from information theory
We lip two different
We Flip Two Di↵erent Coins
coins
Sequence 1:
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ?
Sequence 2:
0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 ... ?
16
10
8
versus
2
0 1 0 1
f
Quantifying Uncertainty
Quantifying Uncertainty
Entropy is a measure of expected “surprise”:
X
H(X ) = p(x) log2 p(x)
x2X
8/9
4/9 5/9
1/9
0 1
0 1
8 8 1 1 1 4 4 5 5
log2 log2 ⇡ log2 log2 ⇡ 0.99
9 9 9 9 2 9 9 9 9
Measures the information content of each observation
Unit = bits
A fair coin flip has 1 bit of entropy
Quantifying Uncertainty
Quantifying Uncertainty
X
H(X ) = p(x) log2 p(x)
x2X
entropy
1.0
0.8
0.6
0.4
0.2
probability p of heads
0.2 0.4 0.6 0.8 1.0
“High Entropy”:
I Variable has a uniform like distribution
I Flat histogram
I Values sampled from it are less predictable
“Low Entropy”
I Distribution of variable has many peaks and valleys
I Histogram has many lows and highs
I Values sampled from it are more predictable
Cloudy' Not'Cloudy'
XX
H(X , Y ) = p(x, y ) log2 p(x, y )
x2X y 2Y
24 24 1 1 25 25 50 50
= log2 log2 log2 log2
100 100 100 100 100 100 100 100
⇡ 1.56bits
Cloudy' Not'Cloudy'
Cloudy' Not'Cloudy'
Cloudy' Not'Cloudy'
Cloudy' Not'Cloudy'
49 49 100 100
Root entropy: H(Y ) = 149 log2 ( 149 ) 149 log2 ( 149 ) ⇡ 0.91
Leafs entropy: H(Y |left) = 0, H(Y |right) ⇡ 1.
1 2
IG (split) ⇡ 0.91 (3 ·0+ 3 · 1) ⇡ 0.24 > 0
Constructing
Constructing Decision Trees Decision Tree
Yes No
Yes No Yes No
IG (Y ) = H(Y ) H(Y |X )
2 2 4 4
IG (type) = 1 H(Y |Fr.) + H(Y |It.) + H(Y |Thai) + H(Y |Bur.) = 0
12 12 12 12
2 4 6 2 4
IG (Patrons) = 1 H(0, 1) + H(1, 0) + H( , ) ⇡ 0.541
12 12 12 6 6
UofT CSC 2515: 02-Decision Trees and Ensembles 29 / 55
Decision Tree
Decision Tree Miscellany
Problems:
I You have exponentially less data at lower levels
I Too big of a tree can overfit the data
I Greedy algorithms don’t necessarily yield the global optimum
Handling continuous attributes
I Split based on a threshold, chosen to maximize information gain
Decision trees can also be used for regression on real-valued outputs. Choose
splits to minimize squared error, rather than maximize information gain.
Comparison
Comparison to k-NN
to kNN
Advantages of decision trees over KNN
Good when there are lots of attributes, but only a few are important
Good with discrete attributes
Easily deals with missing values (just treat as another value)
Robust to scale of inputs
Fast at test time
More interpretable
Advantages of KNN over decision trees
Few hyperparameters
Able to handle attributes/features that interact in complex ways (e.g. pixels)
Can incorporate interesting distance measures (e.g. shape contexts)
Typically make better predictions in practice
I As we’ll see next lecture, ensembles of decision trees are much
stronger. But they lose many of the advantages listed above.
Conditions
• A decision tree is a model composed of a collection of "questions" organized hierarchically in the shape of a
tree.
• The questions are usually called a condition, a split, or a test.
• Axis-aligned vs. oblique conditions
• An axis-aligned condition involves only a single feature.
• An oblique condition involves multiple features
Conditions
• If the dataset contains noise, this tree will over it to the data and show
poor test accuracy.
• A noisy dataset:
f
f
Over itting and pruning
• For pruning, a common solution to select the branches to remove is to use a validation dataset.
• If removing a branch improves the quality of the model on the validation dataset, then the
branch is removed