4.Decision Tree

Decision Tree
Overview Overview
Decision Trees
I Simple but powerful learning algorithm
I One of the most widely used learning algorithms in Kaggle competitions
Lets us introduce ensembles, a key idea in ML more broadly
Useful information theoretic concepts (entropy, mutual information, etc.)
UofT CSC 2515: 02-Decision Trees and Ensembles 2 / 55

Decision Tree
Decision Trees
Yes No
Yes No Yes No

Decision Tree
Decision Trees
Decision Trees
Decision Tree
Decision trees make predictions by recursively splitting on di↵erent attributes
according to a tree structure.
Example with Discrete Inputs
What if the attributes are discrete?
Attributes:
Decision Tree: Example with Discrete Inputs
The tree to decide whether to wait (T) or not (F)
Decision Trees
Decision Trees
Yes No
Yes No Yes No
Internal nodes test attributes

Branching is determined by attribute value
Leaf nodes are outputs (predictions)
Decision Tree: Classi ication and Regression
Decision Tree: Classification and Regression
Each path from root to a leaf defines a region Rm

of input space
(m1 ) (m1 ) (mk ) (mk )
Let {(x ,t ), . . . , (x ,t )} be the
training examples that fall into Rm
Classification tree:
I discrete output
I leaf value y m typically set to the most common value in
{t (m1 ) , . . . , t (mk ) }
Regression tree:
I continuous output
I leaf value y m typically set to the mean value in {t (m1 ) , . . . , t (mk ) }
Note: We will focus on classification
f
Expressiveness
Expressiveness
Discrete-input, discrete-output case:

I Decision trees can express any function of the input attributes
I E.g., for Boolean functions, truth table row ! path to leaf:
Continuous-input, continuous-output case:

I Can approximate any function arbitrarily closely
Trivially, there is a consistent decision tree for any training set w/ one path
to leaf for each example (unless f nondeterministic in x) but it probably
won’t generalize to new examples
How
How do weto wea DecisionTree?
Learn learn a Decision Tree?
How do we construct a useful decision tree?

Learning
Learning Decision Trees Decision Trees
Learning the simplest (smallest) decision tree is an NP complete problem [if you
are interested, check: Hyafil & Rivest’76]
Resort to a greedy heuristic:
I Start from an empty decision tree
I Split on the “best” attribute
I Recurse
Which attribute is the “best”?
I Choose based on accuracy?
Choosing a Good Split
Why isn’t accuracy a good measure?
Is this split good? Zero accuracy gain.

Instead, we will use techniques from information theory
Idea: Use counts at leaves to define probability distributions, so we can measure
uncertainty

Which attribute is better to split on, X1 or X2 ?
I Deterministic: good (all are true or false; just one class in the leaf)
I Uniform distribution: bad (all classes in leaf equally probable)
I What about distributons in between?
Note: Let’s take a slight detour and remember concepts from information theory
We lip two different
We Flip Two Di↵erent Coins
coins
Sequence 1:
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ?
Sequence 2:
0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 ... ?
16
10
8
versus
2
0 1 0 1
f
Quantifying Uncertainty
Entropy is a measure of expected “surprise”:
X
H(X ) = p(x) log2 p(x)
x2X
8/9
4/9 5/9
1/9
0 1
0 1
8 8 1 1 1 4 4 5 5
log2 log2 ⇡ log2 log2 ⇡ 0.99
9 9 9 9 2 9 9 9 9
Measures the information content of each observation
Unit = bits
A fair coin flip has 1 bit of entropy
X
H(X ) = p(x) log2 p(x)
x2X
entropy
1.0
0.8
0.6
0.4
0.2
probability p of heads
0.2 0.4 0.6 0.8 1.0

Entropy
Entropy
“High Entropy”:
I Variable has a uniform like distribution
I Flat histogram
I Values sampled from it are less predictable
“Low Entropy”
I Distribution of variable has many peaks and valleys
I Histogram has many lows and highs
I Values sampled from it are more predictable
[Slide credit: Vibhav Gogate]

Entropy of a Joint Distribution
Entropy of a Joint Distribution
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}
Cloudy' Not'Cloudy'
Raining' 24/100' 1/100'
Not'Raining' 25/100' 50/100'
XX
H(X , Y ) = p(x, y ) log2 p(x, y )
x2X y 2Y
24 24 1 1 25 25 50 50
= log2 log2 log2 log2
100 100 100 100 100 100 100 100
⇡ 1.56bits

Speci ic Conditional Entropy
Specific Conditional Entropy
Cloudy' Not'Cloudy'
Raining' 24/100' 1/100'
Not'Raining' 25/100' 50/100'
What is the entropy of cloudiness Y , given that it is raining?

X
H(Y |X = x) = p(y |x) log2 p(y |x)
y 2Y
24 24 1 1
= log2 log2
25 25 25 25
⇡ 0.24bits
p(x,y ) P
We used: p(y |x) = p(x) , and p(x) = y p(x, y ) (sum in a row)
f
Conditional
Conditional Entropy Entropy
Cloudy' Not'Cloudy'
Raining' 24/100' 1/100'
Not'Raining' 25/100' 50/100'
The expected conditional entropy:

X
H(Y |X ) = p(x)H(Y |X = x)
x2X
XX
= p(x, y ) log2 p(y |x)
x2X y 2Y
Conditional Entropy
Conditional Entropy
Cloudy' Not'Cloudy'
Raining' 24/100' 1/100'
Not'Raining' 25/100' 50/100'
What is the entropy of cloudiness, given the knowledge of whether or not it

is raining?
X
H(Y |X ) = p(x)H(Y |X = x)
x2X
1 3
= H(cloudy|is raining) + H(cloudy|not raining)
4 4
⇡ 0.75 bits
Conditional Entropy
Conditional Entropy
Some useful properties:

I H is always non-negative
I Chain rule: H(X , Y ) = H(X |Y ) + H(Y ) = H(Y |X ) + H(X )
I If X and Y independent, then X doesn’t tell us anything about Y :
H(Y |X ) = H(Y )
I But Y tells us everything about Y : H(Y |Y ) = 0
I By knowing X , we can only decrease uncertainty about Y :
H(Y |X )  H(Y )
Information Gain
Information Gain
Cloudy' Not'Cloudy'
Raining' 24/100' 1/100'
Not'Raining' 25/100' 50/100'
How much information about cloudiness do we get by discovering whether it

is raining?
IG (Y |X ) = H(Y ) H(Y |X )
⇡ 0.25 bits
This is called the information gain in Y due to X , or the mutual information

of Y and X
If X is completely uninformative about Y : IG (Y |X ) = 0
If X is completely informative about Y : IG (Y |X ) = H(Y )
Revisiting our Original Example
Revisiting Our Original Example
Information gain measures the informativeness of a variable, which is exactly
what we desire in a decision tree attribute!
What is the information gain of this split?
49 49 100 100
Root entropy: H(Y ) = 149 log2 ( 149 ) 149 log2 ( 149 ) ⇡ 0.91
Leafs entropy: H(Y |left) = 0, H(Y |right) ⇡ 1.
1 2
IG (split) ⇡ 0.91 (3 ·0+ 3 · 1) ⇡ 0.24 > 0
Constructing
Constructing Decision Trees Decision Tree
Yes No
Yes No Yes No
At each level, one must choose:

1. Which variable to split.
2. Possibly where to split it.
Choose them based on how much information we would gain from the
decision! (choose attribute that gives the highest gain)
Decision
Decision TreeTree Construction
Construction Algorithm Algorithm
Simple, greedy, recursive approach, builds up tree node-by-node
1. pick an attribute to split at a non-terminal node

2. split examples into groups based on attribute value
3. for each group:
I if no examples – return majority from parent
I else if all examples in same class – return class
I else loop to step 1
Back to our Example
Back to Our Example
[from: Russell & Norvig]

Attributes:
Attribute Selection
Attribute Selection
IG (Y ) = H(Y ) H(Y |X )

2 2 4 4
IG (type) = 1 H(Y |Fr.) + H(Y |It.) + H(Y |Thai) + H(Y |Bur.) = 0
12 12 12 12

2 4 6 2 4
IG (Patrons) = 1 H(0, 1) + H(1, 0) + H( , ) ⇡ 0.541
12 12 12 6 6
Decision Tree
Decision Tree Miscellany
Problems:
I You have exponentially less data at lower levels
I Too big of a tree can overfit the data
I Greedy algorithms don’t necessarily yield the global optimum
Handling continuous attributes
I Split based on a threshold, chosen to maximize information gain
Decision trees can also be used for regression on real-valued outputs. Choose
splits to minimize squared error, rather than maximize information gain.
Comparison
Comparison to k-NN
to kNN
Advantages of decision trees over KNN
Good when there are lots of attributes, but only a few are important
Good with discrete attributes
Easily deals with missing values (just treat as another value)
Robust to scale of inputs
Fast at test time
More interpretable
Advantages of KNN over decision trees
Few hyperparameters
Able to handle attributes/features that interact in complex ways (e.g. pixels)
Can incorporate interesting distance measures (e.g. shape contexts)
Typically make better predictions in practice
I As we’ll see next lecture, ensembles of decision trees are much
stronger. But they lose many of the advantages listed above.
Conditions
• A decision tree is a model composed of a collection of "questions" organized hierarchically in the shape of a
tree.
• The questions are usually called a condition, a split, or a test.
• Axis-aligned vs. oblique conditions
• An axis-aligned condition involves only a single feature.
• An oblique condition involves multiple features
Conditions
• Binary vs. non-binary conditions

• non-binary conditions have more discriminative power than binary conditions.
• also more likely to over it
• generally use binary decision trees are used.
f
Conditions
• Common types of binary conditions.

Decision Tree Learning Algorithm
Over itting and pruning
• If the dataset contains noise, this tree will over it to the data and show
poor test accuracy.
• A noisy dataset:
f
f
Over itting and pruning
• Three main regularization techniques:

• Set a maximum depth: Prevent decision trees from growing past a
maximum depth, such as 10.
• Set a minimum number of examples in leaf: A leaf with less than a
certain number of examples will not be considered for splitting.
• Selectively removing (pruning) certain branches: by converting certain
non-leaf nodes to leaves.
f
Over itting
• The e ect of di ering minimum number of examples per leaf:

ff
f
ff
Pruning
• For pruning, a common solution to select the branches to remove is to use a validation dataset.
• If removing a branch improves the quality of the model on the validation dataset, then the
branch is removed
Using 20% of the dataset to prune the decision tree

4.Decision Tree

Uploaded by

Copyright:

Available Formats

4.Decision Tree

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.Decision Tree

Uploaded by

Copyright:

Available Formats

Decision Tree

UofT CSC 2515: 02-Decision Trees and Ensembles 2 / 55

UofT CSC 2515: 02-Decision Trees and Ensembles 3 / 55

Internal nodes test attributes

Each path from root to a leaf defines a region Rm

Discrete-input, discrete-output case:

Continuous-input, continuous-output case:

How do we construct a useful decision tree?

Why isn’t accuracy a good measure?

Is this split good? Zero accuracy gain.

UofT CSC 2515: 02-Decision Trees and Ensembles 13 / 55

UofT CSC 2515: 02-Decision Trees and Ensembles 17 / 55

[Slide credit: Vibhav Gogate]

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

UofT CSC 2515: 02-Decision Trees and Ensembles 19 / 55

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness Y , given that it is raining?

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

The expected conditional entropy:

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness, given the knowledge of whether or not it

Some useful properties:

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

How much information about cloudiness do we get by discovering whether it

This is called the information gain in Y due to X , or the mutual information

At each level, one must choose:

Simple, greedy, recursive approach, builds up tree node-by-node

1. pick an attribute to split at a non-terminal node

[from: Russell & Norvig]

• Binary vs. non-binary conditions

• Common types of binary conditions.

• Three main regularization techniques:

• The e ect of di ering minimum number of examples per leaf:

Using 20% of the dataset to prune the decision tree

You might also like