Nothing Special   »   [go: up one dir, main page]

4.Decision Tree

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Decision Tree

Overview Overview

Decision Trees
I Simple but powerful learning algorithm
I One of the most widely used learning algorithms in Kaggle competitions
Lets us introduce ensembles, a key idea in ML more broadly
Useful information theoretic concepts (entropy, mutual information, etc.)

UofT CSC 2515: 02-Decision Trees and Ensembles 2 / 55


Decision Tree
Decision Trees

Yes No

Yes No Yes No

UofT CSC 2515: 02-Decision Trees and Ensembles 3 / 55


Decision Tree
Decision Trees
Decision Trees
Decision Tree
Decision trees make predictions by recursively splitting on di↵erent attributes
according to a tree structure.
Example with Discrete Inputs
Example with Discrete Inputs
What if the attributes are discrete?

Attributes:
Example with Discrete Inputs
Decision Tree: Example with Discrete Inputs
The tree to decide whether to wait (T) or not (F)
Decision Trees
Decision Trees

Yes No

Yes No Yes No

Internal nodes test attributes


Branching is determined by attribute value
Leaf nodes are outputs (predictions)
Decision Tree: Classi ication and Regression
Decision Tree: Classification and Regression

Each path from root to a leaf defines a region Rm


of input space
(m1 ) (m1 ) (mk ) (mk )
Let {(x ,t ), . . . , (x ,t )} be the
training examples that fall into Rm

Classification tree:
I discrete output
I leaf value y m typically set to the most common value in
{t (m1 ) , . . . , t (mk ) }
Regression tree:
I continuous output
I leaf value y m typically set to the mean value in {t (m1 ) , . . . , t (mk ) }
Note: We will focus on classification
f
Expressiveness
Expressiveness

Discrete-input, discrete-output case:


I Decision trees can express any function of the input attributes
I E.g., for Boolean functions, truth table row ! path to leaf:

Continuous-input, continuous-output case:


I Can approximate any function arbitrarily closely
Trivially, there is a consistent decision tree for any training set w/ one path
to leaf for each example (unless f nondeterministic in x) but it probably
won’t generalize to new examples
How
How do weto wea DecisionTree?
Learn learn a Decision Tree?

How do we construct a useful decision tree?


Learning
Learning Decision Trees Decision Trees

Learning the simplest (smallest) decision tree is an NP complete problem [if you
are interested, check: Hyafil & Rivest’76]
Resort to a greedy heuristic:
I Start from an empty decision tree
I Split on the “best” attribute
I Recurse
Which attribute is the “best”?
I Choose based on accuracy?
Choosing a Good Split
Choosing a Good Split

Why isn’t accuracy a good measure?

Is this split good? Zero accuracy gain.


Instead, we will use techniques from information theory
Idea: Use counts at leaves to define probability distributions, so we can measure
uncertainty

UofT CSC 2515: 02-Decision Trees and Ensembles 13 / 55


Choosing a Good Split
Choosing a Good Split
Which attribute is better to split on, X1 or X2 ?
I Deterministic: good (all are true or false; just one class in the leaf)
I Uniform distribution: bad (all classes in leaf equally probable)
I What about distributons in between?

Note: Let’s take a slight detour and remember concepts from information theory
We lip two different
We Flip Two Di↵erent Coins
coins

Sequence 1:
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ?

Sequence 2:
0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 ... ?
16
10
8
versus
2

0 1 0 1
f
Quantifying Uncertainty
Quantifying Uncertainty
Entropy is a measure of expected “surprise”:
X
H(X ) = p(x) log2 p(x)
x2X

8/9
4/9 5/9

1/9
0 1
0 1

8 8 1 1 1 4 4 5 5
log2 log2 ⇡ log2 log2 ⇡ 0.99
9 9 9 9 2 9 9 9 9
Measures the information content of each observation
Unit = bits
A fair coin flip has 1 bit of entropy
Quantifying Uncertainty
Quantifying Uncertainty

X
H(X ) = p(x) log2 p(x)
x2X

entropy
1.0

0.8

0.6

0.4

0.2

probability p of heads
0.2 0.4 0.6 0.8 1.0

UofT CSC 2515: 02-Decision Trees and Ensembles 17 / 55


Entropy
Entropy

“High Entropy”:
I Variable has a uniform like distribution
I Flat histogram
I Values sampled from it are less predictable
“Low Entropy”
I Distribution of variable has many peaks and valleys
I Histogram has many lows and highs
I Values sampled from it are more predictable

[Slide credit: Vibhav Gogate]


Entropy of a Joint Distribution
Entropy of a Joint Distribution

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

XX
H(X , Y ) = p(x, y ) log2 p(x, y )
x2X y 2Y
24 24 1 1 25 25 50 50
= log2 log2 log2 log2
100 100 100 100 100 100 100 100
⇡ 1.56bits

UofT CSC 2515: 02-Decision Trees and Ensembles 19 / 55


Speci ic Conditional Entropy
Specific Conditional Entropy
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness Y , given that it is raining?


X
H(Y |X = x) = p(y |x) log2 p(y |x)
y 2Y
24 24 1 1
= log2 log2
25 25 25 25
⇡ 0.24bits
p(x,y ) P
We used: p(y |x) = p(x) , and p(x) = y p(x, y ) (sum in a row)
f
Conditional
Conditional Entropy Entropy

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

The expected conditional entropy:


X
H(Y |X ) = p(x)H(Y |X = x)
x2X
XX
= p(x, y ) log2 p(y |x)
x2X y 2Y
Conditional Entropy
Conditional Entropy
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness, given the knowledge of whether or not it


is raining?
X
H(Y |X ) = p(x)H(Y |X = x)
x2X
1 3
= H(cloudy|is raining) + H(cloudy|not raining)
4 4
⇡ 0.75 bits
UofT CSC 2515: 02-Decision Trees and Ensembles 22 / 55
Conditional Entropy
Conditional Entropy

Some useful properties:


I H is always non-negative
I Chain rule: H(X , Y ) = H(X |Y ) + H(Y ) = H(Y |X ) + H(X )
I If X and Y independent, then X doesn’t tell us anything about Y :
H(Y |X ) = H(Y )
I But Y tells us everything about Y : H(Y |Y ) = 0
I By knowing X , we can only decrease uncertainty about Y :
H(Y |X )  H(Y )
Information Gain
Information Gain

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

How much information about cloudiness do we get by discovering whether it


is raining?
IG (Y |X ) = H(Y ) H(Y |X )
⇡ 0.25 bits

This is called the information gain in Y due to X , or the mutual information


of Y and X
If X is completely uninformative about Y : IG (Y |X ) = 0
If X is completely informative about Y : IG (Y |X ) = H(Y )
UofT CSC 2515: 02-Decision Trees and Ensembles 24 / 55
Revisiting our Original Example
Revisiting Our Original Example
Information gain measures the informativeness of a variable, which is exactly
what we desire in a decision tree attribute!
What is the information gain of this split?

49 49 100 100
Root entropy: H(Y ) = 149 log2 ( 149 ) 149 log2 ( 149 ) ⇡ 0.91
Leafs entropy: H(Y |left) = 0, H(Y |right) ⇡ 1.
1 2
IG (split) ⇡ 0.91 (3 ·0+ 3 · 1) ⇡ 0.24 > 0
Constructing
Constructing Decision Trees Decision Tree

Yes No

Yes No Yes No

At each level, one must choose:


1. Which variable to split.
2. Possibly where to split it.
Choose them based on how much information we would gain from the
decision! (choose attribute that gives the highest gain)
Decision
Decision TreeTree Construction
Construction Algorithm Algorithm

Simple, greedy, recursive approach, builds up tree node-by-node

1. pick an attribute to split at a non-terminal node


2. split examples into groups based on attribute value
3. for each group:
I if no examples – return majority from parent
I else if all examples in same class – return class
I else loop to step 1
Back to our Example
Back to Our Example

[from: Russell & Norvig]


Attributes:
UofT CSC 2515: 02-Decision Trees and Ensembles 28 / 55
Attribute Selection
Attribute Selection

IG (Y ) = H(Y ) H(Y |X )

2 2 4 4
IG (type) = 1 H(Y |Fr.) + H(Y |It.) + H(Y |Thai) + H(Y |Bur.) = 0
12 12 12 12

2 4 6 2 4
IG (Patrons) = 1 H(0, 1) + H(1, 0) + H( , ) ⇡ 0.541
12 12 12 6 6
UofT CSC 2515: 02-Decision Trees and Ensembles 29 / 55
Decision Tree
Decision Tree Miscellany

Problems:
I You have exponentially less data at lower levels
I Too big of a tree can overfit the data
I Greedy algorithms don’t necessarily yield the global optimum
Handling continuous attributes
I Split based on a threshold, chosen to maximize information gain
Decision trees can also be used for regression on real-valued outputs. Choose
splits to minimize squared error, rather than maximize information gain.
Comparison
Comparison to k-NN
to kNN
Advantages of decision trees over KNN
Good when there are lots of attributes, but only a few are important
Good with discrete attributes
Easily deals with missing values (just treat as another value)
Robust to scale of inputs
Fast at test time
More interpretable
Advantages of KNN over decision trees
Few hyperparameters
Able to handle attributes/features that interact in complex ways (e.g. pixels)
Can incorporate interesting distance measures (e.g. shape contexts)
Typically make better predictions in practice
I As we’ll see next lecture, ensembles of decision trees are much
stronger. But they lose many of the advantages listed above.
Conditions

• A decision tree is a model composed of a collection of "questions" organized hierarchically in the shape of a
tree.
• The questions are usually called a condition, a split, or a test.
• Axis-aligned vs. oblique conditions
• An axis-aligned condition involves only a single feature.
• An oblique condition involves multiple features
Conditions

• Binary vs. non-binary conditions


• non-binary conditions have more discriminative power than binary conditions.
• also more likely to over it
• generally use binary decision trees are used.
f
Conditions

• Common types of binary conditions.


Decision Tree Learning Algorithm
Over itting and pruning

• If the dataset contains noise, this tree will over it to the data and show
poor test accuracy.
• A noisy dataset:
f
f
Over itting and pruning

• Three main regularization techniques:


• Set a maximum depth: Prevent decision trees from growing past a
maximum depth, such as 10.
• Set a minimum number of examples in leaf: A leaf with less than a
certain number of examples will not be considered for splitting.
• Selectively removing (pruning) certain branches: by converting certain
non-leaf nodes to leaves.
f
Over itting

• The e ect of di ering minimum number of examples per leaf:


ff
f
ff
Pruning

• For pruning, a common solution to select the branches to remove is to use a validation dataset.
• If removing a branch improves the quality of the model on the validation dataset, then the
branch is removed

Using 20% of the dataset to prune the decision tree

You might also like