Nothing Special   »   [go: up one dir, main page]

Ensemble Learning and Random Forests

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

Ensemble Learning and Random Forests

 If you aggregate the predictions of a group of predictors such as


classifiers or regressors, you will often get better predictions than with
the best individual predictor.

 A group of predictors is called an ensemble, thus, this technique is called


Ensemble Learning, and an Ensemble Learning algorithm is called an
Ensemble method.

 For example, you can train a group of Decision Tree classifiers,


each on a different random subset of the training set. To make
predictions, you just obtain the predictions of all individual trees,
then predict the class that gets the most votes. Such an ensemble of
Decision Trees is called a Random Forest
Voting Classifiers
A very simple way to create an even better classifier is to
aggregate the predictions of each classifier and predict the
class that gets the most votes. This majority-vote classifier
is called a hard voting classifier.
 If all classifiers are able to estimate class probabilities (i.e.,
they have a predict_proba() method), then you can tell Scikit-
Learn to predict the class with the highest class probability,
averaged over all the individual classifiers.

 This is called soft voting. It often achieves higher performance


than hard voting because it gives more weight to highly
confident votes.

 All you need to do is replace voting="hard" with


voting="soft" and ensure that all classifiers can estimate class
probabilities.
Bagging and Pasting

 The approach of using the same training algorithm for every


predictor, but to train them on different random subsets of the
training set where the sampling is performed with replacement
is called Bagging.
 If the sampling is performed without replacement it is called
Pasting.
Bagging and Pasting in Scikit-Learn
 Scikit-Learn offers a simple API for both bagging and pasting
with the BaggingClassifier class (or BaggingRegressor for
regression).

 The following code trains an ensemble of 500 Decision Tree


classifiers each trained on 100 training instances randomly
sampled from the training set with replacement (this is an
example of bagging, but if you want to use pasting instead, just
set bootstrap=False).

 The n_jobs parameter tells Scikit-Learn the number of CPU


cores to use for training and predictions (–1 tells Scikit-Learn to
use all available cores)
Out-of-Bag Evaluation
 With bagging, some instances may be sampled several times
for any given predictor, while others may not be sampled at
all.

 By default a BaggingClassifier samples m training instances


with replacement (bootstrap=True), where m is the size of the
training set.

 This means that only about 63% of the training instances are
sampled on average for each predictor. The remaining 37% of
the training instances that are not sampled are called out-of-
bag (oob) instances.
 The oob decision function for each training instance is also
available through the oob_decision_function_ variable. In this
case the decision function returns the class probabilities for
each training instance.

 For example, the oob evaluation estimates that the second


training instance has a 60.6% probability of belonging to the
positive class (and 39.4% of belonging to the negative class)
Random Patches and Random Subspaces

 The BaggingClassifier class supports sampling the features as


well. This is controlled by two hyperparameters: max_features
and bootstrap_features.
 They work the same way as max_samples and bootstrap, but for
feature sampling instead of instance sampling.
 Thus, each predictor will be trained on a random subset of the
input features.
 This is particularly useful when you are dealing with high-
dimensional inputs (such as images).
 Sampling both training instances and features is called the
Random Patches method. Keeping all training instances (i.e.,
bootstrap=False and max_samples=1.0) but sampling features
(i.e.,bootstrap_features=True and/or max_features smaller than
1.0) is called the Random Subspaces method.
Random Forests

A Random Forest is an ensemble of Decision Trees, generally


trained via the bagging method (or sometimes pasting)

 Instead of building a BaggingClassifier and passing it a


DecisionTreeClassifier, you can instead use the
RandomForestClassifier class, which is more convenient and
optimized for Decision Trees.

 The following code trains a Random Forest classifier with 500


trees (each limited to maximum 16 nodes), using all available
CPU cores
Dimensionality Reduction

 Many Machine Learning problems involve thousands or even


millions of features for each training instance which makes
training extremely slow and can also make it much harder to
find a good solution.

 This problem is often referred to as the curse of


dimensionality.

 For example, consider the MNIST images the pixels on the


image borders are almost always white, so you could
completely drop these pixels from the training set without
losing much information.
Main Approaches for Dimensionality
Reduction
Projection :
 Notice that all training instances lie close to a plane: this is a
lower-dimensional (2D) subspace of the high-dimensional
(3D) space.

 Now if we project every training instance perpendicularly onto


this subspace (as represented by the short lines connecting the
instances to the plane), we get the new 2D dataset shown in
Figure 8-3.

 We have just reduced the dataset’s dimensionality from 3D to


2D. Note that the axes correspond to new features z1 and z2
(the coordinates of the projections on the plane).
 However, projection is not always the best approach to
dimensionality reduction. In many cases the subspace may
twist and turn, such as in the famous Swiss roll toy dataset
represented in Figure 8-4.
Manifold Learning
PCA
 Principal Component Analysis (PCA) is by far the most popular
dimensionality reduction algorithm. First it identifies the hyperplane that
lies closest to the data, and then it projects the data onto it.

 Before you can project the training set onto a lower-dimensional


hyperplane, you first need to choose the right hyperplane.

 For example, a simple 2D dataset is represented on the left of Figure 8-


7, along with three different axes (i.e., one-dimensional hyperplanes).
On the right is the result of the projection of the dataset onto each of
these axes.

 As you can see, the projection onto the solid line preserves the
maximum variance, while the projection onto the dotted line preserves
very little variance, and the projection onto the dashed line preserves an
intermediate amount of variance
Principal Components
 PCA identifies the axis that accounts for the largest amount of
variance in the training set.

 In Figure 8-7, it is the solid line. It also finds a second axis,


orthogonal to the first one, that accounts for the largest amount
of remaining variance.

 In this 2D example there is no choice: it is the dotted line. If it


were a higher-dimensional dataset, PCA would also find a
third axis, orthogonal to both previous axes, and a fourth, a
fifth, and so on—as many axes as the number of dimensions in
the dataset.
 The unit vector that defines the ith axis is called the ith
principal component (PC). In Figure the 1st PC is c1 and the
2nd PC is c2.

 There is a standard matrix factorization technique called


Singular Value Decomposition (SVD) that can decompose the
training set matrix X into the dot product of three matrices U
・ Σ ・ VT, where VT contains all the principal components
that we are looking for, as shown in Equation 8-1.
Projecting Down to d Dimensions

 Once you have identified all the principal components, you


can reduce the dimensionality of the dataset down to d
dimensions by projecting it onto the hyperplane defined by the
first d principal components.

 Selecting this hyperplane ensures that the projection will


preserve as much variance as possible.

 To project the training set onto the hyperplane, you can simply
compute the dot product of the training set matrix X by the
matrix Wd, defined as the matrix containing the first d
principal components.
Using Scikit-Learn
Explained Variance Ratio

 Another very useful piece of information is the explained


variance ratio of each principal component, available via the
explained_variance_ratio_ variable.

 It indicates the proportion of the dataset’s variance that lies


along the axis of each principal component.
Choosing the Right Number of Dimensions

 Instead of arbitrarily choosing the number of dimensions to


reduce down to, it is generally preferable to choose the
number of dimensions that add up to a sufficiently large
portion of the variance (e.g., 95%)

 The following code computes PCA without reducing


dimensionality, then computes the minimum number of
dimensions required to preserve 95% of the training set’s
variance:
 However, there is a much better option: instead of specifying
the number of principal components you want to preserve, you
can set n_components to be a float between 0.0 and 1.0,
indicating the ratio of variance you wish to preserve:
PCA for Compression

 Obviously after dimensionality reduction, the training set takes up much


less space. For example, try applying PCA to the MNIST dataset while
preserving 95% of its variance.

 You should find that each instance will have just over 150 features, instead
of the original 784 features. So while most of the variance is preserved, the
dataset is now less than 20% of its original size.

 It is also possible to decompress the reduced dataset back to 784 dimensions


by applying the inverse transformation of the PCA projection. Of course this
won’t give you back the original data but it will likely be quite close to the
original data.

 The mean squared distance between the original data and the reconstructed
data (compressed and then decompressed) is called the reconstruction error.
Incremental PCA

 One problem with the implementation of PCA is that it requires


the whole training set to fit in memory in order for the SVD
algorithm to run.
 Incremental PCA (IPCA) algorithms have been developed to
split the training set into mini-batches and feed an IPCA
algorithm one mini-batch at a time.
 This is useful for large training sets, and also to apply PCA
online.
 The following code splits the MNIST dataset into 100 mini-
batches (using NumPy’s array_split() function) and feeds them
to Scikit-Learn’s IncrementalPCA class to reduce the
dimensionality of the MNIST dataset down to 154 dimensions.
 Note that you must call the partial_fit() method with each mini-
batch rather than the fit() method with the whole training set
 Alternatively, you can use NumPy’s memmap class, which
allows you to manipulate a large array stored in a binary file on
disk as if it were entirely in memory .The class loads only the
data it needs in memory, when it needs it.

 Since the IncrementalPCA class uses only a small part of the


array at any given time, the memory usage remains under
control.
Randomized PCA

 Scikit-Learn
offers yet another option to perform PCA, called
Randomized PCA

 It
is dramatically faster than the previous algorithms when d is
much smaller.
Kernel PCA

 The kernel trick can be applied to PCA, making it possible to


perform complex nonlinear projections for dimensionality
reduction. This is called Kernel PCA (kPCA)

You might also like