Unit-3 ML
Unit-3 ML
Unit-3 ML
UNIT-3
Ensemble Learning and Random Forests: Introduction, Voting Classifiers, Bagging and Pasting,
Random Forests, Boosting, Stacking.
Support Vector Machine: Linear SVM Classification, Nonlinear SVM Classifications Regression,
Naïve Bayes Classifiers.
1. Ensemble Learning and Random Forests:
A group of predictors is called an Ensemble, thus, this technique is called Ensemble
Learning, and an Ensemble Learning algorithm is called an Ensemble method.
Ensemble learning is one of the most powerful machine learning techniques that
use the combined output of two or more models/weak learners and solve a
particular computational intelligence problem.
E.g., a Random Forest algorithm is an ensemble of various decision trees
combined.
Ensemble learning refers to algorithms that combine the predictions from two or
more models.
Ensemble methods combine different decision trees to deliver better predictive results,
afterward utilizing a single decision tree.
The primary principle behind the ensemble model is that a group of weak learners come
together to form an active learner.
There are two techniques given below that are used to perform ensemble decision tree.
ii) Boosting
Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than using
all features to develop trees. When we have numerous random trees, it is called the
Random Forest.
These are the following steps which are taken to implement a Random forest:
Since the last prediction depends on the mean predictions from subset trees, it won't
give precise value for the regression model.
ii) Boosting:
Boosting Algorithms
There are several boosting algorithms.
The original ones, proposed by Robert Schapiro and Yoav
Freund were not adaptive and could not take full advantage of the
weak learners.
Schapiro and Freund then developed AdaBoost, an adaptive boosting
algorithm that won the prestigious Gödel Prize.
AdaBoost was the first really successful boosting algorithm developed
for the purpose of binary classification.
AdaBoost is short for Adaptive Boosting and is a very popular boosting
technique that combines multiple “weak classifiers” into a single
“strong classifier”.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points and decrease
the weights of correctly classified data points. And then normalize the
weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and
weighted dataset
It utilizes a gradient descent algorithm that can optimize any differentiable loss function.
An ensemble of trees is constructed individually, and individual trees are summed
successively. The next tree tries to restore the loss (It is the difference between actual
and predicted values).
Bagging Boosting
Various training data subsets are randomly Each new subset contains the components that
drawn with replacement from the whole were misclassified by previous models.
training dataset.
If the classifier is unstable (high variance), If the classifier is steady and straightforward
then we need to apply bagging. (high bias), then we need to apply boosting.
Every model receives an equal weight. Models are weighted by their performance.
Objective to decrease variance, not bias. Objective to decrease bias, not variance.
Every model is constructed independently. New models are affected by the performance
of the previously developed model.
Stacking: (Method1)
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be achieved.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn
how to best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the
Model Averaging Ensemble technique in which all sub-models equally participate as per
their performance weights and build a new model with better predictions.
This new model is stacked up on top of the others; this is the reason why it is named
stacking.
Architecture of Stacking:
The architecture of the stacking model is designed in such as way that it consists
of two or more base/learner's models and a meta-model that combines the
predictions of the base models.
These base models are called level 0 models, and the meta-model is known as the
level 1 model.
So, the Stacking ensemble method includes original (training) data, primary
level models, primary level prediction, secondary level model, and final
prediction.
The basic architecture of stacking can be represented as shown below the image.
o Original data: This data is divided into n-folds and is also considered test data or
training data.
o Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
o Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-model, which
helps to best combine the predictions of the base models. The meta-model is also known
as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the predictions of the
base models and is trained on different predictions made by individual base models, i.e.,
data not used to train the base models are fed to the meta-model, predictions are made,
and these predictions, along with the expected outputs, provide the input and output
pairs of the training dataset used to fit the meta-model.
………………………………………………………………………………………………………………..
The first layer consists of all the baseline models that are used to predict
the outputs on the test datasets.
Voting Classifiers:
# importing libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size =
0.20,random_state = 42)
# using accuracy_score
Score = accuracy_score(y_test, y_pred)
print("Soft Voting Score % d" % score)
Output:
Hard Voting Score 1
Soft Voting Score 1
……………………………………………………………………………………………………………….
Support Vector Machine: Linear SVM Classification, Nonlinear SVM Classifications Regression,
Naïve Bayes Classifiers
Working of SVM
SVM Kernels:
From the above formula, we can see that the product between two vectors say 𝑥
& 𝑥𝑥 is the sum of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear
input space. Following is the formula for polynomial kernel −
k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d
Example:
The following is an example for creating an SVM classifier by using kernels. We
will be using iris dataset from scikit-learn −
We will start by importing following packages −
import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt
Now, we need to load the input data −
iris = datasets.load_iris()
From this dataset, we are taking first two features as follows −
X = iris.data[:, :2]
y = iris.target
Next, we will plot the SVM boundaries with original data as follows −
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with linear kernel')
Output
Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
For creating SVM classifier with rbf kernel, we can change the kernel to rbf as
follows −
We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1
also.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future.
This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine.
Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example:
SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm.
We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange
creature.
So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
o Linear SVM:
o Non-linear SVM:
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight
line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
1. Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
2. Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
Important Questions
1. What is Ensemble learning? Explain it.
2. Explain about the Ensemble learning and Random Forest.
3. Difference between the bagging and boosting?