Nothing Special   »   [go: up one dir, main page]

Unit-3 ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Department of Artificial Intelligence & Data Science

UNIT-3
Ensemble Learning and Random Forests: Introduction, Voting Classifiers, Bagging and Pasting,
Random Forests, Boosting, Stacking.
Support Vector Machine: Linear SVM Classification, Nonlinear SVM Classifications Regression,
Naïve Bayes Classifiers.
1. Ensemble Learning and Random Forests:
 A group of predictors is called an Ensemble, thus, this technique is called Ensemble
Learning, and an Ensemble Learning algorithm is called an Ensemble method.
 Ensemble learning is one of the most powerful machine learning techniques that
use the combined output of two or more models/weak learners and solve a
particular computational intelligence problem.
E.g., a Random Forest algorithm is an ensemble of various decision trees
combined.
 Ensemble learning refers to algorithms that combine the predictions from two or
more models.

 Ensemble methods combine different decision trees to deliver better predictive results,
afterward utilizing a single decision tree.
 The primary principle behind the ensemble model is that a group of weak learners come
together to form an active learner.

There are two techniques given below that are used to perform ensemble decision tree.

i) Bagging and Pasting

ii) Boosting

i) Bagging and Pasting:

 Bagging is used when our objective is to reduce the variance of a decision


tree.
 Here the concept is to create a few subsets of data from the training sample,
which is chosen randomly with replacement.
 Now each collection of subset data is used to prepare their decision trees
thus, we end up with an ensemble of various models.
 The average of all the assumptions from numerous tress is used, which is
more powerful than a single decision tree.
 Bootstrap Aggregating, also known as bagging, is a machine learning
ensemble meta-algorithm designed to improve the stability and
accuracy of machine learning algorithms used in statistical classification
and regression.
 It decreases the variance and helps to avoid over fitting. It is usually
applied to decision tree methods. Bagging is a special case of the model
averaging approach.

Description of the Technique:


 Suppose a set D of d tuples, at each iteration i, a training set D i of d tuples is
selected via row sampling with a replacement method (i.e., there can be
repetitive elements from different d tuples) from D (i.e., bootstrap).
 Then a classifier model Mi is learned for each training set D < i. Each classifier
Mi returns its class prediction. The bagged classifier M* counts the votes and
assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
 Step 1: Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learned in parallel with each training set and
independent of each other.
 Step 4: The final predictions are determined by combining the predictions
from all the models.

Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than using
all features to develop trees. When we have numerous random trees, it is called the
Random Forest.

These are the following steps which are taken to implement a Random forest:

o Let us consider X observations Y features in the training data set. First, a


model from the training data set is taken randomly with substitution.
o The tree is developed to the largest.
o The given steps are repeated, and prediction is given, which is based on the
collection of predictions from n number of trees.
Advantages of using Random Forest technique:

o It manages a higher dimension data set very well.


o It manages missing quantities and keeps accuracy for missing data.

Disadvantages of using Random Forest technique:

Since the last prediction depends on the mean predictions from subset trees, it won't
give precise value for the regression model.

ii) Boosting:

 Boosting is another ensemble procedure to make a collection of


predictors
 Boosting is an ensemble modeling technique that attempts to build
a strong classifier from the number of weak classifiers.
 It is done by building a model by using weak models in series.
 Firstly, a model is built from the training data.
 Then the second model is built which tries to correct the errors
present in the first model.
 This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum
number of models is added.

Boosting Algorithms
 There are several boosting algorithms.
 The original ones, proposed by Robert Schapiro and Yoav
Freund were not adaptive and could not take full advantage of the
weak learners.
 Schapiro and Freund then developed AdaBoost, an adaptive boosting
algorithm that won the prestigious Gödel Prize.
 AdaBoost was the first really successful boosting algorithm developed
for the purpose of binary classification.
 AdaBoost is short for Adaptive Boosting and is a very popular boosting
technique that combines multiple “weak classifiers” into a single
“strong classifier”.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points and decrease
the weights of correctly classified data points. And then normalize the
weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and
weighted dataset

Gradient Boosting is an expansion of the boosting procedure.

1. Gradient Boosting = Gradient Descent + Boosting

It utilizes a gradient descent algorithm that can optimize any differentiable loss function.
An ensemble of trees is constructed individually, and individual trees are summed
successively. The next tree tries to restore the loss (It is the difference between actual
and predicted values).

Advantages of using Gradient Boosting methods:

o It supports different loss functions.


o It works well with interactions.

Disadvantages of using Gradient Boosting methods:

o It requires cautious tuning of different hyper-parameters.

Difference between Bagging and Boosting:

Bagging Boosting

Various training data subsets are randomly Each new subset contains the components that
drawn with replacement from the whole were misclassified by previous models.
training dataset.

Bagging attempts to tackle the over-fitting Boosting tries to reduce bias.


issue.

If the classifier is unstable (high variance), If the classifier is steady and straightforward
then we need to apply bagging. (high bias), then we need to apply boosting.
Every model receives an equal weight. Models are weighted by their performance.

Objective to decrease variance, not bias. Objective to decrease bias, not variance.

It is the easiest way of connecting It is a way of connecting predictions that


predictions that belong to the same type. belong to the different types.

Every model is constructed independently. New models are affected by the performance
of the previously developed model.

Stacking: (Method1)

 Stacking is one of the popular ensemble modeling techniques in machine learning.


 Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.

 This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be achieved.
 In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn
how to best combine the input predictions to make a better output prediction.
 Stacking is also known as a stacked generalization and is an extended form of the
Model Averaging Ensemble technique in which all sub-models equally participate as per
their performance weights and build a new model with better predictions.
 This new model is stacked up on top of the others; this is the reason why it is named
stacking.

Architecture of Stacking:

 The architecture of the stacking model is designed in such as way that it consists
of two or more base/learner's models and a meta-model that combines the
predictions of the base models.
 These base models are called level 0 models, and the meta-model is known as the
level 1 model.
 So, the Stacking ensemble method includes original (training) data, primary
level models, primary level prediction, secondary level model, and final
prediction.
 The basic architecture of stacking can be represented as shown below the image.

o Original data: This data is divided into n-folds and is also considered test data or
training data.
o Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
o Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-model, which
helps to best combine the predictions of the base models. The meta-model is also known
as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the predictions of the
base models and is trained on different predictions made by individual base models, i.e.,
data not used to train the base models are fed to the meta-model, predictions are made,
and these predictions, along with the expected outputs, provide the input and output
pairs of the training dataset used to fit the meta-model.

………………………………………………………………………………………………………………..

 Stacking (Method2) is a way of Ensembling classification or regression


models it consists of two-layer estimators.

 The first layer consists of all the baseline models that are used to predict
the outputs on the test datasets.

 The second layer consists of Meta-Classifier or Regression which takes


all the predictions of baseline models as an input and generates new
predictions.
……………………………………………………………………………………………………………………

Voting Classifiers:

 A Voting Classifier is a machine learning model that trains on an ensemble of


numerous models and predicts an output (class) based on their highest
probability of chosen class as the output.
 It simply aggregates the findings of each classifier passed into Voting Classifier
and predicts the output class based on the highest majority of voting.
 The idea is instead of creating separate dedicated models and finding the
accuracy for each them, we create a single model which trains by these models
and predicts output based on their combined majority of voting for each output
class.

Voting Classifier supports two types of votings.


1. Hard Voting:
o In hard voting, the predicted output class is a class with the highest
majority of votes i.e the class which had the highest probability of
being predicted by each of the classifiers.
o Suppose three classifiers predicted the output class (A, A, B), so
here the majority predicted A as output.
o Hence A will be the final prediction.
2. Soft Voting:
o In soft voting, the output class is the prediction based on the
average of probability given to that class.
o Suppose given some input to three models, the prediction
probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40).
o So the average for class A is 0.4333 and B is 0.3067, the winner is
clearly class A because it had the highest probability averaged by
each classifier.
Python code to implement Voting Classifier

# importing libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# loading iris dataset


iris = load_iris()
X = iris.data[:, :4]
Y = iris.target

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size =
0.20,random_state = 42)

# group / ensemble of models


estimator = []
estimator.append(('LR',LogisticRegression(solver ='lbfgs',multi_class
='multinomial',max_iter = 200)))
estimator.append(('SVC', SVC(gamma ='auto', probability = True)))
estimator.append(('DTC', DecisionTreeClassifier()))

# Voting Classifier with hard voting


vot_hard = VotingClassifier(estimators = estimator, voting ='hard')
vot_hard.fit(X_train, y_train)
y_pred = vot_hard.predict(X_test)

# using accuracy_score metric to predict accuracy


score = accuracy_score(y_test, y_pred)
print("Hard Voting Score % d" % score)
# Voting Classifier with soft voting
vot_soft = VotingClassifier(estimators = estimator, voting ='soft')
vot_soft.fit(X_train, y_train)
y_pred = vot_soft.predict(X_test)

# using accuracy_score
Score = accuracy_score(y_test, y_pred)
print("Soft Voting Score % d" % score)
Output:
Hard Voting Score 1
Soft Voting Score 1
……………………………………………………………………………………………………………….
Support Vector Machine: Linear SVM Classification, Nonlinear SVM Classifications Regression,
Naïve Bayes Classifiers

2. Support Vector Machine:


 Support vector machines (SVMs) are powerful yet flexible supervised machine
learning algorithms which are used both for classification and regression.
 But generally, they are used in classification problems.
 In 1960s, SVMs were first introduced but later they got refined in 1990.
 SVMs have their unique way of implementation as compared to other machine
learning algorithms.
 Lately, they are extremely popular because of their ability to handle multiple
continuous and categorical variables.

Working of SVM

 An SVM model is basically a representation of different classes in a


hyperplane in multidimensional space.
 The hyperplane will be generated in an iterative manner by SVM so that
the error can be minimized.
 The goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH).

The followings are important concepts in SVM −


 Support Vectors –
Data points that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
 Hyperplane –
As we can see in the above diagram, it is a decision plane or space which is
divided between a set of objects having different classes.
 Margin –
It may be defined as the gap between two lines on the closet data points of
different classes.
It can be calculated as the perpendicular distance from the line to the
support vectors.

Large margin is considered as a good margin and small margin is considered


as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH) and it can be done in the following two steps −
 First, SVM will generate hyperplanes iteratively that segregates the classes in
best way.
 Then, it will choose the hyperplane that separates the classes correctly.

Implementing SVM in Python:

SVM Kernels:

 In practice, SVM algorithm is implemented with kernel that transforms an input


data space into the required form.
 SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space.
 In simple words, kernel converts non-separable problems into separable
problems by adding more dimensions to it.
 It makes SVM more powerful, flexible and accurate.

 The following are some of the types of kernels used by SVM.


Linear Kernel
It can be used as a dot product between any two observations. The formula of
linear kernel is as below −
K(x,xi)=sum(x∗xi)K(x,xi)=sum(x∗xi)

From the above formula, we can see that the product between two vectors say 𝑥
& 𝑥𝑥 is the sum of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear
input space. Following is the formula for polynomial kernel −
k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d

Here d is the degree of polynomial, which we need to specify manually in the


learning algorithm.
Radial Basis Function (RBF) Kernel
RBF kernel, mostly used in SVM classification, maps input space in indefinite
dimensional space. Following formula explains it mathematically −
K(x,xi)=exp(−gamma∗sum(x−xi^2))K(x,xi)=exp(−gamma∗sum(x−xi^2))

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning


algorithm. A good default value of gamma is 0.1.
As we implemented SVM for linearly separable data, we can implement it in
Python for the data that is not linearly separable. It can be done by using kernels.

Example:
The following is an example for creating an SVM classifier by using kernels. We
will be using iris dataset from scikit-learn −
We will start by importing following packages −

import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt
Now, we need to load the input data −
iris = datasets.load_iris()
From this dataset, we are taking first two features as follows −
X = iris.data[:, :2]
y = iris.target
Next, we will plot the SVM boundaries with original data as follows −

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1


y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
X_plot = np.c_[xx.ravel(), yy.ravel()]

Now, we need to provide the value of regularization parameter as follows −


C = 1.0

Next, SVM classifier object can be created as follows −


Svc_classifier = svm.SVC(kernel='linear', C=C).fit(X, y)

Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with linear kernel')
Output
Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
For creating SVM classifier with rbf kernel, we can change the kernel to rbf as
follows −

Svc_classifier = svm.SVC(kernel = 'rbf', gamma =‘auto’,C = C).fit(X, y)


Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap = plt.cm.tab10, alpha = 0.3)
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with rbf kernel')
Output
Text (0.5, 1.0, 'Support Vector Classifier with rbf kernel')

We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1
also.

Pros and Cons of SVM Classifiers

 Pros of SVM classifiers


SVM classifiers offer great accuracy and work well with high dimensional
space. SVM classifiers basically use a subset of training points hence in
result uses very less memory.
 Cons of SVM classifiers
They have high training time hence in practice not suitable for large
datasets. Another disadvantage is that SVM classifiers do not work well
with overlapping classes.

1.1 Support Vector Machine Algorithm:

 Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
 However, primarily, it is used for Classification problems in Machine Learning.
 The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future.
 This best decision boundary is called a hyperplane.
 SVM chooses the extreme points/vectors that help in creating the hyperplane.
 These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine.
 Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

Example:

 SVM can be understood with the example that we have used in the KNN classifier.
 Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm.
 We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange
creature.
 So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog.
 On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM:

 Linear SVM is used for linearly separable data, which means if a


dataset can be classified into two classes by using a single straight
line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.

o Non-linear SVM:

 Non-Linear SVM is used for non-linearly separated data, which


means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes


in n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight
line. And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.

How does SVM works?

1. Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

2. Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

3. Naïve Bayes Classifier Algorithm:


o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain


feature is independent of the occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then red, spherical, and sweet
fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that


the probability of a hypothesis is true.

Important Questions
1. What is Ensemble learning? Explain it.
2. Explain about the Ensemble learning and Random Forest.
3. Difference between the bagging and boosting?

4. What is stacking? Explain it.


5. Explain about the Random Forest Algorithms?
6. Breifly, Explain about the SVM?
7. What is the fundamental idea behind Support Vector Machines?
8. Why is it important to scale the inputs when using SVMs?
9. Explain about the Naïve Bayes theorem.

You might also like