Nothing Special   »   [go: up one dir, main page]

Final Lab Manual of ML BCA

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 69

SANGAM UNIVERSITY, BHILWARA

Hands on Lab Guide


(Lab Manual)

MACHINE LEARNING LAB


(Course Code: PEC-CS-8)
Session 2021-2022

Department of Computer Science & Engineering


SU, BHILWARA
SANGAM UNIVERSITY, BHILWARA
Department of Computer Science &Engineering
Tentative List of Practical Allotment (Machine Learning Lab PEC-CS-8)
Session-2021-22
List of Experiment

CLASS: V SEM Department:CSE


Subject Name: Machine Learning LabPEC-CS-8 Course: B. Tech.

LAB EXPERIMENT EXPERIMENT SCHEDULE


NO. NO. DATE
How to Install Python & Anaconda- Jupyter Notebook
1
on Windows & Linux OS?
1
Explain Some Machine Learning Python Libraries like
2
Pandas, Numpy, Matplotlib, Sklearn etc.
Using Pandas & Matplotlib Libraries Know your
3 Dataset, Use some statistic Functions on them, Plot the
2 different Graphs.
Write & Implement Linear Regression Using Sklearn
4
Library in Python.
Write & Implement Logistic Regression Using Sklearn
3 5
Library in Python.
Write & Implement SVM Using Sklearn Library in
4 6
Python.
Write a program to implement the naïve Bayesian
classifier for a sample training data set stored as a .CSV
5 7 file. Compute the accuracy of the classifier, considering
few test data sets.

Write a program to implement k-Nearest Neighbour


6 8 algorithm to classify the iris data set. Print both correct
and wrong predictions.

Write & Implement K-Means Clustering Using Sklearn


7 9
Library in Python.
8 10 Write & Implement Apriori Algorithm in Python.
9 11 Mini Project
EXPERIMENT 1
How to Install Python & Anaconda- Jupyter Notebook on Windows & Linux OS?
Anaconda is a package manager used in scientific computing and data science. It’s designed
to provide scientific libraries and dependencies in the Python programming language.
Anaconda is commonly used for machine learning and artificial intelligence applications.

How to Install Python on Windows

1. Step 1: Select Version of Python to download Full Installer and install.


2. Step 2: Download Python Executable Installer and install it.
3. Step 3: Wait for it to complete the installation process.
4. Step 4: Verification of installation of python in Windows.
5. Step 2: Select Open Source Distribution.

How to Install Anaconda Navigator on Windows

Download the Anaconda installer.

Double click the installer to launch.

Click Next.

Read the licensing terms and click “I Agree”.

Select an install for “Just Me” unless you’re installing for all users (which requires Windows
Administrator privileges) and click Next.
Select a destination folder to install Anaconda and click the Next button.

How to Install Python on Ubuntu

1. Open up your terminal by pressing Ctrl + Alt + T.


2. Update your local system's repository list by entering the following command: sudo apt-get
update.
3. Download the latest version of Python: sudo apt-get install python.
4. Apt will automatically find the package and install it on your computer.

How to Install Anaconda Navigator on UBUNTU

Step 1: Update Local Package Manager

Start by updating the local package manager. Open a terminal window and enter the following:
sudo apt-get update
If your system doesn’t have curl, install it by entering:
sudo apt-get install curl

Step 2: Download the Latest Version of Anaconda

At the time this article was written, the latest version of Anaconda is 2020.02. Check the
developer’s download page to view the newest version.

Note the URL and use it to download the correct version.

Switch to the /tmp directory and use curl to download the installer using your command
terminal:
cd /tmp
curl –O https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-
x86_64.sh

This version is designed for Python 3.7. If you are using Python 2.7, use the appropriate URL.

Step 3: Verify the Download Checksum

Checksum is a security tool used to verify the authenticity and integrity of a downloaded script.

Enter the following:


sha256sum Anaconda3–2020.02–Linux–x86_64.sh

Your system will display a series of letters and numbers:


69581cf739365ec7fb95608eef694ba959d7d33b36eb961953f2b82cb25bdf5a
Anaconda3-2019.07-Linux-x86_64.sh

Compare those to the appropriate checksum (or hash) in the Anaconda documentation. If you
have chosen a different version, make sure to check the documentation for that version’s
checksum.

Step 4: Run Anaconda Installation Script

DR. AWANIT KUMAR Page 4


The Anaconda installer is a bash script. To run the installation script, use the command:
bash Anaconda3-2020.02-Linux-x86_64.sh

A license agreement will appear. Use the Enter key to review the agreement.

At the bottom, type yes to agree to the terms.

The installer will prompt you to accept the default location, or install to a different location.
Use the default path unless you have a specific need to change it. (You may cancel the
installation here if needed.)

The installation will finish. After successful installation, the following will appear:

installation finished.
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /home/user/.bashrc ? [yes|no]

This determines if you want to use the conda command without changing the directory. Type
yes and hit enter, unless you have a specific need to do otherwise. The system will respond as
follows:
Appending source /home/user/anaconda3/bin/activate to
/home/linux4one/.bashrc
A backup will be made to: /home/user/.bashrc-anaconda3.bak
For this change to become active, you have to open a new terminal.
Thank you for installing Anaconda3!

(Optional) Step 5: Install VSCode Editor

The system will ask if you want to install Microsoft Visual Studio Code Editor. It will display
the following:

Anaconda is partnered with Microsoft! Microsoft VSCode is a


streamlined code editor with support for development operations like
debugging, task running and version control.
To install Visual Studio Code, you will need: - Administrator

DR. AWANIT KUMAR Page 5


Privileges - Internet connectivity
Visual Studio Code License: https://code.visualstudio.com/license
Do you wish to proceed with the installation of Microsoft VSCode?
[yes|no]

Decide by typing yes or no to continue.

Step 6: Activate and Test Installation

Once finished, activate the installation by entering:


source ~/.bashrc

Use the conda command to test the installation:


conda info

The system should display a list of data similar to:


active environment : None
userconfig file : /home/user/.condarc
populatedconfig files :
conda version : 4.5.4
conda-build version : 3.10.5
python version : 3.7.0.final.0
base environment : /home/user/anaconda3 (writable)
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/pro/linux-64
https://repo.anaconda.com/pkgs/pro/noarch
package cache : /home/user/anaconda3/pkgs /home/user/.conda/pkgs
envs directories : /home/user/anaconda3/envs /home/user/.conda/envs
platform : linux-64
user-agent :conda/4.5.4 requests/2.18.4 CPython/3.6.5 Linux/4.15.0-
22-generic ubuntu/18.04 glibc/2.27
UID:GID : 1000:1000
netrc file : None
offline mode : False

To update Anaconda on Ubuntu, start by updating the conda utility:


conda update conda

Next, run the Anaconda package update command:


conda update anaconda

Create and Activate Anaconda Environments

Create a Python 3 environment named test_environment by entering the following:


conda create ––name test_environment python=3

Activate this environment:


DR. AWANIT KUMAR Page 6
conda activate test_environment

The command prompt will change. This indicates that you are now in a shell environment
using Python 3. You can now work in this environment.

DR. AWANIT KUMAR Page 7


EXPERIMENT 2
Explain Some Machine Learning Python Libraries like Pandas, Numpy, Matplotlib,
Sklearn etc.

NumPy

NumPy is a well known general-purpose array-processing package. An extensive collection of


high complexity mathematical functions make NumPy powerful to process large multi-
dimensional arrays and matrices. NumPy is very useful for handling linear algebra, Fourier
transforms, and random numbers. Other libraries like TensorFlow usesNumPy at the backend
for manipulating tensors.

With NumPy, you can define arbitrary data types and easily integrate with most databases.
NumPy can also serve as an efficient multi-dimensional container for any generic data that is in
any datatype. The key features of NumPy include powerful N-dimensional array object,
broadcasting functions, and out-of-box tools to integrate C/C++ and Fortran code.

SciPy

With machine learning growing at supersonic speed, many Python developers were
creating python libraries for machine learning, especially for scientific and analytical
computing. Travis Oliphant, Eric Jones, and Pearu Peterson in 2001 decided to merge most of
these bits and pieces codes and standardize it. The resulting library was then named as SciPy
library. 

The current development of the SciPy library is supported and sponsored by an open
community of developers and distributed under the free BSD license.

The SciPy library offers modules for linear algebra, image optimization, integration
interpolation, special functions, Fast Fourier transform, signal and image processing, Ordinary
Differential Equation (ODE) solving, and other computational tasks in science and analytics.

The underlying data structure used by SciPy is a multi-dimensional array provided by the
NumPy module. SciPy depends on NumPy for the array manipulation subroutines. The SciPy
library was built to work with NumPy arrays along with providing user-friendly and efficient
numerical functions.

Scikit-learn

DR. AWANIT KUMAR Page 8


In 2007, David Cournapeau developed the Scikit-learn library as part of the Google Summer
of Code project. In 2010 INRIA involved and did the public release in January 2010. Skikit-
learn was built on top of two Python libraries – NumPy and SciPy and has become the most
popular Python machine learning library for developing machine learning algorithms.  

Scikit-learn has a wide range of supervised and unsupervised learning algorithms that works
on a consistent interface in Python. The library can also be used for data-mining and data
analysis. The main machine learning functions that the Scikit-learn library can handle are
classification, regression, clustering, dimensionality reduction, model selection, and
preprocessing.

Pandas

Pandas are turning up to be the most popular Python library that is used for data analysis with
support for fast, flexible, and expressive data structures designed to work on both “relational”
or “labeled” data. Pandas today is an inevitable library for solving practical, real-world data
analysis in Python. Pandas is highly stable, providing highly optimized performance. The
backend code is purely written in C or Python. 

The two main types of data structures used by pandas are:


 Series (1-dimensional)
 DataFrame (2-dimensional)
These two put together can handle a vast majority of data requirements and use cases from
most sectors like science, statistics, social, finance, and of course, analytics and other areas of
engineering. 

Pandas support and perform well with different kinds of data including the below:
 Tabular data with columns of heterogeneous data. For instance, consider the data coming from
the SQL table or Excel spreadsheet.
 Ordered and unordered time series data. The frequency of time series need not be fixed, unlike
other libraries and tools. Pandas is exceptionally robust in handling uneven time-series data
 Arbitrary matrix data with the homogeneous or heterogeneous type of data in the rows and
columns
 Any other form of statistical or observational data sets. The data need not be labeled at all.
Pandas data structure can process it even without labeling. 

Matplotlib

Matplotlib is a data visualization library that is used for 2D plotting to produce publication-
quality image plots and figures in a variety of formats. The library helps to generate
histograms, plots, error charts, scatter plots, bar charts with just a few lines of code.

PyTorch

DR. AWANIT KUMAR Page 9


PyTorch has a range of tools and libraries that support computer vision, machine learning, and
natural language processing. The PyTorch library is open-source and is based on the Torch
library. The most significant advantage of PyTorch library is it’s ease of learning and using.

PyTorch can smoothly integrate with the python data science stack, including NumPy. You
will hardly make out a difference between NumPy and PyTorch. PyTorch also allows
developers to perform computations on Tensors. PyTorch has a robust framework to build
computational graphs on the go and even change them in runtime. Other advantages of
PyTorch include multi GPU support, simplified preprocessors, and custom data loaders. 

Keras

Keras has over 200,000 users as of November 2017. Keras is an open-source library used for
neural networks and machine learning. Keras can run on top of TensorFlow, Theano, Microsoft
Cognitive Toolkit, R, or PlaidML. Keras also can run efficiently on CPU and GPU. 

Keras works with neural-network building blocks like layers, objectives, activation functions,
and optimizers. Keras also have a bunch of features to work on images and text images that
comes handy when writing Deep Neural Network code.

Apart from the standard neural network, Keras supports convolutional and recurrent neural
networks. 

TensorFlow

TensorFlow was developed for Google’s internal use by the Google Brain team. Its first release
came in November 2015 under Apache License 2.0. TensorFlow is a popular computational
framework for creating machine learning models. TensorFlow supports a variety of different
toolkits for constructing models at varying levels of abstraction.

TensorFlow exposes a very stable Python and C++ APIs. It can expose, backward compatible
APIs for other languages too, but they might be unstable. TensorFlow has a flexible
architecture with which it can run on a variety of computational platforms CPUs, GPUs, and
TPUs. TPU stands for Tensor processing unit, a hardware chip built around TensorFlow for
machine learning and artificial intelligence.

Theano

Theano is a python machine learning library that can act as an optimizing compiler for
evaluating and manipulating mathematical expressions and matrix calculations. Built on
NumPy, Theano exhibits a tight integration with NumPy and has a very similar interface.
Theano can work on Graphics Processing Unit (GPU) and CPU.

DR. AWANIT KUMAR Page 10


Working on GPU architecture yields faster results. Theano can perform data-intensive
computations up to 140x faster on GPU than on a CPU. Theano can automatically avoid errors
and bugs when dealing with logarithmic and exponential functions. Theano has built-in tools
for unit-testing and validation, thereby avoiding bugs and problems. 
EXPERIMENT 3
Using Pandas & Matplotlib Libraries Know your Dataset, Use some statistic Functions on
them, Plot the different Graphs.

What is Pandas?

Pandas is a Python library used for working with data sets.It has functions for analyzing,
cleaning, exploring, and manipulating data.The name "Pandas" has a reference to both "Panel
Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.Pandas
can clean messy data sets, and make them readable and relevant.Relevant data is very
important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

 Is there a correlation between two or more columns?


 What is average value?
 Max value?
 Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.

Loading Data Using Pandas


Machine learning algorithms needs data and we must provide it, but how to load data? Here are
some ways to load data using Pandas.

We can load CSV file using Pandas with the pandas.read_csv() function. In the example below
we consider that the file iris.data.csv is in our current working directory.

from pandas import read_csv

DR. AWANIT KUMAR Page 11


data = read_csv("iris.csv")
data.head()

We can print the data using head () that returns the first n rows of an object (default n= 5). It
helps in knowing the data and datatype of the object.

If the file iris.csv is not in our current working directory, we need specify the path as shown in
the example below.

from pandas import read_csv


data = read_csv(r"C:\Users\Admin\Desktop\a\iris.csv")
data.head()
If the file iris.csv is not in our system still we can load that dataset by specifying the link as
shown un the example below.

import pandas as pd
data = pd.read_csv("https://5minutesengineering.com/wp-content/uploads/2021/05/iris.csv")
data.head()
Understanding Data using Statistics

Dimensions(Number of Rows and Columns) in our Dataset

Using dataset_name.shape gives us the dimension as result. In output below we can see


(150,5) that indicates iris dataset has 150 rows and 5 columns.

from pandas import read_csv


data = read_csv("iris.csv")
data.shape

DR. AWANIT KUMAR Page 12


Getting Each Attribute’s Data Type

Using dataset_name.dtypes gives us the each attributes data type as result. In the output
mentioned below we can observe the data types for each of the attribute in our iris dataset.

from pandas import read_csv


data = read_csv("iris.csv")
data.dtypes

Statistical Summary of Data

from pandas import read_csv


data = read_csv("iris.csv")
data.describe()

Visualization Of Data using Matplotlib

How to Plot a Line ?

To answer the above question we need to import matplotlib library and use plot() function, we
can make our diagram more informative by providing Title, Labels on x and y axis

DR. AWANIT KUMAR Page 13


using title(),xlabel() and ylabel() respectively. By executing the below mentioned code we can
generate a line.
frommatplotlib import pyplot as plt
import pandas as pd
data = pd.read_csv("iris.csv")
plt.plot([1,2,3],[4,5,6])
plt.title("line")
plt.ylabel("y axis")
plt.xlabel("x axis")
plt.show()

But in above mentioned example we considered lists, now lets consider actual dataset
attributes.

plt.plot(data.sepal_length,data.petal_width)
plt.ylabel("Petal_width")
plt.xlabel("Sepal_length")
plt.show()

DR. AWANIT KUMAR Page 14


Barplot

We can create a barplot by using bar().

plt.bar(data.species,data.sepal_width)
plt.xlabel("species")
plt.ylabel("sepal_width")
plt.title("Barplot")
plt.show()

Histogram

We can create a histogram for our iris dataset by using hist().

plt.hist(data.sepal_width)
plt.show()

DR. AWANIT KUMAR Page 15


Pie chart

We can create a pie chart by using pie().

a=[10,20,60,50]
label=["Maths","Hindi","English","Science"]
cols=["green","red","yellow","blue"]
plt.pie(a,labels=label,colors=cols,startangle=120,explode=(0,0,0,0))
plt.show()

We can adjust the values of startangle and explode, for example lets consider startangle = 90
and explode(0.1,0.2,0,0) result can be seen below

DR. AWANIT KUMAR Page 16


Boxplot

We can create a Boxplot by using boxplot().

plt.boxplot(data['sepal_length'])
plt.show()

DR. AWANIT KUMAR Page 17


EXPERIMENT 4
Write & Implement Linear Regression Using Sklearn Library in Python.

Single Variable
We load a dataset containing single independent variable which is in our below mentioned
example is distance. The data set contains bike prices as per distance travelled.

Dataset Values abc.csv

distance,price
20,10000
30,15000
80,50000
50,30000
10,5000
45,25000
30,20000
55,35000
60,40000
70,45000

import pandas as pd
df = pd.read_csv("abc.csv")
df.head(10)

Now lets try to visualize the data using matplotlib. here we are visualizing a relation between
distance and price.

importmatplotlib.pyplot as plt
plt.scatter(df['distance'],df['price'])
plt.xlabel(‘distance’)
plt.ylabel(‘price’)
plt.show()

DR. AWANIT KUMAR Page 18


Now we need to import Linear Regression from sklearn.linear_model to perform linear
regression on our dataset, After that we consider x(distance) as independent variable and
y(price) dependent variable. Use fit() to train the model.

fromsklearn.linear_model import LinearRegression


clf = LinearRegression()
x = df[['distance']]
y = df['price']
reg = LinearRegression()
reg.fit(x,y)

After the model is trained we try to predict the price of bike with distance travelled 25 which
comes out to be 14055.56

reg.predict([[25]])

Now lets find out how be got that prediction, for we need to compute the coefficient and
intercept of model.

reg.coef_

DR. AWANIT KUMAR Page 19


reg.intercept_

Using the formula y = mx + c , we compute the y(price) value. we know x = 25, m(coefficient)
= 672.2 and c(intercept) = -2750.00

25*672.22222222-2750.0000000000036

Multiple Variables

We just modified our dataset here by considering multiple independent variables (distance and
years), here the bike price depends on the distance travelled and years after it was bought.

Dataset values abcd.csv

distance,years,price
500,5,5000
600,3,4500
100,1,7000
200,2,6000
400,7,3500
800,9,2000

import pandas as pd
df = pd.read_csv("abcd.csv")
df.head()

X = df[['distance', 'years']]
y = df['price']

DR. AWANIT KUMAR Page 20


To split training and testing data we import train_test_split from sklearn.model_selection,
test_size=0.3 means 30% of data will be treated as training data and remaining 70% would be
training data.

fromsklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=10)
fromsklearn.linear_model import LinearRegression
clf = LinearRegression()
clf

clf.fit(X_train, y_train)
clf.predict(X_test)

clf.score(X_test, y_test)

Now lets find the coefficients and intercept values and predict for distance = 350 and years = 4.

DR. AWANIT KUMAR Page 21


clf.coef_
clf.intercept_
clf.predict([[350,4]])

DR. AWANIT KUMAR Page 22


EXPERIMENT 5
Write & Implement Logistic Regression Using Sklearn Library in Python.

Logistic regression is used when we have a categorical data


(‘yes’,’no’,’true’,’false’,’0′,’1′,’positive’,’negative’), as in our below mentioned example where
different ages people played a particular game and results are mentioned as ‘0’ for lose and ‘1’
for win.

Dataset Values abcde.csv

age,results
22,0
25,0
47,1
52,0
46,1
56,1
55,0
60,1
62,1
61,1
18,0
28,0
27,0
29,0
49,1
55,1
25,1
58,1
19,0
18,0
21,0
26,0
40,1
45,1
50,1
54,1
23,0

import pandas as pd
df = pd.read_csv("abcde.csv")
df.head(10)

DR. AWANIT KUMAR Page 23


Now we need to split the data into training and testing data and for that we import
train_test_split from sklearn.model_selection with a train size specified as 0.8 i.e. 80% of
training data and remaining 20% would be testing data.

fromsklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test =
train_test_split(df[['age']],df.results,train_size=0.8,random_state=10)
Then we import LogisticRegression from sklearn.linear_model and use fit() to train the model.

fromsklearn.linear_model import LogisticRegression


model = LogisticRegression()
model.fit(X_train, y_train)

After the model is trained we try to predict by providing X_test values to model and model
generates predictions as mentioned below.

y_predicted = model.predict(X_test)
y_predicted

model.score(X_test,y_test)

DR. AWANIT KUMAR Page 24


DR. AWANIT KUMAR Page 25
EXPERIMENT 6
Write & Implement SVM Using Sklearn Library in Python.

Support vector machines was initially designed for prediction of numeric and categorical
outcomes. Today SVM is mostly used as a classification technique for predicting categorical
outcomes.

Above mentioned graph represents two categories/groups, one with blue color and other with
green. In SVM we need to draw a line also called as Decision boundary to predict the test data
correctly, where the test data belongs to blue or green category.

But the problem here is that many such decision boundaries are possible, but we need to choose
one best/optimal line.

DR. AWANIT KUMAR Page 26


Red is the best decision boundary because we need to have the distance between the line and
data points as maximum as possible to get better prediction. With this condition Red line is the
clear winner. That distance between the decision boundary and data point(also called support
vectors) is called margin. Margin is considered from both side of the decision boundary.

import pandas as pd
fromsklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df.head()

df['target'] = iris.target
df.head()

fromsklearn.model_selection import train_test_split


X = df.drop(['target'], axis='columns')
y = df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=10)
fromsklearn.svm import SVC
DR. AWANIT KUMAR Page 27
model = SVC()
model.fit(X_train, y_train)

model.score(X_test, y_test)

model.predict([[5.1,3.7,2.5,0.9]])

DR. AWANIT KUMAR Page 28


EXPERIMENT 7
Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.

Naive Bayes is among one of the most simple and powerful algorithms for classification based
on Bayes’ Theorem with an assumption of independence among predictors. Naive Bayes model
is easy to build and particularly useful for very large data sets. There are two parts to this
algorithm:

 Naive
 Bayes
The Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to any
other feature. Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability that a particular fruit
is an apple or an orange or a banana and that is why it is known as “Naive”.

What is Bayes Theorem?


In Statistics and probability theory, Bayes’ theorem describes the probability of an event, based
on prior knowledge of conditions that might be related to the event. It serves as a way to figure
out conditional probability.

Given a Hypothesis H and evidence E, Bayes’ Theorem states that the relationship between the
probability of Hypothesis before getting the evidence P(H) and the probability of the
hypothesis after getting the evidence P(H|E) is :

This relates the probability of the hypothesis before getting the evidence P(H), to the
probability of the hypothesis after getting the evidence, P(H|E). For this reason, is called the
prior probability, while P(H|E) is called the posterior probability. The factor that relates the
two, P(H|E) / P(E), is called the likelihood ratio. Using these terms, Bayes’ theorem can be
rephrased as:
“The posterior probability equals the prior probability times the likelihood ratio.”
Bayes’ Theorem Example

Let’s suppose we have a Deck of Cards, we wish to find out the “Probability of the Card we
picked at random to be a King given that it is a Face Card“. So, according to Bayes Theorem,
we can solve this problem. First, we need to find out the probability

P(King) which is 4/52 as there are 4 Kings in a Deck of Cards.


P(Face|King) is equal to 1 as all the Kings are face Cards.
P(Face) is equal to 12/52 as there are 3 Face Cards in a Suit of 13 cards and there are 4 Suits in
total.

DR. AWANIT KUMAR Page 29


Now, putting all the values in the Bayes’ Equation we get the result as 1/3

Game Prediction using Bayes’ Theorem

Let’s continue our Naive Bayes Tutorial blog and Predict the Future of Playing with the
weather data we have.

So here we have our Data, which comprises of the Day, Outlook, Humidity, Wind Conditions
and the final column being Play, which we have to predict.

First, we will create a frequency table using each attribute of the dataset.

DR. AWANIT KUMAR Page 30


For each frequency table, we will generate a likelihood table.

Likelihood of ‘Yes’ given ‘Sunny‘ is:


P(c|x) = P(Yes|Sunny) = P(Sunny|Yes)* P(Yes) / P(Sunny) = (0.3 x 0.71) /0.36 = 0.591

Similarly Likelihood of ‘No’ given ‘Sunny‘ is:


P(c|x) = P(No|Sunny) = P(Sunny|No)* P(No) / P(Sunny) = (0.4 x 0.36) /0.36 = 0.40

Now, in the same way, we need to create the Likelihood Table for other attributes as well.

DR. AWANIT KUMAR Page 31


Suppose we have a Day with the following values :

Outlook = Rain
Humidity = High
Wind = Weak
Play =?

So, with the data, we have to predict whether “we can play on that day or not”.

Likelihood of ‘Yes’ on that Day = P(Outlook = Rain|Yes)*P(Humidity= High|Yes)* P(Wind=


Weak|Yes)*P(Yes)

= 2/9 * 3/9 * 6/9 * 9/14 = 0.0199

Likelihood of ‘No’ on that Day = P(Outlook = Rain|No)*P(Humidity= High|No)* P(Wind=


Weak|No)*P(No)

= 2/5 * 4/5 * 2/5 * 5/14 = 0.0166

Now we normalize the values, then


P(Yes) = 0.0199 / (0.0199+ 0.0166) = 0.55

P(No) = 0.0166 / (0.0199+ 0.0166) = 0.45

Our model predicts that there is a 55% chance there will be a Game tomorrow.

Type of Naive Bayes Classifier

We have three type of naive bayes classifier


 Gaussian
 Multinomial
 Bernoulli

DR. AWANIT KUMAR Page 32


CODE:
# load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()

# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# making predictions on the testing set
y_pred = gnb.predict(X_test)

# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*
100)

OUTPUT:
Gaussian Naive Bayes model accuracy(in %): 95.0

DR. AWANIT KUMAR Page 33


EXPERIMENT 8

Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set. Print
both correct and wrong predictions.

KNN is simple supervised learning algorithm used for both regression and classification
problems.
KNN is basically store all available cases and classify new cases based on similarities with
stored cases.
Concept: So the concept that KNN works on is Basically similarities measurements, for
example, if you look at Mango,it is more similar to Apple then dog or cat, then what KNN will
do is put it in the category of fruits not in the category of animals.

What is K in KNN

What happens in KNN,we trained the model and after that we want to test our model , means
we want to classify our new data (test-data),for that we will check some (K) classes around it
and assign the most common class to the test-data.

K- Number of nearest neighbors

K=1 means the testing data are given the same level as the closet example in training set.

K=4 means the labels of the four closet classes are check and most common class is assign to
the testing data.

Data Set:

Iris Plants Dataset: Dataset contains 150 instances (50 in each of three classes) Number of
Attributes: 4 numeric, predictive attributes and the Class.

DR. AWANIT KUMAR Page 34


Sample Data

CODE:

Import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”

# Assign Coulum names to the dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe


dataset = pd.read_csv(url, names=names)
dataset.head()

DR. AWANIT KUMAR Page 35


X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# The following script performs feature scaling

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Training the data

from sklearn.neighbors import KNeighborsClassifier


classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

#Prediction of data
y_pred = classifier.predict(X_test)

# Evaluating the Algorithm

from sklearn.metrics import classification_report, confusion_matrix


print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[13 0 0]
[ 0 6 2]
[ 0 0 9]]
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 13


Iris-versicolor 1.00 0.75 0.86 8
Iris-virginica 0.82 1.00 0.90 9

accuracy 0.93 30
macro avg 0.94 0.92 0.92 30
weighted avg 0.95 0.93 0.93 30

# Comparing Error Rate with the K Value


error = []

# Calculating error for K values between 1 and 40


for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))
DR. AWANIT KUMAR Page 36
# The next step is to plot the error values against K values. Execute the following script to
create the plot:

plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')

Text(0, 0.5, 'Mean Error')

DR. AWANIT KUMAR Page 37


EXPERIMENT 9
Write & Implement K-Means Clustering Using Sklearn Library in Python.

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.

The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points or centroids by an iterative process.

Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

DR. AWANIT KUMAR Page 38


How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Dataset Values Book1.csv

name,rollno,marks
A,40,65
B,41,63
C,43,64
D,39,80
E,36,156
F,35,130
G,37,137

DR. AWANIT KUMAR Page 39


H,26,45
I,27,48
J,28,51
K,29,49
L,32,53
M,27,70
N,29,90
O,29,61
P,28,60
Q,42,150
R,39,155
S,41,160
T,38,162

Initially we load some basic libraries and our dataset.

fromsklearn.cluster import KMeans


import pandas as pd
fromsklearn.preprocessing import MinMaxScaler
frommatplotlib import pyplot as plt
df = pd.read_csv("Book1.csv")
df.head()

To understand our data we can use visualization and plot a graph to find out the groups/clusters
that could be formed. From below mention plot it is clear that we can easily create 3 clusters.

plt.scatter(df.rollno,df['marks'])
plt.xlabel('rollno')
plt.ylabel('marks')

DR. AWANIT KUMAR Page 40


Hence we use Kmeans method and mention n_clusters equals to 3. After that we use fit_predict
method to predict which sample will be assigned to which cluster. Here cluster 1 is ‘0’, cluster
2 is ‘1’ and cluster 3 is ‘2’.

km = KMeans(n_clusters=3)
predicted = km.fit_predict(df[['rollno','marks']])
predicted

Now we will add these predictions to our original dataset and after that we can split the main
data frame into 3 data frames based on cluster(0,1 and 2). Then we can plot those data framed
points on graph with different colors to see whether we go it all right(the predictions). But from
below mentioned plot we can see that 2 sample/data point which should belong to cluster green
are actually colored blue, that means we added them(those 2 sample points) in a wrong cluster.

df['cluster']=predicted
df.head()
df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.rollno,df1['marks'],color='green')
plt.scatter(df2.rollno,df2['marks'],color='red')
plt.scatter(df3.rollno,df3['marks'],color='blue')
plt.xlabel('rollno')
plt.ylabel('marks')

DR. AWANIT KUMAR Page 41


The sample points got added into a wrong group because our data is not properly scaled, so
now we need to scale our data using normalization. For that we need the help of MinMaxscaler
to scale our data in range of 0 and 1. After that we again try to take the predictions using
Kmeans and fit_predict methods as we did previously on unscaled data.

scale = MinMaxScaler()

scale.fit(df[['marks']])
df['marks'] = scale.transform(df[['marks']])

scale.fit(df[['rollno']])
df['rollno'] = scale.transform(df[['rollno']])

km = KMeans(n_clusters=3)
predicted = km.fit_predict(df[['rollno','marks']])
predicted

DR. AWANIT KUMAR Page 42


We will now delete or drop the previously added cluster column and add a new one with
updated predictions(with scaled data) to our dataset. Now when we try to plot the clusters, we
can observe that each sample point of our dataset is clustered correctly.

df = df.drop(['cluster'], axis='columns')

df['cluster']=predicted
df.head()

df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.rollno,df1['marks'],color='green')
plt.scatter(df2.rollno,df2['marks'],color='red')
plt.scatter(df3.rollno,df3['marks'],color='blue')
plt.xlabel('rollno')
plt.ylabel('marks')

We can also generate the centroids of our clusters by using cluster_centers_

km.cluster_centers_

The generated centroid values can also be represented on our plot using the following code.
Centroids of each cluster is represented by a black ‘*’ marker.

DR. AWANIT KUMAR Page 43


plt.scatter(df1.rollno,df1['marks'],color='green')
plt.scatter(df2.rollno,df2['marks'],color='red')
plt.scatter(df3.rollno,df3['marks'],color='blue')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='black',marker='*')
plt.xlabel('rollno')
plt.ylabel('marks')

EXPERIMENT 10
Write & Implement Apriori Algorithm in Python.

Association rule mining is a technique to identify underlying relations between different items.
Take an example of a Super Market where customers can buy variety of items. Usually, there is
a pattern in what the customers buy. For instance, mothers with babies buy baby products such
as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and
chips etc. In short, transactions involve a pattern. More profit can be generated if the
relationship between the items purchased in different transactions can be identified.
For instance, if item A and B are bought together more frequently then several steps can be
taken to increase the profit. For example:
A and B can be placed together so that when a customer buys one of the product he doesn't
have to go far away to buy the other product.
People who buy one of the products can be targeted through an advertisement campaign to buy
the other.
Collective discounts can be offered on these products if the customer buys both of them.
Both A and B can be packaged together.
The process of identifying associations between products is called association rule mining.

DR. AWANIT KUMAR Page 44


Different statistical algorithms have been developed to implement association rule mining, and
Apriori is one such algorithm. In this article we will study the theory behind the Apriori
algorithm and will later implement Apriori algorithm in Python.

Theory of Apriori Algorithm


There are three major components of Apriori algorithm:
Support
Confidence
Lift

Suppose we have a record of 1 thousand customer transactions, and we want to find the
Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand
transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a
burger is purchased, 50 transactions contain ketchup as well. Using this data, we want to find
the support, confidence, and lift.
Support
Support refers to the default popularity of an item and can be calculated by finding number of
transactions containing a particular item divided by total number of transactions. Suppose we
want to find support for item B. This can be calculated as:
Support(B) = (Transactions containing (B))/(Total Transactions)
For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for
item Ketchup can be calculated as:
Support(Ketchup) = (Transactions containingKetchup)/(Total
Transactions)

Support(Ketchup) = 100/1000
= 10%

Confidence
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be
calculated by finding the number of transactions where A and B are bought together, divided
by total number of transactions where A is bought. Mathematically, it can be represented as:
Confidence(A→B) = (Transactions containing both (A and B))/(Transactions
containing A)
Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought
together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying
ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and
can be mathematically written as:

DR. AWANIT KUMAR Page 45


Confidence(Burger→Ketchup) = (Transactions containing both (Burger
and Ketchup))/(Transactions containing A)

Confidence(Burger→Ketchup) = 50/150
= 33.3%

Lift
Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be
calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be
represented as:
Lift(A→B) = (Confidence (A→B))/(Support (B))
Coming back to our Burger and Ketchup problem, the Lift(Burger -> Ketchup) can be
calculated as:
Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support
(Ketchup))

Lift(Burger→Ketchup) = 33.3/10
= 3.33
Lift basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times
more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association
between products A and B. Lift of greater than 1 means products A and B are more likely to be
bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely
to be bought together.
Steps Involved in Apriori Algorithm
For large sets of data, there can be hundreds of items in hundreds of thousands transactions.
The Apriori algorithm tries to extract rules for each possible combination of items. For
instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and
then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2
and item 3; similarly item 1, item2, and item 4, and so on.

As you can see from the above example, this process can be extremely slow due to the number
of combinations. To speed up the process, we need to perform the following steps:

Set a minimum value for support and confidence. This means that we are only interested in
finding rules for the items that have certain default existence (e.g. support) and have a
minimum value for co-occurrence with other items (e.g. confidence).
Extract all the subsets having higher value of support than minimum threshold.
Select all the rules from the subsets with confidence value higher than minimum threshold.
Order the rules by descending order of Lift.

DR. AWANIT KUMAR Page 46


NOTE: - IF YOU FOUND ANY ERROR “ ModuleNotFoundError: No module
named 'apyori' ” THEN GO TO THE COMMAND PROMPT AND WRITE THE
COMMAND TO INSTALL APYORI MODLE: pip install apyori THEN PRESS
ENTER

Follow these steps to implement Apriori algorithm in Python:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from apyori import apriori

store_data = pd.read_csv("C:/MJ/store_data.csv", header=None)


display(store_data.head())
print(store_data.shape)

# we need a data in form of list for Apriori Algorithm we need a data in form of list for Apriori Algorithm

# Preprocessing on Data
records = []
for i in range(1, 7501):
records.append([str(store_data.values[i, j]) for j in range(0, 20)])

print(type(records))
<class 'list'>

association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3,


min_length=2)

DR. AWANIT KUMAR Page 47


association_results = list(association_rules)

for i in range(0, len(association_results)):


print(association_results[i][0])

frozenset({'chicken', 'light cream'})


frozenset({'mushroom cream sauce', 'escalope'})
frozenset({'pasta', 'escalope'})
frozenset({'ground beef', 'herb & pepper'})
frozenset({'ground beef', 'tomato sauce'})
frozenset({'whole wheat pasta', 'olive oil'})
frozenset({'pasta', 'shrimp'})
frozenset({'nan', 'chicken', 'light cream'})
frozenset({'chocolate', 'shrimp', 'frozen vegetables'})
frozenset({'ground beef', 'spaghetti', 'cooking oil'})
frozenset({'mushroom cream sauce', 'nan', 'escalope'})
frozenset({'pasta', 'nan', 'escalope'})
frozenset({'ground beef', 'spaghetti', 'frozen vegetables'})
frozenset({'frozen vegetables', 'milk', 'olive oil'})
frozenset({'mineral water', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'frozen vegetables', 'olive oil'})
frozenset({'spaghetti', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'frozen vegetables', 'tomatoes'})
frozenset({'ground beef', 'grated cheese', 'spaghetti'})
frozenset({'mineral water', 'ground beef', 'herb & pepper'})
frozenset({'ground beef', 'herb & pepper', 'nan'})
frozenset({'ground beef', 'herb & pepper', 'spaghetti'})
frozenset({'ground beef', 'milk', 'olive oil'})
frozenset({'ground beef', 'nan', 'tomato sauce'})
frozenset({'ground beef', 'shrimp', 'spaghetti'})
frozenset({'spaghetti', 'milk', 'olive oil'})
frozenset({'mineral water', 'olive oil', 'soup'})
frozenset({'whole wheat pasta', 'nan', 'olive oil'})
frozenset({'pasta', 'shrimp', 'nan'})
frozenset({'pancakes', 'spaghetti', 'olive oil'})
frozenset({'nan', 'chocolate', 'shrimp', 'frozen vegetables'})
frozenset({'ground beef', 'spaghetti', 'nan', 'cooking oil'})
frozenset({'nan', 'ground beef', 'spaghetti', 'frozen vegetables'})
frozenset({'mineral water', 'spaghetti', 'frozen vegetables', 'milk'})
frozenset({'nan', 'frozen vegetables', 'milk', 'olive oil'})
frozenset({'mineral water', 'nan', 'shrimp', 'frozen vegetables'})
frozenset({'nan', 'spaghetti', 'frozen vegetables', 'olive oil'})
frozenset({'nan', 'spaghetti', 'shrimp', 'frozen vegetables'})
frozenset({'nan', 'spaghetti', 'frozen vegetables', 'tomatoes'})
frozenset({'ground beef', 'grated cheese', 'nan', 'spaghetti'})
frozenset({'mineral water', 'ground beef', 'herb & pepper', 'nan'})
frozenset({'ground beef', 'herb & pepper', 'nan', 'spaghetti'})
frozenset({'ground beef', 'nan', 'milk', 'olive oil'})
frozenset({'ground beef', 'shrimp', 'nan', 'spaghetti'})
frozenset({'spaghetti', 'nan', 'milk', 'olive oil'})
frozenset({'mineral water', 'nan', 'olive oil', 'soup'})
frozenset({'pancakes', 'spaghetti', 'nan', 'olive oil'})
frozenset({'mineral water', 'spaghetti', 'nan', 'frozen vegetables', 'milk'})

for item in association_results:


# first index of the inner list
# Contains base item and add item
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])

DR. AWANIT KUMAR Page 48


# second index of the inner list
print("Support: " + str(item[1]))

# third index of the list located at 0th


# of the third index of the inner list
print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("=====================================")

Rule: chicken -> light cream


Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
=====================================
Rule: mushroom cream sauce -> escalope
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
=====================================
Rule: pasta -> escalope
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
=====================================
Rule: ground beef -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
=====================================
Rule: ground beef -> tomato sauce
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
=====================================
Rule: whole wheat pasta -> olive oil
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
=====================================
Rule: pasta -> shrimp
Support: 0.005066666666666666
Confidence: 0.3220338983050848

DR. AWANIT KUMAR Page 49


Lift: 4.514493901473151
=====================================
Rule: nan -> chicken
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
=====================================
Rule: chocolate -> shrimp
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
=====================================
Rule: ground beef -> spaghetti
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
=====================================
Rule: mushroom cream sauce -> nan
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
=====================================
Rule: pasta -> nan
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
=====================================
Rule: ground beef -> spaghetti
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
=====================================
Rule: frozen vegetables -> milk
Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
=====================================
Rule: mineral water -> shrimp
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
=====================================
Rule: spaghetti -> frozen vegetables
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174

DR. AWANIT KUMAR Page 50


=====================================
Rule: spaghetti -> shrimp
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
=====================================
Rule: spaghetti -> frozen vegetables
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
=====================================
Rule: ground beef -> grated cheese
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
=====================================
Rule: mineral water -> ground beef
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
=====================================
Rule: ground beef -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
=====================================
Rule: ground beef -> herb & pepper
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
=====================================
Rule: ground beef -> milk
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
=====================================
Rule: ground beef -> nan
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
=====================================
Rule: ground beef -> shrimp
Support: 0.006
Confidence: 0.5232558139534884
Lift: 3.004914704939635
=====================================

DR. AWANIT KUMAR Page 51


Rule: spaghetti -> milk
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
=====================================
Rule: mineral water -> olive oil
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
=====================================
Rule: whole wheat pasta -> nan
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
=====================================
Rule: pasta -> shrimp
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
=====================================
Rule: pancakes -> spaghetti
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
=====================================
Rule: nan -> chocolate
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
=====================================
Rule: ground beef -> spaghetti
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
=====================================
Rule: nan -> ground beef
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
=====================================
Rule: mineral water -> spaghetti
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
=====================================
Rule: nan -> frozen vegetables

DR. AWANIT KUMAR Page 52


Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
=====================================
Rule: mineral water -> nan
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
=====================================
Rule: nan -> spaghetti
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174
=====================================
Rule: nan -> spaghetti
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
=====================================
Rule: nan -> spaghetti
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
=====================================
Rule: ground beef -> grated cheese
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
=====================================
Rule: mineral water -> ground beef
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
=====================================
Rule: ground beef -> herb & pepper
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
=====================================
Rule: ground beef -> nan
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
=====================================
Rule: ground beef -> shrimp
Support: 0.006

DR. AWANIT KUMAR Page 53


Confidence: 0.5232558139534884
Lift: 3.004914704939635
=====================================
Rule: spaghetti -> nan
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
=====================================
Rule: mineral water -> nan
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
=====================================
Rule: pancakes -> spaghetti
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
=====================================
Rule: mineral water -> spaghetti
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
===================================

Conclusion
Association rule mining algorithms such as Apriori are very useful for finding simple
associations between our data items. They are easy to implement and have high explain-ability.
However for more advanced insights, such those used by Google or Amazon etc., more
complex algorithms, such as recommender systems, are used. However, you can probably see
that this method is a very simple way to get basic associations if that's all your use-case needs.

DR. AWANIT KUMAR Page 54


VIVA QUESTIONS
1) What do you understand by Machine learning?

Machine learning is the form of Artificial Intelligence that deals with system programming and
automates data analysis to enable computers to learn and act through experiences without being
explicitly programmed.

For example, Robots are coded in such a way that they can perform the tasks based on data
they collect from sensors. They automatically learn programs from data and improve with
experiences.

2) Differentiate between inductive learning and deductive learning?

In inductive learning, the model learns by examples from a set of observed instances to draw a
generalized conclusion. On the other side, in deductive learning, the model first applies the
conclusion, and then the conclusion is drawn.

Inductive learning is the method of using observations to draw conclusions.

Deductive learning is the method of using conclusions to form observations.

For example, if we have to explain to a kid that playing with fire can cause burns. There are
two ways we can explain this to a kid; we can show training examples of various fire accidents
or images of burnt people and label them as "Hazardous". In this case, a kid will understand
with the help of examples and not play with the fire. It is the form of Inductive machine
learning. The other way to teach the same thing is to let the kid play with the fire and wait to
see what happens. If the kid gets a burn, it will teach the kid not to play with fire and avoid
going near it. It is the form of deductive learning.

3) What is the difference between Data Mining and Machine Learning?

Data mining can be described as the process in which the structured data tries to abstract
knowledge or interesting unknown patterns. During this process, machine learning algorithms
are used.

Machine learning represents the study, design, and development of the algorithms which
provide the ability to the processors to learn without being explicitly programmed.
DR. AWANIT KUMAR Page 55
4) What is the meaning of Overfitting in Machine learning?

Overfitting can be seen in machine learning when a statistical model describes random error or
noise instead of the underlying relationship. Overfitting is usually observed when a model is
excessively complex. It happens because of having too many parameters concerning the
number of training data types. The model displays poor performance, which has been
overfitted.

5) Why overfitting occurs?

The possibility of overfitting occurs when the criteria used for training the model is not as per
the criteria used to judge the efficiency of a model.

6) What is the method to avoid overfitting?

Overfitting occurs when we have a small dataset, and a model is trying to learn from it. By
using a large amount of data, overfitting can be avoided. But if we have a small database and
are forced to build a model based on that, then we can use a technique known as cross-
validation. In this method, a model is usually given a dataset of a known data on which training
data set is run and dataset of unknown data against which the model is tested. The primary aim
of cross-validation is to define a dataset to "test" the model in the training phase. If there is
sufficient data, 'Isotonic Regression' is used to prevent overfitting.

7) Differentiate supervised and unsupervised machine learning.

In supervised machine learning, the machine is trained using labeled data. Then a new dataset
is given into the learning model so that the algorithm provides a positive outcome by analyzing
the labeled data. For example, we first require to label the data which is necessary to train the
model while performing classification.

In the unsupervised machine learning, the machine is not trained using labeled data and let the
algorithms make the decisions without any corresponding output variables.

DR. AWANIT KUMAR Page 56


8) How does Machine Learning differ from Deep Learning?

Machine learning is all about algorithms which are used to parse data, learn from that data, and
then apply whatever they have learned to make informed decisions.

Deep learning is a part of machine learning, which is inspired by the structure of the human
brain and is particularly useful in feature detection.

9) How is KNN different from k-means?

KNN or K nearest neighbors is a supervised algorithm which is used for classification purpose.
In KNN, a test sample is given as the class of the majority of its nearest neighbors. On the other
side, K-means is an unsupervised algorithm which is mainly used for clustering. In k-means
clustering, it needs a set of unlabeled points and a threshold only. The algorithm further takes
unlabeled data and learns how to cluster it into groups by computing the mean of the distance
between different unlabeled points.

10) What are the different types of Algorithm methods in Machine Learning?

The different types of algorithm methods in machine earning are:

Supervised Learning

Semi-supervised Learning

Unsupervised Learning

Transduction

Reinforcement Learning

11) What do you understand by Reinforcement Learning technique?

Reinforcement learning is an algorithm technique used in Machine Learning. It involves an


agent that interacts with its environment by producing actions & discovering errors or rewards.
Reinforcement learning is employed by different software and machines to search for the best
suitable behavior or path it should follow in a specific situation. It usually learns on the basis of
reward or penalty given for every action it performs.

DR. AWANIT KUMAR Page 57


12) What is the trade-off between bias and variance?

Both bias and variance are errors. Bias is an error due to erroneous or overly simplistic
assumptions in the learning algorithm. It can lead to the model under-fitting the data, making it
hard to have high predictive accuracy and generalize the knowledge from the training set to the
test set.

Variance is an error due to too much complexity in the learning algorithm. It leads to the
algorithm being highly sensitive to high degrees of variation in the training data, which can
lead the model to overfit the data.

To optimally reduce the number of errors, we will need to tradeoff bias and variance.

13) How do classification and regression differ?

Classification Regression

Classification is the task to predict a discrete class label.

Regression is the task to predict a continuous quantity.

In a classification problem, data is labeled into one of two or more classes.

A regression problem needs the prediction of a quantity.

A classification having problem with two classes is called binary classification, and more than
two classes is called multi-class classification

A regression problem containing multiple input variables is called a multivariate regression


problem.

Classifying an email as spam or non-spam is an example of a classification problem.

Predicting the price of a stock over a period of time is a regression problem.

14) What are the five popular algorithms we use in Machine Learning?

Five popular algorithms are:

Decision Trees
DR. AWANIT KUMAR Page 58
Probabilistic Networks

Neural Networks

Support Vector Machines

Nearest Neighbor

15) What do you mean by ensemble learning?

Numerous models, such as classifiers are strategically made and combined to solve a specific
computational program which is known as ensemble learning. The ensemble methods are also
known as committee-based learning or learning multiple classifier systems. It trains various
hypotheses to fix the same issue. One of the most suitable examples of ensemble modeling is
the random forest trees where several decision trees are used to predict outcomes. It is used to
improve the classification, function approximation, prediction, etc. of a model.

16) What is a model selection in Machine Learning?

The process of choosing models among diverse mathematical models, which are used to define
the same data is known as Model Selection. Model learning is applied to the fields of statistics,
data mining, and machine learning.

17) What are the three stages of building the hypotheses or model in machine learning?

There are three stages to build hypotheses or model in machine learning:

Model building

It chooses a suitable algorithm for the model and trains it according to the requirement of the
problem.

Applying the model

It is responsible for checking the accuracy of the model through the test data.

Model testing

It performs the required changes after testing and apply the final model.

DR. AWANIT KUMAR Page 59


18) What according to you, is the standard approach to supervised learning?

In supervised learning, the standard approach is to split the set of example into the training set
and the test.

19) Describe 'Training set' and 'training Test'.

In various areas of information of machine learning, a set of data is used to discover the
potentially predictive relationship, which is known as 'Training Set'. The training set is an
example that is given to the learner. Besides, the 'Test set' is used to test the accuracy of the
hypotheses generated by the learner. It is the set of instances held back from the learner. Thus,
the training set is distinct from the test set.

20) What are the common ways to handle missing data in a dataset?

Missing data is one of the standard factors while working with data and handling. It is
considered as one of the greatest challenges faced by the data analysts. There are many ways
one can impute the missing values. Some of the common methods to handle missing data in
datasets can be defined as deleting the rows, replacing with mean/median/mode, predicting the
missing values, assigning a unique category, using algorithms that support missing values, etc.

21) What do you understand by ILP?

ILP stands for Inductive Logic Programming. It is a part of machine learning which uses logic
programming. It aims at searching patterns in data which can be used to build predictive
models. In this process, the logic programs are assumed as a hypothesis.

22) What are the necessary steps involved in Machine Learning Project?

There are several essential steps we must follow to achieve a good working model while doing
a Machine Learning Project. Those steps may include parameter tuning, data preparation, data
collection, training the model, model evaluation, and prediction, etc.

DR. AWANIT KUMAR Page 60


23) Describe Precision and Recall?

Precision and Recall both are the measures which are used in the information retrieval domain
to measure how good an information retrieval system reclaims the related data as requested by
the user.

Precision can be said as a positive predictive value. It is the fraction of relevant instances
among the received instances.

On the other side, recall is the fraction of relevant instances that have been retrieved over the
total amount or relevant instances. The recall is also known as sensitivity.

24) What do you understand by Decision Tree in Machine Learning?

Decision Trees can be defined as the Supervised Machine Learning, where the data is
continuously split according to a certain parameter. It builds classification or regression models
as similar as a tree structure, with datasets broken up into ever smaller subsets while
developing the decision tree. The tree can be defined by two entities, namely decision nodes,
and leaves. The leaves are the decisions or the outcomes, and the decision nodes are where the
data is split. Decision trees can manage both categorical and numerical data.

25) What are the functions of Supervised Learning?

Classification

Speech Recognition

Regression

Predict Time Series

Annotate Strings

26) What are the functions of Unsupervised Learning?

Finding clusters of the data

DR. AWANIT KUMAR Page 61


Finding low-dimensional representations of the data

Finding interesting directions in data

Finding novel observations/ database cleaning

Finding interesting coordinates and correlations

27) What do you understand by algorithm independent machine learning?

Algorithm independent machine learning can be defined as machine learning, where


mathematical foundations are independent of any particular classifier or learning algorithm.

28) Describe the classifier in machine learning.

A classifier is a case of a hypothesis or discrete-valued function which is used to assign class


labels to particular data points. It is a system that inputs a vector of discrete or continuous
feature values and outputs a single discrete value, the class.

29) What do you mean by Genetic Programming?

Genetic Programming (GP) is almost similar to an Evolutionary Algorithm, a subset of


machine learning. Genetic programming software systems implement an algorithm that uses
random mutation, a fitness function, crossover, and multiple generations of evolution to resolve
a user-defined task. The genetic programming model is based on testing and choosing the best
option among a set of results.

30) What is SVM in machine learning? What are the classification methods that SVM can
handle?

SVM stands for Support Vector Machine. SVM are supervised learning models with an
associated learning algorithm which analyze the data used for classification and regression
analysis.

The classification methods that SVM can handle are:

Combining binary classifiers


DR. AWANIT KUMAR Page 62
Modifying binary to incorporate multiclass learning

31) How will you explain a linked list and an array?

An array is a datatype which is widely implemented as a default type, in almost all the modern
programming languages. It is used to store data of a similar type.

But there are many use-cases where we don't know the quantity of data to be stored. For such
cases, advanced data structures are required, and one such data structure is linked list.

There are some points which explain how the linked list is different from an array:

ARRAY LINKED LIST

An array is a group of elements of a similar data type.

Linked List is an ordered group of elements of the same type, which are connected using
pointers.

Elements are stored consecutively in the memory.

New elements can be stored anywhere in memory.

An Array supports Random Access. It means that the elements can be accessed directly using
their index value, like arr[0] for 1st element, arr[5] for 6th element, etc.

As a result, accessing elements in an array is fast with constant time complexity of O(1).

Linked List supports Sequential Access. It means that we have to traverse the complete linked
list, up to that element sequentially which element/node we want to access in a linked list.

To access the nth element of a linked list, the time complexity is O(n).

Memory is allocated at compile time as soon as the array is declared. It is known as Static
Memory Allocation.

Memory is allocated at runtime, whenever a new node is added. It is known as Dynamic


Memory Allocation.

Insertion and Deletion operation takes more time in the array, as the memory locations are
consecutive and fixed.

In case of a linked list, a new element is stored at the first free available memory location.

DR. AWANIT KUMAR Page 63


Thus, Insertion and Deletion operations are fast in the linked list.

Size of the array must be declared at the time of array declaration.

Size of a Linked list is variable. It grows at runtime whenever nodes are added to it.

32) What do you understand by the Confusion Matrix?

A confusion matrix is a table which is used for summarizing the performance of a classification
algorithm. It is also known as the error matrix.

Machine Learning Interview Questions

Where,

TN= True Negative

TP= True Positive

FN= False Negative

FP= False Positive

33) Explain True Positive, True Negative, False Positive, and False Negative in Confusion
Matrix with an example.

True Positive

When a model correctly predicts the positive class, it is said to be a true positive.

For example, Umpire gives a Batsman NOT OUT when he is NOT OUT.

True Negative

When a model correctly predicts the negative class, it is said to be a true negative.

For example, Umpire gives a Batsman OUT when he is OUT.

False Positive

When a model incorrectly predicts the positive class, it is said to be a false positive. It is also
known as 'Type I' error.

For example, Umpire gives a Batsman NOT OUT when he is OUT.

False Negative

DR. AWANIT KUMAR Page 64


When a model incorrectly predicts the negative class, it is said to be a false negative. It is also
known as 'Type II' error.

For example, Umpire gives a Batsman OUT when he is NOT OUT.

34) What according to you, is more important between model accuracy and model
performance?

Model accuracy is a subset of model performance. The accuracy of the model is directly
proportional to the performance of the model. Thus, better the performance of the model, more
accurate are the predictions.

35) What is Bagging and Boosting?

Bagging is a process in ensemble learning which is used for improving unstable estimation or
classification schemes.

Boosting methods are used sequentially to reduce the bias of the combined model.

36) What are the similarities and differences between bagging and boosting in Machine
Learning?

Similarities of Bagging and Boosting

Both are the ensemble methods to get N learns from 1 learner.

Both generate several training data sets with random sampling.

Both generate the final result by taking the average of N learners.

Both reduce variance and provide higher scalability.

Differences between Bagging and Boosting

Although they are built independently, but for Bagging, Boosting tries to add new models
which perform well where previous models fail.

Only Boosting determines the weight for the data to tip the scales in favor of the most
challenging cases.

DR. AWANIT KUMAR Page 65


Only Boosting tries to reduce bias. Instead, Bagging may solve the problem of over-fitting
while boosting can increase it.

37) What do you understand by Cluster Sampling?

Cluster Sampling is a process of randomly selecting intact groups within a defined population,
sharing similar characteristics. Cluster sample is a probability where each sampling unit is a
collection or cluster of elements.

For example, if we are clustering the total number of managers in a set of companies, in that
case, managers (sample) will represent elements and companies will represent clusters.

38) What do you know about Bayesian Networks?

Bayesian Networks also referred to as 'belief networks' or 'casual networks', are used to
represent the graphical model for probability relationship among a set of variables.

For example, a Bayesian network can be used to represent the probabilistic relationships
between diseases and symptoms. As per the symptoms, the network can also compute the
probabilities of the presence of various diseases.

Efficient algorithms can perform inference or learning in Bayesian networks. Bayesian


networks which relate the variables (e.g., speech signals or protein sequences) are called
dynamic Bayesian networks.

39) Which are the two components of Bayesian logic program?

A Bayesian logic program consists of two components:

Logical

It contains a set of Bayesian Clauses, which capture the qualitative structure of the domain.

Quantitative

It is used to encode quantitative information about the domain.

DR. AWANIT KUMAR Page 66


40) Describe dimension reduction in machine learning.

Dimension reduction is the process which is used to reduce the number of random variables
under considerations.

Dimension reduction can be divided into feature selection and extraction.

41) Why instance-based learning algorithm sometimes referred to as Lazy learning algorithm?

In machine learning, lazy learning can be described as a method where induction and
generalization processes are delayed until classification is performed. Because of the same
property, an instance-based learning algorithm is sometimes called lazy learning algorithm.

42) What do you understand by the F1 score?

The F1 score represents the measurement of a model's performance. It is referred to as a


weighted average of the precision and recall of a model. The results tending to 1 are considered
as the best, and those tending to 0 are the worst. It could be used in classification tests, where
true negatives don't matter much.

43) How is a decision tree pruned?

Pruning is said to occur in decision trees when the branches which may consist of weak
predictive power are removed to reduce the complexity of the model and increase the
predictive accuracy of a decision tree model. Pruning can occur bottom-up and top-down, with
approaches such as reduced error pruning and cost complexity pruning.

Reduced error pruning is the simplest version, and it replaces each node. If it is unable to
decrease predictive accuracy, one should keep it pruned. But, it usually comes pretty close to
an approach that would optimize for maximum accuracy.

44) What are the Recommended Systems?

DR. AWANIT KUMAR Page 67


Recommended System is a sub-directory of information filtering systems. It predicts the
preferences or rankings offered by a user to a product. According to the preferences, it provides
similar recommendations to a user. Recommendation systems are widely used in movies, news,
research articles, products, social tips, music, etc.

45) What do you understand by Underfitting?

Underfitting is an issue when we have a low error in both the training set and the testing set.
Few algorithms work better for interpretations but fail for better predictions.

46) When does regularization become necessary in Machine Learning?

Regularization is necessary whenever the model begins to overfit/ underfit. It is a cost term for
bringing in more features with the objective function. Hence, it tries to push the coefficients for
many variables to zero and reduce cost term. It helps to reduce model complexity so that the
model can become better at predicting (generalizing).

47) What is Regularization? What kind of problems does regularization solve?

A regularization is a form of regression, which constrains/ regularizes or shrinks the coefficient


estimates towards zero. In other words, it discourages learning a more complex or flexible
model to avoid the risk of overfitting. It reduces the variance of the model, without a
substantial increase in its bias.

Regularization is used to address overfitting problems as it penalizes the loss function by


adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of weights vector w.

48) Why do we need to convert categorical variables into factor? Which functions are used to
perform the conversion?

Most Machine learning algorithms require number as input. That is why we convert categorical
values into factors to get numerical values. We also don't have to deal with dummy variables.

The functions factor() and as.factor() are used to convert variables into factors.

DR. AWANIT KUMAR Page 68


49) Do you think that treating a categorical variable as a continuous variable would result in a
better predictive model?

For a better predictive model, the categorical variable can be considered as a continuous
variable only when the variable is ordinal in nature.

50) How is machine learning used in day-to-day life?

Most of the people are already using machine learning in their everyday life. Assume that you
are engaging with the internet, you are actually expressing your preferences, likes, dislikes
through your searches. All these things are picked up by cookies coming on your computer,
from this, the behavior of a user is evaluated. It helps to increase the progress of a user through
the internet and provide similar suggestions.

The navigation system can also be considered as one of the examples where we are using
machine learning to calculate a distance between two places using optimization techniques.
Surely, people are going to more engage with machine learning in the near future.

THE END

THANK YOU!

DR. AWANIT KUMAR Page 69

You might also like