Nothing Special   »   [go: up one dir, main page]

Project Report Divii

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

COMPARATIVE STUDY OF MACHINE LEARNING

ALGORITHMS FOR HEART DISEASE PREDICTION


A Project Report Submitted to the

MOTHER TERESA WOMEN’S UNIVERSITY


In partial fulfillment of the requirement for the award of degree of

MASTER OF SCIENCE IN COMPUTER SCIENCE


Submitted
by
S.DIVYA
(Reg. No.: 214326ER007)

Under the guidance of


Dr. S.RAJATHI, MCA., M.Phil., Ph.D.,
Assistant Professor

PG AND RESEARCH DEPARTMENT OF COMPUTER SCIENCE


M.V.MUTHIAH GOVERNMENT ARTS COLLEGE FOR WOMEN
(Affiliated to Mother Teresa Women’s University, Kodaikanal)
Reaccredited with “A” Grade by NAAC
DINDIGUL – 624 001
APRIL - 2023
M.V.MUTHIAH GOVERNMENT ARTS COLLEGE FOR WOMEN
(Affiliated to Mother Teresa Women’s University, Kodaikanal)
Reaccredited with “A” Grade by NAAC

PG AND RESEARCH DEPARTMENT OF COMPUTER SCIENCE


DINDIGUL – 624 001

BONAFIDE CERTIFICATE

This is to certify that the project entitled “COMPARATIVE STUDY OF MACHINE


LEARNING ALGORITHMS FOR HEART DISEASE PREDICTION” is a bonafide
record of work carried out by S.DIVYA (Reg No: 214326ER007) in partial fulfillment for
the award of the degree of the Master of Computer Science in M.V.MUTHIAH
GOVERNMENT ARTS COLLEGE FOR WOMEN, DINDIGUL during the period of
December 2022 to March 2023.

Internal Guide Head of the Department

Submitted for Viva-Voce Examination held on _______________

INTERNAL EXAMINER EXTERNAL EXAMINER


DECLARATION

I hereby declare that the project work entitled “COMPARATIVE STUDY OF


MACHINE LEARNING ALGORITHMS FOR HEART DISEASE PREDICTION”
submitted to PG AND RESEARCH DEPARTMENT OF COMPUTER SCIENCE,
M.V.MUTHIAH GOVERNMENT ARTS COLLEGE FOR WOMEN, DINDIGUL in
partial fulfillment of the requirements for the award of Master of Science in Computer
Science is a record of original work done by me under the Supervision and guidance of
S.RAJATHI, MCA., M.Phil., Ph.D.,

Place: Dindigul Signature of the Candidate


Date: S.DIVYA
(Reg.No:214326ER007)
ACKNOWLEDGEMENT

I thank GOD for the love and blessings showered upon me in completing this
project successfully.

I take this opportunity as an esteemed honor to thank our beloved Principal


Dr.D.LAKSHMI, M.Sc.,M.Phil., Ph.D., M. V. Muthiah Government Arts College For
Women , Dindigul for effectively leading us to meet up with the requirement and
necessities of the project.

I ceremoniously thank Dr.I.RAJESHWARI, M.Sc., M.Phil., M.B.A., Ph.D.,


Associate Professor and Head, Department of Computer Science who has been the soul
behind this effort.

I take immense pleasure in conveying my sincere thanks to my


guide, Dr. S.RAJATHI, MCA., M.Phil., Ph.D., Assistant professor, PG And Research
Department of Computer Science for her valuable suggestion and encouragement
throughout this project.

I sincerely thank all faculties PG And Research Department of Computer


Science, M. V. Muthiah Government Arts College for Women, Dindigul for their generous
help and encouragement to do my project successfully and also like to express my deep sense
of gratitude to a l l who have helped me to complete the project work.

DIVYA.S
CONTENTS

CHAPTER TITLE PAGE NO


NO
ABSTRACT I

1 INTRODUCTION 2

2 SYSTEM SPECIFICATION 3
2.1 Hardware Specification 3

2.2 Software specification 3

2.3 Problem statement 3

3 PROBLEM DESCRIPTION 5

3.1 Data Collection 5

3.2 Data Preprocessing 6

3.3 Exploratory Data analysis 7

3.4 Split the Dataset into Training and Testing 9

3.5 Module Building 9

3.6 Evaluation through performance metrics 19

4 SYSTEM ANALYSIS 22
4.1 Existing System 22

4.2 Proposed System 22

4.3 Comparative analysis of machine learning algorithm 23

5 CONCLUSION 27
6 FUTURE ENHANCEMENT 28

7 BIBLIOGRAPHY 29
8 APPENDIX 30
1.Sample Coding 30

2. Screen layout 36
ABSTRACT

Heart diseases also known as cardio vascular diseases encompass a wide range of
conditions that affect the heart. These vary from blood vessel diseases, heart rhythm problems to
heart defects that one is born with. It is the primary cause for death worldwide over the past few
decades. It is the need of the hour to obtain accurate and reliable approach to achieve early
diagnosis of the disease by automating the task and hence realize efficient management of it.
Data Science plays an important in processing large amounts of data in the field of medical
sciences. Researchers utilize several Data Mining and Machine Learning Techniques to analyze
large sets of data and aid in the right prediction of heart diseases. In which project Machine
Learning algorithms where used for the comparative analysis and as implemented in Python, to
predict the presence of heart disease in a patient. This is a classification problem, with input
features as a variety of parameters, and the target variable as a binary variable, predicting
whether heart disease is present or not. This project analyzes the supervised learning models
Support Vector Machine, K-Nearest Neighbors, Decision Tree, and Random Forest and
comparative study is carried out for the most efficient algorithm. From this study it is found that
Random Forest provides most accuracy with 90.16% in comparison to other algorithms.

1
CHAPTER – 1
INTRODUCTION
The heart plays an important role in the human body by pumping blood, supplying blood
to all the parts of the body, and purifying blood. When the heart does not get the required amount
of blood, it results in heart failure and death. India is also having a very high rate of death due to
heart diseases. The accurate and timely diagnosis of heart disease is necessary to improve the
security of the heart and life. Heart diseases – Rheumatic heart disease, Valvular heart disease,
Hypertensive heart disease, and Cerebrovascular heart disease. Symptoms Chest pain or
discomfort, shortness of breath, fainting, swelling of legs, abdomen, or areas around the eyes and
easily tiring during exercise and activity. Machine learning algorithms and techniques help us to
predict and diagnose different heart diseases and help doctors to avoid sudden death in such
cases.
A huge number of patients details will be collected and interpreted to predict the occurrence of
disease. In this project, Four algorithms have been used that is Support vector ,K Nearest
Neighbor, Decision Tree, and Random Forest. The objective of this project is to compare the
accuracy of four different machine learning algorithms and conclude with the best algorithm
among these for heart disease prediction. It is very difficult to diagnose the heart diseases in
advance or at early stages even though symptoms of heart diseases are noticed and one of the
common things is all of the symptoms will not occur all of a sudden. So, it’s very important to
diagnose the heart diseases in time which avoids death or reduce mortality rate. Since diagnosing
heart disease is very expensive so majority of them will not visit the doctor for consultation due
to various reasons. In this project, the problem is based on the given information about each
individual.
In order to achieve the objective, machine learning algorithms are use at different levels of
evaluation. Although machine learning algorithms are commonly used, heart disease prediction
is a vital task involving the highest possible accuracy. Hence, these algorithms are evaluated at
numerous levels and types of evaluation strategies. The main contribution of this project
includes Extraction of classified accuracy useful for heart disease prediction, comparison of
different machine learning algorithms and identifying the best performance-based algorithm for
heart disease prediction.

2
CHAPTER – 2
SYSTEM SPECIFICATION

2.1 HARDWARE SPECIFICATION

Processor : I5

Hard Disk : 500 GB

RAM : 4GB

Board : Mercury

Key Board : Logitech Key board

Monitor : 21 inch Samsung

Mouse : Logitech

2.2 SOFTWARE SPECIFICATION

 Operating system : Window07


 Coding Language : Python
 Data Set : CSV

2.3 PROBLEM STATEMENT


The major challenge in heart disease is its detection. There are instruments available
which can predict heart disease but either they are expensive or are not efficient to calculate
chance of heart disease in human. Early detection of cardiac diseases can decrease the mortality
rate and overall complications. However, it is not possible to monitor patients every day in all
cases accurately and consultation of a patient for 24 hours by a doctor is not available since it
requires more sapience, time and expertise. Since I have a good amount of data in today’s world,
I can use various machine learning algorithms to analyze the data for hidden patterns. The hidden
patterns can be used for health diagnosis in medicinal data.

3
2.3.1 Aim and objective of the study
The main objective and the study is to detect and access classification of heart disease
prediction using machine learning.
The specific objectives of the study are:
 The objective of this project is to check whether the patient is likely to be
diagnosed with any cardiovascular heart disease based on their medical attributes
such as gender, age, chest pain, fasting sugar level etc.
 To detect heart disease prediction early using machine learning system.

2.3.2 Dataset

Machine Learning heavily depends on data. It is the most crucial aspect that makes
algorithm training possible. so the section allows you to download the required dataset.
There are many popular open sources for collecting the data Eg: Kaggle.com, UCI repository,
etc.
In this project heart disease.csv dataset which used. This data is downloaded from
Kaggle.com.

2.3.3 Software used


Python is a widely used high level programming language. To write and execute code
in python, first install Python on your system. pip is the package-management system written
in python and is used to install and manage software packages. Most distributions of python
come with pip preinstalled, Python 2.7.9 and later (on the python2 series), and python 3.4 and
later include pip (pip3 for python 3) by default.
In python, cURL is a tool for transferring data requests to and from a server : curl
https://boostrap.pypa.io/get-pip -o get -pip.py and then python get-pip.py
Jupyter Notebooks are a spin-off project from the IPython project, which used to
have an IPython Notebook project itself. The name, Jupyter, comes from the core supported
Programing languages that it supports: Julia, Python, and R. Jupyter ships with the
IPythonkernel, which allows you to write your programs in Python, but there are currently
over 100 other kernels that you can also use. It’s a open source web application that allows us
to share and create the documents which contains the live code, equations, visualizations and
narrative text. It can be used for data cleaning and transformation, numerical simulation,
statistical modeling, data visualization, machine learning.

4
CHAPTER – 3
PROBLEM DESCRIPTION

PROJECT PLANNING STEPS


3.1 Data Collection
3.2 Data Preprossing
3.3 Exploratory Data Analysis
3.4 Split the dataset into training and testing
3.5 Model Building
3.6 Evaluation through Performance Metrics

3.1 DATA COLLECTION


3.1.1 Dataset Information
Heart disease data was collected from the UCI machine learning repository. There are
four database The Cleveland database was selected for this research because it is a commonly
used database for ML researchers with comprehensive and complete records. The dataset
contains 303 records. Although the Cleveland dataset has 76 attributes, the data set provided in
the repository furnishes information for a subset of only 14 attributes. The data source of the
Cleveland dataset is the Cleveland Clinic Foundation. Below Table depicts the description and
type of attributes. There are 13 attributes that feature in the prediction of heart disease, where
only one attribute serves as the output or the predicted attribute to the presence of heart disease
in a patient.The Cleveland dataset contains an attribute named num to show the diagnosis of
heart disease in patients on different scales, from 0 to 4. In this scenario, 0 represents the absence
of heart disease and all the values from 1 to 4 represent patients with heart disease, where the
scaling refers to the severity of the disease (4 being the highest). Figure 1 shows the distribution
of the num attribute among the identified 303 records.

Dataset : Heart disease prediction Dataset


Number of Instances : 303(Records)
Number of Attributes : 14(Attributes)
Associated Tasks : Classification
Missing Attribute values : None

5
3.1.2 Load and Read the dataset
The dataset format might be in .csv, excel files, .txt, .json, .etc, and here read the
dataset with the help of pandas.
In pandas it have a function called read_csv () to read the dataset. As a parameter to load the
dataset give the directory of the csv file.
3.2 DATA PREPROCESSING
Data pre-processing is an important step for the creation of a machine learning model.
Initially, data may not be clean or in the required format for the model which can cause
misleading outcomes. In pre-processing of data, A data is transformed into the required
format. It is used to deal with noises, duplicates, and missing values of the dataset.
Data pre-processing has the activities like importing datasets, splitting datasets, attribute
scaling, etc. Preprocessing of data is required for improving the accuracy of the model.
The dataset obtained is not completely accurate and error free. Hence, the first carry out the
following operations on it.

Data Balancing: Since the data balancing graph shows that both the target classes are equal,
data balancing is crucial for accurate results. The target classes are shown "0" denoting patients
with heart disease and "1" denoting patients without heart disease.

Data Cleaning: NA values in the dataset are the major setback for us as it will reduce the
accuracy of the prediction profoundly so, we will remove the fields which do not have values. I
will substitute it with the mean value of the column. This way, we will remove all the values in
the data set.

Feature Scaling: Since the range of values of raw data varies widely, in some machine learning
algorithms, objective functions will not work properly without feature scaling. For example, the
majority of classifiers calculate the distance between two points by the Euclidean distance. If one
of the features has a broad range of values, the distance will be governed by this particular
feature. E.g. Age has just two values i.e. 0,1 and cholesterol has high values like 100. So, in
order to get them closer to each other we will need to scale them.

Factorization: In this process assigned a meaning to the values so that the algorithm doesn’t
confuse between them. For example, assigning meaning to 0 and 1 in the age section so that the
algorithm doesn’t consider 1 as greater than 0 in that section.

6
3.2.1 Importing The Libraries
In python, libraries are used to refer to a collection of modules that are used repeatedly
in various programs without the need of writing them from scratch.
Modules on the other hand refer to any python file saved with the .pyextension.
Modules often contain codes such as functions, classes and statements that can be imported
and used within other programs.
Import the necessary libraries shown below:
 Seaborn
 Numpy
 Pandas
 Matplotlib.pyplot
Seaborn
Seaborn is a library for making statistical graphics in python. It builds on top of matplotlib and
integrates closely with pandas data structures. Seaborn helps you to explore and understand
your data.
Numpy
Numpy can be used to perform a wide variety of mathematical operation on arrays. It adds
powerful data structures to python that guarantee efficient calculation with arrays and matrices
and it supplies an enormous library of high-level mathematical function that operate on these
arrays and matrices.
Pandas
Pandas is a python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive.
Matplotlib.pyplot
Matplotlib is a cross-platform, data visualization and graphical plotting library Python and its
numerical extension Numpy.
3.3 EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is a pre-processing step to understand the data. There are
numerous methods and steps in performing EDA, however, most of them are specific, focusing
on either visualization or distribution, and are incomplete. Therefore, here, I will walk-through
step-by-step to understand, explore, and extract the information from the data to answer the
questions or assumptions. There are no structured steps or method to follow, however, this
project will provide an insight on EDA for you and my future self.Initially, the dataset contains

7
76 features or attributes from 303 patients; however, published studies chose only 14 features
that are relevant in predicting heart disease. Hence, here we will be using the dataset consisting
of 303 patients with 14 features set.
3.3.1 Descriptive statistical
Descriptive analysis is to study the basic features of data with the statistical process. Here
pandas have a worthy function called describe. With this describe function we can understand
the unique, top and frequent values of categorical features. And find mean, std, min, max and
percentile values of continuous features.
3.3.2 Visual Analysis
Visual Analysis is the process of using visual representations, such as charts, plots, and graphs,
to explore and understandable data. It’s a way to quickly identify patterns, trends, and outliers
in the data, which can help to gain insights and make informed decisions. Histogram is used
for both Univariate and Bivariate analysis.
3.3.2.1 Univariate analysis:
Univariate analysis is the simplest form of analysis where I explore a single variable.
Univariate analysis is performed to describe the data in a better way. I will perform Univariate
analysis of numerical and categorical variables differently because plotting use different plots.
3.3.2.2 Bivariate and Multivariate analysis:
I have study about various plots to explore single categorical and numerical data. Bivariate
analysis is used when I have to explore the relationship between 2 different variables and I
have to do this because, in the end, our main task is to explore the relationship between
variables to build a powerful model. And when I analyze more than 2 variables together then it
is known as multivariate analysis. I will work on different plots for Bivariate as well on
multivariate analysis.
3.3.2.3 Histogram:
A histogram is a value distribution plot of numerical columns. It basically creates bins in
various ranges in values and plot it where we can visualize how values are distributed. We can
have a look where more values lie like in positive, negative, or at the center (mean). Let’s have
a look at the age column.

8
3.4 SPLIT THE DATASET INTO TRAINING AND TESTING
3.4.1.Scalling the data
It is important to scale the data so the ML algorithms do not over fit to the wrong features.
Using the minmaxScaler (), the values are scaled per feature based on the minimum and
maximum between 0 and 1. This keeps the information from being lost but allows the ML
algorithms to correctly train with the data.
3.4.2 Dataset Splitting
The data is normally split into two subsets: training data and testing data(and sometimes to
three: train, validate and test). split a dataset into train and test sets to evaluate how well the
machine learning model performs. The training dataset is the actual data set used to fit and
train the model, and the statistics of the train set are known. The test data set is the sample of
data used to provide prediction.
The train-test-split is used to estimate the performance of machine learning algorithms that are
applicable for prediction-based algorithms/applications. This method is a fast and easy
procedure to perform such that user can compare their own machine learning model results to
machine results. In this project the test set is split into 30% of actual data and the train set is
split into 70% of the actual data.
The data was split into 80% training (240 people) and 20% testing (63people) after dropping 6
instances with missing values. This is a general rule of thumb for splitting data to train ML
algorithms with.
3.5 MODEL BUILDING
3.5.1 Machine learning Algorithm
Machine Learning Algorithms are the programs that can learn the hidden patterns from the data,
predict the output, and improve the performance from experience on their own. Different
algorithms can be used in machine learning for different tasks, that can be used for prediction
problems like stock market prediction, and for classification problems.

Types of Machine Learning Algorithms

Machine learning algorithm can be broadly classified into three types:

1. Supervised Learning Algorithm


2. Unsupervised Learning Algorithm
3. Reinforcement Learning Algorithm

9
1. Supervised Learning Algorithm

Supervised learning is a type of machine learning in which the machine needs external
supervise to learn. The supervised learning models are trained using the labeled dataset. Once the
training and processing are done, the model is tested by providing a sample test data to check
whether it predicts the correct output.

The goal of supervised learning is to map input data with the output data. Supervised
learning is based on supervision, and it is the same as when a student learns things in the
teacher’s supervision. The example of supervised learning is spam filtering. Supervised learning
can be divided further into two categories of problem:

• Classification

• Regression

2. Unsupervised Learning Algorithm

Unsupervised Learning is a type of machine learning in which the machine does not need
any external supervision to learn from the data, hence called unsupervised learning. The
unsupervised models can be trained using the unlabelled dataset that is not classified, nor
categorized, and the algorithm needs to act on that data without any supervision. In unsupervised
learning, the model doesn't have a predefined output, and it tries to find useful insights from the
huge amount of data. These are used to solve the Association and Clustering problems. Hence
further, it can be classified into two types:

• Clustering

• Association

Examples of some unsupervised learning algorithms are K-means Clustering, A priori


Algorithm, Éclat, etc.

3. Reinforcement Learning Algorithm

Reinforcement Learning is a feedback based machine learning technique in which an


agent learns to behave in an environment by performing the actions and seeing the result of

10
actions. For each good actions, the agent gets positive feedback, and for each bad action, the
agent gets negative feedback or penalty.

In Reinforcement learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning. ”Reinforcement learning is a type of machine learning
method where an intelligent agent (computer program) interacts with the environment and learns
to act within that”. It is a core part of an Artificial intelligence, and all AI agents’ works on the
concept of reinforcement learning.

3.5.2 Methodologies used on this model

This model deals with binary classification of labeled data and the algorithm were
chosen based on the fact.
Methodologies used on this model are mentioned below:
1. K Nearest Neighbors classifier
2. Support vector classifier
3. Decision tree classifier
4. Random forest classifier
3.5.2.1 K Nearest Neighbors classifier
The KNN algorithm is a simple but extremely powerful classification algorithm. The
name of the algorithm originates from the underlying philosophy of KNN – i.e. people having
similar background or mindset tend to stay close to each other. As a part of the KNN algorithm
the unknown and unlabelled data which comes for a prediction problem is judged on the bases of
the training data set elements which are similar to the unknown element is assigned on the basis
of the class labels of the similar training data set elements.

Algorithm:

Input: Training data set, test data set, value of ‘k’

Steps:

Do for all test data points

Calculate the distance of the test data point form the different training data points

Find the closest ‘k’ training data points

11
If k=1

Then assign class label of the training data point to the test data point

Else

Whichever class label is predominantly present in the training data points, assign that
class label to the test data point.

End do

Applying KNN Algorithm:

fromsklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train,Y_train)

Y_pred_knn=knn.predict(X_test)

Y_pred_knn.shape

score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)

Print ("The accuracy score achieved using KNN is: "+str(score_knn)+" %")from

Accuracy:

The accuracy score achieved using KNN is: 67.21 %

3.5.2.2 Support Vector Machine

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However, primarily,
it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to
create the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called support vectors, and hence the algorithm is
termed as Support Vector Machine. Support vector machines (SVMs) are powerful yet flexible
supervised machine learning algorithms which are used both for classification and regression.

12
But generally, they are used in classification problems. In the 1960s, SVMs were first introduced
but later they got refined in 1990. SVMs have their unique way of implementation as compared
to other machine learning algorithms. Lately, they are extremely popular because of their ability
to handle multiple continuous and categorical variables

Support vector machine (SVM) are supervised learning method that analyze data used
for classification and regression analysis. It is given a set of training data, marked as belonging to
either one of two categories; an SVM training algorithm then builds a model that assigns new
examples to one category or the other, making it a no probabilistic binary linear classifier. An
SVM model is a representation of the examples as points in space, mapped so that the examples
of the separate categories are divided by a clear gap that is as wide as possible. New examples
are then mapped into that same space and predicted to belong to a category based on which side
of the gap they fall. The points are separated based on hyper plane that separate them. When data
are not labeled, supervised learning is not possible, and an unsupervised learning approach is
required, which attempts to find natural clustering of the data to groups, and then map new data
to these formed groups. In the project, we have used this algorithm to classify the patients into
groups according to the risk posed to them based on the parameters provided.

The advantages of support vector machines are:

 Effective in high dimensional spaces.


 Still effective in cases where the number of dimensions is greater than the number of
samples.
 Uses a subset of training points in the decision function (called support vectors), so it is
also memory efficient.
 Versatile: different kernel functions can be specified for the decision function. Common
kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

 If the number of features is much greater than the number of samples, avoid over-fitting
in choosing Kernel functions and regularization term is crucial. SVMs do not directly
provide probability estimates, these are calculated using an expensive five-fold cross-
validation.

13
Algorithm:

Input: heart disease_Input Features Assign training and testing dataset for heart disease

Output: Classification of heart disease as a result of the output Function:


Support_Vector_Machine (Input features F, Label vector V= [1.....n])

Steps

Decide on the optimal cost and gamma value.

Perform while (conditioning)

For each set number of input file features, run the training step.

Run the classification step for a set number of features in the input file.

Come to an end whilst, Submit the heart disease classification results.

Applying Support vector classifier:

fromsklearn import svm

sv = svm.SVC(kernel='linear')

sv.fit(X_train, Y_train)

Y_pred_svm = sv.predict(X_test)

Y_pred_svm.shape

(61,)

score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)

print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")

Accuracy:

The accuracy score achieved using Linear SVM is: 81.97 %

14
3.5.2.3 Decision tree classifier:

Decision tree learning is one of the most widely adopted algorithms for classification. As
the name indicates, it builds a model in the form of a tree structure. Its grouping exactness is
focused with different strategies, and it is exceptionally productive. A decision tree is used for
multi-dimensional analysis with multiple classes. It is characterized by fast execution time and
ease in the interpretation of the rules. The goal of decision tree learning is to create a model that
predicts the value of the output variable based on the input variables in the feature vector. Each
node of a decision tree corresponds to one of the feature vector. From every node, there are
edges to children, wherein there is an edge for each of the possible values of the feature
associated with the node. The tree terminates at different leaf nodes where each leaf node
represents a possible value for the output variable. The output variable is determined by
following a path that starts at the root and is guided by the values of the input variables.

Algorithm :

Input: Training data set, test data set

Steps:

Do for all attributes

Calculate the entropy Ei of the attribute Fi

ifEi<Emin

thenE min = Ei and Fmin = Fi

end if

End do

Split the data set into subsets using the attribute F min

Draw a decision tree node containing the attribute F min and split the data set into subsets

Repeat the above steps until the full tree is drawn covering all the attributes of the original table.

15
Applying Decision tree classifier:

fromsklearn.tree import DecisionTreeClassifier

max_accuracy = 0

for x in range(200):

dt = DecisionTreeClassifier(random_state=x)

dt.fit(X_train,Y_train)

Y_pred_dt = dt.predict(X_test)

current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)

if(current_accuracy>max_accuracy):

max_accuracy = current_accuracy

best_x = x

#print(max_accuracy)

#print(best_x)

dt = DecisionTreeClassifier(random_state=best_x)

dt.fit(X_train,Y_train)

Y_pred_dt = dt.predict(X_test)

print(Y_pred_dt.shape)

(61,)score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)

print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")

Accuracy:

The accuracy score achieved using Decision Tree is: 81.97 %

16
3.5.2.4 Random forest classifier:

Random forest is an ensemble classifier, i.e. a combining classifier that uses and
combines many decision tree classifiers. Ensembling is usually done using the concept of
bagging with different feature sets. The reason for using large number of trees in random forest
is to train the trees enough such that contribution from each feature comes in a number of
models. After the random forest is generated by combining the trees, majority vote is applied to
combine the output of the different trees. The result from the ensemble model is usually better
than that from the individual decision tree models.

The random forest algorithm works as follows:

a. If there are N variables or features in the input data set, select a subset of ‘m’ (m < N)
features at random out of the N features. Also, the observations or data instances should
be picked randomly.
b. Use the best split principle on these ‘m’ features to calculate the number of nodes ‘d’.
c. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible
extent.
d. Select a different subset of the training data ‘with replacement’ to train another decision
tree following steps (1) to (3). Repeat this to build and train ‘n’ decision trees.
e. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.

Advantages:

● Random Forest is capable of performing both Classification and Regression tasks.

● It is capable of handling large datasets with high dimensionality.

● It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages:

Although Random Forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

Applying the random forest algorithm:

fromsklearn.ensemble import RandomForestClassifier

max_accuracy = 0

17
for x in range(2000):

rf = RandomForestClassifier(random_state=x)

rf.fit(X_train,Y_train)

Y_pred_rf = rf.predict(X_test)

current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)

if(current_accuracy>max_accuracy):

max_accuracy = current_accuracy

best_x = x

#print(max_accuracy)

#print(best_x)

rf = RandomForestClassifier(random_state=best_x)

rf.fit(X_train,Y_train)

Y_pred_rf = rf.predict(X_test)

Y_pred_rf.shape

(61,)score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)

print("The accuracy score achieved using random forest Tree is: "+str(score_rf)+" %")

Accuracy :

The accuracy score achieved using Random forest tree: 90.16 %

18
3.6 EVALUATION THROUGH PERFORMANCE METRICS

In this project, various machine learning algorithms like KNN, SVM, Decision Tree,
Random Forest, are used to predict heart disease. Heart Disease UCI dataset, has a total of 76
attributes, out of those only 14 attributes are considered for the prediction of heart disease.
Various attributes of the patient like gender, chest pain type, fasting blood pressure, serum
cholesterol, exang, etc are considered for this project. The accuracy for individual algorithms has
to measure and whichever algorithm is giving the best accuracy, that is considered for the heart
disease prediction. For evaluating the experiment, various evaluation metrics like accuracy,
confusion matrix, precision, recall, and f1-score are considered. Accuracy- Accuracy is the ratio
of the number of correct predictions to the total number of inputs in the dataset.

3.6.1 Confusion Matrix:

The Confusion matrix is one of the most intuitive and easiest metrics used for finding the
correctness and accuracy of the model. It is used for classification problems where the output can
be of two or more types of classes which makes it perfect for this paper. The table layout or the
matrix layout helps to visualize the performance of an algorithm.

Terms Associated with Confusion Matrix:

1. True Positives (TP): True positives are the cases when the actual class of the data point was
True(1) and the predicted is also True(1) Ex: The case where a person is actually having
malignant (1) tumor and the model classifying his case as malignant (1) comes under True
Positive.

2. True Negatives (TN): True negatives are the cases when the actual class of the data point was
False (0) and the predicted is also False (0). Ex: The case where a person having benign (0)
tumor and the model classifying his case as benign (0) comes under True Negatives.

3. False Positives (FP): False positives are the cases when the actual class of the data point was
False (0) and the predicted is True (1). False is because the model has predicted incorrectly and
positive because the class predicted was a positive one (1). Ex: A person having a benign (0)
tumor and the model classifying his case as malignant (1) comes under False Positives.

4. False Negatives (FN): False negatives are the cases when the actual class of the data point was
True (1) and the predicted is False (0). False is because the model has predicted incorrectly and
negative because the class predicted was a negative one (0). Ex: A person having malignant (1)

19
tumor and the model classifying his case as benign (0) tumor comes under False
Negatives.Accuracy = (TP + TN) /(TP+FP+FN+TN)

Table Format for Confusion Matrix

Predicted Values

0 1

True Negative False Positive 0

Actual Values (TN) (FP)

False Negative True Positive

(FN) (TP)

3.6.2 Accuracy

Accuracy is one of the metric for evaluating classification models. Informally accuracy is
the fraction of predictions our model got right. Formally Accuracy has the following
definition.

Number of Correct Prediction

Accuracy = -------------------------------------------

Total Number of Prediction

For Binary Classification Accuracy can be calculated in terms of Positive and Negative as
follows:
T P+ T N
Accuracy = --------------------------
TP+FP+TN+FN

20
3.6.3 Precision
Precision is the ratio of True Positives to the summation of True Positives and False
Positives. Ex: Precision is a measure of proportion of patients that has been diagnosed as having
malignant tumor, actually had malignant tumor. The predicted positives (People predicted as
having malignant tumor are TP and FP) and the people actually having a malignant tumor are
TP.

TP
Precision = -------------
TP + FP

3.6.4 Recall

Recall is a measure that shows the proportion of patients that actually had malignant
tumor was diagnosed by the algorithm as having malignant tumor. The actual positives (People
having malignant tumor are TP and FN) and the people diagnosed by the model having a
malignant tumor are TP. Therefore, if we want to focus more on minimizing False Negatives, we
would want our Recall to be as close to 100% as possible.

TP
Recall = --------------

TP + FN

3.6.5 F1 Score

F1 Score Combines the Precision and Recall scores of a model. It is the Harmonic Mean
between precision and recall. The range for F1 Score is [0, 1]. It shows how precise the classifier
is and how robust it is at the same time.

2 * Prediction*Recall
F1 Score = ------------------------------

Prediction + Recall

21
CHAPTER-4

SYSTEM ANALYSIS

4.1 EXISTING SYSTEM

Existing System might be redundant variables which must be handled. The Heuristic search
method is used in mRMR and selects optimum features that have maximum relevance and
minimum redundancy. It checks one feature at a cycle and computes pairwise redundancy. The
mRMR does not take care of the joint association of features.

4.2 PROPOSED SYSTEM


Heart diseases prediction is a web-based machine learning application, trained by a UCI
dataset. The user inputs its specific medical details to get the prediction of heart disease for that
user. The algorithm will calculate the probability of presence of heart disease. The result will be
displayed on the webpage itself. Thus, minimizing the cost and time required to predict the
disease. Format of data plays crucial part in this application. At the time of uploading the user
data application will check its proper file format and if it not as per need then error dialog box
will be prompted. The main application is a web application which accepts the various
parameters from the user as input and computes the result. The result is displayed along with the
accuracy of prediction to heart that includes attributes such as age, gender, resting blood
pressure, cholesterol, fasting blood sugar, old peak, etc

Advantage of Proposed System

 Easy to use
 No human intervention required
 Not only detect the heart disease type but also suggest precautions
 Efficient use of available annotated data samples

22
4.3 COMPARITIVE ANALYSIS OF
MACHINE LEARNING ALGORITHM
Support vector machine:

The Accuracy of support vector machine was 81.97 %. Along with accuracy, all of the
performance metrics Precision, Recall, F1 score, fell from (0=83%, 1=81%), (0=74%, 1=88%),
(0=78%, 1=78%) respectively.

K Nearest Neighbors:

The Accuracy of K Nearest Neighbors was 67.21. Along with accuracy, all of the performance
metrics Precision, Recall, F1 score, fell from (0=62%, 1=72%), (0=67%, 1=88%), (0=64%,
1=70%) respectively.

Decision tree classifier:

The Accuracy of Decision tree classifier was 81.97. Along with accuracy, all of the performance
metrics Precision, Recall, F1 score, fell from (0=79%, 1=85%), (0=81%, 1=82%), (0=80%,
1=84%) respectively.

Random forest classifier:

The Accuracy of Random forest was 90.16. Along with accuracy, all of the performance metrics
Precision, Recall, F1 score, fell from (0=92%, 1=89%), (0=85%, 1=94%), (0=88%, 1=91%)
respectively.

Comparison of Scores of Various Models

After completing the implementation of all four models for detecting heart disease from the
dataset, the results can be compared below the table.

23
MODELS ACCURACY

Support vector machine 81.97 %

K Nearest Neighbors 67.21 %

Decision tree classifier 81.97 %

Random forest classifier 90.16 %

Table 4.1

Precision

MODELS

0 1

Support vector machine 83% 81%

K Nearest Neighbors 62% 72%

Decision tree classifier 79% 85%

Random forest classifier 92% 89%

Table 4.2

24
Recall

MODELS

0 1

Support vector machine 74% 88%

K Nearest Neighbors 67% 68%

Decision tree classifier 81% 82%

Random forest classifier 85% 94%

Table 4.3

25
F1 Score

MODELS

0 1

Support vector machine 78% 78%

K Nearest Neighbors 64% 70%

Decision tree classifier 80% 84%

Random forest classifier 88% 91%

Table 4.4

26
CHAPTER-5

CONCLUSION

Identifying the processing of raw healthcare data of heart information will help in the long term
saving of human lives and early detection of abnormalities in heart conditions. Machine learning
techniques were used in this work to process raw data and provide a new and novel discernment
towards heart disease. Heart disease prediction is challenging and very important in the medical
field. However, the mortality rate can be drastically controlled if the disease is detected at the
early stages and preventative measures are adopted as soon as possible. Further extension of this
study is highly desirable to direct the investigations to real-world datasets instead of just
theoretical approaches and simulations. In this project have studied various classification
algorithms that can be used for classification of heart disease databases also have seen different
techniques that can be used for classification and the accuracy obtained by them. This
investigation tells us about dissimilar technologies that are used in dissimilar papers with
dissimilar count of attributes with different accuracies depending on the tools designed for
execution.

With the increasing number of deaths due to heart diseases, it has become mandatory to develop
a system to predict heart diseases effectively and accurately. The motivation for the study was to
find the most efficient ML algorithm for detection of heart diseases. This study compares the
accuracy score of Support vector machine, K Nearest Neighbors, Decision Tree, and Random
Forest algorithms for predicting heart disease using UCI machine learning repository dataset.
The result of this study indicates that the Random Forest algorithm is the most efficient
algorithm with accuracy score of 90.16% for prediction of heart disease. In future the work can
be enhanced by developing a web application based on the Random Forest algorithm as well as
using a larger dataset as compared to the one used in this analysis which will help to provide
better results and help health professionals in predicting the heart disease effectively and
efficiently.

27
CHAPTER-6

FUTURE ENHANCEMENT

In future of the structure can be further upgraded by creating various combinations of data
mining techniques and by parameter tuning also. A prototype heart disease prediction system is
developed using three data mining classification modeling techniques. The system extracts
hidden knowledge from a historical heart disease database. DMX query language and functions
are used to build and access the models. The models are trained and validated against a test
dataset. Lift Chart and Classification Matrix methods are used to evaluate the effectiveness of the
models. All three models are able to extract patterns in response to the predictable state. The
most effective model to predict patients with heart disease appears to be Naïve Bayes followed
by Neural Network and Decision Trees. Five mining goals are defined based on business
intelligence and data exploration. The goals are evaluated against the trained models. All three
models could answer complex queries, each with its own strength with respect to ease of model
interpretation, access to detailed information and accuracy. Naïve Bayes could answer four out
of the five goals; Decision Trees, three; and Neural Network, two. Although not the most
effective model, Decision Trees results are easier to read and interpret. The drill through feature
to access detailed patients’ profiles is only available in Decision Trees. Naïve Bayes fared better
than Decision Trees as it could identify all the significant medical predictors. The relationship
between attributes produced by Neural Network is more difficult to understand.

28
CHAPTER-7

BIBLOGRAPHY

[1] V. Krishnaiah, G. Narasimha, N. Subhash Chandra, “Heart Disease Prediction System using
Data Mining Techniques and Intelligent Fuzzy Approach: A Review” IJCA 2016.
[2] K.Sudhakar, Dr. M. Manimekalai “Study of Heart Disease Prediction using Data mining”,
IJARCSSE 2016.

[3] NagannaChetty, Kunwar Singh Vaisla, NagammaPatil, “An Improved Method for Disease
Prediction using Fuzzy Approach”, ACCE 2015.

[4] VikasChaurasia, Saurabh Pal, “Early Prediction of Heart disease using Data mining
Techniques”, Caribbean journal of Science and Technology,2013

[5] ShusakuTsumoto,” Problems with Mining Medical Data”, 0-7695- 0792-1 I00@ 2000 IEEE.

[6] Y. Alp Aslandoganet. al.,” Evidence Combination in Medical Data Mining”, Proceedings of
the international conference on Information Technology: Coding and Computing (ITCC’04) 0-
7695-2108-8/04©2004 IEEE.

[7] Carlos Ordonez, "Improving Heart Disease Prediction Using Constrained Association Rules,"
Seminar Presentation at University of Tokyo, 2004.

[8] Franck Le Duff, CristianMunteanu, Marc Cuggiaa, Philippe Mabob, "Predicting Survival
Causes After Out of Hospital Cardiac Arrest using Data Mining Method", Studies in health
technology and informatics, Vol. 107, No. Pt 2, page no. 1256-1259, 2004.

[9] Boleslaw Szymanski, Long Han, Mark Embrechts, Alexander Ross, KarstenSternickel,Lijuan
Zhu, "Using Efficient Supanova Kernel For Heart Disease Diagnosis", Proc. ANNIE 06,
intelligent engineering systems through artificial neural networks, vol. 16,page no. 305-310,
2006.
[10] Kiyong Noh, HeonGyu Lee, Ho-Sun Shon, Bum Ju Lee, and Keun Ho Ryu, "Associative
Classification Approach for Diagnosing Cardiovascular Disease", Springer 2006,Vol:345, page
no. 721- 727.
https://www.kaggle.com/datasets/jillanisofttech/heart-disease-prediction-using-
differenttechniques

29
CHAPTER-8
SAMPLE CODING

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report

%matplotlib inline

import os
print(os.listdir())

import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv("heart.csv")
type(dataset)
dataset.shape
dataset.head(5)
dataset.sample(5)
dataset.describe()
dataset.info()
###Luckily, we have no missing values
info = ["age","1: male, 0: female","chest pain type, 1: typical angina, 2: atypical angina, 3: non-
anginal pain, 4: asymptomatic","resting blood pressure"," serum cholestoral in mg/dl","fasting
blood sugar > 120 mg/dl","resting electrocardiographic results (values 0,1,2)"," maximum heart
rate achieved","exercise induced angina","oldpeak = ST depression induced by exercise relative
to rest","the slope of the peak exercise ST segment","number of major vessels (0-3) colored by
flourosopy","thal: 3 = normal; 6 = fixed defect; 7 = reversable defect"]

30
for i in range(len(info)):
print(dataset.columns[i]+":\t\t\t"+info[i])
dataset["target"].describe()
dataset["target"].unique()
print(dataset.corr()["target"].abs().sort_values(ascending=False))
#This shows that most columns are moderately correlated with target, but 'fbs' is very weakly
correlated.
y = dataset["target"]

sns.countplot(y)

target_temp = dataset.target.value_counts()

print(target_temp)
print("Percentage of patience without heart problems: "+str(round(target_temp[0]*100/303,2)))
print("Percentage of patience with heart problems: "+str(round(target_temp[1]*100/303,2)))
dataset["sex"].unique()
sns.barplot(dataset["sex"],y)
dataset["cp"].unique()
sns.barplot(dataset["cp"],y)
dataset["fbs"].describe()
dataset["fbs"].unique()
sns.barplot(dataset["fbs"],y)
dataset["restecg"].unique()
sns.barplot(dataset["restecg"],y)
dataset["exang"].unique()
sns.barplot(dataset["exang"],y)
dataset["slope"].unique()
sns.barplot(dataset["slope"],y)
dataset["ca"].unique()
sns.countplot(dataset["ca"])

31
sns.barplot(dataset["ca"],y)
dataset["thal"].unique()
sns.barplot(dataset["thal"],y)
sns.distplot(dataset["thal"])
from sklearn.model_selection import train_test_split

predictors = dataset.drop("target",axis=1)
target = dataset["target"]

X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
X_train.shape
X_test.shape
Y_train.shape
Y_test.shape
from sklearn.metrics import accuracy_score
from sklearn import svm

sv = svm.SVC(kernel='linear')

sv.fit(X_train, Y_train)

Y_pred_svm = sv.predict(X_test) from sklearn import svm

sv = svm.SVC(kernel='linear')

sv.fit(X_train, Y_train)

Y_pred_svm = sv.predict(X_test)
Y_pred_svm.shape
score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)

print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")
Y_pred_svm = sv.predict(X_test)

32
print("support vector machine report:\n\n",classification_report(Y_test,Y_pred_svm))
import numpy
actual = numpy.random.binomial(1, 0.9, size =61 )
predicted = numpy.random.binomial(1, 0.9, size = 61)
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(actual, predicted)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix,
display_labels = [False, True])
import matplotlib.pyplot as plt
cm_display.plot()
plt.show()
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,Y_train)
Y_pred_knn=knn.predict(X_test)
Y_pred_knn.shape
score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)

print("The accuracy score achieved using KNN is: "+str(score_knn)+" %")


Y_pred_svm = sv.predict(X_test)
print("k Nearest neighbors report:\n\n",classification_report(Y_test,Y_pred_knn))
import numpy
actual = numpy.random.binomial(1, 0.9, size =61 )
predicted = numpy.random.binomial(1, 0.9, size = 61)
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(actual, predicted)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix,
display_labels = [False, True])
import matplotlib.pyplot as plt
cm_display.plot()
plt.show()
from sklearn.tree import DecisionTreeClassifier

33
max_accuracy = 0

for x in range(200):
dt = DecisionTreeClassifier(random_state=x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x

#print(max_accuracy)
#print(best_x)

dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
print(Y_pred_dt.shape)
score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)

print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")
Y_pred_svm = sv.predict(X_test)
print("Decision tree report:\n\n",classification_report(Y_test,Y_pred_dt))
from sklearn.ensemble import RandomForestClassifier

max_accuracy = 0

for x in range(2000):
rf = RandomForestClassifier(random_state=x)

34
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x

#print(max_accuracy)
#print(best_x)

rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
Y_pred_rf.shape
score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)

print("The accuracy score achieved using random forest Tree is: "+str(score_rf)+" %")
scores = [score_svm,score_knn,score_dt,score_rf,]
algorithms = ["Support Vector Machine","K-Nearest Neighbors","Decision Tree","Random
forest"]

for i in range(len(algorithms)):
print("The accuracy score achieved using "+algorithms[i]+" is: "+str(scores[i])+" %")
sns.set(rc={'figure.figsize':(15,8)})
plt.xlabel("Algorithms")
plt.ylabel("Accuracy score")

sns.barplot(algorithms,scores)

35
SCREEN LAYOUT

The Result obtained from the project entitled “heart disease prediction using machine learning”
were discussed as follows.

Step1: Import the necessary libraries to implement the project.

Fig 4.1

Step 2: Load the dataset by using the directory

Fig 4.2

36
Step 2.1 : DataFrame check if any missing values exists

Fig 4.3

Fig 4.4

37
Step 3: Dataset by using histogram

Fig 4.5

Step 4: After the analysis of dataset split the data into training and testing

Fig 4.6

38
Step 4.1: Univariate analysis

Fig 4.7

Step 5: After splitting the dataset into training and testing implement the machine learning
models.

Fig 4.8

39
Step 5.1: After the implementation of all the four algorithms, results obtained through confusion
matrix for all four algorithms to Evaluate the performance.

Fig 4.9 (Support vector machine)

Fig 4.10 (K Nearest Neighbors algorithms)

40
Fig 4.11 (Decision tree algorithm)

Fig 4.12 (Random forest algorithm)

41
Fig 4.12 SVM (Precision, recall, f1 score,)

Fig 4.13 KNN (Precision, recall, f1 score,)

42
Fig 4.14 Decision tree (Precision, recall, f1 score,)

Fig 4.15 Random forest (Precision, recall, f1 score,)

43
44

You might also like