Predictionof Diabetesusing Machine Learning
Predictionof Diabetesusing Machine Learning
Predictionof Diabetesusing Machine Learning
net/publication/371178003
CITATIONS READS
5 13
1 author:
SEE PROFILE
All content following this page was uploaded by Yalla Jeevan Nagendra Kumar on 01 June 2023.
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number G5764058719 /19©BEIESP 2547 & Sciences Publication
Prediction of Diabetes using Machine Learning
Abdullah A. Aljumah et al [11] (2013) used regression-based Box Plot:It is a graphical method describing numerical data
machine learning techniques for predictive analysis of through their quartiles. Such graphical figures have vertical
diabetes. Theyemployed Oracle Data Mining software tool. lines extended from boxes and variability part is displayed
Algorithm used for analysis was SVM. outside upper and lower quartiles as shown below:
IoannisKavakiotis et al [12] (2017)implementedsupport
vector machine (SVM) classification algorithm for their
proposed work.
Dr. Y. Jeevan Nagendra Kumar et al [13] (2013): developed a
greedy method in minimizing number of points and spatial
mining is used to recognise solution for it.
NongyaoNai-aruna, RungruttikarnMoungmaia [14] (2015)
developed a system to determine the occurrence of diabetes.
The algorithms used by them wereDT, ANN, Random Forest,
LR and NB classifier.
Sajida Perveena et al[15] (2016), implemented three
classifiers which are bagging, adaboost and J48 decision tree
for prediction of diabetes. J48 produced better results.
D. Asir Antony Gnana Singh et al[16] (2017), Machine
learning algorithms like naïve Bayes, random forest and MPL
Fig-2: Box Plot
are used for analysis and prediction of diabetes. Better results
were obtained with random forest classification algorithm. Above plot shows few points which are outliers as they are not
included in the box of other observation.
III. METHODOLOGY
ii. Multivariate plot:
Diabetic dataset:
1) Correlation Matrix Plot:
The dataset for prediction of diabetes is obtained with the help Correlation is a statistical tool that helps in understanding
of medical reports of patients from Kaggle community. The how two variables are correlated and effect of one over the
dataset consists of 768 rows and 9 attributes which contribute other,that is whether they are positively or negatively
in prediction of diabetes. The parameters chosen from dataset correlated.If both variables move in the same direction that is
include various attributes Blood Pressure, Skin Thickness, increase or decrease in one variable, increases or decreases
Insulin, BMI, Diabetes_Pedigree_Function, Age etc. the other then they are said to be positively correlated while if
they move in opposite directions then they are negatively
correlated.In case if there is no effect in directions between
Applying various two variables then there is no correlation.
Classification
algorithms
Correlationbetween each pair of attributesis calculated and
represented in matrix form and then these variables are
KNN, SVM, Logistic
Regression, Random plotted to understand degree of correlation. These techniques
Data pre-processing forest, Decision tree, will help in predicting the behaviour of how machine learning
Gradient Boosting algorithms work.It is important to note that whether linear or
Diabetes database
logistic regression can show poor performance even if there is
highlydegree of correlated input variables in the data.
Result Comparative analyses of Performance evaluation of algorithms
algorithms based on accuracy
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number G5764058719 /19©BEIESP 2548 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8 Issue-7 May, 2019
It is a standardized method for studying independent variable IV. RESULTS AND DISCUSSION
range data. Data is normalized during pre-processing stage.
This research work is implemented in Anaconda Distribution
which is a free and open- source distribution. The
Min Max ScalarMethod:
programming languages used were Python and R languages
It is a Min-max normalization strategy. This method uses
for scientific computing applications and for analysis
linearly transformation technique such as
purposes(data science, machine learning applications,
x to y= (x-m)/(M-m).
predictive analysis, etc.) were used.
Where, m=min, M=max
In this experiment, the classification algorithms are applied
on all 9 attributes and the results are based on the following
Z-score:
terms:
It is ameasure of number of standard deviations over or above
True Positives (TP): It stands for the correctly predicted
the population mean.
positive value that is if the value of actual class and
The formula is:
predicted class are yes. It is denoted by tp.
z = (x – μ) / σ
Algorithms: The following algorithms are used in the study True Negatives (TN):It stands for the correctly predicted
and some salient features of each algorithm are discussed in negative value that is if the value of actual class and
brief. For further details of each of the algorithms used in our predicted class are no. It is denoted by tn.
proposed workcan be referred from the available sources.
1. KNN:KNN model uses K nearest neighbours that stores False Positives (FP): It stands for if actual class is no and
possible cases and differentiates them based on measure of predicted class is yes. It is denoted by fp.
similarity to arrive at new cases. This algorithm mainly used
for statistical estimation and pattern recognition. It is a False Negatives (FN):It stands for if predicted class is
non-parametric technique. Majority vote among its noand actual class is yes. It is denoted by fn.
neighbours is measured by a distance function. The various i. Precision (p):It is defined as the number of positive
distance functions available are: - observations which are truly predicted divided by the total
number of positive observations predicted.
Distance Function:
i. Euclidean p=
ii. Manhattan
iii. Minkowski
ii. Recall (r):It is defined as the fraction of accurately
2. LR:This is used in the analysis of dataset containing positively predictedobservations to the total
independent variables through which results can be predictedobservations in the class.
determined. This is obtained by a variable having only two
possible outcomes. The logistic function is given below: r=
pilog2pi
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number G5764058719 /19©BEIESP 2549 & Sciences Publication
Prediction of Diabetes using Machine Learning
Fig-5: Precision
Recall
Fig-4: Accuracy
0.85
The figure shows there is steep fall from gradient boosting to
decision tree and there is a monotonically sharp rise in the 0.8
curve for each of the algorithms. Recall
From the plot it is clear that KNN algorithm showed better 0.75
results for chosen data set and gave a greater degree of
0.7
accuracy.
LR
SVM
DT
RF
GB
KNN
Fig-7: F1-score
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number G5764058719 /19©BEIESP 2550 & Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8 Issue-7 May, 2019
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number G5764058719 /19©BEIESP 2551 & Sciences Publication
View publication stats