Predictionof Diabetesusing Machine Learning

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/371178003
Prediction of Diabetes using Machine Learning
Article in International Journal of Innovative Technology and Exploring Engineering · May 2019
CITATIONS READS
5 13
1 author:
Yalla Jeevan Nagendra Kumar

Gokaraju Rangaraju Institute of Engineering & Technology
25 PUBLICATIONS 134 CITATIONS
SEE PROFILE
All content following this page was uploaded by Yalla Jeevan Nagendra Kumar on 01 June 2023.
The user has requested enhancement of the downloaded file.

International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-8 Issue-7 May, 2019

Y. Jeevan Nagendra Kumar, N. Kameswari Shalini, P.K. Abhilash, K. Sandeep, D. Indira
 A classification is a division in a system which divides things
Abstract: Machine learning is an application of artificial into groups or types. Various Classification algorithms used
intelligence which has proved to be a major breakthrough in the to predict the target class. Some of the commonly used
field of medical sciences to take care of healthcare sector
algorithms are regression techniques, KNN, Gradient
especially in diagnosing of diseases. In recent times, various
studies have shown there is a high percentage of population Boosting, Random forest method, SVM etc.
across the worlds who are suffering from diabetes. It is quite An analysis is done on algorithms like Logistic regression,
well-known fact that due to high content of blood sugar levels in SVM, KNN, Random forest, gradient boosting and Decision
human beings results in such a metabolic disordered disease. The tree to compare their classification accuracy in diagnosing
rapid growth of diabetes is well known reality in today’s world due
diabetes.
to unhealthy lifestyles, processed food, lack of health awareness
and proper physical exercises. It is important to note that diabetes
can cause visual disturbances, pancreas malfunction, nerve II. LITERATURE SURVEY
damage, heart diseases, kidney damage, fatigue and lack of
Deepti Sisodia, Dilip Singh Sisodia [1] (2018): In order to
energy, excessive urination, gastroparesis, damaged blood vessels,
foot problems, dry and cracked skin etc and few other chronic predict diabetes, they have use classification algorithms like
diseases. Therefore, it becomes necessary to detect and diagnose naïve Bayes, decision tree and SVM. The highest accuracy
diabetes at an early stage. The proposed work in this research obtained was 76.3% which was given by naïve Bayes
deals with the classification of people who are diagnosed with algorithm.
diabetes using Classification algorithms such as Logistic Tejas N. Joshi, Prof Pramila N. Chawan [2] (2018): They
Regression (LR), Random forest, SVM, KNN, Gradient boosting
(GB) and Decision tree(DT). The experiment showed that KNN
proposed an effective system for early detection of diabetes
algorithm gave better results than when compared with other and used algorithms like SVM, ANN, and Logistic
classified algorithms. The results showed an accuracy of 85% was Regression for prediction.
achieved. Dr. Y. Jeevan Nagendra Kumar et. al [3] (2017): Projected
that Map centered spatial analysis of rainfall data of AP and
Index Terms: Accuracy, Classification, Data mining, TS states is made using Hybrid machine learning methods.
Algorithms
Priyanka Indoria, Yogesh Kumar Rathore [4]: proposed a
system for prediction of diabetes. They have used algorithms
I. INTRODUCTION
like artificial neural networks and Bayesian network for their
Data science has played important role in providing vital system. Their work focuses on recent developments in
source of information which has helped business machine learning which have made significant impact in
establishments, IT sectors and others research organizations. detection and diagnosis of diabetes.
The collection large volume of data (structured/unstructured) Dr. Y. Jeevan Nagendra Kumaret. al [5] (2016): For
has helped many organizations to make opportunities to make supervision of knowledge discretion and get reasonable grain
a mark in research fields. access control.
Data Science solutions has provided revolution in Healthcare Akansha Rathore, Simran Chauhan [6]implemented decision
sectors have benefited from data science in exploring drugs, tree and SVM classification algorithms for the analysis and
genetic diseases etc. Thus, there is a lot potential in this area prediction of diabetes. They have used R Framework for their
that needs to be explored further to benefit society as a whole. system.
There are many algorithms that are available in Machine Dr. Y. Jeevan Nagendra Kumaret al [7] (2014): Proposed a
learning that provide a greater degree of accuracy in new symbol-based tree traverse searching scheme.
predicting outcomes based on the input data and with use of Hrushikesh N. Mhaskar et al [8]used MATLAB data mining
statistical analysis. tool for classification process to predict glucose level in
diabetic patients.
Revised Manuscript Received on May 06, 2019 Aishwariya R, Gayathri P, Jaysankar N [9] (2013) proposed
Dr. Y. J. Nagendra Kumar, Professor, Department of Information algorithms for prediction of diabetes. The promising machine
Technology, Gokaraju Rangaraju Institute of Engineering and Technology, learning algorithm used by them was Support Vector Machine
JNTUH, Hyderabad, India.
N. Kameswari Shalini, Department of Information Technology, (SVM).
Gokaraju Rangaraju Institute of Engineering and Technology, JNTUH, Dr. Y. Jeevan Nagendra Kumaret al [10] (2013): Provided a
Hyderabad, India. thorough information of the exertions of this area and our
P.K. Abhilash, Department of Information Technology, Gokaraju
Rangaraju Institute of Engineering and Technology, JNTUH, Hyderabad,
world is advancing with a lot of concerns on bio-diversity,
India. science and technology and many more.
K. Sandeep, Department of Information Technology, Gokaraju
Rangaraju Institute of Engineering and Technology, JNTUH, Hyderabad,
India.
D. Indira, Department of Information Technology, Gokaraju Rangaraju
Institute of Engineering and Technology, JNTUH, Hyderabad, India.
Published By:
Blue Eyes Intelligence Engineering
Retrieval Number G5764058719 /19©BEIESP 2547 & Sciences Publication
Abdullah A. Aljumah et al [11] (2013) used regression-based Box Plot:It is a graphical method describing numerical data
machine learning techniques for predictive analysis of through their quartiles. Such graphical figures have vertical
diabetes. Theyemployed Oracle Data Mining software tool. lines extended from boxes and variability part is displayed
Algorithm used for analysis was SVM. outside upper and lower quartiles as shown below:
IoannisKavakiotis et al [12] (2017)implementedsupport
vector machine (SVM) classification algorithm for their
proposed work.
Dr. Y. Jeevan Nagendra Kumar et al [13] (2013): developed a
greedy method in minimizing number of points and spatial
mining is used to recognise solution for it.
NongyaoNai-aruna, RungruttikarnMoungmaia [14] (2015)
developed a system to determine the occurrence of diabetes.
The algorithms used by them wereDT, ANN, Random Forest,
LR and NB classifier.
Sajida Perveena et al[15] (2016), implemented three
classifiers which are bagging, adaboost and J48 decision tree
for prediction of diabetes. J48 produced better results.
D. Asir Antony Gnana Singh et al[16] (2017), Machine
learning algorithms like naïve Bayes, random forest and MPL
Fig-2: Box Plot
are used for analysis and prediction of diabetes. Better results
were obtained with random forest classification algorithm. Above plot shows few points which are outliers as they are not
included in the box of other observation.
III. METHODOLOGY
ii. Multivariate plot:
Diabetic dataset:
1) Correlation Matrix Plot:
The dataset for prediction of diabetes is obtained with the help Correlation is a statistical tool that helps in understanding
of medical reports of patients from Kaggle community. The how two variables are correlated and effect of one over the
dataset consists of 768 rows and 9 attributes which contribute other,that is whether they are positively or negatively
in prediction of diabetes. The parameters chosen from dataset correlated.If both variables move in the same direction that is
include various attributes Blood Pressure, Skin Thickness, increase or decrease in one variable, increases or decreases
Insulin, BMI, Diabetes_Pedigree_Function, Age etc. the other then they are said to be positively correlated while if
they move in opposite directions then they are negatively
correlated.In case if there is no effect in directions between
Applying various two variables then there is no correlation.
Classification
algorithms
Correlationbetween each pair of attributesis calculated and
represented in matrix form and then these variables are
KNN, SVM, Logistic
Regression, Random plotted to understand degree of correlation. These techniques
Data pre-processing forest, Decision tree, will help in predicting the behaviour of how machine learning
Gradient Boosting algorithms work.It is important to note that whether linear or
Diabetes database
logistic regression can show poor performance even if there is
highlydegree of correlated input variables in the data.
Result Comparative analyses of Performance evaluation of algorithms
algorithms based on accuracy
Fig 1: Model of the process

Data Pre-processing:
1. Outlier analysis:
It is a point in the observation which is isolated from all
other observations. The exact cause of outlier is due to
measurement variability or due to experimental error. There
are two types of outliers which includes
univariate and multivariate. Outliers observed in a single
space feature are called univariate outliers while those
observed in n-dimensional space are called Multivariate
Fig-3: Correlation Matrix
outliers.
Feature Scaling Method:
2. Discover outliers with visualization tools:
i. Univariate plot:
Published By:
It is a standardized method for studying independent variable IV. RESULTS AND DISCUSSION
range data. Data is normalized during pre-processing stage.
This research work is implemented in Anaconda Distribution
which is a free and open- source distribution. The
Min Max ScalarMethod:
programming languages used were Python and R languages
It is a Min-max normalization strategy. This method uses
for scientific computing applications and for analysis
linearly transformation technique such as
purposes(data science, machine learning applications,
x to y= (x-m)/(M-m).
predictive analysis, etc.) were used.
Where, m=min, M=max
In this experiment, the classification algorithms are applied
on all 9 attributes and the results are based on the following
Z-score:
terms:
It is ameasure of number of standard deviations over or above
 True Positives (TP): It stands for the correctly predicted
the population mean.
positive value that is if the value of actual class and
The formula is:
predicted class are yes. It is denoted by tp.
z = (x – μ) / σ

Algorithms: The following algorithms are used in the study  True Negatives (TN):It stands for the correctly predicted
and some salient features of each algorithm are discussed in negative value that is if the value of actual class and
brief. For further details of each of the algorithms used in our predicted class are no. It is denoted by tn.
proposed workcan be referred from the available sources. 
1. KNN:KNN model uses K nearest neighbours that stores  False Positives (FP): It stands for if actual class is no and
possible cases and differentiates them based on measure of predicted class is yes. It is denoted by fp.
similarity to arrive at new cases. This algorithm mainly used 
for statistical estimation and pattern recognition. It is a  False Negatives (FN):It stands for if predicted class is
non-parametric technique. Majority vote among its noand actual class is yes. It is denoted by fn.
neighbours is measured by a distance function. The various i. Precision (p):It is defined as the number of positive
distance functions available are: - observations which are truly predicted divided by the total
number of positive observations predicted.
Distance Function:
i. Euclidean p=
ii. Manhattan
iii. Minkowski
ii. Recall (r):It is defined as the fraction of accurately
2. LR:This is used in the analysis of dataset containing positively predictedobservations to the total
independent variables through which results can be predictedobservations in the class.
determined. This is obtained by a variable having only two
possible outcomes. The logistic function is given below: r=
iii. F1 score: It is described as the measure of accuracy of tests

Where t stands for the real input value. performed. It takes measuresthat are recognised as both
3. SVM: It is a supervised learning algorithm. In this, each false positives and negatives into consideration.
data point is plotted with respect to the number of features
available. Then, differentiation is done by finding the F1 Score =
hyper-plane that separates the two chosen classes.
4. Random Forest:This algorithmcan be used for both iv. Accuracy: It is the most intuitive performance measure
regression and classification challenges. This algorithm parameter and is defined as accurately predicted
produces great results even without hyper-parameter tuning. observation divided by the total observations.
5. DT:This algorithm falls under supervised learning.
Incongruous to other algorithms, DT is used to solve Accuracy =
classification challenges.Firstly, a training model is created
that can be used topredict target variables by learning The figure below is shown showing the accuracy levels of
decision rules can be well-ascertained from training data. each algorithms used.
pilog2pi
6. GB:This algorithm generates a predictive model through

weak predictive models, unlike DT.The model is built in a
gradation manner compared to other models; it also
optimizes loss function.
Published By:
Fig-5: Precision
Recall
Fig-4: Accuracy
0.85
The figure shows there is steep fall from gradient boosting to
decision tree and there is a monotonically sharp rise in the 0.8
curve for each of the algorithms. Recall
From the plot it is clear that KNN algorithm showed better 0.75
results for chosen data set and gave a greater degree of
0.7
accuracy.
LR
SVM
DT
RF
GB
KNN
v. Classification report:The classification report deals with

scores in the tabular form for the above identified
parameters chosen for the study of the proposed work. Fig-6:Recall
Fig-7: F1-score
AUC (Area under curve): AUC provides a comprehensive

measure of performance across all possible classification
thresholds. It measures how well predictions are ranked,
comparatively than their absolute values. This is shown in
figure below for each of the
algorithms under study.
Published By:
young and old patients”, Journal of King Saud University-Computer and

Information Sciences, 25 (2013), pp. 127-136
12. [12]IoannisKavakiotis, Olga Tsave, Athanasios Salifoglou,
NicosMaglaveras, IoannisVlahavas, IoannaChouvarda,“Machine
Learning and Data Mining Methods in Diabetes Research”,
Computational and Structural Biotechnology Journal, 15 (2017),
pp. 104-116
13. [13]Y. JeevanNagendra Kumar, Dr. T. V. Rajinikanth, Mohana Krishna,
“Optimal Distribution of mobile network towers: an approach through
Geo Spatial Data Mining”, International Conference on Advanced
Computing Methodologies 2013. ISBN No: 9789351071495
14. [14]NongyaoNai-aruna, RungruttikarnMoungmaia,“Comparison of
Classifiers for the Risk of Diabetes Prediction”, Procedia Computer
Science, 69 (2015), pp. 132-14
15. [15]Sajida Perveena, Muhammad Shahbaza, Aziz Guergachib, Karim
Keshavjeec,“Performance Analysis of Data Mining Classification
Techniques to Predict Diabetes”, Procedia Computer Science, 82 (2016),
Fig-8:AUC pp. 115-121
16. [16]Dr. D. Asir Antony Gnana Singh, Dr. E. JebamalarLeavline, B.
ShanawazBaig, “Diabetes Prediction Using Medical Data”, ISSN
V. CONCLUSION 0973-385X Volume 10, Number 1 (2017) pp. 1-8
Identification of diabetes at an early stage is main objective
which of course is true to say. In this work, a comparative AUTHORS PROFILE
analysis is performed on six different algorithms and
performance is evaluated by using various set of parameters Dr. Y. Jeevan Nagendra Kumar,
identified for the given data set. The analysis is done on a obtained his Ph.D in Computer Science
diabetes database collected from Kaggle community. Our and Engineering from Acharya
experimental results showed that KNN classification Nagarjuna University, Guntur, AP in
algorithm has produced better accuracy in predicting diabetes 2017 and M.Tech Computer Science
in comparison to other algorithms used for analysis. Technology from Andhra University in
2005. He is working as Professor and
The work done in the paper can be further improved in the
Dean - Technology and Innovation Cell in
future by identifying some more machine learning algorithms
GRIET since 2005.
to achieve better accuracy and results.
He has about 12 Research Papers in International / National
Conferences and Journals and also attended many FDP Programs to
REFERENCES
enhance his knowledge. With his technical knowledge he guided the
1. Deepti Sisodia, Dilip Singh Sisodia,”Prediction of Diabetes using students in developing the useful Web applications and data mining
Classification Algorithms”, Science direct related products. As B O S member was able to introduce new
2. Tejas N. Joshi, Prof Pramila N. Chawan,“Diabetes Prediction Using
Machine Learning Techniques “, ISSN: 2248-9622, Vol. 8, Issue 1,
subjects, topics in UG / PG Courses. Students are encouraged to
IJERA. work on research projects, engineering projects as well as for
3. Y. Jeevan Nagendra Kumar, Dr. T. V. Rajini Kanth, “GIS-MAP Based industrial training.
Spatial Analysis of Rainfall Data of Andhra Pradesh and Telangana States He is acted as Coordinator for 3 International Conferences and
Using R”, International Journal of Electrical and Computer Engineering
(IJECE), Vol 7, No 1, February 2017, Scopus Indexed Journal, ISSN: Technical Committee member for several International
2088-8708. Conferences. He is Coordinator for J Lab under J Hub JNTUH and
4. Priyanka Indoria, Yogesh Kumar Rathore,“A Survey: Detection and Robotic Club. Also, Coordinator for NBA and NAAC at College
Prediction of Diabetes Using Machine Learning Techniques”, ISSN:
Level.
2278-0181, Vol.7, IJERT
5. Y. J. Nagendra Kumar, M Chander“A Better Search Optimization for .
Multidimensional Queries over Cloud on Encrypted Data”, International Ms. NistalaKameswari Shalini, completed
Journal for Research on Electronics and Computer Science (IJRECS), intermediate education in the year 2014 and
May-June 2014, V-1,I-2 ISSN: 2321-5484 currently pursuing BTech 4th year in
6. Akansha Rathore, Simran Chauhan,“Detecting and Predicting Diabetes
Using Supervised Learning: An Approach towards Better Healthcare for Information Technology from
Women”, ISSN No. 0976-5697, Volume 8, IJARCS. GokarajuRangaraju Institute of Engineering
7. Y. Jeevan Nagendra Kumar,M. Swetha “An Encryption Scheme with and Technology (GRIET). She secured above
Supportable Allocation in Cloud Computing”, International Journal of
90% in academics till date. She has developed number of projects
Innovation Technology and Research (IJITR), Volume No. 4, Issue No. 6,
October – November 2016, ISSN: 4783 – 4785. and attended few training programmes to her credit. She has also
8. Hrushikesh N. Mhaskar, Sergei V. Pereverzyev, Maria D.van der Walt,” completed two internship programs in Reliance Jio and TIFR
A Deep Learning Approach to Diabetic Blood Glucose Prediction”, successfully. Her areas of interest are Data mining and Web
frontiersin.
9. Aishwariya R, Gayathri P, Jaysankar N, “A Method for Classification development.
Using Machine Learning Technique for Diabetes”, International Journal
of Engineering and Technology (IJET), 5 (2013), pp. 2903-2908
10. [10]Y. Jeevan Nagendra Kumar, Dr. T.V. RajiniKanth,“Web GIS Mining
for Effective Retrieval of Remote Sensed Data: A Survey Paper”,
International Conference on Advanced Computing Methodologies 2013.
ISBN No: 9789351071495
11. [11]Abdullah A. Aljumah, Mohammed GulamAhamad, Mohammad
Khubeb Siddiqui, “Application of data mining: Diabetes health care in
Published By:
View publication stats

Predictionof Diabetesusing Machine Learning

Uploaded by

Copyright:

Available Formats

Predictionof Diabetesusing Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictionof Diabetesusing Machine Learning

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Prediction of Diabetes using Machine Learning

Article in International Journal of Innovative Technology and Exploring Engineering · May 2019

Yalla Jeevan Nagendra Kumar

The user has requested enhancement of the downloaded file.

Prediction of Diabetes using Machine Learning

Fig 1: Model of the process

iii. F1 score: It is described as the measure of accuracy of tests

6. GB:This algorithm generates a predictive model through

v. Classification report:The classification report deals with

AUC (Area under curve): AUC provides a comprehensive

young and old patients”, Journal of King Saud University-Computer and

You might also like