Nothing Special   »   [go: up one dir, main page]

Classification of Imbalanced Malaria Disease Using

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of Engineering & Technology, 7 (2.

7) (2018) 786-790

International Journal of Engineering & Technology


Website: www.sciencepubco.com/index.php/IJET

Research Paper

Classification of Imbalanced Malaria Disease


Using Naïve Bayesian Algorithm
T.Sajana1*, M.R.Narasingarao2
1Research Scholar, Department of CSE, K L E F, Vaddeswaram, Guntur, A.P.
2
Professor, Department of CSE, K L E F, Vaddeswaram, Guntur, A.P.
*Corresponding author E-mail:*sajana.cse@kluniversity.in

Abstract

Malaria disease is one whose presence is rampant in semi urban and non-urban areas especially resource poor developing countries. It is
quite evident from the datasets like malaria, dengue, etc., where there is always a possibility of having more negative patients (non-occur-
rence of the disease) compared to patients suffering from disease (positive cases). Developing a model based decision support system with
such unbalanced datasets is a cause of concern and it is indeed necessary to have a model predicting the disease quite accurately. Classifi-
cation of imbalanced malaria disease data become a crucial task in medical application domain because most of the conventional machine
learning algorithms are showing very poor performance to classify whether a patient is affected by malaria disease or not. In imbalanced
data, majority (unaffected) class samples are dominates the minority (affected) class samples leading to class imbalance. To overcome the
nature of class imbalance problem, balancing the data samples is the best solution which produces the better accuracy in classification of
minority samples. The aim of this research is to propose a comparative study on classifying the imbalanced malaria disease data using
Naive Bayesian classifier in different environments like weka and using an R-language. We present here, clinical descriptive study on 165
patients of different age group people collected at medical wards of Narasaraopet from 2014-17. Synthetic Minority Oversampling Tech-
nique (SMOTE) technique has been used to balance the class distribution and then we performed a comparative study on the dataset using
Naïve Bayesian algorithm in various platforms. Out of balanced class distribution data, 70% data was given to train the Naive Bayesian
algorithm and the rest of the data was used for testing the model for both weka and R programming environments. Experimental results
have indicated that, classification of malaria disease data in weka environment has highest accuracy of 88.5% than the Naive Bayesian
algorithm accuracy of 87.5% using R programming language. The impact of vector borne disease is very high in medical applications.
Prediction of disease like malaria is an hour of the need and this is possible only with a suitable model for a given dataset. Hence, we have
developed a model with Naive Bayesian algorithm is used for current research.

Keywords: Imbalanced Data; Malaria; Naïve Bayesian; Weka; R.

problem because of the presence of samples of negative class (un-


1. Introduction affected patients) dominating the samples of positive class (affected
patients) [6] [7]. Handling such a kind of imbalanced data, is a crit-
Malaria disease is one of the elevating issues among the vector ical task because prediction of a patient with or without a disease
borne diseases in medical domain [1]. It becomes one of the global becomes an important problem in the medical scenario [6] [7] [8].
health problem caused by a mosquito bite [1] [2]. Malaria, which is
a vector borne disease, affects the rural population for many years. Unbalanced data sets are include not only medical domain but also
Even though people maintain with healthy life style with good food domains like credit card fraud detection[9], detection of problems
habits and with neat surroundings, still due to the climate changes in software’s[9][10],detection of oil spills in satellite radar im-
or for any other reason, many people are affected by malaria disease ages[11][12], frauds in tele communications[9], detection of frauds
[1][3]. in financial sector[9] etc. Analyzing an imbalanced data set is very
pertinent in different diseases like malaria [6]. Handling such a
According to 2016 report of World Health Organization (WHO), complex nature of data, becomes an issue in Data Mining and Ma-
one million people are dying annually due to the vector borne dis- chine Learning domains [6] and it is observed that, most of the tra-
eases like malaria [4]. Irrespective of the age factor, Millions of ditional machine learning algorithms is very sensitive with imbal-
deaths have occurred which is estimated to be 839000 in 2000 anced data[13][14]. In general, the main goal of machine learning
(range: 653000-1.1 million) to 438000 in 2015 (range: 236000- algorithms is to achieve good accuracy for classification/prediction.
635000) i.e., 48% have been recorded. Overall, it is estimated that, While dealing with imbalanced data, the classifier predicts towards
the incessant presence of Malaria disease has decreased the world the majority class rather than minority class because the classifier
population by 60% [5]. will be trained with many majority class examples rather than mi-
nority class examples and hence the prediction of the classifier will
One way to solve this problem is to properly identify the patient be more oriented towards majority class examples. So, accurate
with a disease and providing accurate diagnosis. Malaria a vector classification of minority class samples is important than majority
borne disease, the data of which always creates a class imbalance class samples especially in medical diagnosis [6]. For example,
Copyright © 2018 Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
International Journal of Engineering & Technology 787

misclassification of an affected patient is more dangerous than un- in diagnosing the disease[24].
affected patient [13] [14]. • Jia Pengfei et al. proposed a new sampling method called
Distinct –Borderline algorithm for balancing the class distribution
The organization of rest of the paper as follows: Section 2 describes of imbalanced data sets and [25]
the literature review of various methods of classification of malaria
• N.Poolsawad et al. Investigated comparative study be-
disease data and imbalanced data applications. Section 3 presents
tween set of classifiers like Multilayer Perceptron (MLP), RBFN,
the experimental frame work for classification of imbalanced ma-
SVM, Decision Tree and Random forest algorithms on LIFELAB
laria disease data using Naive Bayesian algorithm on different plat-
dataset and found that RF algorithm produces better accuracy [26].
forms like Weka and R Programming and Section 4 shows the ex-
perimental outcomes. • V.Garcia et al. suggested various methods for handling of
imbalanced data sets [27].
• Jaree Thongkam et al. developed a model for prediction
2. Literature Survey of breast cancer imbalanced data set using C-Support vector ma-
chine algorithm [28].
Early stage identification of malaria disease is very important for a • Xing-Ming Zhao et al. suggested ensemble classifiers as
disease affected patient; otherwise, sometimes it may even leads to 1NN (Nearest Neighbor), SVM (Support Vector Machine), C4.5
death also. Many researchers developed various methods, which and Logistic regression algorithms for protein classification and
were mainly based on conventional machine learning algorithms conducted a two ways of comparative study. One with balanced
that are producing accurate results with balanced data as follows: data and another with unbalanced data and individual classifiers
with ensemble classifiers and concluded that balanced data is good
• Purnima Pundit et al. [15] analyzed Digital Holographic for accurate prediction of samples while ensemble classifiers are
Interferometric Microscopic (DHM) image for detection of infected best predictors among all the stated classifiers [29].
erythrocytes using Artificial Neural Networks.
• Francis Bbosa et al. [16] predicted the status of malaria
patients based on Rule based classification. 3. Experimental Framework
• Chunqing Wu et al. [17] developed a Disease free equi-
librium model with Halanay inequalities for discrete time dynamic Dataset description: we have collected 165 patients data from the
system of Neural Networks for diagnosing malaria disease. medical wards of Narasaraopet in which it consists of various
• Meng-Hsim Tsai et al. [18] proposed a method for detec- attributes as listed below:
tion of malaria parasite using Adaptive histogram, Threshold seg- Attributes Actual range
mentation and K-means clustering techniques. Age 1-20 yrs
• Farah Zakiyah et al. [19] examined the Classification of Haemoglobin M13.0-18.0/F11.0-16.5 grms%
Plasmodium Vivax parasite using image processing and KNN tech- RBC 3.80-5.80 millions/cumm
niques. Hct 35.0-50.0 %
• Kshipra C.Charpe et al. [20] investigated malaria parasite Mcv 80-97 fl
stages using Image processing and classification techniques. Mch 26.5-33.5 pg/cells
• J.Somasekar et al. [21] examined a method for detection Mchc 31.5-35.0 %
of effected erythrocytes based on Adaptive median filter, edge en- Platelets 1.50-3.90 lakhs/cumm
hancement and Fuzzy C-Means clustering techniques. WBC 3,500-10,000 cells/cumm
• Bruno B Andrade et al. [8] investigated active and inac- Granuls 43.0-76.0 %
tive malaria disease patients based on laboratory assessment. Lymphocytes 17.0-48.0 %
All the above stated survey, deals with balanced malaria disease Monocytes 4.0-10.0 %
data only. But, identification of an affected patient (positive class
Malaria Positive/Negative
sample) in imbalanced malaria disease data is very important [6]
[7]. Because of the skewed distribution of data, classifying the pos-
itive class samples from the past few years became a challenging Consider the class Imbalance Distribution of Malaria Disease Da-
issue to many researchers which leads to class imbalance problem taset as follows:
in many fields [9].Connecting to the context of class imbalance Total No of instances 165
problem, a plethora of methods is proposed by many researchers to No of Attributes 13
classify the imbalanced data sets on different medical data sets Minority Class samples 5
which are stated below: Majority Class samples 160
• Guo Haixiang et al. presented a plethora of methods for Minority Class Positive
handling imbalanced data sets in the era of many application do- Majority Class Negative
mains [1]. % of Class(Minority, Majority) 3.1 , 96.9
• Salma Jamal developed a Predictive model of anti-malar- Imbalance Ratio 1: 32
ial molecules inhibiting apicoplast formation for an imbalanced ma-
laria disease dataset using Cost sensitive Naïve Bayesian, Random The best method for classifying the samples of imbalanced data is
Forest and J48 algorithms and found that Random Forest algorithm a two-step process as follows: Initially it is necessary to balance the
produces better accuracy[6]. class distribution and then apply the classifier to perform classifi-
• Bruno B Andrade et al. suggested that severe stage of ma- cation of samples.
laria disease causes to reducing of inflammatory cytokines which is
a high level imbalance class problem in medical domain [8]. Step-1: Balancing the class distribution: - In order to balance
• Rashmi Dubey developed an ensemble frame work of the class distribution of the data, researchers have compared under
sampling and feature selection methods for classification of Alz- sampling method with over sampling and found that it is always a
heimer’s disease which occurred in elderly patients [22]. better to. It is an oversampling technique in which minority class
• Wing W. Y. Ng et al. proposed a Diversified Sensitivity- samples are oversampled in a synthetic manner.
Based Under sampling method – DSUS on different datasets and Consider the algorithm of SMOTE:
stated that DSUS is an efficient method when compared with other Algorithm: SMOTE (T, N, k)
under sampling techniques [23]. Input: Number of minority class samples T; Amount of SMOTE
• Yazan F et al. Conducted a comparative study on 2080 N%; Number of nearest neighbor’s k.
cardiac patients for classifying the patients status using a set of clas- Output: (N/100)* T synthetic minority class samples.
sifiers like C4.5, logistic regression, discriminant analysis, Support Method: (∗If N is less than 100%, randomize the minority class
Vector Machine (SVM) and Classification and Regression Tree samples as only a random percent of them will be SMOTEd. ∗)
(CART) and found that CART and SVM are has highest precision If N<100
788 International Journal of Engineering & Technology

Then randomize the T minority class samples If the dataset ‘D’ contains many number of attributes,
T = (N/100) ∗T Then find the class-conditional independence to compute P (X|Ci).
N = 100 i.e, P (X|Ci) =
End if P (x1|Ci)* P (x2|Ci)* P (x3|Ci)* ………*P (xn |Ci)
N = (int) (N/100) (∗ the amount of SMOTE is assumed to be Where xk is a value of attribute AK for tuple X.
in integral multiples of 100. ∗) If AK is categorical, then 
k = Number of nearest neighbors, numattrs = Number of attrib- P (xk Ci) = [|Ci, D| having the value xk for AK] / |Ci, D|. 
If AK is continuous valued, then 
utes, Sample [ ] [ ]: array for original minority class samples, new
P (xk, Ci) =g (xk, µci, σci) where g(x, µ, σ) is defined with 
index: keeps a count of number of synthetic samples generated, in-
itialized to ‘0’, Synthetic [ ] [ ]: array for synthetic samples. (∗ Com-
pute k nearest neighbors for each minority class sample only. ∗)
For i ←1 to T [µci or µ is mean or average and σci or σ is the standard deviation
Compute k nearest neighbors for i, and save the indices in the of the values of attribute AK]
narray Predict the class of X based on class C i if and only
Generate (N, i, narray) If P (X | Ci)*P (Ci) > P(X Cj) P (Cj) for 1≤j≤m, j≠i.
End for End of algorithm.
Generate (N, i, narray) (∗ Function to generate the synthetic
samples. ∗) 4. Results
While N =0 (Choose a random number between 1 and k, call it
nn. This step chooses one of the k nearest neighbors of i.) Conducted experiments on collected imbalanced malaria disease
For attr ←1 to numattrs data set i.e., initially performed class balance distribution using
Compute: SMOTE algorithm later performed a comparative study of Naive
Dif= Sample [narray [nn]] [attr] −Sample[i] [attr] Bayesian Classifier on different platforms. Consider the Imbal-
Compute: gap = random number between 0 and 1 anced Malaria data set as shown in figure 1:
Synthetic [new index] [attr] =Sample[i] [attr] +gap∗dif
End for
New index++, N = N −1
Endwhile
Return (∗ End of generate ∗)
End of algorithm.
Step-2: Classification – Unlike traditional statistical methods,
the machine learning models provide dynamic output as further data
is fed into it. It doesn’t assume the skills and insight needed to anal-
yses the disease using statistical techniques.
It is recognized that translational research is essential to extract
knowledge in diagnosing the disease. In fact, the development of
various machine learning techniques/models are going to be a prac-
tical tool to predict disease like malaria, which is very much ram-
pant in nonurban areas/remote villages. The best part of machine
learning techniques is that, as more data is generated, the model
improves in precision and can be widely employed in different clin-
ical settings.
For the classification of samples, we proposed machine learn-
ing technique called Naïve Bayesian algorithm which uses class
conditional probabilities for classification of samples and con-
ducted comparative study in both Weka and R programming envi-
ronments.
Consider the Naive Bayesian classification algorithm: Fig 1: Imbalanced Malaria disease data with majority (red) minority (blue)
class samples.
Algorithm: classification of tuples for a given dataset D.
Input: A dataset D consists of training tuples.
Later applied SMOTE algorithm to obtain balanced Malaria dataset
Output: prediction of a class for a given tuple.
as shown in Figure 2.
Method:
D: A training set of tuples
X = (x1, x2, x3…xm) is an n-dimensional vector each vector tu-
ple of D defined over ‘n’ attributes of A1,A2,A3,……An.
C1, C2, Cm: set of classes
For each tuple of ‘X’
Predict the highest posterior probability that a tuple ‘X’ be
longs to class ‘Ci’ if and only if,
P (Ci |X)> P (Cj | X) for 1≤ j≤m, j! =i

By Bayesian theorem calculate maximized posterior hypothesis,

Where P(X) is constant for all classes.


Find the maximize P (X| Ci). P (Ci)
If the class prior probabilities are not known,
Then P (C1) = P (C2) =……P (Cm) and maximize the P(X/Ci).
Else
Maximize P(X Ci) P (Ci) where P (Ci) = |Cad|/|D| Fig 2: Balanced Malaria disease data with majority (red) and minority (blue)
[|Ci, D|] is the number of training tuples of class C i in D] class samples after applying SMOTE algorithm
International Journal of Engineering & Technology 789

Which is defined from the Confusion matrix as follows:

Table 2: Confusion Matrix


Predicted Posi- Predicted Negative
tive Class Class
Actual Positive Class TP FN
Actual Negative Class FP TN

Consider the Classification performance of Naïve Bayesian algo-


rithm on malaria disease data on different platforms as shown be-
low:

Fig 3: Training data of Naïve Bayesian algorithm

Trained data & Test data after balancing: Among the balanced
class distribution 70 % - 112 patient’s data is taken to train the Na-
ive Bayesian algorithm as shown in Fig 3 and 30% - 48 patients’
data is considered to test the Naive Bayesian classifier performance
as shown in Fig 4 as follows: Fig 5: Naïve Bayesian Classifier performance on Malaria disease data set
using R programming and Weka environments.

From the Figure-5, Naive Bayesian algorithm classifies the samples


with an accuracy of 88.5% in Weka environment where as in R pro-
gramming the Naive Bayesian algorithms performs 87.5% accuracy
for classifying the malaria disease data.

5. Discussions
We conducted a prospective study on clinically collected 165 pa-
tient’s data which is a skewed distribution data. Applied SMOTE
algorithm to balance the class distribution and then conducted a
comparison study of Naive Bayesian classifier in both Weka and R
programming environments. Experimental results stated that Naive
Bayesian classifier in weka environment performed very well for
the classification of imbalanced malaria disease data when com-
pared with R programming environment.

6. Conclusion
Malaria disease becomes one of the class imbalance problems in
medical diagnosis. If we are examined in a nutshell, classification
of such imbalanced data especially minority class samples is a key
thing. So accurate prediction of effected patient and diagnosing
Fig 4: Test data of Naïve Bayesian algorithm within time is very important otherwise it may lead to death also.
Many conventional classifiers are voted for classification of major-
Performance of Naive Bayesian classifier on classification of ma- ity class samples only. So handling minority class samples and its
laria disease data on different platforms as follows: classification is becoming a burning issue in medical field. We pre-
sent the balanced class distribution of imbalanced malaria disease
Table 1: Comparison study of Naive Bayesian classifier performance on data by using SMOTE algorithm and then performed a comparative
malaria disease data on different platforms study Naive Bayesian Classifier on different platforms for better
R– programming Weka tool prediction system.
Correctly Classified sam- 84 85
ples
Incorrectly Classified 12 11 References
samples
Accuracy % 87.5 88.5 [1] Thanh Quang Bui and Hai Minh Pham (2016). Web‑ based GIS
Misclassification % 12.5 11.5 for spatial pattern detection: application to malaria incidence in
Vietnam. Bui and Pham Springer plus 5: 1014: 1-14.
Where [2] S.T. Khot and R.K. Prasad (2015). Optimal Computer Based
Accuracy = [TP + TN] / [TP + FN + FP + TN] Analysis for Detecting Malarial Parasites. Proc. of the 3rd Int.
790 International Journal of Engineering & Technology

Conf. on Front. of Intell. Comput. (FICTA) Advances in Intelli- [27] V.Garcia, J.S.Sanchez et al. (2012). On the Effectiveness of pre-
gent Systems and Computing 327 vol 1: 69-80. processing methods when dealing with different levels of class im-
[3] Md Z Rahman, Leon id Roytman et al. (2015). Environmental balance. Elsevier –Knowledge Based Systems.13-21.
Data Analysis and Remote Sensing for Early Detection of Dengue [28] Jaree Thongkam, Gauandong Xu et al. (2009). Toward breast can-
and Malaria. Proc. of SPIE Vol. 9112: 1-9. cer survivability prediction model through improving training
[4] WHO Malaria Report (2016) space. Elsevier – Expert systems with Applications.12200-12209.
http://www.who.int/mediacentre/factsheets/fs387/en/ [29] Xing-Ming Zhao, Xin Li et al. (2007). Protein classification with
[5] World Malaria Report (2015) Pages-x, xi. imbalanced data. Wiley InterScience.1125-1132.
http://apps.who.int/iris/bitstream/ 10665/200018/ 1/ 9789241565158
_eng.pdf
[6] Salma Jamal, Vinita Periwal et al. (2013).Predictive modeling of
anti-malarial molecules inhibiting apicoplast formation. BMC Bi-
oinformatics 2013, 14:55, 1-8.
[7] Tsige Ketema and Ketema Bacha. (2013). Plasmodium vivax as-
sociated severe malaria complications among children in some
malaria endemic areas of Ethiopia. BMC Public Health 2013,
13:637, 1-7.
[8] Bruno B Andrade, Antonio Reis-Filho et al. (2010). Severe Plas-
modium vivax malaria exhibits marked inflammatory imbalance.
Malaria Journal 2010, 9:13, 1-8.
[9] Guo Haixiang, Li Yijing, et al. (2016). Learning from class-im-
balanced data: Review of methods and applications. Expert sys-
tems with applications: 1-49.
[10] Bartosz Krawczyk (2016). Learning from imbalanced data: open
challenges and future directions. Prog Artif Intell: 1-12.
[11] Xiaoheng Deng, Weijian Zhong et al. (2016). An Imbalanced Data
Classification Method Based On Automatic Clustering Under-
Sampling. IEEE transaction: 1-8.
[12] Aida Ali, Siti Mariyam Shamsuddin et al. (2013). Classification
with class imbalance problem: a review. International journal of
Advances in Soft Computing and its Applications 5(3): 1-30.
[13] N. Poolsawad, C. Kambhampati et al. (2014). Balancing Class for
Performance of Classification with a Clinical Dataset. Proceed-
ings of the World Congress on engineering vol.1: 1-6.
[14] M. Mostafizur Rahman and D. N. Davis (2013). Addressing the
Class Imbalance Problem in Medical Datasets. International Jour-
nal of Machine Learning and Computing 3(2): 224-228.
[15] Purnima Pandit, A. Anand. Artificial Neural Networks for Detec-
tion of Malaria in RBCs. 2016 AUG; arXiv: 1608.06627.
[16] Francis Bbosa, Ronald Wesonga et al. (2016). Clinical malaria di-
agnosis: rule‑based Classification statistical prototype.Springer
Plus. 5:939.
[17] Chunqing Wu and Patricia JY Wong (2016). Multi-dimensional
discrete Halanay inequalities and the global stability of the disease
free equilibrium of a discrete delayed malaria model. Advances in
Difference Equations. 2016:113.
[18] Meng-Hsiun Tsai, Shyr-Shen Yu et.al. (2015). Blood Smear Im-
age Based Malaria Parasite and Infected-Erythrocyte Detection
and Segmentation. TRANSACTIONAL PROCESSING SYS-
TEMS. J Med Syst 39: 118. DOI 10.1007/s10916-015-0280-9.
[19] Farah Zakiyah Rahmanti, Sutojo et al. (2015). Plasmodium Vivax
Classification from Digitalization Microscopic Thick Blood Film
Using Combination of Second Order Statistical Feature Extraction
and K-Nearest Neighbour (K-NN) Classifier Method. IEEE 4th
International Conference on Instrumentation, Communications,
Information Technology, and Biomedical Engineering (ICICI-
BME) Bandung.2015 November.2-3.
[20] Kshipra C. Charpe, Dr. V. K. Bairagi et al. (2015). Automated
Malaria Parasite and there Stage Detection in Microscopic Blood
Images. IEEE Sponsored 9th International Conference on Intelli-
gent Systems and Control (ISCO).
[21] J. Somasekar, B. Eswara Reddy. (2015). Segmentation of erythro-
cytes infected with malaria parasites for the diagnosis using mi-
croscopy imaging. Elsevier - Computers and Electrical Engineer-
ing. 336–51.
[22] Rashmi Dubey, Jiayu Zhou et al. (2014). Analysis of sampling
techniques for imbalanced data: An n = 648 ADNI study. Elsevier
Neuro Image 87: 220–241.
[23] Wing W. Y. Ng, Junjie Hu et al. (2015). Diversified Sensitivity-
Based Under sampling for Imbalance Classification Problems.
IEEE TRANSACTIONS ON CYBERNETICS: 1-11.
[24] Yazan F, Roumani et al. (2013). Classifying highly imbalanced
ICU data. Health care Manag Sci 16:119- 128.
[25] Jia Pengfei, Zhang Chunkai et al. (2014). A New Sampling Ap-
proach for classification of Imbalanced Data sets with High Den-
sity. IEEE transaction: 217-222.
[26] N. Poolsawad, C. Kambhampati et al. (2014). Balancing Class for
Performance of Classification with a Clinical Dataset. Proceed-
ings of the World Congress on engineering vol 1: 1-6.

You might also like