Data Mining in Healthcare - A Literature Survey: Dr. B.R. Prakash Dr. M. Hanumanthappa Uma K

Proceeding of National Conference on Advances in Computer Applications NCACA-2015

Data Mining in Healthcare – A Literature Survey

Dr. B.R. Prakash Dr. M. Hanumanthappa
Department of Computer Science and Department of Computer Science and Uma K
Applications Applications Department of Computer Science and
Sri Siddhartha Institute of Technology, Bangalore University, Applications
Tumkur, India. Bangalore. India. Bangalore University,Bangalore. India.

Abstract- Today a healthcare organization generates a healthcare industry to providea medical solution to the
voluminous data that results lack of information to make the patients at lower cost and make use of hospital resources
right decision. Data mining techniques can be used to extract the effectively. There is a huge potential for data mining
needful information from healthcare organizations. Data mining applications in healthcare. Generally, the applications of
is also widely used in healthcare for various applications such as Data mining in healthcare are grouped as healthcare
detection of fraud and abuse in healthcare, customer relationship
management, customer relationship management, evaluation
management, detection of diseases and treatment effectiveness,
and availability of healthcare services at lower cost. This survey of treatment effectiveness, and health insurer detection of
explores the various Data mining techniquesapplied in fraud and abuse. With these applications of data mining,
healthcare, such as classification, clustering, association rules and healthcare also includes classifying the patients with high
so on. Thus, it is necessary to identify and evaluate the popular risk or low risk, patients with similar type of diseases.The
data mining algorithms implemented in healthcare services. The results are used for giving the best treatment. Theutilization
need for algorithms is to increase the accuracy and efficiency in of resources effectively and making healthcare policies
healthcare services. This paper also explores the data mining using data mining techniques.
challenges in healthcare.

Keywords—Data mining, Data mining techniques, Healthcare

In recent years, the research has been conducted on
applications. various applications of healthcare such as detection of fraud
and abuse, healthcare management, customer relationship
I. INTRODUCTION management and treatment effectiveness. Researchers are
Data mining is the most striving area of researchers more interested in one of the healthcare applications called
to discover the meaningful information and finding of patterns treatment effectiveness in which identifying the disease
in data. The main objective of the data mining is to discover pattern and symptoms of the diseases and giving better
the knowledge hidden in a huge data. Today, with the rapid treatment. For example, research on finding the causes for
growth of data, Data mining is becoming popularin almost all diabetic cases, heart diseases, chronic diseases, cancer,
industries. Data mining is the most appropriate practice meant detecting patterns of disease spread in various states of
for analyzing and discovering the useful information in India, and so on. The other applications of healthcare also
various fields like financial data, Retail industry, Biological uses Data mining techniques to improve services like
data, Scientific and Engineering applications, Intrusion discovering the unusual pattern of claims by physicians,
detection, Telecommunication, Healthcare and so on. Among clinics, health insurers or others, whoever benefit of the
these, Healthcare industry is the vast area to study and apply things, maintenance of healthcare industry and manages the
data mining techniques to improve Healthcare services [1]. relationship between customers of the healthcare industry.

Healthcare is the practice of caring for human The most popular Data mining techniques used in
health in terms of detection of diseases in earlier stage, healthcare are classification, clustering, decision tree,
giving the effective treatment, preventing the diseases association rules, neural network, support vector machine and
before it occurred and other physical and mental many more other techniques. For example, classification
impairments in human beings [5]. Nowadays, Data mining method can be used to classify the patients with high risk or
is broadly used in the healthcare industry, since low risk, clusteringtechnique is used to group the patients who
thehealthcare industry consists of enormous data;however are having similar type of disease, whereas the decision tree
the data used for effective and efficient decision making algorithm can help to choose the causes of disease efficiently.
system for finding the hidden patterns. The feature of
healthcare motivatesto use data mining applicationsto aid II. DATA MINING AND HEALTHCARE
better services for healthcare. Data mining applications are Data mining is a process of discovering the
used in healthcare in four ways i) Strategic ii) previously unknown pattern from large dataset. Data mining is
Administrative iii) Clinical and iv) Operational. Data mining the most powerful and motivating concept of discovering the
techniques can apply to make strategic planning for

hidden pattern from the voluminous data. Presently, Data improve overall customer satisfaction. Medicine companies
mining is significantly used in the healthcare industry to can make profit from healthcare customer relational
transfer the more complex data into useful information. Due to management and data mining. Data mining techniques can
existence of huge data in healthcare, there is a need of strong help to track which physicians prescribe which drugs and this
method to handle the data and extracts the useful information helps to medicine companies can decide whom to target.
from healthcare. With the growth and maintenance of large
data repositories of structured and unstructured data, health Healthcare Management: Data Mining for Healthcare
organizations are increasingly using data analytics. It also Management is an emerging field where researchers from both
includes data mining to analyze and utilize the patterns and industryand academia. This identifies the potential of its
relationships found in the data to make improved clinical and impact on improving healthcare by discovering patterns and
other health-related decisions [6]. trends of diagnosis, treatments in large amounts of complex
data generated by healthcare transactions. Data mining has
Healthcare providers can use data mining to uncover supported in several aspects of healthcare management,
the previously unknown patterns from vast data stores and including diagnosis of diseases, decision-making for
then use this information to build predictive models. By treatments, detection of fraud in medical field, fault detection
examining and analyzing the stored patient data, expert data of medical devices, healthcare quality improvement strategies
miners can uncover important trends. Data mining and privacy. Data mining also helps to discover interesting
techniquesare becoming increasingly popular because it offers business understandings to help make better business
benefits to patients, healthcare care providers, healthcare decisions that can effectcost efficiency of business and yet
industry, researchers and health insurers. Providers can use retain a high quality of care. Healthcare management has the
data analysis to identify effective treatments and best challenge of managing the resources of a hospital. This
practices. Data mining can analyze the patient records to problem can be handled by using data mining model. Fitness
compare the causes, symptoms for providingeffective report and demographic of details of patients is also useful for
treatments. It can also identify clinical best practices to help utilizing the available hospital resources effectively [5]. One
in developing guidelines and standards of care. Patients can of the objectives of healthcare management is to improve the
receive better, more affordable healthcare services. This is true quality of healthcare, while providing the quality healthcare
only when healthcare managers use applications of data services at lower cost and reduce the patient’s length of stay in
mining to identify chronic diseases and high risk patients. The hospital. As well as managing the disease efficiently to
applications are used to design appropriate interventions and improve the results by applying data mining techniques
reduce the number of hospital admissions and claims. .
Healthcare effective treatment: Every hospital collects a
III. DATA MINING IN HEALTHCARE APPLICATIONS large amount of data about patients during their admission
Data mining techniques can help healthcare in time and treatment time. These data will be stored in the
various approaches, including four important applications Electronic Health Record form. From these records, data
which are categorized as detection of fraud and abuse, mining help physicians to discover the disease pattern faster
customer relationship management, healthcare management and prescribe better treatments. Data mining helps in
and treatment effectiveness. identifying the best treatment for specific diseases by
Detection of fraud and abuse: The fraud and abuse in comparing the causes and symptoms of diseases.
healthcare meansclaiming the payment from healthcare by
falsification of statements. It includes intentionally soliciting, IV. CHALLENGES
paying, and accepting payment to induce referrals for services
reimbursed by healthcare programs.Statistical methods can be A huge amount of data is generated by the healthcare
used to identify general patterns of suspicious transactions in industry that results major challenge for data mining to
healthcare data sets.Data mining techniques are the most achieve the quality and relevant healthcare data. It is hard to
valuable tool the healthcare industry can utilize in detecting get the accurate and complete healthcare data. The healthcare
fraud and abuse.Data mining techniques can help to discover data are diverse and complex in nature because it is collected
the unusual pattern in healthcare in order to detect the fraud from different sources such as from the conversation with the
and anomaliesin health insurance and administration [5]. patients or review of physicians, from laboratory report. It is
necessary to maintain the eminence of data from the
Customer Relationship management: Customer relationship healthcare provider, because this data is useful to deliver cost
management is a basic approach for handling interactions effective treatments to the patients. Every healthcare
between healthcare organizations and its customers. The organization collects the minimum data from each and every
healthcare customers include physicians’ offices, billing patient. This data contains missing data, noisy data, and
department, inpatient and outpatient settings, call centers of incomplete data. Data cleaning has to be done before applying
healthcare and ambulatory care settings [6].The customers any data mining technique for data to achieve betterresults [1].
could be patients, pharmacists, physicians or clinics. The Because the quality of data mining results depends on the
identification of usage and purchase patterns can be used to quality of data. The other problem with data mining is to know

the knowledge of applying data mining techniques on Partitioned clustering divides the huge data set into a
healthcare data. predefined number of clusters. Based on the choice of cluster
centroid and similarity measure, partition clustering is
V. DATA MINING TECHNIQUES IN HEALTHCARE categorized into K-means and K-medoid algorithms. The k-
Before applying data mining techniques to healthcare means algorithm is used in healthcare to detect the pattern of
data, researchers must know what kind of data mining disease spread in various places [4] and clustering the heart
techniques exist and how these techniques will work. Even disease patients based on the high BP and cholesterol level of
though data mining contains many techniques only few among patients.
them are successively used in healthcare field. Hierarchical clustering splits the huge data set into smaller
subsets, until reach the termination condition. The hierarchical
Classification algorithm decomposes the database in hierarchical way. This
Classification is the process of finding a model that can be represented in two forms (i) Agglomerative (Bottom-
describes and distinguishes data classes and finds the target up) approach and (ii) Divisive (Top-down) approach. Belciug
class label. The resulting model is constructed based on the [6]has worked on grouping the patients according to their
analysis of a set of training data [7]. Classification is a length of stay in the hospital using hierarchical clustering
supervised learning method to divide the dataset into algorithm.
predefined categorical class labels. The “class” label indicates Density based clustering groups the data set based on the
the deciding attribute or feature in a data set, which are used as notion of density of objects. It discovers the arbitrary shape of
a predictive model inputs to test the future data set. The clusters instead of restricted to only for spherical shape like
classification is a two-step process. In the first step, the set of partition and hierarchical clustering. For example, it discovers
data will be given as a training set to construct the an image pattern of skin and separates the wound from healthy
classification model called a classifier which consisting of skin.
classification rules. For example, if age = senior AND income Clustering is generally done when very less or no
= high THEN loan_decision= safe. Classifying rules are not information about the data is identified. By grouping the
necessarily 100% true; generally,rules with 90-95% accuracy objects intoa number of clusters will helps to understand the
areconsidered as solid rules. In the second step, the set of data information about data. Almost every clustering algorithm can
will examine a classification model for accuracy and only handle numeric data. However, most healthcare databases
efficiency for prediction [1]. have a number of categorical features. While the conversion of
Classification is the basic data mining technique used categorical data into numerical data causes the problem for
in the healthcare domain. For example, it can help to classify clustering. Very few clustering algorithms are efficient to
the patient’ssyndrome as high, medium, or low risk. Various handle the categorical data.
classification algorithms are applied to obtain the required
results from the healthcare data. The widely used classification Association rules
algorithms in health care are, decision tree, naïve Bayesian Association is a method that discovers the frequent
classifier, neural network, Support vector Machine, K-Nearest patterns of items in a collection. The relationships between co-
Neighbor classifier and so on. occurring items are expressed as association rules. This helps
Classification method is successful to select the best attribute to business analysis like market basket analysis to find out the
value to classify the given data set since before applying any frequently purchased items and relationships among a set of
classification algorithm on a data set, redundant attributes and items in the market. For example, if a customer has a
irrelevant attributes are identified using statistical methods computer, then the chance of buying antivirus software is high
such as correlation analysis. In other words, feature selection [1]. The Association is one of the most important techniques
method has been widely used to identify the relevant attributes of data mining used in the healthcare field to detect the
to class. These approaches help to improve the accuracy and relationships among various diseases and drugs. The
efficiency of classification. The cross-validation method is association is also used to construct the classifier by
also used to improve the classification accuracy [6]. discovering the disease pattern. And it is also helpful for
identifying the fake patterns in healthcare. The widely used
Clustering algorithm for association is apriori algorithm. It finds out the
Clustering is the process of grouping the data into relationships between items in a data set. Hemlata Sharma and
clusters or classes [7]. Clustering is an unsupervised learning Pallavi Sharma, used aprori algorithm for finding the
method and it does not have any predefined classes like in relationship between the 5 diseases spread in various states
classification. Clustering is a method of grouping the data set ofIndia [4]. To improve the association rule mining one can
based on similar features of objects. The similarities of objects apply multi-level association.
are measured by their attribute value. Every object within the Regression
same cluster are having the high similarity as compared to the Regression is a statistical analysis method to find out
objects belongs to another cluster. Clustering techniques are the functions that explain the relationship among different
categorized into following types. variables. A regression model is constructed using training
dataset. In this modeling two types of variables are used they

are, (i) dependent variable (ii) independent variable. There is Regression, Decision tree, Random Forest and Multivariate
only one dependent variable exists in the model, but adaptive regression spline model (MARS) and compare these
independent variables may be one or more. Based on the techniques with C-Statistic values. According to C-statistic
number of independent variables regression is of two kinds, value the techniques achieved Random Forest - 0.83, Logistic
one is linear and another one is Non-linear regression. Regression - 0.82, MARS - 0.78 and Decision trees - 0.63, and
Regression method is used in the healthcare field for concluded random forest data mining model is among the best.
predicting the diseases of a patient. Classification and
regression are used for classifying the data based on their In 2014, the work of application of data mining in
attributes while classification is used where the attributes are detecting patterns of disease spread in various states in India
categorical;regression is used where the attributes are was proposed by Hemlata Sharama and Pallavi Sharma [4].
continuous. The logistic regression method is widely used in For this research work the authors have taken the data from
the medical field for finding the relative risk condition such as INDIA STAT. And conducted experiments on 5 diseases
diabetes, pressure ulcer [1]. namely as “Hepatitis”, “Dengue”, “Acute respiratory”, Kala-
azar”, “Encephalities” on 29 states and union territories of
VI. LITERATURE REVIEW India. The researchers have used K-means technique to
discover a similar pattern of diseases in the cluster and
A lot of work has been done related to healthcare service using (Apriori algorithm) classification with a rule based is used for
data mining techniques. finding out the relationships between 5 diseases. The
experiment conducted using TNAGRA tool. From this
In 2012, Dipti Patil et al., [8], proposed work on experiment, the researchers concluded that based on the
existence of diseases and rate of mortality states such
parameter free data mining approach for healthcare
asHimachal Pradesh, Nagaland and Sikkim are in the same
application. In this paper, the authors have worked on
cluster and Jharkhand, Karnataka, Maharashtra, Kerala,
clustering the patients with their health status. The authors
Madhya Pradesh, Manipura and Mizoram belonging to same
conducted experiments on data collected from Switzerland,
which consist 107 instances and 14 attributes. K-means and cluster and the rest of the states are in different clusters.
data stream clustering algorithm for clustering the healthcare
data has been used. The corresponding clustering algorithm In 2015, Umair Shafique Fiaz Majeed, Haseeb
results in 83.18% and 87.85% prediction accuracy Quaiser and Irfan Ul Mustafa [10], have been worked on heart
respectively. disease to find out the interesting patterns from data of heart
patients. The authors collected online available heart disease
In 2013, Fanil V. Gada, Kartik R. Yadav, Rupen D. datasets from University of California, Irvine C.A,and
Machine Learning Repository. The selectionof only 14
Shah [3], proposed work on better detection of disease using
attributes out of 76 attributes present 597 records by applying
data mining. In this work, the researchershave used two
an Attribute selection method using a WEKA tool for different
methods to detect the diseases; one among these is Association
rules. This association rules are used to extract the transitive data. Authors had taken classification which is a data mining
associations between diseases and symptoms. This will be technique and implemented with the following algorithms,
Neural Network, Decision tree and Naïve Bayes and
helpful to find out the probability of occurrence of disease and
conducted 4 experiments and achieved82.914% accuracy of
leads to better treatment. Another method “TF-IDF” used for
result with a Naïve Bayes classification algorithm.
identification of disease which is most likely to be present.

In 2013, Johannes K. Chiang and Sheng Yin. Huang In 2015, the work has been done on diagnosing
diabetes using data mining techniques by Srideivanai
[11] had done work on multidimensional data mining for
Nagarajan and R. M. Chandrasekaran [9]. The researchers
healthcare service portfolio management. The main objective
conducted experiments on 650 patient records which are
of their work is to discover the association rules that hold the
collected from different sources. First, successfully cluster the
useful data element belonging to the healthcare domain. The
authors used apriori algorithm to achieve their goal with 40K records into 3 types using the K - means algorithm.
data and measured the performance using Entropy function. Afterwards, classify the data into mild, moderate and severe
type based on the level of risk using classification algorithms
This article is related to customer relationship management.
such as naïve Bayes, Random tree, simple CART and simple
Logistic. The accuracy of the different algorithms on accuracy
In 2014, Data mining approach for exploring factors value includes 0.989, 0.975, 1 and 1 respectively.
associated with pressure ulcer disease work has been done by
Dheeraj Raju, Xiaogang Su, patricia A. Patrician, Lori A. VII. SUMMARY OF LITERATURE REVIEW
Loan, and Mary S. McCarthy [2] by taking the full data set
consisting 1653 patient records and categorize the records
Many researchers have been worked for better
based on demographic, lab values and Braden subjects. Later,
healthcare service using popular data mining techniques and
applied four important data mining models, namely Logistic

