Nothing Special   »   [go: up one dir, main page]

Hidden Ciphertext Policy Attribute Based Encryption Under Standard Assumptions

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 80

Hidden Ciphertext Policy Attribute Based Encryption

Under Standard Assumptions


A main project report submitted to
Jawaharlal Nehru Technological University,
Kakinada in the partial fulfillment for the award of the
Degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE & ENGINEERING

Submitted by

AMBALA.MOUNIKA (19491A0505)
MUKKA. MANASA MANVITHA (19491A05F6)
SHAIK.SHALIMA (19491A05L7)
JAJULA.ANIL
(19491B0523)

Under the Noble Guidance of:

Ms.T.Jayasri.,
Assistant Professor, Department of CSE - QISCET

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

QIS COLLEGE OF ENGINEERING & TECHNOLOGY


(AUTONOMOUS)
(Approved by AICTE | Permanent Affiliation: JNTU-Kakinada | UGC-
Recognized) (Accredited by NBA | Accredited by NAAC | ISO
9001:2015 Certified)
VENGAMUKKAPALEM, ONGOLE - 523272, A.P.
2019-2023
QIS COLLEGE OF ENGINEERING & TECHNOLOGY
(Approved by AICTE | Permanent Affiliation: JNTU-Kakinada | UGC-
Recognized) (Accredited by NBA | Accredited by NAAC | ISO
9001:2015 Certified)
VENGAMUKKALAPALEM, ONGOLE-523272, A.P

DEPARTMENT
OF
COMPUTER SCIENCE & ENGINEERING

BONAFIDE CERTIFICATE

This is to certify that the mini project entitled Effective Heart Disease
Prediction using Hybrid Machine Learning Techniques is a bonafide
work of AMBALA.MOUNIKA (19491A0505), MUKKA.MANASA
MANVITHA (19491A05F6), SHAIK.SHALIMA (19491A05L7),
JAJULA.ANIL(19491B0523),
in the partial fulfillment of the requirement for the award of the degree of
Bachelor of Technology in COMPUTER SCIENCE & ENGINEERING
and for the academic year 2017-2021. This work is done under my
supervision and guidance.

Signature Of the Guide


Ms.T.Jayasri Signature Of the Head Of Department
Assistant Professor, CSE - Dr. Y. Narasimha Rao M.Tech., Ph.D.,
QISCET
Signature Of External Examiner
ACKNOWLEDGMENT

“Task successful” makes everyone happy. But the happiness will be gold without
glitter if we didn’t state the persons who have supported us to make it a success.
We would like to place on record the deep sense of gratitude to the Hon’ble
Secretary & Correspondent Sri. N. SURYA KALYAN CHAKRAVARTHY
GARU, QIS Group of Institutions, Ongole for providing necessary facilities to
carry the project work.
We express our gratitude to the Hon’ble chairman Sri. N. NAGESWARA RAO
GARU, QIS Group of Institutions, Ongole for his valuable suggestions and
advices in the B.Tech course.
We express our gratitude to Dr. C. V. SUBBARAO GARU, Ph.D., Principal of
QIS College of Engineering & Technology, Ongole for his valuable
suggestions and advices in the B. Tech course.
We express our gratitude to the Head of the Department of CSE, Dr. Y.
NARASIMHA RAO GARU, M. Tech, Ph.D., QIS College of Engineering
&Technology, Ongole for his constant supervision, guidance and co-operation
throughout the project.

We would like to express our thankfulness to our project guide Ms.T.Jayasri,


Assistant Professor - CSE, QIS College of Engineering & Technology,
Ongole for his constant motivation and valuable help throughout the project
work.
Finally, we would like to thank our Parents, Family and friends for their co-
operation to complete this project.
TEAM MEMBERS

AMBALA.MOUNIKA
(19491A0505)
MUKKA.MANASA
MANVITHA (19491A05F6)
SHAIK.SHALIMA
(19491A05L7)
JAJULA.ANIL (19491B0523)

DECLARATION
We hereby declare that the project work entitled “HIDDEN CIPHERTEXT
POLICY ATTRIBUTE BASED ENCRYPTION UNDER STANDARD
ASSUMPTIONS” done under the guidance of Ms.T.Jayasri, Assistant
Professor - CSE, is being submitted to the “Department of Computer Science&
Engineering”, QIS College of Engineering & Technology, Ongole is of our own
and has not been submitted to any other University or Educational institution for
any degree.

Team Members

AMBALA.MOUNIKA (19491A0505)
MUKKA.MANASA MANVITHA (19491A05F6)
SHAIK.SHALIMA
(19491A05L7)

JAJULA.ANIL(19491B0523)
7

Hidden Ciphertext Policy Attribute Based Encryption Under


Standard Assumptions

ABSTRACT

We propose two new ciphertext policy attribute based encryption (CP-ABE) schemes
where the access policy is defined by AND-gate with wildcard. In the first scheme, we present
a new technique that uses only one group element to represent an attribute, while the existing
ABE schemes of the same type need to use three different group elements to represent an
attribute for the three possible values (namely, positive, negative, and wildcard). Our new
technique leads to a new CP-ABE scheme with constant ciphertext size, which, however,
cannot hide the access policy used for encryption. The main contribution of this paper is to
propose a new CP-ABE scheme with the property of hidden access policy by extending the
technique we used in the construction of our first scheme. In particular, we show a way to
bridge ABE based on AND-gate with wildcard with inner product encryption and then use the
latter to achieve the goal of hidden access policy. We prove that our second scheme is secure
under the standard decisional linear and decisional bilinear Diffie–Hellman assumptions.

Contents
Abstract....................................................................................................................................................................... 1
List of Figures............................................................................................................................................................. 4
CHAPTER I – INTRODUCTION...............................................................................................................................5
1.1 General Terms...................................................................................................................................................5
Requirements and Installation.............................................................................................................................5
Managing Packages.............................................................................................................................................6
Machine Learning................................................................................................................................................7
SciKit-Learn........................................................................................................................................................8
Clustering............................................................................................................................................................8
Classification.......................................................................................................................................................9
Dimensionality Reduction.................................................................................................................................10
NEURAL NETWORKS AND DEEPLEARNING...........................................................................................12
1.2 PROBLEM STATEMENT:.............................................................................................................................14
1.3 EXISTING SYSTEM......................................................................................................................................15

1.3.1............................................................................... DISADVANTAGESOF EXISTINGSYSTEM


15

1.3.2........................................................................................................... LITERATURE REVIEW


15
1.4 PROPOSED SYSTEM:...................................................................................................................................18
8

1.4.1 ADVANTAGES OF PROPOSED SYSTEM:............................................................................................18


Chapter 2 - SYSTEM SPECIFICATION..................................................................................................................20
2.1 HARDWARE REQUIREMENTS...................................................................................................................20
2.2 SOFTWARE REQUIREMENTS....................................................................................................................20
CHAPTER 3 PROJECT DESCRIPTION.................................................................................................................21
3.1 PROJECT INTRODUCTION.........................................................................................................................21
Chapter 4 – FUNCTIONAL DESIGN.......................................................................................................................22
4.1 PROPOSED ARCHITECTURE......................................................................................................................22
4.2 Front End Module Diagrams:..........................................................................................................................22
4.3 Use Case Diagram...........................................................................................................................................23
4.4 State Diagram..................................................................................................................................................24
4.5 Machine Learning Project Cycle:....................................................................................................................25
4.6 Data Collection:...............................................................................................................................................26
4.7 Data Preprocessing:.........................................................................................................................................27
4.8 Text or Categorical data Featurization:............................................................................................................28
4.9 Data Splitting:..................................................................................................................................................29
4.10 Model Selection:............................................................................................................................................30
4.11 Hyper Parameter Tuning:..............................................................................................................................31
4.12 Model Evaluations:........................................................................................................................................32
CHAPTER 5 - MODULES.......................................................................................................................................33
5.1 Dataset preparation and preprocessing:...........................................................................................................33
5.2 Datacollection:.................................................................................................................................................33
5.3 Data preprocessing:.........................................................................................................................................34
5.4 Data formatting:..............................................................................................................................................34
5.5 Data cleaning:..................................................................................................................................................34
5.6 Data anonymization:........................................................................................................................................34
5.7 Data sampling:.................................................................................................................................................34
5.8 Featurization:...................................................................................................................................................35
5.9 Data splitting:..................................................................................................................................................35
5.10 Modeling:......................................................................................................................................................35
5.10.1 Model training:.......................................................................................................................................35
5.11 Hyper parameter Tuning:...............................................................................................................................36
5.12 Model Testing:..............................................................................................................................................37
5.13 Applying Machine Learning:.........................................................................................................................37
5.13.1 Random Forest:..........................................................................................................................................37
5.13.2 SVM Algorithm..........................................................................................................................................39
What are Support Vector Machines?.................................................................................................................39
Support Vector Machine for Multi-Class Problems:..........................................................................................39
SVM for complex (Non Linearly Separable):....................................................................................................39
Important Parameters in Kernelized SVC ( Support Vector Classifier).............................................................41
Pros of Kernelized SVM....................................................................................................................................41
Cons of KernelizedSVM...................................................................................................................................41
CHAPTER 6 : SOFTWARE DETAILS....................................................................................................................43
9

6.1 GENERAL ANACONDA...............................................................................................................................43


6.2 PYTHON.........................................................................................................................................................46
CHAPTER 7 IMPLEMENTATION.........................................................................................................................48
7.1 GENERAL......................................................................................................................................................49
7.2 Code................................................................................................................................................................ 49
CHAPTER 8 - OUTPUT SNAPSHOTS...................................................................................................................64
CHAPTER 9 CONCLUSION...................................................................................................................................73
REFERENCE:...........................................................................................................................................................74
10

List of Figures

1.1 Useful commands 3


1.2 Scatter plot 5
1.3 Dendrogram Diagramatic Representation 6
1.4 Classification using Clustering 9
1.4 Logistic Regression 10
1.5 AUC Curve 12
4.1 Architecture diagram 21
4.2 Front end module diagram 22
4.3 Usecase diagram 23
4.4 State diagram 24
4.5 Machine learning project life cycle 25
4.6 Data collection 26
4.7 Data preprocessing process 27
4.8 Data Featurization 28
4.9 Data Splitting 29
4.10 Model Selection 30
4.11 Parameter tuning 31
4.12 Model Evaluation 32
4.13 Random Forest Diagram 40
6.1 Anaconda Navigator 47
6.2 Visual studio projects 49
8.1 Count plot 1 69
8.2 Count plot 2 69
8.3 Count plot 3 70
8.4 Count plot 4 70
8.5 Density vs TrestBPS 71
8.6 Density vs Age 72
8.7 Density vs cholesterol 73
8.8 Density vs thalassemia 74
8.9 AUC vs Hyper parameter KNN 74
8.10 AUC vs Hyper parameter log Regression 75
8.11 AUC vs Hyper parameter linear svm 75
8.12 AUC vs Hyper parameter RBF svm 76
8.13 Decision Tree HeatMap 76
8.14 Random Forest Heatmap 77
8.14 17XG Boost CrossValidation heatmap 77
8.15 17XG Boost training data heatmap 78
8.16 Accuracy table of algorithms 78
11

CHAPTER I – INTRODUCTION

What is Secure Computing?

Computer security (Also known as cyber security or IT Security) is information security as


applied to computers and networks. The field covers all the processes and mechanisms by
which computer-based equipment, information and services are protected from unintended or
unauthorized access, change or destruction. Computer security also includes protection from
unplanned events and natural disasters. Otherwise, in the computer industry, the term security
-- or the phrase computer security -- refers to techniques for ensuring that data stored in
a computer cannot be read or compromised by any individuals without authorization. Most
computer security measures involve data encryption and passwords. Data encryption is the
translation of data into a form that is unintelligible without a deciphering mechanism.
A password is a secret word or phrase that gives a user access to a
particular program or system.

Diagram clearly explain the about the secure computing


Working conditions and basic needs in the secure computing:
If you don't take basic steps to protect your work computer, you put it and all the information
on it at risk.  You can potentially compromise the operation of other computers on your
organization's network, or even the functioning of the network as a whole.

1. Physical security:

Technical measures like login passwords, anti-virus are essential.  (More about those below) 
However, a secure physical space is the first and more important line of defense.

Is the place you keep your workplace computer secure enough to prevent theft or access to it
while you are away?  While the Security Department provides coverage across the Medical
12

center, it only takes seconds to steal a computer, particularly a portable device like a laptop or
a PDA.  A computer should be secured like any other valuable possession when you are not
present.

Human threats are not the only concern.  Computers can be compromised by environmental
mishaps (e.g., water, coffee) or physical trauma.  Make sure the physical location
of your computer takes account of those risks as well.   

2. Access passwords:

The University's networks and shared information systems are protected in part by login
credentials (user-IDs and passwords).  Access passwords are also an essential protection for
personal computers in most circumstances.  Offices are usually open and shared spaces, so
physical access to computers cannot be completely controlled.

To protect your computer, you should consider setting passwords for particularly


sensitive applications resident on the computer (e.g., data analysis software), if the software
provides that capability. 

3. Prying eye protection:

Because we deal with all facets of clinical, research, educational and administrative data here
on the medical campus, it is important to do everything possible to minimize exposure of data
to unauthorized individuals. 

4. Anti-virus software:

Up-to-date, properly configured anti-virus software is essential.  While we have server-side


anti-virus software on our network computers, you still need it on the client side (your
computer)

5. Firewalls:

Anti-virus products inspect files on your computer and in email.  Firewall software and
hardware monitor communications between your computer and the outside world.  That is
essential for any networked computer.
13

5. Software updates:

It is critical to keep software up to date, especially the operating system, anti-virus and anti-
spyware, email and browser software.   The newest versions will contain fixes for discovered
vulnerabilities.

Almost all anti-virus have automatic update features (including SAV).  Keeping the
"signatures" (digital patterns) of malicious software detectors up-to-date is essential for these
products to be effective.

6. Keep secure backups:

Even if you take all these security steps, bad things can still happen.   Be prepared for the
worst by making backup copies of critical data, and keeping those backup copies in a separate,
secure location.  For example, use supplemental hard drives, CDs/DVDs, or flash drives to
store critical, hard-to-replace data.  

7. Report problems:

If you believe that your computer or any data on it has been compromised, your should make
a information security incident report.   That is required by University policy for all data on
our systems, and legally required for health, education, financial and any other kind of record
containing identifiable personal information.

Benefits of secure computing:


 ProctectyourCivilliability:
You may be held legally liable to compensate a third party should they experience
financial damage or distress as a result of their personal data being stolen from you or
leaked by you.
 ProtectyourcredibilityCompliance:
You may require compliancy with the Data Protection Act, the FSA, SOX or other
regulatory standards. Each of these bodies stipulates that certain measures be taken to
protect the data on your network.
14

 ProtectyourreputationSpam:
A common use for infected systems is to join them to a botnet (a collection of infected
machines which takes orders from a command server) and use them to send out spam. This
spam can be traced back to you, your server could be blacklisted and you could be unable
to send email.
 ProtectyourincomeCompetitiveadvantage: 
There are a number of “hackers-for-hire” advertising their services on the internet selling
their skills in breaking into company’s servers to steal client databases, proprietary
software, merger and acquisition information, personnel detailset al.
 ProtectyourbusinessBlackmail:
A seldom-reported source of income for “hackers” is to·break into your server, change all
your passwords and lock you out of it. The password is then sold back to you. Note: the
“hackers” may implant a backdoor program on your server so that they can repeat the
exercise at will.
 ProtectyourinvestmentFreestorage:
Your server’s harddrive space is used (or sold on) to house the hacker's video clips, music
collections, pirated software or worse. Your server or computer then becomes continuously
slow and your internet connection speeds deteriorate due to the number of people
connecting to your server in order to download the offered wares.

SYSTEM ANALYSIS
EXISTING SYSTEM:
 In a CP-ABE, the user’s attributes used for key generation must satisfy the access policy
used for encryption in order to decrypt the ciphertext, while in a KP-ABE, the user can
only decrypt ciphertexts whose attributes satisfy the policy embedded in the key. We can
see that access control is an inherent feature of ABE, and by using some expressive
access structures, we can effectively achieve fine-grained access control.
 The fuzzy IBE given by Sahai and Waters, which can be treated as the first KP-ABE,
used a specific threshold access policy.
 Later, the Linear Secret Sharing Scheme (LSSS) realizable (or monotone) access
structure has been adopted by many subsequent ABE schemes.
15

 Cheung and Newport proposed another way to define access structure using AND-Gate
with wildcard. Cheung and Newport showed that by using this simple access structure,
which is sufficient for many applications, CP-ABE schemes can be constructed based on
standard complexity assumptions.
 Subsequently, several ABE schemes were proposed following this specific access
structure.

DISADVANTAGES OF EXISTING SYSTEM:


 The existing ABE schemes based on AND-Gate with wildcard cannot achieve this
property.
 ABE can well protect the secrecy of the encrypted data against unauthorised access, it
does not protect the privacy of the receivers/decryptors by default. That is, given the
ciphertext, an unauthorised user may still be able to obtain some information of the data
recipients.
 Although a secure ABE can well protect the secrecy of the encrypted data against
unauthorised access, it does not protect the privacy of the receivers/decryptors by
default. That is, given the ciphertext, an unauthorised user may still be able to obtain
some information of the data recipients. For example, a health organization wants to
send a message to all the patients that carry certain diseases. Then the attribute universe
will contain all the diseases, and an access policy will have the format “++−∗∗+. . .”
where “+” (“−”) indicates positive (negative) for a particular disease.
 If a CP-ABE cannot hide the access policy, then from the fact whether a person can
decrypt the message or not, people can directly learn some sensitive information of the
user. Therefore, it is also very important to hide the access policy in such applications.
However, most of the existing ABE schemes based on AND-Gate with wildcard cannot
achieve this property.

PROPOSED SYSTEM:
 In this work, we explore new techniques for the construction of CP-ABE schemes based
on the AND-gate with wildcard access structure. The existing schemes of this type need
16

to use three different elements to represent the three possible values – positive, negative,
and wildcard – of an attribute in the access structure.
 In this paper, we propose a new construction which uses only one element to represent
one attribute. The main idea behind our construction is to use the “positions” of different
symbols to perform the matching between the access policy and user attributes.
 Specifically, we put the indices of all the positive, negative and wildcard attributes
defined in an access structure into three sets, and by using the technique of Viète’s
formulas, we allow the decryptor to remove all the wildcard positions, and perform the
decryption correctly if and only if the remaining user attributes match those defined in
the access structure.
 We further study the problem of hiding the access policy for CP-ABE based on AND-
Gate with wildcard. As the main contribution of this work, we extend the technique we
have used in the first construction to bridge ABE based on AND-Gate with wildcard
with Inner Product Encryption (IPE).
 Specifically, we present a way to convert an access policy containing positive, negative,
and wildcard symbols into a vector _X which is used for encryption, and the user’s
attributes containing positive and negative symbols into another vector _ Y which is used
in key generation, and then apply the technique of IPE to do the encryption.

ADVANTAGES OF PROPOSED SYSTEM:


 Our new technique leads to a new CP-ABE scheme with constant ciphertext size.
 The system have used in the first construction to bridge ABE based on AND-Gate with
wildcard with Inner Product Encryption (IPE).
 Our first scheme achieves constant ciphertext size.
 Secure under the Decisional Bilinear Diffie-Hellman and the Decision Linear
assumptions.

LITERATURE
SURVEY

1) Generalized key delegation for wildcarded identity-based and inner-product


encryption
17

AUTHORS: M. Abdalla, A. De Caro, and D. H. Phan

Inspired by the fact that many e-mail addresses correspond to groups of users, Abdalla
introduced the notion of identity-based encryption with wildcards (WIBE), which allows a
sender to simultaneously encrypt messages to a group of users matching a certain pattern,
defined as a sequence of identity strings and wildcards. This notion was later generalized by
Abdalla, Kiltz, and Neven, who considered more general delegation patterns during the key
derivation process. Despite its many applications, current constructions have two significant
limitations: 1) they are only known to be fully secure when the maximum hierarchy depth is a
constant; and 2) they do not hide the pattern associated with the ciphertext. To overcome these,
this paper offers two new constructions. First, we show how to convert a WIBE scheme of
Abdalla into a (nonanonymous) WIBE scheme with generalized key delegation (WW-IBE)
that is fully secure even for polynomially many levels. Then, to achieve anonymity, we
initially consider hierarchical predicate encryption (HPE) schemes with more generalized
forms of key delegation and use them to construct an anonymous WW-IBE scheme. Finally, to
instantiate the former, we modify the HPE scheme of Lewko to allow for more general key
delegation patterns. Our proofs are in the standard model and use existing complexity
assumptions.

2) Expressive key-policy attribute-based encryption with constant-size ciphertexts

AUTHORS: N. Attrapadung, B. Libert, and E. de Panafieu

Attribute-based encryption (ABE), as introduced by Sahai and Waters, allows for fine-grained
access control on encrypted data. In its key-policy flavor, the primitive enables senders to
encrypt messages under a set of attributes and private keys are associated with access
structures that specify which ciphertexts the key holder will be allowed to decrypt. In most
ABE systems, the ciphertext size grows linearly with the number of ciphertext attributes and
the only known exceptions only support restricted forms of threshold access policies.
This paper proposes the first key-policy attribute-based encryption (KP-ABE) schemes
allowing for non-monotonic access structures (i.e., that may contain negated attributes) and
18

with constant ciphertext size. Towards achieving this goal, we first show that a certain class of
identity-based broadcast encryption schemes generically yields monotonic KP-ABE systems in
the selective set model. We then describe a new efficient identity-based revocation mechanism
that, when combined with a particular instantiation of our general monotonic construction,
gives rise to the first truly expressive KP-ABE realization with constant-size ciphertexts. The
downside of these new constructions is that private keys have quadratic size in the number of
attributes. On the other hand, they reduce the number of pairing evaluations to a constant,
which appears to be a unique feature among expressive KP-ABE schemes.

3) Ciphertext-policy attribute based encryption

AUTHORS: J. Bethencourt, A. Sahai, and B. Waters

In several distributed systems a user should only be able to access data if a user posses a
certain set of credentials or attributes. Currently, the only method for enforcing such policies is
to employ a trusted server to store the data and mediate access control. However, if any server
storing the data is compromised, then the confidentiality of the data will be compromised. In
this paper we present a system for realizing complex access control on encrypted data that we
call Ciphertext-Policy Attribute-Based Encryption. By using our techniques encrypted data can
be kept confidential even if the storage server is untrusted; moreover, our methods are secure
against collusion attacks. Previous Attribute- Based Encryption systems used attributes to
describe the encrypted data and built policies into user's keys; while in our system attributes
are used to describe a user's credentials, and a party encrypting data determines a policy for
who can decrypt. Thus, our methods are conceptually closer to traditional access control
methods such as Role-Based Access Control (RBAC). In addition, we provide an
implementation of our system and give performance measurements.

4) Identity-based encryption from the Weil pairing


19

AUTHORS: D. Boneh and M. K. Franklin

We propose a fully functional identity-based encryption scheme (IBE). The scheme has chosen
ciphertext security in the random oracle model assuming an elliptic curve variant of the
computational Diffie-Hellman problem. Our system is based on the Weil pairing. We give
precise definitions for secure identity based encryption schemes and give several applications
for such systems.

5) Fully secure attribute-based systems with short ciphertexts/ signatures and threshold
access structures

AUTHORS: C. Chen et al

It has been an appealing but challenging goal in research on attribute-based encryption (ABE)


and attribute-based signatures (ABS) to design a secure scheme with short ciphertexts and
signatures, respectively. While recent results show that some promising progress has been
made in this direction, they do not always offer a satisfactory level of security, i.e.
achieving selective rather than full security.
In this paper, we aim to achieve both full security and short ciphertexts/signatures
for threshold access structures in the ABE/ABS setting. Towards achieving this goal, we
propose generic property-preserving conversions from inner-product systems to attribute-based
systems. We first give concrete constructions of fully secure IPE/IPS with constant-size
ciphertexts/signatures in the composite order groups. By making use of our IPE/IPS schemes
as building blocks, we then present concrete constructions of fully secure key-policy ABE
(KP-ABE) and ciphertext-policy ABE (CP-ABE) with constant-size ciphertexts, and a fully
secure ABS with constant-size signatures with perfect privacy for threshold access structures.
These results give rise to the first constructions satisfying the aforementioned requirements.
Our schemes reduce the number of pairing evaluations to a constant, a very attractive property
for practical attribute-based systems. Furthermore, we show that our schemes can be extended
to support large attribute universes and more expressive access structures.
20

SYSTEM
REQUIRMENTS

HARDWARE REQUIREMENTS:

• System : Pentium IV 2.4 GHz.


• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• Ram : 512 Mb.

SOFTWARE REQUIREMENTS:

• Operating system : - Windows XP/7.


• Coding Language : JAVA/J2EE
• Data Base : MYSQL
21

CHAPTER 3 PROJECT DESCRIPTION

3.1 INTRODUCTION
Java is one of the most popular and widely used programming language and
platform. A platform is an environment that helps to develop and run programs written in
any programming language. A platform is the hardware or software environment in which a
program runs. We’ve already mentioned some of the most popular platforms like Windows
2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the
operating system and hardware. The Java platform differs from most other platforms in that
it’s a software-only platform that runs on top of other hardware-based platforms.
Secure Health care frame work has been widely used
22

Chapter-SYSTEM DESIGN

SYSTEM ARCHITECTURE:

DATA FLOW DIAGRAM:

1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be
used to represent a system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to
model the system components. These components are the system process, the data used
by the process, an external entity that interacts with the system and the information
flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a
series of transformations. It is a graphical technique that depicts information flow and
the transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
23

Owner
LOGIN

Attribute key using


Encrypted File Cloud
Upload to Cloud
Activat
e

Admin

Activat
e

User Verify attribute


LOGIN Key

View Files in decrypt format

Download

UML DIAGRAMS

UML stands for Unified Modeling Language. UML is a standardized general-purpose


modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object
oriented computer software. In its current form UML is comprised of two major components: a
Meta-model and a notation. In the future, some form of method or process may also be added
to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization,
24

Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the
software development process. The UML uses mostly graphical notations to express the design
of software projects.

GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks, patterns
and components.
7. Integrate best practices.

USE CASE DIAGRAM:


A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted.

Registration

Login
Activate user, owner

25

File Upload

View files Details

Give an attribute key


Owner
owner
View file

Download File

Owner Details

User Details

Admin

File Details

CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of
static structure diagram that describes the structure of a system by showing the system's
classes, their attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information.

Owner Admin

Login Login
26

SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram
that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart. Sequence diagrams are sometimes called event diagrams, event
scenarios, and timing diagrams

Cloud

User Admin Owner


Activate Activate File Upload to

Sent request File Details

Give an
attribute key

File Access
27

ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modeling Language, activity
diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.

Start

Owner Admin User

LOGIN LOGIN LOGIN

File Upload to Activate Owner Give an


Cloud with attributes
attributes search file

Activate User

File Details File Access


View File Details
28
29

CHAPTER 6: SOFTWARE ENVIRONMENTS

Java Technology

Java technology is both a programming language and a platform.

The Java Programming Language


The Java programming language is a high-level language that can be characterized by
all of the following buzzwords:

 Simple
 Architecture neutral
 Object oriented
 Portable
 Distributed
 High performance
 Interpreted
 Multithreaded
 Robust
 Dynamic
 Secure

With most programming languages, you either compile or interpret a program so that
you can run it on your computer. The Java programming language is unusual in that a program
is both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes —the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer. Compilation happens just once; interpretation occurs each time
the program is executed. The following figure illustrates how this works.
30

You can think of Java byte codes as the machine code instructions for the Java Virtual
Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser
that can run applets, is an implementation of the Java VM. Java byte codes help make “write
once, run anywhere” possible. You can compile your program into byte codes on any platform
that has a Java compiler. The byte codes can then be run on any implementation of the Java
VM. That means that as long as a computer has a Java VM, the same program written in the
Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

The Java Platform


A platform is the hardware or software environment in which a program runs.
We’ve already mentioned some of the most popular platforms like Windows 2000,
Linux, Solaris, and MacOS. Most platforms can be described as a combination of the
operating system and hardware. The Java platform differs from most other platforms in
that it’s a software-only platform that runs on top of other hardware-based platforms.
The Java platform has two components:
 The Java Virtual Machine (Java VM)
 The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform
and is ported onto various hardware-based platforms.
31

The Java API is a large collection of ready-made software components that provide
many useful capabilities, such as graphical user interface (GUI) widgets. The Java API
is grouped into libraries of related classes and interfaces; these libraries are known as
packages. The next section, What Can Java Technology Do? Highlights what
functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the
figure shows, the Java API and the virtual machine insulate the program from the
hardware.

Native code is code that after you compile it, the compiled code runs on a specific
hardware platform. As a platform-independent environment, the Java platform can be a
bit slower than native code. However, smart compilers, well-tuned interpreters, and just-
in-time byte code compilers can bring performance close to that of native code without
threatening portability.

What Can Java Technology Do?


The most common types of programs written in the Java programming language are
applets and applications. If you’ve surfed the Web, you’re probably already familiar
with applets. An applet is a program that adheres to certain conventions that allow it to
run within a Java-enabled browser.
However, the Java programming language is not just for writing cute, entertaining
applets for the Web. The general-purpose, high-level Java programming language is also
a powerful software platform. Using the generous API, you can write many types of
programs.
An application is a standalone program that runs directly on the Java platform. A special
kind of application known as a server serves and supports clients on a network.
Examples of servers are Web servers, proxy servers, mail servers, and print servers.
Another specialized program is a servlet. A servlet can almost be thought of as an applet
that runs on the server side. Java Servlets are a popular choice for building interactive
32

web applications, replacing the use of CGI scripts. Servlets are similar to applets in that
they are runtime extensions of applications. Instead of working in browsers, though,
servlets run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so with packages of
software components that provides a wide range of functionality. Every full
implementation of the Java platform gives you the following features:
 The essentials: Objects, strings, threads, numbers, input and output, data
structures, system properties, date and time, and so on.
 Applets: The set of conventions used by applets.
 Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram
Protocol) sockets, and IP (Internet Protocol) addresses.
 Internationalization: Help for writing programs that can be localized for users
worldwide. Programs can automatically adapt to specific locales and be displayed
in the appropriate language.
 Security: Both low level and high level, including electronic signatures, public
and private key management, access control, and certificates.
 Software components: Known as JavaBeansTM, can plug into existing component
architectures.
 Object serialization: Allows lightweight persistence and communication via
Remote Method Invocation (RMI).
 Java Database Connectivity (JDBCTM): Provides uniform access to a wide range
of relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers,
collaboration, telephony, speech, animation, and more. The following figure depicts
what is included in the Java 2 SDK.
33

How Will Java Technology Change My Life?

We can’t promise you fame, fortune, or even a job if you learn the Java programming
language. Still, it is likely to make your programs better and requires less effort than
other languages. We believe that Java technology will help you do the following:

 Get started quickly: Although the Java programming language is a powerful


object-oriented language, it’s easy to learn, especially for programmers already
familiar with C or C++.
 Write less code: Comparisons of program metrics (class counts, method counts,
and so on) suggest that a program written in the Java programming language can
be four times smaller than the same program in C++.
 Write better code: The Java programming language encourages good coding
practices, and its garbage collection helps you avoid memory leaks. Its object
orientation, its JavaBeans component architecture, and its wide-ranging, easily
extendible API let you reuse other people’s tested code and introduce fewer bugs.
 Develop programs more quickly: Your development time may be as much as
twice as fast versus writing the same program in C++. Why? You write fewer
lines of code and it is a simpler programming language than C++.
 Avoid platform dependencies with 100% Pure Java: You can keep your
program portable by avoiding the use of libraries written in other languages. The
100% Pure JavaTM Product Certification Program has a repository of historical
process manuals, white papers, brochures, and similar materials online.
34

 Write once, run anywhere: Because 100% Pure Java programs are compiled into
machine-independent byte codes, they run consistently on any Java platform.
 Distribute software more easily: You can upgrade applets easily from a central
server. Applets take advantage of the feature of allowing new classes to be loaded
“on the fly,” without recompiling the entire program.
ODBC
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for
application developers and database systems providers. Before ODBC became a de facto
standard for Windows programs to interface with database systems, programmers had to use
proprietary languages for each database they wanted to connect to. Now, ODBC has made the
choice of the database system almost irrelevant from a coding perspective, which is as it should
be. Application developers have much more important things to worry about than the syntax
that is needed to port their program from one database to another when business needs
suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular
database that is associated with a data source that an ODBC application program is written to
use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a
particular database. For example, the data source named Sales Figures might be a SQL Server
database, whereas the Accounts Payable data source could refer to an Access database. The
physical database referred to by a data source can reside anywhere on the LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they
are installed when you setup a separate database application, such as SQL Server Client or
Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called
ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-
alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program
and each maintains a separate list of ODBC data sources.

From a programming perspective, the beauty of ODBC is that the application can be
written to use the same set of function calls to interface with any data source, regardless of the
database vendor. The source code of the application doesn’t change whether it talks to Oracle
or SQL Server. We only mention these two as an example. There are ODBC drivers available
for several dozen popular database systems. Even Excel spreadsheets and plain text files can be
35

turned into data sources. The operating system uses the Registry information written by ODBC
Administrator to determine which low-level ODBC drivers are needed to talk to the data
source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is
transparent to the ODBC application program. In a client/server environment, the ODBC API
even handles many of the network issues for the application programmer.
The advantages of this scheme are so numerous that you are probably thinking there
must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking
directly to the native database interface. ODBC has had many detractors make the charge that
it is too slow. Microsoft has always claimed that the critical factor in performance is the quality
of the driver software that is used. In our humble opinion, this is true. The availability of good
ODBC drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never match the
speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the
opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers
get faster every year.

JDBC
In an effort to set an independent database standard API for Java; Sun Microsystems
developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access
mechanism that provides a consistent interface to a variety of RDBMSs. This consistent
interface is achieved through the use of “plug-in” database connectivity modules, or drivers. If
a database vendor wishes to have JDBC support, he or she must provide the driver for each
platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As you
discovered earlier in this chapter, ODBC has widespread support on a variety of platforms.
Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than
developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review that
ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon
after.
36

The remainder of this section will cover enough information about JDBC for you to know what
it is about and how to use it effectively. This is by no means a complete overview of JDBC.
That would fill an entire book.

JDBC Goals
Few software packages are designed without goals in mind. JDBC is one that, because of
its many goals, drove the development of the API. These goals, in conjunction with early
reviewer feedback, have finalized the JDBC class library into a solid framework for building
database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to why
certain classes and functionalities behave the way they do. The eight design goals for JDBC are
as follows:

1. SQL Level API


The designers felt that their main goal was to define a SQL interface for Java. Although
not the lowest database interface level possible, it is at a low enough level for higher-level
tools and APIs to be created. Conversely, it is at a high enough level for application
programmers to use it confidently. Attaining this goal allows for future tool vendors to
“generate” JDBC code and to hide many of JDBC’s complexities from the end user.
2. SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to
support a wide variety of vendors, JDBC will allow any query statement to be passed
through it to the underlying database driver. This allows the connectivity module to handle
non-standard functionality in a manner that is suitable for its users.
3. JDBCmustbeimplementalontopofcommondatabaseinterfaces :
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal
allows JDBC to use existing ODBC level drivers by the use of a software interface. This
interface would translate JDBC calls to ODBC and vice versa.
4. Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they
should not stray from the current design of the core Java system.
5. Keep it simple
37

This goal probably appears in all software design goal listings. JDBC is no exception.
Sun felt that the design of JDBC should be very simple, allowing for only one method of
completing a task per mechanism. Allowing duplicate functionality only serves to confuse
the users of the API.
6. Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error
appear at runtime.
7. Keep the common cases simple
Because more often than not, the usual SQL calls used by the programmer are simple
SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to
perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to proceed the implementation using Java Networking.

And for dynamically updating the cache table we go for MS Access database

Java ha two things: a programming language and a platform.

Java is a high-level programming language that is all of the following

Simple Architecture-neutral
Object-oriented Portable
Distributed High-performance
Interpreted multithreaded
Robust Dynamic
Secure

Java is also unusual in that each Java program is both compiled and interpreted.
With a compile you translate a Java program into an intermediate language called
Java byte codes the platform-independent code instruction is passed and run on the
computer.

Compilation happens just once; interpretation occurs each time the program is
executed. The figure illustrates how this works.
38

Java Interpreter
Program

Compilers My Program

You can think of Java byte codes as the machine code instructions for the Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a Java development
tool or a Web browser that can run Java applets, is an implementation of the Java VM.
The Java VM can also be implemented in hardware.

Java byte codes help make “write once, run anywhere” possible. You can compile
your Java program into byte codes on my platform that has a Java compiler. The byte
codes can then be run any implementation of the Java VM. For example, the same
Java program can run Windows NT, Solaris, and Macintosh.

Networking

TCP/IP stack

The TCP/IP stack is shorter than the OSI one:


39

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a


connectionless protocol.

IP datagram’s

The IP layer provides a connectionless and unreliable delivery system. It considers


each datagram independently of the others. Any association between datagram must be
supplied by the higher layers. The IP layer supplies a checksum that includes its own
header. The header includes the source and destination addresses. The IP layer handles
routing through an Internet. It is also responsible for breaking up large datagram into
smaller ones for transmission and reassembling them at the other end.

UDP

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the
contents of the datagram and port numbers. These are used to give a client/server
model - see later.

TCP

TCP supplies logic to give a reliable connection-oriented protocol above IP. It


provides a virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address
scheme for machines so that they can be located. The address is a 32 bit integer which
gives the IP address. This encodes a network ID and more addressing. The network ID
falls into various classes according to the size of the network address.

Network address
40

Class A uses 8 bits for the network address with 24 bits left over for other
addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network
addressing and class D uses all 32.

Subnet address

Internally, the UNIX network is divided into sub networks. Building 11 is currently
on one sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address

8 bits are finally used for host addresses within our subnet. This places a limit of
256 machines that can be on the subnet.

Total address

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To
send a message to a server, you send it to the port for that service of the host that it is
running on. This is not location transparency! Certain of these ports are "well known".

Sockets

A socket is a data structure maintained by the system to handle network


connections. A socket is created using the call socket. It returns an integer that is
41

like a file descriptor. In fact, under Windows, this handle can be used with Read
File and Write File functions.

#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);

Here "family" will be AF_INET for IP communications, protocol will be zero,


and type will depend on whether TCP or UDP is used. Two processes wishing to
communicate over a network create a socket each. These are similar to two ends of a
pipe - but the actual pipe does not yet exist.

JFree Chart

JFreeChart is a free 100% Java chart library that makes it easy for developers to display
professional quality charts in their applications. JFreeChart's extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types;
A flexible design that is easy to extend, and targets both server-side and client-side
applications;
Support for many output types, including Swing components, image files (including PNG
and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is "open source" or, more specifically, free software. It is distributed under the
terms of the GNU Lesser General Public Licence (LGPL), which permits use in proprietary
applications.

1. Map Visualizations
Charts showing values that relate to geographical areas. Some examples include: (a)
population density in each state of the United States, (b) income per capita for each country in
Europe, (c) life expectancy in each country of the world. The tasks in this project include:
Sourcing freely redistributable vector outlines for the countries of the world, states/provinces
in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus default implementation), a rendered, and
integrating this with the existing XYPlot class in JFreeChart;
Testing, documenting, testing some more, documenting some more.

2. Time Series Chart Interactivity


42

Implement a new (to JFreeChart) feature for interactive time series charts --- to display a
separate control that shows a small version of ALL the time series data, with a sliding
"view" rectangle that allows you to select the subset of the time series data to display in the
main chart.

3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard


mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars,
and lines/time series) that can be delivered easily via both Java Web Start and an applet.

4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the
properties that can be set for charts. Extend (or reimplement) this mechanism to provide
greater end-user control over the appearance of the charts.

J2ME (Java 2 Micro edition):-

Sun Microsystems defines J2ME as "a highly optimized Java run-time environment targeting a
wide range of consumer products, including pagers, cellular phones, screen-phones, digital set-
top boxes and car navigation systems." Announced in June 1999 at the JavaOne Developer
Conference, J2ME brings the cross-platform functionality of the Java language to smaller
devices, allowing mobile wireless devices to share applications. With J2ME, Sun has adapted
the Java platform for consumer products that incorporate or are based on small computing
devices.

1. General J2ME architecture


43

J2ME uses configurations and profiles to customize the Java Runtime Environment (JRE). As
a complete JRE, J2ME is comprised of a configuration, which determines the JVM used, and a
profile, which defines the application by adding domain-specific classes. The configuration
defines the basic run-time environment as a set of core classes and a specific JVM that run on
specific types of devices. We'll discuss configurations in detail in the The profile defines the
application; specifically, it adds domain-specific classes to the J2ME configuration to define
certain uses for devices. We'll cover profiles in depth in the The following graphic depicts the
relationship between the different virtual machines, configurations, and profiles. It also draws a
parallel with the J2SE API and its Java virtual machine. While the J2SE virtual machine is
generally referred to as a JVM, the J2ME virtual machines, KVM and CVM, are subsets of
JVM. Both KVM and CVM can be thought of as a kind of Java virtual machine -- it's just that
they are shrunken versions of the J2SE JVM and are specific to J2ME.

2. Developing J2ME applications


Introduction In this section, we will go over some considerations you need to keep in mind
when developing applications for smaller devices. We'll take a look at the way the compiler is
invoked when using J2SE to compile J2ME applications. Finally, we'll explore packaging and
deployment and the role preverification plays in this process.

3. Design considerations for small devices

Developing applications for small devices requires you to keep certain strategies in mind
44

during the design phase. It is best to strategically design an application for a small device
before you begin coding. Correcting the code because you failed to consider all of the
"gotchas" before developing the application can be a painful process. Here are some design
strategies to consider:
* Keep it simple. Remove unnecessary features, possibly making those features a separate,
secondary application.
* Smaller is better. This consideration should be a "no brainer" for all developers. Smaller
applications use less memory on the device and require shorter installation times. Consider
packaging your Java applications as compressed Java Archive (jar) files.
* Minimize run-time memory use. To minimize the amount of memory used at run time, use
scalar types in place of object types. Also, do not depend on the garbage collector. You should
manage the memory efficiently yourself by setting object references to null when you are
finished with them. Another way to reduce run-time memory is to use lazy instantiation, only
allocating objects on an as-needed basis. Other ways of reducing overall and peak memory use
on small devices are to release resources quickly, reuse objects, and avoid exceptions.

4. Configurations overview
The configuration defines the basic run-time environment as a set of core classes and a specific
JVM that run on specific types of devices. Currently, two configurations exist for J2ME,
though others may be defined in the future:
* Connected Limited Device Configuration (CLDC) is used specifically with the KVM for
16-bit or 32-bit devices with limited amounts of memory. This is the configuration (and the
virtual machine) used for developing small J2ME applications. Its size limitations make CLDC
more interesting and challenging (from a development point of view) than CDC. CLDC is also
the configuration that we will use for developing our drawing tool application. An example of
a small wireless device running small applications is a Palm hand-held computer.

* Connected Device Configuration (CDC) is used with the C virtual machine (CVM) and is
used for 32-bit architectures requiring more than 2 MB of memory. An example of such a
device is a Net TV box.

5. J2ME profiles
45

What is a J2ME profile?


As we mentioned earlier in this tutorial, a profile defines the type of device supported. The
Mobile Information Device Profile (MIDP), for example, defines classes for cellular phones. It
adds domain-specific classes to the J2ME configuration to define uses for similar devices. Two
profiles have been defined for J2ME and are built upon CLDC: KJava and MIDP. Both KJava
and MIDP are associated with CLDC and smaller devices. Profiles are built on top of
configurations. Because profiles are specific to the size of the device (amount of memory) on
which an application runs, certain profiles are associated with certain configurations.
A skeleton profile upon which you can create your own profile, the Foundation Profile, is
available for CDC.
Profile 1: KJava
KJava is Sun's proprietary profile and contains the KJava API. The KJava profile is built on
top of the CLDC configuration. The KJava virtual machine, KVM, accepts the same byte
codes and class file format as the classic J2SE virtual machine. KJava contains a Sun-specific
API that runs on the Palm OS. The KJava API has a great deal in common with the J2SE
Abstract Windowing Toolkit (AWT). However, because it is not a standard J2ME package, its
main package is com.sun.kjava. We'll learn more about the KJava API later in this tutorial
when we develop some sample applications.
Profile 2: MIDP
MIDP is geared toward mobile devices such as cellular phones and pagers. The MIDP, like
KJava, is built upon CLDC and provides a standard run-time environment that allows new
applications and services to be deployed dynamically on end user devices. MIDP is a common,
industry-standard profile for mobile devices that is not dependent on a specific vendor. It is a
complete and supported foundation for mobile application
development. MIDP contains the following packages, the first three of which are core CLDC
packages, plus three MIDP-specific packages.
* java.lang
* java.io
* java.util
* javax.microedition.io
* javax.microedition.lcdui
* javax.microedition.midlet
46

* javax.microedition.rms

CHAPTER 7 - IMPLEMENTATION
7.1 GENERAL

Python is a program that was originally designed to simplify the implementation of


numerical linear algebra routines. It has since grown into something much bigger, and
it is used to implement numerical algorithms for a wide range of applications. The
basic language used is very similar to standard linear algebra notation, but there are a
few extensions that will likely cause you some problems atfirst.

7.2 Code

#list of useful imports that I will use


importpandasaspd
importnumpyasnp
importmatplotlib.pyplotasplt
%matplotlib inline
importos
importnumpyasnp
importseabornassns
importrandom
importpickle
fromsklearn.metricsimportconfusion_matrix
fromsklearnimportmetrics
fromsklearn.metricsimportroc_curve,accuracy_score,log_loss,roc_auc_score
fromnltk.stem.porterimportPorterStemmer

fromnltk.corpusimportstopwords
fromnltk.tokenizeimportword_tokenize

importwarnings
fromsklearn.neighborsimportKNeighborsClassifier
fromsklearn.metricsimportroc_auc_score
importmath
fromsklearn.utilsimportresample
fromsklearn.model_selectionimporttrain_test_split
fromsklearn.metricsimportconfusion_matrix
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.metricsimportroc_auc_score
importmath
fromsklearn.linear_modelimportSGDClassifier
fromsklearn.calibrationimportCalibratedClassifierCV
fromsklearn.metricsimportroc_auc_score
importmath
fromsklearn.treeimportDecisionTreeClassifier
47

fromsklearn.metricsimportroc_auc_score
fromsklearn.model_selectionimportGridSearchCV
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.metricsimportroc_auc_score
48

fromsklearn.model_selectionimportGridSearchCV
fromxgboostimportXGBClassifier
fromsklearn.metricsimportroc_auc_score
fromsklearn.model_selectionimportGridSearchCV

warnings.filterwarnings("ignore")
Mount Google Drive.
from google.colab import drive drive.mount('/content/drive')
from google.colab import files files = files.upload()
data=pd.read_csv("heart.csv")
data.head()
data.describe()

data.info()

data.isnull().sum()

data.target.value_counts()

# Separate majority and minority


classes
df_majority=data[data['target']==1]
df_minority=data[data['target']==0]

# Downsample majority class and upsample the minority class


df_minority_upsampled=resample(df_minority,replace=True,n_samples=500,random_state=123)
df_majority_downsampled=resample(df_majority,replace=True,n_samples=500,random_state=1
23)

# Combine minority class with downsampled majority class


df_upsampled=pd.concat([df_minority_upsampled,df_majority_downsampled])

# Display new class counts


df_upsampled['target'].value_counts()

# shuffle the DataFrame rows


data=df_upsampled.sample(frac=1)

data.head()

sns.countplot(x=data['target'],data=data)

sns.countplot(x=data['restecg'],data=data)

sns.countplot(x=data['sex'],data=data)

sns.countplot(x=data['cp'],data=data)

plt.figure(figsize=(10,8))
sns.distplot(data['trestbps'])
49

plt.figure(figsize=(10,8))
50

sns.distplot(data['age'])

plt.figure(figsize=(10,8))
sns.distplot(data['chol'])

plt.figure(figsize=(10,8))
sns.distplot(data['thalach'])

plt.figure(figsize=(10,8))
sns.countplot(data['thal'])

plt.figure(figsize=(10,8))
sns.distplot(data['oldpeak'])
Data Splitting

y=np.array(data['target'])

x=data.drop(['target'],axis=1)

x.columns

#Breaking into Train and test


X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.3,stratify=y,random_state=42)
KNN Algorithm

k=list(range(1,50,4))

train_acc=[]
test_acc=[]

foriink:
clf=KNeighborsClassifier(n_neighbors=i,algorithm='brute')
clf.fit(X_train,y_train)
pred_test=clf.predict(X_test)
test_acc.append(accuracy_score(y_test,pred_test))
pred_train=clf.predict(X_train)
train_acc.append(accuracy_score(y_train,pred_train))
optimal_k=k[test_acc.index(max(test_acc))]
k=[math.log(x)forxink]

#plot auc vs alpha


x=plt.subplot()
x.plot(k,train_acc,label='Train Accuracy')
x.plot(k,test_acc,label='Test Accuracy')
plt.title('Accuracy vs hyperparameter')
plt.xlabel('k')
plt.ylabel('Accuracy')
x.legend()
plt.show()
51

print('optimal alpha for which auc is maximum : ',optimal_k)


knn=KNeighborsClassifier(n_neighbors=optimal_k,algorithm='brute')
knn.fit(X_train,y_train)
filename='heart_knn.pkl'
pickle.dump(knn,open(filename,'wb'))

pred_test=knn.predict(X_test)
#fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, pred_test)
pred_train=knn.predict(X_train)
#fpr2,tpr2,thresholds2 = metrics.roc_curve(le_y_train,pred_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)
print("AUC on Test data is "+str(accuracy_score(y_test,pred_test)))
print("AUC on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")

results=pd.DataFrame(columns=['model','Classifier','Train-Accuracy','Test-Accuracy'])
new=['KNN Algorithm','KNeighborsClassifier',train,test]
results.loc[0]=new

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=knn.predict(X_test[:20])
pred=[]

foriinpredicted:
ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
df

Logistic Regression

c=[10000,1000,100,10,1,0.1,0.01,0.001,0.0001,0.00001]
52

train_auc=[]
cv_auc=[]

foriinc:
clf=LogisticRegression(C=i)
clf.fit(X_train,y_train)
prob_cv=clf.predict(X_test)
cv_auc.append(accuracy_score(y_test,prob_cv))
prob_train=clf.predict(X_train)
train_auc.append(accuracy_score(y_train,prob_train))
optimal_c=c[cv_auc.index(max(cv_auc))]
c=[math.log(x)forxinc]

#plot auc vs alpha


x=plt.subplot()
x.plot(c,train_auc,label='AUC train')
x.plot(c,cv_auc,label='AUC CV')
plt.title('AUC vs hyperparameter')
plt.xlabel('c')
plt.ylabel('AUC')
x.legend()
plt.show()

print('optimal c for which auc is maximum : ',optimal_c)


y=mx+c
y=ax^2+bx+c

#Testing AUC on Test data


log=LogisticRegression(C=optimal_c)
log.fit(X_train,y_train)

filename='heart_log.pkl'
pickle.dump(knn,open(filename,'wb'))

pred_test=log.predict(X_test)
#fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, pred_test)
pred_train=log.predict(X_train)
#fpr2,tpr2,thresholds2 = metrics.roc_curve(le_y_train,pred_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)
print("AUC on Test data is "+str(accuracy_score(y_test,pred_test)))
print("AUC on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
53

fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=log.predict(X_test[:20])
pred=[]

foriinpredicted:
ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
df

new=['LogisticRegression','LogisticRegression',train,test]
results.loc[1]=new
Applying Linear SVM

warnings.filterwarnings("ignore")

alpha=[10000,1000,100,10,1,0.1,0.01,0.001,0.0001]

train_auc=[]
cv_auc=[]

foriinalpha:
model=SGDClassifier(alpha=i,loss="hinge")
clf=CalibratedClassifierCV(model,cv=3)
clf.fit(X_train,y_train)
prob_cv=clf.predict(X_test)
cv_auc.append(accuracy_score(y_test,prob_cv))
prob_train=clf.predict(X_train)
train_auc.append(accuracy_score(y_train,prob_train))
optimal_alpha=alpha[cv_auc.index(max(cv_auc))]
alpha=[math.log(x)forxinalpha]

#plot auc vs alpha


x=plt.subplot()
x.plot(alpha,train_auc,label='AUC train')
x.plot(alpha,cv_auc,label='AUC CV')
plt.title('AUC vs hyperparameter')
plt.xlabel('alpha')
plt.ylabel('AUC')
x.legend()
plt.show()
54

print('optimal alpha for which auc is maximum : ',optimal_alpha)

#Testing AUC on Test


data #Testing AUC on
Test data
model=SGDClassifier(alpha=optimal_alpha)
svm=CalibratedClassifierCV(model,cv=3)
svm.fit(X_train,y_train)

importpickle
filename='heart_svm.pkl'
pickle.dump(svm,open(filename,'wb'))

pred_test=svm.predict(X_test)
#fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, pred_test)
pred_train=svm.predict(X_train)
#fpr2,tpr2,thresholds2 = metrics.roc_curve(le_y_train,pred_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)
print("AUC on Test data is "+str(accuracy_score(y_test,pred_test)))
print("AUC on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=svm.predict(X_test[:20])
pred=[]

foriinpredicted:
ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
df

new=['Linear SVM','SGDClassifier',train,test]
55

results.loc[2]=new
56

Applying RBF SVM

fromsklearn.svmimportSVC
C=[10000,1000,100,10,1,0.1,0.01,0.001,0.0001]

train_auc=[]
cv_auc=[]

foriinC:
model=SVC(C=i)
clf=CalibratedClassifierCV(model,cv=3)
clf.fit(X_train,y_train)
prob_cv=clf.predict(X_test)
cv_auc.append(accuracy_score(y_test,prob_cv))
prob_train=clf.predict(X_train)
train_auc.append(accuracy_score(y_train,prob_train))
optimal_C=C[cv_auc.index(max(cv_auc))]
C=[math.log(x)forxinC]

#plot auc vs alpha


x=plt.subplot()
x.plot(C,train_auc,label='AUC train')
x.plot(C,cv_auc,label='AUC CV')
plt.title('AUC vs hyperparameter')
plt.xlabel('C')
plt.ylabel('AUC')
x.legend()
plt.show()

print('optimal C for which auc is maximum : ',optimal_C)

print('optimal C for which auc is maximum : ',optimal_C)

#Testing AUC on Test data


model=SVC(C=optimal_C)
clf=CalibratedClassifierCV(model,cv=3)
clf.fit(X_train,y_train)

importpickle
filename='heart_rbf.pkl'
pickle.dump(clf,open(filename,'wb'))

pred_test=clf.predict(X_test)
#fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, pred_test)
pred_train=clf.predict(X_train)
#fpr2,tpr2,thresholds2 = metrics.roc_curve(le_y_train,pred_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)
print("AUC on Test data is "+str(accuracy_score(y_test,pred_test)))
57

print("AUC on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=clf.predict(X_test[:20])
pred=[]

foriinpredicted:
ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
df

new=['RBF SVM','SVC',train,test]
results.loc[3]=new
Applying Decision Tree

dept=[1,5,10,50,100,500,1000]
min_samples=[5,10,100,500]

param_grid={'min_samples_split':min_samples,'max_depth':dept}
clf=DecisionTreeClassifier()
model=GridSearchCV(clf,param_grid,scoring='accuracy',n_jobs=-1,cv=3)
model.fit(X_train,y_train)
print("optimal min_samples_split",model.best_estimator_.min_samples_split)
print("optimal max_depth",model.best_estimator_.max_depth)

#Testing AUC on Test data


dt=DecisionTreeClassifier(max_depth=10,min_samples_split=5)

dt.fit(X_train,y_train)

importpickle
filename='heart_dt.pkl'
pickle.dump(dt,open(filename,'wb'))
58

pred_test=dt.predict(X_test)
#fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, pred_test)
pred_train=dt.predict(X_train)
#fpr2,tpr2,thresholds2 = metrics.roc_curve(le_y_train,pred_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)
print("AUC on Test data is "+str(accuracy_score(y_test,pred_test)))
print("AUC on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=dt.predict(X_test[:20])
pred=[]

foriinpredicted:
ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
df

new=['Decision Tree','DecisionTreeClassifier',train,test]
results.loc[4]=new
Applying Random Forest

dept=[1,5,10,50,100,500,1000]
n_estimators=[20,40,60,80,100,120]

param_grid={'n_estimators':n_estimators,'max_depth':dept}
clf=RandomForestClassifier()
model=GridSearchCV(clf,param_grid,scoring='accuracy',n_jobs=-1,cv=3)
model.fit(X_train,y_train)
print("optimal n_estimators",model.best_estimator_.n_estimators)
59

print("optimal max_depth",model.best_estimator_.max_depth)

#Testing AUC on Test data


rf=RandomForestClassifier(max_depth=model.best_estimator_.max_depth,n_estimators=model.b
est_estimator_.n_estimators)

rf.fit(X_train,y_train)

importpickle
filename='heart_rf.pkl'
pickle.dump(rf,open(filename,'wb'))

pred_test=rf.predict(X_test)
#fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, pred_test)
pred_train=rf.predict(X_train)
#fpr2,tpr2,thresholds2 = metrics.roc_curve(le_y_train,pred_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)
print("AUC on Test data is "+str(accuracy_score(y_test,pred_test)))
print("AUC on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")
pip install wordcloud

# worldcloud of top important features


all_features=X_train.columns
data=''
feat=rf.feature_importances_
features=np.argsort(feat)[::-1]
foriinfeatures[0:20]:
data+=all_features[i]
data+=' '

fromwordcloudimportWordCloud
wordcloud=WordCloud(background_color="white").generate(data)

# Display the generated image:


plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.show()
60

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=knn.predict(X_test[:20])
pred=[]

foriinpredicted:
ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
df

new=['Random Forest','RandomForestClassifier',train,test]
results.loc[5]=new
XGboost Algorithm
pip install xgboost

dept=[1,5,10,50,100,500,1000]
n_estimators=[20,40,60,80,100,120]

param_grid={'n_estimators':n_estimators,'max_depth':dept}
clf=XGBClassifier()
model=GridSearchCV(clf,param_grid,scoring='accuracy',n_jobs=-1,cv=3)
model.fit(X_train,y_train)
print("optimal n_estimators",model.best_estimator_.n_estimators)
print("optimal max_depth",model.best_estimator_.max_depth)

optimal_n_estimators=model.best_estimator_.n_estimators
optimal_max_depth=model.best_estimator_.max_depth

importseabornassns
X=[]
Y=[]
cv_auc=[]
train_auc=[]
forninn_estimators:
fordindept:
clf=XGBClassifier(max_depth=d,n_estimators=n)
clf.fit(X_train,y_train)
pred_cv=clf.predict(X_test)
pred_train=clf.predict(X_train)
X. append(n)
Y.append(d)
cv_auc.append(accuracy_score(y_test,pred_cv))
train_auc.append(accuracy_score(y_train,pred_train))
61

optimal_depth=Y[cv_auc.index(max(cv_auc))]
optimal_n_estimator=X[cv_auc.index(max(cv_auc))]

#Heatmap for cross validation data


data=pd.DataFrame({'n_estimators':X,'max_depth':Y,'AUC':cv_auc})
data_pivoted=data.pivot("n_estimators","max_depth","AUC")
ax=sns.heatmap(data_pivoted,annot=True)
plt.title('Heatmap for cross validation data')
plt.show()

#Heatmap for training data


data=pd.DataFrame({'n_estimators':X,'max_depth':Y,'Accuracy':train_auc})
data_pivoted=data.pivot("n_estimators","max_depth","Accuracy")
ax=sns.heatmap(data_pivoted,annot=True)
plt.title('Heatmap for training data')
plt.show()

#training our model for max_depth=50,min_samples_split=500


xgb=XGBClassifier(max_depth=10,n_estimators=80)
xgb.fit(X_train,y_train)
importpickle
filename='heart_xgb.pkl'
pickle.dump(xgb,open(filename,'wb'))

pred_test=xgb.predict(X_test)
pred_train=xgb.predict(X_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)

print("Accuracy on Test data is "+str(accuracy_score(y_test,pred_test)))


print("Accuracy on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")

new=['XGBOOST','XGBClassifier',train,test]
results.loc[6]=new

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=xgb.predict(X_test[:20])
pred=[]

foriinpredicted:
62

ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
df
Stacking Classifier
pip install mlxtend

frommlxtend.classifierimportStackingClassifier

KNC=KNeighborsClassifier(n_neighbors=optimal_k,algorithm='brute')# initialising
KNeighbors Classifier
XGB=XGBClassifier(max_depth=10,n_estimators=80)

clf_stack=StackingClassifier(classifiers=[KNC,XGB],meta_classifier=XGB,use_probas=True,us
e_features_in_secondary=True)

#training our model for max_depth=50,min_samples_split=500


clf_stack.fit(X_train,y_train)

importpickle
filename='heart_clf_stack.pkl'
pickle.dump(clf_stack,open(filename,'wb'))

pred_test=clf_stack.predict(X_test)
pred_train=clf_stack.predict(X_train)

test=accuracy_score(y_test,pred_test)
train=accuracy_score(y_train,pred_train)

print("Accuracy on Test data is "+str(accuracy_score(y_test,pred_test)))


print("Accuracy on Train data is "+str(accuracy_score(y_train,pred_train)))

print(" ")

# Code for drawing seaborn heatmaps


class_names=['negative','positive']
df_heatmap=pd.DataFrame(confusion_matrix(y_test,pred_test.round()),index=class_names,colum
ns=class_names)
fig=plt.figure()
heatmap=sns.heatmap(df_heatmap,annot=True,fmt="d")

new=['Stacking Classifier','StackingClassifier',train,test]
results.loc[7]=new
63

original=["Positive"ifx==1else"Negative"forxiny_test[:20]]
predicted=clf_stack.predict(X_test[:20])
pred=[]

foriinpredicted:
ifi==1:
k="Positive"
pred.append(k)
else:
k="Negative"
pred.append(k)
# Creating a data frame
df=pd.DataFrame(list(zip(original,pred,)),
columns=['original_Classlabel','predicted_classlebel'])
64

CHAPTER 8 - OUTPUT SNAPSHOTS

Figure 8.1 Count Plot 1

Figure 8.2 Count Plot 2


65

Figure 8.3 Count plot 3

Figure 8.4 Count Plot 4


66

Figure 8.5 Density vs Trest BPS


67

Figure 8.6 Density vs Age


68

Figure 8.7 Density vs Cholesterol

Figure 8.8 Density vs Thalassemia


69

Figure 8.9 AUC vs Hyperparameter KNN

Figure 8.10 AUC vs Hyperparameter Log Regression


70

Figure 8.11 AUC vs Hyperparameter Linear SVM

Figure 8.12 AUC vs Hyperparameter RBF SVM

Figure 8.13 Decision Tree Heatmap


71

Figure 8.14 Random Forest Heatmap

Figure 8.15 XGBoost Cross validation heatmap

Figure 8.16 XGBoost training data heatmap


72

Figure 8.17 Accuracy Table of algorithms


73

CHAPTER 9 CONCLUSION
CONCLUSION:-
Parental history or hereditary symptoms will leads to many chronic diseases to
peoples, out of which heart disease is one among. If we identify the chronic diseases
in early stage, it can be cured. So medical or hospital data set is collected from kaggle
web to analyse and implement the data on different algorithm to check the accuracy
score, sensitivity and specificity of the key attribute of the heart disuse patients. In fact
we analyse the proposed model for heart disease patients with various algorithm, in
which many key attributes are verified, out of which Random forest algorithm found
to be that very effective and efficient performance on accuracy score on heart disease
prediction. With this inference of this customized model, machine learning algorithms will
provide very valuable knowledge on analysis and prediction of many chronic diseases,
so in this regard researchers are helpful to the needy persons, doctors andsociety.
74

REFERENCE:
[1]. Monika Gandhi, Shailendra Narayanan Singh Predictions in heart disease
using techniques of data mining (2015)

[2]. J Thomas, R Theresa Princy Human heart disease prediction system using
data mining techniques (2016)

[3]. Sana Bharti, Shailendra Narayan Singh, Amity university, Noida, India Analytical
study of heart disease prediction comparing with different algorithms (May 2015)
[4]. Purushottam, Kanak Saxena, Richa Sharma Efficient heart disease prediction
system using Decision tree (2015)

[5]. Sellappan Palaniyappan, Rafiah Awang Intelligent heart disease prediction using
data mining techniques (August 2008)

[6]. Himanshu Sharma,M A Rizvi Prediction of Heart Disease using Machine


Learning Algorithms: A Survey (August 2017)
[7]. Animesh Hazra, Subrata Kumar Mandal, Amit Gupta, Arkomita Mukherjee and
Asmita Mukherjee Heart Disease Diagnosis and Prediction Using Machine Learning
and Data Mining Techniques: A Review (2017)

[8]. V.Krishnaiah, G.Narsimha, N.Subhash Chandra Heart Disease Prediction


System using Data Mining Techniques and Intelligent Fuzzy Approach: A
Review (February 2016)

[9]. Ramandeep Kaur, 2Er. Prabhsharn Kaur A Review - Heart Disease Forecasting
Pattern using Various Data Mining Techniques (June 2016)

[10]. J.Vijayashree and N.Ch. SrimanNarayanaIyengar Heart Disease Prediction


System Using Data Mining and Hybrid Intelligent Techniques: A Review
(2016)
EFFECTIVE HEART DISEASE PREDICTION USING HYBRID
MACHINE LEARNING TECHNIQUES
1
P. Raguraman, 2Ch.V. Lavanya, 3K. R. Issac Samuel, 4M.V. Narayana, 5B. Sreeja Chowdary,
6
P. Padmini
1
Assistant Professor, QIS College Of Engineering & Technology, Ongole, India
2
Student, QIS College Of Engineering &
Technology 3Student, QIS College Of Engineering
& Technology 4Student, QIS College Of
Engineering & Technology 5Student, QIS College
Of Engineering & Technology 6Student, QIS
College Of Engineering & Technology

Abstract: - Machine learning and deep learning are playing very vital
role in health domain and internet sector. In the course of the most
Information mining alludes to the extraction of required data
recent couple of many years, heart disease is the most widely from gigantic datasets in different fields like the clinical field,
recognized reason for worldwide passing. Heart disease expectation business field, and instructive field. AI is quite possibly the
is quite possibly the most confounded assignments in clinical field. In most quickly advancing spaces of man-made brainpower.
the advanced period, roughly one individual bites the dust each These calculations can break down tremendous information
moment because of heart disease. Information science assumes a from different fields, one such significant field is the clinical
vital part in handling gigantic measure of information in the field of field. It's anything but a substitute to routine expectation
medical services. As heart disease expectation is a mind-boggling demonstrating approach utilizing a PC to acquire a
task, there is a need to computerize the forecast cycle to keep away
comprehension of perplexing and non-direct co-operations
from chances related with it and caution the patient well ahead of
time. This paper utilizes heart disease dataset accessible in UCI AI among various elements by diminishing the blunders in
store. The proposed work predicts the odds of Heart Disease and anticipated and real results. Information mining is
groups patient's danger level by executing distinctive information investigating enormous datasets to extricate covered up vital
mining methods like Naive Bayes, Decision Tree, Logistic dynamic data from an assortment of a past storehouse for
Regression, KNN, SVM, XGboost and Random Forest. future examination. The clinical field involves huge
Subsequently, this paper presents a relative report by dissecting the information of patients. This information need mining by
exhibition of various AI calculations. The preliminary outcomes different AI calculations. Medical care experts do examination
confirm that Random Forest calculation has accomplished the most of these information to accomplish compelling symptomatic
noteworthy precision of 97% contrasted with other ML calculations
carried out.
choice by medical care experts. Clinical information mining
utilizing characterization calculations gives clinical guide
through examination. It tests the characterization calculations
to anticipate coronary illness in patients.
1. INTRODUCTION
Throughout the last decade, coronary illness or
cardiovascular remaining parts the essential premise of death Information mining is the way toward extricating important
around the world. A gauge by the World Health Organization, information and data from enormous data sets. Different
that over 17.9 million passings happen each year overall in information mining methods like relapse, bunching, affiliation
light of cardiovascular infection, and of these passings, 80% rule and arrangement procedures like Naïve Bayes, choice
are a direct result of coronary course sickness and cerebral tree, arbitrary timberland and K-closest neighbor are utilized
stroke. The tremendous number of passings is normal among to characterize different coronary illness credits in foreseeing
low and center pay nations. Many inclining variables, for coronary illness. A similar investigation of the arrangement
example, individual and expert propensities and hereditary procedures is utilized [5]. In this exploration, I have taken
inclination represents coronary illness. Different ongoing dataset from the UCI store. The order model is created
danger factors like smoking, abuse of liquor and caffeine, utilizing arrangement calculations for expectation of coronary
stress, and actual idleness alongside other physiological illness. In this exploration, a conversation of calculations
variables like heftiness, hypertension, high blood cholesterol, utilized for coronary illness expectation, correlation among
and previous heart conditions are inclining factors for the current frameworks is made. It likewise makes reference
coronary illness. The effective and exact and early clinical to additional examination and headway prospects in the paper.
finding of coronary illness assumes an essential part in taking
preventive measures to forestall demise.
2. LITERATURE SURVEY:- vector Quantization neural framework computation The neural
[2]. Mohammed Abdul Khaleel has given paper in the Survey framework in this structure recognizes 13 clinical incorporates
of Techniques for mining of data on Medical Data for Finding as data and predicts that there is a closeness or nonattendance
Frequent Diseases locally. This paper center around analyze of coronary sickness in the patient, close by different
data mining methodology which are needed for therapeutic execution measures.
data mining especially to discover locally visit diseases, for
instance, heart illnesses, lung threat, chest infection and so on.
Data mining is the route toward removing data for discovering [7]. D.R. PatiI and Jayshril S. Sonawane have given a paper
inert models which Vembandasamy et al. performed a work, named Prediction of Heart Disease Using Learning Vector
to break down and recognize coronary illness. In this the Quantization Algorithm. In this paper they show an
calculation utilized was Naive Bayes calculation. In Naïve assumption structure for heart contamination using Learning
Bayes calculation they utilized Bayes hypothesis. vector Quantization neural framework estimation The neural
Subsequently Naive Bayes has a very capacity to make framework in this structure recognizes 13 clinical incorporates
suspicion autonomously. The pre-owned informational as data and predicts that there is a proximity or nonattendance
collection is acquired from a diabetic examination of coronary ailment in the patient, close by different execution
establishments of Chennai, Tamilnadu which is driving measures.
foundation. There are more than 500 patients in the dataset.
The device utilized is Weka and order is executed by utilizing
70% of Percentage Split. Naïve Bayes offers 86.4% of 3. METHODOLOGY
accurate data. 3.1. Data Pre-Processing
Cleaning: Data that we need to deal with won't be perfect that
[3]. Costas Sideris, Nabil Alshurafa, Haik Kalantarian and is it may contain commotion or it might contain values
Mohammad Pourhomayoun have given a paper named missing of we measure we cannot get great outcomes so to
Remote Health Monitoring Outcome Success prediction using acquire great and amazing outcomes we need to take out this,
First Month and Baseline Intervention Data. RHS frameworks the cycle to dispense with this is information cleaning. We
are viable in saving expenses and decreasing sickness. In this will fill missing qualities and can eliminate commotion by
paper, they depict an updated RHM system, Wanda-CVD that utilizing a few strategies like loading up with most normal
is cell based what's more, planned to give far off educating worth in missing spot.
and social assistance to individuals. CVD balancing activity Change: This includes changing information configuration to
measures are seen as an essential center by friendly protection one structure to other that is making them generally
relationship all throughout the planet. reasonable by doing standardization, smoothing, and
speculation, total strategies on information.
[4]. L.Sathish Kumar and A. Padmapriya has given a paper Integration: Data that we'd like not process might not be from
named Prediction for similarities of disease by using ID3 one source sometimes it are often from different sources we
algorithm intelevision and mobile phone. This paper gives a don't integrate them it's going to be a problem while
modified and disguised approach to manage perceive plans processing so integration is one of important phase in data
that are concealed of coronary disease. The given structure use pre-processing and different issues are considered here to
data mining techniques, for instance, ID3 calculation. This integrate.
proposed technique helps individuals not exclusively to think Decrease: When we work on information it could be
about the sicknesses yet it can likewise help's to diminish the perplexing and it might be hard to see some of the time so to
passing rate and check of infection influenced people.ions all make them justifiable to framework we will decrease them to
throughout the planet. required organization with the goal that we can accomplish
great outcomes.
[5]. Nishara banu M.A and B.Gomathy has proposed in
Disease Predicting system using data mining techniques. In Table.1 Data description
this paper they talk about MAFIA (Maximal Frequent Item set
calculation) also, K-Means bunching. As characterization is
significant for forecast of an illness. The characterization
dependent on MAFIA and K-Means brings about exactness.

[6]. Wiharto and Hari Kusnanto have given a paper named


Intelligence System for Diagnosis Level of Coronary Heart
Disease with K-Star Algorithm. In this paper they display an
assumption structure for heart contamination using Learning
Attribute
S.NO
Information:
1 age age in years
2 sex sex (1 = male; 0 = female)
3 cp chest pain type
 Continuously needs to decide the
resting
worthblood
of Kpressure
which might
(in mmbe unpredictable some time.
4 trestbps
 The calculation cost is high a Hg on admission
result of figuringtothe
thedistance
hospital)between the information focuses for all the preparation tests.
5 chol serum cholestoral in mg/dl 3.3 Naïve-Bayes Classification:
fasting blood glucose > 120 The Naïve-Bayesian classifier relies upon Bayes' speculation
6 fbs
mg/dl) (1 = true; 0 = false) with autonomy suppositions among attributes [7-13]. A
Naïve- Bayesian output is definitely not hard to run, with no
resting electrocardiographic
7 restecg entrapped repetitive parameter estimation which makes it
results
particularly sup-portive for broad datasets in spite of its
8 thalach maximum pulse achieved effortlessness, the Naive Bayesian classifier generally
completes its job shockingly good and is broadly used in light
exercise induced angina (1 =
9 exang of the fact that it frequently outflanks high order techniques
yes; 0 = no)
which are complex. The Naïve Bayes treats every variable as
ST depression induced by independent which helps it to predict even if variables don’t
10 oldpeak
exercise relative to rest have proper relation [1].
the slope of the peak exercise 𝑃(𝐵|𝐴)𝑃(𝐴)
11 slope 𝑃(𝐴|𝐵) =
ST segment 𝑃(𝐵) (1)
number of major vessels (0-3) Where,
12 ca
colored by flourosopy
P(A|B) is the posterior probability of class (target) given
3 = normal; 6 = fixed defect; 7 = predictor (attribute),
13 thal
reversable defect
P(A) is the prior probability of class,
the predicted
14 diagnosis of heart condition P(B|A) is the likelihood which is the probability of predictor
attribute(Target)
given class,
3.2 KNN Algorithm: P(A) is the prior probability of predictor.
K nearest neighbors is a straightforward calculation that stores 3.4 Logistic Regression Classification:
every single accessible case and characterizes new cases
dependent on a closeness measure (e.g., distance capacities). Logistic regression, also called logit model or logistic model,
KNN has been utilized in measurable assessment and example is a widely used model to analyze the relationship between
acknowledgment effectively in the start of 1970's as a non- multiple independent variables and one categorical dependent
parametric method. variable with the equation of the form (2)
The K-NN working can be clarified based on the underneath 1
𝐽(𝑤) = − ∑𝑚 [𝑦i log(ℎ (X )) + (1 − 𝑦i)log(1 −
i=1 w i
calculation: 𝑚
ℎw(Xi))] (2)
Step-1: Select the K number of neighbors
Step-2: Calculate the Euclidean distance of K number of 3.5 Support Vector Machine Classification:
neighbors
SVM: Support Vector Machines (SVM) is a supervised
Step-3: Take the K closest neighbors according to the learning model that is commonly used in classification
determined Euclidean distance. problems. The possibility of the SVM calculation is to figure
the ideal hyper plane that in a perfect world isolates all objects
Step-4: Among these k neighbors, check the quantity of the
of one class from those objects of another class with the
information focuses in every classification.
biggest edges between these two classes. The items that are a
Step-5: Assign the new information focuses to that class for long way from the limit are disposed of from the estimation,
which the quantity of the neighbor is most extreme. while other information focuses that are situated on the limit
will be kept up and decided as "support vectors" to get
Step-6: Our model is prepared.
agreeable computational productivity. The SVM calculation
Advantages:- has diverse piece capacities: outspread premise work (RBF),
straight, sigmoid, and polynomial. In this examination, spiral
 It is easy to carry out. premise work has been picked dependent on settled cross
 It is powerful to the boisterous preparing information validation results.
 It very well may be more compelling if the preparation
information is enormous.
Disadvantages:- 3.6 Decision Tree Classification:
Decision tree is a classification algorithm that chips away at over fitting. Consequently, settled cross-validation has been
downright just as mathematical information. Decision tree is applied.
utilized for making tree-like constructions. Decision tree is
basic and generally used to deal with clinical dataset. It is not
difficult to execute and break down the information in tree- 4. EVALUATION METRICS
formed chart. The choice tree model makes examination
dependent on three hubs. Evaluation metrics were used to evaluate the performance of
the four classifiers. One of these measures is through the
confusion matrix, from which the accuracy, precision, recall,
Root node: fundamental hub, in view of this any remaining and F1-score are extracted by computing the correctly
hubs capacities. classified samples (TP and TN) and the incorrectly classified
samples (FP and FN), as shown in the following equations (3),
Inside node: handles different properties. (4), (5) & (6):
Leaf node: address the consequence of each test.
This calculation parts the information into at least two 𝑇𝑁+𝐹𝑃
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = * 100% (3)
comparable to sets dependent on the main markers. The 𝑇𝑁+𝑇𝑃+𝐹𝑁+𝐹𝑃
𝑇𝑃
entropy of each trait is determined and afterward the 𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛 = * 100% (4)
𝑇𝑃+𝐹𝑃
information are partitioned, with indicators having most 𝑇𝑃
information gain or least entropy 𝑟𝑒𝑐𝑎𝑙𝑙 = * 100% (5)
𝑇𝑃+𝐹𝑃
The outcomes got are simpler to peruse and decipher. This 𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛*𝑟𝑒𝑐𝑎𝑙𝑙
calculation has higher exactness in contrast with different 𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 2 * * 100 (6)
𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛*𝑟𝑒𝑐𝑎𝑙𝑙
calculations as it examines the dataset in the tree-like diagram.
Be that as it may, the information might be over characterized Where TN is True Negative, TP is True Positive, FN is False
and just each trait is tried in turn for dynamic. Negative, and FP is False Positive.
3.7 Random Forest Classification:
It is one of the supervised classification algorithmic technique.
In this calculation, a few trees make a forest. Every individual
tree in a random forest lets out a class assumption and the
class with most votes transforms into a model's estimate. In
the arbitrary random classifier, more number of trees give
higher exactness. The three normal strategies are:
Forest RI (random input choice);
Forest RC (random blend);
A mix of forest RI and forest RC.
It is utilized for order too as a classification task, yet can do
well with regression task, and can beat missing qualities.
Moreover, being delayed to acquire expectations as it requires
enormous informational indexes and more trees, results are
untouchable.
3.8 XGboost Classification:
Gradient boosting (GB) is an ensemble boosting technique
that beginnings with a "regression tree" as "powerless
students". All in all, the GB model adds an added substance
model to limit the misfortune work by utilizing a phase
shrewd inspecting methodology. The misfortune work
estimates the sum at which the normal worth goes amiss from
the genuine worth. Stage wise design put more accentuation
on examples that are troublesome to anticipate or
misclassified. In contrast to irregular backwoods, in GB, tests
that are misclassified have a higher shot at being chosen in
preparing information. GB lessens predisposition and change
furthermore, frequently gives higher exactness, however the
boundaries ought to be tuned cautiously to stay away from Fig.1. Proposed Architecture
With the expanding number of passings because of heart
disease, it has gotten obligatory to foster a framework to
5. RESULTS AND ANALYSIS:
anticipate heart disease successfully and precisely. The
Aim of this research is to predict whether or not a patient will inspiration for the investigation was to track down the most
develop heart disease. This examination was done on proficient ML algorithms for the detecting of heart disease.
managed AI order procedures utilizing Naïve Bayes, Decision This study compares the accuracy score of KNN, SVM,
tree, Random Forest, and K-NN, SVM, Logistic Regression Decision Tree, Logistic Regression, Random Forest, XGboost
on UCI storehouse. Different analyses utilizing distinctive and Naive Bayes algorithms for predicting heart disease using
classifier calculations were directed through the WEKA UCI machine learning repository dataset. The result of this
device. Examination was performed on eighth era Intel Corei7 study indicates that the Random Forest, Decision tree,
having a 8750H processor up to 4.1 GHz CPU and 16 GB XGboost algorithms are the most efficient algorithms with
smash. Dataset was ordered and parted into a preparation set accuracy score of 97% for prediction of heart disease.
and a test set. Pre-handling of the information is done and
8. REFERENCES:
administered arrangement strategies like Naïve Bayes,
Decision tree, Random Forest, and K-NN, SVM, Logistic 1. Avinash Golande, Pavan Kumar T, Heart Disease Prediction
Regression are applied to get exactness score. The accuracy Using Effective Machine Learning Techniques, International
score consequences of various arrangement procedures were Journal of Recent Technology and Engineering, Vol 8,
noted utilizing Python Programming for preparing and test pp.944-950, 2019.
informational indexes. Rate accuracy scores are portrayed in
2. T.Nagamani, S.Logeswari, B.Gomathy, Heart Disease
for various calculations. Examination of exactness score of
Prediction using Data Mining with Mapreduce Algorithm,
heart disease expectation in proposed model with various
International Journal of Innovative Technology and Exploring
creato rs is given
Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-3,
in Table 1.
Model Test-Accuracy January 2019.
KNN Algorithm 0.96 3. Fahd Saleh Alotaibi, Implementation of Machine Learning
Naïve Bayes 0.93 Model to Predict Heart Failure Disease, (IJACSA)
International Journal of Advanced Computer Science and
Logistic Regression 0.89 Applications, Vol. 10, No. 6, 2019.
Linear SVM 0.72 4. Anjan Nikhil Repaka, Sai Deepak Ravikanti, Ramya G
Decision Tree 0.97 Franklin, Design And Implementation Heart Disease
Prediction Using Naives Bayesian, International Conference
Random Forest 0.97 on Trends in Electronics and Information (ICOEI 2019).
XGBOOST 0.97 5. Theresa Princy R,J. Thomas, Human heart Disease Prediction
System using Data Mining Techniques, International
Conference on Circuit Power and Computing Technologies,
Table 2: Models and their test accuracies
Bangalore, 2016.
6. FUTURE SCOPE:- 6. Nagaraj M Lutimath, Chethan C,Basavaraj S Pol., Prediction
This work will be considered as basement for the healthcare Of Heart Disease using Machine Learning, International
system for Heart disease. The enhancement of this work is to journal Of Recent Technology and Engineering, 8, (2S10), pp
provide high quality performance using Deep learning. 474-477, 2019.
7. CONCLUSION:- 7. UCI, Heart Disease Data Set.[Online]. Available (Accessed on
May 1 2020):
https://www.kaggle.com/ronitf/heart-disease-uci.

You might also like