Nothing Special   »   [go: up one dir, main page]

MPML10 2022 FR

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Extreme Gradient Boosting Technique for Online Payment

Transactions Fraud Detection


ABSTRACT
In today’s economic scenario, credit card use has become extremely commonplace. These cards
allow the user to make payments of large sums of money without the need to carry large sums of
cash. They have revolutionized the way of making cashless payments and made making any sort
of payments convenient for the buyer. This electronic form of payment is extremely useful but
comes with its own set of risks. With the increasing number of users, credit card frauds are also
increasing at a similar pace. The credit card information of a particular individual can be collected
illegally and can be used for fraudulent transactions. Some Machine Learning Algorithms can be
applied to collect data to tackle this problem. This paper presents a comparison of some established
supervised learning algorithms to differentiate between genuine and fraudulent transactions.

INTRODUCTION
A Fraud can be described as an intentional deceit which is perpetrated for some kind of gain,
mostly monetary. It is an unfair practice whose occurrences are increasing by the day. There has
been a sharp increase in the usage of electronic payment methods like credit and debit cards and
this has in turn led to a rise in credit card frauds. These cards may be used in both online as well
as offline modes to make payments. In case of the online mode of payment, the card may not have
to be physically presented. In such cases the card data is prone to attack by hackers or cyber
criminals. These kinds of frauds result in millions being lost every year. To overcome this obstacle,
many algorithms have and are being developed. Various detection approaches are being worked
upon to solve this issue most efficiently.

OBJECTIVE
The aim of this study is to present a comparison of some established supervised learning algorithms
to differentiate between genuine and fraudulent transactions.

Existing System:

➢ Supervised Learning

This type of learning is also referred to as predictive learning as it predicts the class of unknown
objects based on prior class-related information of similar objects. The main inspiration behind
this type of learning is to learn from the information about the task, which has been provided in
the past. A machine requires the basic data about the task to be provided to it. This basic input, or
experience is given to it in the form of ‘training data’. This is the past information or data of a
particular task. In this paper, we use the supervised approach to detect fraud detection and analyze
the various algorithms based on supervised machine learning. In this kind of supervised approach,
a database of past cases of fraudulent and genuine transactions is stored. This databaseacts as a
reference point for the various algorithms.
LITERATURE SURVEY:

Title: Credit Card Fraud Detection: A Realistic Modelling and a Novel Learning Strategy.

Author:A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi and G. Bontempi.

Year: 2018

Description:

Detecting frauds in credit card transactions is perhaps one of the best testbeds for computational
intelligence algorithms. In fact, this problem involves a number of relevant challenges, namely:
concept drift (customers' habits evolve and fraudsters change their strategies over time), class
imbalance (genuine transactions far outnumber frauds), and verification latency (only a small set
of transactions are timely checked by investigators). However, the vast majority of learning
algorithms that have been proposed for fraud detection rely on assumptions that hardly hold in a
real-world fraud-detection system (FDS). This lack of realism concerns two main aspects: 1) the
way and timing with which supervised information is provided and 2) the measures used to assess
fraud-detection performance. This paper has three major contributions. First, we propose, with the
help of our industrial partner, a formalization of the fraud-detection problem that realistically
describes the operating conditions of FDSs that everyday analyze massive streams of credit card
transactions. We also illustrate the most appropriate performance measures to be used for fraud-
detection purposes. Second, we design and assess a novel learning strategy that effectively
addresses class imbalance, concept drift, and verification latency. Third, in our experiments, we
demonstrate the impact of class unbalance and concept drift in a real-world data stream containing
more than 75 million transactions, authorized over a time window of three years.

Title: Credit card fraud detection using machine learning techniques: A comparative analysis.
Author:J. O. Awoyemi, A. O. Adetunmbi and S. A. Oluwadare.

Year:2017

Description:

Fraud is any malicious activity that aims to cause financial loss to the other party. As the use of
digital money or plastic money even in developing countries is on the rise so is the fraud associated
with them. Frauds caused by Credit Cards have costs consumers and banks billions of dollars
globally. Even after numerous mechanisms to stop fraud, fraudsters are continuously trying to find
new ways and tricks to commit fraud. Thus, in order to stop these frauds we need a powerful fraud
detection system which not only detects the fraud but also detects it before it takes place and in an
accurate manner. We need to also make our systems learn from the past committed frauds and
make them capable of adapting to future new methods of frauds. In this paper we have introduced
the concept of frauds related to credit cards and their various types. We have explained various
techniques available for a fraud detection system such as Support Vector Machine (SVM),
Artificial Neural Networks (ANN), Bayesian Network, K- Nearest Neighbor (KNN), Hidden
Markov Model, Fuzzy Logic Based System and Decision Trees. An extensive review is done on
the existing and proposed models for credit card fraud detection and has done a comparative study
on these techniques on the basis of quantitative measurements such as accuracy, detection rate and
false alarm rate. The conclusion of our study explains the drawbacks of existing models and
provides a better solution in order to overcome them.
Title: Supervised Machine Learning Algorithms for Credit Card Fraudulent Transaction
Detection: A Comparative Study.

Author:S. Dhankhad, E. Mohammed and B. Far.

Year: 2018

Description:

The goal of data analytics is to delineate hidden patterns and use them to support informed
decisions in a variety of situations. Credit card fraud is escalating significantly with the
advancement of modernized technology and became an easy target for frauds. Credit card fraud
has highly imbalanced publicly available datasets. In this paper, we apply many supervised
machine learning algorithms to detect credit card fraudulent transactions using a real-world
dataset. Furthermore, we employ these algorithms to implement a super classifier using ensemble
learning methods. We identify the most important variables that may lead to higher accuracy in
credit card fraudulent transaction detection. Additionally, we compare and discuss the performance
of various supervised machine learning algorithms that exist in literature against the super
classifier that we implemented in this paper.
Title: Random forest for credit card fraud detection.

Author:S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang and C. Jiang.

Year: 2018

Description:

Credit card fraud events take place frequently and then result in huge financial losses. Criminals
can use some technologies such as Trojan or Phishing to steal the information of other people's
credit cards. Therefore, an effective fraud detection method is important since it can identify a
fraud in time when a criminal uses a stolen card to consume. One method is to make full use of
the historical transaction data including normal transactions and fraud ones to obtain normal/fraud
behavior features based on machine learning techniques, and then utilize these features to check if
a transaction is fraud or not. In this paper, two kinds of random forests are used to train the behavior
features of normal and abnormal transactions. We make a comparison of the two random forests
which are different in their base classifiers, and analyze their performance on credit fraud
detection. The data used in our experiments come from an e-commerce company in China.

Title: Data mining for credit card fraud: A comparative study.


Author:S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland.

Year: 2011

Description:

Credit card fraud is a serious and growing problem. While predictive models for credit card fraud
detection are in active use in practice, reported studies on the use of data mining approaches for
credit card fraud detection are relatively few, possibly due to the lack of available data for research.
This paper evaluates two advanced data mining approaches, support vector machines and random
forests, together with the well-known logistic regression, as part of an attempt to better detect (and
thus control and prosecute) credit card fraud. The study is based on real-life data of transactions
from an international credit card operation.
Proposed System
➢ Hybrid Models

Specific algorithms based on artificial intelligence and neural networks are also being proposed
and implemented to predict the credit card frauds with increased accuracy. The distribution of the
datasets used for fraud detection is highly imbalanced. So, to overcome this obstacle, under-
sampling and oversampling techniques are being designed to obtain comparatively balanced data.
Data mining techniques are also being implemented in order to create a more efficient Fraud
Detection System. Another important area of development is the emergence of new hybrid models.
These are derived from preexisting supervised as well as unsupervised machine learning
techniques.
PROJECT DESCRIPTION

Credit card transactions are extremely commonplace now but they also come with their own set of
problems. There are a lot of problems faced during fraud detection. The process of acceptance or
rejection of a transaction happens within a very small-time frame, which may range between micro
and milliseconds. Therefore, the process adopted for the detection of a fraudulent transaction has
to be extremely quick and effective. Another problem is that there are a vast number of similar
types of transactions happening at the same time. This makes it difficult to monitor each and every
transaction individually and hence determine a fraud.

METHODOLOGIES

MODULES:
1. Data Understanding

2. Exploratory Data Analytics

3. Train/Test Split

4. Model Building/Hyper Parameter Tuning

5. Model Elevation
DATA UNDERSTANDING:

Here, we need to load the data and understand the features present in it. This would help us choose
the features that we will need for your final model. The datasets contain transactions made by
credit cards in September 2013 by European cardholders. This dataset presents transactions that
occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly
unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Before continuing with our analysis, it is important not to forget that while the anonymized features
have been scaled and seem to be centered around zero, our time and amount features have not. Not
scaling them as well would result in certain machine learning algorithms that give weights to
features (logistic regression) or rely on a distance measure (KNN) performing much worse. To
avoid this issue, I standardized both the time and amount column. Luckily, there are no missing
values and we, therefore, do not need to worry about missing value imputation.

EXPLORATORY DATA ANALYTICS (EDA):

Normally, in this step, we need to perform univariate and bivariate analyses of the data, followed
by feature transformations, if necessary. For the current data set, because Gaussian variables are
used, we do not need to perform Z-scaling. However, you can check if there is any skewness in
the data and try to mitigate it, as it might cause problems during the model-building phase.

TRAIN/TEST SPLIT:

Now comes the challenging part: Creating a training data set that will allow our algorithms to pick
up the specific characteristics that make a transaction more or less likely to be fraudulent. Using
the original data set would not prove to be a good idea for a very simple reason: Since over 99%
of our transactions are nonfraudulent, an algorithm that always predicts that the transaction is
nonfraudulent would achieve an accuracy higher than 99%. Nevertheless, that is the opposite of
what we want. We do not want a 99% accuracy that is achieved by never labeling a transaction as
fraudulent, we want to detect fraudulent transactions and label them as such.

Now we are familiar with the train/test split, which we can perform in order to check the
performance of our models with unseen data. Here, for validation, we can use the k-fold cross-
validation method. We need to choose an appropriate k value so that the minority class is correctly
represented in the test folds.

MODEL-BUILDING/HYPERPARAMETER TUNING:

This is the final step at which we can try different models and fine-tune their hyperparameters until
we get the desired level of performance on the given dataset. We should try and see if we get a
better model by the various sampling techniques. This type of learning is also referred to as
predictive learning as it predicts the class of unknown objects based on prior class-related
information of similar objects. The main inspiration behind this type of learning is to learn from
the information about the task, which has been provided in the past. A machine requires the basic
data about the task to be provided to it. This basic input, or experience is given to it in the form of
‘training data’. This is the past information or data of a particular task.

MODEL EVALUATION:

In order to evaluate the performance of a particular model, we make use of various parameters.
The models are used on the trained dataset and the outputs obtained with the use of each model
are compared systematically to those produced by the other models. Based on these comparisons,
a conclusion is formed as to which is the best suited model for a particular dataset or a particular
type of problem.

We need to evaluate the models using appropriate evaluation metrics. Note that since the data is
imbalanced it is is more important to identify which are fraudulent transactions accurately than the
non-fraudulent. We need to choose an appropriate evaluation metric which reflects this business
goal.
REQUIREMENTS ENGINEERING

HARDWARE REQUIREMENTS

• PROCESSOR : DUAL CORE 2 DUOS.


• RAM : 4GB DD RAM
• HARD DISK : 250 GB

SOFTWARE REQUIREMENTS

• Operating System : Windows 7/8/10

• Platform : Spyder3

• Programming Language : Python, HTML

• Front End : Spyder3


DESIGN ENGINEERING

GENERAL

Design Engineering deals with the various UML [Unified Modeling language] diagrams
for the implementation of project. Design is a meaningful engineering representation of a thing
that is to be built. Software design is a process through which the requirements are translated into
representation of the software. Design is the place where quality is rendered in software
engineering. Design is the means to accurately translate customer requirements into finished
product.
UML Diagrams

Use case diagram

NewUseCase

Data
Understanding Predictive Learning
NewUseCase2 Model Building

NewUseCase3
Dataset

Data
Analytics(EDA) Trained Dataset
NewUseCase4

Model Evaluation

Particular Data

NewUseCase5

Test/Test Split

NewUseCase6

EXPLANATION:

The main purpose of a use case diagram is to show what system functions are performed for which
actor. Roles of the actors in the system can be depicted. The above diagram consists of user as
actor. Each will play a certain role to achieve the concept.
Class Diagram

EXPLANATION

In this class diagram represents how the classes with attributes and methods are linked together to
perform the verification with security. From the above diagram shown the various classes involved
in our project.
Object Diagram

Data Understanding Data Analysis (EDA) Train Data Model Building

Model Evaluation

EXPLANATION:

In the above digram tells about the flow of objects between the classes. It is a diagram that
shows a complete or partial view of the structure of a modeled system. In this object diagram
represents how the classes with attributes and methods are linked together to perform the
verification with security.
Component Diagram

Data Datasets Machine


Understanding Transfers Learning

Model-Building Datasets Data Analysis


Phase (EDA)

Train Data Training Split Data


Data

Model Predictive Model


Evaluation Learning Building

Trained Particular
Dataset Data

EXPLANATION:

A component provides the set of required interfaces that a component realizes or implements.
These are the static diagrams of the unified modeling language. Component diagrams are used to
represent the working and behavior of various components of a system.
Deployment Diagram

Data
Data
Understanding
Analysis(EDA)

Train Data

Model Building

Model
Evaluation

EXPLANATION:

A UML deployment diagram is a diagram that shows the configuration of run time processing
nodes and the components that live on them. Deployment diagrams is a kind of structure diagram
used in modeling the physical aspects of an object-oriented system. They are often be used to
model the static deployment view of a system.
Sequence Diagram

Data Data Train Data Model Building Model Dataset


Understanding Analysis(EDA) Evaluation

Datasets Transfers

Datasets

Training Data

Predictive Learning

Machine Learning
Trained Dataset

Model Building Phase

Split Data

Particular Data

EXPLANATION:

UML Sequence Diagrams are interaction diagrams that detail how operations are carried out.
They capture the interaction between objects in the context of a collaboration. Sequence
Diagrams are time focus and they show the order of the interaction visually by using the vertical
axis of the diagram to represent time what messages are sent and when.
Collaboration Diagram

5: Machine Learning
Train
Data Data
Understanding

1: Datasets Transfers
8: Split Data 3: Training Data

4: Predictive Learning
Model
Building

Dataset
6: Trained Dataset
9: Particular Data

2: Datasets
Model
Evaluation

7: Model Building Phase

Data
Analysis(EDA)

EXPLANATION:

Collaboration diagrams are used to show how objects interact to perform the behavior of a
particular use case, or a part of a use case. Along with sequence diagrams, collaboration are used
by designers to define and clarify the roles of the objects that perform a particular flow of events
of a use case. They are the primary source of information used to determining class
responsibilities and interfaces.
State Diagram

Dataset

Data Understanding Train Data Model Building


Data Analysis(EDA) Model Evaluation

Datasets Training Data


Predictive Learning
Datasets Transfers Trained Dataset

Split Data
Model-Building Phase
Particular Data

Machine Learning

Dataset

EXPLANATION:

State diagram are a loosely defined diagram to show workflows of stepwise activities and
actions, with support for choice, iteration and concurrency. State diagrams require that the system
described is composed of a finite number of states; sometimes, this is indeed the case, while at
other times this is a reasonable abstraction. Many forms of state diagrams exist, which differ
slightly and have different semantics.
Activity Diagram

Dataset

Data Understanding Data Analysis(EDA) Model Building Model Evaluation


Train Data

Datasets Transfers Datasets Training Data Trained Dataset


Predictive Learning

Model-Building Phase Split Data


Machine Learning
Particular Data

Dataset

EXPLANATION:

Activity diagram are a loosely defined diagram to show workflows of stepwise activities and
actions, with support for choice, iteration and concurrency. UML, activity diagrams can be used
to describe the business and operational step-by-step workflows of components in a system. UML
activity diagrams could potentially model the internal logic of a complex operation. In many ways
UML activity diagrams are the object-oriented equivalent of flow charts and data flow diagrams
(DFDs) from structural development.
System Architecture

You might also like