Nothing Special   »   [go: up one dir, main page]

NSE Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

COURSE NAME: BUSINESS ANALYTICS

SKILL OFFERING ID:2281

PROJECT TITLE: EXPLORATORY DATA ANALYSIS ON BANKING DATA USING


EXCEL

PROJECT SUBMITTED TO: NSE ACADEMY

YEAR: III YEAR

DEPARTMENT: B.COM GENERAL

SEMESTER: VI

GROUP NUMBER: 017

MEMBERS OF THE GROUP: 017. MOHANA PRIYA M


017. MONIKA B
017. MONIKA H
017. PREETHIKA B
017. PRIYA DHARSHINI BAI S
017. SWATHI PRIYA G
017. THAMIL SELVI M

GUIDED BY: M HARINI

SPOC NAME: K R SUBAPRIYA


EXPLORATORY DATA ANALYSIS ON BANKING
DATA USING EXCEL

Introduction:
Exploratory Data Analysis is a data analytics process to understand the data in
depth and learn the different data characteristics, often with visual means. This
allows you to get a better feel of your data and find useful patterns in it.

It is crucial to understand it in depth before you perform data analysis and run
your data through an algorithm. You need to know the patterns in your data and
determine which variables are important and which do not play a significant role
in the output. Further, some variables may have correlations with other variables.
You also need to recognize errors in your data.

All of this can be done with Exploratory Data Analysis. It helps you gather insights
and make better sense of the data, and removes irregularities and unnecessary
values from data.
STEPS INVOLVED IN EXPLORATORY DATA ANALYSIS

I. Data Collection

Data collection is an essential part of exploratory data analysis. It refers to the


process of finding and loading data into our system. Good, reliable data can be
found on various public sites or bought from private organizations. Some reliable
sites for data collection are Kaggle, Github, Machine Learning Repository, etc.

II. Data Cleaning

Data cleaning refers to the process of removing unwanted variables and values
from your dataset and getting rid of any irregularities in it. Such anomalies can
disproportionately skew the data and hence adversely affect the results. Some
steps that can be done to clean data are:

 Removing missing values, outliers, and unnecessary rows/ columns.

 Re-indexing and reformatting our data.

III. Univariate Analysis

In Univariate Analysis, you analyze data of just one variable. A variable in your
dataset refers to a single feature/ column. You can do this either with graphical or
non-graphical means by finding specific mathematical values in the data. Some
visual methods include:

• Histograms: Bar plots in which the frequency of data is represented


with rectangle bars.
• Box-plots: Here the information is represented in the form of boxes.

IV. Bivariate Analysis

Here, you use two variables and compare them. This way, you can find how one
feature affects the other. It is done with scatter plots, which plot individual data
points or correlation matrices that plot the correlation in hues. You can also use
boxplots.
Objective:

 Helps you prepare your dataset for analysis.

 Allows a machine learning model to predict our dataset better.

 Gives you more accurate results.

 It also helps us to choose a better machine learning model.

Acknowledgements:
Banking Data - Banking.csv.

 The dataset is taken from Kaggle.

About the File


• Age
• Job
• Marital
• Housing
• month
• duration
• campaign
• days
• outcome
• emp_var_rate

age Job marital housing month duration campaign days outcome emp_var_rate

53 technician married no nov 138 1 999 nonexistent -0.1

28 management single yes jun 339 3 6 success -1.7

39 services married no apr 185 2 999 nonexistent -1.8

55 retired married yes aug 137 1 3 success -2.9

30 management divorced yes jul 68 8 999 nonexistent 1.4

37 blue-collar married yes may 204 1 999 nonexistent -1.8

39 blue-collar divorced yes may 191 1 999 nonexistent -1.8

36 admin. married no jun 174 1 3 success -2.9

27 blue-collar single yes apr 191 2 999 failure -1.8

34 housemaid single no may 62 2 999 nonexistent 1.1


Milestones:
1. Data Wrangling read the dataset, search for missing value, null values,
repeated values and outliers and treat the missing value, remove irrelevant data,
replace null values, delete repeated values and treat outliers Data wrangling,
sometimes referred to as data munging, is the process of transforming and
mapping data from one "raw" data form into another format with the intent of
making it more appropriate and valuable for a variety of downstream purposes
such as analytics. The goal of data wrangling is to assure quality and useful data.
Data analysts typically spend the majority of their time in the process of data
wrangling compared to the actual analysis of the data.

The process of data wrangling may include further munging, data visualization,
data aggregation, training a statistical model, as well as many other potential
uses. Data wrangling typically follows a set of general steps which begin with
extracting the data in a raw form from the data source, "munging" the raw data
(e.g. sorting) or parsing the data into predefined data structures, and finally
depositing the resulting content into a data sink for storage and future use.
2. Perform Descriptive Statistics by calculating mean, median, mode, range,
standard deviation, variance standard error, skewness, kurtosis, maximum,
minimum. A descriptive statistic (in the count noun sense) is a summary statistic
that quantitatively describes or summarizes features from a collection of
information, while descriptive statistics (in the mass noun sense) is the process of
using and analyzing those statistics. Descriptive statistics is distinguished from
inferential statistics (or inductive statistics) by its aim to summarize a sample,
rather than use the data to learn about the population that the sample of data is
thought to represent. This generally means that descriptive statistics, unlike
inferential statistics, is not developed on the basis of probability theory, and are
frequently nonparametric statistics. Even when a data analysis draws its main
conclusions using inferential statistics, descriptive statistics are generally also
presented. For example, in papers reporting on human subjects, typically a table
is included giving the overall sample size, sample sizes in important subgroups
(e.g., for each treatment or exposure group), and demographic or clinical
characteristics such as the average age, the proportion of subjects of each sex,
the proportion of subjects with related co-morbidities, etc.

3. Create a correlation matrix for all possible factors. Perform using data
analysis in Excel. A correlation matrix is simply a table which displays the
correlation coefficients for different variables. The matrix depicts the correlation
between all the possible pairs of values in a table. It is a powerful tool to
summarize a large dataset and to identify and visualize patterns in the given data.
A correlation matrix consists of rows and columns that show the variables. Each
cell in a table contains the correlation coefficient.

4. Analyze the data using Logistic Regression analysis to model the


relationship between a dependent (target) and independent (predictor) variables
with one or more independent variables. Make us understand how the value of the
dependent variable is changing corresponding to an independent variable when
other independent variables are held fixed. "Logit model" redirects here. Not to be
confused with Logit function.

In statistics, the logistic model (or logit model) is a statistical model that models
the probability of an event taking place by having the log-odds for the event be a
linear combination of one or more independent variables. In regression analysis,
logistic regression (or logit regression) is estimating the parameters of a logistic
model (the coefficients in the linear combination). Formally, in binary logistic
regression there is a single binary dependent variable, coded by an indicator
variable, where the two values are labeled "0" and "1", while the independent
variables can each be a binary variable (two classes, coded by an indicator
variable) or a continuous variable (any real value).
5. Visualize the data and provide charts, plots, etc., explaining the relationship
between variables using Power BI or Tableau software and Interpret the outcomes
for business insights.
Outcome:

1) Can become a Data engineer- BI specifically data wrangler who can transform
data into business value to provide qualified data-based insights to various
business verticals.

2) Helpful in fetching jobs as Data Analyst who can Inspect, Clean, Validate, Update
and transform data to provide required information and can become a Python and
R programmer building and maintaining systems for extracting, pre- processing
and modelling for different data streams, as well as integrating data across
different sources

3) Can develop the skill to become a Machine Learning engineer who has machine
learning skills and can build and deploy machine learning models

Conclusion:

Data visualization is a fun and very important part of being a data scientist.
Simplicity and the ability for others to quickly understand a message is the most
important part of exploratory data analysis. Before every building any model,
make sure you create a visualization to understand the data first! Exploratory is
not a subject you can learn by reading book. You need to out and acquire data
and start plotting! The exploratory data analysis tutorial gets you started. At the
end of the tutorial there are various links to public datasets you can start
exploring. After making your own visualization we can move onto describing data
using clustering.

You might also like