NSE Project
NSE Project
NSE Project
SEMESTER: VI
Introduction:
Exploratory Data Analysis is a data analytics process to understand the data in
depth and learn the different data characteristics, often with visual means. This
allows you to get a better feel of your data and find useful patterns in it.
It is crucial to understand it in depth before you perform data analysis and run
your data through an algorithm. You need to know the patterns in your data and
determine which variables are important and which do not play a significant role
in the output. Further, some variables may have correlations with other variables.
You also need to recognize errors in your data.
All of this can be done with Exploratory Data Analysis. It helps you gather insights
and make better sense of the data, and removes irregularities and unnecessary
values from data.
STEPS INVOLVED IN EXPLORATORY DATA ANALYSIS
I. Data Collection
Data cleaning refers to the process of removing unwanted variables and values
from your dataset and getting rid of any irregularities in it. Such anomalies can
disproportionately skew the data and hence adversely affect the results. Some
steps that can be done to clean data are:
In Univariate Analysis, you analyze data of just one variable. A variable in your
dataset refers to a single feature/ column. You can do this either with graphical or
non-graphical means by finding specific mathematical values in the data. Some
visual methods include:
Here, you use two variables and compare them. This way, you can find how one
feature affects the other. It is done with scatter plots, which plot individual data
points or correlation matrices that plot the correlation in hues. You can also use
boxplots.
Objective:
Acknowledgements:
Banking Data - Banking.csv.
age Job marital housing month duration campaign days outcome emp_var_rate
The process of data wrangling may include further munging, data visualization,
data aggregation, training a statistical model, as well as many other potential
uses. Data wrangling typically follows a set of general steps which begin with
extracting the data in a raw form from the data source, "munging" the raw data
(e.g. sorting) or parsing the data into predefined data structures, and finally
depositing the resulting content into a data sink for storage and future use.
2. Perform Descriptive Statistics by calculating mean, median, mode, range,
standard deviation, variance standard error, skewness, kurtosis, maximum,
minimum. A descriptive statistic (in the count noun sense) is a summary statistic
that quantitatively describes or summarizes features from a collection of
information, while descriptive statistics (in the mass noun sense) is the process of
using and analyzing those statistics. Descriptive statistics is distinguished from
inferential statistics (or inductive statistics) by its aim to summarize a sample,
rather than use the data to learn about the population that the sample of data is
thought to represent. This generally means that descriptive statistics, unlike
inferential statistics, is not developed on the basis of probability theory, and are
frequently nonparametric statistics. Even when a data analysis draws its main
conclusions using inferential statistics, descriptive statistics are generally also
presented. For example, in papers reporting on human subjects, typically a table
is included giving the overall sample size, sample sizes in important subgroups
(e.g., for each treatment or exposure group), and demographic or clinical
characteristics such as the average age, the proportion of subjects of each sex,
the proportion of subjects with related co-morbidities, etc.
3. Create a correlation matrix for all possible factors. Perform using data
analysis in Excel. A correlation matrix is simply a table which displays the
correlation coefficients for different variables. The matrix depicts the correlation
between all the possible pairs of values in a table. It is a powerful tool to
summarize a large dataset and to identify and visualize patterns in the given data.
A correlation matrix consists of rows and columns that show the variables. Each
cell in a table contains the correlation coefficient.
In statistics, the logistic model (or logit model) is a statistical model that models
the probability of an event taking place by having the log-odds for the event be a
linear combination of one or more independent variables. In regression analysis,
logistic regression (or logit regression) is estimating the parameters of a logistic
model (the coefficients in the linear combination). Formally, in binary logistic
regression there is a single binary dependent variable, coded by an indicator
variable, where the two values are labeled "0" and "1", while the independent
variables can each be a binary variable (two classes, coded by an indicator
variable) or a continuous variable (any real value).
5. Visualize the data and provide charts, plots, etc., explaining the relationship
between variables using Power BI or Tableau software and Interpret the outcomes
for business insights.
Outcome:
1) Can become a Data engineer- BI specifically data wrangler who can transform
data into business value to provide qualified data-based insights to various
business verticals.
2) Helpful in fetching jobs as Data Analyst who can Inspect, Clean, Validate, Update
and transform data to provide required information and can become a Python and
R programmer building and maintaining systems for extracting, pre- processing
and modelling for different data streams, as well as integrating data across
different sources
3) Can develop the skill to become a Machine Learning engineer who has machine
learning skills and can build and deploy machine learning models
Conclusion:
Data visualization is a fun and very important part of being a data scientist.
Simplicity and the ability for others to quickly understand a message is the most
important part of exploratory data analysis. Before every building any model,
make sure you create a visualization to understand the data first! Exploratory is
not a subject you can learn by reading book. You need to out and acquire data
and start plotting! The exploratory data analysis tutorial gets you started. At the
end of the tutorial there are various links to public datasets you can start
exploring. After making your own visualization we can move onto describing data
using clustering.