2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
This chapter defines the following parts of the project, EDA(Exploratory Data Analysis), Feature
Engineering, Feature Selection, and Model Building and is the basis for the further project.
2.1 EDA (Exploratory Data Analysis)
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so
as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations.
It is a good practice to understand the data first and try to gather as many insights from it. EDA is
all about making sense of data in hand,before getting them dirty with it.
Feature Engineering is the Next Step in A DATA SCIENCE or MACHINE LEARNING Project
after EDA (Exploratory Data Analysis)
Handling Outlires
Using Standard Deviation
Normal Distribution
IQR (Inter Quartile Range)
It is desirable to reduce the number of input variables to both reduce the computational cost of
modeling and, in some cases, to improve the performance of the model.
Statistical-based feature selection methods involve evaluating the relationship between each
input variable and the target variable using statistics and selecting those input variables that
have the strongest relationship with the target variable. These methods can be fast and
effective, although the choice of statistical measures depends on the data type of both the input
and output variables.
In this post, you will discover how to choose statistical measures for filter-based feature
selection with numerical and categorical data.
There are two main types of feature selection techniques: supervised and unsupervised,
and supervised methods may be divided into wrapper, filter and intrinsic.
Filter-based feature selection methods use statistical measures to score the correlation
or dependence between input variables that can be filtered to choose the most relevant
features.
Statistical measures for feature selection must be carefully chosen based on the data
type of the input variable and the output or response variable.
2.4 Model Building
A machine learning model is built by learning and generalizing from training data, then
applying that acquired knowledge to new data it has never seen before to make predictions and
fulfill its purpose. Lack of data will prevent you from building the model, and access to data isn't
enough.