Nothing Special   »   [go: up one dir, main page]

Data Preprocessing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

LEARNING

PROGRESS
REVIEW
-Week 10
BTS
Data
Preprocessing
for Machine
Learning
Data Preprocessing
Data Preprocessing

Data is growing exponentially from multiple sources in multiple formats. Real world
data is too dirty (raw data) and cannot be directly fed into machine learning model as
it may contain errors, incomplete, noisy, and unstructured data.
Feature

Unsupervised Supervised

Do not use the target


Use the target variable (e.g.
variable (e.g. remove remove irrelevant variables)
redundant variables)
Feature
● Numerical Output: Regression predictive modeling problem.
● Categorical Output: Classification predictive modeling problem.
Predictor vs Target

Predictive Variable Target Variable

One or more variables that


are used to determine A variable that needs to be
(predict) the ‘Target predicted is a target variable.
Variable’.
Standardization vs Normalization
Standardization Normalization

Rescales data to have a mean of 0


Rescales the values into a range of
and a standard deviation of 1 (unit
[0,1]
variance)the values into a range of
Salary range
[0,1] 0 10,000,000

Salary range after


normalization
0 10,000,000
—Why Standardization & Normalization
➔ The algorithm will treat the same scaled values fairly.
➔ Centered and same scaled data will help to accelerate the
learning algorithm.
➔ Same scaled data will help to interpret some ML models.

—When to use them?

➔ Use standardization when we know our data has


normal/gaussian distribution.
➔ Otherwise, use normalization.
In Python, we can use sklearn to perform Standardization & Normalization

—Standardization

from sklearn.preprocessing import StandardScaler


df['umur_std'] =
StandardScaler().fit_transform(df['umur'].values.reshape(len(df), 1))
df['gaji_std'] =
StandardScaler().fit_transform(df['gaji'].values.reshape(len(df), 1))
Standardized data

➔ The means are close


to 0
➔ The deviation
standards are 1
—Normalization

from sklearn.preprocessing import MinMaxScaler


df['berat_norm'] =
MinMaxScaler().fit_transform(df['berat'].values.reshape(len(df), 1))
df['tinggi_norm'] =
MinMaxScaler().fit_transform(df['tinggi'].values.reshape(len(df), 1))
Normalized data

➔ The means and


deviation standard are
in the same scale
Feature Encoding
A process to transform categorical features to numerical features, because in ML
there’s only numerical computation.

Types of
One-Hot encoding Feature Label encoding
Encoding
● Changing each categories to 0 or 1.
● Changing each categories to 1,2,3..
● 1 is for the designated category at
● Normally each categories will have
the row, and 0 is for another
order that will be changed to integer.
categories
● Can be used for ordinal data.
● Usually used for nominal data
—One-Hot Encoding

for cat in cats:


prefix=cat)
df_onehot = df.join(onehots)

The “jenis_kelamin” values are replaced; 1 for Male, 0


for Female
—Label Encoding

from sklearn.preprocessing import OrdinalEncoder


encoder = OrdinalEncoder()
ordinals =
pd.DataFrame(encoder.fit_transform(df[cats]),columns =
cats)
df[cats]=ordinals

“Pekerjaan” and “provinsi” values are replaced with


1,2,3,...,n as the designated values.
Splitting dataset into two to Training Set and Test Set is a step to evaluate the
machine algorithm performance.
—Imbalance Data

Occurs when the target features’ numbers are different significantly. To handle
this, we can use undersampling or oversampling method.
ADVANCED DATA
PREPROCESSING FOR MACHINE
LEARNING
FEATURE ENGINEERING
Feature engineering is a method to create new feature from other features. Why this is
matter? Data is limited and might have unprovided potential information.
ADVANCED LABEL ENCODER
When we have a feature that own multiple categories, normal label encoder might abuse the
memory of the engine. Let’s say city feature has 100 categories, should we encode it to 100
columns? It will be a burden for the engine and scalable. This method tackle the problem.

Step 1 :
ADVANCED LABEL ENCODER EXAMPLE
Step 3 :

Output :
ADVANCED LABEL ENCODER EXAMPLE

Step 4 :
ADVANCED LABEL ENCODER EXAMPLE
Output :
FEATURE ENGINEERING
Feature engineering is a method to create new feature from other features. Why this is
matter? Data is limited and might have unprovided potential information.
FEATURE ENGINEERING
Feature engineering is a method to create new feature from other features. Why this is
matter? Data is limited and might have unprovided potential information.
HANDLING TEXT DATA
Have you ever heard of sentiment analysis in twitter? It basically get people’s opinion toward
specific topic, like politics, brand, or even individual. This kind of subject is called Natural
Language Processing(NPL). NPL is Data Science’s sub subject that is specialized in language
in general.
Then how to fit test data into NPL machine learning while it can only accept numerical data?
This topic will help us to manipulate text data into number

Step 1 : Step 2 :
HANDLING TEXT DATA EXAMPLE

Step 1 :

Step 2 :
TEXT CLEANSING

When we cleanse data, symbol ike .,:”)( are not utilized and must be cleaned. That is why this
step is important to start NLP Process.

Example :
STOP WORDS

Cunjunction, repetitive word (I,you,is,am,are,...) need to be removed from dataset. These


words are calles stop word.

Example :
TOKENIZATION
Tokenization is an NLP method to change a text into a list of words. This will help machine to
separate each content.

Step 1 :
TOKENIZATION EXAMPLE

Step 2 :
TOKENIZATION EXAMPLE

Step 3 :
TOKENIZATION EXAMPLE
Step 4 : Step 5 :

Output :
IMBALANCED DATA

Why we need to do this? Imagine we work in an ecommerce and want to predict which one is
the fraud account. And then, the customer service team reported that there are several
accounts tagged as fraud. But the fraud account only 0.0001% from total users. If we use
machine learning to predict this small number account, it will prefer to be lazy since lacking the
data. Then how to solved it?

We can use undersampling and oversampling simultaneously. Undersampling is considerably


easy because we choose data randomly. However, oversampling need extra effort to generate
dataset. That is why algorithm like SMOTE comes for the help.
IMBALANCED DATA EXAMPLE

Step 1 : Step 2 :

Step 3 :
IMBALANCED DATA EXAMPLE
Step 3 :

Output :
Thank you

You might also like