Data Preprocessing

LEARNING
PROGRESS
REVIEW
-Week 10
BTS
Data
Preprocessing
for Machine
Learning
Data Preprocessing
Data Preprocessing
Data is growing exponentially from multiple sources in multiple formats. Real world
data is too dirty (raw data) and cannot be directly fed into machine learning model as
it may contain errors, incomplete, noisy, and unstructured data.
Feature
Unsupervised Supervised
Do not use the target

Use the target variable (e.g.
variable (e.g. remove remove irrelevant variables)
redundant variables)
Feature
● Numerical Output: Regression predictive modeling problem.
● Categorical Output: Classification predictive modeling problem.
Predictor vs Target
Predictive Variable Target Variable
One or more variables that

are used to determine A variable that needs to be
(predict) the ‘Target predicted is a target variable.
Variable’.
Standardization vs Normalization
Standardization Normalization
Rescales data to have a mean of 0

Rescales the values into a range of
and a standard deviation of 1 (unit
[0,1]
variance)the values into a range of
Salary range
[0,1] 0 10,000,000
Salary range after

normalization
0 10,000,000
—Why Standardization & Normalization
➔ The algorithm will treat the same scaled values fairly.
➔ Centered and same scaled data will help to accelerate the
learning algorithm.
➔ Same scaled data will help to interpret some ML models.
—When to use them?
➔ Use standardization when we know our data has

normal/gaussian distribution.
➔ Otherwise, use normalization.
In Python, we can use sklearn to perform Standardization & Normalization
—Standardization
from sklearn.preprocessing import StandardScaler

df['umur_std'] =
StandardScaler().fit_transform(df['umur'].values.reshape(len(df), 1))
df['gaji_std'] =
StandardScaler().fit_transform(df['gaji'].values.reshape(len(df), 1))
Standardized data
➔ The means are close

to 0
➔ The deviation
standards are 1
—Normalization
from sklearn.preprocessing import MinMaxScaler

df['berat_norm'] =
MinMaxScaler().fit_transform(df['berat'].values.reshape(len(df), 1))
df['tinggi_norm'] =
MinMaxScaler().fit_transform(df['tinggi'].values.reshape(len(df), 1))
Normalized data
➔ The means and

deviation standard are
in the same scale
Feature Encoding
A process to transform categorical features to numerical features, because in ML
there’s only numerical computation.
Types of
One-Hot encoding Feature Label encoding
Encoding
● Changing each categories to 0 or 1.
● Changing each categories to 1,2,3..
● 1 is for the designated category at
● Normally each categories will have
the row, and 0 is for another
order that will be changed to integer.
categories
● Can be used for ordinal data.
● Usually used for nominal data
—One-Hot Encoding
for cat in cats:

prefix=cat)
df_onehot = df.join(onehots)
The “jenis_kelamin” values are replaced; 1 for Male, 0

for Female
—Label Encoding
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
ordinals =
pd.DataFrame(encoder.fit_transform(df[cats]),columns =
cats)
df[cats]=ordinals
“Pekerjaan” and “provinsi” values are replaced with

1,2,3,...,n as the designated values.
Splitting dataset into two to Training Set and Test Set is a step to evaluate the
machine algorithm performance.
—Imbalance Data
Occurs when the target features’ numbers are different significantly. To handle
this, we can use undersampling or oversampling method.
ADVANCED DATA
PREPROCESSING FOR MACHINE
LEARNING
FEATURE ENGINEERING
Feature engineering is a method to create new feature from other features. Why this is
matter? Data is limited and might have unprovided potential information.
ADVANCED LABEL ENCODER
When we have a feature that own multiple categories, normal label encoder might abuse the
memory of the engine. Let’s say city feature has 100 categories, should we encode it to 100
columns? It will be a burden for the engine and scalable. This method tackle the problem.
Step 1 :
ADVANCED LABEL ENCODER EXAMPLE
Step 3 :
Output :
Step 4 :
Output :
FEATURE ENGINEERING
FEATURE ENGINEERING
HANDLING TEXT DATA
Have you ever heard of sentiment analysis in twitter? It basically get people’s opinion toward
specific topic, like politics, brand, or even individual. This kind of subject is called Natural
Language Processing(NPL). NPL is Data Science’s sub subject that is specialized in language
in general.
Then how to fit test data into NPL machine learning while it can only accept numerical data?
This topic will help us to manipulate text data into number
Step 1 : Step 2 :
HANDLING TEXT DATA EXAMPLE
Step 1 :
Step 2 :
TEXT CLEANSING
When we cleanse data, symbol ike .,:”)( are not utilized and must be cleaned. That is why this
step is important to start NLP Process.
Example :
STOP WORDS
Cunjunction, repetitive word (I,you,is,am,are,...) need to be removed from dataset. These

words are calles stop word.
Example :
TOKENIZATION
Tokenization is an NLP method to change a text into a list of words. This will help machine to
separate each content.
Step 1 :
TOKENIZATION EXAMPLE
Step 2 :
Step 3 :
Step 4 : Step 5 :
Output :
IMBALANCED DATA
Why we need to do this? Imagine we work in an ecommerce and want to predict which one is
the fraud account. And then, the customer service team reported that there are several
accounts tagged as fraud. But the fraud account only 0.0001% from total users. If we use
machine learning to predict this small number account, it will prefer to be lazy since lacking the
data. Then how to solved it?
We can use undersampling and oversampling simultaneously. Undersampling is considerably

easy because we choose data randomly. However, oversampling need extra effort to generate
dataset. That is why algorithm like SMOTE comes for the help.
IMBALANCED DATA EXAMPLE
Step 1 : Step 2 :
Step 3 :
IMBALANCED DATA EXAMPLE
Step 3 :
Output :
Thank you

Data Preprocessing

Uploaded by

Copyright:

Available Formats

Data Preprocessing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing

Uploaded by

Copyright:

Available Formats

LEARNING

Do not use the target

Predictive Variable Target Variable

One or more variables that

Rescales data to have a mean of 0

Salary range after

—When to use them?

➔ Use standardization when we know our data has

from sklearn.preprocessing import StandardScaler

➔ The means are close

from sklearn.preprocessing import MinMaxScaler

➔ The means and

for cat in cats:

The “jenis_kelamin” values are replaced; 1 for Male, 0

from sklearn.preprocessing import OrdinalEncoder

“Pekerjaan” and “provinsi” values are replaced with

Cunjunction, repetitive word (I,you,is,am,are,...) need to be removed from dataset. These

We can use undersampling and oversampling simultaneously. Undersampling is considerably

You might also like