Data Preprocessing
Data Preprocessing
Data Preprocessing
PROGRESS
REVIEW
-Week 10
BTS
Data
Preprocessing
for Machine
Learning
Data Preprocessing
Data Preprocessing
Data is growing exponentially from multiple sources in multiple formats. Real world
data is too dirty (raw data) and cannot be directly fed into machine learning model as
it may contain errors, incomplete, noisy, and unstructured data.
Feature
Unsupervised Supervised
—Standardization
Types of
One-Hot encoding Feature Label encoding
Encoding
● Changing each categories to 0 or 1.
● Changing each categories to 1,2,3..
● 1 is for the designated category at
● Normally each categories will have
the row, and 0 is for another
order that will be changed to integer.
categories
● Can be used for ordinal data.
● Usually used for nominal data
—One-Hot Encoding
Occurs when the target features’ numbers are different significantly. To handle
this, we can use undersampling or oversampling method.
ADVANCED DATA
PREPROCESSING FOR MACHINE
LEARNING
FEATURE ENGINEERING
Feature engineering is a method to create new feature from other features. Why this is
matter? Data is limited and might have unprovided potential information.
ADVANCED LABEL ENCODER
When we have a feature that own multiple categories, normal label encoder might abuse the
memory of the engine. Let’s say city feature has 100 categories, should we encode it to 100
columns? It will be a burden for the engine and scalable. This method tackle the problem.
Step 1 :
ADVANCED LABEL ENCODER EXAMPLE
Step 3 :
Output :
ADVANCED LABEL ENCODER EXAMPLE
Step 4 :
ADVANCED LABEL ENCODER EXAMPLE
Output :
FEATURE ENGINEERING
Feature engineering is a method to create new feature from other features. Why this is
matter? Data is limited and might have unprovided potential information.
FEATURE ENGINEERING
Feature engineering is a method to create new feature from other features. Why this is
matter? Data is limited and might have unprovided potential information.
HANDLING TEXT DATA
Have you ever heard of sentiment analysis in twitter? It basically get people’s opinion toward
specific topic, like politics, brand, or even individual. This kind of subject is called Natural
Language Processing(NPL). NPL is Data Science’s sub subject that is specialized in language
in general.
Then how to fit test data into NPL machine learning while it can only accept numerical data?
This topic will help us to manipulate text data into number
Step 1 : Step 2 :
HANDLING TEXT DATA EXAMPLE
Step 1 :
Step 2 :
TEXT CLEANSING
When we cleanse data, symbol ike .,:”)( are not utilized and must be cleaned. That is why this
step is important to start NLP Process.
Example :
STOP WORDS
Example :
TOKENIZATION
Tokenization is an NLP method to change a text into a list of words. This will help machine to
separate each content.
Step 1 :
TOKENIZATION EXAMPLE
Step 2 :
TOKENIZATION EXAMPLE
Step 3 :
TOKENIZATION EXAMPLE
Step 4 : Step 5 :
Output :
IMBALANCED DATA
Why we need to do this? Imagine we work in an ecommerce and want to predict which one is
the fraud account. And then, the customer service team reported that there are several
accounts tagged as fraud. But the fraud account only 0.0001% from total users. If we use
machine learning to predict this small number account, it will prefer to be lazy since lacking the
data. Then how to solved it?
Step 1 : Step 2 :
Step 3 :
IMBALANCED DATA EXAMPLE
Step 3 :
Output :
Thank you