4 - Data Pre-Processing I
4 - Data Pre-Processing I
4 - Data Pre-Processing I
TIET, PATIALA
Data
▪ Data is a unprocessed fact, value, text, sound or picture that is not being
interpreted and analyzed.
▪Data is the most important part of all Data Analytics, Machine Learning,
Artificial Intelligence.
▪ Big Enterprises are spending lots of money just to gather as much certain data as
possible.
In 2021, Facebook acquire WhatsApp by paying a huge price of $19 billion
Structured vs. Unstructured Data in ML
Structured Data in ML
▪ Structured data in Machine Learning is stored in the form of rows and columns.
▪ For instance a well known ML dataset, Iris, has five features about species
namely Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species.
Types of Features
Quantitative Qualitative Nominal
Binary
(Numerical) (Categorical)
• Quantitative data being • Which have two values • Categorical data where
measured. (0/1, yes/no, true/false) order of categories is
• Can be Continuous • Examples- Marital arbitrary
(infinite values- length, Status, Permanent • Example- account type
mass, weight) or Employee, etc. (savings, current, fixed
Discrete (finite integer term, etc).
values- no. of workers
absent)
Types of Features Contd….
Qualitative Interval Ratio
Ordinal (Ranked)
• Categorical data where • Has meaningful intervals • Have highest level of
there is some logical between measurement. measurement.
ordering of categories • No true starting point • Ratios between
• Example: Size (S, M, L, (zero) measurements and
XL, XXL, etc.), Likert • Example- Temperature intervals are meaningful
Scale (Strongly because there is true
Disagree, Disagree, starting point (zero)
Neutral, Agree, etc.) • Example: weight, Age
Data Pre-Processing
▪ Data Pre-processing: It is that phase of any Machine Learning process, which transforms, or
Encodes, the data to bring it to such a state where it can be easily interpreted by the learning
algorithm.
Median
= 29-4 = 2.777
9
Bin 1: 4, 8, 9
Bin 2: 21, 21, 24
Bin 3: 26, 28, 29
Data Cleaning
Extreme
Central Hot Deck Cold Deck Nearest Linear Proximity- Information
Value
Tendency Imputation Imputation Neighbor Models Based Theoritic
Analysis