CH - 4
CH - 4
CH - 4
(03105430)
Prof. Dheeraj Kumar Singh, Assistant Professor
Information Technology Department
CHAPTER-4
Data Pre-processing
No quality data, results in no quality mining !
What is Data Preprocessing?
• Allows to analyse and summarize main characteristics of data sets.
• Refers to steps applied to make data more suitable for data mining.
Data Mining
Data
Data Integration Preprocessing
– Completeness – Believability
w x i i
x i 1
n
w
i 1
i
• Trimmed Mean
– problem with the mean is its sensitivity to extreme (e.g., outlier)
– offset this effect, mean is obtained after chopping off values at high
and low extremes.
Measuring the Central Tendency: Median
• Median is a better measure of center of data for skewed (asymmetric) data.
• Given dataset of N distinct values is sorted in numerical order.
– If N is odd, then the median is the middle value,
– otherwise median is the average of the two middle values.
Mean, median, and mode of symmetric versus positively and negatively skewed data
Image source : Data Mining Concepts and Techniques Book
Measuring the Dispersion of Data
• The degree to which numerical data tend to spread is called the dispersion,
or variance of the data.
• The kth percentile of a set of data in numerical order is the value xi having
property that k percent of the data entries lie at or below xi.
• Range, Quartiles, and Inter-quartile range(IQR):
– Range: Difference between the largest (max()) and smallest (min())
values.
– Quartiles: First quartile: Q1 (25th percentile), Third quartile: Q3 (75th
percentile)
– Inter-quartile range: Distance between first and third quartiles, IQR =
Q 3 – Q1
– Five-number summary: min, Q1, M, Q3, max
– Outlier: usually, a value higher/lower than 1.5 x IQR
Measuring the Dispersion of Data: Boxplot Analysis
• way of visualizing a distribution.
• A boxplot incorporates the five-number
summary:
– Data is represented with a box.
– The ends of the box are at the first and
third quartiles, i.e.,height of the box is IRQ.
– The median is marked by a line within box.
– Whiskers: two lines outside the box extend
to Minimum and Maximum. Boxplot of unit price data for items
sold at four branches of a shop
Image source : Data Mining Concepts and Techniques Book
Variance and Standard Deviation
• The variance of N observations, x1,x2, … xN, is
1 n 1 n 2
( xi ) xi 2
2 2
N i 1 N i 1
Data Data
Data Cleaning Data Reduction Integration
Transformation
Fill missing
Data Normalization Aggregation
Dimensionality Reduction
Data Cleaning : Fill missing Data
• Ignore the tuple: usually done when most of attribute values are missing
• Fill in the missing value manually: Tedious & infeasible sometime
• Use a global constant to fill in the missing value: e.g., unknown, a new class
• Use the attribute mean to fill in the missing value: calculate mean for all
samples of the same class
• Use the most probable value to fill in the missing value: based on inference
Data Cleaning : Noisy Data
• Meaningless data or unknown values
• Cannot interpreted by machine
• Error in a measured variable
• Incorrect attribute values
• Sources of such data:
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
Data Cleaning : Handling Noisy Data
• Binning Method: sort the data and partition into segments of equal size
called bins. Then binning methods are performed on each partition.
• Clustering: groups the similar data in a cluster and detect outliers.
• Regression: made smooth by fitting in to a regression function.
• Semi-automated method: human detect suspicious values and update
manually.
Data Cleaning : Binning Method
• Sort the input data
• Create bins according to given bin size
• Partition data in to equal segment and arrange bins
• Apply smoothing by
– bin mean
– bin boundaries
– bin medians
Bin 1: 4, 8, 15 Bin 1: 4, 4, 15
• PCA computes k orthonormal vectors that provide a basis for normalized data.
• The principal components essentially serve as a new set of sorted axis, such
that the first axis shows data having most variance, the second axis shows the
next highest variance, and so on.
• The size of data can be reduced by eliminating weaker components with low
variance.
Principal Components Analysis Example
• Figure shows, first two principal
components, Y1 and Y2, for the
given data set originally mapped to X2
the axes X1 and X2. Y1
Y2
X1
Numerosity Reduction
• Data are replaced or estimated by alternative, smaller data representations
using
– parametric models
Instead of the actual data, only the data parameters need to be stored.
Example: Regression and Log-linear models
– nonparametric methods
Used to store reduced representations of the data.
Example: histograms, clustering, and sampling.
Parametric models: Regression
• Linear regression: The data are modeled to fit a straight line.
For example, a random variable, y can be modeled as a linear function of
another random variable, x using following equation:
y = w x + b,
where w and b are regression coefficients used to specify the slope of the
line and the Y-intercept, respectively.
• Multiple linear regression: Allows a response variable, y to be modeled as a
linear function of two or more predictor variables.
– For example, a random variable, Y can be modeled as a linear function of
two random variable, X1, X2 using following equation:
Y = b0 + b1 X1 + b2 X2 ,
where w, b1 and b2 are regression coefficients.
Parametric models: Log-Linear Models
• approximates discrete multidimensional joint probability distributions.
• Useful in dimensionality reduction and data smoothing.
• Used to handle skewed data.
• Good scalability for up to 10 or so dimensions.
Nonparametric models: Histogram
• Histograms use binning to
approximate data distributions.
• Partitions the data distribution into
disjoint subsets also called buckets.
• Example: Following data are a
sorted list of prices of commonly
sold items:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12,
14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18,
18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20,
20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, Histogram using singleton buckets
28, 28, 30, 30, 30. (each bucket represents one price
value/frequency pair)
Image source : Data Mining Concepts and Techniques Book
Nonparametric models: Clustering
• Partition data set into groups called clusters, such that objects within a
cluster are similar to one another and dissimilar to objects in other
clusters.
• For data reduction, the cluster representations of data are used to replace
the actual data.
• It is more effective for data that can be organized into distinct clusters
than for smeared data.
• Quality of clusters measured by their diameter (max distance between any
two objects in cluster) or centroid distance (average distance of each
cluster object from its centroid).
Nonparametric models: Sampling
• Choose a representative subset of the data.
• Sampling complexity is potentially sublinear to the size of the data.
• Other data reduction techniques can require at least one complete pass
through D.
• For a fixed sample size, sampling complexity increases only linearly as the
number of data dimensions increases, whereas techniques using
histograms, for example, increase exponentially in n.
• Common ways of Sampling:
– Simple random sample without replacement (SRSWOR)
– Simple random sample with replacement (SRSWR)
– Cluster sample
– Stratified sample
SRSWOR and SRSWR Sampling
• SRSWOR of size s is created by drawing s of the N tuples from dataset D (s <
N), such that all tuples are equally likely to be sampled.
• SRSWR is similar to SRSWOR, except that each time a tuple is drawn from
D, it is recorded and then placed back in D so that it may be drawn again.
Supervised Top-down
discretization discretization
Unsupervised Bottom-up
discretization discretization
Concept Hierarchy Generation
• Reduce the data by collecting and
replacing low level concepts (such All All
as numeric values for the attribute
age) by higher level concepts (such
as young, middle-aged, or senior). Country India Pakistan
(-$4000 -$5,000)
Step
4:
($1,000 - $2, 000) ($2,000 - $5, 000)
(-$400 - 0) (0 - $1,000)
(0 - ($1,000
(-$400 - ($2,000 -
-
-$300) $200)
($200 - $3,000)
$1,200)
($1,200 -
(-$300 - $400) $1,400) ($3,000 -
-$200) ($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
($800 - ($1,800 -
$800) $1,800)
(-$1000 - 0) $1,000) $2,000)
Concept hierarchy generation for categorical data
• Categorical attributes have a finite number of distinct values, with no
ordering among the values.
• Methods:
– Specification of a partial ordering of attributes explicitly at the schema
level by users or experts.
– Specification of a portion of a hierarchy by explicit data grouping.
– Specification of a set of attributes, but not of their partial ordering.
– Specification of only a partial set of attributes.
Data Integration
• Combines data from multiple sources into a coherent store.
• These sources may include multiple databases, data cubes, or flat files.
• Issues:
– How can equivalent real-world entities from multiple data sources be
matched up? This is referred to as the entity identification problem.
Solution: Schema integration - integrate metadata from different sources
– How to deal with Redundancy?
Solution: Some redundancies can be detected by correlation analysis.
– How to detect and resolve data value conflicts?
Solution: Careful integration with special attention to structure of data.
www.paruluniversity.ac.in