DM UNIT-1 Question and Answer
DM UNIT-1 Question and Answer
DM UNIT-1 Question and Answer
A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in
which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure such as count or sum(sales amount).
A data cube provides a multidimensional view of data and allows the precomputation and fast access
of summarized data.
3)Transactional Data.
In general, each record in a transactional database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity number (trans ID) and a list of the items
making up the transaction, such as the items purchased in the transaction.
b) Noisy data :
Noise is a random error or variance in a measured variable. In order to remove the
noise data smoothing techniques are used
Data Smoothing Techniques:
i) Binning:
Binning methods smooth a sorted data value by consulting its
neighborhood i.e. the values around it.
The sorted values are distributed into a number of “buckets,” or
bins.
smoothing by bin means:In this metho each value in a bin is
replaced by the mean value of the bin
smoothing by bin medians : In this method each bin value is
replaced by the bin median
smoothing by bin boundaries: In this method the minimum
and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value.
ii) Regression:
Data smoothing can also be done by regression, a technique
that conforms data values to a function.
Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to
predict the other.
Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit
to a multidimensional surface
4)Cluster Analysis :
• Clustering can be used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are rather
dissimilar to objects in other clusters.
• Each cluster so formed can be viewed as a class of objects, from which rules can be
derived.
5)Outlier Analysis
• A data set may contain objects that do not comply with the general behavior or model
of the data.
• These data objects are outliers. Many data mining methods discard outliers as noise or
exceptions.
• The analysis of outlier data is referred to as outlier analysis or anomaly mining.
c)Outlier analysis:
Outliers may be detected by clustering, for example, where
similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be
considered outliers
ii) Attribute construction : where new attributes are constructed and added
from the given set of attributes to help the mining process.
iii) Aggregation: where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute monthly
and annual total amounts.
iv) Normalization : where the attribute data are scaled so as to fall within a smaller
range, such as −1.0 to 1.0, or 0.0 to 1.0.
There are many methods for data normalization . Some are
a) Min-max normalization:
Suppose that minA and maxA are the minimum and maximum values of an
attribute, A.
Min-max normalization maps a value, vi , of A to v!i in the range [new_minA,
new_maxA]
4b) Describe the various phases in knowledge discovery process with a neat
diagram.
Ans:
4c) What are the typical methods for Data Discretization and Concept
Hierarchy Generation?
Ans Data Discretization and Concept Hierarchy Generation:
• Data discretization: Data discretization techniques can be used to reduce the number
of values for a given continuous attribute by dividing the range of the attribute into
intervals.
• Interval labels can then be used to replace actual data values.
• If the discretization process uses class information, then we say it is supervised
discretization.
• Otherwise, it is unsupervised. If the process starts by first finding one or a few points
(called split points or cut points) to split the entire attribute range, and then repeats
this recursively on the resulting intervals
• Ans: Data integration combines data from sources into coherent data store.
iii)Clustering
• Clustering techniques consider data tuples as objects.
• They partition the objects into groups or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters.
• Similarity is commonly defined in terms of how “close” the objects are in space,
based on a distance function.
• The “quality” of a cluster may be represented by its diameter, the maximum distance
between any two objects in the cluster.
• Centroid distance is an alternative measure of cluster quality and is defined as the
average distance of each cluster object from the cluster centroid
iv)Sampling
• Sampling can be used as a data reduction technique because it allows a large data set
to be represented by a much smaller random sample of the data.
• Suppose that a large data set, D, contains N tuples. Let’s look at the most common
ways that we could sample D for data reduction
• Simple random sample without replacement (SRSWOR) of size s: This is created
by drawing s of the N tuples from D (s < N), where the probability of drawing any
tuple in D is 1/N, that is, all tuples are equally likely to be sampled.
• Simple random sample with replacement (SRSWR) of size s: This is similar to
SRSWOR, except that each time a tuple is drawn from D, it is recorded and then
replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn
again
•
• Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,”
then an SRS of s clusters can be obtained, where s < M. For example, tuples in a
database are usually retrieved a page at a time, so that each page can be considered a
cluster. A reduced data representation can be obtained by applying, say, SRSWOR to
the pages, resulting in a cluster sample of the tuples
• Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified
sample of D is generated by obtaining an SRS at each stratum.