Module 2
Module 2
Module 2
Module 2
Syllabus
Data warehouse implementation& Data mining: Efficient Data Cube computation: An overview,
Indexing OLAP Data: Bitmap index and join index, Efficient processing of OLAP Queries, OLAP server
Architecture ROLAP versus MOLAP Versus HOLAP. : Introduction: What is data mining, Challenges,
Data Mining Tasks, Data: Types of Data, Data Quality, Data Preprocessing, Measures of Similarity and
Dissimilarity.
The following are general optimization techniques for efficient computation of data cubes which as
follows −
Sorting, hashing, and grouping − Sorting, hashing, and grouping operations must be used to the
dimension attributes to reorder and cluster associated tuples. In cube computation, aggregation is
implemented on the tuples that share the similar set of dimension values. Therefore, it is essential to
analyse sorting, hashing, and grouping services to access and group such data to support evaluation of
such aggregates.
It can calculate total sales by branch, day, and item. It can be more effective to sort tuples or cells by
branch, and thus by day, and then group them as per the item name. An effective performance of such
operations in huge data sets have been widely considered in the database research community.
Such performance can be continued to data cube computation. This method can also be continued to
implement shared-sorts (i.e., sharing sorting costs across different cuboids when sort-based techniques are
used), or to implement shared-partitions (i.e., sharing the partitioning cost across different cuboids when
hash-based algorithms are utilized).
Aggregation from the smallest child when there exist multiple child cuboids − When there exist
several child cuboids, it is generally more effective to calculate the desired parent (i.e., more generalized)
cuboid from the smallest, formerly computed child cuboid.
The Apriori pruning method can be explored to compute iceberg cubes efficiently − The Apriori
property in the context of data cubes, defined as follows: If a given cell does not fulfil minimum support,
One approach to cube computation extends SQL so as to include a compute cube operator. The compute
cube operator computes aggregates over all subsets of the dimensions specified in the operation. This can
require excessive storage space, especially for large numbers of dimensions. We start with an intuitive
look at what is involved in the efficient computation of data cubes.
Example 1: A data cube is a lattice of cuboids. Suppose that you want to create a data cube for
AllElectronics sales that contains the following: city, item, year, and sales in dollars. You want to be able
to analyze the data, with queries such as the following: ―Compute the sum of sales, grouping by city and
item.‖ ―Compute the sum of sales, grouping by city.‖ ―Compute the sum of sales, grouping by item.‖
What is the total number of cuboids, or group-by‘s, that can be computed for this data cube? Taking the
three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars as the
measure, the total number of cuboids, or groupby‘s, that can be computed for this data cube is 23 D 8.
The possible group-by‘s are the following: f(city, item, year), (city, item), (city, year), (item, year), (city),
(item), (year), ./g, where ./ means that the group-by is empty (i.e., the dimensions are not grouped). These
group-by‘s form a lattice of cuboids for the data cube, as shown in Figure 1.
Bitmap Indexes
Bitmap indexes are widely used in data warehousing environments. The environments typically have
large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such
applications, bitmap indexing provides:
Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of
space because the indexes can be several times larger than the data in the table. Bitmap indexes are
typically only a fraction of the size of the indexed data in the table.
It is a special type of indexing built on a single key. But, it is designed to fire queries on multiple keys
quickly. We need to arrange the records in sequential order before applying bitmap indexing on it. It
makes it simple to fetch a particular record from the block. Also, it becomes easy to allocate them in the
block of a file.
Disadvantages –
Only suitable for large tables
Bitmap Indexing is time consuming
Join Index
The join indexing method gained popularity from its use in relational database query processing. Join
indexes are designed to permit queries (join queries in the case of multitable join indexes) to be resolved
by accessing the index instead of accessing, and possibly joining, their underlying base tables. Traditional
indexing maps the value in a given column to a list of rows having that value. In contrast, join indexing
registers the joinable rows of two relations from a relational database. For example, if two relations
R.RID, A/ and S.B, SID/ join on the attributes A and B, then the join index record contains the pair .RID,
SID/, where RID and SID are record identifiers from the R and S relations, respectively. Hence, the join
index records can identify joinable tuples without performing costly join operations. Join indexing is
especially useful for maintaining the relationship between a foreign key2 and its matching primary keys,
from the
Example : Join indexing. we defined a star schema for AllElectronics of the form ―sales star [time, item,
branch, location]: dollars sold D sum (sales in dollars).‖ An example of a join index relationship between
the sales fact table and the location and item imension tables is shown in Figure 4.16. For example, the
―Main Street‖ value in the
location dimension table joins with tuples T57, T238, and T884 of the sales fact table. Similarly, the
―Sony-TV‖ value in the item dimension table joins with tuples T57 and T459 of the sales fact table. The
corresponding join index tables are shown in Figure 3
Figure 4: Join index tables based on the linkages between the sales fact table and the location and item
dimension tables shown in Figure 3
JOIN INDEX is a materialized view. Its definition is permanently stored and the data is updated
whenever the base tables referred in the join index is updated. JOIN INDEX may contain one or more
tables and also contain pre-aggregated data. Join indexes are mainly used for improving the
performance.
There are different types of join indexes available.
Single-Table join indexes are created from exactly one base table. Their purpose is to have the same table
available with a different primary index, partitioning, or smaller table (the join index table) with fewer
columns to be spooled. This improves the performance of joins as no distribution or duplication is
needed. The user will query on the base table, but PE will decide whether to access the base table or
single table join index.
Single Table Join Index: Fewer columns from single table with different primary index.
Exam:
A multi-table join index is created by joining more than one table. A multi-table join index can be
used to store the result set of frequently joined tables to improve the performance.
A multi-table join index is used to hold a pre-join result set from the two or more columns. So
during join processing, PE may decide to access data from the Multi-table join index rather than
joining again underlying base tables. We need to remember that we should define a multi-table
join index after lots of analysis based on the frequency and cost of joining.
Multi-Table Join Indexes allow us to move resource-intensive joins from the online to the batch
window.
Shifting the workload does not reduce the total workload, but it turns it to a point in time that is
beneficial for the system's overall performance.
Multi-Table Join Index: Pre-Joins multiple tables from the base tables.
CREATE JOIN INDEX Multi_table_Join_Index_name AS
SELECT
emp.emp_name,
emp.job_title,
emp.salary,
emp.dept_no,
dept.department_name,
dept.loc_name
FROM tutorial_db.employee as emp
** Only COUNT and SUM clause are allowed in aggregate join index.
Determine which operations should be performed on the available cuboids: This involves transforming
any selection, projection, roll-up (group-by), and drill-down operations specified in the query into
corresponding SQL and/or OLAP operations. For example, slicing and dicing a data cube may correspond
to selection and/or projection operations on a materialized cuboid.
Determine to which materialized cuboid(s) the relevant operations should be applied: This involves
identifying all of the materialized cuboids that may potentially be used to answer the query, pruning the
OLAP Servers
Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in order to
make business decisions. OLAP provides a platform for gaining insights from databases retrieved from
multiple database systems at the same time. It is based on a multidimensional data model, which enables
users to extract and view data from various perspectives. A multidimensional database is used to store
OLAP data. Many Business Intelligence (BI) applications rely on OLAP technology.
Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a relational
database, where both the base data and dimension tables are stored as relational tables. ROLAP servers
are used to bridge the gap between the relational back-end server and the client‘s front-end tools.
ROLAP servers store and manage warehouse data using RDBMS, and OLAP middleware fills in the
gaps.
Benefits:
It is compatible with data warehouses and OLTP systems.
The data size limitation of ROLAP technology is determined by the underlying RDBMS. As
a result, ROLAP does not limit the amount of data that can be stored.
MOLAP stores data on discs in the form of a specialized multidimensional array structure. It is used for
OLAP, which is based on the arrays‘ random access capability. Dimension instances determine array
elements, and the data or measured value associated with each cell is typically stored in the corresponding
array element. The multidimensional array is typically stored in MOLAP in a linear allocation based on
nested traversal of the axes in some predetermined order.
However, unlike ROLAP, which stores only records with non-zero facts, all array elements are defined in
MOLAP, and as a result, the arrays tend to be sparse, with empty elements occupying a larger portion of
them. MOLAP systems typically include provisions such as advanced indexing and hashing to locate data
while performing queries for handling sparse arrays, because both storage and retrieval costs are
important when evaluating online performance. MOLAP cubes are ideal for slicing and dicing data and
can perform complex calculations. When the cube is created, all calculations are pre-generated.
Benefits:
Suitable for slicing and dicing operations.
Outperforms ROLAP when data is dense.
Capable of performing complex calculations.
Limitations:
It is difficult to change the dimensions without re-aggregating.
Since all calculations are performed when the cube is built, a large amount of data cannot be
stored in the cube itself.
Benefits:
HOLAP combines the benefits of MOLAP and ROLAP.
Provide quick access at all aggregation levels.
Limitations
Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely
complex.
There is a greater likelihood of overlap, particularly in their functionalities
Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some
database system vendors implement specialized SQL servers that provide advanced query language and
query processing support for SQL queries over star and snowflake schemas in a read-only environment.
Data mining is a process of extracting and discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary
subfield of computer science and statistics with an overall goal of extracting information (with intelligent
methods) from a data set and transform the information into a comprehensible structure for further use.
Decision-Making strategies are done through data collection-sharing, so it requires considerable security.
Private information about individuals and sensitive information are collected for customers profiles, user
behaviour pattern understanding. Illegal access to information and the confidential nature of information
becoming an important issue.
User Interface:
The knowledge discovered is discovered using data mining tools is useful only if it is interesting and
above all understandable by the user. From good visualization interpretation of data, mining results can be
eased and helps better understand their requirements. To obtain good visualization many research is
carried out for big data sets that display and manipulate mined knowledge.
(i) Mining based on Level of Abstraction: Data Mining process needs to be collaborative because it
allows users to concentrate on pattern finding, presenting and optimizing requests for data mining based
on returned results.
(ii) Integration of Background Knowledge: Previous information may be used to express discovered
patterns to direct the exploration processes and to express discovered patterns.
These challenges are related to data mining approaches and their limitations. Mining approaches that
cause the problem are:
Quantitative data- Quantitative data can be expressed as a number or can be quantified. Simply put, it can
be measured by numerical variables. Quantitative data are easily amenable to statistical manipulation and
can be represented by a wide variety of statistical types of graphs and charts such as line, bar graph,
scatter plot, and etc.
Qualitative data- Qualitative data can‘t be expressed as a number and can‘t be measured. Qualitative data
consist of words, pictures, and symbols, not numbers. Qualitative data is also called categorical data
because the information can be sorted by category, not by number.
Ordinal data- Ordinal data shows where a number is in order. This is the crucial difference from nominal
types of data. Ordinal data is data which is placed into some kind of order by their position on a scale.
Ordinal data may indicate superiority. Ordinal variables are considered as ―in between‖ qualitative and
quantitative variables. In other words, the ordinal data is qualitative data for which the values are ordered.
Discrete data- Discrete data is a count that involves only integers. The discrete values cannot be
subdivided into parts. For example, the number of children in a class is discrete data. You can count
whole individuals. You can‘t count 1.5 kids.
Continuous data- Continuous data is information that could be meaningfully divided into finer levels. It
can be measured on a scale or continuum and can have almost any numeric value. For example, you can
measure your height at very precise scales — meters, centimeters, millimeters and etc. You can record
continuous data at so many different measurements – width, temperature, time, and etc. This is where the
key difference from discrete types of data lies.
Data Quality
Data quality refers to the state of qualitative or quantitative pieces of information. There are many
definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses
in operations, decision making and planning".
Data quality is the measure of how well suited a data set is to serve its specific purpose. Measures of data
quality are based on data quality characteristics such as accuracy, completeness, consistency, validity,
uniqueness, and timeliness.
Data quality refers to the development and implementation of activities that apply quality management
techniques to data in order to ensure the data is fit to serve the specific needs of an organization in a
particular context. Data that is deemed fit for its intended purpose is considered high quality data.
Examples of data quality issues include duplicated data, incomplete data, inconsistent data, incorrect data,
poorly defined data, poorly organized data, and poor data security.
Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or
enhance performance, and is an important step in the data mining process. The phrase "garbage in,
garbage out" is particularly applicable to data mining and machine learning projects.
Data preprocessing is a step in the data mining and data analysis process that takes raw data and
transforms it into a format that can be understood and analyzed by computers and machine learning. Data
preprocessing is a broad area and consists of a number of different strategies and techniques that are
interrelated in complex ways. We will present some of the most important ideas and approaches, and try
to point out the interrelationships among them. Specifically, we will discuss the following topics:
Aggregation
Sampling
Dimensionality reduction
Feature subset selection
Feature creation
Discretization and binarization
Variable transformation
Similarity and dissimilarity are important because they are used by a number of data mining techniques,
such as clustering nearest neighbor classification and anomaly detection. The term proximity is used to
refer to either similarity or dissimilarity.
Distance or similarity measures are essential in solving many pattern recognition problems such as
classification and clustering. Various distance/similarity measures are available in the literature to
compare two data distributions. As the names suggest, a similarity measures how close two distributions
are. For multivariate data complex summary methods are developed to answer this question.
Similarity Measure
Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1
(complete similarity)
Dissimilarity Measure
Numerical measure of how different two data objects are range from 0 (objects are alike)
to ∞ (objects are different)
Here, p and q are the attribute values for two data objects.
Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties:
Common Properties of Dissimilarity Measures
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
A distance that satisfies these properties is called a metric. Following is a list of several common distance
measures to compare multivariate data. We will assume that the attributes are all continuous.
Minkowski Distance