Unit1 (DW&DM)

Unit 1
Data Warehousing :
 A Data Warehousing (DW) is process for collecting and managing data

from varied sources to provide meaningful business insights. A Data
warehouse is typically used to connect and analyze business data from
heterogeneous sources. The data warehouse is the core of the BI system
which is built for data analysis and reporting.
 It is a blend of technologies and components which aids the strategic use
of data. It is electronic storage of a large amount of information by a
business which is designed for query and analysis instead of transaction
processing. It is a process of transforming data into information and
making it available to users in a timely manner to make a difference.
 The decision support database (Data Warehouse) is maintained

separately from the organization's operational database. However, the
data warehouse is not a product but an environment. It is an architectural
construct of an information system which provides users with current
and historical decision support information which is difficult to access or
present in the traditional operational data store.
 You many know that a 3NF-designed database for an inventory system

many have tables related to each other. For example, a report on current
inventory information can include more than 12 joined conditions. This
can quickly slow down the response time of the query and report. A data
warehouse provides a new design which can help to reduce the response
time and helps to enhance the performance of queries for reports and
analytics.
Data warehouse system is also known by the following name:
 Decision Support System (DSS)

 Executive Information System
 Management Information System
 Business Intelligence Solution
 Analytic Application
 Data Warehouse
Advantages of Data Warehouse :
 Data warehouse allows business users to quickly access critical data

from some sources all in one place.
 Data warehouse provides consistent information on various cross-
functional activities. It is also supporting ad-hoc reporting and query.
 Data Warehouse helps to integrate many sources of data to reduce
stress on the production system.
 Data warehouse helps to reduce total turnaround time for analysis and
reporting.
 Restructuring and Integration make it easier for the user to use for
reporting and analysis.
 Data warehouse allows users to access critical data from the number
of sources in a single place. Therefore, it saves user's time of
retrieving data from multiple sources.
 Data warehouse stores a large amount of historical data. This helps
users to analyze different time periods and trends to make future
predictions.
Disadvantages of Data Warehouse :
 Not an ideal option for unstructured data.

 Creation and Implementation of Data Warehouse is surely time
confusing affair.
 Data Warehouse can be outdated relatively quickly
 Difficult to make changes in data types and ranges, data source
schema, indexes, and queries.
 The data warehouse may seem easy, but actually, it is too complex for
the average users.
 Despite best efforts at project management, data warehousing project
scope will always increase.
 Sometime warehouse users will develop different business rules.
 Organisations need to spend lots of their resources for training and
Implementation purpose.
Components of Data warehouse
Four components of Data Warehouses are:
Load manager: Load manager is also called the front component. It

performs with all the operations associated with the extraction and load of
data into the warehouse. These operations include transformations to prepare
the data for entering into the Data warehouse.
Warehouse Manager: Warehouse manager performs operations associated

with the management of the data in the warehouse. It performs operations
like analysis of data to ensure consistency, creation of indexes and views,
generation of denormalization and aggregations, transformation and merging
of source data and archiving and baking-up data.
Query Manager: Query manager is also known as backend component. It

performs all the operation operations related to the management of user
queries. The operations of this Data warehouse components are direct
queries to the appropriate tables for scheduling the execution of queries.
End-user access tools:
This is categorized into five different groups like 1. Data Reporting 2. Query
Tools 3. Application development tools 4. EIS tools, 5. OLAP tools and data
mining tools.
Data Warehouse Architectures
There are mainly three types of Data warehouse Architectures: -
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This
goal is to remove data redundancy. This architecture is not frequently used
in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data
warehouse. This architecture is not expandable and also not supporting a
large number of end-users. It also has connectivity problems because of
network limitations.
Three-tier architecture
This is the most widely used architecture. It consists of the Top, Middle and
Bottom Tier.
1. Bottom Tier: The database of the Data warehouse servers as the
bottom tier. It is usually a relational database system. Data is cleansed,

transformed, and loaded into this layer using back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP or MOLAP model. For a

user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the
database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools
and API that you connect and get data out from the data warehouse. It
could be Query tools, reporting tools, managed query tools, Analysis
tools and Data mining tools.
The data warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key components to make
the entire environment functional, manageable and accessible.
ETL Process in Data Warehouse
 ETL is a process in Data Warehousing and it stands

for Extract, Transform and Load. It is a process in which an ETL tool
extracts the data from various data source systems, transforms it in the
staging area and then finally, loads it into the Data Warehouse system.
Let us understand each step of the ETL process in depth:
 Extraction :
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats like
relational databases, No SQL, XML and flat files into the staging area. It is
important to extract the data from various source systems and store it into
the staging area first and not directly into the data warehouse because the
extracted data is in various formats and can be corrupted also. Hence
loading it directly into the data warehouse may damage it and rollback will
be much more difficult. Therefore, this is one of the most important steps of
ETL process.
 Transformation :
The second step of the ETL process is transformation. In this step, a set of
rules or functions are applied on the extracted data to convert it into a single
standard format. It may involve following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multipe attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key-
attribbute).
 Loading :
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes the
data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals. The rate and period
of loading solely depends on the requirements and varies from system to
system.
ETL process can also use the pipelining concept i.e. as soon as some data is
extracted, it can transformed and during that period some new data can be
extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed.
The block diagram of the pipelining of ETL process is shown below :

ETL Tools :
Most commonly used ETL tools are Sybase, Oracle Warehouse
builder, CloverETL and MarkLogic.
Data Mart
A data mart is focused on a single functional area of an organization and
contains a subset of data stored in a Data Warehouse.
A data mart is a condensed version of Data Warehouse and is designed for

use by a specific department, unit or set of users in an organization. E.g.,
Marketing, Sales, HR or finance. It is often controlled by a single
department in an organization.
Data Mart usually draws data from only a few sources compared to a Data
warehouse. Data marts are small in size and are more flexible compared to a
Data warehouse.
Type of Data Mart
There are three main types of data marts are:
1. Dependent: Dependent data marts are created by drawing data
directly from operational, external or both sources.

2. Independent: Independent data mart is created without the use of a
central data warehouse.

3. Hybrid: This type of data marts can take data from data warehouses
or operational systems.
OLTP and OLAP

Parameters OLTP OLAP
Process It is an online transactional OLAP is an online analysis and data
system. It manages database retrieving process.
modification.
Characteristic It is characterized by large It is characterized by a large volume of
numbers of short online data.
transactions.
Functionality OLTP is an online database OLAP is an online database query
modifying system. management system.
Method OLTP uses traditional OLAP uses the data warehouse.
DBMS.
Query Insert, Update, and Delete Mostly select operations
information from the
database.
Table Tables in OLTP database are Tables in OLAP database
normalized. are not normalized.
Source OLTP and its transactions Different OLTP databases become the
are the sources of data. source of data for OLAP.
Data Integrity OLTP database must OLAP database does not get
maintain data integrity frequently modified. Hence, data
constraint. integrity is not an issue.
Response time It's response time is in Response time in seconds to minutes.
millisecond.
Data quality The data in the OLTP The data in OLAP process might not
database is always detailed be organized.
and organized.
Usefulness It helps to control and run It helps with planning, problem-
fundamental business tasks. solving, and decision support.
Operation Allow read/write operations. Only read and rarely write.
Audience It is a market orientated It is a customer orientated process.
process.
Query Type Queries in this process are Complex queries involving
standardized and simple. aggregations.
Back-up Complete backup of the data OLAP only need a backup from time
combined with incremental to time. Backup is not important
backups. compared to OLTP
Design DB design is application DB design is subject oriented.
oriented. Example: Database Example: Database design changes
design changes with industry with subjects like sales, marketing,
like Retail, Airline, Banking, purchasing, etc.
etc.
User type It is used by Data critical Used by Data knowledge users like
users like clerk, DBA & Data workers, managers, and CEO.
Base professionals.
Purpose Designed for real time Designed for analysis of business
business operations. measures by category and attributes.
Performance Transaction throughput is the Query throughput is the performance
metric performance metric metric.
Number of This kind of Database users This kind of Database allows only
users allows thousands of users. hundreds of users.
Productivity It helps to Increase user's Help to Increase productivity of the
self-service and productivity business analysts.
Challenge Data Warehouses historically An OLAP cube is not an open SQL
have been a development server data warehouse. Therefore,
project which may prove technical knowledge and experience is
costly to build. essential to manage the OLAP server.
Process It provides fast result for It ensures that response to the query is
daily used data. quicker consistently.
Characteristic It is easy to create and It lets the user create a view with the
maintain. help of a spreadsheet.
Style OLTP is designed to have A data warehouse is created uniquely
fast response time, low data so that it can integrate different data
redundancy and is sources for building a consolidated
normalized. database
Data Mining
Data mining is looking for hidden, valid, and potentially useful patterns in
huge data sets. Data Mining is all about discovering unsuspected/ previously
unknown relationships amongst the data.
It is a multi-disciplinary skill that uses machine learning, statistics, AI and

database technology.
The insights derived via Data Mining can be used for marketing, fraud
detection, and scientific discovery, etc.
Data mining is also called as Knowledge discovery, Knowledge extraction,

data/pattern analysis, information harvesting, etc
Data Mining Implementation Process

Let's study the Data Mining implementation process in detail
Business understanding : In this phase, business and data-mining goals are

established.
 First, you need to understand business and client objectives. You need
to define what your client wants (which many times even they do not
know themselves)
 Take stock of the current data mining scenario. Factor in resources,
assumption, constraints, and other significant factors into your
assessment.
 Using business objectives and current scenario, define your data
mining goals.
 A good data mining plan is very detailed and should be developed to
accomplish both business and data mining goals.
Data understanding : In this phase, sanity check on data is performed to

check whether its appropriate for the data mining goals.
 First, data is collected from multiple data sources available in the

organization.
 These data sources may include multiple databases, flat filer or data
cubes. There are issues like object matching and schema integration
which can arise during Data Integration process. It is a quite complex
and tricky process as data from various sources unlikely to match
easily. For example, table A contains an entity named cust_no whereas
another table B contains an entity named cust-id.
 Therefore, it is quite difficult to ensure that both of these given objects
refer to the same value or not. Here, Metadata should be used to
reduce errors in the data integration process.
 Next, the step is to search for properties of acquired data. A good way
to explore the data is to answer the data mining questions (decided in
business phase) using the query, reporting, and visualization tools.
 Based on the results of query, the data quality should be ascertained.
Missing data if any should be acquired.
Data preparation : In this phase, data is made production ready.
The data preparation process consumes about 90% of the time of the project.
The data from different sources should be selected, cleaned, transformed,

formatted, anonymized, and constructed (if required).
Data cleaning is a process to "clean" the data by smoothing noisy data and
filling in missing values.
For example, for a customer demographics profile, age data is missing. The
data is incomplete and should be filled. In some cases, there could be data
outliers. For instance, age has a value 300. Data could be inconsistent. For
instance, name of the customer is different in different tables.
Data transformation operations change the data to make it useful in data

mining. Following transformation can be applied.
Data transformation : Data transformation operations would contribute
toward the success of the mining process.
Smoothing: It helps to remove noise from the data.
Aggregation: Summary or aggregation operations are applied to the data.

I.e., the weekly sales data is aggregated to calculate the monthly and yearly
total.
Generalization : In this step, Low-level data is replaced by higher-level

concepts with the help of concept hierarchies. For example, the city is
replaced by the county.
Normalization : Normalization performed when the attribute data are scaled

up o scaled down. Example: Data should fall in the range -2.0 to 2.0 post-
normalization.
Attribute construction : these attributes are constructed and included the

given set of attributes helpful for data mining.
The result of this process is a final data set that can be used in modeling.
Modelling : In this phase, mathematical models are used to determine data

patterns.
 Based on the business objectives, suitable modeling techniques should

be selected for the prepared dataset.
 Create a scenario to test check the quality and validity of the model.
 Run the model on the prepared dataset.
 Results should be assessed by all stakeholders to make sure that
model can meet data mining objectives.
Evaluation : In this phase, patterns identified are evaluated against the

business objectives.
 Results generated by the data mining model should be evaluated

against the business objectives.
 Gaining business understanding is an iterative process. In fact, while
understanding, new business requirements may be raised because of
data mining.
 A go or no-go decision is taken to move the model in the deployment
phase.
Deployment : In the deployment phase, you ship your data mining

discoveries to everyday business operations.
 The knowledge or information discovered during data mining process

should be made easy to understand for non-technical stakeholders.
 A detailed deployment plan, for shipping, maintenance, and
monitoring of data mining discoveries is created.
 A final project report is created with lessons learned and key
experiences during the project. This helps to improve the
organization's business policy.
Data Mining Techniques
Data Mining techniques are as follows,
1. Classification Analysis Technique
 Classification technique is used for assigning the items into target

categories or classes which is used to predict what will occur within
the class accurately.
 It classifies each item in a set of data into one of a predefined set of
classes or groups.
 We use it to classify different data in different classes.
 As this process is like clustering. It relates a way that segments data
records into different segments called classes.
 An example is an Outlook email. They use specific algorithms to
characterise an email as authenticating or spam.
Figure: A classification model can be represented in various forms, such as

(a) IF-THEN rules, (b) a decision tree, or a (c) neural network
 Association Rule Learning Technique
 It is also known as relation technique.

 A pattern is recognised based upon the relationship of items in a single
transaction.
 The association technique is used in market basket analysis to identify
a set of products that customers frequently purchase together.
 Retailers used the association technique to research customer’s buying
habits. Based on historical sale data, retailers might find out that
customers always buy crisps when they buy beers, and, therefore, they
can put beers and crisps next to each other to save time for the
customer and increase sales.
3. Anomaly or Outlier Detection Technique
 Outliers is defined as the data objects that do not comply with the
general behaviour or model of the data available.
 It refers mainly to an observation of data items in a dataset for the data
sets that do not match an expected pattern.
 Anomalies are also known as outliers, novelties, noise, deviations, and
exceptions as this anomaly provide critical and actionable
information.
4. Clustering Analysis Technique

 Cluster analysis is one of the techniques of data mining by which
related records are grouped. As a result, objects are like one another
within the same group. Although, they are different in same or other
clusters.
 The objects are clustered based on the principle of maximising the
intraclass similarity and minimising the interclass similarity.
 In clustering, the class labels are not present in the training because
they are not known to begin with which is called unsupervised
learning.
5. Regression Analysis Technique

 This technique is used for establishing the dependency between the
two variables so that causal relationship can be used to predict the
outcome.
 In statistical ways, we use to identify and analyse the relationship
between variables. It helps you to know the characteristic value of the
dependent variable.
 Generally, used for prediction and forecasting.
6. Prediction Technique
 Prediction is made by finding the relationship between independent

and dependent variables.
 Suppose the deal is an independent variable and profit could be a
dependent variable. Then we can draw a fitted regression curve that is
used for profit prediction.
7. Sequential Patterns Technique
 This is an important part of data mining techniques.

 This technique will identify regular occurrences of similar events.
 This technique is used to understand user buying behaviours. With the
help of historical data. This technique is used in shopping basket
application.
 In online shopping sales, with the use of historical transaction data,
businesses can identify a set of items that customers buy together
different times in a year. Then companies can use this information to
recommend customers buy it with better deals based on their
purchasing frequency in the past.
8. Decision Trees Technique
 decision tree is one of the analytical technique of Data Mining.

 This technique is effortless to understand the users.
 This technique is used for categorising or predict data.
 In this technique, the root of a decision tree is a simple question. As
they have multiple answers.
 Above figure shows an example where you can classify an incoming
error condition.
Major Issues of Data Mining
 There are chances of companies may sell useful information of their
customers to other companies for money. For example, American
Express has sold credit card purchases of their customers to the other
companies.
 Many data mining analytics software is difficult to operate and
requires advance training to work on.
 Different data mining tools work in different manners due to different
algorithms employed in their design. Therefore, the selection of
correct data mining tool is a very difficult task.
 The data mining techniques are not accurate, and so it can cause
serious consequences in certain conditions.
Data Mining Applications
Applications Usage
Communications Data mining techniques are used in communication sector to

predict customer behavior to offer highly targeted and
relevant campaigns.
Insurance Data mining helps insurance companies to price their
products profitable and promote new offers to their new or
existing customers.
Education Data mining benefits educators to access student data,

predict achievement levels and find students or groups of
students which need extra attention. For example, students
who are weak in maths subject.
Manufacturing With the help of Data Mining Manufacturers can predict

wear and tear of production assets. They can anticipate
maintenance which helps them reduce them to minimize
downtime.
Banking Data mining helps finance sector to get a view of market
risks and manage regulatory compliance. It helps banks to
identify probable defaulters to decide whether to issue credit
cards, loans, etc.
Retail Data Mining techniques help retail malls and grocery stores
identify and arrange most sellable items in the most attentive
positions. It helps store owners to comes up with the offer
which encourages customers to increase their spending.
Service Service providers like mobile phone and utility industries use
Providers Data Mining to predict the reasons when a customer leaves
their company. They analyze billing details, customer service
interactions, complaints made to the company to assign each
customer a probability score and offers incentives.
E-Commerce E-commerce websites use Data Mining to offer cross-sells

and up-sells through their websites. One of the most famous
names is Amazon, who use Data mining techniques to get
more customers into their e Commerce store.
Super Markets Data Mining allows supermarket's develope rules to predict

if their shoppers were likely to be expecting. By evaluating
their buying pattern, they could find woman customers who
are most likely pregnant. They can start targeting products
like baby powder, baby shop, diapers and so on.
Crime Data Mining helps crime investigation agencies to deploy
Investigation police workforce who to search at a border crossing etc.
Bio informatics Data Mining helps to mine biological data from massive data
sets gathered in biology and medicine.

Unit1 (DW&DM)

Uploaded by

Copyright:

Available Formats

Unit1 (DW&DM)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit1 (DW&DM)

Uploaded by

Copyright:

Available Formats

Unit 1

 A Data Warehousing (DW) is process for collecting and managing data

 The decision support database (Data Warehouse) is maintained

 You many know that a 3NF-designed database for an inventory system

Data warehouse system is also known by the following name:

 Decision Support System (DSS)

 Data warehouse allows business users to quickly access critical data

Disadvantages of Data Warehouse :

 Not an ideal option for unstructured data.

Components of Data warehouse

Four components of Data Warehouses are:

Load manager: Load manager is also called the front component. It

Warehouse Manager: Warehouse manager performs operations associated

Query Manager: Query manager is also known as backend component. It

End-user access tools:

Data Warehouse Architectures

There are mainly three types of Data warehouse Architectures: -

1. Bottom Tier: The database of the Data warehouse servers as the

bottom tier. It is usually a relational database system. Data is cleansed,

2. Middle Tier: The middle tier in Data warehouse is an OLAP server

which is implemented using either ROLAP or MOLAP model. For a

ETL Process in Data Warehouse

 ETL is a process in Data Warehousing and it stands

The block diagram of the pipelining of ETL process is shown below :

A data mart is a condensed version of Data Warehouse and is designed for

Type of Data Mart

There are three main types of data marts are:

1. Dependent: Dependent data marts are created by drawing data

directly from operational, external or both sources.

central data warehouse.

OLTP and OLAP

It is a multi-disciplinary skill that uses machine learning, statistics, AI and

Data mining is also called as Knowledge discovery, Knowledge extraction,

Data Mining Implementation Process

Business understanding : In this phase, business and data-mining goals are

Data understanding : In this phase, sanity check on data is performed to

 First, data is collected from multiple data sources available in the

Data preparation : In this phase, data is made production ready.

The data from different sources should be selected, cleaned, transformed,

Data transformation operations change the data to make it useful in data

Smoothing: It helps to remove noise from the data.

Aggregation: Summary or aggregation operations are applied to the data.

Generalization : In this step, Low-level data is replaced by higher-level

Normalization : Normalization performed when the attribute data are scaled

Attribute construction : these attributes are constructed and included the

Modelling : In this phase, mathematical models are used to determine data

 Based on the business objectives, suitable modeling techniques should

Evaluation : In this phase, patterns identified are evaluated against the

 Results generated by the data mining model should be evaluated

Deployment : In the deployment phase, you ship your data mining

 The knowledge or information discovered during data mining process

Data Mining techniques are as follows,

1. Classification Analysis Technique

 Classification technique is used for assigning the items into target

Figure: A classification model can be represented in various forms, such as

 It is also known as relation technique.

3. Anomaly or Outlier Detection Technique

4. Clustering Analysis Technique