DWM - Te - Week-1
DWM - Te - Week-1
DWM - Te - Week-1
MINING
Unit No:1
Unit Name: INTRODUCTION TO DATA WAREHOUSING
2
Unit :1 Unit Name: Introduction to Data Warehousing
Lecture No: 1
Introduction to Data Warehouse,
Data warehouse architecture
What is Data Warehouse ?
Bill Inmon, considered to be the father of Data Warehousing provides the following
definition:
▪ “A Data Warehouse is a subject oriented, integrated, nonvolatile, and
time variant collection of data in support of management’s decisions.”
▪ Before the data from various disparate sources can be usefully stored in a data
warehouse, you have to:
▪ remove the inconsistencies;
▪ standardize the various data elements;
▪ make sure of the meanings of data names in each source application
▪ Before moving the data into the data warehouse, you have to go through a process
of transformation, consolidation, and integration of the source data
▪ For an operational system, the stored data contains the current values.
▪ The data in the data warehouse is meant for analysis and decision making.
▪ A data warehouse, because of the vary nature of its purpose, has to contain
historical data, not just current values
▪ Data is stored as snapshots over past and current periods
▪ Every data structure in the data warehouse contains the time element
Two-tier architecture
It separates physically available sources and data warehouse.
Not expandable and also not supporting a large number of end-users.
It also has connectivity problems because of network limitations.
Bottom Tier − The bottom tier of the architecture is the data warehouse database server.
It is the relational database system. We use the back end tools and utilities to feed
data into the bottom tier. These back end tools and utilities perform the Extract, Clean,
Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in
either of the following ways.
● By Relational OLAP (ROLAP), which is an extended relational database management
system. The ROLAP maps the operations on multidimensional data to standard
relational operations.
● By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools,
reporting tools, analysis tools and data mining tools.
Lecture No: 2
Data warehouse versus Data Marts,
E-R Modelling versus Dimensional
Modelling
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable
to specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular
group. For example, the marketing data mart may contain data related to items,
customers, and sales. Data marts are confined to subjects.
20 Lecture no 2:Data warehouse versus Data Marts, E-R Modelling versus Dimensional Modelling
Data Mart
● The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
● Data Mart is smaller version of Data Warehouse which deals with single subject
● Data marts are focused on one area, hence they draw data from limited number of
sources
● Time taken to build data mart is very less compared to DWH
Lecture no 2:Data warehouse versus Data Marts, E-R Modelling versus Dimensional
22
Modelling
Types of Data Mart
● Data Warehouse design approaches are very important aspect of building data
warehouse.
● Selection of right data warehouse design could save lot of time and project
cost.
Lecture no 2:Data warehouse versus Data Marts, E-R Modelling versus Dimensional
24
Modelling
Top-Down Approach for Data warehouse Design
Fact
Facts are the measurements/metrics or facts from your business process. For a Sales
business process, a measurement would be quarterly sales number
Dimension
Dimension provides the context surrounding a business process event. In simple terms,
they give who, what, where of a fact. In the Sales business process, for the fact quarterly
sales number, dimensions would be
Attributes
The Attributes are the various characteristics of the dimension in dimensional data
modeling.
Attributes are used to search, filter, or classify facts. Dimension Tables contain
Attributes
Lecture No: 3
Information Package Diagram, Data
Warehouse Schemas; Star Schema
DEFINING THE BUSINESS REQUIREMENTS
Information Packages –
•Novel idea for determining and recording information requirements for a data
warehouse.
•The relevant dimension and measurements in that dimension are captured and
kept in a data warehouse
Business dimensions
•In requirements collection phase, the end users can provide the measurements
which are important to that department.
•They can also give insights of combining the various pieces of information for
strategic decision making.
•Managers think of business in terms of business dimensions
•The managers try to evaluate business in different dimensions.
• When a user analyzes the measurements along a business dimension, the user usually would
like to see the numbers first in summary and then at various levels of detail.
• What the user does here is to traverse the hierarchical levels of a business dimension for getting
the details at various levels.
• The hierarchy of the time dimension consists of the levels of year, quarter, and month.
• The dimension hierarchies are the paths for drilling down or rolling up in our analysis.
• Within each major business dimension there are categories of data elements that can also be
useful for analysis.
• Hierarchies and categories are included in the information packages for each dimension.
● Dimensional Data Modeling is one of the data modeling techniques used in data warehouse
design.
● The concept of Dimensional Modeling was developed by Ralph Kimball which is comprised
of facts and dimension tables
● Since the main goal of this modeling is to improve the data retrieval so it is optimized for
SELECT OPERATION
● The advantage of using this model is that we can store data in such a way that it is easier to
store and retrieve the data once stored in a data warehouse.
End Users fires a queries on these tables which contains descriptive information
● SALES is a fact table having attributes i.e. (Product ID, Order ID, Customer ID, Employer ID,
Total, Quantity, Discount) which references to the dimension tables(first 4) and next 3 are
measures.
● Employee dimension table contains the attributes: Emp ID, Emp Name, Title, Department
and Region.
● Product dimension table contains the attributes: Product ID, Product Name, Product
Category, Unit Price.
● Customer dimension table contains the attributes: Customer ID, Customer Name, Address,
City, Zip.
● Time dimension table contains the attributes: Order ID, Order Date, Year, Quarter, Month.
Sales price, sale quantity, distance, speed, weight, and weight measurements are few examples
of fact data in star schema.
Simpler Queries:
Join logic of star schema is quite cinch in compare to other join logic which are needed to fetch
data from a transactional schema that is highly normalized.
Feeding Cubes:
Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact, major
OLAP systems deliver a ROLAP mode of operation which can use a star schema as a source
without designing a cube structure.
●Data integrity is not enforced well since in a highly de-normalized schema state.
●Star schemas don’t reinforce many-to-many relationships within business entities – at least
not frequently.