DWH Fundamentals
DWH Fundamentals
DWH Fundamentals
TCS Confidential
Course Roadmap
OPERATIONAL DATABASE:
Online Transaction Processing
Designed for running the business and not suitable for
analyzing the business in the prospect Of business executives
because data volatile nature (Keep on changing)
It does not maintain historical data.
It contains only current data.
If u insert any new values it will update
Eg:
Acnthno
Acnthsal
1072
13,000 20,000
hardware is different
Understanding The Differences Is The Key
Data Warehouse
Transaction Processing
Query Processing
Time Sensitive
History Oriented
Operator View
Managerial View
Normalized Efficient
Design for TP
Query Processing
Data Warehouse
OLTP Vs Warehouse
Operational System
Designed for Atmocity,
Consistency, Isolation and
Durability
Organized by transactions
(Order, Input, Inventory)
Organized by subject
(Customer, Product)
Volatile Data
Operational System
Data Warehouse
Performance Sensitive
Not Flexible
Flexible
Efficiency
Effectiveness
Subject Oriented
Entry
Sales Rep
Quantity Sold
Part Number
Date
Customer Name
Product Description
Unit Price
Mail Address
Transactional Storage
Sales
Sales
Customers
Customers
Products
Products
Data Warehouse Storage
10
Integration of Data
M, F
Unit of
Attributes
pipeline cm
Physical
Attributes
Naming
Conventions
Appl. A - bal-on-hand
Appl. B - current_balance
Appl. C - balance
Data
Consistency
Transactional Storage
Integration
Appl. A - M, F
Appl. B - 1, 0
Appl. C - X, Y
Encoding
balance dec(13, 2)
balance
date (Julian)
Volatility of Data
Volatile
Insert
Non-Volatile
Change
Access
Delete
Insert
Load
Change
Access
Record-by-Record Data Manipulation
Transactional Storage
Historical Data
Sales ( Region , Year - Year 97 - 1st Qtr)
20
15
Sales ( in lakhs
10
)
East
West
North
5
0
January
February
March
Year97
Transactional Storage
What is DSS?
Need for DSS
Comparison of OLTP & DSS
Transition from Data Processing to Information
Processing
14
What is DSS?
Decision
DecisionSupport
SupportSystems
Systems(DSS)
(DSS)are
areinteractive
interactivecomputercomputer-
based
basedsystems
systemsintended
intendedtotohelp
helpdecision
decisionmakers
makersutilize
utilizedata
dataand
and
models
modelstotoidentify
identifyand
andsolve
solveproblems
problemsand
andmake
makedecisions.
decisions.
Data
DataWarehouse
Warehouseisisthe
thefoundation
foundationof
ofDSS
DSSprocess.
process.ItItisisaaStrategy
Strategyand
and
aaProcess
Processfor
forStaging
StagingCorporate
CorporateData.
Data.
OLTP Environment
DSS Environment
get data IN
large volumes of simple
transaction queries
continuous data changes
low processing time
mode of processing
transaction details
data inconsistency
mostly current data
17
OLTP Environment
DSS Environment
18
DW Implementation
Approaches
Top Down
Bottom-up
Combination of both
Choices depend on:
current infrastructure
resources
architecture
ROI
Implementation speed
Bottom Up Implementation
DW Implementation Approaches
Top Down
Bottom Up
DW Implementation Approaches
Top Down
Consistent data definition
and enforcement of business
rules across enterprise
High cost, lengthy process,
time consuming
Works well when there is
centralized IS department
responsible for all H/W and
resources
Bottom Up
Data redundancy and
inconsistency between
data marts may occur
Integration requires
great planning
Less cost of H/W and
other resources
Faster pay-back
DW Architectures
24
ExtractPush/Pull
Source 1
Source 2
Source 3
Source n
Sources
Metadata
Staging
Layer
Canned
Reports
Detail Data
Summaries
/
Aggregatio
ns
Ad-hoc
analysis
CubesConformed
Dimensions
Transformatio
n
Summarization
Aggregation
ODS
Data
Warehouse
Data Marts
Reportin
g Layer
25
Benefits of DWH
To formulate effective business, marketing
and sales strategies.
To precisely target promotional activity.
To discover and penetrate new markets.
To successfully compete in the marketplace
from a position of informed strength.
To build predictive rather than retrospective models.
Data Modeling
Data Modeling
WHAT IS A DATA MODEL?
Analysis
Logical Database Design
Deciding Database
Physical Database design
Schema Generation
Levels of modeling
Conceptual modeling
Describe data requirements from a business
point of view without technical details
Logical modeling
Refine conceptual models
Data structure oriented, platform independent
Physical modeling
Detailed specification of what is physically
implemented using specific technology
Conceptual Model
A conceptual model shows data through
business eyes.
All entities which have business meaning.
Important relationships
Few significant attributes in the entities.
Few identifiers or candidate keys.
Logical Model
Replaces many-to-many relationships with
associative entities.
Defines a full population of entity attributes.
May use non-physical entities for domains
and sub-types.
Establishes entity identifiers.
Has no specifics for any RDBMS or
configuration.
Physical Model
A Physical data model may include
Referential Integrity
Indexes
Views
Alternate keys and other constraints
Tablespaces and physical storage objects.
Modeling Techniques
Entity-Relationship Modeling
Traditional modeling technique
Technique of choice for OLTP
Suited for corporate data warehouse
Dimensional Modeling
Analyzing business measures in the specific business context
Helps visualize very abstract business questions
End users can easily understand and navigate the data structure
Examples: ER Model
37
Poor Performance
Tend to be very complex and difficult to
navigate.
Dimensional Modeling
39
Dimensional Modeling
Dimensional modeling uses three basic
concepts : measures, facts, dimensions.
Is powerful in representing the requirements
of the business user in the context of
database tables.
Focuses on numeric data, such as values
counts, weights, balances and occurences.
Dimensional modeling
Must identify
What is a Facts
A fact is a collection of related data items,
consisting of measures and context data.
Each fact typically represents a business item,
a business transaction, or an event that can be
used in analyzing the business or business
process.
Facts are measured, continuously valued,
rapidly changing information. Can be
calculated and/or derived.
Types of Facts
Additive
Dimensions
A dimension is a collection of members or
units of the same type of views.
Dimensions determine the contextual
background for the facts.
Dimensions represent the way business
people talk about the data resulting from a
business process, e.g., who, what, when,
where, why, how
Dimensional Hierarchy
Geography Dimension
World Level
America
Europe
Asia
Pa
re
nt
Continent Level
Re
la
tio
n
World
Country Level
USA
State Level
FL
City Level
Miami
Attributes: Population,
Tourists Place
Canada
GA
Tampa
VA
CA
Orlando
Argentina
WA
NaplesDimension
Member / Business
Entity
45
Dimensions Types
Conformed Dimension
junk Dimension
Dirty Dimension
Monster Dimension
Slowly Changing Dimension
Degenerated Dimension
46
DM - Types
ODS
49
ODS - Types
ODS
50
OLTP
ODS
Data Warehouse
Data redundancy
Somewhat
redundant with
operational
databases
Managed
redundancy
Data stability
Non-redundant
within system;
Unmanaged
redundancy among
systems
Dynamic
Data update
Field by field
Field by field
Controlled batch
Data usage
Highly structured,
repetitive
Somewhat
structured, some
analytical
Database size
Moderate
Moderate
Highly
unstructured,
heuristic or
analytical
Large to very large
Somewhat stable
Dynamic
Database
Stable
structure stability
Snowflake Schema
Single fact table surrounded by normalized dimension
tables
Normalizes dimension table to save data storage space.
When dimensions become very very large
Less intuitive, slower performance due to joins
Snowflake - Disadvantages
Normalization of dimension makes it
difficult for user to understand
Decreases the query performance because it
involves more joins
Dimension tables are normally smaller than
fact tables - space may not be a major issue
to warrant snowflaking
OLAP Cubes
57
OLAP Features
Subject oriented approach to Decision Support
Calculations applied across dimensions, through hierarchies
and/or across members
Trend analysis over sequential time periods, What If
scenarios.
Slicing/Dicing subsets for on-screen viewing
Drill-down/up along the hierarchy
Reach-through to underlying detail data
Rotation to new dimensional comparisons in the viewing
area
OLAP Cubes
58
OLAP Categories
Multi-dimensional OLAP (MOLAP)
Relational OLAP (ROLAP)
Hybrid OLAP (HOLAP)
OLAP Cubes
59
MOLAP
Use pre-calculated data set CUBE
Cube contains all possible answers to given range of
questions
Features:
Very fast response
Ability to quickly write data into the cube
Downsides:
Limited Scalability
Inability to contain detailed data
Load time
OLAP Cubes
60
ROLAP
Do not use pre-calculated CUBE
Intercept query & pose it to the Relational DB
Features:
Ask any question (not limited to the contents of the
cube)
Ability to drill down
Downsides:
Slow Response
Some limitations on scalability
OLAP Cubes
61
HOLAP
Combines MOLAP & ROLAP
Utilizes both pre-calculated cubes & relational data
sources
Features:
For summary type info cube, (Faster response)
Ability to drill down relational data sources (drill
through detail to underlying data)
Source of data transparent to end-user
OLAP Cubes
62
Data Acquisation
Data Extraction
Data Transformation
Data Loading
63