Data Warehouse Notes
Data Warehouse Notes
Data Warehouse Notes
Business Intelligence
Combination of technologies like
Analytical Data :
Analytical Data is used to make business decisions
OLAP vs OLTP
Why Datawarehouse?
1. Ensure consistency: Standardizing data from different sources also reduces the
risk of error in interpretation and improves overall accuracy.
2. Make better business decisions: Data warehousing improves the speed and
efficiency of accessing different data sets and makes it easier for corporate
decision-makers to derive insights that will guide the business and marketing
strategies that set them apart from their competitors.
3. Improve their bottom line : Data warehousing improves the speed and
efficiency of accessing different data sets and makes it easier
for corporate decision-makers to derive insights that will guide the business and
marketing strategies that set them apart from their competitors.
Data warehouse
● A decision support database that is maintained separately from the organization’s
operational databases
● A Data Warehouse is an enterprise-wise collection of
● Subject oriented
● Integrated
● Time variant
● Non-volatile
*Integrated - Data in the warehouse is obtained from multiple sources and kept in a
Consistent format.
*Time-Varying - Every data component in the date warehouse associates itself with some
Point of time like weekly,monthly,quarterly, yearly
*Non-volatile - Dw stores historical data. Data does not change once it gets into the
warehouse. Only load/refresh.
Use of DWH
● Ad-hoc analyses and reports
● Data mining: identification of trends
● Management Information Systems
Datamart
● Datamart is a subset of data warehouse and it is designed for a particular line of
business, such as sales, marketing, or finance.
● In a dependent data mart, data can be derived from an enterprise-wide data
warehouse.
● In an independent data mart, data can be collected directly from sources
● Datamart is the data warehouse you really use
● Why Datamart?
1. Datawarehouse projects are very expensive and time taking.
2. Success rate of DWH projects is very less
To avoid single point of loss we identify department wise needs
and build Datamart. If succeeded we go for other departments and integrate all
datamarts into a Datawarehouse.
● Advantages
◼ Improve data access performance
1. Top-down approach:
1. Since the data marts are created from the Datawarehouse, provides consistent
dimensional view of data marts.
2. Also, this model is considered as the strongest model for business changes.
That’s why, big organisations prefer to follow this approach.
1. The cost, time taken in designing and its maintenance is very high.
2. Bottom-up approach:
1. First, the data is extracted from external sources (same as happens in top-down
approach).
2. Then, the data go through the staging area (as explained above) and loaded into
data marts instead of datawarehouse. The data marts are created first and provide
reporting capability. It addresses a single business area.
This approach is given by Kinball as – data marts are created first and provides a
thin view for analyses and datawarehouse is created after complete data marts have been
created.
1. As the data marts are created first, so the reports are quickly generated.
2. We can accommodate a greater number of data marts here and in this way
datawarehouse can be extended.
3. Also, the cost and time taken in designing this model is low comparatively.
Disadvantage of Bottom-Up Approach –
ETL
ETL Process
Etl is a process that involves the following tasks:
● extracting data from source operational or archive systems which are the
primary source of data for the data warehouse
● transforming the data - which may involve cleaning, filtering, validating and
applying business rules
● loading the data into a data warehouse or any other database or application that
houses data
Transform
1. Denormalize data
2. Data cleaning.
3. Case conversion
4. Data trimming
5. String concatenation
6. datatype conversion
7. Decoding
8. calculation
9. Data correction.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data,
typically as part of the ETL process.
Types of Dimensions
1. Normal Dimension
2. Confirmed Dimension
3. Junk Dimension
4. Degenerated Dimension
5. Role Playing Dimension
Confirmed Dimension
Dimension table used by more than one fact table is called Confirmed Dimensions
(dimensions that are linked to multiple fact tables)
D1 D2 D1 D2 D5
D3 D4 D3
Adv:
1. To avoid unnecessary space
2. Reduce time
3. Drill across fact table
Junk Dimension
● It is an abstract dimension it will remove number of foreign keys from fact table.
● This is achieved by combining 2 or more dimensions into a single dimension.
Degenerated Dimension
Means a key value or dimension table which does not have descriptive attributes.
i.e.) a non foreign key and non numerical measure column used for grouping purpose
Measure Types
Type 1 SCD :
● Used if history is not required
● Overwriting the old values.
Type 2 SCD:
● If history and current value needed
● Creating another additional record.(new record with new changes and new
surrogate key)
● Mostly preferred in dimensional modeling
Product
Product Effective Product Product Expiry
Year
ID(PK) DateTime(PK) Name Price DateTime
01-01-2004 12-31-2004
1 2004 Product1 $150
12.00AM 11.59PM
01-01-2005
1 2005 Product1 $250
12.00AM
Type 3 SCD:
● Used if changes are very less
● Previous one level of history available
● Creating new fields.
Product Price in 2005
Current Product Current Old Product
Product ID(PK) Old Year
Year Name Product Price Price
1 2005 Product1 $250 $150 2004
Surrogate keys
● Surrogate keys are always numeric and unique on a table level which makes it
easy to distinguish and track values changed over time.
● Surrogate keys are integers that are assigned sequentially as needed to populate a
dimension.
● Surrogate keys merely serve to join dimensional tables to the fact table.
● Surrogate keys are beneficial as the following reasons:
1. Faster retrieval of data (since alphanumerical retrieval is costlier than
numerical data)
2. Maintaining index is easier with numeric key.
3. Maintain all slowly changing dimension.
Star Schema
1. It is the simplest form of data warehouse schema that contains one or more
dimensions and fact tables
2. It is called a star schema because the entity-relationship diagram between
dimensions and fact tables resembles a star where one fact table is connected to
multiple dimensions
3. The center of the star schema consists of a large fact table and it points towards
the dimension tables
4. Fact Table = Highly Normalized
Dimension Table = Highly denormalized.
Advantages:
● Star schema is easy to define.
● It reduces the number of physical joins.
● Provides very simple metadata.
Drawbacks:
● Summary data in Fact tables (such as Sales amount by region, or district-wise, or
year-wise) yields poor performance for summary levels and huge dimension tables.
Advantages:
● Snowflake schema provides best performance when queries involve aggregation.
Disadvantages:
● Maintenance is complicated.
● Increase in the number of tables.
More joins will be needed
Star Schema vs Snowflake Schema
What Is Metadata?
Metadata is information about other data that is contained inside a collection of data.
Metadata summarizes essential facts about data, making it easy to search and deal with
specific instances of data. Metadata can be generated both manually and automatically.
FAQ
Hierarchy
1. Hierarchies are logical structures that use ordered levels as a means of organizing
data.
2. A hierarchy can be used to define data aggregation.
Example
● country>city>state>zip
● in a time dimension, a hierarchy might be used to aggregate data from the Month
level to the Quarter level, from the Quarter level to the Year level.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that
represents data at the Month, Quarter, and Year levels.
Degenerated Dimension
Operational control numbers such as invoice numbers, order numbers and bill of lading
numbers looks like dimension key in a fact table but do not join to any actual dimension
table. They give rise to empty dimension hence we refer them as Degenerated
Dimension(DD).
ODS stands for the operational data store, and it stores the most
Recent data transfers from various operational sources and activities over data and
monitoring.
A central archive that provides a snapshot of the most recent data from multiple
transactional processes for operational monitoring is known as an operational data store
(ODS).
It allows businesses to consolidate data in its original format from several sources into a
single destination for market reporting.
What Are the Different Kinds of Dimensional Modeling?
Dimensional Modeling is classified into three groups, which are as follows:
1. Conceptual Modeling
2. Logical Modeling
3. Physical Modeling
4.
In Data Warehousing, What Is the Concept of a Cube?
Cubes are used to describe multidimensional data logically. The dimension members are
located on the cube’s edge, and the data values are located on the cube’s body.