DW Unit-1 (1) XXXXXXXX
DW Unit-1 (1) XXXXXXXX
DW Unit-1 (1) XXXXXXXX
Overview
The term "Data Warehouse" was first coined by
Bill Inmon in 1990.
According to Inmon, a data warehouse is a
subject-oriented, integrated, time-variant, and
non-volatile collection of data.
This data helps analysts to take informed
decisions in an organization.
Data, Data everywhere yet ...
A data warehouse is a
• subject-oriented
• Integrated
• time-varying
• non-volatile collection of data that is used primarily in
organizational decision making.
The four keywords, subject-oriented, integrated, time-
variant, and nonvolatile, distinguish data warehouses
from other data repository systems, such as relational
database systems, transaction processing systems,
and file systems.
Subject-oriented
A data warehouse is organized around major subjects,
such as customer, supplier, product, and sales.
Rather than concentrating on the day-to-day
operations and transaction processing of an
organization, a data warehouse focuses on the
modeling and analysis of data for decision makers.
Hence, data warehouses typically provide a simple and
concise view around particular subject issues by
excluding data that are not useful in the decision
support process.
Integrated:
A data warehouse is usually constructed by integrating
multiple heterogeneous sources, such as relational
databases, flat files, and on-line transaction records.
Data cleaning and data integration techniques are
applied to ensure consistency in naming conventions,
encoding structures, attribute measures, and so on..
Time-variant:
Data are stored to provide information from a
historical perspective (e.g., the past 5–10 years). Every
key structure in the data warehouse contains, either
implicitly or explicitly, an element of time.
Nonvolatile:
A data warehouse is always a physically separate store
of data transformed from the application data found in
the operational environment.
Due to this separation, a data warehouse does not
require transaction processing, recovery, and
concurrency control mechanisms.
It usually requires only two operations in data
accessing: initial loading of data and access of data.
Differences between Operational Database Systems and
DataWarehouses
View:
An OLTP system focuses mainly on the current data within an
enterprise or department, without referring to historical data
or data in different organizations.
In contrast, an OLAP system often spans multiple versions of
a database schema, due to the evolutionary process of an
organization. OLAP systems also deal with information that
originates from different organizations, integrating information
from many data stores. Because of their huge volume, OLAP
data are stored on multiple storage media.
Access patterns:
The access patterns of an OLTP system consist mainly of
short, atomic transactions. Such a system requires
concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly read-only
operations (because most data warehouses store historical
rather than up-to-date information), although many could be
complex queries.
OLAP Operations in the Multidimensional Data Model
Implementation steps
Requirements analysis and capacity planning:
The first step in data warehousing involves defining
enterprise needs, defining architecture, carrying out
capacity planning and selecting the hardware and
software tools.
This step will involve consulting senior management as
well as the various stakeholders.
Hardware integration:
Once the hardware and software have been selected, they
need to be put together by integrating the servers, the
storage devices and the client software tools
Modelling:
Modelling is a major step that involves designing the
warehouse schema and views. This may involve using a
modelling tool if the data warehouse is complex.
Physical modelling:
For the data warehouse to perform efficiently, physical
modelling is required. This involves designing the physical
data warehouse organization, data placement, data
partitioning, deciding on access methods and indexing.
Sources
The data for the data warehouse is likely to come from a
number of data sources. This step involves identifying and
connecting the sources using gateways, ODBC drives or other
wrappers.
ETL:
The data from the source systems will need to go through an
ETL process. The step of designing and implementing the ETL
process may involve identifying a suitable ETL tool vendor and
This may include customizing the tool to suit the needs of the
enterprise.
Populate the data warehouse:
Once the ETL tools have been agreed upon, testing the tools
will be required, perhaps using a staging area.
Once everything is working satisfactorily, the ETL tools may be
used in populating the warehouse given the schema and view
definitions.
User applications:
For the data warehouse to be useful there must be end-user
applications. This step involves designing and implementing
applications required by the end users.
Roll-out the warehouse and applications:
Once the data warehouse has been populated and the end-user
applications tested, the warehouse system and the applications
may be rolled out for the user community to use.
Implementation Guidelines
Build incrementally:
The data quality in the source systems is not always high and
often little effort is made to improve data quality in the
source systems. Improved data quality, when recognized by
Corporate strategy:
Business plan:
Adaptability:
Categories of Metadata
Data Warehouse Metadata
Metadata can be broadly categorized into three categories:
Business Metadata - It has the data ownership
information, business definition, and changing policies.
Technical Metadata - It includes database system names,
table and column names and sizes, data types and
allowed values. Technical metadata also includes
structural information such as primary and foreign key
attributes and indices.
Operational Metadata - It includes currency of data and
data lineage. Currency of data means whether the data is
active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Data Warehouse Metadata
The Kimball technical system architecture separates the
data and processes comprising the DW/BI system into
the backroom extract, transformation and load (ETL)
environment and the front room presentation area, as
illustrated in the following diagram
Data Warehouse Metadata
Backroom ETL system
The front room is the public face of the DW/BI system; it’s
what business users see and work with day-to-day.
There’s a broad range of BI applications supported by BI
management services in the front room, including ad hoc
queries, standardized reports, dashboards and scorecards,
and more powerful analytic or mining/modeling applications.
Metadata
5) Client/server architecture
8) Multi-user support
.
Characteristics of OLAP
11) Flexible reporting
Fact constellation: