Unit1 (DW&DM)
Unit1 (DW&DM)
Unit1 (DW&DM)
Data Warehousing :
This is categorized into five different groups like 1. Data Reporting 2. Query
Tools 3. Application development tools 4. EIS tools, 5. OLAP tools and data
mining tools.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This
goal is to remove data redundancy. This architecture is not frequently used
in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data
warehouse. This architecture is not expandable and also not supporting a
large number of end-users. It also has connectivity problems because of
network limitations.
Three-tier architecture
This is the most widely used architecture. It consists of the Top, Middle and
Bottom Tier.
and API that you connect and get data out from the data warehouse. It
could be Query tools, reporting tools, managed query tools, Analysis
tools and Data mining tools.
The data warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key components to make
the entire environment functional, manageable and accessible.
Extraction :
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats like
relational databases, No SQL, XML and flat files into the staging area. It is
important to extract the data from various source systems and store it into
the staging area first and not directly into the data warehouse because the
extracted data is in various formats and can be corrupted also. Hence
loading it directly into the data warehouse may damage it and rollback will
be much more difficult. Therefore, this is one of the most important steps of
ETL process.
Transformation :
The second step of the ETL process is transformation. In this step, a set of
rules or functions are applied on the extracted data to convert it into a single
standard format. It may involve following processes/tasks:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multipe attributes.
Sorting – sorting tuples on the basis of some attribute (generally key-
attribbute).
Loading :
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes the
data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals. The rate and period
of loading solely depends on the requirements and varies from system to
system.
ETL process can also use the pipelining concept i.e. as soon as some data is
extracted, it can transformed and during that period some new data can be
extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed.
Data Mart
A data mart is focused on a single functional area of an organization and
contains a subset of data stored in a Data Warehouse.
Data Mart usually draws data from only a few sources compared to a Data
warehouse. Data marts are small in size and are more flexible compared to a
Data warehouse.
or operational systems.
Data mining is looking for hidden, valid, and potentially useful patterns in
huge data sets. Data Mining is all about discovering unsuspected/ previously
unknown relationships amongst the data.
The insights derived via Data Mining can be used for marketing, fraud
detection, and scientific discovery, etc.
First, you need to understand business and client objectives. You need
to define what your client wants (which many times even they do not
know themselves)
Take stock of the current data mining scenario. Factor in resources,
assumption, constraints, and other significant factors into your
assessment.
Using business objectives and current scenario, define your data
mining goals.
A good data mining plan is very detailed and should be developed to
accomplish both business and data mining goals.
The data preparation process consumes about 90% of the time of the project.
Data cleaning is a process to "clean" the data by smoothing noisy data and
filling in missing values.
For example, for a customer demographics profile, age data is missing. The
data is incomplete and should be filled. In some cases, there could be data
outliers. For instance, age has a value 300. Data could be inconsistent. For
instance, name of the customer is different in different tables.
The result of this process is a final data set that can be used in modeling.
Outliers is defined as the data objects that do not comply with the
general behaviour or model of the data available.
It refers mainly to an observation of data items in a dataset for the data
sets that do not match an expected pattern.
Anomalies are also known as outliers, novelties, noise, deviations, and
exceptions as this anomaly provide critical and actionable
information.
6. Prediction Technique
Applications Usage
Service Service providers like mobile phone and utility industries use
Providers Data Mining to predict the reasons when a customer leaves
their company. They analyze billing details, customer service
interactions, complaints made to the company to assign each
customer a probability score and offers incentives.
Bio informatics Data Mining helps to mine biological data from massive data
sets gathered in biology and medicine.