Nothing Special   »   [go: up one dir, main page]

1.3 Tasks of Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

1. 1.

3 Tasks of Data Mining


Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error.
DEPT OF CSE & IT
VSSUT, Burla
Summarization – providing a more compact representation of the data set, including
visualization and report generation
2. 1.9.1 Data Warehouse Design Process:
A data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both.
The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that
must
be solved are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the
early
stage of business modeling and technology development. It allows an organization to
move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
DEPT OF CSE & IT
VSSUT, Burla
The warehouse design process consists of the following steps:
Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is
organizational
and involves multiple complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the analysis of one kind
of
business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of
data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact
3. 1.10 OLAP(Online analytical Processing):
OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.
OLAP is part of the broader category of business intelligence, which also encompasses
relational database, report writing and data mining.
OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.
OLAP consists of three basic analytical operations:
Consolidation (Roll-Up)
Drill-Down

2. Multidimensional OLAP (MOLAP):


MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in an optimized multi-dimensional array storage, rather than
in a relational database. Therefore it requires the pre-computation and storage of
information in the cube - the operation known as processing.
MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
The data cube contains all the possible answers to a given range of questions.
MOLAP tools have a very fast response time and the ability to quickly write back
data into the data set.

4. 1.11.3 Data Transformation:


In data transformation, the data are transformed or consolidated into forms appropriatefor
mining.
Data transformation can involve the following:
Smoothing, which works to remove noise from the data. Such techniques includebinning,
regression, and clustering.
Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and
annualtotal amounts. This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.
DEPT OF CSE & IT
VSSUT, Burla
Generalization of the data, where low-level or ―primitive‖ (raw) data are replaced
by higher-level concepts through the use of concept hierarchies. For example,
categorical attributes, like street, can be generalized to higher-level concepts, like city or
country.
Normalization, where the attribute data are scaled so as to fall within a small
specified range, such as 1:0 to 1:0, or 0:0 to 1:0.
Attribute construction (or feature construction),where new attributes are constructed and
added from the given set of attributes to help the mining process.
2.2 Market basket analysis:
This process analyzes customer buying habits by finding associations between the different
items that customers place in their shopping baskets. The discovery of such associations can
help retailers develop marketing strategies by gaining insight into which item sare frequently
purchased together by customers. For instance, if customers are buying milk, how likely are
they to also buy bread (and what kind of bread) on the same trip to the supermarket. Such
information can lead to increased sales by helping retailers do selective marketing and plan
their shelf space.

5. 3.1.2 Comparing Classification and Prediction Methods:


Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label
information).
The accuracy of a predictor refers to how well a given predictor can guess the value of
the predicted attribute for new or previously unseen data.
Speed:
This refers to the computational costs involved in generating and using the
given classifier or predictor.
Robustness:
This is the ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values.
Scalability:
This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
Interpretability:
This refers to the level of understanding and insight that is providedby the classifier or
predictor.
Interpretability is subjective and therefore more difficultto assess.

Long:
1. 1.6 Classification of Data mining Systems:
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
DEPT OF CSE & IT
VSSUT, Burla
Some Other Classification Criteria:
Classification according to kind of databases mined
Classification according to kind of knowledge mined
Classification according to kinds of techniques utilized
Classification according to applications adapted
Classification according to kind of databases mined
We can classify the data mining system according to kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data
etc. And the data mining system can be classified accordingly. For example if we classify
the database according to data model then we may have a relational, transactional, object-
relational, or data warehouse mining system.
Classification according to kind of knowledge mined
We can classify the data mining system according to kind of knowledge mined. It is
means data mining system are classified on the basis of functionalities such as:
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Classification according to kinds of techniques utilized
We can classify the data mining system according to kind of techniques used. We can
describes these techniques according to degree of user interaction involved or the
methods of analysis employed.
Classification according to applications adapted
We can classify the data mining system according to application adapted. These
applications are
as follows:
Finance
Telecommunications
DNA
Stock Markets
E-mail
2. .7 Major Issues In Data Mining:
Mining different kinds of knowledge in databases. - The need of different
users is not the same. And Different user may be in interested in different kind of
knowledge. Therefore it is necessary for data mining to cover broad range of knowledge
discovery task.
Interactive mining of knowledge at multiple levels of abstraction. - The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
Incorporation of background knowledge. - To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
level of abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are
discovered it needs to be expressed in high level languages, visual representations. This
representations should be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can
handle the noise, incomplete objects while mining the data regularities. If data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered
should be interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract
the information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge
size of databases, wide distribution of data,and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithm divide the data into partitions which is further processed parallel. Then the
results from the partitions is merged. The incremental algorithms, updates databases
without having mine the data again from scratch.
3. What is Data Warehousing?
A data warehousing is defined as a technique for collecting and managing data from
varied sources to provide meaningful business insights. It is a blend of technologies and
components which aids the strategic use of data.

It is electronic storage of a large amount of information by a business which is designed


for query and analysis instead of transaction processing. It is a process of transforming
data into information and making it available to users in a timely manner to make a
difference.

In this tutorial, you will learn-

 What is Data Warehousing?


 History of Datawarehouse
 How Datawarehouse works?
 Types of Data Warehouse
 General stages of Data Warehouse
 Components of Data warehouse
 Who needs Data warehouse?
 What Is a Data Warehouse Used For?
 Steps to Implement Data Warehouse
 Best practices to implement a Data Warehouse
 Why We Need Data Warehouse? Advantages & Disadvantages
 The Future of Data Warehousing
 Data Warehouse Tools

The decision support database (Data Warehouse) is maintained separately from the
organization's operational database. However, the data warehouse is not a product but
an environment. It is an architectural construct of an information system which provides
users with current and historical decision support information which is difficult to access
or present in the traditional operational data store.

The data warehouse is the core of the BI system which is built for data analysis and
reporting.

You many know that a 3NF-designed database for an inventory system many have
tables related to each other. For example, a report on current inventory information can
include more than 12 joined conditions. This can quickly slow down the response time
of the query and report. A data warehouse provides a new design which can help to
reduce the response time and helps to enhance the performance of queries for reports
and analytics.

Data warehouse system is also known by the following name:


 Decision Support System (DSS)
 Executive Information System
 Management Information System
 Business Intelligence Solution
 Analytic Application
 Data Warehouse

History of Datawarehouse
The Datawarehouse benefits users to understand and enhance their organization's
performance. The need to warehouse data evolved as computer systems became more
complex and needed to handle increasing amounts of Information. However, Data
Warehousing is a not a new thing.

Here are some key events in evolution of Data Warehouse-

 1960- Dartmouth and General Mills in a joint research project, develop the terms
dimensions and facts.

 1970- A Nielsen and IRI introduces dimensional data marts for retail sales.
 1983- Tera Data Corporation introduces a database management system which
is specifically designed for decision support

 Data warehousing started in the late 1980s when IBM worker Paul Murphy and
Barry Devlin developed the Business Data Warehouse.

 However, the real concept was given by Inmon Bill. He was considered as a
father of data warehouse. He had written about a variety of topics for building,
usage, and maintenance of the warehouse & the Corporate Information Factory.

How Datawarehouse works?


A Data Warehouse works as a central repository where information arrives from one or
more data sources. Data flows into a data warehouse from the transactional system and
other relational databases.

Data may be:

1. Structured
2. Semi-structured
3. Unstructured data

The data is processed, transformed, and ingested so that users can access the
processed data in the Data Warehouse through Business Intelligence tools, SQL
clients, and spreadsheets. A data warehouse merges information coming from different
sources into one comprehensive database.

By merging all of this information in one place, an organization can analyze its
customers more holistically. This helps to ensure that it has considered all the
information available. Data warehousing makes data mining possible. Data mining is
looking for patterns in the data that may lead to higher sales and profits.

Types of Data Warehouse


Three main types of Data Warehouses are:

1. Enterprise Data Warehouse:

Enterprise Data Warehouse is a centralized warehouse. It provides decision support


service across the enterprise. It offers a unified approach for organizing and
representing data. It also provide the ability to classify data according to the subject and
give access according to those divisions.

2. Operational Data Store:


Operational Data Store, which is also called ODS, are nothing but data store required
when neither Data warehouse nor OLTP systems support organizations reporting
needs. In ODS, Data warehouse is refreshed in real time. Hence, it is widely preferred
for routine activities like storing records of the Employees.

3. Data Mart:

A data mart is a subset of the data warehouse. It specially designed for a particular line
of business, such as sales, finance, sales or finance. In an independent data mart, data
can collect directly from sources.

1.9.3 Data Warehouse Models:


There are three data warehouse models.
1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
It provides corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope.
It typically contains detailed data aswell as summarized data, and can range in size from a
few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes, computer
superservers, or parallel architecture platforms. It requires extensive business modeling
and may take years to design and build.
2. Data mart:
A data mart contains a subset of corporate-wide data that is of value to aspecific group of
users. The scope is confined to specific selected subjects. For example,a marketing data
mart may confine its subjects to customer, item, and sales. Thedata contained in data
marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that
areUNIX/LINUX- or Windows-based. The implementation cycle of a data mart ismore
likely to be measured in weeks rather than months or years. However, itmay involve
complex integration in the long run if its design and planning werenot enterprise-wide.
DEPT OF CSE & IT
VSSUT, Burla
Depending on the source of data, data marts can be categorized as independent
ordependent. Independent data marts are sourced fromdata captured fromone or
moreoperational systems or external information providers, or fromdata generated
locallywithin a particular department or geographic area. Dependent data marts are
sourceddirectly from enterprise data warehouses.
3. Virtual warehouse:
A virtual warehouse is a set of views over operational databases. Forefficient query
processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database
servers

You might also like