ADBMS Chapter No. 6

MCA Knowledge Base Systems & Data Mining
NOTES:
1. Subject Code : IT -34 Subject Name : Advanced Database management System
2. Learning Objectives of the Course ADBMS :
To know about different database handling techniques. To gain an awareness

of the basic issues in objected oriented data models, learn about the Web-DBMS
integration technology and XML for Internet database applications, familiarize with the
data-warehousing and data-mining techniques and other advanced topics.
3. Unit Name : Knowledge Base Systems & Data Mining
4. Contents of the Unit
6.1 Data mining as a part Knowledge Discovery

process Introduction to machine learning & data mining
6.2 Association rules
6.3 Market-basket Model, support & confidence -Apriori Algorithm -Sampling

Algorithm -Frequent-pattern Tree Algorithm -Partition Algorithm -Other types of
Association rules
6.4 Classification Decision tree induction Bayesian classifiers
6.5 Clustering k-means Algorithm
6.6 Approaches to other data mining problems Discovery of sequential patterns

Discovery of patterns in time series Regression Neural Networks Genetic Algorithms
Text mining Data-visualization
6.7 Applications of Data Mining
Learning Objectives of the Unit : to study different algorithm to perform the

analysis of data.
5. Key Definitions, Key Words in the definitions

Data Mining : The process of discovering interesting hidden & previously
unknown pattern from vast amount of data store such as data warehouse.
6. Key Concepts
Prof. Khandagale S P UNIT NO. 6

KBS, KDD, Apriori Algorithm, K-means Algorithm,Bayesion Algorithm,FPT

Algorithm,Applications Of Data mining
7. Questions Asked in the University Exam
Q.1Write Note On : Apriori Algorithm(Nov.2009 4M, Nov. 2010 4M, Nov. 2012 10M)
Q.2 Explain K-means Algorithm in data mining.(Nov. 2009 4M,Apr 2010 5M, Apr
2011 10M)
Q.3 Write Short Note On : machine Learning(Apr 2010 5M, Apr 2012 10M, Apr
2013 4M )
Q.4 Write Short Note on : KBS(Nov .2012 5M, Apr. 2011 5M)
Q.5 Explain Outlier Analysis in Data Mining(Nov. 2010 4M, Apr 2013 4M)
Q.6 Explain Text Mining(Nov 2010 6M, Apr 2013 6M)
Q.7 Explain association rules for data mining with help of an algorithm(Nov 2012 10M)
Question For Practice :
Q.1 Write Short Note On :
1. KDD
2. Bayesian Classifier
3.Sampling Algorithm
Q.2 Explain Various Data mining Applications
Q.3 Explain the architecture of Data mining
8. Learning Resources :
Reference Book :
1. Data Mining Concepts & Techniques Jiawaei & Micheline Kamber ,

ELSEVIER second Edition.
2. Database system concepts', 6th Edition Abraham Silberschatz, Henry Korth,

S, Sudarshan, (McGraw Hill International )
3. Database systems : "Design implementation and management"- Rob

Coronel, 4thEdition, (Thomson Learning Press)

4. Database Management Systems - Raghu Ramkrishnan, Johannes Gehrke

Second Edition, (McGraw Hill International )
Reference Link :
http://www.oracle.com/technetwork/articles/sql/11g-dw-olap-100058.html
http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html
Knowledge Base Systems & Data Mining

Data Mining :
The discovery of new information in terms of patterns or rules from vast amounts of data.

The process of finding interesting structure in data.

For some experts data mining is the process for knowledge discovery.

The process of discovering interesting
hidden & previously unknown pattern from vast amount
of data store such as data warehouse.
Knowledge Discovery Data

Another popular term used for data mining is knowledge discovery from data, KDD
Knowledge Discovery in Data is the Important process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data.
Knowledge Discovery in Databases (KDD)

Data mining is actually one step of a larger process known as knowledge discovery in
databases (KDD).
The KDD process model comprises six phases
Data selection
Data cleaning
Data transformation or encoding
Data mining
Reporting and displaying discovered knowledge
Goals of Data Mining and Knowledge Discovery (PICO)

Prediction:
Determine how certain attributes will behave in the future.
Identification:
Identify the existence of an item, event, or activity.
Classification:
Partition data into classes or categories.
Optimization:
Optimize the use of limited resources.
Knowledge Discovery
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the context
of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, sub sampling, and transformations
of that database.
Data Mining a step in A KDD Process
Interacting with a user / expert in KDD

KDD is not a fully automatically way of analysis.
The user is an important element in KDD process.
User Should decide about, e.g. Choosing task and algorithms, selection in preprocessing.
Interpretation and evaluation of patterns

Architecture of a typical data mining systemArchitecture of a typical data mining system
Graphical user interface
Pattern evaluation
Knowledge base
Data mining engine
Database or data warehouse server
Data cleansing
Data Integration Filtering
Database Data warehouse
Information repositories : It include Databases, data warehouse, or other repository

like spreadsheet etc.
Database or Data warehouse server : as per the user requirement these fetches the
relevant information.
Knowledge Base: knowledge consist of Domain knowledge. To guide, explore & evaluate
interestingness of patterns domain knowledge is used. Knowledge base consist of past
experience as well as user belief based on which certain conclusions can be drawn
Data Mining Engine : It is utmost important in data mining system. It consist of set of
functional modules such as : Characterization

Association, Correlation analysis, Classification, Prediction, Cluster analysis, oulier

analysis etc
Patter Evaluation Module : searches interesting patterns for which it communicates

with data mining module.
User Interface : It acts as interface between user & data mining system, by specifying a
data mining query, providing info to help the search etc.
Knowledge Based System(KBS)
To provide intelligent decision with appropriate justification KBS works as artificial

intelligence tool.
KBS are systems based on the method & techniques of Artificial intelligence.
Knowledge base is dependent on following concepts :hypothesis, rules, object,

attributes, relations, definitions, events, process, facts etc.
Their core components are : Knowledge base, Acquisition mechanisms , Inference

Mechanism
KBS architecture
The typical architecture of an KBS is often described as follows:
Diagnosis :Problems are identified using number of symptoms or failure
Interpretation : To provide an understanding of a situation from available information.
Prediction : To predict a future state from a set of data or observations.
Design : To develop configuration that satisfies constraints of a design problem
Control : To collect & evaluate evidence & form opinions on that evidence.
Instruction: To train students & correct their performance
Debugging : To identify & prescribe remedies for malfunctions.
Planning : Both short term & long term in project management.
Monitoring : To check performance & flag exceptions
Advantages of KBS :

Documentation of knowledge
Intelligent Decision Support
Self learning reasoning & explanation.
Increase availability of expert knowledge
Efficient & cost effective.
Consistency of answers
Explanation of solution
Deal with the uncertainty
Limitations of KBS :
Lack of common sense
Inflexible & difficult to modify
Restricted Domain of Expertise
Lack of learning ability
Not always reliable
Applications Of KBS
Retail: Market basket analysis, Customer relationship management (CRM)
Finance: Credit scoring, fraud detection
Manufacturing: Optimization, troubleshooting
Medicine: Medical diagnosis
Telecommunications: Quality of service optimization
Bioinformatics: Motifs, alignment
Web mining: Search engines
What is intelligence ?

Intelligence = Knowledge + ability to perceive, feel, comprehend, process,

communicate, judge, learn.
Artificial Intelligence is the design, study and construction of computer programs that
behave intelligently. -- Tom Dean.
Examples of intelligent agents:
Knowledge-based systems: capture knowledge that people have which are relevant to
a problem.
Common sense reasoning systems: capture knowledge that people commonly hold
which is why this knowledge is not explicitly communicated.
Learning systems: posses the ability to expend their knowledge based on the
accumulated experience.
Natural language understanding systems: support dialog in English/French/Japanese/
Game playing systems.
Intelligent robots.
Speech and vision recognition systems.
What is Machine Learning?
Machine Learning
Study of algorithms that
improve their performance
at some task
with experience
Optimize a performance criterion using example data or past experience.
Role of Statistics: Inference from a sample
Role of Computer science: Efficient algorithms to
Solve the optimization problem

Representing and evaluating the model for inference
Machine learning is a process which causes systems to improve with experience.
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics areas not part of data mining
Types of Machine Learning

Some of the main types of machine learning are:
Supervised Learning, in which the training data is labeled with the correct answers, e.g.
spam or ham. The two most common types of supervised learning are classification
(where the outputs are discrete labels, as in spam filtering) and regression (where the
outputs are real-valued).
Unsupervised learning, in which we are given a collection of unlabeled data, which we

wish to analyze and discover patterns within. The two most important examples are
dimension reduction and clustering.
Reinforcement learning, in which an agent (e.g., a robot or controller) seeks to learn

the optimal actions to take based the outcomes of past actions.
There are many other types of machine learning as well, for example:
1. Semi-supervised learning, in which only a subset of the training data is labeled
2. Time-series forecasting, such as in financial markets
3. Anomaly detection such as used for fault-detection in factories and in surveillance
4. Active learning, in which obtaining data is expensive, and so an algorithm must

determine which training data to acquire and many others.
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data

Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships
Association Rules
Association rules are frequently used to generate rules from market-basket data.
A market basket corresponds to the sets of items a consumer purchases during

one visit to a supermarket.
The set of items purchased by customers is known as an itemset.
An association rule is of the form X=>Y, where X ={x1, x2, ., xn }, and Y = {y1,y2, ., yn}
are sets of items, with xi and yi being distinct items for all i and all j.
For an association rule to be of interest, it must satisfy a minimum support and

confidence.
Retail shops are often interested in association between different item that people buy.
The Market-Basket Model
A large set of items, e.g., things sold in a supermarket.
A large set of baskets, each of which is a small set of the items, e.g., the things one
customer buys on one day.
Market Basket Analysis is a mathematical modeling technique which is based on

assumption that when customer buys certain product he is likely to buy other or group
of product together
Support is the percentage of transactions that contain all of the items in the itemset.
Milk ->screwdriver Low support
Milk ->Bread high Support

If the value support is low, the rule may not be statistically significant.
Confidence : It is the probability that the item in RHS will be purchased given that the
items in the LHS are purchased by customer
Confidence and Support
Support:
The minimum percentage of instances in the database that contain all items
listed in a given association rule.
Support is the percentage of transactions that contain all of the items in the
itemset, LHS U RHS.
Confidence:
Given a rule of the form A=>B, rule confidence is the conditional probability that
B is true when A is known to be true.
Confidence can be computed as
support(LHS U RHS) / support(LHS)
APriori Algorithm :
A two-pass approach called a-priori limits the need for
main memory.
Key idea: monotonicity : if a set of items appears at least
s times, so does every subset.
Converse for pairs: if item i does not appear in s baskets, then no
pair including i can appear in s baskets.
Pass 1: Read baskets and count in main memory the
occurrences of each item.
Requires only memory proportional to #items.

Pass 2: Read baskets again and count in main memory
only those pairs both of which were found in Pass 1 to
have occurred at least s times.
Requires memory proportional to square of frequent items only.
Apriori Algorithm
Lk: Set of frequent itemsets of size k (with min support)
Ck: Set of candidate itemset of size k (potentially frequent itemsets)
L1 = {frequent items};
for (k = 1; Lk != ; k++) do
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are

contained in t
Lk+1 = candidates in Ck+1 with min_support
return k Lk;

The Sampling Algorithm
The sampling algorithm selects samples from the database of transactions that
individually fit into memory. Frequent itemsets are then formed for each sample.
If the frequent itemsets form a superset of the frequent itemsets for the entire
database, then the real frequent itemsets can be obtained by scanning the
remainder of the database.
In some rare cases, a second scan of the database is required to find all frequent
itemsets.
Frequent-Pattern Tree Algorithm

The Frequent-Pattern Tree Algorithm reduces the total number of candidate itemsets
by producing a compressed version of the database in terms of an FP-tree.
The FP-tree stores relevant information and allows for the efficient discovery of
frequent itemsets.
The algorithm consists of two steps:
Step 1 builds the FP-tree.
Step 2 uses the tree to find frequent itemsets.
Step 1: Building the FP-Tree
First, frequent 1-itemsets along with the count of transactions containing each item
are computed.
The 1-itemsets are sorted in non-increasing order.
The root of the FP-tree is created with a null label.
For each transaction T in the database, place the frequent 1-itemsets in T in sorted
order. Designate T as consisting of a head and the remaining items, the tail.
Insert itemset information recursively into the FP-tree as follows:
if the current node, N, of the FP-tree has a child with an item name = head,
increment the count associated with N by 1 else create a new node, N, with a
count of 1, link N to its parent and link N with the item header table.
if tail is nonempty, repeat the above step using only the tail, i.e., the old head is
removed and the new head is the first item from the tail and the remaining
items become the new tail.
Step 2: The FP-growth Algorithm For Finding Frequent Itemsets
Input: Fp-tree and minimum support, mins
Output: frequent patterns (itemsets)
procedure FP-growth (tree, alpha);
Begin
if tree contains a single path P then

for each combination, beta of the nodes in the
path generate pattern (beta U alpha)
with support = minimum support of nodes in
beta else
for each item, i, in the header of the tree do
begin
generate pattern beta = (i U alpha) with support =
i.support; construct betas conditional pattern base;
construct betas conditional FP-tree,
beta_tree; if beta_tree is not empty then FP-
growth(beta_tree, beta);
end;
End;
The Partition Algorithm

Divide the database into non-overlapping subsets.
Treat each subset as a separate database where each subset fits entirely into main
memory.
Apply the Apriori algorithm to each partition.
Take the union of all frequent itemsets from each partition.
These itemsets form the global candidate frequent itemsets for the entire database.
Verify the global set of itemsets by having their actual support measured for the entire
database.
Association Rules
The cardinality of itemsets in most situations is extremely large.

Association rule mining is more difficult when transactions show variability in factors
such as geographic location and seasons.
Item classifications exist along multiple dimensions.
Data quality is variable; data may be missing, erroneous, conflicting, as well as

redundant.
Other Association Rules
Association Rules Among Hierarchies
Multidimensional Association
Negative Association
Classification
Classification is the process of learning a model that is able to describe different

classes of data.
Learning is supervised as the classes to be learned are predetermined.
Learning is accomplished by using a training set of pre-classified data.
The model produced is usually in the form of a decision tree or a set of rules.
Classification : A data mining technique
Decision tree & neural network are example of classification techniques.
Classification makes segments classes of those objects who have certain kind of
similarity almost like clustering.
Decision Tree
It is flowchart like tree structure
It has internal node denotes on an attribute
Branch represents an outcome of the test
Decision tree with two branches is called as Binary tree & with multiple branches is
called Multiway tree.

Advantages:
Learning & classification steps are simple & fast
They provide good accuracy
Bayesion Classifiers
It is probabilistic approach based on applying Bays thermo
It is based on strong independence assumption
Bayesion classifier is supported by probability model
Advantage : It works well in complex real world situations
Outlier Analysis & Clustering

Outlier Analysis : When data objects behavior does not matches with its similar kind of
objects, the previous object is called as an outlier
Outliers are those objects who are dissimilar or inconsistent with their fellow objects.
The main reason behind outlier are measurement execution error or assumption.
Applications :
Fraud Detection
Customized marketing
Medical Analysis
Clustering: Clustering is the process where the data objects similar to each other are
placed together in a cluster & dissimilar objects into other clusters.
Applications:
Data Mining
Statistics
Biology

Machine learning
Clustering Requirements :
Scalability
Ability to deal with Different types of attributes
Discovery of cluster with arbitrary shape
Ability to deal with noisy Data
Incremental clustering
High Dimensionality
Constraint based clustering
Minimal Requirement for domain knowledge to determine input parameter
K-means Clustering Algorithm

Algorithm: The k-Means algorithm for partitioning based on the mean value of object in the
cluster.
Input: K : number of cluster and
D : database containing n objects.
Output: A set of k clusters that mininimizes the squared-error criterion.
1) Randomly choose k object as the initial cluster centers From D (centroid);
2) Repeat
3) (re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the cluster;
4) Update the cluster mean
calculate the mean value of the objects for each cluster;
5) Until centroid (center point) no change;

Approaches to other data mining problems
Discovery of sequential patterns :
The technique determines objects called as sequential object. These helps to predict the
strong dependencies amongst the events.
The sequential object may be : goods purchased by customer , medical treatment given to the
patient etc.
It is being observed that when particular event occurs it depends on the previous event.
Discovery in patterns in Time Series
The approach is based on identification of similarities between time series of data.
Many transactions are performed on some regular time intervals such as weekly report.
Applications :
Financial Market
Medical diagnosis
Market basket data analysis
Regression : It deals with the prediction of a value rather than class.
If we consider P as a function it will called as regression function which takes place as :
Y=f(x1,x2,x3.xn)
Neural Network : It is the technique which uses generalized regression & provides
iterative method to conduct it over & over again.
Types of Neural Network :
Supervised neural Network
Unsupervised neural Network
Characteristics of Neural Network
Self Adaptive
Classification Tasks

Highly quantitative Output
No unique internal representation
Time Series Data
Genetic Algorithm :Algorithm which are capable of performing randomized search

procedure which are adaptive & robust in Nature are called as Genetic algorithm
Applications of GA
Image Analysis
Scheduling
Engineering Design
Text Mining
Text data is everywhere books, news, articles, financial analysis, blogs, social
networking, etc
According to estimates, 80% of worlds data is in unstructured text format
We need methods to extract, summarize, and analyze useful information from

unstructured/text data
Text mining seeks to automatically discover useful knowledge from the massive
amount of data
Active research is going on in the area of text mining in industry and academics
Data Visualization/Visual data Mining(VDM)
Is arecent approach for exploring very large dataset which combines traditional mining
methods and information visualization technique.
It is required science the size of the data is very large & if it is not displayed in an
organized manner trends & patterns can not be recognized appropriately.
It allows user to perform automated calculations & also human perception to observe trends
& patterns.

There are different Methods as :
Univariate :
Bivariate
Multivariate
a) Icon based method
b) Pixel based method
c) Dynamic parallel coordinate system
Applications Of data Mining :

Data Mining is widely used in diverse areas. There are number of commercial data
mining system available today yet there are many challenges in this field
FINANCIAL DATA ANALYSIS
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates the systematic data analysis and data mining. Here are the few
typical cases:
Design and construction of data warehouses for multidimensional data analysis and
data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
TELECOMMUNICATION INDUSTRY
Today the Telecommunication industry is one of the most emerging industries

providing various services such as fax, pager, cellular phone, Internet messenger,
images, e-mail, web data transmission etc.Due to the development of new computer
and communication technologies, the telecommunication industry is rapidly
expanding. This is the reason why data mining is become very important to help and
understand the business.

Data Mining in Telecommunication industry helps in identifying the telecommunication

patterns, catch fraudulent activities, make better use of resource, and improve quality
of service. Here is the list examples for which data mining improve telecommunication
services:
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
Data Mining Applications in Health Care and Insurance
The growth of the insurance industry entirely depends on the ability of converting data
into the knowledge, information or intelligence about customers, competitors and its
markets. Data mining is applied in insurance industry lately but brought tremendous
competitive advantages to the companies who have implemented it successfully
Data Mining Applications in Transportation
Data mining helps determine the distribution schedules among warehouses and
outlets and analyze loading patterns.
Data Mining Applications in Medicine
Data mining enables to characterize patient activities to see incoming office visits.
Data mining helps identify the patterns of successful medical therapies for different
illnesses.
Data mining applications are continuously developing in various industries to provide

more hidden knowledge that increases business efficiency and grows businesses.
Identification of product stoled Most : the products which are frequently stored are
identified. Also the mechanism for its security can be planed & thieves can be detected

Data Mining for Retail Industry
Retail industry: huge amounts of data on sales, customer shopping history, etc.
Applications of retail data mining
Identify customer buying behaviors
Discover customer shopping patterns and trends
Improve the quality of customer service
Achieve better customer retention and satisfaction
Enhance goods consumption ratios
Design more effective goods transportation and distribution policies

ADBMS Chapter No. 6

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ADBMS Chapter No. 6

Uploaded by

Copyright:

Available Formats

MCA Knowledge Base Systems & Data Mining

1. Subject Code : IT -34 Subject Name : Advanced Database management System

2. Learning Objectives of the Course ADBMS :

To know about different database handling techniques. To gain an awareness

3. Unit Name : Knowledge Base Systems & Data Mining

4. Contents of the Unit

6.1 Data mining as a part Knowledge Discovery

6.2 Association rules

6.3 Market-basket Model, support & confidence -Apriori Algorithm -Sampling

6.4 Classification Decision tree induction Bayesian classifiers

6.5 Clustering k-means Algorithm

6.6 Approaches to other data mining problems Discovery of sequential patterns

6.7 Applications of Data Mining

Learning Objectives of the Unit : to study different algorithm to perform the

5. Key Definitions, Key Words in the definitions

Prof. Khandagale S P UNIT NO. 6

KBS, KDD, Apriori Algorithm, K-means Algorithm,Bayesion Algorithm,FPT

7. Questions Asked in the University Exam

Question For Practice :

Q.1 Write Short Note On :

Q.2 Explain Various Data mining Applications

Q.3 Explain the architecture of Data mining

1. Data Mining Concepts & Techniques Jiawaei & Micheline Kamber ,

2. Database system concepts', 6th Edition Abraham Silberschatz, Henry Korth,

3. Database systems : "Design implementation and management"- Rob

Prof. Khandagale S P UNIT NO. 6

4. Database Management Systems - Raghu Ramkrishnan, Johannes Gehrke

Knowledge Base Systems & Data Mining

Knowledge Discovery Data

Knowledge Discovery in Databases (KDD)

Goals of Data Mining and Knowledge Discovery (PICO)

Optimize the use of limited resources.

Data Mining a step in A KDD Process

Interacting with a user / expert in KDD

Prof. Khandagale S P UNIT NO. 6

Architecture of a typical data mining systemArchitecture of a typical data mining system

Graphical user interface

Data mining engine

Database or data warehouse server

Database Data warehouse

Information repositories : It include Databases, data warehouse, or other repository

Prof. Khandagale S P UNIT NO. 6

Association, Correlation analysis, Classification, Prediction, Cluster analysis, oulier

Patter Evaluation Module : searches interesting patterns for which it communicates

Knowledge Based System(KBS)

To provide intelligent decision with appropriate justification KBS works as artificial

Knowledge base is dependent on following concepts :hypothesis, rules, object,

Their core components are : Knowledge base, Acquisition mechanisms , Inference

The typical architecture of an KBS is often described as follows:

Diagnosis :Problems are identified using number of symptoms or failure

Interpretation : To provide an understanding of a situation from available information.

Prediction : To predict a future state from a set of data or observations.

Design : To develop configuration that satisfies constraints of a design problem

Instruction: To train students & correct their performance

Debugging : To identify & prescribe remedies for malfunctions.

Planning : Both short term & long term in project management.

Monitoring : To check performance & flag exceptions

Prof. Khandagale S P UNIT NO. 6

Intelligent Decision Support

Self learning reasoning & explanation.

Increase availability of expert knowledge

Efficient & cost effective.

Deal with the uncertainty