ADBMS Chapter No. 6
ADBMS Chapter No. 6
NOTES:
Data Mining : The process of discovering interesting hidden & previously
unknown pattern from vast amount of data store such as data warehouse.
6. Key Concepts
Q.1Write Note On : Apriori Algorithm(Nov.2009 4M, Nov. 2010 4M, Nov. 2012 10M)
Q.2 Explain K-means Algorithm in data mining.(Nov. 2009 4M,Apr 2010 5M, Apr
2011 10M)
Q.3 Write Short Note On : machine Learning(Apr 2010 5M, Apr 2012 10M, Apr
2013 4M )
Q.4 Write Short Note on : KBS(Nov .2012 5M, Apr. 2011 5M)
Q.5 Explain Outlier Analysis in Data Mining(Nov. 2010 4M, Apr 2013 4M)
Q.6 Explain Text Mining(Nov 2010 6M, Apr 2013 6M)
Q.7 Explain association rules for data mining with help of an algorithm(Nov 2012 10M)
1. KDD
2. Bayesian Classifier
3.Sampling Algorithm
8. Learning Resources :
Reference Book :
Reference Link :
http://www.oracle.com/technetwork/articles/sql/11g-dw-olap-100058.html
http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html
The process of discovering interesting
hidden & previously unknown pattern from vast amount
of data store such as data warehouse.
Knowledge Discovery
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the context
of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, sub sampling, and transformations
of that database.
Pattern evaluation
Knowledge base
Data cleansing
Data Integration Filtering
Database or Data warehouse server : as per the user requirement these fetches the
relevant information.
Knowledge Base: knowledge consist of Domain knowledge. To guide, explore & evaluate
interestingness of patterns domain knowledge is used. Knowledge base consist of past
experience as well as user belief based on which certain conclusions can be drawn
Data Mining Engine : It is utmost important in data mining system. It consist of set of
functional modules such as : Characterization
User Interface : It acts as interface between user & data mining system, by specifying a
data mining query, providing info to help the search etc.
KBS are systems based on the method & techniques of Artificial intelligence.
KBS architecture
Control : To collect & evaluate evidence & form opinions on that evidence.
Advantages of KBS :
Documentation of knowledge
Consistency of answers
Explanation of solution
Limitations of KBS :
Applications Of KBS
What is intelligence ?
Artificial Intelligence is the design, study and construction of computer programs that
behave intelligently. -- Tom Dean.
Knowledge-based systems: capture knowledge that people have which are relevant to
a problem.
Common sense reasoning systems: capture knowledge that people commonly hold
which is why this knowledge is not explicitly communicated.
Learning systems: posses the ability to expend their knowledge based on the
accumulated experience.
Intelligent robots.
Machine Learning
at some task
with experience
Machine learning
more heuristic
also looks at real-time learning and robotics areas not part of data mining
Supervised Learning, in which the training data is labeled with the correct answers, e.g.
spam or ham. The two most common types of supervised learning are classification
(where the outputs are discrete labels, as in spam filtering) and regression (where the
outputs are real-valued).
There are many other types of machine learning as well, for example:
Association Rules
Association rules are frequently used to generate rules from market-basket data.
An association rule is of the form X=>Y, where X ={x1, x2, ., xn }, and Y = {y1,y2, ., yn}
are sets of items, with xi and yi being distinct items for all i and all j.
Retail shops are often interested in association between different item that people buy.
A large set of baskets, each of which is a small set of the items, e.g., the things one
customer buys on one day.
Support is the percentage of transactions that contain all of the items in the itemset.
If the value support is low, the rule may not be statistically significant.
Confidence : It is the probability that the item in RHS will be purchased given that the
items in the LHS are purchased by customer
Support:
The minimum percentage of instances in the database that contain all items
listed in a given association rule.
Support is the percentage of transactions that contain all of the items in the
itemset, LHS U RHS.
Confidence:
Given a rule of the form A=>B, rule confidence is the conditional probability that
B is true when A is known to be true.
APriori Algorithm :
main memory.
Apriori Algorithm
Lk: Set of frequent itemsets of size k (with min support)
L1 = {frequent items};
for (k = 1; Lk != ; k++) do
return k Lk;
The sampling algorithm selects samples from the database of transactions that
individually fit into memory. Frequent itemsets are then formed for each sample.
If the frequent itemsets form a superset of the frequent itemsets for the entire
database, then the real frequent itemsets can be obtained by scanning the
remainder of the database.
In some rare cases, a second scan of the database is required to find all frequent
itemsets.
The Frequent-Pattern Tree Algorithm reduces the total number of candidate itemsets
by producing a compressed version of the database in terms of an FP-tree.
The FP-tree stores relevant information and allows for the efficient discovery of
frequent itemsets.
First, frequent 1-itemsets along with the count of transactions containing each item
are computed.
For each transaction T in the database, place the frequent 1-itemsets in T in sorted
order. Designate T as consisting of a head and the remaining items, the tail.
if the current node, N, of the FP-tree has a child with an item name = head,
increment the count associated with N by 1 else create a new node, N, with a
count of 1, link N to its parent and link N with the item header table.
if tail is nonempty, repeat the above step using only the tail, i.e., the old head is
removed and the new head is the first item from the tail and the remaining
items become the new tail.
Begin
beta else
begin
growth(beta_tree, beta);
end;
End;
Treat each subset as a separate database where each subset fits entirely into main
memory.
These itemsets form the global candidate frequent itemsets for the entire database.
Verify the global set of itemsets by having their actual support measured for the entire
database.
Association Rules
Association rule mining is more difficult when transactions show variability in factors
such as geographic location and seasons.
Multidimensional Association
Negative Association
Classification
The model produced is usually in the form of a decision tree or a set of rules.
Classification makes segments classes of those objects who have certain kind of
similarity almost like clustering.
Decision Tree
Decision tree with two branches is called as Binary tree & with multiple branches is
called Multiway tree.
Advantages:
Bayesion Classifiers
It is probabilistic approach based on applying Bays thermo
Outliers are those objects who are dissimilar or inconsistent with their fellow objects.
The main reason behind outlier are measurement execution error or assumption.
Applications :
Fraud Detection
Customized marketing
Medical Analysis
Clustering: Clustering is the process where the data objects similar to each other are
placed together in a cluster & dissimilar objects into other clusters.
Applications:
Data Mining
Statistics
Biology
Machine learning
Clustering Requirements :
Scalability
Incremental clustering
High Dimensionality
2) Repeat
3) (re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the cluster;
The technique determines objects called as sequential object. These helps to predict the
strong dependencies amongst the events.
The sequential object may be : goods purchased by customer , medical treatment given to the
patient etc.
It is being observed that when particular event occurs it depends on the previous event.
Many transactions are performed on some regular time intervals such as weekly report.
Applications :
Financial Market
Medical diagnosis
Y=f(x1,x2,x3.xn)
Neural Network : It is the technique which uses generalized regression & provides
iterative method to conduct it over & over again.
Self Adaptive
Classification Tasks
Applications of GA
Image Analysis
Scheduling
Engineering Design
Text Mining
Text data is everywhere books, news, articles, financial analysis, blogs, social
networking, etc
Text mining seeks to automatically discover useful knowledge from the massive
amount of data
Active research is going on in the area of text mining in industry and academics
Is arecent approach for exploring very large dataset which combines traditional mining
methods and information visualization technique.
It is required science the size of the data is very large & if it is not displayed in an
organized manner trends & patterns can not be recognized appropriately.
It allows user to perform automated calculations & also human perception to observe trends
& patterns.
Univariate :
Bivariate
Multivariate
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates the systematic data analysis and data mining. Here are the few
typical cases:
Design and construction of data warehouses for multidimensional data analysis and
data mining.
TELECOMMUNICATION INDUSTRY
The growth of the insurance industry entirely depends on the ability of converting data
into the knowledge, information or intelligence about customers, competitors and its
markets. Data mining is applied in insurance industry lately but brought tremendous
competitive advantages to the companies who have implemented it successfully
Data mining helps determine the distribution schedules among warehouses and
outlets and analyze loading patterns.
Data mining enables to characterize patient activities to see incoming office visits.
Data mining helps identify the patterns of successful medical therapies for different
illnesses.
Identification of product stoled Most : the products which are frequently stored are
identified. Also the mechanism for its security can be planed & thieves can be detected
Retail industry: huge amounts of data on sales, customer shopping history, etc.