DMDW Chapter 4
DMDW Chapter 4
DMDW Chapter 4
05/03/2023 3
Frequent Pattern Analysis
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
-Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log
(click stream) analysis, and DNA sequence analysis.
05/03/2023 4
Why Frequent Pattern Mining is Important?
• Frequent pattern: An intrinsic (fundamental) and important property of datasets
• Foundation for many essential data mining tasks
Association, correlation, and causality analysis Sequential, structural (e.g., sub-
graph) patterns
Pattern analysis Is spatiotemporal, multimedia, time-series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
05/03/2023 5
Association Rule Mining
• Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets.
• Association is used to mine frequent patterns
• Initially used for Market Basket Analysis to analyze customer buying habits by finding
associations between the different items that customers place in their “shopping
baskets”
05/03/2023 6
Association Rule Mining
• The result can be used in
– Making various marketing strategies or advertising strategies
– Shelf arrangement
– design different layouts of shop
05/03/2023 7
Association rule can be used for
05/03/2023 8
Association Rule Mining
• Association Rule:
TID Items
– An implication expression of the form
1. Bread, Milk
X → Y, where X and Y are itemsets 2. Bread, Diaper, Beer, Eggs
– Example: 3. Milk, Diaper, Beer, Coke
{Milk, Diaper} → {Beer} 4. Bread, Milk, Diaper, Beer
5. Bread, Milk, Diaper, Coke
• Rule Evaluation Metrics
– Support(s)
= support(“AB” probability that a transaction contains A∪B
s = P(A∪B)
– Confidence (c)
conditional probability that a transaction having
A also contains B.
c = confidence(“AB”) = P(B|A)
05/03/2023 9
Basic Concept: Association Rules
Customer Customer
buys both buys diaper
Customer
buys beer
10
Basic Concepts: Frequent Patterns and Association Rules
• The overall performance of mining association rules is determined by the first step
• Given a set of transactions T, the goal of association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
11
Mining Association Rules—an Example
12
The Apriori Algorithm
• The name, Apriori, is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties
• Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets
– The first pass determines the frequent 1-itemsets denoted L1
– A subsequence pass k consists of two phases
– First, the frequent itemsets Lk-1 are used to generate the candidate itemsets C k
– Next, the database is scanned and the support of candidates in Ck is counted
– The frequent itemsets Lk are determined
13
Apriori Property
• Apriori property: any subset of a large itemset must be large
– If {beer, diaper, nuts} is frequent, so is {beer, diaper}
– Every transaction having {beer, diaper, nuts} also contains {beer, diaper}
• Anti-monotone: if a set cannot pass a test, all of its supersets will fail the same test as
well
14
Apriori: A Candidate Generation-and-test Approach
• Apriori pruning principle: If there is any itemset which is infrequent, its superset should
not be generated/tested!
• Method: join and prune steps
– Generate candidate (k+1)-itemsets Ck+1 from frequent k-itemsets Lk
– If any k-subset of a candidate (k+1)-itemset is not in Lk, then the candidate cannot
be frequent either and so can be removed from Ck
– Test the candidates against DB to obtain Lk+1
15
The Apriori Algorithm—Example
• Let the minimum support be 20%
16
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++)
Ck+1 = candidates generated from Lk;
for each transaction t in database
increment the count of all candidates in Ck+1 that are
contained in t
end
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
17
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
– Pruning:
– acde is removed because ade is not in L3
– C4={abcd}
18
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
05/03/2023 20
• The association rules that can be generated from L3= { I1, I2, I5 }:
…
– The nonempty subsets of l are
– The resulting association rules are as shown below, each listed with its confidence:
• If the minimum confidence threshold is, say, 70%, then only the second, third,
and last rules above are the output, because these are the only ones generated
that are strong.
05/03/2023 21
Mining Frequent Itemsets without Candidate Generation
22
Algorithm of FP-growth
23
Construct FP-tree from a Transaction Database
Let the minimum support be 20%
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
25
Construct FP-tree from a Transaction Database
26
Additional references
• https://www.youtube.com/watch?v=yCbankIouUU
• http://home.etf.rs/~vm/os/dmsw/FP-Growth%20algorithm.ppt
05/03/2023 27
Thank You