Association Rules
Association Rules
Association Rules
Correlations
S=
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought Itemset X = {x1, …, xk}
10 A, B, D Find all the rules X Y with minimum
20 A, C, D support and confidence
30 A, D, E support, s, probability that a
40 B, E, F
transaction contains X Y
50 B, C, D, E, F
confidence, c, conditional
Customer Customer probability that a transaction
buys both buys bread
having X also contains Y
Customer
buys milk S
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought Itemset X = {x1, …, xk}
10 A, B, D Find all the rules X Y with minimum
20 A, C, D support and confidence
30 A, D, E support, s, probability that a
40 B, E, F
transaction contains X Y
50 B, C, D, E, F
confidence, c, conditional
Customer Customer probability that a transaction
buys both buys bread
having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer
A D (60%, 100%)
buys milk
D A (60%, 75%)
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no super-
pattern Y כX, with the same support as X
An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y כX (proposed by Bayardo @
SIGMOD’98)
Scalable Methods for Mining Frequent Patterns
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
Apriori Algorithm
Apriori is an algorithm proposed by R. Agrawal and R. Srikant in 1994
for mining frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm
uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise/breadth
first search, where k-itemsets are used to explore (k+1)-itemsets.
This is called generate-and-test strategy.
First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and collecting those
items that satisfy minimum support.The resulting set is denoted by
L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is
used to find L3, and so on, until no more frequent k-itemsets can be
found.
The finding of each Lk requires one full scan of the database.
Apriori Property
{bread, milk}
Apriori Algorithm: Method
Initially, scan DB once to get frequent 1-itemset
Join Step: Generate length (k+1) candidate itemsets from
length k frequent itemsets by joining with itself.
This is denoted as
is a superset of
Prune: To reduce the size of apriori property is used. Any k-
itemset that is not frequent cannot be a subset of (k+1)-
itemset.
check the support value and delete all itemsets which
are below min_support. This will give
Terminate when no frequent or candidate set can be
generated or repeat from Join step.
Flow Chart
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
Problem
Find all the frequent
itemsets using apriori
algorithm
Generating Association Rules
The Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
Improving the Efficiency of Apriori Algorithm
considerations
3. Partitioning:
It consists of two phases:
points.
Unlike in Apriori, which determines new candidate itemsets
Completeness
Preserve complete information for frequent pattern mining
Compactness
Reduce irrelevant info—infrequent items are gone
Patterns containing p
…
Pattern f
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
Problem
Find all the frequent
itemsets using FP-
Growth algorithm
Mining Frequent Itemsets Using the Vertical
Data Format
Both the Apriori and FP-growth methods mine frequent
patterns from a set of transactions known as horizontal data
format.
Alternatively, data can be presented in item-TID set format (i.e.,
{item : TID_set}), where item is an item name, and TID_set is
the set of transaction identifiers containing the item. This is
known as the vertical data format.
Steps in mining frequent patterns in Vertical
Data Format
1. Transform the horizontally formatted data into vertical data
format. (One database scan needed)
2. Calculate the support count of 1-itemset. It is nothing but
the length of the TID_set.
3. Starting with k=1, the frequent (k+1)-itemset are constructed
from frequent k-itemset based on Apriori property. This is done
by intersection of TID_sets of frequent k-itemsets to compute
TID_sets of (k+1)-frequent itemsets.
4. The process repeats for k=k+1, until no frequent itemsets or
candidate itemsets can be found.
Analysis
1. No need to scan the database to compute support of (k+1)-
itemsets. It is simply the length of TID_set.
However, if TID_sets become quite long, they will take
substantial memory space as well as computation of
intersection operation.
Strong Rules Are Not Necessarily Interesting
Last example illustrates that the confidence of a rule AB can
be deceiving. It does not measure the real strength (or lack of
strength) of the correlation and implication between A and B.
Hence, alternatives to the support–confidence framework can
be useful in mining interesting data relationships.
If the resulting value of Eq. (6.8) is less than 1, then the occurrence of A is
negatively correlated with the occurrence of B, meaning that the
occurrence of one likely leads to the absence of the other one. If the
resulting value is greater than 1, then A and B are positively correlated,
meaning that the occurrence of one implies the occurrence of the other. If
the resulting value is equal to 1, then A and B are independent and there is
no correlation between them.