Data Mining Association Rules
Data Mining Association Rules
Data Mining Association Rules
Outline
Transactions
Frequent itemsets
Subset Property
Association rules
Applications
Association rules
Transaction data
Market basket analysis
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
Applications
Broad applications
Transactions Example
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
Instances = Transactions
ITEMS:
A = milk
B= bread
C= cereal
D= sugar
E= eggs
TID
3
4
0
1
1
1
1
0
0
1
0
0
7
8
1
1
0
1
1
1
0
0
0
1
Definitions
Item: attribute=value pair or simply value
usually attributes are converted to binary flags for
each value, e.g. product=A is written as A
10
In example database:
sup ({A,B,E}) = 2, sup ({B,C}) = 4
11
Association Rules
Association rule R : Itemset1 => Itemset2
Itemset1, 2 are disjoint and Itemset2 is non-empty
meaning: if transaction includes Itemset1 then it also
has Itemset2
Examples
A,B => E,C
A => B,C
13
14
15
Association Rules
Measures: Accuracy
Measures: Support,
Confidence, Lift
16
17
Dont qualify
A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
18
Problem:
Find all association rules with given minsup and
minconf
19
20
21
Another example
Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
Q: OK?
Q: OK?
22
23
Transaction-id
Items bought
10
A, B, C
20
A, C
30
40
Support
A, D
{A}
75%
B, E, F
{B}
50%
{C}
50%
{A, C}
50%
For rule A C:
25
A, C, D
20
A, B, C, E
40
B, E
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
{A}
{B}
{C}
{D}
{E}
1st scan
B, C, E
30
sup
C1
Items
10
Itemset
C2
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
L1
Itemset
sup
{A}
{B}
{C}
{E}
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
L3
3rd scan
26
Itemset
{B, C, E}
sup
2
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
end
return k Lk;
27
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
28
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
29
30
4/4
4/6
4/6
4/7
4/8
4/9
If True then Humidity = Normal and Windy = False and Play = Yes
4/12
31
Sup.
Conf.
Humidity=Normal Windy=False
Play=Yes
100%
Temperature=Cool
Humidity=Normal
100%
Outlook=Overcast
Play=Yes
100%
Temperature=Cold Play=Yes
Humidity=Normal
100%
...
...
...
...
...
58
Outlook=Sunny Temperature=Hot
Humidity=High
100%
Better algorithms
Fast algorithm
Problem extension
Generalized A.R.
Hash-based
(ParkPartitioning
et. al SIGMOD95)
Quantitative A.R.
(Srikant et. al SIGMOD96)
N-dimensional A.R.
Meta-ruleguided
mining
Parallel mining
(Agrawal et. al TKDE96)
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
Another algorithms FP Tree
34
35
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
36
2.5
Weka associations
File: weather.nominal.arff
MinSupport: 0.2
37
38
40
41
Interpretation:
if lift > 1, then I and J are positively correlated
corrA, B
P( A B)
=
P( A) P( B)
43
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
44
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
[support = 6%]
45
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Age
44
55
45
34
45
33
Salary
30 000
45 000
50 000
44 000
38 000
44 000
Sample Rules
Support Confidence
<age:44..55> and < status: married> ==> <numCars:2> 50%
100%
<NumCars: 0..1> ==> <Married: No>
33%
66,70%
46
Multi-dimensional Association
Single-dimensional rules:
buys(X, milk) buys(X, bread)
Categorical Attributes
finite number of possible values, no ordering among values
Quantitative Attributes
numeric, implicit ordering among values
47
(df) c b >
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
sequential pattern
49
50
Applications
Market basket analysis
Store layout, client offers
51
Conclusions
Association rule mining
probably the most significant contribution from the
database community in KDD
A large number of papers have been published
Summary
Frequent itemsets
Association rules
Subset property
Apriori algorithm
Extensions of this algorithms
Sequence patterns
Applications
53
54