04 AssociationRules
04 AssociationRules
04 AssociationRules
Apriori Algorithm
Advanced Techniques
Rule form
Antecedent Consequent
Examples
Applications
[support, confidence]
The retailer
could move
diapers and beers
to separate
places and
position highprofit items of
interest to young
fathers along the
path.
A B [ s, c ]
Given:
support(A B [ s, c ]) = p(A B)
Find:
Example
Itemset:
Trans. Id
Purchased Items
A,D
A,C
A,B,C
B,E,F
A,B
or
B,E,F
Support of an itemset:
Sup(A,B)=1
Sup(A,C)=2
Frequent pattern:
Given min. sup=2, {A,C} is a
frequent pattern
Apriori Algorithm
Advanced Techniques
10
Each transaction is
represented by a
Boolean vector
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Frequent Itemset
Support
{A}
{B}
{C}
{A,C}
75%
50%
50%
50%
For rule A C :
support = support({A , C }) = 50%
confidence = support({A , C }) / support({A }) = 66.6%
11
12
Apriori principle
13
Itemset Lattice
14
null
AB
ABC
AC
ABD
AD
ABE
ABCD
AE
ACD
ABCE
BC
BD
ACE
ADE
ABDE
BE
BCD
ACDE
If an itemset is
infrequent, then all of its
supersets must also be
infrequent
CD
BCE
CE
BDE
null
DE
Found to be
Infrequent
CDE
BCDE
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Pruned
supersets
ABCDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
15
16
Database D
TID
100
200
300
400
Items
134
235
1235
25
L2 itemset sup
{1 3}
{2 3}
{2 5}
{3 5}
C1 itemset sup.
{1}
2
{2}
3
{3}
3
Scan D
{4}
1
{5}
3
2
2
3
2
C3 itemset
Generation is straightforward
C2
itemset sup
{1 2}
1
{1 3}
2
{1 5}
1
{2 3}
2
{2 5}
3
{3 5}
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
Scan D
{1
{1
{1
{2
{2
{3
2}
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
Scan D
{2 3 5}
17
18
insert into Ck
<
Step 2: pruning
19
20
Self-joining: L3*L3
Pruning
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
end
return L = k Lk;
C4={abcd}
21
22
Method:
support_count({A,B})
support_count({A})
23
24
A => B, E
A, B => E
A, E => B
B => A, E
B, E => A
E => A, B
Trans-ID
1
2
3
4
5
25
Exercice
TID
Items
Milk
1
0
1
1
1
1
1
Chips Mustard
1
1
0
0
0
0
1
0
0
0
0
1
0
0
Beer
0
1
1
1
0
1
0
Rule
{BC} =>{E}
{BE} =>{C}
{CE} =>{B}
{B} =>{CE}
{C} =>{BE}
{E} =>{BC}
Conf.
100%
75%
100%
75%
75%
75%
26
0.4*7= 2.8
C1
Bread
Milk
Chips
Mustard
Beer
Diaper
Eggs
Coke
Diaper
0
1
1
1
1
1
1
Min_support: 60%
Min_confidence: 75%
Items
ACD
BCE
ABCE
BE
ABCE
Eggs
0
1
0
0
0
0
0
C3
Bread,Milk,Beer
Bread,Milk,Diaper
Bread,Beer,Diaper
Milk,Beer,Diaper
Milk,Beer,Coke
Milk,Diaper,Coke
Coke
0
0
1
0
1
0
1
L1
Bread
Milk
Beer
Diaper
Coke
6
6
2
2
4
6
1
3
2
4
3
3
6
6
4
6
3
C2
Bread,Milk
Bread,Beer
Bread,Diaper
Bread,Coke
Milk,Beer
Milk,Diaper
Milk,Coke
Beer,Diaper
Beer,Coke
Diaper,Coke
L3
Bread,Milk,Diaper
Bread,Beer,Diaper
Milk,Beer,Diaper
Milk,Diaper,Coke
5
3
5
2
3
5
3
4
1
3
5
3
5
3
5
3
4
3
4
3
3
3
8 + C 28 + C 38 = 92
27
L2
Bread,Milk
Bread,Beer
Bread,Diaper
Milk,Beer
Milk,Diaper
Milk,Coke
Beer,Diaper
Diaper,Coke
>>
24
28
Challenges
Builds a storage set C^k that stores in memory the frequent sets
per transaction
29
C^1
TID
Items
TID
Set-of-itemsets
100
134
100
200
235
200
300
1235
400
25
C2
Support
{ {1},{3},{4} }
{1}
{ {2},{3},{5} }
{2}
300
{ {1},{2},{3},{5} }
{3}
400
{ {2},{5} }
{5}
C^2
TID
Set-of-itemsets
{1 2}
100
{ {1 3} }
{1 3}
200
{ {2 3},{2 5} {3 5} }
{1 5}
300
{2 3}
{2 5}
{3 5}
L1
Itemset
itemset
{2 3}, {2 5}, {3 5} }
400
{ {2 5} }
C^3
C3
TID
Set-of-itemsets
itemset
200
{ {2 3 5} }
{2 3 5}
300
{ {2 3 5} }
30
L2
Itemset
Support
{1 3}
{2 3}
{2 5}
{3 5}
L3
Itemset
Support
{2 3 5}
2
31
32
Phase I
Phase II
Divide D
into n
Trans.
in D
Nonoverlapping
partitions
Find frequent
itemsets local
to each
partition
(parallel alg.)
Combine
results to
form a global
set of
candidate
itemset
Find global
frequent
itemsets
among
candidates
At each start point, DIC estimates the support of all itemsets that are
Freq.
itemsets
in D
33
34
Comment
Apriori Algorithm
Advanced Techniques
35
36
Interestingness Measurements
Support
Confidence or strength
Conviction
Subjective measures
Leverage or Piatetsky-Shapiro
Coverage
37
38
PODS98)
Lift of a Rule
cereal
2000
1750
3750
75%
not cereal
1000
250
1250
2 5%
sum(col.)
3000
2000
5000
60%
40%
sup(A, B )
P (B | A)
=
sup(A) sup(B )
P (B )
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
rule
Support
Lift
X Y
XZ
YZ
25%
37.50%
12.50%
2.00
0.86
0.57
40
Lift of a Rule
Example 1 (cont)
2000
5000
LIFT =
= 0.89
3000 3750
5000 5000
33.3%] LIFT =
1000
5000
= 1.33
3000 1250
5000 5000
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
Rules that hold 100% of the time may not have the highest
5000 5000
41
42
Conviction of a Rule
Leverage of a Rule
Conv (A B ) =
Leverage or Piatetsky-Shapiro
3000
3750
1
5000
5000
Conv =
= 0.75
3000 2000
5000 5000
Conv =
PS (or Leverage):
is the proportion of additional elements covered by both the
premise and consequence above the expected if independent.
3000
1250
1
5000
5000
= 1.125
3000 1000
5000 5000
43
44
Coverage of a Rule
coverage(A B ) = sup(A)
46
47
48
A circular node represents a frequent (large) data item. The volume of the ball
represents the support of the item. Only those items which occur sufficiently
frequently are shown
An arrow between two nodes represents the rule implication between the two
items. An arrow will be drawn only when the support of a rule is no less than the
minimum support
49
50
51
Apriori Algorithm
FP-tree Algorithm
Advanced Techniques
52
Single-dimensional rules:
Frozen
Refrigerated
Vegetable
Fresh
Fruit
Bakery
Dairy
Etc...
Etc....
Banana
Apple
Orange
Etc...
53
54
Given
Example:
followed by
Problem
Application Difficulties
Some Suggestions
Special promotions for Barbie dolls with candy at a slightly higher margin.
By increasing the price of Barbie doll and giving the type of candy bar
free, wal-mart can reinforce the buying habits of that particular types of
buyer
61
62
References
(http://www.liacc.up.pt/~amjorge/Aulas/madsad/ecd2/ecd2_Aulas_AR_3_2003.pdf)
63
64