Nothing Special   »   [go: up one dir, main page]

DM Association

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Data Mining:

Association Analysis Association Analysis


Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk
{Bread} {Milk}
{Diaper} {Beer}
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread Milk Diaper Beer
{Diaper} {Beer},
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

2
Definition: Frequent Itemset
Itemset
A collection of one or more items A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
A it t th t t i k it
TID Items
1 Bread Milk
An itemset that contains k items
Support count ()
Frequency of occurrence of an itemset
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5 E.g. s({Milk, Bread, Diaper}) 2/5
Frequent Itemset
An itemset whose support is greater
3
than or equal to a minsup threshold
Definition: Association Rule
Association Rule
An implication expression of the form
TID Items
1 B d Milk
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk Diaper} {Beer}
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
{Milk, Diaper} {Beer}
Rule Evaluation Metrics
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Example:
Beer } Diaper , Milk {
Support (s)
Fraction of transactions that contain
both X and Y
Beer } Diaper , Milk {
4 . 0
5
2
| T |
) Beer Diaper, , Milk (
= = =

s
Confidence (c)
Measures how often items in Y
appear in transactions that
5 | T |
67 . 0
3
2
) Diaper Milk (
) Beer Diaper, Milk, (
= = =

c
appear in transactions that
contain X
4
3 ) Diaper , Milk (
Association Rule Mining Task
Given a set of transactions T, the goal of
i ti l i i i t fi d ll l h i association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold
Brute-force approach:
List all possible association rules List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
5
Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper Beer} {Milk} (s=0 4 c=0 67)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper Beer} (s=0 4 c=0 5)
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke

Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
6
Thus, we may decouple the support and confidence requirements
Association Rules Mining (cont.)
Problem Decomposition
1 Find all items (itemsets) with support larger than s% (large itemset 1. Find all items (itemsets) with support larger than s% (large itemset
L
k
)
2. Use the large itemsets to generate the rules
Example
if ABCD and AB are large itemsets
then rule AB -> CD holds if support(ABCD)/support(AB) >= c%
Existing Mining Algorithms
focused on sub-problem 1
f features
read the database several passes
focused on reducing the computation cost
7
focused on reducing the computation cost
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Given d items, there
are 2
d
possible
did t it t
8
ABCDE
candidate itemsets
Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the
database database
TID Items
Transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk Diaper Beer Coke 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Match each transaction against every candidate
9
Complexity ~ O(NMw) => Expensive since M = 2
d
!!!
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
Complete search: M=2
d
Complete search: M=2
d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM) Reduce the number of comparisons (NM)
Use efficient data structures to store the candidates or
transactions transactions
No need to match every candidate against every
transaction
10
Reducing Number of Candidates
Apriori principle:
If an itemset is frequent, then all of its subsets must also
be frequent
Apriori principle holds due to the following property
f th t of the support measure:
) ( ) ( ) ( : Y s X s Y X Y X
Support of an itemset never exceeds the support of its
) ( ) ( ) ( : , Y s X s Y X Y X
Support of an itemset never exceeds the support of its
subsets
This is known as the anti-monotone property of support
11
p p y pp
Illustrating Apriori Principle
Found to be
Infrequent
Pruned
t
12
supersets
Illustrating Apriori Principle
Item Count
B d 4
Items (1-itemsets)
Bread 4
Coke 2
Milk 4
Beer 3
Itemset Count
{Bread Milk} 3
Pairs (2-itemsets)
Diaper 4
Eggs 1
{Bread,Milk} 3
{Bread,Beer} 2
{Bread,Diaper} 3
{Milk,Beer} 2
(No need to generate
candidates involving Coke
or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
gg )
Triplets (3-itemsets)
Minimum Support = 3
I t e m s e t C o u n t
{ B r e a d , M i l k , D i a p e r } 3
Triplets (3 itemsets)
If every subset is considered,
6
C
1
+
6
C
2
+
6
C
3
= 41
With support-based pruning,
6 + 6 + 1 = 13
13
Apriori Algorithm
Method:
Let k=1
G t f t it t f l th 1 Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
i f t are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those Eliminate candidates that are infrequent, leaving only those
that are frequent
14
Association Rules Mining
Generation of Candidate Itemsets C
k
1) Join: L
k-1
L
k-1
2) Prune: delete all itemsets c C
k
such that some (k-1)-
subset of c is not in L
k-1
.
<e.g.> If {A B} is large then both A and B are large
Example
Let L
3
be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}
1) Join: C
4
= L
3
L
3
= {{1 2 3 4}, {1 3 4 5}} 1) Join: C
4
L
3
L
3
{{1 2 3 4}, {1 3 4 5}}
2) Prune: {1 3 4 5} is deleted because {1 4 5} is not in L
3
=> C
4
= {{1 2 3 4}}
15
Example of Apriori Mining
Trans # Items
Step 1: Find L
1
L = ( A B C D)
1 A B C D E
2 A B D
3 A C F
L
1
= ( A B C D)
Step 2: Find L
2
(i) Generate Candidates
3 A C F
4 D G
5 A B H
(i) Generate Candidates
C
2
= L
1
x L
1
= { (A B), (A C), (A D), .}
(ii) Count
S((A B))=S((A C))=S((A D))=S((B D))=2/5 S((A B)) S((A C)) S((A D)) S((B D)) 2/5
S((B C))=S((C D))=1/5
= { (A B), (A C), (A D), (B D)}
Minimum Support = 2/5
Step 3: Find L
3
(i) C
3
= L
2
x L
2
= { (A B C), (A B D), (A C D) }
Prune: delete (A B C), (A C D)
(ii) Count
S((A B D)) = 2/5
L
3
= {(A B D)}
16
Factors Affecting Complexity
Choice of minimum support threshold
lowering support threshold results in more frequent itemsets lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set Dimensionality (number of items) of the data set
more space is needed to store support count of each item
if number of frequent items also increases, both computation and
I/O t l i I/O costs may also increase
Size of database
since Apriori makes multiple passes, run time of algorithm may since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
Average transaction width
transaction width increases with denser data sets transaction width increases with denser data sets
This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
17
width)
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
Maximal
Itemsets
B d
Infrequent
Itemsets
18
Border
Itemsets
Closed Itemset
An itemset is closed if none of its immediate supersets
has the same support as the itemset has the same support as the itemset
Itemset Support
TID Items
1 {A,B}
2 {B C D}
{A} 4
{B} 5
{C} 3
Itemset Support
{A,B,C} 2
2 {B,C,D}
3 {A,B,C,D}
4 {A,B,D}
{ C }
{ }
{D} 4
{A,B} 4
{A,C} 2
{A,B,D} 3
{A,C,D} 2
{B,C,D} 3
5 {A,B,C,D}
{A,C} 2
{A,D} 3
{B,C} 3
{B D} 4
{A,B,C,D} 2
{B,D} 4
{C,D} 3
19
Maximal vs Closed Itemsets
TID Items
1 ABC
null
124 123
1234 245 345
Transaction Ids
1 ABC
2 ABCD
3 BCE
A B C D E
124 123
1234 245 345
4 ACDE
5 DE
AB AC AD AE BC BD BE CD CE DE
12 124 24
4
123
2
3 24
34
45
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
12
2
24
4 4
2
3 4
2
4
ABCD ABCE ABDE ACDE BCDE
2
4
Not supported by
20
ABCDE
any transactions
Maximal vs Closed Frequent Itemsets
null
124 123
1234 245 345
Minimum support = 2
Closed but
not maximal
A B C D E
124 123
1234 245 345
Closed and
maximal
AB AC AD AE BC BD BE CD CE DE
12 124 24
4
123
2
3 24
34
45
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
12
2
24
4 4
2
3 4
2
4
ABCD ABCE ABDE ACDE BCDE
ABCDE
# Closed = 9
# Maximal = 4
21
ABCDE
Association Rules Discovery
Research Issues
ffi i t i i th d efficient mining methods
(eg) No candidates; I/O reduction
Variations of association rules
DB
Variations of association rules
Multiple support, Multi-level rules
rule interestingness & pruning
DB
DB*
rule interestingness & pruning
incremental mining
accuracy adjustment
DB
+
accuracy adjustment
new constraints & measures
integration considerations g
with RDBMS
fuzzy sets, rough sets, , etc.
22
What more applications for association rules mining?
Discussions Discussions
Fields, industry
problems
23
Variations of Association Rules
Multiple Support
Multi-Level Association Rules
24
Effect of Support Distribution
Many real data sets have skewed support
di t ib ti distribution
Support
distribution of
a retail data set
25
Effect of Support Distribution
How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products) products)
If i i t t l it i t ti ll If minsup is set too low, it is computationally
expensive and the number of itemsets is very large
Using a single minimum support threshold may
t b ff ti not be effective
26
Multiple Minimum Support
How to apply multiple minimum supports?
MS(i): minimum support for item i
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
0 1% = 0.1%
Ch ll S t i l ti t Challenge: Support is no longer anti-monotone
Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5% Support(Milk, Coke, Broccoli) 0.5%
{Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent!
27
Multiple Minimum Support (Liu 1999)
Order the items according to their minimum
support (in ascending order) support (in ascending order)
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0 1% MS(Salmon)=0 5% MS(Broccoli) 0.1%, MS(Salmon) 0.5%
Ordering: Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:
L : set of frequent items L
1
: set of frequent items
F
1
: set of items whose support is MS(1)
where MS(1) is min
i
( MS(i) ) where MS(1) is min
i
( MS(i) )
C
2
: candidate itemsets of size 2 is generated from F
1
instead of L
1
28
Multiple Minimum Support (Liu 1999)
Modifications to Apriori:
I t diti l A i i In traditional Apriori,
A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k q
The candidate is pruned if it contains any infrequent subsets
of size k
P i t h t b difi d Pruning step has to be modified:
Prune only if subset contains the first item
e g : Candidate={Broccoli Coke Milk} (ordered according to e.g.: Candidate {Broccoli, Coke, Milk} (ordered according to
minimum support)
{Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke Milk} is infrequent {Coke, Milk} is infrequent
Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
29
Multi-level Association Rules
Food
Electronics
Bread
Electronics
Bread
Milk
Computers Home
Skim
2%
Desktop Laptop
Wheat White
DVD TV Accessory
Foremost Kemps
Printer Scanner
30
Multi-level Association Rules
Why should we incorporate concept hierarchy?
Rules at lower levels may not have enough support to
appear in any frequent itemsets
Rules at lower levels of the hierarchy are overly
ifi specific
e.g., skim milk white bread, 2% milk wheat bread,
skim milk wheat bread etc skim milk wheat bread, etc.
are indicative of association between milk and bread
31
Multi-level Association Rules
How do support and confidence vary as we
t th t hi h ? traverse the concept hierarchy?
If X is the parent item for both X1 and X2, then
(X) (X1) (X2) (X) (X1) + (X2)
If (X1 Y1) i If (X1 Y1) minsup,
and X is parent of X1, Y is parent of Y1
then (X Y1) minsup (X1 Y) minsup then (X Y1) minsup, (X1 Y) minsup
(X Y) minsup
If conf(X1 Y1) minconf,
then conf(X1 Y) minconf
32
Multi-level Association Rules
Approach 1:
E t d t i ti l f l ti b ti h Extend current association rule formulation by augmenting each
transaction with higher level items
Original Transaction: {skim milk, wheat bread}
Augmented Transaction:
{skim milk, wheat bread, milk, bread, food}
I Issues:
Items that reside at higher levels have much higher support
counts counts
if support threshold is low, too many frequent patterns involving items
from the higher levels
33
Increased dimensionality of the data
Multi-level Association Rules
Approach 2:
Generate frequent patterns at highest level first
Then, generate frequent patterns at the next highest
level, and so on
Issues:
I/O requirements will increase dramatically because q y
we need to perform more passes over the data
May miss some potentially interesting cross-level
association patterns
Other approaches: [Paper Study]
34
pp [ p y]
Pattern Evaluation
Association rule algorithms tend to produce too
l many rules
many of them are uninteresting or redundant
Redundant if {A,B,C} {D} and {A,B} {D}
have same support & confidence
Interestingness measures can be used to
prune/rank the derived patterns
In the original formulation of association rules,
support & confidence are the only measures used
35
support & confidence are the only measures used
Application of Interestingness Measure
Knowledge
Interestingness
Measures
Postprocessing
Patterns
Measures
Featur
e
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
P
r
o
d
u
c
t
Featur
e
Featur
e
Featur
F t
Mining
Preprocessed
Data
e
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Mining
Selected
Data
Preprocessing
Data
Selection
36
Computing Interestingness Measure
Given a rule X Y, information needed to compute rule
interestingness can be obtained from a contingency table interestingness can be obtained from a contingency table
Contingency table for X Y
Y Y
X f
11
f
10
f
1+
f
11
: support of X and Y
f
10
: support of X and Y
X f
01
f
00
f
o+
f
+1
f
+0
|T|
f
01
: support of X and Y
f
00
: support of X and Y
Used to define various measures
support, confidence, lift, Gini,
J-measure, etc.
37
,
Drawback of Confidence
C ff C ff Coffee Coffee
Tea 15 5 20
Tea 75 5 80 Tea 75 5 80
90 10 100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Although confidence is high, rule is misleading
P(Coffee|Tea) = 0.9375
38
(Co ee| ea) 0 93 5
Statistical Independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S) P(B) = 0.6 0.7 = 0.42 ( ) ( )
P(SB) = P(S) P(B) => Statistical independence P(SB) P(S) P(B) Statistical independence
P(SB) > P(S) P(B) => Positively correlated
P(SB) < P(S) P(B) => Negatively correlated
39
P(SB) < P(S) P(B) => Negatively correlated
Statistical-based Measures
Measures that take into account statistical
d d dependence
) | ( X Y P
) (
) | (
Y P
X Y P
Lift =
) ( ) (
) , (
Y P X P
Y X P
Interest =
) ( ) ( ) (
) ( ) ( ) , (
) ( ) (
Y P X P Y X P
Y P X P Y X P PS =
)] ( 1 )[ ( )] ( 1 )[ (
) ( ) ( ) , (
Y P Y P X P X P
Y P X P Y X P
t coefficien


=
40
Example: Lift
C ff C ff Coffee Coffee
Tea 15 5 20
Tea 75 5 80 Tea 75 5 80
90 10 100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Lift =0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
41
Example: Lift (cont.)
Y Y
X 10 0 10
Y Y
X 90 0 90
X 0 90 90
10 90 100
X 0 10 10
90 10 100
10
) 1 0 )( 1 0 (
1 . 0
= = Lift
11 . 1
9 . 0
= = Lift
) 1 . 0 )( 1 . 0 (
f
11 . 1
) 9 . 0 )( 9 . 0 (
Lift
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
42
Some Issues
Th l t f There are lots of
measures proposed
in the literature
Some measures are
good for certain
li ti b t t applications, but not
for others
What criteria should
we use to determine
whether a measure
i d b d? is good or bad?
What about Apriori- p
style support based
pruning? How does
it affect these
meas res? measures?

You might also like