Nothing Special   »   [go: up one dir, main page]

Data Mining Association Rules

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Association rules

Lecturer: JERZY STEFANOWSKI


Institute of Computing Sciences
Poznan University of Technology
Poznan, Poland
Lecture 10
SE Master Course
2008/2009

This lecture is based on the following


resources - slides:
G.Piatetsky-Shapiro: Association Rules and
Frequent Item Analysis.
and partly on
J.Han: Mining Association Rules in Large
Databases
and my other notes.

Outline
Transactions
Frequent itemsets
Subset Property
Association rules
Applications

Association rules
Transaction data
Market basket analysis

TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL

{Cheese, Milk} Bread [sup=5%, conf=80%]


Association rule:
80% of customers who buy cheese and milk also buy
bread and 5% of customers buy all these products
together
4

What Is Frequent Pattern Analysis?


Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context


of frequent itemsets and association rule mining

Motivation: Finding inherent regularities in data


What products were often purchased together? Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?

Applications

Basket data analysis, cross-marketing, catalog design, sale campaign analysis,


Web log (click stream) analysis, and DNA sequence analysis.
5

Why is Frequent Pattern or Association Mining


an Essential Task in Data Mining?
Foundation for many essential data mining tasks
Association, correlation, causality
Sequential patterns, temporal or cyclic association, partial
periodicity, spatial and multimedia association
Associative classification, cluster analysis, fascicles (semantic data
compression)

DB approach to efficient mining massive data

Broad applications

Basket data analysis, cross-marketing, catalog design, sale


campaign analysis

Web log (click stream) analysis, DNA sequence analysis, etc


6

Transactions Example
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL

Transaction database: Example, 1


TID Products
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C

Instances = Transactions

ITEMS:
A = milk
B= bread
C= cereal
D= sugar
E= eggs

Transaction database: Example, 2


Attributes converted to binary flags
TID Products
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C

TID

3
4

0
1

1
1

1
0

0
1

0
0

7
8

1
1

0
1

1
1

0
0

0
1

Definitions
Item: attribute=value pair or simply value
usually attributes are converted to binary flags for
each value, e.g. product=A is written as A

Itemset I : a subset of possible items


Example: I = {A,B,E} (order unimportant)

Transaction: (TID, itemset)


TID is transaction ID

10

Support and Frequent Itemsets


Support of an itemset
sup(I ) = no. of transactions t that support
(i.e. contain) I

In example database:
sup ({A,B,E}) = 2, sup ({B,C}) = 4

Frequent itemset I is one with at least the


minimum support count
sup(I ) >= minsup

11

SUBSET PROPERTY (Agrawal et al..)


Every subset of a frequent set is frequent!
Q: Why is it so?
A: Example: Suppose {A,B} is frequent. Since
each occurrence of A,B includes both A and B,
then both A and B must also be frequent
Similar argument for larger itemsets
Almost all association rule algorithms are based
on this subset property
12

Association Rules
Association rule R : Itemset1 => Itemset2
Itemset1, 2 are disjoint and Itemset2 is non-empty
meaning: if transaction includes Itemset1 then it also
has Itemset2

Examples
A,B => E,C
A => B,C

13

From Frequent Itemsets to Association


Rules
Q: Given frequent set {A,B,E}, what are

possible association rules?


A => B, E
A, B => E
A, E => B
B => A, E
B, E => A
E => A, B

__ => A,B,E (empty rule), or true => A,B,E

14

Rule Support and Confidence


Suppose R : I => J is an association rule
sup (R) = sup (I J) is the support count
support of itemset I J (I or J)

conf (R) = sup(J) / sup(R) is the confidence of R


fraction of transactions with I J that have J

Association rules with minimum support and count are


sometimes called strong rules

15

Classification vs Association Rules


Classification Rules

Association Rules

Focus on one target field

Many target fields

Specify class in all cases

Applicable in some cases

Measures: Accuracy

Measures: Support,
Confidence, Lift

16

Association Rules Example, 1


Q: Given frequent set {A,B,E},

what association rules have


minsup = 2 and minconf= 50% ?
A, B => E : conf=2/4 = 50%

17

TID List of items


1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C

Association Rules Example, 2


Q: Given frequent set {A,B,E}, what

association rules have minsup = 2 and


minconf= 50% ?
A, B => E : conf=2/4 = 50%
A, E => B : conf=2/2 = 100%
B, E => A : conf=2/2 = 100%
E => A, B : conf=2/2 = 100%

Dont qualify
A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
18

TID List of items


1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C

Find Strong Association Rules


A rule has the parameters minsup and minconf:
sup(R) >= minsup and conf (R) >= minconf

Problem:
Find all association rules with given minsup and

minconf

First, find all frequent itemsets

19

Finding Frequent Itemsets


Start by finding one-item sets (easy)
Q: How?
A: Simply count the frequencies of all items

20

Finding itemsets: next level


Apriori algorithm (Agrawal & Srikant 94)
Idea: use one-item sets to generate two-item
sets, two-item sets to generate three-item sets,
If (A B) is a frequent item set, then (A) and (B) have to
be frequent item sets as well!
In general: if X is frequent k-item set, then all (k-1)item subsets of X are also frequent
Compute k-item set by merging (k-1)-item sets

21

Another example
Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)

Lexicographic order improves efficiency!


Candidate four-item sets:
(A B C D)

Q: OK?

A: yes, because all 3-item subsets are frequent


(A C D E)

Q: OK?

A: No, because (C D E) is not frequent

22

Pruning search space

23

Apriori: A Candidate Generation-and-test


Approach - Summary
Any subset of a frequent itemset must be frequent
if {beer, diaper, nuts} is frequent, so is {beer, diaper}
Every transaction having {beer, diaper, nuts} also contains {beer,
diaper}

Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
Method:
generate length (k+1) candidate itemsets from length k frequent
itemsets, and
test the candidates against DB

The performance studies show its efficiency and scalability


Agrawal & Srikant 1994, Mannila, et al. 1994
24

Mining Association Rulesan Example

Transaction-id

Items bought

10

A, B, C

20

A, C

30
40

Min. support 50%


Min. confidence 50%
Frequent pattern

Support

A, D

{A}

75%

B, E, F

{B}

50%

{C}

50%

{A, C}

50%

For rule A C:

support = support({A}{C}) = 50%


confidence = support({A}{C})/support({A}) = 66.6%

25

The Apriori AlgorithmAn Example


Database TDB
Tid

A, C, D

20

A, B, C, E

40

B, E

L2

Itemset
{A, C}
{B, C}
{B, E}
{C, E}

{A}

{B}

{C}

{D}

{E}

1st scan

B, C, E

30

sup

C1

Items

10

Itemset

C2
sup
2
2
3
2

Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}

sup
1
2
1
2
3
2

L1

Itemset

sup

{A}

{B}

{C}

{E}

C2
2nd scan

Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}

C3

Itemset
{B, C, E}

L3

3rd scan
26

Itemset
{B, C, E}

sup
2

The Apriori Algorithm


Pseudo-code:

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end
return k Lk;

27

How to Generate Candidates?


Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk1

Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
28

How to Count Supports of Candidates?


Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates

Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
29

Generating Association Rules


Two stage process:
Determine frequent itemsets e.g. with the Apriori
algorithm.
For each frequent item set I
for each subset J of I
determine all association rules of the form: I-J => J

Main idea used in both stages : subset property

30

Example: Generating Rules


from an Itemset
Frequent itemset from golf data:
Humidity = Normal, Windy = False, Play = Yes (4)

Seven potential rules:


If Humidity = Normal and Windy = False then Play = Yes

4/4

If Humidity = Normal and Play = Yes then Windy = False

4/6

If Windy = False and Play = Yes then Humidity = Normal

4/6

If Humidity = Normal then Windy = False and Play = Yes

4/7

If Windy = False then Humidity = Normal and Play = Yes

4/8

If Play = Yes then Humidity = Normal and Windy = False

4/9

If True then Humidity = Normal and Windy = False and Play = Yes

4/12

31

Rules for the weather data


Rules with support > 1 and confidence = 100%:
Association rule

Sup.

Conf.

Humidity=Normal Windy=False

Play=Yes

100%

Temperature=Cool

Humidity=Normal

100%

Outlook=Overcast

Play=Yes

100%

Temperature=Cold Play=Yes

Humidity=Normal

100%

...

...

...

...

...

58

Outlook=Sunny Temperature=Hot

Humidity=High

100%

In total: 3 rules with support four, 5 with support


three, and 50 with support two
32

Association Rule Mining


mining association rules
(Agrawal et. al SIGMOD93)

Better algorithms
Fast algorithm

Problem extension
Generalized A.R.

(Agrawal et. al VLDB94)

(Srikant et. al; Han et. al. VLDB95)

Hash-based
(ParkPartitioning
et. al SIGMOD95)

Quantitative A.R.
(Srikant et. al SIGMOD96)

(Navathe et. al VLDB95)

Direct Itemset Counting

N-dimensional A.R.

(Brin et. al SIGMOD97)

(Lu et. al DMKD98)

Meta-ruleguided
mining

Parallel mining
(Agrawal et. al TKDE96)

Distributed mining Incremental mining


(Cheung et. al PDIS96) (Cheung et. al ICDE96)
33

Bottleneck of Frequent-pattern Mining


with Apriori
Multiple database scans are costly
Mining long patterns needs many passes of scanning and
generates lots of candidates
To find frequent itemset i1i2i100
# of scans: 100
# of Candidates: (1001) + (1002) + + (110000) = 2100-1 = 1.27*1030 !

Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
Another algorithms FP Tree
34

Mining Frequent Patterns Without


Complete Candidate Generation
Grow long patterns from short ones using local
frequent items
abc is a frequent pattern
Get all transactions having abc: DB|abc
d is a local frequent item in DB|abc abcd is a
frequent pattern

35

FP-Growth vs. Apriori: Scalability With the


Support Threshold
Data set T25I20D10K

100

D1 FP-grow th runtime

90

D1 Apriori runtime

80
Run time(sec.)

70
60
50
40
30
20
10
0
0

0.5

1
1.5
2
Support threshold(%)
36

2.5

Weka associations
File: weather.nominal.arff
MinSupport: 0.2

37

Weka associations: output

38

Presentation of Association Rules (Table


Form )

Visualization of Association Rules: Plane Graph

40

Filtering Association Rules


Problem: any large dataset can lead to very large
number of association rules, even with reasonable
Min Confidence and Support
Confidence by itself is not sufficient
e.g. if all transactions include Z, then
any rule I => Z will have confidence 100%.

Other measures to filter rules

41

Association Rule LIFT


The lift of an association rule I => J is defined as:
lift = P(J|I) / P(J)
Note, P(J) = (support of J) / (no. of transactions)
ratio of confidence to expected confidence

Interpretation:
if lift > 1, then I and J are positively correlated

lift < 1, then I are J are negatively correlated.


lift = 1, then I and J are independent.
42

Interestingness Measure: Correlations


and Lift
play basketball eat cereal [40%, 66.7%] is misleading
The overall percentage of students eating cereal is 75% which is higher
than 66.7%.

play basketball not eat cereal [20%, 33.3%] is more accurate,


although with lower support and confidence
Measure of dependent/correlated events: lift or corr

corrA, B

P( A B)
=
P( A) P( B)
43

Basketball

Not basketball

Sum (row)

Cereal

2000

1750

3750

Not cereal

1000

250

1250

Sum(col.)

3000

2000

5000

Beyond Binary Data


Hierarchies
drink milk low-fat milk Stop&Shop low-fat milk

find associations on any level

Sequences over time

44

Multiple-level Association Rules


Items often form hierarchy
Flexible support settings:
Items at the lower level are expected
to have lower support.
Transaction database can be encoded based on
dimensions and levels
explore shared multi-level mining
reduced support

uniform support
Level 1
min_sup = 5%

Level 2
min_sup = 5%

Milk
[support = 10%]
2% Milk
[support = 6%]
45

Skim Milk
[support = 4%]

Level 1
min_sup = 5%
Level 2
min_sup = 3%

Quantitative Association Rules


ID
100
200
300
400
500
600

Age
44
55
45
34
45
33

Salary
30 000
45 000
50 000
44 000
38 000
44 000

Maritial Status NumCars


married
2
married
3
divorced
1
single
0
married
2
single
2

Sample Rules
Support Confidence
<age:44..55> and < status: married> ==> <numCars:2> 50%
100%
<NumCars: 0..1> ==> <Married: No>
33%
66,70%

46

Multi-dimensional Association

Single-dimensional rules:
buys(X, milk) buys(X, bread)

Multi-dimensional rules: 2 dimensions or predicates


Inter-dimension assoc. rules (no repeated predicates)
age(X,19-25) occupation(X,student) buys(X,coke)
hybrid-dimension assoc. rules (repeated predicates)
age(X,19-25) buys(X, popcorn) buys(X, coke)

Categorical Attributes
finite number of possible values, no ordering among values

Quantitative Attributes
numeric, implicit ordering among values
47

Sequence Databases and Sequential


Pattern Analysis
Transaction databases, time-series databases vs. sequence
databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera,
within 3 months.
Medical treatment, natural disasters (e.g., earthquakes), science &
engineering processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
48

What Is Sequential Pattern Mining?


Given a set of sequences, find the complete set
of frequent subsequences
A

sequence : < (ef) (ab)

(df) c b >

A sequence database
SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

An element may contain a set of items.


Items within an element are unordered
and we list them alphabetically.

<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a

sequential pattern

49

Challenges on Sequential Pattern


Mining
A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible, satisfying the
minimum support (frequency) threshold
be highly efficient, scalable, involving only a small number of
database scans
be able to incorporate various kinds of user-specific constraints

50

Applications
Market basket analysis
Store layout, client offers

This analysis is applicable whenever a customer purchases multiple


things in proximity
telecommunication (each customer is a transaction containing the set of
phone calls)
weather analysis (each time interval is a transaction containing the set
of observed events)
credit cards
banking services
medical treatments

Finding unusual events


WSARE What is Strange About Recent Events


51

Conclusions
Association rule mining
probably the most significant contribution from the
database community in KDD
A large number of papers have been published

Many interesting issues have been explored


An interesting research direction
Association analysis in other types of data: sequence
data, spatial data, multimedia data, time series data, etc.
52

Summary
Frequent itemsets
Association rules
Subset property
Apriori algorithm
Extensions of this algorithms
Sequence patterns
Applications
53

Any questions, remarks?

54

You might also like