Lecture 8-9 Association Rule Mining
Lecture 8-9 Association Rule Mining
Lecture 8-9 Association Rule Mining
Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach
AB
Read as A implies B, where A and B are sets of binary valued attributes represented in a data set. Association Rule Mining (ARM) is then the process of finding all the ARs in a given DB.
Applications
Home Electronics * (What other products should the store stocks up?) Attached mailing in direct marketing Web page navigation in Search Engines (first page a-> page b) Text mining if IT companies -> Microsoft
Some Notation
D = A data set comprising n records and m binary valued attributes. I = The set of m attributes, {i1,i2, ,im}, represented in D. Itemset = Some subset of I. Each record in D is an itemset.
Example DB
I = {a,b,c,d,e}, D = {{a,b,c},{a,b,d},{a,b,e},{a,c,d}, {a,c,e},{a,d,e},{b,c,d},{b,c,e}, {b,d,e},{c,d,e}}
Given attributes which are not binary valued (i.e. either nominal or or ranged) the attributes can be discretised so that they are represented by a number of binary valued attributes.
TID 1 2 3 4 5 6 7 8 9 10
Atts abc abd abe acd ace ade bcd bce bde cde
AB
Read as A implies B Such that AI, BI, AB= (A and B are disjoint) and ABI. In other words an AR is made up of an itemset of cardinality 2 or more.
2m-m-1
Itemset Support
Support: A measure of the frequency with which an itemset occurs in a DB.
supp(A) = # records that contain A m If an itemset has support higher than some specified threshold we say that the itemset is supported or frequent (some authors use the term large). Support threshold is normally set reasonably low (say) 1%.
Confidence
Confidence: A measure, expressed as a ratio, of the support for an AR compared to the support of its antecedent. conf(AB) = supp(AB) supp(A) We say that we are confident in a rule if its confidence exceeds some threshold (normally set reasonably high, say, 80%).
Find all the rules X & Y Z with minimum confidence and support
support, s, probability that a transaction contains {X Y Z} confidence, c, conditional probability that a transaction having {X Y} also contains Z
Let minimum support 50%, and Transaction ID Items Bought minimum confidence 50%, 2000 A,B,C we have
1000 4000 5000 A,C A,D B,E,F
A C (50%, 66.6%) C A (50%, 100%)
BRUTE FORCE
a
b ab c ac bc abc d ad bd abd
6 cd 3 abce 6 acd 1 de 3 bcd 1 ade 6 abcd 0 bde 3 e 6 abde 3 ae 3 cde 1 be 3 acde 6 abe 1 bcde 6 ce 3 abcde 3 ace 1 1 bce 1
0 3 1 1 0 1 0 0 0
Support threshold = 5%
(count of 1.55) a
Frequents Sets (F): ab(3) ac(3) bc(3) 0 3 1 1 0 1 0 0 0 ad(3) bd(3) cd(3) ae(3) be(3) ce(3) de(3)
b ab c ac bc abc d ad bd abd
6 cd 3 abce 6 acd 1 de 3 bcd 1 ade 6 abcd 0 bde 3 e 6 abde 3 ae 3 cde 1 be 3 acde 6 abe 1 bcde 6 ce 3 abcde 3 ace 1 1 bce 1
Rules:
ab conf=3/6=50% ba conf=3/6=50% Etc.
BRUTE FORCE
Advantages:
1) Very efficient for data sets with small numbers of attributes (<20).
Disadvantages:
1) Given 20 attributes, number of combinations is 220-1 = 1048576. Therefore array storage requirements will be 4.2MB. 2) Given a data sets with (say) 100 attributes it is likely that many combinations will not be present in the data set --therefore store only those combinations present in the dataset!
buys(x, SQLServer) ^ buys(x, DMBook) -> buys(x, DBMiner) [0.2%, 60%] age(x, 30..39) ^ income(x, 42..48K) -> buys(x, PC) [1%, 75%]
For rule A C: support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent
L1 itemset sup.
{1} {2} {3} {5} 2 3 3 3
C2 itemset sup
L2 itemset sup
{1 3} {2 3} {2 5} {3 5} 2 2 3 2
{1 {1 {1 {2 {2 {3
2} 3} 5} 3} 5} 5}
1 2 1 2 3 2
C2 itemset {1 2} Scan D
{1 {1 {2 {2 {3 3} 5} 3} 5} 5}
C3 itemset {2 3 5}
Scan D
L3 itemset sup {2 3 5} 2
Step 2: pruning
Pruning:
acde is removed because ade is not in L3
C4={abcd}