Nothing Special   »   [go: up one dir, main page]

DWDM Unit 4 (R22)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Q)Frequent Itemset Generation

In order to generate a frequent itemset list one must avoid using the brute force approach
described in the previous section because it can be very expensive to search through the
whole data set to find the support count of each itemset. Some of the strategies used to
address this problem are :-

1. Reduce the number of candidates: use pruning techniques such as the Apriori
principle to eliminate some of the candidate itemsets without counting their
support values
2. Reduce the number of transactions: by combining transactions together we can
reduce the total number of transactions
3. Reduce the number of comparisons: use efficient data structures to store the
candidates thereby eliminating the need to match every candidate against every
transaction.

The Apriori Principle:

States that if an itemset is frequent, then all of its subsets must also be frequent. This
principle holds true because of the anti-monotone property of support.

This means that when f is anti-monotone (or downward closed) and X is a subset of Y, then
the support of X, s(X) must not exceed the support of Y, s(Y).
The converse is also true for when f is monotone (or upward closed), this means that when
X is a subset of Y, then the support of Y, s(Y) must not exceed the support of X, s(X).

Apriori Algorithm:

The picture below represents the algorithm that is used


Apriori Generator Application:

It is one thing to explain the concepts, it is quite another to see it in action. To get a hands
on understanding of how this algorithm works click the link below to download an
application that will allow you to build a decision tree for a particular dataset based on
these measures.

Apriori Algorithm Deconstructed

There are three main parts of the algorithm that need to be explained a little bit further.
The Candidate Generation, Candidate Pruning and Support Counting stages which take
place after we generate frequent itemsets of length k are very crucial to grasping the
complexity of the Apriori Algorithm.

Candidate Generation: this part of the algorithm does as the name suggests and generates
new candidates of length (k+1) from frequent itemsets of length k. For the sake of
efficiency there are some requirements that need to be met in order for us to have effective
candidate generation.

1. Avoid generating too many unnecessary candidates


2. Ensure that all frequent itemsets are included
3. Avoid repetitious generation of candidates

The table below outlines three candidate generation procedures that are widely used.
………………………… end………………

Q) association rule?

Association rule learning is a type of unsupervised learning technique that checks for
the dependency of one data item on another data item and maps accordingly so that it
can be more profitable. It tries to find some interesting relations or associations among
the variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.

The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc. Here market basket analysis is a technique used by the various big
retailer to discover the associations between items. We can understand it by taking an
example of a supermarket, as in a supermarket, all products that are purchased together
are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or mostly nearby. Consider the below
diagram:
Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

…………………… end…………..

Q)How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A
then B.

Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number
of items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics
are given below:

o Support
o Confidence
o Lift

Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as
the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that
contain X.

o Support
o Confidence
o Lift

Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that
contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is


independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to
each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one
item has a negative effect on another.

………………. End………….

Q)Types of Association Rule Lerning?


Association rule learning can be divided into three algorithms:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the
databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-
first search technique to find frequent itemsets in a transaction database. It performs faster
execution than Apriori Algorithm.

F-P Growth Algorithm


The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of
the Apriori Algorithm. It represents the database in the form of a tree structure that is
known as a frequent pattern or tree. The purpose of this frequent tree is to extract the
most frequent patterns.

………………… end…………

Q)Applications of Association Rule Learning?


It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of


association rule mining. This technique is commonly used by big retailers to
determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured
easily, as it helps in identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of
artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more
other applications.
o ………………………… end……………..
Q)Compact Representations of Frequent Itemset?
The number of frequent itemsets can be very large for instance let us say that you are dealing with a
store that is trying to find the relationship between over 100 items. According to the Apriori
Principle, if an itemset is frequent, all of its subsets must be frequent so for frequent 100-itemset
has
100 frequent 1-itemset and
1002 frequent 2-itemset and
1003 frequent 3-itemset and
the list goes on, if one was to calculate all the frequent itemsets that are subsets of this larger 100-
itemset they will be close to 2100. Now I don’t know about you but this number isn’t the sort of
data you want to store on the average computer or try and find the support of and so it is for this
reason that alternative representations have been derived which reduce the initial set but can be
used to generate all other frequent itemsets. The Maximal and Closed Frequent Itemsets are two
such representations that are subsets of the larger frequent itemset that will be discussed in this
section. The table below provides the basic information about these two representations and how
they can be identified.

Maximal and Closed Frequent Itemsets


Relationship between Frequent Itemset
Representations:

In conclusion to this section it is important to point


out the relationship between frequent itemsets,
closed frequent itemsets and maximal frequent
itemsets. As mentioned earlier closed and maximal
frequent itemsets are subsets of frequent itemsets but
maximal frequent itemsets are a more compact
representation because it is a subset of closed
frequent itemsets. The diagram to the right shows the
relationship between these three types of itemsets.
Closed frequent itemsets are more widely used than
maximal frequent itemset because when efficiency is
more important that space, they provide us with the
support of the subsets so no additional pass is needed
to find this information.

………………… end……………….
Q) Apriori Algorithm?
Apriori algorithm is an influential algorithm that is generally used in the field
of data mining & association rule learning. It is used to identify frequent
itemsets in a dataset & generate an association based rule based on the
itemsets.
Imagine you have a database about the items a customer purchases from the
store. The Apriori algorithm helps to uncover interesting relationships &
patterns in this data. It does that by finding the sets of items that occur
together, frequently.
For e.g. the algorithm would discover that when a customer buys bread, they
often end up buying butter & eggs as well. This indicates a strong association
between these items. These associations help businesses to make decisions to
improve sales, customer satisfaction, etc.

The following are the main steps of the apriori algorithm in data mining:

1. Set the minimum support threshold - min frequency required for an


itemset to be "frequent".
2. Identify frequent individual items - count the occurence of each
individual item.
3. Generate candidate itemsets of size 2 - create pairs of frequent items
discovered.
4. Prune infrequent itemsets - eliminate itemsets that do no meet the
threshold levels.
5. Generate itemsets of larger sizes - combine the frequent itemsets of size
3,4, and so on.
6. Repeat the pruning process - keep eliminating the itemsets that do not
meet the threshold levels.
7. Iterate till no more frequent itemsets can be generated.
8. Generate association rules that express the relationship between them -
calculate measures to evaluate the strength & significance of these rules.

advantages of Apriori Algorithm :

1. Simplicity & ease of implementation


2. The rules are easy to human-readable * interpretable
3. Works well on unlabelled data
4. Flexibility & customisability
5. Extensions for multiple use cases can be created easily
6. The algorithm is widely used & studied

disadvantages of Apriori Algorithm

1. Computational complexity
2. Time & space overhead
3. Difficulty handling sparse data
4. Limited discovery of complex patterns
5. Higher memory usage
6. Bias of minimum support threshold
7. Inability to handle numeric data
8. Lack of incorporation of context

………………………. End…………………..

Q)How can we improve the Apriori Algorithm's efficiency?


Here are some of the methods how to improve efficiency of apriori algorithm -

1. Hash-Based Technique: This method uses a hash-based structure


called a hash table for generating the k-itemsets and their
corresponding count. It uses a hash function for generating the table.
2. Transaction Reduction: This method reduces the number of
transactions scanned in iterations. The transactions which do not
contain frequent items are marked or removed.
3. Partitioning: This method requires only two database scans to mine the
frequent itemsets. It says that for any itemset to be potentially frequent
in the database, it should be frequent in at least one of the partitions of
the database.
4. Sampling: This method picks a random sample S from Database D and
then searches for frequent itemset in S. It may be possible to lose a
global frequent itemset. This can be reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate
itemsets at any marked start point of the database during the scanning
of the database.

……………………….. end………….
Q)What are the components of the Apriori algorithm?
There are three major components of the Apriori algorithm in data mining
which are as follows.

1. Support
2. Confidence
3. Lift

For example, you have 5000 customer transactions in a Zara Store. You have
to calculate the Support, Confidence, and Lift for two products, and you may
say Men's Wear and Women Wears.
Out of 5000 transactions, 300 contain Men's Wear, whereas 700 contain
women's wear, and these 700 transactions include 250 transactions of both
men's & women's wear.

1. Support
Support denotes the average popularity of any product or data item in the data
set. We need to divide the total number of transactions containing that product
by the total number of transactions.
Support (Men's wear)= (transactions relating MW) / (total transaction)
= 300/5000
= 16.67 %

2. Confidence
Confidence is the sum average of transactions/data items present in
pairs/combinations in the universal dataset. To find out confidence, we divide
the number of transactions that comprise both men's & women's wear by the
total number of transactions.
Hence,
Confidence = (Transactions with men's & women's wear) / (total transaction)
= 250/5000
= 5%
3. Lift
It helps find out the ratio of the sales of women's wear when you sell men's
wear. The mathematical equation of lift is mentioned below.
Lift = (Confidence ( Men's wear- women's wear)/ (Support (men's wear)
= 20/18
= 1.11
………………………….. end……………

Q)What are the applications of Apriori Algorithm in data mining?


Apriori Algorithm has picked up a pace in recent years and is used in different industries for data
mining and handling.
Some fields where Apriori is used:

1. Medical
Hospitals are generally trashed with data every day and need to retrieve a lot of past data for
existing patience. Apriori algorithm help hospitals to manage the database of patients without jinxing
it with other patients.

2. Education
The educational institute can use the Apriori algorithm to store and monitor students' data like age,
gender, traits, characteristics, parent's details, etc.

3. Forestry
On the same line as the education and medical industry, forestry can also use the Apriori algorithm
to store, analyze and manage details of every flora and fauna of the given territory.

4. New Tech Firms


Tech firms use the Apriori algorithm to maintain the record of various items of products that are
purchased by various customers for recommender systems.

5. Mobile Commerce
Big data can help mobile e-commerce companies to deliver an easy, convenient and personalized
shopping experience. With the Apriori algorithm, the real-time product recommendation accuracy
increases, which creates an excellent customer experience and increases sales for the company.

………………………….. end………………
Q) FP Growth Algorithm?
The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set association information.

This algorithm works as follows:

o First, it compresses the input database creating an FP-tree instance to represent


frequent items.
o After this first step, it divides the compressed database into a set of conditional
databases, each associated with one frequent pattern.
o Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with
this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.

FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then mapped onto
a path in the FP-tree. This is done until all transactions have been read. Different transactions
with common subsets allow the tree to remain compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP
tree is to mine the most frequent pattern. Each node of the FP tree represents an item of the item
set.

The root node represents null, while the lower nodes represent the item sets. The associations of
the nodes with the lower nodes, that is, the item sets with the other item sets, are maintained
while forming the tree.

Han defines the FP-tree as the tree structure given below:


1. One root is labelled as "null" with a set of item-prefix subtrees as children and a
frequent-item-header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the path
reaching the node;
o Node-link: links to the next node in the FP-tree carrying the same item
name or null if there is none.
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the
item name..

Advantages of FP Growth Algorithm


Here are the following advantages of the FP growth algorithm, such as:

o This algorithm needs to scan the database twice when compared to Apriori,
which scans the transactions for each iteration.
o The pairing of items is not done in this algorithm, making it faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm


This algorithm also has some disadvantages, such as:

o FP Tree is more cumbersome and difficult to build than Apriori.


o It may be expensive.
o The algorithm may not fit in the shared memory when the database is large.

…………………………………. End………………
Q)Difference between Apriori and FP Growth Algorithm?
Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are some
basic differences between these algorithms, such as:

Apriori FP Growth

Apriori generates frequent patterns by making the FP Growth generates an FP-Tree for
itemsets using pairings such as single item set, making frequent patterns.
double itemset, and triple itemset.

Apriori uses candidate generation where frequent FP-growth generates a conditional


subsets are extended one item at a time. FP-Tree for every item in the data.

Since apriori scans the database in each step, it FP-tree requires only one database
becomes time-consuming for data where the scan in its beginning steps, so it
number of items is larger. consumes less time.

A converted version of the database is saved in the A set of conditional FP-tree for every
memory item is saved in the memory

It uses a breadth-first search It uses a depth-first search.

…………………………….. end…………..
Association Rule Mining--Apriori Algorithm Solved Problems

Q.1) For the following given Transaction Data-set, Generate Rules using Apriori Algorithm.

Consider the values as Support=50% and Confidence=75%

Transaction ID Items Purchased

1 Bread, Cheese, Egg, Juice

2 Bread, Cheese, Juice

3 Bread, Milk, Yogurt

4 Bread, Juice, Milk

5 Cheese, Juice, Milk

Answer:

Given Support=50% and Confidence=75%

Step 1) Find Frequent Item Set and their support

Item Frequency Support (in %)

Bread 4 4/5=80%

Cheese 3 3/5=60%

Egg 1 1/5=20%

Juice 4 4/5=80%

Milk 3 3/5=60%

Yogurt 1 1/5=20%

Support (item) = Frequency of item/Number of transactions

Step 2) Remove all the items whose support is below given minimum support.

Item Frequency Support (in %)

Prepared By Prof. Mahendra Patil 1


Bread 4 4/5=80%

Cheese 3 3/5=60%

Juice 4 4/5=80%

Milk 3 3/5=60%

Step 3) Now form the two items candidate set and write their frequencies.

Items Pair Frequency Support (in %)

Bread, Cheese 2 2/5=40%

Bread, Juice 3 3/5=60%

Bread, Milk 2 2/5=40%

Cheese, Juice 3 3/5=60%

Cheese, Milk 1 1/5=20%

Juice, Milk 2 2/5=40%

Step 4) Remove all the items whose support is below given minimum support.

Items Pair Frequency Support (in %)

Bread, Juice 3 3/5=60%

Cheese, Juice 3 3/5=60%

Step 5) Generate rules

For Rules we consider item pairs:

a) (Bread, Juice)
Bread->Juice and Juice->Bread
b) (Cheese, Juice)
Cheese->Juice and Juice->Cheese

Confidence (A->B) = support (AUB)/support (A)

Prepared By Prof. Mahendra Patil 2


Therefore,

1. Confidence (Bread->Juice) = support (Bread U Juice)/support (Bread)


= 3/5 * 5/4=3/4= 75%
2. Confidence (Juice->Bread) = support (Juice U Bread)/support (Juice)

= 3/5*5/4=3/4=75%

3. Confidence (Cheese->Juice) = support (Cheese U Juice)/support (Cheese)

=3/5*5/3=1=100%

4. Confidence (Juice->Cheese) = support (Juice U Cheese)/support (Juice)

= 3/5*5/4=3/4=75%

All the above rules are good because the confidence of each rule is greater than or equal to
the minimum confidence given in the problem.

Q.2) For the following given transaction data set, generate rules using Apriori
Algorithm. Consider the values as Support=22% and Confidence= 70%

Transaction ID Items Purchased

1 I1,I2,I5

2 I2,I4

3 I2,I3

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

Answer:

Prepared By Prof. Mahendra Patil 3


Q)Frequent Pattern Growth Algorithm

The above-given data is a hypothetical dataset of transactions with each letter


representing an item. The frequency of each individual item is computed:-

Let the minimum support be 3. A Frequent Pattern set is built which will
contain all the elements whose frequency is greater than or equal to the
minimum support. These elements are stored in descending order of their
respective frequencies. After insertion of the relevant items, the set L looks like
this:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}

Now, for each transaction, the respective Ordered-Item set is built. It is done
by iterating the Frequent Pattern set and checking if the current item is
contained in the transaction in question. If the current item is contained, the
item is inserted in the Ordered-Item set for the current transaction. The
following table is built for all the transactions:

Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of
occurrence in the set and initialize the support count for each item as 1.

b) Inserting the set {K, E, O, Y}:


Till the insertion of the elements K and E, simply the support count is increased
by 1. On inserting O we can see that there is no direct link between E and O,
therefore a new node for the item O is initialized with the support count as 1 and
item E is linked to this new node. On inserting Y, we first initialize a new node
for the item Y with support count as 1 and link the new node of O with the new
node of Y.

c) Inserting the set {K, E, M}:


Here simply the support count of each element is increased by 1.
d) Inserting the set {K, M, Y}:
Similar to step b), first the support count of K is increased, then new nodes for
M and Y are initialized and linked accordingly.

e) Inserting the set {K, E, O}:


Here simply the support counts of the respective elements are increased. Note
that the support count of the new node of item O is increased.

You might also like