DWDM Unit 4 (R22)
DWDM Unit 4 (R22)
DWDM Unit 4 (R22)
In order to generate a frequent itemset list one must avoid using the brute force approach
described in the previous section because it can be very expensive to search through the
whole data set to find the support count of each itemset. Some of the strategies used to
address this problem are :-
1. Reduce the number of candidates: use pruning techniques such as the Apriori
principle to eliminate some of the candidate itemsets without counting their
support values
2. Reduce the number of transactions: by combining transactions together we can
reduce the total number of transactions
3. Reduce the number of comparisons: use efficient data structures to store the
candidates thereby eliminating the need to match every candidate against every
transaction.
States that if an itemset is frequent, then all of its subsets must also be frequent. This
principle holds true because of the anti-monotone property of support.
This means that when f is anti-monotone (or downward closed) and X is a subset of Y, then
the support of X, s(X) must not exceed the support of Y, s(Y).
The converse is also true for when f is monotone (or upward closed), this means that when
X is a subset of Y, then the support of Y, s(Y) must not exceed the support of X, s(X).
Apriori Algorithm:
It is one thing to explain the concepts, it is quite another to see it in action. To get a hands
on understanding of how this algorithm works click the link below to download an
application that will allow you to build a decision tree for a particular dataset based on
these measures.
There are three main parts of the algorithm that need to be explained a little bit further.
The Candidate Generation, Candidate Pruning and Support Counting stages which take
place after we generate frequent itemsets of length k are very crucial to grasping the
complexity of the Apriori Algorithm.
Candidate Generation: this part of the algorithm does as the name suggests and generates
new candidates of length (k+1) from frequent itemsets of length k. For the sake of
efficiency there are some requirements that need to be met in order for us to have effective
candidate generation.
The table below outlines three candidate generation procedures that are widely used.
………………………… end………………
Q) association rule?
Association rule learning is a type of unsupervised learning technique that checks for
the dependency of one data item on another data item and maps accordingly so that it
can be more profitable. It tries to find some interesting relations or associations among
the variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.
The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc. Here market basket analysis is a technique used by the various big
retailer to discover the associations between items. We can understand it by taking an
example of a supermarket, as in a supermarket, all products that are purchased together
are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or mostly nearby. Consider the below
diagram:
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
…………………… end…………..
Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number
of items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics
are given below:
o Support
o Confidence
o Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as
the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that
contain X.
o Support
o Confidence
o Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that
contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
………………. End………….
It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-
first search technique to find frequent itemsets in a transaction database. It performs faster
execution than Apriori Algorithm.
………………… end…………
………………… end……………….
Q) Apriori Algorithm?
Apriori algorithm is an influential algorithm that is generally used in the field
of data mining & association rule learning. It is used to identify frequent
itemsets in a dataset & generate an association based rule based on the
itemsets.
Imagine you have a database about the items a customer purchases from the
store. The Apriori algorithm helps to uncover interesting relationships &
patterns in this data. It does that by finding the sets of items that occur
together, frequently.
For e.g. the algorithm would discover that when a customer buys bread, they
often end up buying butter & eggs as well. This indicates a strong association
between these items. These associations help businesses to make decisions to
improve sales, customer satisfaction, etc.
The following are the main steps of the apriori algorithm in data mining:
1. Computational complexity
2. Time & space overhead
3. Difficulty handling sparse data
4. Limited discovery of complex patterns
5. Higher memory usage
6. Bias of minimum support threshold
7. Inability to handle numeric data
8. Lack of incorporation of context
………………………. End…………………..
……………………….. end………….
Q)What are the components of the Apriori algorithm?
There are three major components of the Apriori algorithm in data mining
which are as follows.
1. Support
2. Confidence
3. Lift
For example, you have 5000 customer transactions in a Zara Store. You have
to calculate the Support, Confidence, and Lift for two products, and you may
say Men's Wear and Women Wears.
Out of 5000 transactions, 300 contain Men's Wear, whereas 700 contain
women's wear, and these 700 transactions include 250 transactions of both
men's & women's wear.
1. Support
Support denotes the average popularity of any product or data item in the data
set. We need to divide the total number of transactions containing that product
by the total number of transactions.
Support (Men's wear)= (transactions relating MW) / (total transaction)
= 300/5000
= 16.67 %
2. Confidence
Confidence is the sum average of transactions/data items present in
pairs/combinations in the universal dataset. To find out confidence, we divide
the number of transactions that comprise both men's & women's wear by the
total number of transactions.
Hence,
Confidence = (Transactions with men's & women's wear) / (total transaction)
= 250/5000
= 5%
3. Lift
It helps find out the ratio of the sales of women's wear when you sell men's
wear. The mathematical equation of lift is mentioned below.
Lift = (Confidence ( Men's wear- women's wear)/ (Support (men's wear)
= 20/18
= 1.11
………………………….. end……………
1. Medical
Hospitals are generally trashed with data every day and need to retrieve a lot of past data for
existing patience. Apriori algorithm help hospitals to manage the database of patients without jinxing
it with other patients.
2. Education
The educational institute can use the Apriori algorithm to store and monitor students' data like age,
gender, traits, characteristics, parent's details, etc.
3. Forestry
On the same line as the education and medical industry, forestry can also use the Apriori algorithm
to store, analyze and manage details of every flora and fauna of the given territory.
5. Mobile Commerce
Big data can help mobile e-commerce companies to deliver an easy, convenient and personalized
shopping experience. With the Apriori algorithm, the real-time product recommendation accuracy
increases, which creates an excellent customer experience and increases sales for the company.
………………………….. end………………
Q) FP Growth Algorithm?
The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set association information.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with
this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then mapped onto
a path in the FP-tree. This is done until all transactions have been read. Different transactions
with common subsets allow the tree to remain compact because their paths overlap.
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP
tree is to mine the most frequent pattern. Each node of the FP tree represents an item of the item
set.
The root node represents null, while the lower nodes represent the item sets. The associations of
the nodes with the lower nodes, that is, the item sets with the other item sets, are maintained
while forming the tree.
o This algorithm needs to scan the database twice when compared to Apriori,
which scans the transactions for each iteration.
o The pairing of items is not done in this algorithm, making it faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.
…………………………………. End………………
Q)Difference between Apriori and FP Growth Algorithm?
Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are some
basic differences between these algorithms, such as:
Apriori FP Growth
Apriori generates frequent patterns by making the FP Growth generates an FP-Tree for
itemsets using pairings such as single item set, making frequent patterns.
double itemset, and triple itemset.
Since apriori scans the database in each step, it FP-tree requires only one database
becomes time-consuming for data where the scan in its beginning steps, so it
number of items is larger. consumes less time.
A converted version of the database is saved in the A set of conditional FP-tree for every
memory item is saved in the memory
…………………………….. end…………..
Association Rule Mining--Apriori Algorithm Solved Problems
Q.1) For the following given Transaction Data-set, Generate Rules using Apriori Algorithm.
Answer:
Bread 4 4/5=80%
Cheese 3 3/5=60%
Egg 1 1/5=20%
Juice 4 4/5=80%
Milk 3 3/5=60%
Yogurt 1 1/5=20%
Step 2) Remove all the items whose support is below given minimum support.
Cheese 3 3/5=60%
Juice 4 4/5=80%
Milk 3 3/5=60%
Step 3) Now form the two items candidate set and write their frequencies.
Step 4) Remove all the items whose support is below given minimum support.
a) (Bread, Juice)
Bread->Juice and Juice->Bread
b) (Cheese, Juice)
Cheese->Juice and Juice->Cheese
= 3/5*5/4=3/4=75%
=3/5*5/3=1=100%
= 3/5*5/4=3/4=75%
All the above rules are good because the confidence of each rule is greater than or equal to
the minimum confidence given in the problem.
Q.2) For the following given transaction data set, generate rules using Apriori
Algorithm. Consider the values as Support=22% and Confidence= 70%
1 I1,I2,I5
2 I2,I4
3 I2,I3
4 I1,I2,I4
5 I1,I3
6 I2,I3
7 I1,I3
8 I1,I2,I3,I5
9 I1,I2,I3
Answer:
Let the minimum support be 3. A Frequent Pattern set is built which will
contain all the elements whose frequency is greater than or equal to the
minimum support. These elements are stored in descending order of their
respective frequencies. After insertion of the relevant items, the set L looks like
this:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done
by iterating the Frequent Pattern set and checking if the current item is
contained in the transaction in question. If the current item is contained, the
item is inserted in the Ordered-Item set for the current transaction. The
following table is built for all the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of
occurrence in the set and initialize the support count for each item as 1.