Nothing Special   »   [go: up one dir, main page]

U2 - Apriori - 5th Sem - DS

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Unit 2 - Apriori

Prepared by: Varun Rao (Dean, Data Science & AI)


For: Data Science - 3rd years

What is the Apriori Algorithm?


Apriori algorithm is used for frequent itemset mining and association rule
learning over relational databases. It proceeds by identifying the frequent
individual items in the database and extending them to larger and larger item
sets as long as those item sets appear sufficiently often in the database. The
frequent itemsets determined by Apriori can be used to determine association
rules which highlight general trends in the database: this has applications in
domains such as market basket analysis.

The primary objective of the apriori algorithm is to create the association rule
between different objects. The association rule describes how two or more
objects are related to one another. Apriori algorithm is also called frequent
pattern mining. Generally, you operate the Apriori algorithm on a database that
consists of a huge number of transactions. Let's understand the apriori algorithm
with the help of an example; suppose you go to Big Bazar and buy different
products. It helps the customers buy their products with ease and increases the
sales performance of the Big Bazar.

Apriori uses a "bottom-up" approach, where frequent subsets are extended


one item at a time (a step known as candidate generation), and groups of
candidates are tested against the data. The algorithm terminates when no
further successful extensions are found.
Using breadth-first search and a Hash tree structure, Apriori counts candidate
item sets efficiently. It generates candidate item sets of length k from item sets
of length k-1. Then it prunes the candidates which have an infrequent sub-
pattern. According to the downward closure lemma, the candidate set
contains all frequent k-length item sets. After that, it scans the transaction
database to determine frequent item sets among the candidates.

What are the steps of the Apriori


Algorithm?
The following are the main steps of the algorithm:
1. Calculate the support of item sets (of size k = 1) in the transactional
database (note that support is the frequency of occurrence of an
itemset).
2. In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm will count the occurrences of each item. This is
called generating the candidate set.
3. Let there be some minimum support, min_sup ( eg 2). The set of 1 –
itemsets whose occurrence is satisfying the min sup is determined. Only
those candidates which count more than or equal to min_sup, are taken
ahead for the next iteration and the others are pruned.
4. Prune the candidate set by eliminating items with a support less than
the given threshold.
5. Next, 2-itemset frequent items with min_sup are discovered. For this in
the join step, the 2-itemset is generated by forming a group of 2 by
combining items with itself.
6. Join the frequent itemsets to form sets of size k + 1, and repeat the
above sets until no more itemsets can be formed. This will happen when
the set(s) formed to have a support less than the given support.
7. The 2-itemset candidates are pruned using a min-sup threshold value.
Now the table will have 2 –itemsets with min-sup only.
8. The next iteration will form 3 –itemsets using the join and prune step.
9. The next step will follow making 4-itemset by joining 3-itemset with itself
and pruning if its subset does not meet the min_sup criteria. The
algorithm is stopped when the most frequent itemset is achieved.

What are the advantages of Apriori


Algorithm?
The Apriori algorithm advantages are as follows:
1. This is the most simple and easy-to-understand algorithm among
association rule learning algorithms
2. The resulting rules are intuitive and easy to communicate to an end-
user
3. It doesn't require labeled data as it is fully unsupervised; as a result, you
can use it in many different situations because unlabeled data is often
more accessible
4. Many extensions were proposed for different use cases based on this
implementation—for example, there are association learning algorithms
that take into account the ordering of items, their number, and
associated timestamps
5. The algorithm is exhaustive, so it finds all the rules with the specified
support and confidence

What are the disadvantages of Apriori
Algorithm?
One of the biggest limitations of the Apriori Algorithm is that it is slow. This is
so because of the bare decided by the:
1. A large number of itemsets in the Apriori algorithm dataset.
2. Low minimum support in the data set for the Apriori algorithm.
3. The time needed to hold a large number of candidate sets with many
frequent itemsets.
4. Thus it is inefficient when used with large volumes of datasets.
As an example, if we assume there is a frequent-1 itemset with 10^4 from the
set. The Apriori algorithm code needs to generate greater than 10^7
candidates with a 2-length which will then be tested and collected as an
accumulation. To detect a frequent pattern of size 100 (having v1, v2… v100)
the algorithm generates 2^100 possible itemsets or candidates which is an
example of an application of the Apriori algorithm.
Hence, the yield costs escalate and a lot of time is wasted in candidate
generation aka time complexity of the Apriori algorithm. Also, in its attempts to
improve the Apriori algorithm to check the many candidate itemsets obtained
from the many sets, it scans the database many times using expensive
resources. This in turn impacts the algorithm when the system memory is
insufficient and there are a large number of frequent transactions. That’s why
the algorithm becomes inefficient and slow with large databases.

How can we improve the Apriori


Algorithm's efficiency?
Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure
called a hash table for generating the k-itemsets and their
corresponding count. It uses a hash function for generating the table.
2. Transaction Reduction: This method reduces the number of
transactions scanned in iterations. The transactions which do not
contain frequent items are marked or removed.
3. Partitioning: This method requires only two database scans to mine the
frequent itemsets. It says that for any itemset to be potentially frequent
in the database, it should be frequent in at least one of the partitions of
the database.
4. Sampling: This method picks a random sample S from Database D and
then searches for a frequent itemset in S. It may be possible to lose a
global frequent itemset. This can be reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate
itemsets at any marked start point of the database during the scanning
of the database.

There are various methods used for the efficiency of the Apriori algorithm

Hash-based itemset counting

In hash-based itemset counting, you need to exclude the k-itemset whose equivalent
hashing bucket count is least than the threshold is an infrequent itemset.

Transaction Reduction

In transaction reduction, a transaction not involving any frequent X itemset becomes not
valuable in subsequent scans.

What are the components of the Apriori


algorithm?
There are three major components of the Apriori algorithm which are as
follows.
1. Support
2. Confidence
3. Lift

For example, you have 5000 customer transactions in a Zara Store. You have
to calculate the Support, Confidence, and Lift for two products, and you may
say Men's Wear and Women Wears.
Out of 5000 transactions, 300 contain Men's Wear, whereas 700 contain
women's wear, and these 700 transactions include 250 transactions of both
men's & women's wear.

1. Support
Support denotes the average popularity of any product or data item in the data
set. We need to divide the total number of transactions containing that product
by the total number of transactions.
Support (Men's wear)= (transactions relating MW) / (total transaction)
= 300/5000
= 16.67 %

2. Confidence
Confidence is the sum average of transactions/data items present in
pairs/combinations in the universal dataset. To find out confidence, we divide
the number of transactions that comprise both men's & women's wear by the
total number of transactions.
Hence,
Confidence = (Transactions with men's & women's wear) / (total transaction)
= 250/5000
= 5%

3. Lift
It helps find out the ratio of the sales of women's wear when you sell men's
wear. The mathematical equation of lift is mentioned below.
Lift = (Confidence ( Men's wear- women's wear)/ (Support (men's wear)
= 20/18
= 1.11
Mining association rules from frequent item sets

Association Rules Mining General Concepts


This is an example of Unsupervised Data Mining. You are not trying to predict a
variable.

All previous classification algorithms are considered Supervised techniques.

Given a set of transactions, find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.

Nominal attributes are required.

Affinity Analysis is the process of determining which things go together. This is


also called market basket analysis.

For example, we may have the following products: Milk, Cheese, Bread, Eggs

Possible associations include:

1. if customers purchase milk they also purchase bread {milk} →


{bread}
2. if customers purchase bread they also purchase milk {bread}→
{milk}
3. if customers purchase milk and eggs they also purchase cheese
and bread {milk, eggs} → { cheese, bread}
4. if customers purchase milk, cheese, and eggs they also
purchase bread {milk, cheese, eggs} → {bread}

Based on a set of transactions of customers

Association Rules Mining Approach


Given a set of transactions, T, the goal of association rule mining is to find all rules having

● support ≥ minSup threshold


● confidence ≥ minConf threshold
Brute-force approach:

● List all possible association rules


● Compute the support and confidence for each rule
● Prune rules that fail the minSup and minConf thresholds

Patterns in congressional voting

Association Rules find all sets of items (itemsets) that have


support greater than the minimum support and then using the
large itemsets to generate the desired rules that have
confidence greater than the minimum confidence. The lift of a
rule is the ratio of the observed support to that expected if X and
Y were independent. A typical and widely used example of
association rules application is market basket analysis.
AIS Algorithm

1. Candidate itemsets are generated and counted on-the-fly


as the database is scanned.
2. For each transaction, it is determined which of the large
itemsets of the previous pass are contained in this
transaction.
3. New candidate itemsets are generated by extending
these large itemsets with other items in this transaction.

The disadvantage of the AIS algorithm is that it results in


unnecessarily generating and counting too many candidate itemsets
that turn out to be small.

SETM Algorithm

1. Candidate itemsets are generated on-the-fly as the database


is scanned, but counted at the end of the pass.
2. New candidate itemsets are generated the same way as in
AIS algorithm, but the TID of the generating transaction is
saved with the candidate itemset in a sequential structure.
3. At the end of the pass, the support count of candidate itemsets
is determined by aggregating this sequential structure.

The SETM algorithm has the same disadvantage of the AIS


algorithm. Another disadvantage is that for each candidate
itemset, there are as many entries as its support value.

Apriori Algorithm

1. Candidate itemsets are generated using only the large


itemsets of the previous pass without considering the
transactions in the database.
2. The large itemset of the previous pass is joined with itself
to generate all itemsets whose size is higher by 1.
3. Each generated itemset that has a subset which is not
large is deleted. The remaining itemsets are the
candidate ones.
The Apriori algorithm takes advantage of the fact that any
subset of a frequent itemset is also a frequent itemset. The
algorithm can therefore reduce the number of candidates being
considered by only exploring the itemsets whose support count
is greater than the minimum support count. All infrequent
itemsets can be pruned if it has an infrequent subset.

AprioriTid Algorithm

1. The database is not used at all for counting the support of


candidate itemsets after the first pass.
2. The candidate itemsets are generated the same way as
in Apriori algorithm.
3. Another set C’ is generated of which each member has
the TID of each transaction and the large itemsets
present in this transaction. This set is used to count the
support of each candidate itemset.
The advantage is that the number of entries in C’ may be
smaller than the number of transactions in the database,
especially in the later passes.

You might also like