Keywords

1 Introduction

Over the years, pattern mining has played an imperative role in solving many data mining tasks. For a considerable period of time, pattern mining research was restricted only to the extraction of frequent patterns disregarding the mining of rare patterns. Rare patterns have proved to be of vital importance in a wide range of applications ranging from network anomaly detection, equipment failure, medicine to fraud detection. Considering the significance of rare patterns, research on rare pattern mining is increasing rapidly and a considerable amount of work has already been carried out for the extraction of these momentous patterns. The techniques of rare pattern mining available in the literature either follow a level-wise approach and generate candidates like Apriori [2] or use efficient data structure and extract rare patterns without generating candidates like FP-Growth [4]. Among the two, tree based approaches have been found to be more efficient in terms of performance compared to level-wise approaches.

Tree based approaches achieve good compression in case of dense datasets. As dense dataset contains many frequent items, sharing of nodes having common item prefixes will be more and hence compactness of the tree structure will also be higher. On the contrary, sparse datasets contain lesser number of frequent items that reduces the compression ratio of the tree structure to a great extent. Also, it has been found that tree based approaches work well with data having long patterns but fails miserably when the data have short patterns. In this paper, an efficient rare pattern mining approach has been proposed that generates appreciable results in case of sparse datasets and data with long patterns and also requires less memory compared to the eminent tree based rare pattern mining approaches.

The remaining paper is systematized as follows:- Some prior works in the area of rare pattern mining have been discussed in Sect. 2. The proposed approach for solving the bottleneck of existing algorithms is presented in Sect. 3 followed by the experimental results in Sect. 4. Section 5 finally concludes the paper with some feasible future perspectives.

2 Related Work

Since its inception, an appreciable amount of work has been done for the extraction of rare patterns. This section illustrates the prior works in the area of rare pattern mining.

Liu et al. [6] made the initial attempt using an Apriori based approach to generate the rare patterns by assigning a minimum support to each item individually. ARIMA [8] and AfRIM [1] on the other hand, employed a single support threshold for extracting the rare patterns. Considering the shortcomings of Apriori based approaches, few other techniques have adopted a tree based approach. The primer method in this category is the Compressed FP-Growth (CFP-Growth) algorithm proposed in [5] where the extraction of rare patterns is based on a multiple minimum support framework. The most popular and efficient pattern growth approach for mining rare patterns is the Rare Pattern Tree (RP-Tree) Mining algorithm proposed in [9]. The algorithm uses two support thresholds and takes into account only those transactions that posses at least one rare item. RP-Tree algorithm is further enhanced for better performance using multiple support thresholds in [3].

3 Rare Pattern Mining Using Hyper-Linked Data Structure

As explained in previous sections, handling sparse datasets is a severe research issue that needs utmost attention. This work is therefore an attempt to resolve this issue in context of rare pattern mining. This section presents the proposed rare pattern mining approach called Hyper-Linked Rare Pattern Mining (Hyper- Linked RPM) with the help of a suitable example for better understandability.

The proposed approach employs a hyper-linked queue based data structure for extracting rare patterns [7]. Experimental results illustrate that it is faster and outperforms the pattern growth and level-wise approaches in many cases. This is because the proposed technique employs memory-based queue instead of a tree data structure that is more efficient in mining sparse datasets and has less space overhead. Also, the computational overhead for constructing tree structures is more as compared to simple queue data structures. The proposed technique of Hyper-Linked RPM is illustrated in Algorithm 1.

figure a

The technique generates the support count or frequency of occurrence of every item during the initial scan of database. During the second scan, for building the hyper-linked data structure, only those transactions will be considered that involves at least one rare item. The reason being, transactions containing only frequent items have no contribution in the rare pattern mining process. To distinguish between frequent and rare items, two support thresholds freqSup and rareSup have been used. The support count of rare itemset (SupRareItem) must lie between these two support thresholds, i.e. freqSup> SupRareItem >rareSup. The items of the reduced database will be stored in a header table H having three fields: item id, hyper-link and support count of items. It is to be noted that the support counts of the items in the header table will be their support counts in the original database. Support counts of items in the reduced database will not be considered. While loading the transactions into memory, transactions with the same first item are stored in queues and linked together using hyper-links. The heads of the queues will be represented by the items of the header table.

For example, a, b, c, d and e are the items of the reduced database. The data structure will perform mining on the following subsets of rare patterns: (i) patterns having only item a, (ii) patterns having item b but not a, (iii) patterns having item c but neither of a and b, (iv) patterns having item d but neither of a, b and c (v) patterns having item e but neither of the former items. The first subset of rare patterns are obtained by mining the a-projected database having all the rare item projections that contain item a and are connected together through links in the a-queue. Mining of a-projected database is done by first creating a header table H\(_a\), where the support count of each item is its corresponding frequency of occurrence with item a in the reduced database. Next the items satisfying freqSup and rareSup in the a-projected database are obtained upon traversal of the a-queue. The same procedure continues to mine the ab-projected database by creating Header Table H\(_a\) \(_b\) and traversing the ab-queue. If none of the items in H\(_a\) \(_b\) could satisfy the thresholds, no rare patterns will be generated and the search terminates for this path. Next the ac-projected database is considered to mine the rare patterns by traversing the ac-queue. If the items in the ac-projected database could not satisfy the thresholds, the search again terminates. The same process continues for other items of the a-projected database. Once the mining of a-projected database is complete, the procedure again starts for mining the next subset of rare patterns i.e. the b-projected database by adjusting the links.

4 Experimental Evaluation of Proposed Approach

To gauge the effectiveness of proposed approach, this section illustrates its performance evaluation on both real-life and synthetic sparse datasets. Real-life datasets Gazelle and Retail have been obtained from UCI repository while synthetic datasets T25I15D10k and T40I10D15K have been generated using synthetic dataset generation method described in [2]. Characteristics of each dataset are given in Table 1. The implementation has been done in Java on a machine of 2.90 GHz Intel i5 processor having 4 GB main memory and 64-bit Windows 8 Pro operating system.

Table 1. Datasets used
Fig. 1.
figure 1

Experimental evaluation

The comparison has been carried out with two tree based approaches: a rare pattern mining algorithm called RP-Tree and a modified frequent pattern mining algorithm called FP-Growth. To generate only rare itemsets, FP-Growth algorithm has been revised by considering only those itemsets that satisfy rareSup and pruning those that exceeds freqSup. Keeping rareSup constant at 5% and varying freqSup threshold, the results obtained for the above datasets are shown in Fig. 1. From the figure, it is evident that the proposed approach is faster and has less space requirement compared to FP-Growth and RP-tree.

5 Conclusion and Future Work

Rare pattern mining has established its significance in front of the data mining community. Pattern growth approaches are the most effective ones among the rare pattern mining techniques. However, they suffer from performance drawback while dealing with sparse data and data with short patterns. This paper, introduces a rare pattern mining method that employs a queue based hyper-linked data structure. The data structure automatically adjusts links while mining rare patterns and greatly reduces the memory overhead encountered by tree based approaches. Performance evaluation given in Sect. 4 elicits the fact that the proposed method is more space efficient and faster in dealing with sparse datasets unlike pattern growth approaches. However, for dense datasets, pattern growth approaches outperforms the proposed approach.

As a future work, the proposed approach can be integrated with some tree based approach to effectively handle dense datasets as well. Further study may also incorporate some database partitioning approach to enhance scalability of the proposed algorithm for handing large datasets.