Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Mining Induced and Embedded Subtrees in Ordered, Unordered, and Partially-Ordered Trees Aı́da Jiménez, Fernando Berzal and Juan-Carlos Cubero Dept. Computer Science and Artificial Intelligence, ETSIIT - University of Granada, 18071, Granada, Spain {aidajm|jc.cubero|fberzal}@decsai.ugr.es Abstract. Many data mining problems can be represented with nonlinear data structures like trees. In this paper, we introduce an scalable algorithm to mine partially-ordered trees. Our algorithm, POTMiner, is able to identify both induced and embedded subtrees and, as special cases, it can handle both completely ordered and completely unordered trees (i.e. the particular situations existing algorithms address). 1 Introduction Non-linear data structures are becoming more and more common in data mining problems. Graphs, for instance, are commonly used to represent data and their relationships in different problem domains, ranging from web mining and XML documents to bioinformatics and computer networks. Trees, in particular, are amenable to efficient mining techniques and they have recently attracted the attention of the research community. The aim of this paper is to present a new algorithm, POTMiner, to identify frequent patterns in partially-ordered trees, a particular kind of trees that is present in several problems domains. However, existing tree mining algorithms cannot be directly applied to this important kind of trees. Our paper is organized as follows. We introduce the idea of partially-ordered trees as well as some standard terms in Section 2. Section 3 describes the state of the art in tree mining algorithms. Our algorithm is presented in Section 4. Section 5 shows some experimental results. Finally, in Section 6, we present some conclusions and pointers to future work in this area. 2 Background We will first review some basic concepts related to labeled trees using the notation from [1]. A tree is a connected and acyclic graph. A tree is rooted if its edges are directed and a special node, called root, can be identified. The root is the node from which it is possible to reach all the other vertices in the tree. In contrast, a tree is free if its edges have no direction, that is, when it is an undirected graph. Therefore, a free tree has no predefined root. Rooted trees can be classified as: – Ordered trees, when there is a predefined order within each set of siblings. – Unordered trees, when there is not such a predefined order among siblings. In this paper, we consider partially-ordered trees, witch can be defined as trees that have both ordered and unordered sets of siblings. They can be useful when the order within some sets of siblings is important but it is not necessary to establish an order relationship among all the tree nodes. In Figure 1, we show a dataset example with different kinds of rooted trees. In this figure, circled nodes represent ordered nodes, while squared nodes represent unordered ones. 1 3 1 3 2 2 3 1 2 3 3 2 2 1 2 1 Fig. 1. Example dataset with different kinds of rooted trees (from left to right): (a) completely ordered tree, (b) and (c) partially-ordered trees, (d) completely unordered tree. Different kinds of subtrees can be mined from a set of trees depending on the way we define the matching function between a pattern and a tree. We can obtain induced subtrees from a tree T by repeatedly removing leaf nodes from a bottom-up subtree of T (i.e. the subtree obtained by taking one vertex from T and all its descendants). We can also obtain embedded subtrees by removing nodes from a bottom-up subtree provided that we do not to break the ancestor relationship among the vertices of T. Figure 2 shows the induced subtrees of size 3 that appear at least 3 times in the example dataset from Figure 1. In this example, this would also be the set of frequent embedded subtrees of size 3. 1 3 3 2 2 1 Fig. 2. Frequent subtrees of size 3 found in the dataset from Figure 1. 3 Tree Pattern Mining Once we have introduced some basic terminology, we will survey the state of the art in tree pattern mining. The goal of frequent tree mining is the discovery of all the frequent subtrees in a large database of trees D, also referred to as forest, or in a unique large tree. Let δT (S) be the occurrence count of a subtree S in a tree T and dT a variable such that dT (S)=0 if δT (S) P = 0 and dT (S)=1 if δT (S) > 0. We define the support of a subtree as σ(S) = T ∈D dT (S), i.e, the number of trees in D that include at least one occurrence of the subtree P S. Analogously, the weighted support of a subtree is defined as σw (S) = T ∈D δT (S), i.e., the total number of occurrences of S within all the trees in D. We say that a subtree S is frequent if its support is greater than or equal to a predefined minimum support threshold. We define Fk as the set of all frequent subtrees of size k. Before we proceed to introduce the algorithms proposed in the literature, we will describe the most common tree representations used by such algorithms. 3.1 Tree Representation A canonical tree representation is a unique way of representing a labeled tree. This representation makes the problems of tree comparison and subtree enumeration easier. Three alternatives have been proposed in the literature to represent rooted ordered trees as strings: – Depth-first codification: The string representing the tree is built by adding the label of the tree nodes in a depth-first order. A special symbol ↑, which is not in the label alphabet, is used when the sequence comes back from a child to his parent. In Figure 1, the depth-first codification of tree (b) is 32↑12↑3↑↑ while the codification of tree (c) is 132↑1↑↑2↑. – Breadth-first codification: Using this codification scheme, the string is obtained by traversing the tree in a breadth-first order, i.e., level by level. Again, we need an additional symbol $, which is not in the label alphabet, in order to separate sibling families. The breadth-first codifications for trees (b) and (c) in the previous example are 3$21$$23 and 1$32$21, respectively. – Depth-sequence-based codification: This codification scheme is also based on a depth-first traversal of the tree, but it explicitly stores the depth of each node within the tree. The resulting string is built with pairs (d, l) where the first element (d) is the depth of the node and the second one (l) is the node label. The depth sequence codifications of the previous examples are (3,0)(2,1)(1,1)(2,2)(3,2) for tree (b) and (1,0)(3,1)(2,2)(1,2)(2,1) for tree (c). The canonical representation for unordered trees can be defined as the minimum codification, in lexicographical order, of all the ordered trees that can be derived from it. You can use any of the codification schemes described above. Free trees have no predefined root, but it is possible to select one node as the root in order to get a unique canonical representation of the tree, as described in [2]. 3.2 Tree Mining Algorithms Several frequent tree pattern mining algorithms have been proposed in the literature. These algorithms are mainly derived from two well-known frequent pattern mining algorithms: Apriori [3] and FP-Growth [4]. Many algorithms follow the well-known Apriori [3] iterative pattern mining strategy, where each iteration is broken up into two distinct phases: – Candidate Generation: Potentially frequent candidates are generated from the frequent patterns discovered in the previous iteration. Most algorithms generate candidates of size k + 1 by merging two patterns of size k having k − 1 elements in common. There are several strategies for candidate subtree generation: • The rightmost expansion strategy generates subtrees of size k+1 from frequent subtrees of size k by adding nodes only to the rightmost branch of the tree. This technique is used in algorithms like FREQT [5], uFreqt [6], and UNOT [7]. • The equivalence class-based extension technique is based on the depth-first canonical representation of trees. It generates a candidate (k + 1)-subtree through joining two frequent k-subtrees with (k − 1) nodes in common. Zaki used this extension method in his TreeMiner [8] and SLEUTH [9] algorithms. • The right-and-left tree join method was proposed with the AMIOT algorithm [10]. In candidate generation, this algorithm consideres the rightmost and the leftmost leaves of a tree. • The extension and join technique is based on the breadth-first codification of trees and defines two extension mechanisms and a join operation to generate candidates. This method is used by HybridTreeMiner [2]. – Support Counting: Given the set of potentially frequent candidates, this phase consist of determining their support and keeping only those candidates that are actually frequent. Other algorithms are derived from the FP-Growth [4] algorithm. For instance, the PathJoin algorithm [11] uses compacted structures called FP-Trees to encode input data, while CHOPPER and XSpanner [12] use a sequence codification for trees and extract subtrees using frequent subsequences. Table 1 summarizes some frequent tree mining algorithms that have been proposed in the literature, pointing out the kind of input trees they can be applied to (ordered, unordered, or free) and the kind of subtrees they are able to identify (induced, embedded, or maximal). Input trees Discovered patterns Algorithm Ordered Unordered Free Induced Embedded Maximal FreqT[5] • • AMIOT [10] • • uFreqT [6] • • TreeMiner [8] • • CHOPPER [12] • • XSpanner [12] • • SLEUTH [9] • • Unot [7] • • TreeFinder [13] • • • PathJoin [11] • • • CMTreeMiner [14] • • • • FreeTreeMiner [15] • • HybridTreeMiner [2] • • • Table 1. Some frequent tree mining algorithms. 4 Mining Partially-Ordered Trees The algorithms described in the previous section can extract frequents patterns from trees that are completely ordered or completely unordered, but none of them works with partially ordered trees. In this section, we describe how Zaki’s TreeMiner [8] and SLEUTH [9] algorithms can be adapted to mine partially ordered trees. The algorithm we have devised, POTMiner (Partially-Ordered Tree Miner), is able to identify frequent subtrees, both induced and embedded, in ordered, unordered, and partiallyordered trees. Our algorithm is based on Apriori [3]. Therefore, it has two phases: candidate generation and support counting. 4.1 Candidate Generation for Partially-Ordered Trees We use a depth-first codification to represent trees and generate (k + 1)-subtree candidates by joining two frequent k-subtrees with k − 1 elements in common. We use Zaki’s class extension method to generate candidates. Two k-subtrees are in the same equivalence class [P ] if they share the same codification string until the node k − 1. Each element of the class can then be represented by a single pair (x, p) where x is the k-th node label and p specifies the depth-first position of its parent. TreeMiner [8] and SLEUTH [9] use the class extension method to mine ordered and unordered embedded subtrees, respectively. The main different between TreeMiner and SLEUTH is that, in SLEUTH, which works with unordered trees, only those extensions that produce canonical subtrees are allowed in order to avoid the duplicate generation of candidates. In POTMiner, as in TreeMiner, all extensions are allowed because tree nodes might be ordered or unordered. The candidate generation method is defined as follows: Let (x, i) and (y, j) denote two elements in the same class [P ], and [Pxi ] be the set of candidate trees derived from the tree that is obtained by adding the element (x, i) to P . The join procedure considers two scenarios [9]: 1. Cousin extension, when the element (y,j) is a right relative of the element (x,i): If j ≤ i and |P | = k − 1 ≥ 1, then (y, j) ∈ [Pxi ]. 2. Child extension, when the element (y,j) is a descendant of the element (x,i): If j = i then (y, k − 1) ∈ [Pxi ]. 4.2 Support Counting for Induced and Embedded Subtrees For the support counting phase, we define the scope of a node [9] as a pair [l,u] where l is the position of the node in depth-first order and u is the position of its rightmost descendant. Our algorithm preserves each occurrence of a pattern X in each database tree using a tuple (t, m, s, d ) where t is the tree identifier, m stores which nodes of the tree match those of the (k-1) prefix of the pattern X in depth-first order, s is the scope of the last node in the pattern X, and d is a depth-based parameter used for mining induced subtrees (it is not needed when mining embedded subtrees). The scope list of a pattern X is the list of all the tuples (t, m, s, d ) representing the occurrences of X in the database. For patterns of size 1, m is a empty list and the element d is initialized with the depth of the pattern only node in the database tree. The scope list for a new candidate is built by joining the scope lists of the subtrees involved in the generation of the candidate. Let [tx , mx , sx , dx ] and [ty , my , sy , dy ] be the scope lists of the subtrees involved in the generation of the candidate. The scope list for this candidate is built by a join operation that depends on the candidate extension method used, i.e. whether it has been generated by cousin extension or child extension. 1. In-scope test (used in conjunction with child extension): If (a) tx = ty = t and (b) mx = my = m and (c) sy ⊂ sx , (i.e., S lx ≤ ly and ux ≥ uy ) then add [t, m {lx }, sy , dy −dx ] to the scope list of the generated candidate. 2. Out-scope test (used in conjunction with cousin extension): If (a) tx = ty = t and (b) mx = my = m and (c) If the node is ordered and sx < sy (i.e. ux < ly ) or if the node is unordered and S either sx < sy or sy < sx , (i.e. either ux < ly or uy < lx ) then add [t, m {lx }, sy ,dy ] to the scope list of the generated candidate. The weighted support of an embedded pattern is the number of elements in its scope list. In the case of induced patterns, their weighted support is the number of elements within its scope list whose d parameter equals 1. Intuitively, d represents the distance between the last node in the pattern and its prefix m. This parameter is needed to properly perform the support counting phase for induced patterns, since only the occurrences with dx = 1 have to be considered when joining scope lists. 5 Experimental Results All the experiments described in this section have been performed on a 2GHz Core 2 Duo processor with 2GB of main memory running on Windows Vista. POTMiner has been implemented in Java using Sun Microsystems JDK 5.0, while Zaki’s TreeMiner and SLEUTH C++ implementations were obtained from http://www.cs.rpi.edu/∼zaki/. The experiments were performed with 5 synthetic datasets generated by the tree generator available at http://www.cs.rpi.edu/∼zaki/software/TreeGen.tgz. The datasets were obtained using the generator default values and varying the number of trees from 10 to 100000. In our first experiments, we have compared POTMiner and TreeMiner / SLEUTH execution times. POTMiner and TreeMiner [8] are compared for (completely) ordered trees, while POTMiner and SLEUTH [9] are compared for unordered trees. The results we have obtained are summarized in Figure 3. Looking at the charts in Figure 3, which use a logarithmic scale for their y axis, we can see that TreeMiner, SLEUTH, and POTMiner are efficient, scalable algorithms for mining induced and embedded subtrees in ordered (TreeMiner and POTMiner) and unordered (SLEUTH and POTMiner) trees. The observed differences are probably due to the different programming platforms used in their implementation (Java for POTMiner, C++ for TreeMiner and SLEUTH). We have also performed some experiments with partially-ordered trees. In this case, since TreeMiner [8] and SLEUTH [9] cannot be applied to partiallyordered trees, we have studied the behavior of POTMiner when dealing with this kind of trees. Starting from the same datasets used in the previous experiments, we have varied the percentage of ordered nodes in the datasets. The results we have obtained are shown in Figure 4. As expected, we have found that execution times slightly decrease when the percentage of ordered nodes is increased, since ordered trees are easier to mine than unordered trees. 6 Conclusions and Future Work There are many different tree mining algorithms that work either on ordered or unordered trees, but none of them, to our knowledge, works with partiallyordered trees, that is, trees that have both ordered and unordered nodes. We have 1000 1000 Time (ms) 10000 Time (ms) 10000 100 100 10 10 1 1 10 100 1000 10000 100000 10 100 Number of trees TreeMiner SLEUTH POTMiner 10000 1000 1000 Time (ms) 10000 Time (ms) 1000 10000 Number of trees 100 10 100000 POTMiner 100 10 1 1 10 100 1000 10000 100000 10 Number of trees TreeMiner 100 1000 10000 100000 Number of trees POTMiner SLEUTH POTMiner Fig. 3. Execution times of POTMiner and Zaki’s TreeMiner and SLEUTH algorithms for extracting induced (top) and embedded (down) subtrees in completely ordered (left) and completely unordered (right) trees. 1000 1000 Time (ms) 10000 Time (ms) 10000 100 100 10 10 1 0 25 50 75 Proportion of ordered nodes (%) 10 1000 100000 100 1 0 25 50 75 Proportion of ordered nodes (%) 10 1000 100 100000 Fig. 4. POTMiner execution times varying the percentage of ordered nodes. The charts display POTMiner behavior when mining induced (left) and embedded (right) subtrees. The series show execution times for different dataset sizes (10, 1000, and 100000 trees). devised a new algorithm to address this situation that is as efficient and scalable as existing algorithms that exclusively work on either ordered or unordered trees. Partially-ordered trees are important because they appear in different application domains. In the future, we expect to apply our tree mining algorithm to some of these domains. In particular, we believe that our algorithm for identifying frequent subtrees in partially ordered trees can be useful in the following contexts: – XML documents [16], due to their hierarchical structure, are directly amenable to tree mining techniques. Since XML documents can contain both ordered and unordered sets of nodes, partially-ordered trees provide a better representation model for them and they are better suited for knowledge discovery than existing ordered (or unordered) tree mining techniques. – Multi-relational data mining [17] is another emerging research area where tree mining techniques can be useful. They might help improve existing multi-relational classification [18] and clustering [19] algorithms. – In Software Engineering, it is usually acknowledged that mining the wealth of information stored in software repositories can ”support the maintenance of software systems, improve software design/reuse, and empirically validate novel ideas and techniques” [21]. For instance, there are hierarchical program representations, such as dependence higraphs [20], that can be viewed as partially-ordered trees, hence the potential of tree mining techniques in software mining. We also have the goal of making some refinements to our POTMiner algorithm, such as extending it for dealing with partial or approximate tree matching, a feature that would be invaluable for many real-world problems, from entity resolution in XML documents to program element matching in software mining. References 1. Chi, Y., Muntz, R.R., Nijssen, S., Kok, J.N.: Frequent subtree mining - an overview. Fundamenta Informaticae 66(1-2) (2005) 161–198 2. Chi, Y., Yang, Y., Muntz, R.R.: HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical form. In: SSDBM ’04: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. (2004) 11–20 3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB’1994: Proceedings of the 20th International Conference on Very Large Data Bases. (1994) 487–499 4. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data. (2000) 1–12 5. Abe, K., Kawasoe, S., Asai, T., Arimura, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: SDM’2002: Proceedings of the Second SIAM International Conference on Data Mining. (2002) 6. Nijssen, S., Kok, J.N.: Efficient discovery of frequent unordered trees. In: First International Workshop on Mining Graphs, Trees and Sequences (MGTS2003), in conjunction with ECML/PKDD’2003. (2003) 7. Asai, T., Arimura, H., Uno, T., ichi Nakano, S.: Discovering frequent substructures in large unordered trees. In: DS’2003, Discovery Science, LNAI 2843. (2003) 47–61 8. Zaki, M.J.: Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Transactions on Knowledge and Data Engineering 17(8) (2005) 1021–1035 9. Zaki, M.J.: Efficiently mining frequent embedded unordered trees. Fundamenta Informaticae 66(1-2) (2005) 33–52 10. Hido, S., Kawano, H.: AMIOT: induced ordered tree mining in tree-structured databases. In: ICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining. (2005) 170–177 11. Xiao, Y., Yao, J.F., Li, Z., Dunham, M.H.: Efficient data mining for maximal frequent subtrees. In: ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining. (2003) 379–386 12. Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., Shi., B.: Efficient pattern-growth methods for frequent tree pattern mining. In: PAKDD’04: The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining. (2004) 13. Termier, A., Rousset, M.C., Sebag, M.: TreeFinder: a first step towards xml data mining. In: ICDM ’02: Proceedings of the 2002 IEEE International Conference on Data Mining. (2002) 450–457 14. Chi, Y., Xia, Y., Yang, Y., Muntz, R.R.: Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Transactions on Knowledge and Data Engineering 17(2) (2005) 190–202 15. Chi, Y., Yang, Y., Muntz, R.R.: Indexing and mining free trees. In: ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining. (2003) 509–512 16. Nayak, R., Zaki, M.J., eds.: Knowledge Discovery from XML Documents, First International Workshop, KDXD 2006, Singapore, April 9, 2006, Proceedings. In Nayak, R., Zaki, M.J., eds.: KDXD. Volume 3915 of Lecture Notes in Computer Science., Springer (2006) 17. Džeroski, S.: Multi-relational data mining: An introduction. SIGKDD Explorations Newsletter 5(1) (2003) 1–16 18. Yin, X., Han, J., Yang, J., Yu, P.S.: CrossMine: efficient classification across multiple database relations. In: ICDE ’04: Proceedings of the 20th International Conference on Data Engineering. (2004) 399–410 19. Yin, X., Han, J., Yu, P.S.: Cross-relational clustering with user’s guidance. In: KDD ’05: Proceeding of the eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (2005) 344–353 20. Berzal, F., Cubero, J.C., Jimenez, A.: Hierarchical program representation for program element matching. In: IDEAL’07: 8th International Conference on Intelligent Data Engineering and Automated Learning. (2007) 21. Gall, H., Lanza, M., Zimmermann, T.: 4th International Workshop on Mining Software Repositories (MSR 2007). In: ICSE COMPANION ’07: Companion to the proceedings of the 29th International Conference on Software Engineering. (2007) 107–108