article

Finding Frequent Patterns in a Large Sparse Graph*

Authors:

Michihiro Kuramochi,

George KarypisAuthors Info & Claims

Data Mining and Knowledge Discovery, Volume 11, Issue 3

Pages 243 - 271

https://doi.org/10.1007/s10618-005-0003-9

Published: 01 November 2005 Publication History

Abstract

Graph-based modeling has emerged as a powerful abstraction capable of capturing in a single and unified framework many of the relational, spatial, topological, and other characteristics that are present in a variety of datasets and application areas. Computationally efficient algorithms that find patterns corresponding to frequently occurring subgraphs play an important role in developing data mining-driven methodologies for analyzing the graphs resulting from such datasets. This paper presents two algorithms, based on the horizontal and vertical pattern discovery paradigms, that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods for determining the number of edge-disjoint embeddings of a subgraph and employ novel algorithms for candidate generation and frequency counting, which allow them to operate on datasets with different characteristics and to quickly prune unpromising subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 120,000 vertices or 110,000 edges, and significantly outperform previously developed algorithms.

References

[1]

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of 1998 ACM SIGMOD International Conference on Management of Data, pp. 94-105.

[2]

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the 20th International Conference on Very Large Data Bases (VLDB), J.B. Bocca, M. Jarke, and C. Zaniolo (Eds.), Morgan Kaufmann, pp. 487-499.

[3]

Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proc. of the 11th International Conference on Data Engineering (ICDE), P.S. Yu and A.L.P. Chen (Eds.), IEEE Press, pp. 3-14.

[4]

Berendt, B., Hotho, A., and Stumme, G. 2002. Towards semantic web mining. In International Semantic Web Conference (ISWC), pp. 264-278.

[5]

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H. et al. 2000. The protein data bank. Nucleic Acids Research, 28:235-242.

[6]

Berman, P. and Fujito, T. 1995. On the approximation properties of independent set problem in degree 3 graphs. In Proc. of Workshop on Algorithms and Data Structures, pp. 449-460.

[7]

Blake, C.L. and Merz, C.J. 1998. UCI Repository of Machine Learning Databases.

[8]

Borgelt, C. and Berthold, M.R. 2002. Mining molecular fragments: Finding relevant substructures of molecules. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 51-58.

[9]

Chew, L.P., Huttenlocher, D., Kedem, K., and Kleinberg, J. 1999. Fast detection of common geometric substructure in proteins. In Proc. of the 3rd ACM RECOMB International Conference on Computational Molecular Biology, pp. 104-114.

[10]

Cohen, M. and Gudes, E. 2004. Diagonally subgraphs pattern mining. In Proc. of the 9th ACM SIGMOD workshop on research issues in Data Mining and Knowledge Discovery DMKD '04, pp. 51-58.

[11]

Cook, D.J. and Holder, L.B. 1994. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231-255.

Digital Library

[12]

Cook, D.J. and Holder, L.B. 2000. Graph-based data mining. IEEE Intelligent Systems, 15(2):32-41.

Digital Library

[13]

Cook, D.J., Holder, L.B., and Djoko, S. 1995. Knowledge discovery from structural data. Journal of Intelligent Information Systems, 5(3):229-245.

Digital Library

[14]

Dehaspe, L., Toivonen, H., and King, R.D. 1998. Finding frequent substructures in chemical compounds. In Proc. of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-98), R. Agrawal, P. Stolorz, and G. Piatetsky-Shapiro (Eds.), AAAI Press, pp. 30-36.

[15]

De Raedt, L. and Kramer, S. 2001. The level-wise version space algorithm and its application to molecular fragment finding. In Proc. of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01), pp. 853-862.

[16]

Deshpande, M., Kuramochi, M., and Karypis, G. 2002. Automated approaches for classifying structures. In Proc. of the 2nd Workshop on Data Mining in Bioinformatics (BIOKDD '02) pp. 11-18.

[17]

Deshpande, M., Kuramochi, M., and Karypis, G. 2003. Frequent sub-structure based approaches for classifying chemical compounds. In Proc. of 2003 IEEE International Conference on Data Mining (ICDM), pp. 35-42.

[18]

Feige, U., Goldwasser, S., Lovasz, L., Safra, S., and Szegedy, M. 1991. Approximating clique is almost NP-complete. In Proc. of the 32nd IEEE Symposium on Foundations of Computer Science (FOCS), pp. 2-12.

[19]

Fortin, S. 1996. The Graph Isomorphism Problem (Tech. Rep. No. TR96-20). Department of Computing Science, University of Alberta.

[20]

Garey, M.R. and Johnson, D.S. 1979. Computers and Intractability: A Guide to the Theory of np-completeness. New York: W. H. Freeman and Company.

Digital Library

[21]

Ghazizadeh, S. and Chawathe, S. 2002a. Discovering Freuqent Structures Using Summaries (Tech. Rep. No. CS-TR-4364). Department of Computer Science, University of Maryland.

[22]

Ghazizadeh, S. and Chawathe, S. 2002b. SEuS: Structure extraction using summaries. In Proc. of the 5th International Conference on Discovery Science, pp. 71-85.

[23]

Gonzalez, J., Holder, L.B. and Cook, D.J. 2001. Application of graph-based concept learning to the predictive toxicology domain. In Proc. of the Predictive Toxicology Challenge Workshop.

[24]

Grindley, H.M., Artymiuk, P.J., Rice, D.W., and Willett, P. 1993. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of Molecular Biology, 229:707- 721.

[25]

Guralnik, V. and Karypis, G. 2001. A scalabale algorithm for clustering sequence datasets. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM).

[26]

Halldórsson, M.M., and Radhakrishnan, J. 1997. Greed is good: Approximating independent sets in sparse and bounded-degree graphs. Algorithmica, 18(1):145-163.

Digital Library

[27]

Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD International Conference on Management of Data Dallas, TX, pp. 1-12.

[28]

Hochbaum, D.S. 1983. Efficient bounds for the stable set, vertex cover, and set packing problems. Discrete Applied Mathematics, 6:243-254.

[29]

Holder, L.B., Cook, D.J., and Djoko, S. 1994. Substructure discovery in the SUBDUE system. In Proc. of the AAAI Workshop on Knowledge Discovery in Databases, pp. 169-180.

[30]

Hong, M., Zhou, H., Wang, W., and Shi, B. 2003. An efficient algorithm of frequent connected subgraph extraction. In Proc. of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-03), Springer-Verlag, Vol. 2637, pp. 40-51.

[31]

Huan, J., Wang, W., and Prins, J. 2003. Efficient mining of frequent subgraph in the presence of isomophism. In Proc. of 2003 IEEE International Conference on Data Mining (ICDM'03), pp. 549- 552.

[32]

Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'00), Lyon, France, pp. 13-23.

[33]

Inokuchi, A., Washio, T., and Motoda, H. 2003. Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3):321-354.

Digital Library

[34]

Inokuchi, A., Washio, T., Nishimura, K., and Motoda, H. 2002. A Fast Algorithm for Mining Frequent Connected Subgraphs (Tech. Rep. No. RT0448). IBM Research, Tokyo Research Laboratory.

[35]

Jensen, D. and Goldberg, H. (Eds.). 1998. Artificial Intelligence and Link Analysis Papers from the 1998 Fall Symposium. AAAI Press.

[36]

Jonyer, I., Cook, D.J., and Holder, L.B. 2001a. Discovery and evaluation of graph-based hierarchical conceptual clusters. Journal of Machine Learning Research, 2:19-43.

[37]

Jonyer, I., Holder, L.B., and Cook, D.J. 2001b. Hierarchical conceptual structural clustering. International Journal on Artificial Intelligence Tools, 10(1-2):107-136.

[38]

Khanna, S., Motwani, R., Sudan, M., and Vazirani, U.V. 1994. On syntactic versus computational views of approximability. In Proc. of IEEE Symposium on Foundations of Computer Science, pp. 819-830.

[39]

Kleinberg, J.M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604-632.

Digital Library

[40]

Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A.S. 1999. The Web as a graph: Measurements, models and methods. Lecture Notes in Computer Science, 1627:1-17.

Digital Library

[41]

Ko, C. 2000. Logic induction of valid behavior specifications for intrusion detection. In IEEE Symposium on Security and Privacy (S&P), pp. 142-155.

[42]

Koch, I., Lengauer, T., and Wanke, E. 1996. An algorithm for finding maximal common subtopoloties in a set of protein structures. Journal of Computational Biology, 3(2):289-306.

[43]

Kramer, S., De Raedt, L., and Helma, C. 2001. Molecular feature mining in HIV data. In Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-01), pp. 136-143.

[44]

Kuramochi, M., Deshpande, M., and Karypis, G. 2004. Data mining: Next generation challenges and future directions. In H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.), AAAI Press (in press), pp. 315-334.

[45]

Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM), pp. 313-320.

[46]

Kuramochi, M. and Karypis, G. 2002. An Efficient Algorithm for Discovering Frequent Subgraphs (Tech. Rep. No. 02-026). University of Minnesota, Department of Computer Science.

[47]

Kuramochi, M. and Karypis, G. 2004a. An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering, 16(9):1038-1051.

Digital Library

[48]

Kuramochi, M. and Karypis, G. 2004b. Finding frequent patterns in a large sparse graph. In Proc. of the 2004 SIAM International Conference on Data Mining (SDM04).

[49]

Lee, W. and Stolfo, S.J. 2000. A framework for constructing features and models for intrusion detection systems. ACM Transactions on Information and System Security, 3(4):227-261.

Digital Library

[50]

Leibowitz, N., Fligelman, Z.Y., Nussinov, R., and Wolfson, H.J. 1999. Multiple structural alignment and core detection by geometric hashing. In Proc. of the 7th international conference on intelligent systems in molecular biology, Heidelberg: Germany, pp. 169-177.

[51]

Leibowitz, N., Nussinov,R., and Wolfson, H.J. 2001. MUSTA--a general, efficient, automated method for multiple structure alignment and detection of common motifs: Application to proteins. Journal of Computational Biology, 8(2):93-121.

[52]

Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM), pp. 369-376.

[53]

Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proc. of the 4th Internation Conference on Knowledge Discovery and Data Mining (kdd-98), pp. 80-86.

[54]

McKay, B.D. (n.d.). Nauty users guide. http://cs.anu.edu.au/~bdm/nauty/

[55]

McKay, B.D. 1981. Practical graph isomorphism. Congressus Numerantium, 30:45-87.

[56]

Mitchell, E.M., Artymiuk, P.J., Rice, D.W., and Willett, P. 1989. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. Journal of Molecular Biology, 212:151-166.

[57]

Mooney, R.J., Melville, P., Tang, L.R., Shavlik, J., Castro Dutra, I. de, Page, D. et al. 2004. Relational data mining with inductive logic programming for link discovery. In AAAI Press/The MIT Press, pp. 239-255.

[58]

Muggleton, S.H. 1999. Scientific knowledge discovery using Inductive Logic Programming. Communications of the ACM, 42(11):42-46.

Digital Library

[59]

Östergård, P.R.J. 2002. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, 120:195-205.

Digital Library

[60]

Palmer, C.R., Gibbons, P.B., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proc. of the 8th ACM SIGKDD Internal Conference on Knowlege Discovery and Data Mining (KDD-2002) Edmonton, AB, Canada, pp. 81-90.

[61]

Pennec, X. and Ayache, N. 1998. A geometric algorithm to find small but highly simialar 3D substructures in proteins. Bioinformatics, 14(6):516-522.

[62]

Raymond, J.W. 2002. Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J. Chem. Inf. Comput. Sci., 42:305-316.

[63]

Read, R.C. and Corneil, D.G. 1977. The graph isomorph disease. Journal of Graph Theory, 1:339-363.

[64]

Robson, J.M. 1986. Algorithms for maximum independent sets. Journal of Algorithms, 7:425-440.

[65]

Srinivasan, A., King, R.D., Muggleton, S.H., and Sternberg, M.J.E. 1997. Carcinogenesis predictions using ILP. In Proc. of the 7th International Workshop on Inductive Logic Programming S. D¿eroski and N. Lavra¿ (Eds.), Springer-Verlag, (Vol. 1297, pp. 273-287).

[66]

Vanetik, N., Gudes, E., and Shimony, S.E. 2002. Computing frequent graph patterns from semistructured data. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 458-465.

[67]

Wang, X., Wang, J.T.L., Shasha, D., Shapiro, B.A., Rigoutsos, I., and Zhang, K. 2002. Finding patterns in three dimensional graphs: Algorithms and applications to scientific data mining. IEEE Transactions on Knowledge and Data Engineering, 14(4):731-749.

Digital Library

[68]

Wasserman, S., Faust, K., and Iacobucci, D. 1994. Social network analysis : Methods and applications. Cambridge University Press.

[69]

Yan, X. and Han, J. 2002. gSpan: Graph-based substructure pattern mining. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 721-724.

[70]

Yan, X. and Han, J. 2003. CloseGraph: Mining closed frequent graph patterns. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 286-295.

[71]

Yoshida, K. and Motoda, H. 1995. CLIP: Concept learning from inference patterns. Artificial Intelligence, 75(1):63-92.

Digital Library

[72]

Yoshida, K., Motoda, H., and Indurkhya, N. 1994. Graph-based induction as a unified learning framework. Journal of Applied Intelligence, 4:297-328.

[73]

Zaki, M.J. and Gouda, K. 2003. Fast vertical mining using diffsets. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 326-335.

Cited By

Jazayeri AYang C(2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3324799
Nguyen TNguyen TNguyen TYin HNguyen TJo JNguyen Q(2023)Isomorphic Graph Embedding for Progressive Maximal Frequent Subgraph MiningACM Transactions on Intelligent Systems and Technology10.1145/363063515:1(1-26)Online publication date: 19-Dec-2023
https://dl.acm.org/doi/10.1145/3630635
Preti GDe Francisci Morales GRiondato M(2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
https://dl.acm.org/doi/10.1145/3587254
Show More Cited By

Index Terms

Finding Frequent Patterns in a Large Sparse Graph*

Recommendations

Mining of Frequent Externally Extensible Outerplanar Graph Patterns
ICMLA '08: Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications

An outerplanar graph is a planar graph which can be embedded in the plane in such a way that all of vertices lie on the outer boundary. Many chemical compounds are known to be expressed by outerplanar graphs. In this paper, firstly, we introduce an ...
Learning block-preserving graph patterns and its application to data mining

Recently, due to the rapid growth of electronic data having graph structures such as HTML and XML texts and chemical compounds, many researchers have been interested in data mining and machine learning techniques for finding useful patterns from graph-...
Mining of frequent block preserving outerplanar graph structured patterns
ILP'07: Proceedings of the 17th international conference on Inductive logic programming

An outerplanar graph is a planar graph which can be embedded in the plane in such a way that all of vertices lie on the outer boundary. Many semi-structured data like the NCI dataset having about 250,000 chemical compounds can be expressed by ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery

Data Mining and Knowledge Discovery Volume 11, Issue 3

November 2005

108 pages

ISSN:1384-5810

Issue’s Table of Contents

Copyright © Copyright © 2005 Springer Science+Business Media, Inc.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2005

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

88
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jazayeri AYang C(2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3324799
Nguyen TNguyen TNguyen TYin HNguyen TJo JNguyen Q(2023)Isomorphic Graph Embedding for Progressive Maximal Frequent Subgraph MiningACM Transactions on Intelligent Systems and Technology10.1145/363063515:1(1-26)Online publication date: 19-Dec-2023
https://dl.acm.org/doi/10.1145/3630635
Preti GDe Francisci Morales GRiondato M(2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
https://dl.acm.org/doi/10.1145/3587254
Le NVo BYun ULe B(2023)A novel approach to discover frequent weighted subgraphs using the average measureApplied Intelligence10.1007/s10489-023-04501-y53:16(19491-19504)Online publication date: 8-Mar-2023
https://dl.acm.org/doi/10.1007/s10489-023-04501-y
Wu YHuang JWu DJensen CLu K(2023)Mining Frequent Geo-Subgraphs in a Knowledge GraphWeb and Big Data10.1007/978-981-97-2303-4_2(16-31)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-2303-4_2
Liu MLi P(2022)SATMargin: Practical Maximal Frequent Subgraph Mining via Margin Space SamplingProceedings of the ACM Web Conference 202210.1145/3485447.3512196(1495-1505)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512196
Nguyen LNguyen LVo BZelinka ILin JYun UNguyen H(2022)An efficient and scalable approach for mining subgraphs in a single large graphApplied Intelligence10.1007/s10489-022-03164-552:15(17881-17895)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10489-022-03164-5
Banerjee S(2022)A survey on mining and analysis of uncertain graphsKnowledge and Information Systems10.1007/s10115-022-01681-w64:7(1653-1689)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s10115-022-01681-w
Nguyen LLe NNguyen HPham TVo B(2022)Frequent Closed Subgraph Mining: A Multi-thread ApproachIntelligent Information and Database Systems10.1007/978-3-031-21743-2_6(64-77)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-21743-2_6
Kim KKim HFletcher GHan WLi GLi ZIdreos SSrivastava D(2021)Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality EstimationProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457246(964-976)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457246
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents