Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Finding Frequent Patterns in a Large Sparse Graph*

Published: 01 November 2005 Publication History

Abstract

Graph-based modeling has emerged as a powerful abstraction capable of capturing in a single and unified framework many of the relational, spatial, topological, and other characteristics that are present in a variety of datasets and application areas. Computationally efficient algorithms that find patterns corresponding to frequently occurring subgraphs play an important role in developing data mining-driven methodologies for analyzing the graphs resulting from such datasets. This paper presents two algorithms, based on the horizontal and vertical pattern discovery paradigms, that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods for determining the number of edge-disjoint embeddings of a subgraph and employ novel algorithms for candidate generation and frequency counting, which allow them to operate on datasets with different characteristics and to quickly prune unpromising subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 120,000 vertices or 110,000 edges, and significantly outperform previously developed algorithms.

References

[1]
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of 1998 ACM SIGMOD International Conference on Management of Data, pp. 94-105.
[2]
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the 20th International Conference on Very Large Data Bases (VLDB), J.B. Bocca, M. Jarke, and C. Zaniolo (Eds.), Morgan Kaufmann, pp. 487-499.
[3]
Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proc. of the 11th International Conference on Data Engineering (ICDE), P.S. Yu and A.L.P. Chen (Eds.), IEEE Press, pp. 3-14.
[4]
Berendt, B., Hotho, A., and Stumme, G. 2002. Towards semantic web mining. In International Semantic Web Conference (ISWC), pp. 264-278.
[5]
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H. et al. 2000. The protein data bank. Nucleic Acids Research, 28:235-242.
[6]
Berman, P. and Fujito, T. 1995. On the approximation properties of independent set problem in degree 3 graphs. In Proc. of Workshop on Algorithms and Data Structures, pp. 449-460.
[7]
Blake, C.L. and Merz, C.J. 1998. UCI Repository of Machine Learning Databases.
[8]
Borgelt, C. and Berthold, M.R. 2002. Mining molecular fragments: Finding relevant substructures of molecules. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 51-58.
[9]
Chew, L.P., Huttenlocher, D., Kedem, K., and Kleinberg, J. 1999. Fast detection of common geometric substructure in proteins. In Proc. of the 3rd ACM RECOMB International Conference on Computational Molecular Biology, pp. 104-114.
[10]
Cohen, M. and Gudes, E. 2004. Diagonally subgraphs pattern mining. In Proc. of the 9th ACM SIGMOD workshop on research issues in Data Mining and Knowledge Discovery DMKD '04, pp. 51-58.
[11]
Cook, D.J. and Holder, L.B. 1994. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231-255.
[12]
Cook, D.J. and Holder, L.B. 2000. Graph-based data mining. IEEE Intelligent Systems, 15(2):32-41.
[13]
Cook, D.J., Holder, L.B., and Djoko, S. 1995. Knowledge discovery from structural data. Journal of Intelligent Information Systems, 5(3):229-245.
[14]
Dehaspe, L., Toivonen, H., and King, R.D. 1998. Finding frequent substructures in chemical compounds. In Proc. of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-98), R. Agrawal, P. Stolorz, and G. Piatetsky-Shapiro (Eds.), AAAI Press, pp. 30-36.
[15]
De Raedt, L. and Kramer, S. 2001. The level-wise version space algorithm and its application to molecular fragment finding. In Proc. of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01), pp. 853-862.
[16]
Deshpande, M., Kuramochi, M., and Karypis, G. 2002. Automated approaches for classifying structures. In Proc. of the 2nd Workshop on Data Mining in Bioinformatics (BIOKDD '02) pp. 11-18.
[17]
Deshpande, M., Kuramochi, M., and Karypis, G. 2003. Frequent sub-structure based approaches for classifying chemical compounds. In Proc. of 2003 IEEE International Conference on Data Mining (ICDM), pp. 35-42.
[18]
Feige, U., Goldwasser, S., Lovasz, L., Safra, S., and Szegedy, M. 1991. Approximating clique is almost NP-complete. In Proc. of the 32nd IEEE Symposium on Foundations of Computer Science (FOCS), pp. 2-12.
[19]
Fortin, S. 1996. The Graph Isomorphism Problem (Tech. Rep. No. TR96-20). Department of Computing Science, University of Alberta.
[20]
Garey, M.R. and Johnson, D.S. 1979. Computers and Intractability: A Guide to the Theory of np-completeness. New York: W. H. Freeman and Company.
[21]
Ghazizadeh, S. and Chawathe, S. 2002a. Discovering Freuqent Structures Using Summaries (Tech. Rep. No. CS-TR-4364). Department of Computer Science, University of Maryland.
[22]
Ghazizadeh, S. and Chawathe, S. 2002b. SEuS: Structure extraction using summaries. In Proc. of the 5th International Conference on Discovery Science, pp. 71-85.
[23]
Gonzalez, J., Holder, L.B. and Cook, D.J. 2001. Application of graph-based concept learning to the predictive toxicology domain. In Proc. of the Predictive Toxicology Challenge Workshop.
[24]
Grindley, H.M., Artymiuk, P.J., Rice, D.W., and Willett, P. 1993. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of Molecular Biology, 229:707- 721.
[25]
Guralnik, V. and Karypis, G. 2001. A scalabale algorithm for clustering sequence datasets. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM).
[26]
Halldórsson, M.M., and Radhakrishnan, J. 1997. Greed is good: Approximating independent sets in sparse and bounded-degree graphs. Algorithmica, 18(1):145-163.
[27]
Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD International Conference on Management of Data Dallas, TX, pp. 1-12.
[28]
Hochbaum, D.S. 1983. Efficient bounds for the stable set, vertex cover, and set packing problems. Discrete Applied Mathematics, 6:243-254.
[29]
Holder, L.B., Cook, D.J., and Djoko, S. 1994. Substructure discovery in the SUBDUE system. In Proc. of the AAAI Workshop on Knowledge Discovery in Databases, pp. 169-180.
[30]
Hong, M., Zhou, H., Wang, W., and Shi, B. 2003. An efficient algorithm of frequent connected subgraph extraction. In Proc. of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-03), Springer-Verlag, Vol. 2637, pp. 40-51.
[31]
Huan, J., Wang, W., and Prins, J. 2003. Efficient mining of frequent subgraph in the presence of isomophism. In Proc. of 2003 IEEE International Conference on Data Mining (ICDM'03), pp. 549- 552.
[32]
Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'00), Lyon, France, pp. 13-23.
[33]
Inokuchi, A., Washio, T., and Motoda, H. 2003. Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3):321-354.
[34]
Inokuchi, A., Washio, T., Nishimura, K., and Motoda, H. 2002. A Fast Algorithm for Mining Frequent Connected Subgraphs (Tech. Rep. No. RT0448). IBM Research, Tokyo Research Laboratory.
[35]
Jensen, D. and Goldberg, H. (Eds.). 1998. Artificial Intelligence and Link Analysis Papers from the 1998 Fall Symposium. AAAI Press.
[36]
Jonyer, I., Cook, D.J., and Holder, L.B. 2001a. Discovery and evaluation of graph-based hierarchical conceptual clusters. Journal of Machine Learning Research, 2:19-43.
[37]
Jonyer, I., Holder, L.B., and Cook, D.J. 2001b. Hierarchical conceptual structural clustering. International Journal on Artificial Intelligence Tools, 10(1-2):107-136.
[38]
Khanna, S., Motwani, R., Sudan, M., and Vazirani, U.V. 1994. On syntactic versus computational views of approximability. In Proc. of IEEE Symposium on Foundations of Computer Science, pp. 819-830.
[39]
Kleinberg, J.M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604-632.
[40]
Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A.S. 1999. The Web as a graph: Measurements, models and methods. Lecture Notes in Computer Science, 1627:1-17.
[41]
Ko, C. 2000. Logic induction of valid behavior specifications for intrusion detection. In IEEE Symposium on Security and Privacy (S&P), pp. 142-155.
[42]
Koch, I., Lengauer, T., and Wanke, E. 1996. An algorithm for finding maximal common subtopoloties in a set of protein structures. Journal of Computational Biology, 3(2):289-306.
[43]
Kramer, S., De Raedt, L., and Helma, C. 2001. Molecular feature mining in HIV data. In Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-01), pp. 136-143.
[44]
Kuramochi, M., Deshpande, M., and Karypis, G. 2004. Data mining: Next generation challenges and future directions. In H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.), AAAI Press (in press), pp. 315-334.
[45]
Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM), pp. 313-320.
[46]
Kuramochi, M. and Karypis, G. 2002. An Efficient Algorithm for Discovering Frequent Subgraphs (Tech. Rep. No. 02-026). University of Minnesota, Department of Computer Science.
[47]
Kuramochi, M. and Karypis, G. 2004a. An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering, 16(9):1038-1051.
[48]
Kuramochi, M. and Karypis, G. 2004b. Finding frequent patterns in a large sparse graph. In Proc. of the 2004 SIAM International Conference on Data Mining (SDM04).
[49]
Lee, W. and Stolfo, S.J. 2000. A framework for constructing features and models for intrusion detection systems. ACM Transactions on Information and System Security, 3(4):227-261.
[50]
Leibowitz, N., Fligelman, Z.Y., Nussinov, R., and Wolfson, H.J. 1999. Multiple structural alignment and core detection by geometric hashing. In Proc. of the 7th international conference on intelligent systems in molecular biology, Heidelberg: Germany, pp. 169-177.
[51]
Leibowitz, N., Nussinov,R., and Wolfson, H.J. 2001. MUSTA--a general, efficient, automated method for multiple structure alignment and detection of common motifs: Application to proteins. Journal of Computational Biology, 8(2):93-121.
[52]
Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM), pp. 369-376.
[53]
Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proc. of the 4th Internation Conference on Knowledge Discovery and Data Mining (kdd-98), pp. 80-86.
[54]
McKay, B.D. (n.d.). Nauty users guide. http://cs.anu.edu.au/~bdm/nauty/
[55]
McKay, B.D. 1981. Practical graph isomorphism. Congressus Numerantium, 30:45-87.
[56]
Mitchell, E.M., Artymiuk, P.J., Rice, D.W., and Willett, P. 1989. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. Journal of Molecular Biology, 212:151-166.
[57]
Mooney, R.J., Melville, P., Tang, L.R., Shavlik, J., Castro Dutra, I. de, Page, D. et al. 2004. Relational data mining with inductive logic programming for link discovery. In AAAI Press/The MIT Press, pp. 239-255.
[58]
Muggleton, S.H. 1999. Scientific knowledge discovery using Inductive Logic Programming. Communications of the ACM, 42(11):42-46.
[59]
Östergård, P.R.J. 2002. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, 120:195-205.
[60]
Palmer, C.R., Gibbons, P.B., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proc. of the 8th ACM SIGKDD Internal Conference on Knowlege Discovery and Data Mining (KDD-2002) Edmonton, AB, Canada, pp. 81-90.
[61]
Pennec, X. and Ayache, N. 1998. A geometric algorithm to find small but highly simialar 3D substructures in proteins. Bioinformatics, 14(6):516-522.
[62]
Raymond, J.W. 2002. Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J. Chem. Inf. Comput. Sci., 42:305-316.
[63]
Read, R.C. and Corneil, D.G. 1977. The graph isomorph disease. Journal of Graph Theory, 1:339-363.
[64]
Robson, J.M. 1986. Algorithms for maximum independent sets. Journal of Algorithms, 7:425-440.
[65]
Srinivasan, A., King, R.D., Muggleton, S.H., and Sternberg, M.J.E. 1997. Carcinogenesis predictions using ILP. In Proc. of the 7th International Workshop on Inductive Logic Programming S. D¿eroski and N. Lavra¿ (Eds.), Springer-Verlag, (Vol. 1297, pp. 273-287).
[66]
Vanetik, N., Gudes, E., and Shimony, S.E. 2002. Computing frequent graph patterns from semistructured data. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 458-465.
[67]
Wang, X., Wang, J.T.L., Shasha, D., Shapiro, B.A., Rigoutsos, I., and Zhang, K. 2002. Finding patterns in three dimensional graphs: Algorithms and applications to scientific data mining. IEEE Transactions on Knowledge and Data Engineering, 14(4):731-749.
[68]
Wasserman, S., Faust, K., and Iacobucci, D. 1994. Social network analysis : Methods and applications. Cambridge University Press.
[69]
Yan, X. and Han, J. 2002. gSpan: Graph-based substructure pattern mining. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 721-724.
[70]
Yan, X. and Han, J. 2003. CloseGraph: Mining closed frequent graph patterns. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 286-295.
[71]
Yoshida, K. and Motoda, H. 1995. CLIP: Concept learning from inference patterns. Artificial Intelligence, 75(1):63-92.
[72]
Yoshida, K., Motoda, H., and Indurkhya, N. 1994. Graph-based induction as a unified learning framework. Journal of Applied Intelligence, 4:297-328.
[73]
Zaki, M.J. and Gouda, K. 2003. Fast vertical mining using diffsets. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 326-335.

Cited By

View all
  • (2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
  • (2023)Isomorphic Graph Embedding for Progressive Maximal Frequent Subgraph MiningACM Transactions on Intelligent Systems and Technology10.1145/363063515:1(1-26)Online publication date: 19-Dec-2023
  • (2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery
Data Mining and Knowledge Discovery  Volume 11, Issue 3
November 2005
108 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2005

Author Tags

  1. frequent subgraph
  2. graph mining
  3. pattern discovery

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
  • (2023)Isomorphic Graph Embedding for Progressive Maximal Frequent Subgraph MiningACM Transactions on Intelligent Systems and Technology10.1145/363063515:1(1-26)Online publication date: 19-Dec-2023
  • (2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
  • (2023)A novel approach to discover frequent weighted subgraphs using the average measureApplied Intelligence10.1007/s10489-023-04501-y53:16(19491-19504)Online publication date: 8-Mar-2023
  • (2023)Mining Frequent Geo-Subgraphs in a Knowledge GraphWeb and Big Data10.1007/978-981-97-2303-4_2(16-31)Online publication date: 6-Oct-2023
  • (2022)SATMargin: Practical Maximal Frequent Subgraph Mining via Margin Space SamplingProceedings of the ACM Web Conference 202210.1145/3485447.3512196(1495-1505)Online publication date: 25-Apr-2022
  • (2022)An efficient and scalable approach for mining subgraphs in a single large graphApplied Intelligence10.1007/s10489-022-03164-552:15(17881-17895)Online publication date: 1-Dec-2022
  • (2022)A survey on mining and analysis of uncertain graphsKnowledge and Information Systems10.1007/s10115-022-01681-w64:7(1653-1689)Online publication date: 1-Jul-2022
  • (2022)Frequent Closed Subgraph Mining: A Multi-thread ApproachIntelligent Information and Database Systems10.1007/978-3-031-21743-2_6(64-77)Online publication date: 28-Nov-2022
  • (2021)Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality EstimationProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457246(964-976)Online publication date: 9-Jun-2021
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media