Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2815400.2815410acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Arabesque: a system for distributed graph mining

Published: 04 October 2015 Publication History

Abstract

Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics.
In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Supplementary Material

MP4 File (p425.mp4)

References

[1]
Aggarwal, C. C., and Wang, H. Managing and mining graph data. Springer, 2010.
[2]
Anchuri, P., Zaki, M. J., Barkol, O., Golan, S., and Shamy, M. Approximate graph mining with label costs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013).
[3]
Avery, C. Giraph: Large-scale graph processing infrastructure on Hadoop. Proceedings of the Hadoop Summit (2011).
[4]
Bahmani, B., Kumar, R., and Vassilvitskii, S. Densest subgraph in streaming and MapReduce. Proceedings of the VLDB Endowment 5, 5 (2012).
[5]
Bhuiyan, M. A., and Hasan, M. An iterative MapReduce based frequent subgraph mining algorithm. IEEE Transactions on Knowledge and Data Engineering 27, 3 (March 2015).
[6]
Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1 (1998).
[7]
Bringmann, B., and Nijssen, S. What is frequent in a single graph? In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (2008), Springer-Verlag.
[8]
Bron, C., and Kerbosch, J. Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16, 9 (1973).
[9]
Cheng, J., Zhu, L., Ke, Y., and Chu, S. Fast algorithms for maximal clique enumeration with limited memory. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (2012).
[10]
Cheng, X., Dale, C., and Liu, J. Dataset for "statistics and social network of youtube videos". http://netsg.cs.sfu.ca/youtubedata/.
[11]
Cordella, L., Foggia, P., Sansone, C., and Vento, M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10 (2004).
[12]
Dahlhaus, E., and Karpinski, M. A fast parallel algorithm for computing all maximal cliques in a graph and the related problems. In Proceedings of SWAT, Lecture Notes in Computer Science. Springer, 1988.
[13]
Di Fatta, G., and Berthold, M. R. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems 17, 8 (2006).
[14]
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., and Kalnis, P. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014).
[15]
Eppstein, D., and Strash, D. Listing all maximal cliques in large sparse real-world graphs. In Experimental Algorithms. Springer, 2011.
[16]
Garey, M. R., and Johnson, D. S. Computers and intractability, vol. 29. WH Freeman, 2002.
[17]
Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (2012).
[18]
H., H. B., Jaffe, A. B., and Trajtenberg, M. The NBER patent citation data file: Lessons, insights and methodological tools. http://www.nber.org/patents/, 2001.
[19]
Hill, S., Srichandan, B., and Sunderraman, R. An iterative MapReduce approach to frequent subgraph mining in biological datasets. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (2012).
[20]
Junttila, T., and Kaski, P. Engineering an efficient canonical labeling tool for large and sparse graphs. In Proceedings of the SIAM Workshop on Algorithm Engineering and Experiments (2007).
[21]
Kessl, R., Talukder, N., Anchuri, P., and Zaki, M. J. Parallel graph mining with GPUs. In Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (2014).
[22]
Kuramochi, M., and Karypis, G. Finding frequent patterns in a large sparse graph. Data mining and knowledge discovery 11, 3 (2005).
[23]
Lin, W., Xiao, X., and Ghinita, G. Large-scale frequent subgraph mining in MapReduce. In Proceedings of the IEEE International Conference on Data Engineering (2014).
[24]
Lu, W., Chen, G., Tung, A. K., and Zhao, F. Efficiently extracting frequent subgraphs using MapReduce. In Proceedings of the IEEE International Conference on Big Data (2013).
[25]
Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2010).
[26]
McCune, R. R., Weninger, T., and Madey, G. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. arXiv:1507.04405 (2015).
[27]
McSherry, F., Isard, M., and Murray, D. G. Scalability! but at what cost. In Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (2015).
[28]
Mejova, Y., Haddadi, H., Noulas, A., and Weber, I. #Foodporn: Obesity patterns in culinary interactions. In Proceedings of the ACM conference on Digital Health 2015 (2015).
[29]
Oliveira Aparicio, D., Pinto Ribeiro, P. M., and Da Silva, F. M. A. Parallel subgraph counting for multicore architectures. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications (2014).
[30]
Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007).
[31]
Ribeiro, P., and Silva, F. G-Tries: A data structure for storing and finding subgraphs. Data Mining and Knowledge Discovery 28, 2 (2014).
[32]
Schmidt, M. C., Samatova, N. F., Thomas, K., and Park, B.-H. A scalable, parallel algorithm for maximal clique enumeration. Journal of Parallel and Distributed Computing 69, 4 (2009).
[33]
Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., and Xu, N. Parallel subgraph listing in a large-scale graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2014).
[34]
Slota, G. M., and Madduri, K. Complex network analysis using parallel approximate motif counting. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2014).
[35]
Teixeira, C. H. C., Fonseca, A. J., Serafini, M., Siganos, G., Zaki, M. J., and Aboulnaga, A. Arabesque: A system for distributed graph mining - Extended version. Technical Report, Qatar Computing Research Institute, 2015.
[36]
Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., and McPherson, J. From "think like a vertex" to "think like a graph". Proceedings of the VLDB Endowment 7, 3 (2013).
[37]
Uno, T. MACE: Maximal clique enumerator, ver 2.0. http://research.nii.ac.jp/~uno/code/mace.html.
[38]
Valiant, L. G. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990).
[39]
Wang, C., and Parthasarathy, S. Parallel algorithms for mining frequent structural motifs in scientific data. In Proceedings of the Annual International Conference on Supercomputing (2004).
[40]
Wu, B., Yang, S., Zhao, H., and Wang, B. A distributed algorithm to enumerate all maximal cliques in MapReduce. In Proceedings of the International Conference on Frontier of Computer Science and Technology (2009).
[41]
Xiang, J., Guo, C., and Aboulnaga, A. Scalable maximum clique computation using MapReduce. In Proceedings of the IEEE International Coference on Data Engineering (2013).
[42]
Yan, D., Cheng, J., Lu, Y., and Ng, W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment 7, 14 (2014).
[43]
Yan, X., and Han, J. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (2002).
[44]
Zhao, Z., Wang, G., Butt, A. R., Khan, M., Kumar, V. A., and Marathe, M. V. Sahad: Subgraph analysis in massive networks using Hadoop. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2012).

Cited By

View all
  • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/3694966Online publication date: 5-Sep-2024
  • (2024)A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and InteractionProceedings of the ACM on Management of Data10.1145/36393152:1(1-29)Online publication date: 26-Mar-2024
  • (2024)Systems for Scalable Graph Analytics and Machine Learning: Trends and MethodsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671472(6627-6632)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
October 2015
499 pages
ISBN:9781450338349
DOI:10.1145/2815400
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SOSP '15
Sponsor:

Acceptance Rates

SOSP '15 Paper Acceptance Rate 30 of 181 submissions, 17%;
Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)77
  • Downloads (Last 6 weeks)17
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/3694966Online publication date: 5-Sep-2024
  • (2024)A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and InteractionProceedings of the ACM on Management of Data10.1145/36393152:1(1-29)Online publication date: 26-Mar-2024
  • (2024)Systems for Scalable Graph Analytics and Machine Learning: Trends and MethodsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671472(6627-6632)Online publication date: 25-Aug-2024
  • (2024)Exploiting Fine-Grained Redundancy in Set-Centric Graph Pattern MiningProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638507(175-187)Online publication date: 2-Mar-2024
  • (2024)A Survey on Concurrent Processing of Graph Analytical Queries: Systems and AlgorithmsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339393636:11(5508-5528)Online publication date: Nov-2024
  • (2024) G 2 -AIMD: A Memory-Efficient Subgraph-Centric Framework for Efficient Subgraph Finding on GPUs 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00245(3164-3177)Online publication date: 13-May-2024
  • (2024)Faster Depth-First Subgraph Matching on GPUs2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00244(3151-3163)Online publication date: 13-May-2024
  • (2024)DuMato: An Efficient Warp-Centric Subgraph Enumeration System for GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104903(104903)Online publication date: Apr-2024
  • (2024)Balanced parallel triangle enumeration with an adaptive algorithmDistributed and Parallel Databases10.1007/s10619-023-07437-x42:1(103-141)Online publication date: 1-Mar-2024
  • (2024)An effective algorithm for genealogical graph partitioningApplied Intelligence10.1007/s10489-023-05265-154:2(1798-1817)Online publication date: 22-Jan-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media