research-article

Arabesque: a system for distributed graph mining

Authors:

Carlos H. C. Teixeira,

Alexandre J. Fonseca,

Marco Serafini,

Georgos Siganos,

Mohammed J. Zaki,

Ashraf AboulnagaAuthors Info & Claims

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

Pages 425 - 440

https://doi.org/10.1145/2815400.2815410

Published: 04 October 2015 Publication History

Abstract

Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics.

In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Supplementary Material

MP4 File (p425.mp4)

Download
1942.07 MB

References

[1]

Aggarwal, C. C., and Wang, H. Managing and mining graph data. Springer, 2010.

Digital Library

[2]

Anchuri, P., Zaki, M. J., Barkol, O., Golan, S., and Shamy, M. Approximate graph mining with label costs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013).

Digital Library

[3]

Avery, C. Giraph: Large-scale graph processing infrastructure on Hadoop. Proceedings of the Hadoop Summit (2011).

[4]

Bahmani, B., Kumar, R., and Vassilvitskii, S. Densest subgraph in streaming and MapReduce. Proceedings of the VLDB Endowment 5, 5 (2012).

Digital Library

[5]

Bhuiyan, M. A., and Hasan, M. An iterative MapReduce based frequent subgraph mining algorithm. IEEE Transactions on Knowledge and Data Engineering 27, 3 (March 2015).

[6]

Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1 (1998).

Digital Library

[7]

Bringmann, B., and Nijssen, S. What is frequent in a single graph? In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (2008), Springer-Verlag.

Digital Library

[8]

Bron, C., and Kerbosch, J. Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16, 9 (1973).

Digital Library

[9]

Cheng, J., Zhu, L., Ke, Y., and Chu, S. Fast algorithms for maximal clique enumeration with limited memory. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (2012).

Digital Library

[10]

Cheng, X., Dale, C., and Liu, J. Dataset for "statistics and social network of youtube videos". http://netsg.cs.sfu.ca/youtubedata/.

[11]

Cordella, L., Foggia, P., Sansone, C., and Vento, M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10 (2004).

Digital Library

[12]

Dahlhaus, E., and Karpinski, M. A fast parallel algorithm for computing all maximal cliques in a graph and the related problems. In Proceedings of SWAT, Lecture Notes in Computer Science. Springer, 1988.

Digital Library

[13]

Di Fatta, G., and Berthold, M. R. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems 17, 8 (2006).

Digital Library

[14]

Elseidy, M., Abdelhamid, E., Skiadopoulos, S., and Kalnis, P. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014).

Digital Library

[15]

Eppstein, D., and Strash, D. Listing all maximal cliques in large sparse real-world graphs. In Experimental Algorithms. Springer, 2011.

Digital Library

[16]

Garey, M. R., and Johnson, D. S. Computers and intractability, vol. 29. WH Freeman, 2002.

[17]

Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (2012).

Digital Library

[18]

H., H. B., Jaffe, A. B., and Trajtenberg, M. The NBER patent citation data file: Lessons, insights and methodological tools. http://www.nber.org/patents/, 2001.

[19]

Hill, S., Srichandan, B., and Sunderraman, R. An iterative MapReduce approach to frequent subgraph mining in biological datasets. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (2012).

Digital Library

[20]

Junttila, T., and Kaski, P. Engineering an efficient canonical labeling tool for large and sparse graphs. In Proceedings of the SIAM Workshop on Algorithm Engineering and Experiments (2007).

Digital Library

[21]

Kessl, R., Talukder, N., Anchuri, P., and Zaki, M. J. Parallel graph mining with GPUs. In Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (2014).

[22]

Kuramochi, M., and Karypis, G. Finding frequent patterns in a large sparse graph. Data mining and knowledge discovery 11, 3 (2005).

Digital Library

[23]

Lin, W., Xiao, X., and Ghinita, G. Large-scale frequent subgraph mining in MapReduce. In Proceedings of the IEEE International Conference on Data Engineering (2014).

[24]

Lu, W., Chen, G., Tung, A. K., and Zhao, F. Efficiently extracting frequent subgraphs using MapReduce. In Proceedings of the IEEE International Conference on Big Data (2013).

[25]

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2010).

Digital Library

[26]

McCune, R. R., Weninger, T., and Madey, G. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. arXiv:1507.04405 (2015).

[27]

McSherry, F., Isard, M., and Murray, D. G. Scalability! but at what cost. In Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (2015).

Digital Library

[28]

Mejova, Y., Haddadi, H., Noulas, A., and Weber, I. #Foodporn: Obesity patterns in culinary interactions. In Proceedings of the ACM conference on Digital Health 2015 (2015).

Digital Library

[29]

Oliveira Aparicio, D., Pinto Ribeiro, P. M., and Da Silva, F. M. A. Parallel subgraph counting for multicore architectures. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications (2014).

Digital Library

[30]

Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007).

Digital Library

[31]

Ribeiro, P., and Silva, F. G-Tries: A data structure for storing and finding subgraphs. Data Mining and Knowledge Discovery 28, 2 (2014).

Digital Library

[32]

Schmidt, M. C., Samatova, N. F., Thomas, K., and Park, B.-H. A scalable, parallel algorithm for maximal clique enumeration. Journal of Parallel and Distributed Computing 69, 4 (2009).

Digital Library

[33]

Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., and Xu, N. Parallel subgraph listing in a large-scale graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2014).

Digital Library

[34]

Slota, G. M., and Madduri, K. Complex network analysis using parallel approximate motif counting. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2014).

Digital Library

[35]

Teixeira, C. H. C., Fonseca, A. J., Serafini, M., Siganos, G., Zaki, M. J., and Aboulnaga, A. Arabesque: A system for distributed graph mining - Extended version. Technical Report, Qatar Computing Research Institute, 2015.

[36]

Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., and McPherson, J. From "think like a vertex" to "think like a graph". Proceedings of the VLDB Endowment 7, 3 (2013).

Digital Library

[37]

Uno, T. MACE: Maximal clique enumerator, ver 2.0. http://research.nii.ac.jp/~uno/code/mace.html.

[38]

Valiant, L. G. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990).

Digital Library

[39]

Wang, C., and Parthasarathy, S. Parallel algorithms for mining frequent structural motifs in scientific data. In Proceedings of the Annual International Conference on Supercomputing (2004).

Digital Library

[40]

Wu, B., Yang, S., Zhao, H., and Wang, B. A distributed algorithm to enumerate all maximal cliques in MapReduce. In Proceedings of the International Conference on Frontier of Computer Science and Technology (2009).

Digital Library

[41]

Xiang, J., Guo, C., and Aboulnaga, A. Scalable maximum clique computation using MapReduce. In Proceedings of the IEEE International Coference on Data Engineering (2013).

Digital Library

[42]

Yan, D., Cheng, J., Lu, Y., and Ng, W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment 7, 14 (2014).

Digital Library

[43]

Yan, X., and Han, J. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (2002).

Digital Library

[44]

Zhao, Z., Wang, G., Butt, A. R., Khan, M., Kumar, V. A., and Marathe, M. V. Sahad: Subgraph analysis in massive networks using Hadoop. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2012).

Digital Library

Cited By

Meng LShao YYuan LLai LCheng PLi XYu WZhang WLin XZhou J(2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/3694966Online publication date: 5-Sep-2024
https://dl.acm.org/doi/10.1145/3694966
Zhang ZLu YZheng WLin X(2024)A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and InteractionProceedings of the ACM on Management of Data10.1145/36393152:1(1-29)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639315
Yan DYuan LAhmad AZheng CChen HCheng JBaeza-Yates RBonchi F(2024)Systems for Scalable Graph Analytics and Machine Learning: Trends and MethodsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671472(6627-6632)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671472
Show More Cited By

Index Terms

Arabesque: a system for distributed graph mining

Recommendations

AutoMine: harmonizing high-level abstraction and high performance for graph mining
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

Graph mining algorithms that aim at identifying structural patterns of graphs are typically more complex than graph computation algorithms such as breadth first search. Researchers have implemented several systems with high-level and flexible interfaces ...
Peregrine: a pattern-aware graph mining system
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer Systems

Graph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. General purpose graph mining systems provide a generic runtime to explore subgraph structures of interest with the help of user-defined ...
Fractal: A General-Purpose Graph Pattern Mining System
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

In this paper we propose Fractal, a high performance and high productivity system for supporting distributed graph pattern mining (GPM) applications. Fractal employs a dynamic (auto-tuned) load-balancing based on a hierarchical and locality-aware work ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

October 2015

499 pages

ISBN:9781450338349

DOI:10.1145/2815400

General Chair:
Ethan Miller
UC Santa Cruz
,
Program Chair:
Steven Hand
Google

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SSRC: Storage Systems Research Center, UC Santa Cruz
SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SOSP '15

Sponsor:

SSRC
SIGOPS

SOSP '15: ACM SIGOPS 25th Symposium on Operating Systems Principles

October 4 - 7, 2015

California, Monterey

Acceptance Rates

SOSP '15 Paper Acceptance Rate 30 of 181 submissions, 17%;

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '25

Sponsor:
sigops

ACM SIGOPS 31st Symposium on Operating Systems Principles

October 13 - 16, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

173
Total Citations
View Citations
2,288
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)17

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Meng LShao YYuan LLai LCheng PLi XYu WZhang WLin XZhou J(2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/3694966Online publication date: 5-Sep-2024
https://dl.acm.org/doi/10.1145/3694966
Zhang ZLu YZheng WLin X(2024)A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and InteractionProceedings of the ACM on Management of Data10.1145/36393152:1(1-29)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639315
Yan DYuan LAhmad AZheng CChen HCheng JBaeza-Yates RBonchi F(2024)Systems for Scalable Graph Analytics and Machine Learning: Trends and MethodsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671472(6627-6632)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671472
Lin ZMeng KShui CZhang KXiao JTan GLee IChabbi MSteuwer M(2024)Exploiting Fine-Grained Redundancy in Set-Centric Graph Pattern MiningProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638507(175-187)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638507
Li YSun SXiao HYe CLu SHe B(2024)A Survey on Concurrent Processing of Graph Analytical Queries: Systems and AlgorithmsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339393636:11(5508-5528)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3393936
Yuan LAhmad AYan DHan JAdhikari SYu XZhou Y(2024) G 2 -AIMD: A Memory-Efficient Subgraph-Centric Framework for Efficient Subgraph Finding on GPUs 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00245(3164-3177)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00245
Yuan LYan DHan JAhmad AZhou YJiang Z(2024)Faster Depth-First Subgraph Matching on GPUs2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00244(3151-3163)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00244
Ferraz SDias VTeixeira CParthasarathy STeodoro GMeira W(2024)DuMato: An Efficient Warp-Centric Subgraph Enumeration System for GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104903(104903)Online publication date: Apr-2024
https://doi.org/10.1016/j.jpdc.2024.104903
Farouzi AZhou XBellatreche LMalki MOrdonez C(2024)Balanced parallel triangle enumeration with an adaptive algorithmDistributed and Parallel Databases10.1007/s10619-023-07437-x42:1(103-141)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07437-x
Sheng SZhang ZZhou PWu X(2024)An effective algorithm for genealogical graph partitioningApplied Intelligence10.1007/s10489-023-05265-154:2(1798-1817)Online publication date: 22-Jan-2024
https://doi.org/10.1007/s10489-023-05265-1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents