Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Scalable subgraph enumeration in MapReduce

Published: 01 June 2015 Publication History

Abstract

Subgraph enumeration, which aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph, is a fundamental graph problem with a wide range of applications. However, existing sequential algorithms for subgraph enumeration fall short in handling large graphs due to the involvement of computationally intensive subgraph isomorphism operations. Thus, some recent researches focus on solving the problem using MapReduce. Nevertheless, exiting MapReduce approaches are not scalable to handle very large graphs since they either produce a huge number of partial results or consume a large amount of memory. Motivated by this, in this paper, we propose a new algorithm TwinTwigJoin based on a left-deep-join framework in MapReduce, in which the basic join unit is a TwinTwig (an edge or two incident edges of a node). We show that in the Erdös-Rényi random-graph model, TwinTwigJoin is instance optimal in the left-deep-join framework under reasonable assumptions, and we devise an algorithm to compute the optimal join plan. Three optimization strategies are explored to improve our algorithm. Furthermore, we discuss how our approach can be adapted in the power-law random-graph model. We conduct extensive performance studies in several real graphs, one of which contains billions of edges. Our approach significantly outperforms existing solutions in all tests.

References

[1]
F. N. Afrati, D. Fotakis, and J. D. Ullman. Enumerating subgraph instances using map-reduce. In Proc. of ICDE'13, 2013.
[2]
F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. PVLDB, 6(4), 2013.
[3]
F. N. Afrati and J. D. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng., 23(9), 2011.
[4]
W. Aiello, F. Chung, and L. Lu. A random graph model for massive graphs. In Proc. of STOC '00, 2000.
[5]
N. Alon, P. Dao, I. Hajirasouliha, F. Hormozdiari, and S. C. Sahinalp. Biomolecular network motif counting and discovery by color coding. In Proc. of ISMB'08, 2008.
[6]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), 1970.
[7]
B. Bollobás. Random graphs. Springer, 1998.
[8]
N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM J. Comput., 14(1), 1985.
[9]
F. R. K. Chung, L. Lu, and V. H. Vu. The spectra of random graphs with given expected degrees. Internet Mathematics, 1(3), 2003.
[10]
A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Rev., Nov. 2009.
[11]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proc. of OSDI'04, 2004.
[12]
P. Eades, X. Lin, and R. Tamassia. An algorithm for drawing a hierarchical graph. International Journal of Computational Geometry & Applications, 6(02):145--155, 1996.
[13]
P. Erdos and A. Renyi. On the evolution of random graphs. In Publ. Math. Inst. Hungary. Acad. Sci., 1960.
[14]
W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph pattern matching: From intractable to polynomial time. PVLDB, 3(1), 2010.
[15]
M. Gonen, D. Ron, and Y. Shavitt. Counting stars and other small subgraphs in sublinear time. In Proc. of SODA'10, 2010.
[16]
J. A. Grochow and M. Kellis. Network motif discovery using subgraph enumeration and symmetry-breaking. In Proc. of RECOMB'07, 2007.
[17]
W.-S. Han, J. Lee, and J.-H. Lee. Turboiso: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proc. of SIGMOD'13, 2013.
[18]
H. He and A. K. Singh. Graphs-at-a-time: Query language and access methods for graph databases. In Proc. of SIGMOD'08, 2008.
[19]
J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph:distributed graph-parallel computation on natural graphs. In Proc. of OSDI'12, 2012.
[20]
S. R. Kairam, D. J. Wang, and J. Leskovec. The life and death of online groups: Predicting group growth and longevity. In Proc. of WSDM'12, 2012.
[21]
H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In Proc. of SODA'10, 2010.
[22]
A. Khan, Y. Wu, C. C. Aggarwal, and X. Yan. Nema: Fast graph search with label similarity. PVLDB, 6(3), 2013.
[23]
J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 2012.
[24]
J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommendation network. In Proc. of PAKDD'06, 2006.
[25]
S. Ma, Y. Cao, J. Huai, and T. Wo. Distributed graph pattern matching. In WWW, 2012.
[26]
T. Milenkovic and N. Przulj. Uncovering biological network function via graphlet degree signatures. Cancer Inform, 6, 2008.
[27]
R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594), 2002.
[28]
N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In AISTATS, 2009.
[29]
T. Plantenga. Inexact subgraph isomorphism in mapreduce. J. Parallel Distrib. Comput., 73(2), 2013.
[30]
N. Przulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2), 2007.
[31]
G. Rücker and C. Rücker. Substructure, subgraph, and walk counts as measures of the complexity of graphs and molecules. Journal of Chemical Information and Computer Sciences, 41(6), 2001.
[32]
M. Steinbrunn, G. Moerkotte, and A. Kemper. Optimizing join orders. Technical report, 1993.
[33]
Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. PVLDB, 5(9), 2012.
[34]
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In Proc. of WWW'11, 2011.
[35]
Y. Tao, W. Lin, and X. Xiao. Minimal mapreduce algorithms. In Proc. of SIGMOD'13, 2013.
[36]
C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: Counting triangles in massive graphs with a coin. In Proc. of KDD'09, 2009.
[37]
F. Viger and M. Latapy. Efficient and simple generation of random simple connected graphs with prescribed degree sequence. In COCOON'05, pages 440--449, Berlin, Heidelberg, 2005. Springer-Verlag.
[38]
C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li. Mapdupreducer: detecting near duplicates over massive datasets. In Proc. of SIGMOD'10, pages 1119--1122, 2010.
[39]
J. Wang and J. Cheng. Truss decomposition in massive networks. PVLDB, 5(9), 2012.
[40]
D. Watts and S. Strogatz. Collective dynamics of 'small-world' networks. Nature, 6684(393), 1998.
[41]
X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using mapreduce. PVLDB, 5(11), 2012.
[42]
P. Zhao and J. Han. On graph query optimization in large networks. PVLDB, 3(1-2), 2010.
[43]
Z. Zhao, M. Khan, V. S. A. Kumar, and M. V. Marathe. Subgraph enumeration in large social contact networks using parallel color coding and streaming. In Proc. of ICPP'10, 2010.

Cited By

View all
  • (2024)Fast Local Subgraph CountingProceedings of the VLDB Endowment10.14778/3659437.365945117:8(1967-1980)Online publication date: 1-Apr-2024
  • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/369496657:2(1-39)Online publication date: 10-Oct-2024
  • (2024)Speeding Up Subgraph Matching Queries with Schema Guided IndexProceedings of the 2024 3rd International Conference on Networks, Communications and Information Technology10.1145/3672121.3672129(34-38)Online publication date: 7-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 10
June 2015
168 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2015
Published in PVLDB Volume 8, Issue 10

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)8
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fast Local Subgraph CountingProceedings of the VLDB Endowment10.14778/3659437.365945117:8(1967-1980)Online publication date: 1-Apr-2024
  • (2024)A Survey of Distributed Graph Algorithms on Massive GraphsACM Computing Surveys10.1145/369496657:2(1-39)Online publication date: 10-Oct-2024
  • (2024)Speeding Up Subgraph Matching Queries with Schema Guided IndexProceedings of the 2024 3rd International Conference on Networks, Communications and Information Technology10.1145/3672121.3672129(34-38)Online publication date: 7-Jun-2024
  • (2024)Brief Announcement: Improved Massively Parallel Triangle Counting in O(1) RoundsProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662819(519-522)Online publication date: 17-Jun-2024
  • (2024)Understanding High-Performance Subgraph Pattern Matching: A Systems PerspectiveProceedings of the 7th Joint Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3661304.3661897(1-12)Online publication date: 14-Jun-2024
  • (2024)A Comprehensive Survey and Experimental Study of Subgraph Matching: Trends, Unbiasedness, and InteractionProceedings of the ACM on Management of Data10.1145/36393152:1(1-29)Online publication date: 26-Mar-2024
  • (2024)Optimizing subgraph retrieval and matching with an efficient indexing schemeKnowledge and Information Systems10.1007/s10115-024-02175-766:11(6815-6843)Online publication date: 1-Nov-2024
  • (2024)Parallelization of butterfly counting on hierarchical memoryThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00856-x33:5(1453-1484)Online publication date: 1-Sep-2024
  • (2024)Temporal graph patterns by timed automataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00795-z33:1(25-47)Online publication date: 1-Jan-2024
  • (2023)BEEP: Balanced Efficient subgraph Enumeration in ParallelProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605653(142-152)Online publication date: 7-Aug-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media