Abstract
We outline a method for constructing in parallel a collection of local clusters for a massive distributed graph. For a given input set of (vertex, cluster size) tuples, we compute approximations of personal PageRank vectors in parallel using Pregel, and sweep over the results to create clusters using MapReduce. We show that our method converges to the serial approximate PageRank, and perform an experiment that illustrates the speed up over the serial method. We also outline a random selection and de-confliction procedure to cluster a distributed graph, and perform experiments to determine the quality of clusterings returned.
Similar content being viewed by others
Notes
Available: http://snap.stanford.edu/data/index.html.
References
Ahn Y, Bagrow J, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466(7307):761–764
Andersen R, Peres Y (2008) Finding sparse cuts locally using evolving sets. CoRR abs/0811.3779
Andersen R, Chung F, Lang K (2006) Local graph partitioning using pagerank vectors. In: Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, IEEE, pp 475–486
Apache Giraph (2012) Apache giraph. http://incubator.apache.org/giraph/
Bahmani B, Chakrabarti K, Xin D (2011) Fast personalized pagerank on mapreduce. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’11, pp 973–984, doi:10.1145/1989323.1989425
Blondel V, Guillaume J, Lambiotte R, Mech E (2008) Fast unfolding of communities in large networks. J Stat Mech p P10008
Chung FRK (1996) Spectral graph theory (CBMS Regional Conference Series in Mathematics, No. 92) American Mathematical Society
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 doi:10.1145/1327452.1327492
Fogaras D, Rácz B, Csalogány K, Sarlós T (2005) Towards scaling fully personalized pagerank: algorithms, lower bounds, and experiments. Internet Math 2(3):333–358
Gleich DF, Seshadhri C (2012) Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: KDD, pp 597–605
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX conference on operating systems design and implementation, USENIX Association, Berkeley, CA, USA, OSDI’12, pp 17–30
Good BH, de Montjoye YA, Clauset A (2010) Performance of modularity maximization in practical contexts. Phys Rev E 81(4):046,106
Kannan R, Vempala S, Vetta A (2004) On clusterings: good, bad and spectral. J ACM 51(3):497–515 doi:10.1145/990308.990313
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2008) Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th international conference on World Wide Web, ACM, New York, NY, USA, WWW ’08, pp 695–704, doi:10.1145/1367497.1367591
Lin J, Dyer C (2010) Data-intensive text processing with mapreduce. Synth Lect Hum Lang Technol 3(1):1–177
Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the eighth workshop on mining and learning with graphs, ACM, pp 78–85
Lovász L, Simonovits M (1993) Random walks in a convex body and an improved volume algorithm. Random Struct Algorithms 4(4):359–412 doi:10.1002/rsa.3240040402
Mahoney MW, Orecchia L, Vishnoi NK (2012) A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. JMLR 13:2339–2365
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’10, pp 135–146, doi:10.1145/1807167.1807184
Massie ML, Chun BN, Culler DE (2004) The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput 30(7):817–840
McCubbin C, Perozzi B, Levine A, Rahman A (2011) Finding the ’needle’: locating interesting nodes using the k-shortest paths algorithm in mapreduce. 2011 IEEE international conference on data mining workshops 0:180–187, doi:10.1109/ICDMW.2011.84
Neal P (2008) The generalised coupon collector problem. J Appl Probab 45(3):621–629 doi:10.1239/jap/1222441818
Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026,113
Newman MEJ (2006) Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103(23): 8577–8582 doi:10.1073/pnas.0601602103
Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64
Spielman D, Teng S (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the thirty-sixth annual ACM symposium on theory of computing, ACM, pp 81–90
Spielman DA, Teng SH (2008) A local clustering algorithm for massive graphs and its application to nearly-linear time graph partitioning. CoRR abs/0809.3232
White T (2009) Hadoop: the definitive guide, 1st edn. O’Reilly Media, Inc., Sebastopol, California, USA
Yang J, Leskovec J (2012) Structure and overlaps of communities in networks. CoRR abs/1205.6228
Zhao Z, Wang G, Butt A, Khan M, Kumar V, Marathe M (2012) Sahad: subgraph analysis in massive networks using hadoop. In: parallel and distributed processing symposium (IPDPS), 2012 IEEE 26th international, IEEE, pp 390–401
Acknowledgments
The authors would like to thank Spencer Beecher and Jason Morris for their help throughout this work, and the anonymous reviewers for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is part of the Topical Collection on Social Systems as Complex Networks.
Rights and permissions
About this article
Cite this article
Perozzi, B., McCubbin, C. & Halbert, J.T. Scalable graph clustering with parallel approximate PageRank. Soc. Netw. Anal. Min. 4, 179 (2014). https://doi.org/10.1007/s13278-014-0179-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-014-0179-3