Scalable graph clustering with parallel approximate PageRank

Bryan Perozzi¹,
Christopher McCubbin² &
J. T. Halbert³

421 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

We outline a method for constructing in parallel a collection of local clusters for a massive distributed graph. For a given input set of (vertex, cluster size) tuples, we compute approximations of personal PageRank vectors in parallel using Pregel, and sweep over the results to create clusters using MapReduce. We show that our method converges to the serial approximate PageRank, and perform an experiment that illustrates the speed up over the serial method. We also outline a random selection and de-confliction procedure to cluster a distributed graph, and perform experiments to determine the quality of clusterings returned.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Graph Clustering with Pregel

Distributed Algorithms for Finding Local Clusters Using Heat Kernel Pagerank

Scalable Data-Driven PageRank: Algorithms, System Issues, and Lessons Learned

Notes

Available: http://snap.stanford.edu/data/index.html.

References

Ahn Y, Bagrow J, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466(7307):761–764
Article Google Scholar
Andersen R, Peres Y (2008) Finding sparse cuts locally using evolving sets. CoRR abs/0811.3779
Andersen R, Chung F, Lang K (2006) Local graph partitioning using pagerank vectors. In: Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, IEEE, pp 475–486
Apache Giraph (2012) Apache giraph. http://incubator.apache.org/giraph/
Bahmani B, Chakrabarti K, Xin D (2011) Fast personalized pagerank on mapreduce. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’11, pp 973–984, doi:10.1145/1989323.1989425
Blondel V, Guillaume J, Lambiotte R, Mech E (2008) Fast unfolding of communities in large networks. J Stat Mech p P10008
Chung FRK (1996) Spectral graph theory (CBMS Regional Conference Series in Mathematics, No. 92) American Mathematical Society
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 doi:10.1145/1327452.1327492
Article Google Scholar
Fogaras D, Rácz B, Csalogány K, Sarlós T (2005) Towards scaling fully personalized pagerank: algorithms, lower bounds, and experiments. Internet Math 2(3):333–358
Article MathSciNet MATH Google Scholar
Gleich DF, Seshadhri C (2012) Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: KDD, pp 597–605
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX conference on operating systems design and implementation, USENIX Association, Berkeley, CA, USA, OSDI’12, pp 17–30
Good BH, de Montjoye YA, Clauset A (2010) Performance of modularity maximization in practical contexts. Phys Rev E 81(4):046,106
Google Scholar
Kannan R, Vempala S, Vetta A (2004) On clusterings: good, bad and spectral. J ACM 51(3):497–515 doi:10.1145/990308.990313
Article MathSciNet MATH Google Scholar
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2008) Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th international conference on World Wide Web, ACM, New York, NY, USA, WWW ’08, pp 695–704, doi:10.1145/1367497.1367591
Lin J, Dyer C (2010) Data-intensive text processing with mapreduce. Synth Lect Hum Lang Technol 3(1):1–177
Article Google Scholar
Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the eighth workshop on mining and learning with graphs, ACM, pp 78–85
Lovász L, Simonovits M (1993) Random walks in a convex body and an improved volume algorithm. Random Struct Algorithms 4(4):359–412 doi:10.1002/rsa.3240040402
Article MATH Google Scholar
Mahoney MW, Orecchia L, Vishnoi NK (2012) A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. JMLR 13:2339–2365
Google Scholar
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’10, pp 135–146, doi:10.1145/1807167.1807184
Massie ML, Chun BN, Culler DE (2004) The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput 30(7):817–840
Article Google Scholar
McCubbin C, Perozzi B, Levine A, Rahman A (2011) Finding the ’needle’: locating interesting nodes using the k-shortest paths algorithm in mapreduce. 2011 IEEE international conference on data mining workshops 0:180–187, doi:10.1109/ICDMW.2011.84
Neal P (2008) The generalised coupon collector problem. J Appl Probab 45(3):621–629 doi:10.1239/jap/1222441818
Article MathSciNet MATH Google Scholar
Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026,113
Google Scholar
Newman MEJ (2006) Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103(23): 8577–8582 doi:10.1073/pnas.0601602103
Article Google Scholar
Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64
Article MathSciNet MATH Google Scholar
Spielman D, Teng S (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the thirty-sixth annual ACM symposium on theory of computing, ACM, pp 81–90
Spielman DA, Teng SH (2008) A local clustering algorithm for massive graphs and its application to nearly-linear time graph partitioning. CoRR abs/0809.3232
White T (2009) Hadoop: the definitive guide, 1st edn. O’Reilly Media, Inc., Sebastopol, California, USA
Yang J, Leskovec J (2012) Structure and overlaps of communities in networks. CoRR abs/1205.6228
Zhao Z, Wang G, Butt A, Khan M, Kumar V, Marathe M (2012) Sahad: subgraph analysis in massive networks using hadoop. In: parallel and distributed processing symposium (IPDPS), 2012 IEEE 26th international, IEEE, pp 390–401

Download references

Acknowledgments

The authors would like to thank Spencer Beecher and Jason Morris for their help throughout this work, and the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
Bryan Perozzi
Sqrrl Data, Inc., Boston, MA, USA
Christopher McCubbin
TASC Inc., Chantilly, VA, USA
J. T. Halbert

Authors

Bryan Perozzi
View author publications
You can also search for this author in PubMed Google Scholar
Christopher McCubbin
View author publications
You can also search for this author in PubMed Google Scholar
J. T. Halbert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bryan Perozzi.

Additional information

This article is part of the Topical Collection on Social Systems as Complex Networks.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Perozzi, B., McCubbin, C. & Halbert, J.T. Scalable graph clustering with parallel approximate PageRank. Soc. Netw. Anal. Min. 4, 179 (2014). https://doi.org/10.1007/s13278-014-0179-3

Download citation

Received: 14 September 2013
Revised: 10 January 2014
Accepted: 04 February 2014
Published: 04 March 2014
DOI: https://doi.org/10.1007/s13278-014-0179-3

Scalable graph clustering with parallel approximate PageRank

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scalable Graph Clustering with Pregel

Distributed Algorithms for Finding Local Clusters Using Heat Kernel Pagerank

Scalable Data-Driven PageRank: Algorithms, System Issues, and Lessons Learned

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Scalable graph clustering with parallel approximate PageRank

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scalable Graph Clustering with Pregel

Distributed Algorithms for Finding Local Clusters Using Heat Kernel Pagerank

Scalable Data-Driven PageRank: Algorithms, System Issues, and Lessons Learned

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation