Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Scalable graph clustering with parallel approximate PageRank

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

We outline a method for constructing in parallel a collection of local clusters for a massive distributed graph. For a given input set of (vertex, cluster size) tuples, we compute approximations of personal PageRank vectors in parallel using Pregel, and sweep over the results to create clusters using MapReduce. We show that our method converges to the serial approximate PageRank, and perform an experiment that illustrates the speed up over the serial method. We also outline a random selection and de-confliction procedure to cluster a distributed graph, and perform experiments to determine the quality of clusterings returned.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Available: http://snap.stanford.edu/data/index.html.

References

  • Ahn Y, Bagrow J, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466(7307):761–764

    Article  Google Scholar 

  • Andersen R, Peres Y (2008) Finding sparse cuts locally using evolving sets. CoRR abs/0811.3779

  • Andersen R, Chung F, Lang K (2006) Local graph partitioning using pagerank vectors. In: Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, IEEE, pp 475–486

  • Apache Giraph (2012) Apache giraph. http://incubator.apache.org/giraph/

  • Bahmani B, Chakrabarti K, Xin D (2011) Fast personalized pagerank on mapreduce. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’11, pp 973–984, doi:10.1145/1989323.1989425

  • Blondel V, Guillaume J, Lambiotte R, Mech E (2008) Fast unfolding of communities in large networks. J Stat Mech p P10008

  • Chung FRK (1996) Spectral graph theory (CBMS Regional Conference Series in Mathematics, No. 92) American Mathematical Society

  • Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 doi:10.1145/1327452.1327492

    Article  Google Scholar 

  • Fogaras D, Rácz B, Csalogány K, Sarlós T (2005) Towards scaling fully personalized pagerank: algorithms, lower bounds, and experiments. Internet Math 2(3):333–358

    Article  MathSciNet  MATH  Google Scholar 

  • Gleich DF, Seshadhri C (2012) Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: KDD, pp 597–605

  • Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX conference on operating systems design and implementation, USENIX Association, Berkeley, CA, USA, OSDI’12, pp 17–30

  • Good BH, de Montjoye YA, Clauset A (2010) Performance of modularity maximization in practical contexts. Phys Rev E 81(4):046,106

    Google Scholar 

  • Kannan R, Vempala S, Vetta A (2004) On clusterings: good, bad and spectral. J ACM 51(3):497–515 doi:10.1145/990308.990313

    Article  MathSciNet  MATH  Google Scholar 

  • Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2008) Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th international conference on World Wide Web, ACM, New York, NY, USA, WWW ’08, pp 695–704, doi:10.1145/1367497.1367591

  • Lin J, Dyer C (2010) Data-intensive text processing with mapreduce. Synth Lect Hum Lang Technol 3(1):1–177

    Article  Google Scholar 

  • Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the eighth workshop on mining and learning with graphs, ACM, pp 78–85

  • Lovász L, Simonovits M (1993) Random walks in a convex body and an improved volume algorithm. Random Struct Algorithms 4(4):359–412 doi:10.1002/rsa.3240040402

    Article  MATH  Google Scholar 

  • Mahoney MW, Orecchia L, Vishnoi NK (2012) A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. JMLR 13:2339–2365

    Google Scholar 

  • Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’10, pp 135–146, doi:10.1145/1807167.1807184

  • Massie ML, Chun BN, Culler DE (2004) The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput 30(7):817–840

    Article  Google Scholar 

  • McCubbin C, Perozzi B, Levine A, Rahman A (2011) Finding the ’needle’: locating interesting nodes using the k-shortest paths algorithm in mapreduce. 2011 IEEE international conference on data mining workshops 0:180–187, doi:10.1109/ICDMW.2011.84

  • Neal P (2008) The generalised coupon collector problem. J Appl Probab 45(3):621–629 doi:10.1239/jap/1222441818

    Article  MathSciNet  MATH  Google Scholar 

  • Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026,113

    Google Scholar 

  • Newman MEJ (2006) Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103(23): 8577–8582 doi:10.1073/pnas.0601602103

    Article  Google Scholar 

  • Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64

    Article  MathSciNet  MATH  Google Scholar 

  • Spielman D, Teng S (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the thirty-sixth annual ACM symposium on theory of computing, ACM, pp 81–90

  • Spielman DA, Teng SH (2008) A local clustering algorithm for massive graphs and its application to nearly-linear time graph partitioning. CoRR abs/0809.3232

  • White T (2009) Hadoop: the definitive guide, 1st edn. O’Reilly Media, Inc., Sebastopol, California, USA

  • Yang J, Leskovec J (2012) Structure and overlaps of communities in networks. CoRR abs/1205.6228

  • Zhao Z, Wang G, Butt A, Khan M, Kumar V, Marathe M (2012) Sahad: subgraph analysis in massive networks using hadoop. In: parallel and distributed processing symposium (IPDPS), 2012 IEEE 26th international, IEEE, pp 390–401

Download references

Acknowledgments

The authors would like to thank Spencer Beecher and Jason Morris for their help throughout this work, and the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bryan Perozzi.

Additional information

This article is part of the Topical Collection on Social Systems as Complex Networks.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Perozzi, B., McCubbin, C. & Halbert, J.T. Scalable graph clustering with parallel approximate PageRank. Soc. Netw. Anal. Min. 4, 179 (2014). https://doi.org/10.1007/s13278-014-0179-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-014-0179-3

Keywords

Navigation