Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2387880.2387884acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

GraphChi: large-scale graph computation on just a PC

Published: 08 October 2012 Publication History

Abstract

Current systems for graph computation require a distributed computing cluster to handle very large real-world problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible, developing distributed graph algorithms still remains challenging, especially to non-experts.
In this work, we present GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. By using a well-known method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumer-level computer. We further extend GraphChi to support graphs that evolve over time, and demonstrate that, on a single computer, GraphChi can process over one hundred thousand graph updates per second, while simultaneously performing computation. We show, through experiments and theoretical analysis, that GraphChi performs well on both SSDs and rotational hard drives.
By repeating experiments reported for existing distributed systems, we show that, with only fraction of the resources, GraphChi can solve the same problems in very reasonable time. Our work makes large-scale graph computation available to anyone with a modern PC.

References

[1]
A. Aggarwal, J. Vitter, et al. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116-1127, 1988.
[2]
L. Arge. The buffer tree: A new technique for optimal i/o-algorithms. Algorithms and Data Structures, pages 334-345, 1995.
[3]
L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: membership, growth, and evolution. The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'06. ACM, 2006.
[4]
A. Badam and V. S. Pai. Ssdalloc: hybrid ssd/ram memory management made easy. In Proc. of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, pages 16-16, Boston, MA, 2011. USENIX Association.
[5]
M. Bender, G. Brodal, R. Fagerberg, R. Jacob, and E. Vicari. Optimal sparse matrix dense vector multiplication in the i/o-model. Theory of Computing Systems, 47(4):934-962, 2010.
[6]
J. Bennett and S. Lanning. The netflix prize. In Proc. of the KDD Cup Workshop 2007, pages 3-6, San Jose, CA, Aug. 2007. ACM.
[7]
D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989.
[8]
D. K. Blandford, G. E. Blelloch, and I. A. Kash. Compact representations of separable graphs. In In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages 679-688, 2003.
[9]
G. Blelloch, H. Simhadri, and K. Tangwongsan. Parallel and i/o efficient set covering algorithms. In Proc. of the 24th ACM symposium on Parallelism in algorithms and architectures, pages 82-90, 2012.
[10]
G. E. Blelloch. Prefix sums and their applications. Synthesis of Parallel Algorithms, 1990.
[11]
P. Boldi, M. Santini, and S. Vigna. A large time-aware graph. SIGIR Forum, 42(2):33-38, 2008.
[12]
P. Boldi and S. Vigna. The webgraph framework i: compression techniques. In Proc. of the 13th international conference on World Wide Web, pages 595-602. ACM, 2004.
[13]
R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD'10, pages 1123-1126, Indianapolis, Indiana, USA, 2010. ACM.
[14]
Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing pagerank. In Proc. of the eleventh international conference on Information and knowledge management, pages 549-557, McLean, Virginia, USA, 2002. ACM.
[15]
R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking the pulse of a fast-changing and connected world. In Proc. of the 7th ACM european conference on Computer Systems, EuroSys'12, pages 85-98, Bern, Switzerland, 2012. ACM.
[16]
Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. External-memory graph algorithms. In Proc. of the sixth annual ACM-SIAM symposium on Discrete algorithms, SODA'95, pages 139-149, Philadelphia, PA, 1995. Society for Industrial and Applied Mathematics.
[17]
F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In Proc. of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 219-228, Paris, France, April 2009. ACM.
[18]
S. Chu and J. Cheng. Triangle listing in massive networks and its applications. In In Proc. of the 17th ACM SIGKDD international conf. on Knowledge discovery and data mining, pages 672-680, 2011.
[19]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. of the 6th USENIX conference on Operating systems design and implementation, OSDI'04, pages 10-10, San Francisco, CA, 2004. USENIX.
[20]
J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. of the 10th USENIX conference on Operating systems design and implementation, OSDI'12, Hollywood, CA, 2012.
[21]
T. Haveliwala. Efficient computation of pagerank. Technical report, Stanford University, 1999.
[22]
U. Kang, D. Chau, and C. Faloutsos. Inference of beliefs on billion-scale graphs. In The 2nd Workshop on Large-scale Data Mining: Theory and Applications, Washington, D.C., 2010.
[23]
U. Kang and C. Faloutsos. Beyond'caveman communities': Hubs and spokes for graph compression and mining. In 11th International Conference on Data Mining (ICDM'11), pages 300-309, Vancouver, Canada, 2011.
[24]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. ICDM'09. IEEE Computer Society, 2009.
[25]
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput., 48(1):96-129, 1998.
[26]
H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In Proc. of the 19th international conference on World wide web, pages 591-600. ACM, 2010.
[27]
J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29-123, 2009.
[28]
X. Liu and T. Murata. Advanced modularity-specialized label propagation algorithm for detecting communities in networks. Physica A: Stat. Mechanics and its Applications, 389(7):1493-1500, 2010.
[29]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, CA, July 2010.
[30]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 2012.
[31]
G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD 10: Proc. of the 2010 international conference on Management of data, Indianapolis, IN, 2010.
[32]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, 1999.
[33]
D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proc. of the 1988 ACM SIGMOD international conference on Management of data, SIGMOD'88, pages 109-116, Chicago, IL, 1988.
[34]
R. Pearce, M. Gokhale, and N. Amato. Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory. In SuperComputing, 2010.
[35]
J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science, University of California, Los Angeles, 1982.
[36]
R. Power and J. Li. Piccolo: building fast, distributed programs with partitioned tables. In Proc. of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1-14, 2010.
[37]
S. Salihoglu and J. Widom. GPS: a graph processing system. Technical report, Stanford University, 2012.
[38]
I. Stanton and G. Kliot. Streaming graph partitioning for large distributed graphs. Technical report, Microsoft Research, 2012.
[39]
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In In Proc. of the 20th international conference on World wide web, pages 607-614, Lyon, France, 2011. ACM.
[40]
S. Toledo. A survey of out-of-core algorithms in numerical linear algebra. External Memory Algorithms and Visualization, 50:161-179, 1999.
[41]
L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, 1990.
[42]
J. Vitter. External Memory Algorithms. ESA, 1998.
[43]
D. Watts and S. Strogatz. Collective dynamics of small-world networks. Nature, 393(6684): 440-442, 1998.
[44]
Yahoo WebScope. Yahoo! altavista web page hyper-link connectivity graph, circa 2002, 2012. http://webscope.sandbox.yahoo.com/.
[45]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010.
[46]
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In In Proc. of the 4th international conference on Algorithmic Aspects in Information and Management, AAIM'08, pages 337-348, Berlin, Heidelberg, 2008. Springer-Verlag.
[47]
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University, 2002.

Cited By

View all
  • (2024)BYO: A Unified Framework for Benchmarking Large-Scale Graph ContainersProceedings of the VLDB Endowment10.14778/3665844.366585917:9(2307-2320)Online publication date: 1-May-2024
  • (2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
  • (2024)CPMA: An Efficient Batch-Parallel Compressed Set Without PointersProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638492(348-363)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Reviews

Jose F Rodrigues

The analytical processing of large graphs has increasingly been in demand for academic and industrial applications. Due to their magnitude, large graphs are typically processed in computing clusters—systems with many computers that work together—at the cost of greater complexity to build, configure, and manage them. Data analysts prefer to avoid problems like these. To address this, the authors of this paper propose the GraphChi system to allow for the processing of large graphs on a single computer. GraphChi introduces the parallel sliding windows technique, inspired by the asynchronous model of computation [1], which processes the graph data according to P disjoint subsets of vertices defined so that the induced subgraph (plus out-edges) of each subset will fit into memory. This scheme ensures that the entire graph will be processed in P steps. Furthermore, vertex-ordered disk organization (preprocessed) guarantees that only P sequential disk accesses are necessary in each step. The result is an efficient and scalable framework. GraphChi cannot be used for common operations like graph traversals and vertex queries, but it is quite suitable for problems in which the neighborhood of the vertices is sufficient. Examples of these problems include sparse-matrix dense-vector multiplication (SpMV); connected components, community detection, and triangle counting algorithms; collaborative filtering; and probabilistic graphical models. GraphChi rivals the principal frameworks presented in recent literature; for specific problems, it outperforms clusters with dozens of machines. I recommend this paper to graph researchers and practitioners who can benefit from this novel and useful paradigm. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
OSDI'12: Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
October 2012
362 pages
ISBN:9781931971966

Sponsors

  • Infosys
  • EMC2: EMC2
  • Microsoft Reasearch: Microsoft Reasearch
  • ORACLE: ORACLE
  • USENIX Assoc: USENIX Assoc

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 08 October 2012

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)BYO: A Unified Framework for Benchmarking Large-Scale Graph ContainersProceedings of the VLDB Endowment10.14778/3665844.366585917:9(2307-2320)Online publication date: 1-May-2024
  • (2024)Load Balanced PIM-Based Graph ProcessingACM Transactions on Design Automation of Electronic Systems10.1145/365995129:4(1-22)Online publication date: 21-Jun-2024
  • (2024)CPMA: An Efficient Batch-Parallel Compressed Set Without PointersProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638492(348-363)Online publication date: 2-Mar-2024
  • (2024)TGLite: A Lightweight Programming Framework for Continuous-Time Temporal Graph Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640414(1183-1199)Online publication date: 27-Apr-2024
  • (2023)Learning to Drive Software-Defined Solid-State DrivesProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614281(1289-1304)Online publication date: 28-Oct-2023
  • (2023)NosWalker: A Decoupled Architecture for Out-of-Core Random Walk ProcessingProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582025(466-482)Online publication date: 25-Mar-2023
  • (2023)DGAP: Efficient Dynamic Graph Analysis on Persistent MemoryProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607106(1-13)Online publication date: 12-Nov-2023
  • (2022)BlazeProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571943(1-15)Online publication date: 13-Nov-2022
  • (2022)An I/O-efficient disk-based graph system for scalable second-order random walk of large graphsProceedings of the VLDB Endowment10.14778/3529337.352934615:8(1619-1631)Online publication date: 1-Apr-2022
  • (2022)LargeEAProceedings of the VLDB Endowment10.14778/3489496.348950415:2(237-245)Online publication date: 4-Feb-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media