Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2987550.2987552acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments

Published: 05 October 2016 Publication History

Abstract

Many graph algorithms are iterative in nature and can be supported by distributed memory-based systems in a synchronous manner. However, an asynchronous model has been recently proposed to accelerate iterative computations. Nevertheless, it is challenging to recover from failures in such a system, since a typical checkpointing based approach requires many expensive synchronization barriers that largely offset the gains of asynchronous computations.
This paper first proposes a fault-tolerant framework that performs recovery by leveraging surviving data, rather than checkpointing. Our fault-tolerant approach guarantees the correctness of computations. Additionally, a novel asynchronous checkpointing method is introduced to further boost the recovery efficiency at the price of nearly zero overhead. Our solutions are implemented on a prototype system, Faiter, to facilitate tolerating failures for asynchronous computations. Also, Faiter performs load balancing on recovery by re-assigning lost data onto multiple machines. We conduct extensive experiments to show the effectiveness of our proposals using a broad spectrum of real-world graphs.

References

[1]
Giraph. http://giraph.apache.org/.
[2]
Apache spark. http://spark.apache.org/.
[3]
S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly. Video suggestion and discovery for youtube: taking random walks through the view graph. In Proc. of the 17th international conference on World Wide Web, pages 895--904. ACM, 2008.
[4]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. 7th Intl. Conf. on the World Wide Web, pages 107--117. Elsevier, 1998.
[5]
C. Cao, T. Herault, G. Bosilca, and J. Dongarra. Design for a soft error resilient dynamic task-based runtime. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 765--774. IEEE, 2015.
[6]
K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63--75, 1985.
[7]
Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proc. of HPDC, pages 73--84. ACM, 2011.
[8]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. of OSDI, volume 12, page 2, 2012.
[9]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In Proc. of OSDI, pages 599--613, 2014.
[10]
Z. Guan, J. Wu, Q. Zhang, A. Singh, and X. Yan. Assessing and ranking structural correlations in graphs. In Proc. of SIGMOD, pages 937--948. ACM, 2011.
[11]
L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39--43, 1953.
[12]
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In Proc. of Eurosys, pages 169--182. ACM, 2013.
[13]
D. LaSalle and G. Karypis. Multi-threaded graph partitioning. In Proc. of IPDPS, pages 225--236. IEEE, 2013.
[14]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. of the VLDB Endowment, 5(8):716--727, 2012.
[15]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. of SIGMOD, pages 135--146. ACM, 2010.
[16]
T. Martsinkevich, O. Subasi, O. Unsal, F. Cappello, and J. Labarta. Fault-tolerant protocol for hybrid task-parallel message-passing applications. In Cluster Computing (CLUSTER), 2015 IEEE International Conference on, pages 563--570. IEEE, 2015.
[17]
M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In Proc. of SoCC, pages 195--208. ACM, 2015.
[18]
S. Salihoglu and J. Widom. Gps: A graph processing system. In Proc. of SSDBM, page 22. ACM, 2013.
[19]
S. Schelter, S. Ewen, K. Tzoumas, and V. Markl. All roads lead to rome: optimistic recovery for distributed iterative data processing. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1919--1928. ACM, 2013.
[20]
Y. Shen, G. Chen, H. Jagadish, W. Lu, B. C. Ooi, and B. M. Tudor. Fast failure recovery in distributed graph processing systems. Proc. of the VLDB Endowment, 8(4):437--448, 2014.
[21]
I. Stanton and G. Kliot. Streaming graph partitioning for large distributed graphs. In Proc. of SIGKDD, pages 1222--1230. ACM, 2012.
[22]
J. Wang, M. Balazinska, and D. Halperin. Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. Proc. of the VLDB Endowment, 8(12):1542--1553, 2015.
[23]
Z. Wang, Y. Gu, Y. Bao, G. Yu, and J. X. Yu. Hybrid pulling/pushing for i/o-efficient distributed and iterative graph computing. In Proc. of SIGMOD, pages 479--494. ACM, 2016.
[24]
C. Xu, M. Holzemer, M. Kaul, and V. Markl. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In Proc. of ICDE, pages 613--624. IEEE, 2016.
[25]
J. Xue, Z. Yang, Z. Qu, S. Hou, and Y. Dai. Seraph: an efficient, low-cost system for concurrent graph processing. In Proc. of HPDC, pages 227--238. ACM, 2014.
[26]
J. Yin and L. Gao. Scalable distributed belief propagation with prioritized block updates. In Proc. of CIKM, pages 1209--1218, 2014.
[27]
J. Yin and L. Gao. Asynchronous distributed incremental computation on evolving graphs. In ECML/PKDD'16, 2016.
[28]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of NSDI, pages 2--2. USENIX Association, 2012.
[29]
C. Zhang, L. Shou, K. Chen, G. Chen, and Y. Bei. Evaluating geo-social influence in location-based social networks. In Proc. of CIKM, pages 1442--1451. ACM, 2012.
[30]
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: A distributed framework for prioritizing iterative computations. Parallel and Distributed Systems, IEEE Transactions on, 24(9):1884--1893, 2013.
[31]
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Maiter: an asynchronous graph processing framework for delta-based accumulative iterative computation. TPDS, 25(8):2091--2100, 2014.
[32]
C. Zhou, J. Gao, B. Sun, and J. X. Yu. Mocgraph: Scalable distributed graph processing using message online computing. Proc. of the VLDB Endowment, 8(4):377--388, 2014.

Cited By

View all
  • (2021)A Fault-Tolerant Distributed Framework for Asynchronous Iterative ComputationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305942032:8(2062-2073)Online publication date: 1-Aug-2021
  • (2021)FreeLauncher: Lossless Failure Recovery of Parameter Servers with Ultralight Replication2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00052(472-482)Online publication date: Jul-2021
  • (2020)Comprehensive and Systematic Study on the Fault Tolerance Architectures in Cloud ComputingJournal of Circuits, Systems and Computers10.1142/S021812662050240029:15(2050240)Online publication date: 22-Jun-2020
  • Show More Cited By

Index Terms

  1. A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
      October 2016
      534 pages
      ISBN:9781450345255
      DOI:10.1145/2987550
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 October 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. asynchronous computations
      2. distributed memory-based systems
      3. fault-tolerance
      4. iterative algorithms

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • the National Basic Research Program of China (973 Program)
      • US NSF
      • the National Natural Science Foundation of China
      • China Scholarship Council

      Conference

      SoCC '16
      Sponsor:
      SoCC '16: ACM Symposium on Cloud Computing
      October 5 - 7, 2016
      CA, Santa Clara, USA

      Acceptance Rates

      SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;
      Overall Acceptance Rate 169 of 722 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)A Fault-Tolerant Distributed Framework for Asynchronous Iterative ComputationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305942032:8(2062-2073)Online publication date: 1-Aug-2021
      • (2021)FreeLauncher: Lossless Failure Recovery of Parameter Servers with Ultralight Replication2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00052(472-482)Online publication date: Jul-2021
      • (2020)Comprehensive and Systematic Study on the Fault Tolerance Architectures in Cloud ComputingJournal of Circuits, Systems and Computers10.1142/S021812662050240029:15(2050240)Online publication date: 22-Jun-2020
      • (2020)Accelerating Large-Scale Prioritized Graph Computations by Hotness Balanced PartitionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.3032709(1-1)Online publication date: 2020
      • (2020)HBP: Hotness Balanced Partition for Prioritized Iterative Graph Computations2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00209(1942-1945)Online publication date: Apr-2020
      • (2019)Multiple Fault-Tolerance Mechanisms in Cloud Systems: A Systematic Review2019 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW.2019.00104(414-421)Online publication date: Oct-2019
      • (2018)A Distributed Snapshot Protocol for Efficient Artificial Intelligence Computation in Cloud Computing EnvironmentsSymmetry10.3390/sym1001003010:1(30)Online publication date: 17-Jan-2018
      • (2018)A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud EnvironmentsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.280851929:8(1678-1692)Online publication date: 1-Aug-2018
      • (2018)FBSGraph: Accelerating Asynchronous Graph Processing via Forward and Backward SweepingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.278124130:5(895-907)Online publication date: 1-May-2018
      • (2017)An I/O-efficient and adaptive fault-tolerant framework for distributed graph computationsDistributed and Parallel Databases10.1007/s10619-017-7192-235:2(177-196)Online publication date: 1-Jun-2017
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media