Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2505515.2505753acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Published: 27 October 2013 Publication History

Abstract

Executing data-parallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing iterative tasks in large clusters typically achieve fault tolerance through rollback recovery. The principle behind this pessimistic approach is to periodically checkpoint the algorithm state. Upon failure, the system restores a consistent state from a previously written checkpoint and resumes execution from that point.
We propose an optimistic recovery mechanism using algorithmic compensations. Our method leverages the robust, self-correcting nature of a large class of fixpoint algorithms used in data mining and machine learning, which converge to the correct solution from various intermediate consistent states. In the case of a failure, we apply a user-defined compensate function that algorithmically creates such a consistent state, instead of rolling back to a previous checkpointed state. Our optimistic recovery does not checkpoint any state and hence achieves optimal failure-free performance with respect to the overhead necessary for guaranteeing fault tolerance.
We illustrate the applicability of this approach for three wide classes of problems. Furthermore, we show how to implement the proposed optimistic recovery mechanism in a data flow system. Similar to the Combine operator in MapReduce, our proposed functionality is optional and can be applied to increase performance without changing the semantics of programs.
In an experimental evaluation on large datasets, we show that our proposed approach provides optimal failure-free performance. In the absence of failures our optimistic scheme is able to outperform a pessimistic approach by a factor of two to five. In presence of failures, our approach provides fast recovery and outperforms pessimistic approaches in the majority of cases.

References

[1]
Apache Giraph, http://giraph.apache.org.
[2]
Apache Hadoop, http://hadoop.apache.org.
[3]
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010.
[4]
D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation. Athena Scientific, 1997.
[5]
P. Boldi and S. Vigna. The Webgraph Framework I: Compression techniques. In WWW, pp. 595--602, 2004.
[6]
P. Bonacich. Power and Centrality: A family of measures. American Journal of Sociology, pp. 1170--1182, 1987.
[7]
V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011.
[8]
Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling datalog for machine learning on big data. CoRR, abs/1203.0160, 2012.
[9]
M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI, pp. 335--350, 2006.
[10]
M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in Twitter: The million follower fallacy. In ICWSM, 2010.
[11]
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW, pp. 271--280, 2007.
[12]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.
[13]
E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.
[14]
S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB, 5(11):1268--1279, 2012.
[15]
H. Garcia-Molina and K. Salem. Sagas. In SIGMOD, pp. 249--259, 1987.
[16]
R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, pp. 69--77, 2011.
[17]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, pp. 29--43, 2003.
[18]
G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.
[19]
M. Gondran and M. Minoux. Graphs, Dioids and Semirings - New Models and Algorithms. Springer, 2008.
[20]
Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In ICDM, pp. 263--272, 2008.
[21]
U. Kang, S. Papadimitriou, J. Sun, and H. Tong. Centralities in large networks: Algorithms and observations. In SDM, pp. 119--130, 2011.
[22]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pp. 229--238, 2009.
[23]
L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39--43, 1953.
[24]
Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 2009.
[25]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB, 5(8):716--727, 2012.
[26]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD pp. 135--146, 2010.
[27]
F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In CIDR, 2013.
[28]
S. R. Mihaylov, Z. G. Ives, and S. Guha. Rex: Recursive, delta-based data-centric computation. PVLDB, 5(11):1280--1291, 2012.
[29]
M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 74:036104, 2006.
[30]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab, 1999.
[31]
A. Reuter and F. Schwenkreis. ConTracts - A low-level mechanism for building general-purpose workflow management-systems. IEEE Data Eng. Bull., 18(1):4--10, 1995.
[32]
M. Zaharia, M. Chowdhury, T. Das, D. Ankur, M. McCauley, M. Franklin, S. Shenker and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NDSI, pp. 2--2 2012.
[33]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.
[34]
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica.Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In HotCLoud, 2012.
[35]
Y. Zhou, D. M. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pp. 337--348, 2008.

Cited By

View all
  • (2024)A Fuzzy PID-Incorporated Stochastic Gradient Descent Algorithm for Fast and Accurate Latent Factor AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.338973332:7(4049-4061)Online publication date: 16-Apr-2024
  • (2024)Adaptive Divergence-Based Non-Negative Latent Factor Analysis of High-Dimensional and Incomplete Matrices From Industrial ApplicationsIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33325508:2(1209-1222)Online publication date: Apr-2024
  • (2024)A Nonlinear PID-Incorporated Adaptive Stochastic Gradient Descent Algorithm for Latent Factor AnalysisIEEE Transactions on Automation Science and Engineering10.1109/TASE.2023.328481921:3(3742-3756)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. "All roads lead to Rome": optimistic recovery for distributed iterative data processing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
    October 2013
    2612 pages
    ISBN:9781450322638
    DOI:10.1145/2505515
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. fault-tolerance
    2. iterative algorithms
    3. optimistic recovery

    Qualifiers

    • Research-article

    Conference

    CIKM'13
    Sponsor:
    CIKM'13: 22nd ACM International Conference on Information and Knowledge Management
    October 27 - November 1, 2013
    California, San Francisco, USA

    Acceptance Rates

    CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Fuzzy PID-Incorporated Stochastic Gradient Descent Algorithm for Fast and Accurate Latent Factor AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.338973332:7(4049-4061)Online publication date: 16-Apr-2024
    • (2024)Adaptive Divergence-Based Non-Negative Latent Factor Analysis of High-Dimensional and Incomplete Matrices From Industrial ApplicationsIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33325508:2(1209-1222)Online publication date: Apr-2024
    • (2024)A Nonlinear PID-Incorporated Adaptive Stochastic Gradient Descent Algorithm for Latent Factor AnalysisIEEE Transactions on Automation Science and Engineering10.1109/TASE.2023.328481921:3(3742-3756)Online publication date: Jul-2024
    • (2022)An α–β-Divergence-Generalized Recommender for Highly Accurate Predictions of Missing User PreferencesIEEE Transactions on Cybernetics10.1109/TCYB.2020.302642552:8(8006-8018)Online publication date: Aug-2022
    • (2022)ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph ProcessingWeb and Big Data10.1007/978-3-031-25158-0_5(45-59)Online publication date: 11-Aug-2022
    • (2022)Data Management in Machine Learning SystemsundefinedOnline publication date: 26-Feb-2022
    • (2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
    • (2021)A Fault-Tolerant Distributed Framework for Asynchronous Iterative ComputationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305942032:8(2062-2073)Online publication date: 1-Aug-2021
    • (2021)Continuous Distributed Processing of Software Defined Radar2021 CIE International Conference on Radar (Radar)10.1109/Radar53847.2021.10028245(2876-2880)Online publication date: 15-Dec-2021
    • (2021)SmartDL: energy-aware decremental learning in a mobile-based federation for geo-spatial systemNeural Computing and Applications10.1007/s00521-021-06378-935:5(3677-3696)Online publication date: 9-Aug-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media