Article

Evaluation of the QoS of crash-recovery failure detection

Authors:

Stuart AndersonAuthors Info & Claims

SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

Pages 538 - 542

https://doi.org/10.1145/1244002.1244127

Published: 11 March 2007 Publication History

Abstract

Crash failure detection is a key topic in fault tolerance, and it is important to be able to assess the QoS of failure detection services. Most previous work on crash failure detectors has been based on the crash-stop or fail-free assumption. In this paper we study and model a crash-recovery service which has the ability to recover from the crash state. We analyse the QoS bounds for such a crash-recovery failure detection service. Our results show that the dependability metrics of the monitored service will have an impact on the QoS of the failure detection service. Our results are corroborated by simulation results, showing bounds on the QoS.

References

[1]

M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. In Int. Sym. on Distributed Computing, pages 231--245, 1998.

Digital Library

[2]

M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theoretical Computer Science, 220(1):3--30, 1999.

Digital Library

[3]

T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, 1996.

Digital Library

[4]

W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. volume 51, pages 13--32, 2002.

Digital Library

[5]

D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. In Proc. of the 16th Annual ACM Sym. on Principles of Distributed Computing, page 286, 1997.

Digital Library

[6]

L. Falai and A. Bondavalli. Experimental evaluation of the qos of failure detectors on wide area network. In 2005 Int. Conf. on Dependable Systems and Networks, pages 624--633, 2005.

Digital Library

[7]

C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Trans. Comput., 52(2):99--112, 2003.

Digital Library

[8]

C. Fetzer, M. Raynal, and F. Tronel. An adaptive failure detection protocol. In Proc. of the 2001 Pacific Rim Int. Sym. on Dependable Computing, page 146, 2001.

Digital Library

[9]

M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, 1985.

Digital Library

[10]

V. K. Garg and J. R. Mitchell. Implementable failure detectors in asynchronous systems. In Proc. 18th Conf. on Foundations of Software Technology and Theoretical Computer Science, number 1530, 1998.

Digital Library

[11]

I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. of the 20th Annual ACM Sym. on Principles of Distributed Computing, pages 170--179, 2001.

Digital Library

[12]

N. Hayashibara, A. Cherif, and T. Katayama. Failure detectors for large-scale distributed systems. In Proc. of the 21st IEEE Sym. on Reliable Distributed Systems, page 404, 2002.

Digital Library

[13]

N. Hayashibara, X. Defago, R. Yared, and T. Katayama. The accrual failure detector. In 23rd IEEE Int. Sym. on Reliable Distributed Systems, pages 66--78, 2004.

Digital Library

[14]

M. Hurfin, A. Mostefaoui, and M. Raynal;. Consensus in asynchronous systems where processes can crash and recover. In The 17th IEEE Sym. on Reliable Distributed Systems, pages 280--286, 1998.

Digital Library

[15]

G. Neiger. Failure detectors and the wait-free hierarchy (extended abstract). In Proc. of the 14th Annual ACM Sym. on Principles of Distributed Computing, pages 100--109, 1995.

Digital Library

[16]

R. C. Nunes and I. Jansch-Porto. Qos of timeout-based self-tuned failure detectors: The effects of the communication delay predictor and the safety margin. In 2004 Int. Conf. on Dependable Systems and Networks, page 753, 2004.

Digital Library

[17]

R. Oliveira, R. Guerraoui, and A. Schiper. Consensus in the crash-recover model. Technical Report TR-97/239, 1997.

[18]

R. V. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical Report TR98--1687, 1998.

Digital Library

[19]

I. Sotoma and E. R. M. Madeira. A markov model for quality of service of failure detectors in the pressure of loss bursts. In 18th Int. Conf. on Advanced Information Networking and Applications, volume 2, page 62, 2004.

Digital Library

[20]

P. Stelling, C. DeMatteis, I. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117--128, 1999.

Digital Library

Cited By

Bandara LRathnasinghe HKavinda EDe. Seram CHansika M(2021)Intelligent Crowd-Sourced 5G Heat-map with Event-driven Architecture2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON53756.2021.9623247(0982-0987)Online publication date: 27-Oct-2021
https://doi.org/10.1109/IEMCON53756.2021.9623247
Pezoa JHayat M(2014)Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated FailuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.7825:4(1034-1043)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1109/TPDS.2013.78
Pezoa JHayat M(2012)Performance and Reliability of Non-Markovian Heterogeneous Distributed Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2011.28523:7(1288-1301)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.1109/TPDS.2011.285
Show More Cited By

Index Terms

Evaluation of the QoS of crash-recovery failure detection
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

On the Quality of Service of Crash-Recovery Failure Detectors

We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include ...
On the Quality of Service of Crash-Recovery Failure Detectors
DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

In this paper, we study and model a crash-recovery target and its failure detector's probabilistic behavior. We extend Quality of Service (QoS) metrics to measure the recovery detection speed and the proportion of the detected failures of a crash-...
Failure detection and consensus in the crash-recovery model

We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '07: Proceedings of the 2007 ACM symposium on Applied computing

March 2007

1688 pages

ISBN:1595934804

DOI:10.1145/1244002

Conference Chairs:
Yookun Cho
Seoul National University, Seoul, Korea
,
Roger L. Wainwright
University of Tulsa, Tulsa, Oklahoma
,
Hisham M. Haddad
Kennesaw State University, Kennesaw, Georgia
,
Sung Y. Shin
South Dakota State University, Brookings, South Dakota
,
Program Chair:
Yong Wan Koo
The University of Suwon, Gyeongggi-do, Korea

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 March 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SAC07

Sponsor:

SIGAPP

SAC07: The 2007 ACM Symposium on Applied Computing

March 11 - 15, 2007

Seoul, Korea

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
329
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bandara LRathnasinghe HKavinda EDe. Seram CHansika M(2021)Intelligent Crowd-Sourced 5G Heat-map with Event-driven Architecture2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON53756.2021.9623247(0982-0987)Online publication date: 27-Oct-2021
https://doi.org/10.1109/IEMCON53756.2021.9623247
Pezoa JHayat M(2014)Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated FailuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.7825:4(1034-1043)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1109/TPDS.2013.78
Pezoa JHayat M(2012)Performance and Reliability of Non-Markovian Heterogeneous Distributed Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2011.28523:7(1288-1301)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.1109/TPDS.2011.285
Ma THillston JAnderson S(2010)On the Quality of Service of Crash-Recovery Failure DetectorsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2009.357:3(271-283)Online publication date: 1-Jul-2010
https://dl.acm.org/doi/10.1109/TDSC.2009.35
Zhang DQi YYang SHou F(2009)Structure-Based Mesh Completion2009 International Conference on Computational Intelligence and Software Engineering10.1109/CISE.2009.5366092(1-4)Online publication date: Dec-2009
https://doi.org/10.1109/CISE.2009.5366092
Zhang DLiu F(2009)Research and Design of Redundant Cluster-Head Model in WIA-PA Based on Adaptive Heartbeat Protocol2009 International Conference on Computational Intelligence and Software Engineering10.1109/CISE.2009.5365032(1-4)Online publication date: Dec-2009
https://doi.org/10.1109/CISE.2009.5365032
Zhang YZhou J(2009)Research and Implementation of EAI Based on SOA2009 International Conference on Computational Intelligence and Software Engineering10.1109/CISE.2009.5362586(1-4)Online publication date: Dec-2009
https://doi.org/10.1109/CISE.2009.5362586
Ma THillston JAnderson S(2007)On the Quality of Service of Crash-Recovery Failure DetectorsProceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2007.70(739-748)Online publication date: 25-Jun-2007
https://dl.acm.org/doi/10.1109/DSN.2007.70

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents