research-article

Towards transparent hardening of distributed systems

Authors:

Christof Fetzer,

Flavio P. Junqueira,

Marco SerafiniAuthors Info & Claims

HotDep '13: Proceedings of the 9th Workshop on Hot Topics in Dependable Systems

Article No.: 4, Pages 1 - 6

https://doi.org/10.1145/2524224.2524230

Published: 03 November 2013 Publication History

Abstract

In distributed systems, errors such as data corruption or arbitrary changes to the flow of programs might cause processes to propagate incorrect state across the system. To prevent error propagation in such systems, an efficient and effective technique is to harden processes against Arbitrary State Corruption (ASC) faults through local detection, without replication. For distributed systems designed from scratch, dealing with state corruption can be made fully transparent, but requires that developers follow a few concrete design patterns. In this paper, we discuss the problem of hardening existing code bases of distributed systems transparently. Existing systems have not been designed with ASC hardening in mind, so they do not necessarily follow required design patterns. For such systems, we focus here on both performance and number of changes to the existing code base. Using memcached as an example, we identify and discuss three areas of improvement: reducing the memory overhead, improving access to state variables, and supporting multi-threading. Our initial evaluation of memcached shows that our ASC-hardened version obtains a throughput that is roughly 76% of the throughput of stock memcached with 128-byte and 1k-byte messages.

References

[1]

D. Behrens, S. Weigert, and C. Fetzer. Automatically tolerating arbitrary faults in non-malicious settings. In Proceedings of the Sixth Latin-American Symposium on Dependable Computing (LADC), pages 114--123, April 2013.

Digital Library

[2]

T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14: 80--107, February 1996.

Digital Library

[3]

M. Castro and B. Liskov, Practical Byzantine fault tolerance. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 1999.

Digital Library

[4]

M. Correia, D. G. Ferro, F. Junqueira, and M. Serafini. Practical hardening of crash-tolerant systems. In USENIX Annual Technical Conference, 2012.

Digital Library

[5]

T. J. Dell. A white paper on the benefits of chipkill- correct ECC for PC server main memory. Technical report, IBM Microelectronics Division, 1997.

[6]

C. Ho, R. van Renesse, M. Bickford, and D. Dolev. Nysiad: Practical protocol transformation to tolerate byzantine failures. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2007.

Digital Library

[7]

A. A. Hwang, I. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design. In ASPLOS, 2012.

Digital Library

[8]

Intel. Intel SSE4 Programming Reference, 2007.

[9]

M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin. All about eve: execute-verify replication for multi-core servers. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, pages 237--250, 2012.

Digital Library

[10]

M. E. Kounavis and F. L. Berry. A systematic approach to building high performance software-based crc generators. 2012 IEEE Symposium on Computers and Communications (ISCC), 0: 855--862, 2005.

Digital Library

[11]

X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC'10, pages 6--16, Berkeley, CA, USA, 2010. USENIX Association.

Digital Library

[12]

T. Liu, C. Curtsinger, and E. D. Berger. Dthreads: efficient deterministic multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 327--336. ACM, 2011.

Digital Library

[13]

M. G. Merideth, A. Iyengar, T. Mikalsen, S. Tai, I. Rouvellou, and P. Narasimhan. Thema: Byzantine-fault-tolerant middleware forweb-service applications. In Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems, SRDS '05, pages 131--142, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[14]

G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. SWIFT: software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization, pages 243--254, Mar. 2005.

Digital Library

[15]

R. Rodrigues, M. Castro, and B. Liskov. Base: using abstraction to improve fault tolerance. In SOSP '01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 15--28, New York, NY, USA, 2001. ACM.

Digital Library

[16]

N. Shavit and D. Touitou. Software transactional memory. In Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, PODC '95, pages 204--213, New York, NY, USA, 1995. ACM.

Digital Library

Cited By

Martens ABorchert CNieke MSpinczyk OKapitza R(2016)CrossCheck: A Holistic Approach for Tolerating Crash-Faults and Arbitrary Failures2016 12th European Dependable Computing Conference (EDCC)10.1109/EDCC.2016.29(65-76)Online publication date: Sep-2016
https://doi.org/10.1109/EDCC.2016.29
Behrens DSerafini MArnautov SJunqueira FFetzer CBarham PKrishnamurthy A(2015)Scalable error isolation for distributed systemsProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation10.5555/2789770.2789812(605-620)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.5555/2789770.2789812
Martens ABorchert CGeiβler TLohmann DSpinczyk OKapitza R(2014)CrosscheckProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.98(648-653)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1109/DSN.2014.98

Index Terms

Towards transparent hardening of distributed systems
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
    2. Software system structures
      1. Distributed systems organizing principles

Recommendations

The customizable fault/error model for dependable distributed systems
Dependable computing

Dependability is a qualitative term referring to a system's ability to meet its service requirements in the presence of faults. The types and number of faults covered by a system play a primary role in determining the level of dependability which that ...
Fast and simple distributed consensus

The problem of fault-tolerant agreement is fundamental to distributed computing. When agreement is to be reached in spite of arbitrary behavior by faulty processors, this problem is called Distributed Consensus. By requiring that the number of faulty ...
A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
IPDS '00: Proceedings of the 4th International Computer Performance and Dependability Symposium

Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HotDep '13: Proceedings of the 9th Workshop on Hot Topics in Dependable Systems

November 2013

64 pages

ISBN:9781450324571

DOI:10.1145/2524224

Program Chairs:
Christian Cachin
IBM Research - Zürich, Switzerland
,
Robbert van Renesse
Cornell University, Ithaca, NY

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

SOSP '13

Sponsor:

SIGOPS

SOSP '13: ACM SIGOPS 24th Symposium on Operating Systems Principles

November 3, 2013

Pennsylvania, Farmington

Acceptance Rates

HotDep '13 Paper Acceptance Rate 11 of 21 submissions, 52%;

Overall Acceptance Rate 11 of 21 submissions, 52%

Upcoming Conference

SOSP '25

Sponsor:
sigops

ACM SIGOPS 31st Symposium on Operating Systems Principles

October 13 - 16, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
188
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Martens ABorchert CNieke MSpinczyk OKapitza R(2016)CrossCheck: A Holistic Approach for Tolerating Crash-Faults and Arbitrary Failures2016 12th European Dependable Computing Conference (EDCC)10.1109/EDCC.2016.29(65-76)Online publication date: Sep-2016
https://doi.org/10.1109/EDCC.2016.29
Behrens DSerafini MArnautov SJunqueira FFetzer CBarham PKrishnamurthy A(2015)Scalable error isolation for distributed systemsProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation10.5555/2789770.2789812(605-620)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.5555/2789770.2789812
Martens ABorchert CGeiβler TLohmann DSpinczyk OKapitza R(2014)CrosscheckProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.98(648-653)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1109/DSN.2014.98

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents