Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2524224.2524230acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Towards transparent hardening of distributed systems

Published: 03 November 2013 Publication History

Abstract

In distributed systems, errors such as data corruption or arbitrary changes to the flow of programs might cause processes to propagate incorrect state across the system. To prevent error propagation in such systems, an efficient and effective technique is to harden processes against Arbitrary State Corruption (ASC) faults through local detection, without replication. For distributed systems designed from scratch, dealing with state corruption can be made fully transparent, but requires that developers follow a few concrete design patterns. In this paper, we discuss the problem of hardening existing code bases of distributed systems transparently. Existing systems have not been designed with ASC hardening in mind, so they do not necessarily follow required design patterns. For such systems, we focus here on both performance and number of changes to the existing code base. Using memcached as an example, we identify and discuss three areas of improvement: reducing the memory overhead, improving access to state variables, and supporting multi-threading. Our initial evaluation of memcached shows that our ASC-hardened version obtains a throughput that is roughly 76% of the throughput of stock memcached with 128-byte and 1k-byte messages.

References

[1]
D. Behrens, S. Weigert, and C. Fetzer. Automatically tolerating arbitrary faults in non-malicious settings. In Proceedings of the Sixth Latin-American Symposium on Dependable Computing (LADC), pages 114--123, April 2013.
[2]
T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14: 80--107, February 1996.
[3]
M. Castro and B. Liskov, Practical Byzantine fault tolerance. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 1999.
[4]
M. Correia, D. G. Ferro, F. Junqueira, and M. Serafini. Practical hardening of crash-tolerant systems. In USENIX Annual Technical Conference, 2012.
[5]
T. J. Dell. A white paper on the benefits of chipkill- correct ECC for PC server main memory. Technical report, IBM Microelectronics Division, 1997.
[6]
C. Ho, R. van Renesse, M. Bickford, and D. Dolev. Nysiad: Practical protocol transformation to tolerate byzantine failures. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2007.
[7]
A. A. Hwang, I. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design. In ASPLOS, 2012.
[8]
Intel. Intel SSE4 Programming Reference, 2007.
[9]
M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin. All about eve: execute-verify replication for multi-core servers. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, pages 237--250, 2012.
[10]
M. E. Kounavis and F. L. Berry. A systematic approach to building high performance software-based crc generators. 2012 IEEE Symposium on Computers and Communications (ISCC), 0: 855--862, 2005.
[11]
X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC'10, pages 6--16, Berkeley, CA, USA, 2010. USENIX Association.
[12]
T. Liu, C. Curtsinger, and E. D. Berger. Dthreads: efficient deterministic multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 327--336. ACM, 2011.
[13]
M. G. Merideth, A. Iyengar, T. Mikalsen, S. Tai, I. Rouvellou, and P. Narasimhan. Thema: Byzantine-fault-tolerant middleware forweb-service applications. In Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems, SRDS '05, pages 131--142, Washington, DC, USA, 2005. IEEE Computer Society.
[14]
G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. SWIFT: software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization, pages 243--254, Mar. 2005.
[15]
R. Rodrigues, M. Castro, and B. Liskov. Base: using abstraction to improve fault tolerance. In SOSP '01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 15--28, New York, NY, USA, 2001. ACM.
[16]
N. Shavit and D. Touitou. Software transactional memory. In Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, PODC '95, pages 204--213, New York, NY, USA, 1995. ACM.

Cited By

View all
  • (2016)CrossCheck: A Holistic Approach for Tolerating Crash-Faults and Arbitrary Failures2016 12th European Dependable Computing Conference (EDCC)10.1109/EDCC.2016.29(65-76)Online publication date: Sep-2016
  • (2015)Scalable error isolation for distributed systemsProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation10.5555/2789770.2789812(605-620)Online publication date: 4-May-2015
  • (2014)CrosscheckProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.98(648-653)Online publication date: 23-Jun-2014

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HotDep '13: Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
November 2013
64 pages
ISBN:9781450324571
DOI:10.1145/2524224
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data corruption
  2. distributed systems
  3. fault-tolerance

Qualifiers

  • Research-article

Funding Sources

Conference

SOSP '13
Sponsor:

Acceptance Rates

HotDep '13 Paper Acceptance Rate 11 of 21 submissions, 52%;
Overall Acceptance Rate 11 of 21 submissions, 52%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2016)CrossCheck: A Holistic Approach for Tolerating Crash-Faults and Arbitrary Failures2016 12th European Dependable Computing Conference (EDCC)10.1109/EDCC.2016.29(65-76)Online publication date: Sep-2016
  • (2015)Scalable error isolation for distributed systemsProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation10.5555/2789770.2789812(605-620)Online publication date: 4-May-2015
  • (2014)CrosscheckProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.98(648-653)Online publication date: 23-Jun-2014

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media