Article

Supporting nondeterministic execution in fault-tolerant systems

Authors:

E. N. ElnozahyAuthors Info & Claims

FTCS '96: Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)

Page 250

Published: 25 June 1996 Publication History

Abstract

We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.

References

[1]

L. Alvisi, B. Hoppe, and K. Marzullo. Nonblocking and orphan-free message logging protocols. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, June 1993.

[2]

L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic, causal and optimal. In Proceedings of the 15th International Conference on Distributed Computing Systems, May 1995.

[3]

D. Bacon. File system measurements and their applications to the design of efficient operation logging algorithm. In Proceedings of the 10th Symposium on Reliable Distributed Systems, pages 21-30, Oct. 1991.

[4]

P. Barrett, A. Hilborne, P. Verissimo, L. Rodrigues, P. Bond, D. Seaton, and N. Speirs. The Delta-4 extra performance architecture XPA. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing, pages 481-488, June 1990.

[5]

A. Borg, J. Baumbach, and S. Glazer. A message system supporting fault tolerance. In Proceedings of the 9th ACM Symposium on Operating Systems Principles, pages 90-99, Oct. 1983.

[6]

A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. ACM Trans. Comput. Syst., 7(1):1-24, Feb. 1989.

[7]

T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. Proc. of the 15th ACM Symposium on Operating Systems Principles, pages 1-11, December 1995.

[8]

T. Cargill and B. Locanthi. Cheap hardware support for software debugging and profiling. Proceedings of the 2nd Symposium on Architectural Support for Progmmming Languages and Operating Systems, pages 82-83, October 1987.

[9]

M. Chérèque, D. Powell, p, Reynier, J.-L. Richier, and J. Voiron, Active replication in Delta-4. In Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pages 28-37, July 1992.

[10]

D. Cheriton. The V distributed system. Communications of the ACM, 31(3):314-333, March 1988.

[11]

E. Elnozahy. Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication. PhD thesis, Rice University, October 1993. Also available as technical report TR-93- 212.

[12]

E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers Special Issue On Fault-Tolerant Computing, 41(5):526-531, May 1992.

[13]

E. N. Elnozahy. An efficient technique for tracking nondeterministic execution and its applications. Technical Report CMU-CS-95-157, Carnegie Mellon University, May 1995.

[14]

E. N. Elnozahy and W. Zwaenepoel. On the use and implementation of message logging. Proc. IEEE Fault-Tolerant Computing Symp., pages 298-307, 1994.

[15]

A. Goldberg, A. Gopal, K. Li, R. Strom, and D. Bacon. Transparent recovery of Mach applications. In Proceedings of the Usenix Mach Workshop, pages 169-184, Oct. 1990.

[16]

Y. Huang and C. Kintala. Software implemented fault tolerance: Technologies and experience. Proc. IEEE Fault- Tolerant Computing Symp., pages 2-9, June 1993.

[17]

Y. Huang and Y. M. Wang. Why optimistic message logging has not been used in telecommunications systems. Proc. IEEE Fault-Tolerant Computing Symp., pages 459-463, June 1995.

[18]

D. Johnson. Distributed System Fault Tolerance Using Message Logging and Checkpointing. PhD thesis, Rice University, Dec. 1989.

[19]

D. Johnson and W. Zwaenepoel. Sender-based message logging. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing, pages 14- 19, June 1987.

[20]

D. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and checkpointing. Journal of Algorithms, 11(3):462-491, Sept. 1990.

[21]

T. Juang and S. Venkatesan. Crash recovery with little overhead. In Proceedings of the 11th International Conference on Distributed Computing Systems, pages 454-461, May 1991.

[22]

J. Jump. YACSIM reference manual, version 2.1, Mar. 1993.

[23]

H. V. Leong and D. Agrawal. Using message semantics to reduce rollback in optimistic message logging recovery schemes. Proceedings of the 14th International Conference on Distributed Computing Systems, pages 227-234, 1994.

[24]

J. M. Mellor-Crummey and T. J. LeBlanc. A software instruction counter. Proceedings of the 3rd Symposium on Architectural Support for Programming Languages and Operating Systems, pages 78- 86, April 1989.

[25]

R. H. B. Netzer and J. Xu. Adaptive message logging for incremental program replay. IEEE Parallel and Distributed Technology, 1(4):32-39, November 1993.

[26]

J. Plank. Libckpt, 1994. Public domain software for checkpointing.

[27]

M. Powell and D. Presotto. Publishing: A reliable broadcast communication mechanism. In Proceedings of the 9th A CM Symposium on Operating Systems Principles, pages 100-109, Oct. 1983.

[28]

M. Russinovich, B. Cogswell, and Z. Segall. Replay for concurrent nondeterministic shared memory applications. To appear in the proceedings of SIGPLAN '96, 1995.

[29]

M. Russinovich, Z. Segall, and D. P. Siewiorek. Application transparant fault management in fault-tolerant mach. Proceedings of the Twenty Third International Symposium on Fault-Tolerant Computing, FTCS-23, pages 10-19, June 1993.

[30]

A. Sistla and J. Welch. Efficient distributed recovery using message logging. In Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing, pages 223-238, Aug. 1989.

[31]

R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst., 3(3):204-226, Aug. 1985.

[32]

G. Suri, B. Janssens, and W. K. Fuchs. Reduced overhead loggin in rollback recovery in distributed shared memory. Proc. IEEE Fault-Tolerant Computing Symposium, pages 279-288, June 1995.

[33]

N. H. Vaidya. Dynamic cluster-based recovery: Pessimistic and optimistic schemes. Technical Report #93-027, Dept. of Computer Science, Texas A&M University, May 1993.

[34]

N. H. Vaidya. A case of two-level distributed recovery schemes. Proceedings of the International Conference on Measurement and Modeling of Computer Systems (Sigmetrics '95), pages 64-73, May 1995.

[35]

Y. M. Wang. Reducing message logging overhead for log-based recovery. Proc. IEEE Int'l Symp. on Circuits and Systems, pages 1925-1928, May 1993.

[36]

Y.-M. Wang and W. Fuchs. Optimistic message logging for independent checkpointing in message-passing systems. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 147- 154, Oct. 1992.

[37]

Y.-M. Wang and W. Fuchs. Scheduling message processing for reducing rollback propagation. In Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pages 204-211, July 1992.

[38]

Y. M. Wang, Y. Huang, K. P. Yo, P. Y. Chung, and C. Kintala. Checkpointing and its applications. Proc. IEEE Fault- Tolerant Computing Symp., pages 22-31, June 1995.

Cited By

Psychou GRodopoulos DSabry MGemmeke TAtienza DNoll TCatthoor F(2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
https://dl.acm.org/doi/10.1145/3092699
Chen YZhang SGuo QLi LWu RChen T(2015)Deterministic ReplayACM Computing Surveys10.1145/279007748:2(1-47)Online publication date: 24-Sep-2015
https://dl.acm.org/doi/10.1145/2790077
Viennot NNair SNieh J(2013)Transparent mutable replay for multicore debugging and patch validationACM SIGPLAN Notices10.1145/2499368.245113048:4(127-138)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1145/2499368.2451130
Show More Cited By

Supporting nondeterministic execution in fault-tolerant systems
1. Computer systems organization

Recommendations

Error Detection and Handling in a Superscalar, Speculative Out-of-Order Execution Processor System
FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing

Abstract: The HaL SPARC64 Processor, the first 64-bit SPARC-V9 architecture implementation, uses several techniques to ensure a high degree of system reliability, error detection, and error recovery. The CPU of the multi-chip module processor has a ...
A memory approach to consistent, reliable distributed shared memory
HOTOS '95: Proceedings of the Fifth Workshop on Hot Topics in Operating Systems (HotOS-V)

Fault-tolerant distributed shared memory systems do not always need to support a complete and consistent recovery after a failure. We describe a framework, within which different approaches to, and different degrees of consistency and recoverability can ...
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

FTCS '96: Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)

June 1996

ISBN:0818672617

Copyright © Copyright (c) 1996 Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Publisher

IEEE Computer Society

United States

Publication History

Published: 25 June 1996

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Psychou GRodopoulos DSabry MGemmeke TAtienza DNoll TCatthoor F(2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
https://dl.acm.org/doi/10.1145/3092699
Chen YZhang SGuo QLi LWu RChen T(2015)Deterministic ReplayACM Computing Surveys10.1145/279007748:2(1-47)Online publication date: 24-Sep-2015
https://dl.acm.org/doi/10.1145/2790077
Viennot NNair SNieh J(2013)Transparent mutable replay for multicore debugging and patch validationACM SIGPLAN Notices10.1145/2499368.245113048:4(127-138)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1145/2499368.2451130
Viennot NNair SNieh J(2013)Transparent mutable replay for multicore debugging and patch validationACM SIGARCH Computer Architecture News10.1145/2490301.245113041:1(127-138)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1145/2490301.2451130
Viennot NNair SNieh JSarkar VBodik R(2013)Transparent mutable replay for multicore debugging and patch validationProceedings of the eighteenth international conference on Architectural support for programming languages and operating systems10.1145/2451116.2451130(127-138)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1145/2451116.2451130
Bergan THunt NCeze LGribble SArpaci-Dusseau RChen B(2010)Deterministic process groups in dOSProceedings of the 9th USENIX conference on Operating systems design and implementation10.5555/1924943.1924956(177-191)Online publication date: 4-Oct-2010
https://dl.acm.org/doi/10.5555/1924943.1924956
Laadan OViennot NNieh J(2010)Transparent, lightweight application execution replay on commodity multiprocessor operating systemsACM SIGMETRICS Performance Evaluation Review10.1145/1811099.181105738:1(155-166)Online publication date: 14-Jun-2010
https://dl.acm.org/doi/10.1145/1811099.1811057
Laadan OViennot NNieh JMisra VBarford PSquillante M(2010)Transparent, lightweight application execution replay on commodity multiprocessor operating systemsProceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems10.1145/1811039.1811057(155-166)Online publication date: 14-Jun-2010
https://dl.acm.org/doi/10.1145/1811039.1811057
Pokam GPereira CDanne KKassa RAdl-Tabatabai AAlbonesi DMartonosi MAugust DMartínez J(2009)Architecting a chunk-based memory race recorder in modern CMPsProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/1669112.1669183(576-585)Online publication date: 12-Dec-2009
https://dl.acm.org/doi/10.1145/1669112.1669183
Zagorodnov DMarzullo KAlvisi LBressoud T(2009)Practical and low-overhead masking of failures of TCP-based serversACM Transactions on Computer Systems10.1145/1534909.153491127:2(1-39)Online publication date: 29-May-2009
https://dl.acm.org/doi/10.1145/1534909.1534911
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents