Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/795669.796773guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Supporting nondeterministic execution in fault-tolerant systems

Published: 25 June 1996 Publication History

Abstract

We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.

References

[1]
L. Alvisi, B. Hoppe, and K. Marzullo. Nonblocking and orphan-free message logging protocols. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, June 1993.
[2]
L. Alvisi and K. Marzullo. Message logging: Pessimistic, optimistic, causal and optimal. In Proceedings of the 15th International Conference on Distributed Computing Systems, May 1995.
[3]
D. Bacon. File system measurements and their applications to the design of efficient operation logging algorithm. In Proceedings of the 10th Symposium on Reliable Distributed Systems, pages 21-30, Oct. 1991.
[4]
P. Barrett, A. Hilborne, P. Verissimo, L. Rodrigues, P. Bond, D. Seaton, and N. Speirs. The Delta-4 extra performance architecture XPA. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing, pages 481-488, June 1990.
[5]
A. Borg, J. Baumbach, and S. Glazer. A message system supporting fault tolerance. In Proceedings of the 9th ACM Symposium on Operating Systems Principles, pages 90-99, Oct. 1983.
[6]
A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. ACM Trans. Comput. Syst., 7(1):1-24, Feb. 1989.
[7]
T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. Proc. of the 15th ACM Symposium on Operating Systems Principles, pages 1-11, December 1995.
[8]
T. Cargill and B. Locanthi. Cheap hardware support for software debugging and profiling. Proceedings of the 2nd Symposium on Architectural Support for Progmmming Languages and Operating Systems, pages 82-83, October 1987.
[9]
M. Chérèque, D. Powell, p, Reynier, J.-L. Richier, and J. Voiron, Active replication in Delta-4. In Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pages 28-37, July 1992.
[10]
D. Cheriton. The V distributed system. Communications of the ACM, 31(3):314-333, March 1988.
[11]
E. Elnozahy. Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication. PhD thesis, Rice University, October 1993. Also available as technical report TR-93- 212.
[12]
E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers Special Issue On Fault-Tolerant Computing, 41(5):526-531, May 1992.
[13]
E. N. Elnozahy. An efficient technique for tracking nondeterministic execution and its applications. Technical Report CMU-CS-95-157, Carnegie Mellon University, May 1995.
[14]
E. N. Elnozahy and W. Zwaenepoel. On the use and implementation of message logging. Proc. IEEE Fault-Tolerant Computing Symp., pages 298-307, 1994.
[15]
A. Goldberg, A. Gopal, K. Li, R. Strom, and D. Bacon. Transparent recovery of Mach applications. In Proceedings of the Usenix Mach Workshop, pages 169-184, Oct. 1990.
[16]
Y. Huang and C. Kintala. Software implemented fault tolerance: Technologies and experience. Proc. IEEE Fault- Tolerant Computing Symp., pages 2-9, June 1993.
[17]
Y. Huang and Y. M. Wang. Why optimistic message logging has not been used in telecommunications systems. Proc. IEEE Fault-Tolerant Computing Symp., pages 459-463, June 1995.
[18]
D. Johnson. Distributed System Fault Tolerance Using Message Logging and Checkpointing. PhD thesis, Rice University, Dec. 1989.
[19]
D. Johnson and W. Zwaenepoel. Sender-based message logging. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing, pages 14- 19, June 1987.
[20]
D. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and checkpointing. Journal of Algorithms, 11(3):462-491, Sept. 1990.
[21]
T. Juang and S. Venkatesan. Crash recovery with little overhead. In Proceedings of the 11th International Conference on Distributed Computing Systems, pages 454-461, May 1991.
[22]
J. Jump. YACSIM reference manual, version 2.1, Mar. 1993.
[23]
H. V. Leong and D. Agrawal. Using message semantics to reduce rollback in optimistic message logging recovery schemes. Proceedings of the 14th International Conference on Distributed Computing Systems, pages 227-234, 1994.
[24]
J. M. Mellor-Crummey and T. J. LeBlanc. A software instruction counter. Proceedings of the 3rd Symposium on Architectural Support for Programming Languages and Operating Systems, pages 78- 86, April 1989.
[25]
R. H. B. Netzer and J. Xu. Adaptive message logging for incremental program replay. IEEE Parallel and Distributed Technology, 1(4):32-39, November 1993.
[26]
J. Plank. Libckpt, 1994. Public domain software for checkpointing.
[27]
M. Powell and D. Presotto. Publishing: A reliable broadcast communication mechanism. In Proceedings of the 9th A CM Symposium on Operating Systems Principles, pages 100-109, Oct. 1983.
[28]
M. Russinovich, B. Cogswell, and Z. Segall. Replay for concurrent nondeterministic shared memory applications. To appear in the proceedings of SIGPLAN '96, 1995.
[29]
M. Russinovich, Z. Segall, and D. P. Siewiorek. Application transparant fault management in fault-tolerant mach. Proceedings of the Twenty Third International Symposium on Fault-Tolerant Computing, FTCS-23, pages 10-19, June 1993.
[30]
A. Sistla and J. Welch. Efficient distributed recovery using message logging. In Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing, pages 223-238, Aug. 1989.
[31]
R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst., 3(3):204-226, Aug. 1985.
[32]
G. Suri, B. Janssens, and W. K. Fuchs. Reduced overhead loggin in rollback recovery in distributed shared memory. Proc. IEEE Fault-Tolerant Computing Symposium, pages 279-288, June 1995.
[33]
N. H. Vaidya. Dynamic cluster-based recovery: Pessimistic and optimistic schemes. Technical Report #93-027, Dept. of Computer Science, Texas A&M University, May 1993.
[34]
N. H. Vaidya. A case of two-level distributed recovery schemes. Proceedings of the International Conference on Measurement and Modeling of Computer Systems (Sigmetrics '95), pages 64-73, May 1995.
[35]
Y. M. Wang. Reducing message logging overhead for log-based recovery. Proc. IEEE Int'l Symp. on Circuits and Systems, pages 1925-1928, May 1993.
[36]
Y.-M. Wang and W. Fuchs. Optimistic message logging for independent checkpointing in message-passing systems. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 147- 154, Oct. 1992.
[37]
Y.-M. Wang and W. Fuchs. Scheduling message processing for reducing rollback propagation. In Proceedings of the 22nd International Symposium on Fault-Tolerant Computing, pages 204-211, July 1992.
[38]
Y. M. Wang, Y. Huang, K. P. Yo, P. Y. Chung, and C. Kintala. Checkpointing and its applications. Proc. IEEE Fault- Tolerant Computing Symp., pages 22-31, June 1995.

Cited By

View all
  • (2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
  • (2015)Deterministic ReplayACM Computing Surveys10.1145/279007748:2(1-47)Online publication date: 24-Sep-2015
  • (2013)Transparent mutable replay for multicore debugging and patch validationACM SIGPLAN Notices10.1145/2499368.245113048:4(127-138)Online publication date: 16-Mar-2013
  • Show More Cited By
  1. Supporting nondeterministic execution in fault-tolerant systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    FTCS '96: Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
    June 1996
    ISBN:0818672617

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 25 June 1996

    Author Tags

    1. DEC Alpha processor
    2. asynchronous events
    3. fault tolerant computing
    4. fault-tolerant systems
    5. log-based rollback-recovery protocols
    6. memory protocols
    7. multithreading
    8. nondeterministic execution
    9. pre-failure state
    10. shared memory systems
    11. software counter
    12. software fault tolerance
    13. system recovery
    14. uncontrolled nondeterminism

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital SystemsACM Computing Surveys10.1145/309269950:4(1-38)Online publication date: 4-Oct-2017
    • (2015)Deterministic ReplayACM Computing Surveys10.1145/279007748:2(1-47)Online publication date: 24-Sep-2015
    • (2013)Transparent mutable replay for multicore debugging and patch validationACM SIGPLAN Notices10.1145/2499368.245113048:4(127-138)Online publication date: 16-Mar-2013
    • (2013)Transparent mutable replay for multicore debugging and patch validationACM SIGARCH Computer Architecture News10.1145/2490301.245113041:1(127-138)Online publication date: 16-Mar-2013
    • (2013)Transparent mutable replay for multicore debugging and patch validationProceedings of the eighteenth international conference on Architectural support for programming languages and operating systems10.1145/2451116.2451130(127-138)Online publication date: 16-Mar-2013
    • (2010)Deterministic process groups in dOSProceedings of the 9th USENIX conference on Operating systems design and implementation10.5555/1924943.1924956(177-191)Online publication date: 4-Oct-2010
    • (2010)Transparent, lightweight application execution replay on commodity multiprocessor operating systemsACM SIGMETRICS Performance Evaluation Review10.1145/1811099.181105738:1(155-166)Online publication date: 14-Jun-2010
    • (2010)Transparent, lightweight application execution replay on commodity multiprocessor operating systemsProceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems10.1145/1811039.1811057(155-166)Online publication date: 14-Jun-2010
    • (2009)Architecting a chunk-based memory race recorder in modern CMPsProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/1669112.1669183(576-585)Online publication date: 12-Dec-2009
    • (2009)Practical and low-overhead masking of failures of TCP-based serversACM Transactions on Computer Systems10.1145/1534909.153491127:2(1-39)Online publication date: 29-May-2009
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media