Abstract
A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for recovering messages is to use message-logging techniques. In this paper, we present a model for message-logging based schemes to support fault-tolerant processes and develop conditions for proper message recovery in asynchronous systems. We show that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary. We then propose a distributed scheme to support fault-tolerant processes that can also handle multiple process failures.
Similar content being viewed by others
References
Allchin JE, McKendry MS (1983) Synchronization and recovery of actions. Proceedings of Symposium on Principles of Distributed Computing. ACM SIGACT-SIGOPS, Montreal, California, pp 17–19
Alsberg PA, Day JD (1976) A principle for resilient sharing of distributed resources. Proceedings of the International Conference on Software Engineering, San Francisco, pp 562–570
Bartlett JF (1981) A NonStop kernel. Proceedings of 7th ACM Symposium on Operating Systems Principles, pp 22–29
Birman KP, Joseph TA (1985) Reliable communication in unreliable networks. ACM Trans Comput Syst 47–76
Birman KP, Joseph TA, Raeuchle T, Abbadi E (1985) Implementing fault-tolerant distributed objects. IEEE Trans Software Eng SE-11:502–508
Borg A, Baumbach J, Glazer S (1983) A message system supporting fault tolerance. 9th AMC Symposium on Operating Systems Principles. Operat Syst Rev 17:90–99
Chang JM, Maxemchuk NF (1984) Reliable broadcast protocols. ACM Trans Comput Syst 2:251–273
Cristian F, Aghili H, Strong R (1985) Atomic broadcast: from simple message diffusion to byzantine agreement. Digest of Papers: The 15th Fault Tolerant Computing Symposium, 1985, pp 200/206
Jalote P (1989) Resilient objects in broadcast networks. IEEE Trans Software Eng 15:68–72
Johnson DB, Zwaenepoel W (1987) Sender-based message logging. Digest of Papers: The 17th International Fault Tolerant Computing Symposium, July 1987, Pittsburgh, pp 14–19
Lamport L (1983) Specifying concurrent program modules. ACM Trans Program Lang Syst 5:190–222
Lin K-J, Gannon J (1985) Atomic remote procedure call. IEEE Trans Software Eng SE-11:1126–1135
Liskov BH, Scheifler R (1983) Guardians and actions: linguistic support for robust, distributed programs. ACM Trans Program Lang Syst 5:381–404
Powell ML, Presotto DL (1983) PUBLISHING: a reliable broadcast communication mechanism. 9th ACM Symposium on Operating Systems Principles. Operat Syst Rev 17:100–109
Randell B (1985) System structure for software fault tolerance. IEEE Trans Software Eng SE-1:220–232
Reed DP (1983) Implementing atomic actions on decentralized data. ACM Trans Comput Syst 1:3–23
Schlichting RD, Schneider FB (1983) Fall-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans Comput Syst 1:222–238
Schneider FB (1982) Synchronization in distributed programs. ACM Trans Program Lang Syst 4:179–195
Strom RE, Yemini S (1984) Optimistic recovery: an asynchronous approach to fault-tolerance in distributed systems. Digest of Papers: The 14th International Fault Tolerant Computing Symposium, 1984, Florida, pp 374–379
Svobodova L (1984) Resilient distributed computing. IEEE Trans Software Eng SE-10:257–268
Walker B, Popek G, English R, Kline C, Thiel G (1983) The LOCUS distributed operating system. Proceedings of the 9th ACM Symposium on Operating Systems Principles, Bretton Woods, Oct 1983, pp 49–70
Author information
Authors and Affiliations
Additional information
Pankaj Jalote received the Bachelor of Technology degree in electrical engineering from the Indian Institute of Technology, Kanpur, India, in 1980, the M.S. degree in computer science from Pennsylvania State University, University Park, in 1982, and the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign in 1985. From August 1985 to July 1989 he was an Assistant Professor in the Department of Computer Science at the University of Maryland, College Park. Currently he is an Assistant Professor in the Department of Computer Science and Engineering at IIT Kanpur, India. His research interests include fault-tolerant computing, distributed systems, and software engineering.
This work was supported in parts by the NSF grant DCI-8610337
Rights and permissions
About this article
Cite this article
Jalote, P. Fault tolerant processes. Distrib Comput 3, 187–195 (1989). https://doi.org/10.1007/BF01784887
Issue Date:
DOI: https://doi.org/10.1007/BF01784887