Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Implementing fault-tolerant services using the state machine approach: a tutorial

Published: 01 December 1990 Publication History

Abstract

The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

References

[1]
AIZIKOWITZ, J. 1989. Designing distributed services using refinement mappings. Ph.D. dissertation, Computer Science Dept., Cornell Univ., Ithaca, New York. Also available as Tech. Rep. TR 89-1040.
[2]
BERNSTEIN, A. J. 1985. A loosely coupled system for reliably storing data. IEEE Trans. Softw. Eng. SE-11, 5 (May), 446-454.
[3]
BIRMAN, K. P. 1985. Replication and fault tolerance in the ISIS system. In Proceedings of the lOth A CM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec. 1985), A CM, pp. 79-86.
[4]
BIRMAN, K. P., AND JOSEPH, T. 1987. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76.
[5]
CRISTIAN, F., AGHILI, H., STRONG, H. R., AND DOLEV, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Conference on Fault-tolerant Computing (Ann Arbor, Mich., June 1985), IEEE Computer Society.
[6]
DIJKSTRA, E. W. 1974. Self stabilization in spite of distributed control. Commun. A CM I7, 11 (Nov.), 643-644.
[7]
FISCHER, M., LYNCH, N., AND PATERSON, M. 1985. Impossibility of distributed consensus with one faulty process, d. ACM 32, 2 (Apr. 1986), 374-382.
[8]
GARCIA-MOLINA, H., PITTELLI, F., AND DAVIDSON, S. 1986. Application of Byzantine agreement in database systems. ACM TODS 11, 1 (Mar. 1986), 27-47.
[9]
GOPAL, A., STRONG, R., TOUEG, S., AND CRISTIAN, F., 1990. Early-delivery atomic broadcast. To appear in Proceedings of the 9th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (Quebec City, Quebec, Aug. 1990).
[10]
GRAY, J. 1978. Notes on data base operating systems. In Operating Systems: An Advanced Course, Lecture Notes in Computer Science. Vol. 60. Springer- Verlag, New York, pp. 393-481.
[11]
HALPERN, J., SIMONS, B., STRONG, R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), pp. 89-102.
[12]
HUTCHINSON, N., AND PETERSON, L. 1988. Design of the x-kernel. In Proceedings of SIGCOMM '88--Symposium on Communication Architectures and Protocols (Stanford, Calif., Aug.), pp. 65-75.
[13]
LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7 (July), 558-565.
[14]
LAMPORT, L. 1979b. The implementation of reliable distributed multiprocess systems. Comput. Networks 2, 95-114.
[15]
LAMPORT, L. 1984. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (Apr.), 254-280.
[16]
LAMPORT, L. 1989. The part-time parliament. Tech. Rep. 49. Digital Equipment Corporation Systems Research Center, Palo Alto, Calif.
[17]
LAMPORT, L., AND MELLIAR-SMITH, P. M. 1984. Byzantine clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), 68-74.
[18]
LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM TOPLAS 4, 3 (July), 382-401.
[19]
LISKOV, B., AND LADIN, R. 1986. Highly available distributed services and fault-tolerant distributed garbage collection. In Proceedings of the 5th A CM Symposium on Principles of Distributed Computing (Calgary, Alberta, Canada, Aug.), ACM, pp. 29-39.
[20]
MANCINI, L., AND PAPPALARDO, G. 1988. Towards a theory of replicated processing. Formal Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 331. Springer-Verlag, New York, pp. 175-192.
[21]
MARZULLO, K. 1989. Implementing fault-tolerant sensors. Tech. Rep. TR 89-997. Computer Science Dept., Cornell Univ., Ithaca, New York.
[22]
MARZULLO, K., AND SCHMUCK, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems (San Jose, CA, June), IEEE Computer Society, pp. 447-455.
[23]
PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHT- ING, R. D. 1989. Preserving and using context information in interprocess communication. ACM TOCS 7, 3 (Aug.), 217-246.
[24]
PITTELLI, F. M., AND GARCIA-MOLINA, S. 1989. Reliable scheduling in a TMR database system. ACM TOCS 7, 1 (Feb.), 25-60.
[25]
SCHLICHTING, R. D., AND SCHNEIDER, F. B. 1983. Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS I, 3 (Aug.), 222-238.
[26]
SCHNEIDER, F. B. 1980. Ensuring consistency on a distributed database system by use of distributed semaphores. In Proceedings of International Symposium on Distributed Data Bases (Paris, France, Mar.), INRIA, pp. 183-189.
[27]
SCHNEIDER, F. B. 1982. Synchronization in distributed programs. ACM TOPLAS 4, 2 (Apr.), 179-195.
[28]
SCHNEIDER, F. B. 1984. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May), 145-154.
[29]
SCHNEIDER, F. B. 1985. Paradigms for distributed programs. Distributed Systems. Methods and Tools for Specification. Lecture Notes in Computer Science, Vol. 190. Springer-Verlag, New York, pp. 343-430.
[30]
SCHNEIDER, F. B. 1986. A paradigm for reliable clock synchronization. In Proceedings of the Advanced Seminar on Real-Time Local Area Networks (Bandol, France, Apr.), INRIA, pp. 85-104.
[31]
SCHNEIDER, F. B., GRIES, D., AND SCHLICHTING, R. D. 1984. Fault-tolerant broadcasts. Sci. Comput. Program. 4, 1-15.
[32]
SIEWIOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass.
[33]
SKEEN, D. 1982. Crash recovery in a distributed database system. Ph.D. dissertation, Univ. of California at Berkeley, May.
[34]
STRONG, H. R., AND DOLEV, D. 1983. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers. (Compcon 83, IEEE Computer Society, Mar.), IEEE Computer Society, pp. 77-82.
[35]
WENSLEY, J., WENSKY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P. M., SHOSTAK, R. E., and WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.

Cited By

View all
  • (2025)A survey on scalable consensus algorithms for blockchain technologyCyber Security and Applications10.1016/j.csa.2024.1000653(100065)Online publication date: Dec-2025
  • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
  • (2024)IONIAProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650711(225-242)Online publication date: 27-Feb-2024
  • Show More Cited By

Recommendations

Reviews

Valentin Cristea

Distributed software structured in terms of clients and servers is considered. Replicas of a single server are executed on separate processors of a distributed system, and protocols coordinate client interactions with these replicas. The paper describes how a system can be viewed in terms of a state machine, clients, and output devices. In this context, Schneider considers two representative classes of faulty behavior: Byzantine failures and fail-stop failures. The core sections of the paper present algorithms that cope with these failures. An important class of optimizations and the dynamic reconfiguration are also tackled. A separate section discusses related work . The paper is intended for people working in the domain of distributed systems and real-time systems. It systematically presents protocols that involve replication of components using the state machine approach, although few of these protocols were obtained in this manner. The paper was received in November 1987 and the final revision was accepted in January 1990. Unfortunately, this long delay is easily perceived by the reader.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 22, Issue 4
Dec. 1990
109 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/98163
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 1990
Published in CSUR Volume 22, Issue 4

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,145
  • Downloads (Last 6 weeks)155
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)A survey on scalable consensus algorithms for blockchain technologyCyber Security and Applications10.1016/j.csa.2024.1000653(100065)Online publication date: Dec-2025
  • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
  • (2024)IONIAProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650711(225-242)Online publication date: 27-Feb-2024
  • (2024)Modelling the Raft Distributed Consensus Protocol in mCRL2Electronic Proceedings in Theoretical Computer Science10.4204/EPTCS.399.4399(7-20)Online publication date: 27-Mar-2024
  • (2024)RHCA: Robust HCA via Consistent RevotingMathematics10.3390/math1204059312:4(593)Online publication date: 17-Feb-2024
  • (2024)Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus ProtocolsElectronics10.3390/electronics1303048113:3(481)Online publication date: 23-Jan-2024
  • (2024)A Survey of Consortium Blockchain and Its ApplicationsCryptography10.3390/cryptography80200128:2(12)Online publication date: 22-Mar-2024
  • (2024)PALF: Replicated Write-Ahead Logging for Distributed DatabasesProceedings of the VLDB Endowment10.14778/3685800.368580317:12(3745-3758)Online publication date: 1-Aug-2024
  • (2024)Rashnu: Data-Dependent Order-FairnessProceedings of the VLDB Endowment10.14778/3665844.366586117:9(2335-2348)Online publication date: 1-May-2024
  • (2024)Towards Full Stack Adaptivity in Permissioned BlockchainsProceedings of the VLDB Endowment10.14778/3641204.364121617:5(1073-1080)Online publication date: 1-Jan-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media