Nothing Special   »   [go: up one dir, main page]

skip to main content
survey

Fault-Tolerant Systems

Published: 01 December 1976 Publication History

Abstract

Basic concepts, motivation, and techniques of fault tolerance are discussed in this paper. The topics include fault classification, redundancy techniques, reliability modeling and prediction, examples of fault-tolerant computers, and some approaches to the problem of tolerating design faults.

References

[1]
A. Avi¿ienis, "Architecture of fault-tolerant computing systems," in Dig. 1975 Int. Symp. Fault- Tolerant Computing, Paris, France, June 1975, pp. 3-16.
[2]
R. A. Short, "The attainment of reliable digital systems through the use of redundancy-A survey," IEEE Computer Group News, vol. 2, pp. 2-17, Mar. 1968.
[3]
A. E. Cooper and W. T. Chow, "Development of on-board space computer systems," IBM J. Res. Develop., vol. 20, pp. 5-19, Jan. 1976.
[4]
W. C. Carter and W. G. Bouricius, "A survey of fault-tolerant computer architecture and its evaluation," Computer, vol. 4, pp. 9-16, Jan.-Feb. 1971.
[5]
Y.-W. Ng and A. Avi¿ienis, "A model for transient and permanent fault recovery in closed fault tolerant systems," in Proc. 1976 Int. Symp. Fault- Tolerant Computing, Pittsburgh, PA, June 1976, pp. 182-188.
[6]
G. C. Gilley, "A fault-tolerant spacecraft," in Dig. 1972 Int. Symp. Fault-Tolerant Computing, June 1972, pp. 105-109.
[7]
W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Reliability modeling techniques for self-repairing computer systems," in Proc. 24th Nat. Conf. Ass. Comput. Mach., 1969, pp. 295-383.
[8]
R. W. Barlow and F. Proschan, Mathematical Theory of Reliability. New York: Wiley, 1965.
[9]
H. O. Levy and R. B. Conn, "A simulation program for reliability prediction of fault tolerant systems," in Dig. 1975 Int. Symp. Fault-Tolerant Computing, Paris, France, June 1975, pp. 104- 109.
[10]
A. Avi¿ienis et al., "The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design," IEEE Trans. Comput., vol. C-20, no. 11, pp. 1312- 1321, Nov. 1971.
[11]
H. Y. Chang, G. W. Smith, Jr., and R. B. Walford, "LAMP: System description," Bell Syst. Tech. J., vol. 53, pp. 1431-1449, Oct. 1974.
[12]
R. W. Downing, J. S. Nowak, and L. S. Tuomenoksa, "No.1 ESS maintenance plan," Bell Syst. Tech. J., vol. 43, Part 1, pp. 1961- 2019, Sept. 1964.
[13]
H. J. Beuscher, et al., "Administration and maintenance plan of no. 2 ESS," Bell Syst. Tech. J., vol. 48, pp. 2765-2815, Oct. 1969.
[14]
F. P. Maison, "The MECRA: A self-repairable computer for highly reliable process," IEEE Trans. Comput., vol. C-20, pp. 1382-1393, Nov. 1971.
[15]
R. R. Everett, C. A. Zraket, and H. D. Benington, "SAGE-A data-processing system for air defense," in Proc. Eastern Joint Comput. Conf., Washington, DC, Dec. 1957, pp. 148-155.
[16]
W. C. Carter, et al., "Design of serviceability features for the IBM System 360," IBM J. Res. Develop., vol. 8, pp. 115-125, Apr. 1964.
[17]
"An application-oriented multiprocessing system," IBM Syst. J., vol. 6, pp. 78-132, 1967.
[18]
F. J. Corbato, J. H. Saltzer, and C. T. Clingen, "Multics-The first seven years," in AFIPS Conf. Proc., vol. 40, 1972, pp. 571-583.
[19]
S. M. Ornstein et al., "Pluribus-A reliable multiprocessor," in AFlPS Conf. Proc., vol. 44, 1975, pp. 551-559.
[20]
J. H. Wensley, "SIFT-Software implemented fault tolerance," in AFIPS Conf. Proc., vol. 41, part 1, 1972, pp. 243-254
[21]
A. L. Hopkins, Jr. and T. B. Smith, III, "The architectural elements of a symmetric fault-tolerant multiprocessor," IEEE Trans. Comput., vol. C-24, pp. 498-505, May 1975.
[22]
D. T. Tang and R. T. Chien, "Coding for error control," IBM Syst. J., vol. 8, pp. 48-86, 1969.
[23]
B. Parhami and A. Avi¿ienis, "A study of fault-tolerance techniques for associative processors," in AFIPS Conf. Proc., vol. 43, pp. 643-652, 1974.
[24]
A. Avi¿ienis, "Arithmetic error codes: Cost and effectiveness studies for application in digital system design," IEEE Trans. Comput., vol. C-20, pp. 1322-1331, Nov. 1971.
[25]
W. C. Carter, "Theory and use of checking circuits," in Computer Systems Reliability, Infotech Information Ltd., 1974, pp. 413- 454.
[26]
S. A. Szygenda and E. W. Thompson, "Modeling and digital simulation for design verification and diagnosis," IEEE Trans. Comput., this issue, pp. 1242-1253.
[27]
T. T. Butler et al., "LAMP: Application to switching-system development," Bell Syst. Tech. J., vol. 53, pp. 1535-1555, Oct. 1974.
[28]
Proc. 1975 Int. Conf. Reliable Software, Los Angeles, CA, Apr. 1975.
[29]
E. C. Nelson, "Software reliability," in Dig. 1975 Int. Symp. Fault- Tolerant Computing, Paris, France, June 1975, pp. 24-28.
[30]
A. Avi¿ienis, "Fault-tolerance and fault-intolerance: Complementary approaches to reliable computing," in Proc. 1975 Int. Conf. Reliable Software, Los Angeles, CA, Apr. 1975, pp. 45-64.
[31]
B. Randell, "System structure for software fault tolerance," IEEE Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975.

Cited By

View all
  • (2023)Structural Models for Failure Detection of Moore Finite-State MachinesJournal of Computer and Systems Sciences International10.1134/S106423072306010262:6(977-990)Online publication date: 1-Dec-2023
  • (2022)Robustness improvement of component-based cloud computing systemsThe Journal of Supercomputing10.1007/s11227-021-04054-278:4(4977-5009)Online publication date: 1-Mar-2022
  • (2019)Unifying system health management and automated decision makingJournal of Artificial Intelligence Research10.1613/jair.1.1136665:1(487-518)Online publication date: 1-May-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 25, Issue 12
December 1976
178 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 December 1976

Author Tags

  1. Fault classification
  2. fault tolerance
  3. fault-tolerant computer design
  4. redundancy techniques
  5. reliability modeling
  6. reliable computing.

Qualifiers

  • Survey

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Structural Models for Failure Detection of Moore Finite-State MachinesJournal of Computer and Systems Sciences International10.1134/S106423072306010262:6(977-990)Online publication date: 1-Dec-2023
  • (2022)Robustness improvement of component-based cloud computing systemsThe Journal of Supercomputing10.1007/s11227-021-04054-278:4(4977-5009)Online publication date: 1-Mar-2022
  • (2019)Unifying system health management and automated decision makingJournal of Artificial Intelligence Research10.1613/jair.1.1136665:1(487-518)Online publication date: 1-May-2019
  • (2018)Fault tolerant functional reactive programming (functional pearl)Proceedings of the ACM on Programming Languages10.1145/32367912:ICFP(1-30)Online publication date: 30-Jul-2018
  • (2015)A distributed robust convergence algorithm for multi-robot systems in the presence of faulty robots2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS.2015.7353788(2980-2985)Online publication date: 28-Sep-2015
  • (2007)Performance study of Byzantine Agreement Protocol with artificial neural networkInformation Sciences: an International Journal10.1016/j.ins.2007.04.011177:21(4785-4798)Online publication date: 1-Nov-2007
  • (2006)Software based fault toleranceUbiquity10.1145/1149633.11479952006:July(1-1)Online publication date: 1-Jul-2006
  • (2004)Towards a Control-Theoretical Approach to Software Fault-ToleranceProceedings of the Quality Software, Fourth International Conference10.5555/1018442.1022072(198-205)Online publication date: 8-Sep-2004
  • (2003)Achieving software robustness via large-scale multiagent systemsSoftware engineering for large-scale multi-agent systems10.5555/1807559.1807577(199-215)Online publication date: 1-Jan-2003
  • (1990)Adaptable Recovery Using Dynamic Quorum AssignmentsProceedings of the 16th International Conference on Very Large Data Bases10.5555/645916.756651(231-242)Online publication date: 13-Aug-1990
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media