Abstract
The paper presents an analysis of the various problems involved in achieving very high reliability from complex computing systems, and discusses the relationship between system structuring techniques and techniques of fault tolerance. Topics covered include (i) differing types of reliability requirement, (ii) forms of protective redundancy in hardware and software systems, (iii) methods of structuring the activity of a system, using atomic actions, so as to limit information flow, (iv) error detection techniques, (v) strategies for locating and dealing with faults, and for assessing the damage they have caused, and (vi) forward and backward error recovery techniques, based on the concepts of recovery line, commitment, exception and compensation. A set of appendices provide summary descriptions and analyses of a number of computing systems that have been specifically designed with the aim of achieving very high reliability.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
8 References
Anderson, T., R. Kerr. Recovery Blocks in Action: a system supporting high reliability. Proc. Int. Conf. on Software Engineering San Francisco (Oct. 1976).
Anderson, T., P.A. Lee, S.K. Shrivastava. A Conceptual Model of Recoverability in Multi-Level Systems. Technical Report 115, Computing Laboratory, The University, Newcastle upon Tyne (Nov. 1977).
Avizienis, A. et al. The STAR (Self Testing and Repairing Computer): An Investigation of the Theory and Practice of Fault Tolerant Computer Design. IEEE Trans. on Computers, C-20, 11 (Nov. 1971), 1312–1321.
Avizienis, A., D.A. Rennels. Fault Tolerance Experiments With the JPL-STAR Computer. IEEE Compcon 72, (1972), 321–324.
Avizienis, A. Fault-Tolerant Systems. IEEE Trans. on Computers C-25, 12 (Dec. 1976), 1304–1312.
Banatre, J.-P., S.K. Shrivastava. Reliable Resource Allocation Between Unreliable Processes. Technical Report 99, Computing Laboratory, The University, Newcastle upon Tyne (June 1977).
Baskin, H.B., B.R. Borgerson, R. Roberts. PRIME-A Modular Architecture for Terminal-Orientated Systems. Proc. AFIPS 1972 SJCC 40 (1972), 431–437.
Bell System Technical Journal. (Sept. 1964).
Bell System Technical Journal. (Feb. 1977).
Bjork, L.A., C.T. Davies. The Semantics of the Preservation and Recovery of Integrity in a Data System. Report TR 02.540, IBM, San Jose, Calif. (Dec. 1972).
Bjork, L.A. Generalised Audit Trail (Ledger) Concepts for Data Base Applications. Report TR 02.641, IBM, San Jose, Calif. (Sept. 1974).
Borgerson, B.R. A Fail-Softly System For Timesharing Use. Digest of papers FTC-2. (1972), 89–93.
Borgerson, B.R. Spontaneous Reconfiguration in a Fail-Softly Computer Utility. Datafair (1973), 326–331.
Borgerson, B.R., R.F. Freitas. An Analysis of PRIME Using a New Reliability Model. Digest of papers FTC-4, (1974), 2.26–2.31.
Brinch Hansen, P. Operating System Principles. Prentice-Hall, Englewood Cliffs, N.J. (1973).
Brinch Hansen, P. The Programming Language Concurrent Pascal. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 199–207.
Clement, C.F., R.D. Toyer. Recovery From Faults in the No. 1A Processor. FTC-4 (1974), 5.2–5.7.
Cohen, E.S. Strong Dependency: a formalism for describing information transmission in computation systems. Technical Report, Computer Science Dept, Carnegie-Mellon Univ., Pittsburgh, PA (Aug. 1976).
Cohen, E.S. On Mechanisms for Solving Problems in Computational Systems. (In preparation.)
Cosserat, D.C. A Capability Oriented Multi-processor System for Real-Time Applications. Int. Conf. On Computer Communications. Washington, D.C. (Oct. 1972), 287–289.
Darton, K.S. The Dependable Process Computer. Electrical Review 186, 6 (Feb. 1970), 207–209.
Davies, C.T. A Recovery/Integrity Architecture for a Data System. Report TR 02.528, IBM, San Jose, Calif. (May 1972).
Depledge, P.G., M.G. Hartley. Fault-Tolerant Microcomputer Systems for Aircraft. Proc. Conf. On Computer Systems and Technology, University of Sussex, Institute of Electronic and Radio Engineers, London (1977), 205–220.
Dijkstra E.W. The Structure of the THE Multiprogramming System. Comm. ACM 11, 5 (1968), 341–346.
Dijkstra, E.W. A Discipline of Programming. Prentice-Hall, Englewood Cliffs, N.J. (1976).
Edelberg, M. Data Base Contamination and Recovery. Proc. ACM SIGMOD Workshop on Data Description, Access and Control (May 1974), 419–430.
Eswaran, K.P., J.N. Gray, R.A. Lorie, I.L. Traiger. The Notions of Consistency and Predicate Locks in a Database System. Comm. ACM 19, 11 (Nov. 1976), 624–633.
Fabry, R.S. Dynamic Verification of Operating System Decisions. Comm. ACM 16, 11 (1973), 659–668.
Goodenough, J.B. Exception Handling: Issues and a Proposed Notation. Comm. ACM 18, 12 (1975), 683–696.
Gray, J.N., R.A. Lorie, G.R. Putzolu, L.L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Database. IBM Research Report RJ1654 (Sept. 1975).
Gray, J.N. (Private Communication).
Hamer-Hodges, K. Fault Resistance and Recovery within System 250. Int. Conf. On Computer Communications. Washington (Oct. 1972), 290–296.
Heart, F.E., S.M. Ornstein, W.R. Crowther, W.B. Barker. A new minicomputer/multiprocessor for the ARPA network. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1973), 529–537.
Hecht, H. Fault Tolerant Software for a Fault Tolerant Computer. Software Systems Engineering. Online, Uxbridge (1976), 235–348.
Hoare, C.A.R. Monitors: an operating system structuring concept. Comm. ACM 17, 10 (Oct. 1974), 549–537.
Horning, J.J., B. Randell. Process Structuring. Comp. Surveys 5, 1 (1973), 5–30.
Horning, J.J., H.C. Lauer, P.M. Melliar-Smith, B. Randell. A Program Structure for Error Detection and Recovery. Proc. Conf. On Operating Systems: Theoretical and Practical Aspects. IRIA (1974), 177–193. (Reprinted in Lecture Notes in Computer Science, Vol. 16, Springer-Verlag).
Lampson, B., H. Sturgis. Crash Recovery in a Distributed Data Storage System. Computer Science Laboratory, Xerox Palo Alto Research Center, Palo Alto, Calif, (1976).
Linden, T.A. Operating System Structures to Support Security and Reliable Software. Comp. Surveys 8, 4 (Dec. 1976), 409–445.
Lomet, D.B. Process Structuring, Synchronisation and Recovery using Atomic Actions. Proc. ACM Conf. On Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 128–137.
McPhee, W.S. Operating System Integrity in OS/VS2. IBM System J. 13, 3 (1974), 230–252.
Melliar-Smith, P.M. Error Detection and Recovery in Data Base Systems. (Unpublished, 1975).
Melliar-Smith, P.M., B. Randell. Software Reliability: the role of programmed exception handling. Proc. ACM Conf. on Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 95–100.
Naur, P. Software Reliability. Infotech State of the Art Conference on Reliable Software, London (1977), 7–13.
Neumann, P.G., J. Goldberg, K.N. Levitt, J.H. Wensley. A Study of Fault-Tolerant Computing. Stanford Research Institute, Menlo Park, California (July 1973).
Ornstein, S.M., W.R. Crowther, M.F. Kraley, R.D. Bressler, A. Michael, F.E. Heart. Pluribus — a reliable multi-processor. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1975), 551–559.
Parnas, D.L. Information Distribution Aspects of Design Methodology. Proc. IFIP Congress (1971), TA256-30.
Parnas, D.L., H. Wurges. Response to Undesired Events in Software Systems. Proc. Conf. On Software Engineering. San Francisco, Calif. (1976), 437–446.
Parsons, B.J. Reliability Considerations and Design Aspects of the Hawker Siddeley Space Computer. Proc. Conf. On Computer Systems and Technology, University of Sussex. Inst. Of Electronic and Radio Engineers, London (March 1977), 221–222.
Randell, B. System Structure for Software Fault Tolerance. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 220–232.
Repton, C.S. Reliability Assurance for System 250, a Reliable Real-Time Control System. Int. Conf. On Computer Communications. Washington (Oct. 1972), 297–305.
Rohr, J.A. Starex Self-Repair Routines: Software Recovery in the JPL-STAR Computer. Digest of papers FTC-3, (1973), 11–16.
Ross, D.T. Plex1: Sameness and the Need for Rigor. Report 9031-1.1, Softech, Inc., Waltham, Mass. (Nov. 1975).
Russell, D.L. State Restoration Amongst Communicating Processes. TR 112, Digital Systems Laboratory, Stanford University, Calif. (June 1976).
Shooman, M.L. Probabilistic Reliability: An Engineering Approach. McGraw-Hill, New York (1968).
Simpson, R.M. A Study in the Design of High Integrity Systems. INFO Software, London (1974).
Stoy, J.E., C. Strachey. OS6 — An Experimental Operating System for a Small Computer. Comp. J. 15 (1972), 117–124, 195–201.
Taylor, J.M. Redundancy and Recovery in the HIVE Virtual Machine. Proc. European Conf. on Software System Engineering, London (Sept. 1976), 263–293.
Verhofstad, J.S.M. Recovery for Multi-Level Data Structures. Technical Report No. 96. Computing Laboratory, The University, Newcastle upon Tyne (Dec. 1976).
Verhofstad, J.S.M. Recovery and Crash Resistance in a Filing System. Proc. SIGMOD Conference, Toronto (Aug. 1977).
Wasserman, A.I. Procedure-Oriented Exception Handling Medical Information Science, University of California, San Francisco, Calif. (1976).
Wensley, J.H. SIFT — Software implemented fault tolerance. Proc. Nat. Computer Conf., New York (June 1972), 243–253.
Wulf, W.A. Reliable Hardware-Software Architecture. Proc. Int. Conf. On Reliable Software. SigPlan Notices 10, 6 (June 1975), 122–130.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1978 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Randell, B. (1978). Reliable computing systems. In: Bayer, R., Graham, R.M., Seegmüller, G. (eds) Operating Systems. Lecture Notes in Computer Science, vol 60. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-08755-9_8
Download citation
DOI: https://doi.org/10.1007/3-540-08755-9_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-08755-7
Online ISBN: 978-3-540-35880-0
eBook Packages: Springer Book Archive