Reliable computing systems

B. Randell¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 60))

269 Accesses

Abstract

The paper presents an analysis of the various problems involved in achieving very high reliability from complex computing systems, and discusses the relationship between system structuring techniques and techniques of fault tolerance. Topics covered include (i) differing types of reliability requirement, (ii) forms of protective redundancy in hardware and software systems, (iii) methods of structuring the activity of a system, using atomic actions, so as to limit information flow, (iv) error detection techniques, (v) strategies for locating and dealing with faults, and for assessing the damage they have caused, and (vi) forward and backward error recovery techniques, based on the concepts of recovery line, commitment, exception and compensation. A set of appendices provide summary descriptions and analyses of a number of computing systems that have been specifically designed with the aim of achieving very high reliability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Fault Tolerance: Theory and Concepts

Reliability Evaluation Techniques

Fault Tolerance: Theory and Concepts

8 References

Anderson, T., R. Kerr. Recovery Blocks in Action: a system supporting high reliability. Proc. Int. Conf. on Software Engineering San Francisco (Oct. 1976).
Google Scholar
Anderson, T., P.A. Lee, S.K. Shrivastava. A Conceptual Model of Recoverability in Multi-Level Systems. Technical Report 115, Computing Laboratory, The University, Newcastle upon Tyne (Nov. 1977).
Google Scholar
Avizienis, A. et al. The STAR (Self Testing and Repairing Computer): An Investigation of the Theory and Practice of Fault Tolerant Computer Design. IEEE Trans. on Computers, C-20, 11 (Nov. 1971), 1312–1321.
Google Scholar
Avizienis, A., D.A. Rennels. Fault Tolerance Experiments With the JPL-STAR Computer. IEEE Compcon 72, (1972), 321–324.
Google Scholar
Avizienis, A. Fault-Tolerant Systems. IEEE Trans. on Computers C-25, 12 (Dec. 1976), 1304–1312.
Google Scholar
Banatre, J.-P., S.K. Shrivastava. Reliable Resource Allocation Between Unreliable Processes. Technical Report 99, Computing Laboratory, The University, Newcastle upon Tyne (June 1977).
Google Scholar
Baskin, H.B., B.R. Borgerson, R. Roberts. PRIME-A Modular Architecture for Terminal-Orientated Systems. Proc. AFIPS 1972 SJCC 40 (1972), 431–437.
Google Scholar
Bell System Technical Journal. (Sept. 1964).
Google Scholar
Bell System Technical Journal. (Feb. 1977).
Google Scholar
Bjork, L.A., C.T. Davies. The Semantics of the Preservation and Recovery of Integrity in a Data System. Report TR 02.540, IBM, San Jose, Calif. (Dec. 1972).
Google Scholar
Bjork, L.A. Generalised Audit Trail (Ledger) Concepts for Data Base Applications. Report TR 02.641, IBM, San Jose, Calif. (Sept. 1974).
Google Scholar
Borgerson, B.R. A Fail-Softly System For Timesharing Use. Digest of papers FTC-2. (1972), 89–93.
Google Scholar
Borgerson, B.R. Spontaneous Reconfiguration in a Fail-Softly Computer Utility. Datafair (1973), 326–331.
Google Scholar
Borgerson, B.R., R.F. Freitas. An Analysis of PRIME Using a New Reliability Model. Digest of papers FTC-4, (1974), 2.26–2.31.
Google Scholar
Brinch Hansen, P. Operating System Principles. Prentice-Hall, Englewood Cliffs, N.J. (1973).
Google Scholar
Brinch Hansen, P. The Programming Language Concurrent Pascal. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 199–207.
Google Scholar
Clement, C.F., R.D. Toyer. Recovery From Faults in the No. 1A Processor. FTC-4 (1974), 5.2–5.7.
Google Scholar
Cohen, E.S. Strong Dependency: a formalism for describing information transmission in computation systems. Technical Report, Computer Science Dept, Carnegie-Mellon Univ., Pittsburgh, PA (Aug. 1976).
Google Scholar
Cohen, E.S. On Mechanisms for Solving Problems in Computational Systems. (In preparation.)
Google Scholar
Cosserat, D.C. A Capability Oriented Multi-processor System for Real-Time Applications. Int. Conf. On Computer Communications. Washington, D.C. (Oct. 1972), 287–289.
Google Scholar
Darton, K.S. The Dependable Process Computer. Electrical Review 186, 6 (Feb. 1970), 207–209.
Google Scholar
Davies, C.T. A Recovery/Integrity Architecture for a Data System. Report TR 02.528, IBM, San Jose, Calif. (May 1972).
Google Scholar
Depledge, P.G., M.G. Hartley. Fault-Tolerant Microcomputer Systems for Aircraft. Proc. Conf. On Computer Systems and Technology, University of Sussex, Institute of Electronic and Radio Engineers, London (1977), 205–220.
Google Scholar
Dijkstra E.W. The Structure of the THE Multiprogramming System. Comm. ACM 11, 5 (1968), 341–346.
Google Scholar
Dijkstra, E.W. A Discipline of Programming. Prentice-Hall, Englewood Cliffs, N.J. (1976).
Google Scholar
Edelberg, M. Data Base Contamination and Recovery. Proc. ACM SIGMOD Workshop on Data Description, Access and Control (May 1974), 419–430.
Google Scholar
Eswaran, K.P., J.N. Gray, R.A. Lorie, I.L. Traiger. The Notions of Consistency and Predicate Locks in a Database System. Comm. ACM 19, 11 (Nov. 1976), 624–633.
Google Scholar
Fabry, R.S. Dynamic Verification of Operating System Decisions. Comm. ACM 16, 11 (1973), 659–668.
Google Scholar
Goodenough, J.B. Exception Handling: Issues and a Proposed Notation. Comm. ACM 18, 12 (1975), 683–696.
Google Scholar
Gray, J.N., R.A. Lorie, G.R. Putzolu, L.L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Database. IBM Research Report RJ1654 (Sept. 1975).
Google Scholar
Gray, J.N. (Private Communication).
Google Scholar
Hamer-Hodges, K. Fault Resistance and Recovery within System 250. Int. Conf. On Computer Communications. Washington (Oct. 1972), 290–296.
Google Scholar
Heart, F.E., S.M. Ornstein, W.R. Crowther, W.B. Barker. A new minicomputer/multiprocessor for the ARPA network. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1973), 529–537.
Google Scholar
Hecht, H. Fault Tolerant Software for a Fault Tolerant Computer. Software Systems Engineering. Online, Uxbridge (1976), 235–348.
Google Scholar
Hoare, C.A.R. Monitors: an operating system structuring concept. Comm. ACM 17, 10 (Oct. 1974), 549–537.
Google Scholar
Horning, J.J., B. Randell. Process Structuring. Comp. Surveys 5, 1 (1973), 5–30.
Google Scholar
Horning, J.J., H.C. Lauer, P.M. Melliar-Smith, B. Randell. A Program Structure for Error Detection and Recovery. Proc. Conf. On Operating Systems: Theoretical and Practical Aspects. IRIA (1974), 177–193. (Reprinted in Lecture Notes in Computer Science, Vol. 16, Springer-Verlag).
Google Scholar
Lampson, B., H. Sturgis. Crash Recovery in a Distributed Data Storage System. Computer Science Laboratory, Xerox Palo Alto Research Center, Palo Alto, Calif, (1976).
Google Scholar
Linden, T.A. Operating System Structures to Support Security and Reliable Software. Comp. Surveys 8, 4 (Dec. 1976), 409–445.
Google Scholar
Lomet, D.B. Process Structuring, Synchronisation and Recovery using Atomic Actions. Proc. ACM Conf. On Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 128–137.
Google Scholar
McPhee, W.S. Operating System Integrity in OS/VS2. IBM System J. 13, 3 (1974), 230–252.
Google Scholar
Melliar-Smith, P.M. Error Detection and Recovery in Data Base Systems. (Unpublished, 1975).
Google Scholar
Melliar-Smith, P.M., B. Randell. Software Reliability: the role of programmed exception handling. Proc. ACM Conf. on Language Design for Reliable Software. Sigplan Notices 12, 3 (March 1977), 95–100.
Google Scholar
Naur, P. Software Reliability. Infotech State of the Art Conference on Reliable Software, London (1977), 7–13.
Google Scholar
Neumann, P.G., J. Goldberg, K.N. Levitt, J.H. Wensley. A Study of Fault-Tolerant Computing. Stanford Research Institute, Menlo Park, California (July 1973).
Google Scholar
Ornstein, S.M., W.R. Crowther, M.F. Kraley, R.D. Bressler, A. Michael, F.E. Heart. Pluribus — a reliable multi-processor. Proc. Of the Nat. Computer Conf. New York, N.Y. (June 1975), 551–559.
Google Scholar
Parnas, D.L. Information Distribution Aspects of Design Methodology. Proc. IFIP Congress (1971), TA256-30.
Google Scholar
Parnas, D.L., H. Wurges. Response to Undesired Events in Software Systems. Proc. Conf. On Software Engineering. San Francisco, Calif. (1976), 437–446.
Google Scholar
Parsons, B.J. Reliability Considerations and Design Aspects of the Hawker Siddeley Space Computer. Proc. Conf. On Computer Systems and Technology, University of Sussex. Inst. Of Electronic and Radio Engineers, London (March 1977), 221–222.
Google Scholar
Randell, B. System Structure for Software Fault Tolerance. IEEE Trans. On Software Engineering. SE-1, 2 (June 1975), 220–232.
Google Scholar
Repton, C.S. Reliability Assurance for System 250, a Reliable Real-Time Control System. Int. Conf. On Computer Communications. Washington (Oct. 1972), 297–305.
Google Scholar
Rohr, J.A. Starex Self-Repair Routines: Software Recovery in the JPL-STAR Computer. Digest of papers FTC-3, (1973), 11–16.
Google Scholar
Ross, D.T. Plex1: Sameness and the Need for Rigor. Report 9031-1.1, Softech, Inc., Waltham, Mass. (Nov. 1975).
Google Scholar
Russell, D.L. State Restoration Amongst Communicating Processes. TR 112, Digital Systems Laboratory, Stanford University, Calif. (June 1976).
Google Scholar
Shooman, M.L. Probabilistic Reliability: An Engineering Approach. McGraw-Hill, New York (1968).
Google Scholar
Simpson, R.M. A Study in the Design of High Integrity Systems. INFO Software, London (1974).
Google Scholar
Stoy, J.E., C. Strachey. OS6 — An Experimental Operating System for a Small Computer. Comp. J. 15 (1972), 117–124, 195–201.
Google Scholar
Taylor, J.M. Redundancy and Recovery in the HIVE Virtual Machine. Proc. European Conf. on Software System Engineering, London (Sept. 1976), 263–293.
Google Scholar
Verhofstad, J.S.M. Recovery for Multi-Level Data Structures. Technical Report No. 96. Computing Laboratory, The University, Newcastle upon Tyne (Dec. 1976).
Google Scholar
Verhofstad, J.S.M. Recovery and Crash Resistance in a Filing System. Proc. SIGMOD Conference, Toronto (Aug. 1977).
Google Scholar
Wasserman, A.I. Procedure-Oriented Exception Handling Medical Information Science, University of California, San Francisco, Calif. (1976).
Google Scholar
Wensley, J.H. SIFT — Software implemented fault tolerance. Proc. Nat. Computer Conf., New York (June 1972), 243–253.
Google Scholar
Wulf, W.A. Reliable Hardware-Software Architecture. Proc. Int. Conf. On Reliable Software. SigPlan Notices 10, 6 (June 1975), 122–130.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Newcastle upon Tyne, Newcastle upon Tyne, England
B. Randell

Authors

B. Randell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

R. Bayer R. M. Graham G. Seegmüller

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Randell, B. (1978). Reliable computing systems. In: Bayer, R., Graham, R.M., Seegmüller, G. (eds) Operating Systems. Lecture Notes in Computer Science, vol 60. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-08755-9_8

Download citation

DOI: https://doi.org/10.1007/3-540-08755-9_8
Published: 25 May 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-08755-7
Online ISBN: 978-3-540-35880-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Reliable computing systems

Abstract

Access this chapter

Preview

Similar content being viewed by others

Fault Tolerance: Theory and Concepts

Reliability Evaluation Techniques

Fault Tolerance: Theory and Concepts

8 References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Reliable computing systems

Abstract

Access this chapter

Preview

Similar content being viewed by others

Fault Tolerance: Theory and Concepts

Reliability Evaluation Techniques

Fault Tolerance: Theory and Concepts

8 References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation