Fault Lecture 01 - Introduction

CSC310
Fault Tolerance
SPRING 2023
Lecture 01- Preliminaries

Instructor: Dr. Belal Badawy
1
Introduction
 In recent years, the computer has become an essential component of
human life, not only computers visible as desktops, laptops, and PDAs,
it is also a commonplace that they are invisible everywhere.
 Computers (hardware and software) are quite likely the most complex
systems ever created by human beings, and with that complexity comes
an increased propensity to failure.
 Computer scientists and engineers have responded to the challenge of
designing complex systems with a variety of tools and techniques to

reduce the number of faults in the systems they build. fault tolerance
2
Introduction
 Fault Classification
o Fault : It is a condition that causes the hardware or software
to fail to perform its required function.
o Error : Refers to difference between Actual Output and
Expected output.(it is a manifestation of the fault)
o Failure : It is the inability of a system or component to
perform required function.
Error Failure
Fault
(FEF chain)
3
Introduction
 Hardware Fault Classification (Regarding their duration)
o Permanent: it reflects the permanent going out of
commission of a component.
o Transient: is one that causes a component to malfunction
for some time.
o Intermittent: is never quite goes away entirely; it oscillates
between being quiescent and active.

4
Introduction
 Hardware Fault Classification (Regarding their effect)
o Benign: A fault that just causes a unit to go dead, such faults
are the easiest to deal with.
o Malicious: Far more insidious are the faults that cause a
unit to produce reasonable-looking, but incorrect, output, or

that make a component “act maliciously” and send differently
valued outputs to different receivers.
5
Introduction
 Fault-Tolerance: is one that continues to perform at desired level of
service in spite of failures in some components that constitute the

system.
 All of fault tolerance is an exercise in exploiting and managing
redundancy.
 Redundancy: is the use of some additional elements within the
system which would not be required in a system that was free from all
faults. (It is needed to detect or mask a fault and continue to operate
even some redundant component failed)
6
Redundancy
 Types of Redundancy
 Hardware redundancy
 Software redundancy
 Information redundancy:
 Time redundancy
7
Redundancy
 Hardware redundancy
Is provided by incorporating extra hardware into the design to either

detect or override the effects of a failed component.
 Types of Hardware Redundancy Techniques:
 Static: utilize fault masking rather than fault detection.
 Dynamic: depends on the detection of faults and on the system taking
appropriate actions to nullify their effects (this involves configuration)
 Hybrid: a combination of static and dynamic redundancy techniques.
8
Redundancy
 Software redundancy
 Is used mainly against software failures. Dealing with such faults can be
expensive.
 One way is to independently produce two or more versions of that
software (preferably by disjoint teams of programmers).
 Multiple versions of the program can be executed either concurrently
(requiring redundant hardware as well) or sequentially (requiring extra

time redundancy) upon a failure detection.
9
Redundancy
 Information redundancy:
 Using error detection and correction coding. Extra bits (called check
bits) are added to the original data bits.
 Used in memory units and various storage devices to protect against
benign failures.
 Error-detecting and error-correcting codes are also used to protect
data communicated over noisy channels
10
Redundancy
 Time redundancy:
 Computing nodes can also exploit time redundancy through the re-
execution of the same program on the same hardware.
 It is effective mainly against transient faults. Because the majority of
hardware faults are transient.
 Compared with the other forms of redundancy, it has much lower
hardware and software overhead but incurs a high performance.
11
Basic Measures of Fault Tolerance
 Basic Measures of Fault Tolerance
Because fault tolerance is about making machines more dependable, it is

important to have proper metrics to measure such dependability. It is
divided into two categories:
 Traditional Measures.
 Network Measures.
12
 Traditional Measures.
It is describe the dependability of a single computer. Two of these very

basic attributes of the system are reliability and availability.
 Reliability: denoted by R(t), is the probability (as a function of the
time t) in the interval [0, t].
 One example is computers that control physical processes such as
aircraft, for which failure would result in catastrophe.
13
 Reliability is closely related to the average time of the system
operates until a failure occurs Mean Time to Failure (MTTF), and

the average time between two consecutive failures Mean Time
Between Failures(MTBF).
 The difference between the two is due to the time needed to repair the
system following the first failure. Denoting the Mean Time to

Repair (MTTR). We obtain MTBF = MTTF+ MTTR.
 Network Measures.
14
 Availability, is denoted by A(t), is the average fraction of time over
the interval [0, t] that the system is up.
 This measure is appropriate for applications in which continuous
performance is not vital but where it would be expensive to have the

system down for a significant amount of time.
 One example is an airline reservation system needs to be highly
available, because downtime can put off customers and lose sales;
short-duration failure can be well tolerated.
15
 The long-term availability, denoted by A, is defined as A = lim A(t)
t→∞
 The long-term availability can be calculated from MTTF, MTBF, and
MTTR as follows:
 It is possible for a low-reliability system to have high
availability,
 Example: if a system fails every hour on the average but comes back
up after only a second. Such a system has an MTBF of just 1 hour ( a

low reliability); but, its availability is high: A = 3599/3600 = 0.99972.
16
 Network Measures.
 There are more specialized measures, focusing on the network that
connects the processors together. The simplest of these are classical

node and line connectivity
 Defined as the minimum number of nodes and lines, respectively, that
have to fail before the network becomes disconnected.
17
 Network Measures.
 Classical connectivity is a very basic measure of network reliability,
it distinguishes between only two network states: connected or

disconnected. It says nothing about how the network degrades as nodes
fail before, or after, becoming disconnected.
 To express “connectivity robustness,” we can use additional
measures. The average node-pair distance, and the network

diameter (the maximum node-pair distance), both calculated given
the probability of node and/or link failure.
18
Motivation
 Who is concerned about fault-tolerance?

System Users – irrespective of the application but some are a
lot more concerned than others
 Who is concerned at design stages?
Universities
 R, d, and a (Research, development, applications)
Industry
 r, D, and A (research, Development, Applications)
 Issues
Design, Analysis/Validation, Implementation,
Testing/Validation, Evaluation
19
20

Fault Lecture 01 - Introduction

Uploaded by

Copyright:

Available Formats

Fault Lecture 01 - Introduction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fault Lecture 01 - Introduction

Uploaded by

Copyright:

Available Formats

CSC310

Lecture 01- Preliminaries

 Computer scientists and engineers have responded to the challenge of

designing complex systems with a variety of tools and techniques to

o Permanent: it reflects the permanent going out of

o Transient: is one that causes a component to malfunction

for some time.

o Intermittent: is never quite goes away entirely; it oscillates

between being quiescent and active.

o Benign: A fault that just causes a unit to go dead, such faults

are the easiest to deal with.

o Malicious: Far more insidious are the faults that cause a

unit to produce reasonable-looking, but incorrect, output, or

service in spite of failures in some components that constitute the

 All of fault tolerance is an exercise in exploiting and managing

 Redundancy: is the use of some additional elements within the

Is provided by incorporating extra hardware into the design to either

 Types of Hardware Redundancy Techniques:

 Static: utilize fault masking rather than fault detection.

 Dynamic: depends on the detection of faults and on the system taking

appropriate actions to nullify their effects (this involves configuration)

 Hybrid: a combination of static and dynamic redundancy techniques.

 One way is to independently produce two or more versions of that

software (preferably by disjoint teams of programmers).

 Multiple versions of the program can be executed either concurrently

(requiring redundant hardware as well) or sequentially (requiring extra

bits) are added to the original data bits.

 Used in memory units and various storage devices to protect against

 Error-detecting and error-correcting codes are also used to protect

data communicated over noisy channels

execution of the same program on the same hardware.

 It is effective mainly against transient faults. Because the majority of

hardware faults are transient.

 Compared with the other forms of redundancy, it has much lower

hardware and software overhead but incurs a high performance.

 Basic Measures of Fault Tolerance

Because fault tolerance is about making machines more dependable, it is

It is describe the dependability of a single computer. Two of these very

 Reliability: denoted by R(t), is the probability (as a function of the

time t) in the interval [0, t].

 One example is computers that control physical processes such as

aircraft, for which failure would result in catastrophe.

 Reliability is closely related to the average time of the system

operates until a failure occurs Mean Time to Failure (MTTF), and

system following the first failure. Denoting the Mean Time to

 Availability, is denoted by A(t), is the average fraction of time over

the interval [0, t] that the system is up.

 This measure is appropriate for applications in which continuous

performance is not vital but where it would be expensive to have the

 One example is an airline reservation system needs to be highly

 It is possible for a low-reliability system to have high

up after only a second. Such a system has an MTBF of just 1 hour ( a

 There are more specialized measures, focusing on the network that

connects the processors together. The simplest of these are classical

 Defined as the minimum number of nodes and lines, respectively, that

have to fail before the network becomes disconnected.

 Classical connectivity is a very basic measure of network reliability,

it distinguishes between only two network states: connected or