Nothing Special   »   [go: up one dir, main page]

Fault Lecture 01 - Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

CSC310

Fault Tolerance
SPRING 2023

Lecture 01- Preliminaries


Instructor: Dr. Belal Badawy

1
Introduction
 In recent years, the computer has become an essential component of

human life, not only computers visible as desktops, laptops, and PDAs,
it is also a commonplace that they are invisible everywhere.

 Computers (hardware and software) are quite likely the most complex

systems ever created by human beings, and with that complexity comes
an increased propensity to failure.

 Computer scientists and engineers have responded to the challenge of

designing complex systems with a variety of tools and techniques to


reduce the number of faults in the systems they build. fault tolerance
2
Introduction
 Fault Classification
o Fault : It is a condition that causes the hardware or software
to fail to perform its required function.
o Error : Refers to difference between Actual Output and
Expected output.(it is a manifestation of the fault)
o Failure : It is the inability of a system or component to
perform required function.

Error Failure
Fault

(FEF chain)
3
Introduction
 Hardware Fault Classification (Regarding their duration)

o Permanent: it reflects the permanent going out of

commission of a component.

o Transient: is one that causes a component to malfunction

for some time.

o Intermittent: is never quite goes away entirely; it oscillates

between being quiescent and active.


4
Introduction
 Hardware Fault Classification (Regarding their effect)

o Benign: A fault that just causes a unit to go dead, such faults

are the easiest to deal with.

o Malicious: Far more insidious are the faults that cause a

unit to produce reasonable-looking, but incorrect, output, or


that make a component “act maliciously” and send differently
valued outputs to different receivers.

5
Introduction
 Fault-Tolerance: is one that continues to perform at desired level of

service in spite of failures in some components that constitute the


system.

 All of fault tolerance is an exercise in exploiting and managing

redundancy.

 Redundancy: is the use of some additional elements within the

system which would not be required in a system that was free from all
faults. (It is needed to detect or mask a fault and continue to operate
even some redundant component failed)
6
Redundancy

 Types of Redundancy

 Hardware redundancy

 Software redundancy

 Information redundancy:

 Time redundancy

7
Redundancy
 Hardware redundancy

Is provided by incorporating extra hardware into the design to either


detect or override the effects of a failed component.

 Types of Hardware Redundancy Techniques:

 Static: utilize fault masking rather than fault detection.

 Dynamic: depends on the detection of faults and on the system taking

appropriate actions to nullify their effects (this involves configuration)

 Hybrid: a combination of static and dynamic redundancy techniques.

8
Redundancy
 Software redundancy

 Is used mainly against software failures. Dealing with such faults can be

expensive.

 One way is to independently produce two or more versions of that

software (preferably by disjoint teams of programmers).

 Multiple versions of the program can be executed either concurrently

(requiring redundant hardware as well) or sequentially (requiring extra


time redundancy) upon a failure detection.

9
Redundancy
 Information redundancy:

 Using error detection and correction coding. Extra bits (called check

bits) are added to the original data bits.

 Used in memory units and various storage devices to protect against

benign failures.

 Error-detecting and error-correcting codes are also used to protect

data communicated over noisy channels

10
Redundancy

 Time redundancy:

 Computing nodes can also exploit time redundancy through the re-

execution of the same program on the same hardware.

 It is effective mainly against transient faults. Because the majority of

hardware faults are transient.

 Compared with the other forms of redundancy, it has much lower

hardware and software overhead but incurs a high performance.

11
Basic Measures of Fault Tolerance

 Basic Measures of Fault Tolerance

Because fault tolerance is about making machines more dependable, it is


important to have proper metrics to measure such dependability. It is
divided into two categories:

 Traditional Measures.

 Network Measures.

12
Basic Measures of Fault Tolerance

 Traditional Measures.

It is describe the dependability of a single computer. Two of these very


basic attributes of the system are reliability and availability.

 Reliability: denoted by R(t), is the probability (as a function of the

time t) in the interval [0, t].

 One example is computers that control physical processes such as

aircraft, for which failure would result in catastrophe.

13
Basic Measures of Fault Tolerance

 Reliability is closely related to the average time of the system

operates until a failure occurs Mean Time to Failure (MTTF), and


the average time between two consecutive failures Mean Time
Between Failures(MTBF).

 The difference between the two is due to the time needed to repair the

system following the first failure. Denoting the Mean Time to


Repair (MTTR). We obtain MTBF = MTTF+ MTTR.

 Network Measures.
14
Basic Measures of Fault Tolerance

 Availability, is denoted by A(t), is the average fraction of time over

the interval [0, t] that the system is up.

 This measure is appropriate for applications in which continuous

performance is not vital but where it would be expensive to have the


system down for a significant amount of time.

 One example is an airline reservation system needs to be highly

available, because downtime can put off customers and lose sales;
short-duration failure can be well tolerated.
15
Basic Measures of Fault Tolerance
 The long-term availability, denoted by A, is defined as A = lim A(t)
t→∞
 The long-term availability can be calculated from MTTF, MTBF, and

MTTR as follows:

 It is possible for a low-reliability system to have high

availability,

 Example: if a system fails every hour on the average but comes back

up after only a second. Such a system has an MTBF of just 1 hour ( a


low reliability); but, its availability is high: A = 3599/3600 = 0.99972.
16
Basic Measures of Fault Tolerance

 Network Measures.

 There are more specialized measures, focusing on the network that

connects the processors together. The simplest of these are classical


node and line connectivity

 Defined as the minimum number of nodes and lines, respectively, that

have to fail before the network becomes disconnected.

17
Basic Measures of Fault Tolerance
 Network Measures.

 Classical connectivity is a very basic measure of network reliability,

it distinguishes between only two network states: connected or


disconnected. It says nothing about how the network degrades as nodes
fail before, or after, becoming disconnected.

 To express “connectivity robustness,” we can use additional

measures. The average node-pair distance, and the network


diameter (the maximum node-pair distance), both calculated given
the probability of node and/or link failure.
18
Motivation

 Who is concerned about fault-tolerance?


System Users – irrespective of the application but some are a
lot more concerned than others
 Who is concerned at design stages?
Universities
 R, d, and a (Research, development, applications)
Industry
 r, D, and A (research, Development, Applications)
 Issues
Design, Analysis/Validation, Implementation,
Testing/Validation, Evaluation
19
20

You might also like