Fault Lecture 01 - Introduction
Fault Lecture 01 - Introduction
Fault Lecture 01 - Introduction
Fault Tolerance
SPRING 2023
1
Introduction
In recent years, the computer has become an essential component of
human life, not only computers visible as desktops, laptops, and PDAs,
it is also a commonplace that they are invisible everywhere.
Computers (hardware and software) are quite likely the most complex
systems ever created by human beings, and with that complexity comes
an increased propensity to failure.
Error Failure
Fault
(FEF chain)
3
Introduction
Hardware Fault Classification (Regarding their duration)
commission of a component.
5
Introduction
Fault-Tolerance: is one that continues to perform at desired level of
redundancy.
system which would not be required in a system that was free from all
faults. (It is needed to detect or mask a fault and continue to operate
even some redundant component failed)
6
Redundancy
Types of Redundancy
Hardware redundancy
Software redundancy
Information redundancy:
Time redundancy
7
Redundancy
Hardware redundancy
8
Redundancy
Software redundancy
Is used mainly against software failures. Dealing with such faults can be
expensive.
9
Redundancy
Information redundancy:
Using error detection and correction coding. Extra bits (called check
benign failures.
10
Redundancy
Time redundancy:
Computing nodes can also exploit time redundancy through the re-
11
Basic Measures of Fault Tolerance
Traditional Measures.
Network Measures.
12
Basic Measures of Fault Tolerance
Traditional Measures.
13
Basic Measures of Fault Tolerance
The difference between the two is due to the time needed to repair the
Network Measures.
14
Basic Measures of Fault Tolerance
available, because downtime can put off customers and lose sales;
short-duration failure can be well tolerated.
15
Basic Measures of Fault Tolerance
The long-term availability, denoted by A, is defined as A = lim A(t)
t→∞
The long-term availability can be calculated from MTTF, MTBF, and
MTTR as follows:
availability,
Example: if a system fails every hour on the average but comes back
Network Measures.
17
Basic Measures of Fault Tolerance
Network Measures.