Achieving Hardware Fault Tolerance

Session Ten: Achieving Compliance in Hardware Fault Tolerance

Achieving Compliance in Hardware Fault Tolerance
Mirek Generowicz
FS Senior Expert (TV Rheinland #183/12)
Engineering Manager, I&E Systems Pty Ltd

The functional safety standards ISA S84/IEC 61511 and IEC 61508 both set
out requirements for hardware fault tolerance or architectural constraints.
The method specified in ISA S84 and IEC 61511 for assessing hardware fault
tolerance has often proven to be impracticable for SIL 3 in the process sector.
Many users in the process sector have not been able to comply fully with the
Further confusion has been created because there are many SIL certificates in
circulation that are undeniably incorrect and misleading.
This paper describes common problems and misunderstandings in assessing
hardware fault tolerance.
The 2010 edition of IEC 61508 brought in a new and much simpler and more
practicable method for assessing hardware fault tolerance. The method is
called Route 2H.
This paper explains how Route 2H overcomes the problems with the earlier
The proposed new edition of IEC 61511 will be based on Route 2H.

Why do we need Hardware Fault Tolerance?

The standards impose architectural constraints to compensate for the
uncertainty in the failure rates and the assumptions made in the design.
AS IEC 61508 and AS IEC 61511 both set architectural constraints according
the required integrity level.
The architectural constraints are characterised by hardware fault tolerance,
(HFT), the ability to perform a required function in the presence of a fault.
Hardware fault tolerance is the ability of a component or subsystem to
continue to be able to undertake the required safety instrumented
function in the presence of one or more dangerous faults in hardware. A
hardware fault tolerance of 1 means that there are, for example, two
devices and the architecture is such that the dangerous failure of one of
the two components or subsystems does not prevent the safety action
from occurring.
AS IEC 61511 explains that hardware fault tolerance is necessary:
to alleviate potential shortcomings in SIF design that may result due to
the number of assumptions made in the design of the SIF, along with
uncertainty in the failure rate of components or subsystems used in
various process applications.
The simple calculation of probability of failure is not enough. If we dont have
sufficient confidence in the failure rate data the calculated probability may be
unrealistically optimistic.
We need to have a certain minimum level of fault tolerance in addition to
showing that the calculated probability of failure is low enough to meet the SIL
target. Exactly what level of fault tolerance we need depends on:
The level of confidence we have in the failure rate data,
The dominant failure behaviour (safe or dangerous)
Whether we can detect and respond to failures.
Fault tolerance may be achieved by using redundant elements in a fault
tolerant circuit architecture, for instance 2 valves in series:

The AS IEC 61511-1 method for HFT can only be used for relatively simple
architectures. The AS IEC 61508-2 methods can be applied to assess
hardware fault tolerance requirements for complex architectures.

What is the problem?

AS IEC 61511 sets requirements for HFT in Sub-clause 11.4.
Table 6 specifies the level of HFT for sensors and final elements. The level of
HFT required increases with SIL.
The basic table shows the requirement provided that the dominant failure mode
is to the safe state, or dangerous failures are detected:

Minimum HFT




If the dominant failure is to dangerous state, and if we dont have effective

diagnostics we need to increase the HFT:

Dominant failure
to a dangerous

failure to a
safe state




Actuated shutdown valves generally have dominant failure to a dangerous

state. They tend to jam or stick in an open position. They tend to leak.
These failure modes are not only dangerous, they are also undetected.
The basic requirement is 4 valves in series to achieve SIL 3!

The standard allows us to reduce the HFT requirement if we can demonstrate

limited adjustment and prior use (with extensive evidence):


Prior use, limited

dominant failure to
a dangerous state

Prior use, limited

dominant failure to a
safe state




The bare minimum requirement for SIL 3 is therefore to have 3 valves in series:

This is simply NOT practicable. Installing 3 valves rather than 2 increases

capital cost as well as maintenance costs and it reduces reliability.

An alternative approach: AS IEC 61508 Route 1H

AS IEC 61508 Route 1H is allowed as an alternative to AS IEC 61511 to
determine the HFT required.
Route 1H distinguishes between simple Type A devices and complex Type B
Type A devices have:
Well defined failure modes
Deterministic behaviour
Sufficient dependable failure rate data
Other devices are classified as Type B. These devices with complex behavior
and failure modes, typically devices containing software.
Route 1H requires comprehensive data and documentation for every element
and rigorous quality management and configuration management. Safety
manuals must be provided for every element to demonstrate compliance to
AS IEC 61508.
The requirements for Type A are equivalent to those for limited adjustment
and prior use in AS IEC 61511.

Route 1H applies the concept of Safe Failure Fraction (SFF). This is another
way of assessing whether the dominant failure is to the safe state. The
maximum SIL that can be claimed depends on the HFT.
The results are very similar to those of the AS IEC 61511 method.
The following table shows that maximum SIL that can be claimed for Type A
elements under Route 1H, depending on the HFT and SFF:
Safe Failure Fraction of the

Hardware Fault Tolerance


SFF < 60%




60% SFF < 90%




90% SFF < 99%




SFF 99%




If the SFF < 60% then the dominant failure mode is not to the safe state and to
claim SIL 3 we still need HFT 2, requiring 3 valves in series:

To claim SIL 3 with only 2 valves we need to prove that SFF 60%:

Common errors in trying to show compliance

The total failure rate is the sum of failure rates for safe failures, those causing
a trip (S), plus the rate of dangerous failures detected by on-line diagnostics
(DD) and rate of dangerous failures that remain undetected (DU):

= S + DD + DU
The SFF is the proportion of failures that are either safe (S) or are
dangerous but detected by on-line diagnostics (DD):

SFF = (S + DD )/
Understandably, equipment suppliers and designers have been creative in
trying to prove that SFF 60%.

Error No. 1: No-effect Failures

The first trick is to add in irrelevant no-effect failures as if they were safe. For
example, a typical valve and actuator assembly will have:
S 0.5 x 10-6 failures per hour
D 1 x 10-6 failures per hour,
and no diagnostic functions, so
DD = 0
SFF 0.5 / 1.5 33%
No-effect failures have absolutely no effect on the safety function. A typical
no-effect failure might be a faulty position switch on the actuator.
Adding in the no-effect failures increases the SFF:
S 0.5 x 10-6 failures per hour
NE 1 x 10-6 failures per hour,
D 1 x 10-6 failures per hour,
DD = 0
SFF 1.5 / 2.5 60%
The formula for SFF given in AS IEC 61508 has never allowed the inclusion of
no-effect failures. The 2010 revision AS IEC 61508 Ed. 2 (2010) added
specific clarification that no-effect failures must be excluded from SFF.
Beware that there are many certificates in circulation that are invalid because
they take credit for no-effect failures.

This example certificate is no longer valid and has been withdrawn:

Any certificate taking credit for no-effect failures is invalid.

Beware that some of the commercial software packages commonly used for
SIL calculations take credit for no-effect failures (also called residual failures)
if the IEC 61508-2000 method is selected.
The 3rd edition of the SERH Safety Equipment Reliability Handbook published
by exida in 2007 takes credit for called residual failures in the calculation of
SFF. Users should recalculate the SFF excluding the residual failures.

Error No. 2: Partial Stroke Testing

The following example certificate takes credit for partial stroke testing in the
calculation of SFF:

Partial stroke testing can be claimed as a diagnostic if it is sufficiently frequent.

AS IEC 61508-2 defines the requirements for the frequency of
diagnostic functions.
In low demand mode
credit shall only be taken for the diagnostics if the sum of the diagnostic
test interval and the time to perform the repair of a detected failure is
less than the MTTR used in the calculation to determine the achieved
safety integrity for that safety function.
The diagnostic interval must be included in the MTTR that is used in calculating
probability of failure:

If the MTTR is extended to periods measured in months it will lead to a

significant increase in the probability of failure of the safety function.
The same requirement applies to high demand mode and continuous mode
functions that have HFT > 0.
In high demand mode and continuous mode functions with HFT = 0 then either:

The diagnostic interval + time for safety action response must be less
than the process safety time OR

The diagnostic test rate must be at least 100 times more frequent than
the demand rate.

Automatic weekly or daily testing might be sufficiently frequent for low demand
applications in the process sector but it is usually impractical.
6-monthly testing cannot be classed as a diagnostic and does not
contribute to improving SFF.
TV Rheinland has published a statement clarifying how these certificates
should be interpreted:

Error No. 3: Assuming prior use without evidence

AS IEC 61511-1 11.5.3 stipulates rigorous documentary requirements to
support claims for prior use.
The requirements are onerous and difficult to achieve in practice. Most users
find it easier to demonstrate compliance to AS IEC 61508-2 and/or AS IEC
61508-3 but sourcing independently certified components.

Error No. 4: Assuming compliance to AS IEC 61508

AS IEC 61508-2 requires that suppliers must provide a safety manual
for each item that is claimed to be in compliance with the IEC 61508 series.
Annex D describes very detailed requirements for what should be included in a
safety manual.
Compliance cannot be claimed unless the safety manuals are provided. The
information required in the manuals is similar to what is required to support
claims of prior use.

The solution: AS IEC 61508 Route 2H

AS IEC 61508 Route 1H and AS IEC 61511 are based on having failure rates
with a confidence level of at least 70%.
This means that 70% of the recorded time intervals between failures are longer
than the MTBF to be used in the calculations.
In other words, the calculations are based on the failure rate 70%, which is at
least as high as 70% of the failure rates recorded.
The purpose of HFT is to compensate for uncertainty in the failure rate data
and assumptions. If we can reduce the uncertainty we can reduce the HFT.
Route 2H is based on confidence level increased to 90%.

The requirement for Route 2H is very simple. If the confidence level can be
demonstrated then HFT of 1 is sufficient for SIL 3, and HFT of 0 is acceptable
for SIL 2.
There is no need to consider SFF for Type A elements.
The requirement for Type B elements is simply that
All type B elements used in Route 2H shall have, as a minimum, a
diagnostic coverage of not less than 60 %.
Failure rates with a confidence level of 90% can be expected to be
approximately 0.8 standard deviation (0.8) higher than failure rates with a
confidence level of 70%.

Finding data
Two dependable sources: OREDA and exida SERH
The OREDA Offshore Reliability Handbook published by SINTEF gives the
standard deviation and the mean for failure rates of components commonly
applied in the hydrocarbons industry.
OREDA is based on extensive field experience, though in limited applications.
The SERH Safety Equipment Reliability Handbook is published by exida.
The failure rates in exida SERH are calculated using FMEDA, but are based on
extensive datasets for individual component parts.
The results are broadly consistent, though OREDA includes some site specific
failures and OREDA failure rates may be twice as high as corresponding exida

Differing treatment of systematic failures

One of the reasons for the differences between sources is in how the decision
is made whether to include or exclude failures from the datasets.
Failures of non-electronic components such as valves are always systematic
but can be treated as quasi-random.
The standards require that systematic failures should be avoided or controlled
through the application of appropriate techniques and measures. However
many systematic failures cannot be eliminated easily. The intention of the
standards is that these quasi random failures should be included in the
probability of failure calculations.
Judgement is needed in deciding which failures to exclude.

Confidence levels
The confidence level in exida SERH is stated as 70%.
OREDA shows full details of the spread of failure rates recorded, including the
mean and the standard deviation.
The standard deviation allows us to estimate failure rates with 90% confidence
level (90%) from failure rates with 70% confidence level (70%).
In a normal distribution approximately 90% of population lies within 1.6 of .
Typically failure rates are distributed over one or two orders of magnitude.
According to OREDA, the following failure rates are typical for actuated ball
50% 2.3 per 106 hours
2.7 per 106 hours
70% 3.6 per 106 hours
90% 5.8 per 106 hours
90% / 70% 1.6
This value of ratio of 90% / 70% is typical.
We might infer that the calculated probability of failure for designs relying on
Route 2H will typically be around 60% higher than calculations based on Route

Less dependable: Studies based on vendor returns

Many SIL certificates have been published that show failure rates up to 50 x
lower than those in SERH or OREDA and claiming 90% confidence level.
The example certificates shown above have:
3 x10-8 per hour for a ball valve
3 x10-8 per hour for a pneumatic actuator
6 x10-8 per hour for the assembly
For similar equipment, SERH has
1.4 x10-6 per hour
3.6 x10-6 per hour.
Note that confidence levels are related to the spread of data in a given
90% confidence levels may be claimed with small datasets, the confidence
level is not related to the validity or applicability of the measured failure rates.

Failures per hour

Studies based on vendor returns may inadvertently exclude many failures that
were not reported to the vendor. They may also exclude failures considered to
be systematic or outside the design envelope.
Low failure rates from restricted datasets may be unrealistically optimistic.

Most dependable: The users own data

The difficulty is in the large volume of operating experience required. It needs
the equivalent of decades of experience with a sizeable population of devices.
Analysis of failure causes is just as important as failure rates.
systematic causes must be controlled.


IEC 61511 Edition 2 proposed for release in 2015

The proposed next edition of IEC 61511-1 specifies HFT requirements based
on Route 2H.
HFT of 1 will be sufficient for SIL 3.

Minimum required HFT


(low demand mode)

(high demand/continuous mode)



The proposed draft excludes the requirement for 90% confidence level.

The HFT methods in AS IEC 61511 and AS IEC 61508 Route 1H do not work
well in practice for the process sector. These methods require 3 valves in
series (1 out of 3) to achieve SIL 3.
IEC Route 2H is based on confidence level increased to 90%. It is much
simpler and easier to apply. It allows SIL 3 to be achieved with only 2 valves
as final elements.
The new edition of IEC 61511 will apply Route 2H though without an explicit
requirement for 90% confidence levels.
OREDA and exida SERH provide failure rate data that are widely accepted as
being dependable. These references provide enough information to allow us to
infer failure rates with 90% confidence levels.
There are many certificates in circulation that claim failure rates that are much
lower than the rates published by OREDA and exida.
Users should collect their own data. Requirements for collection of evidence
are onerous. A large volume of evidence is required. User should compare
their failure rates with those in OREDA and SERH.
Failure rates from different sources should always be compared and assessed
for plausibility. For Route 2H a conservative approach should be taken, the
complete spread of failure rates should be taken into account.
Published failure rates for valves all include systematic failures. All valve
failures are essentially systematic in nature and can be avoided or controlled to
some extent. In evaluating failure rates the effectiveness of the planned
operation and maintenance should be considered. Particular attention should
be given to identifying and controlling common cause failures as these will
almost always dominate in the calculated probability of failure.
There are some certificates in circulation that take credit for no effect failures
or for partial stroke testing in determining SFF. These certificates must be
interpreted with caution. It is not valid to claim SFF > 60% for valves by:
Taking credit for no effect failures
Taking credit for infrequent partial stroke testing as a diagnostic
Certificates on their own are not sufficient as evidence of compliance to
AS IEC 61508-2 and AS IEC 61508-3. Detailed safety manuals must be
provided in accordance with AS IEC 61508-2 Annex D.

AS IEC 61511.1-2004 Functional safetySafety instrumented systems for the
process industry sector
Part 1: Framework, definitions, systems, hardware and software requirements
AS IEC 61508.2-2011 Functional safety of electrical/electronic/programmable
electronic safety-related systems
Part 2: Requirements for electrical/electronic/programmable electronic safetyrelated systems
SINTEF 2009, OREDA Offshore Reliability Handbook 5th Edition
Volume 1 Topside Equipment L.L.C. 2007, Safety Equipment Reliability Handbook 3rd Edition
Volume 3 Final Elements
YouTube video
The exida FMEDA Process Accurate Failure Data for the Process Industries
Dr. William M. Goble, CFSE, Exida Consulting, February 2012 Field Failure
Data the Good, the Bad and the Ugly

