L.Fault Detection and Diagnostics in Equipment Maintenance

Understanding equipment failures and developing strategies to detect and diagnose them is one
of the key elements of equipment maintenance. The purpose of this lesson is to present an
overview of Fault Detection and Diagnostics as they are applied to improve the equipment
maintenance process and boost asset reliability.

States and Signals


An unpermitted deviation of at least one characteristic property or parameter of the system from
the acceptable, usual or standard condition.


A permanent interruption of a system’s ability to perform a required function under

specified operating conditions.

An intermittent irregularity in the fulfifilment of a system’s desired function.


A deviation between a measured or computed value of an output variable and its

true or theoretically correct one.


An unknown and uncontrolled input acting on a system.


A fault indicator, based on a deviation between measurements and model-equation-

based computations.


A change of an observable quantity from normal behaviour.


Fault detection

Determination of faults present in a system and the time of detection.

Fault isolation

Determination of the kind, location and time of detection of a fault. Follows fault

Fault identification

Determination of the size and time-variant behaviour of a fault. Follows fault

Fault diagnosis

Determination of the kind, size, location and time of detection of a fault. Follows
fault detection. Includes fault detection and identification.


A continuous real-time task of determining the conditions of a physical system, by

recording information, recognizing and indication anomalies in the behaviour.


Monitoring a physical and taking appropriate actions to maintain the operation in

the case of fault.

3. Models

Quantitative model

Use of static and dynamic relations among system variables and parameters in
order to describe a system’s behaviour in quantitative mathematical terms.

Qualitative model

Use of static and dynamic relations among system variables in order to describe a
system’s behaviour in qualitative terms such as causalities and IF–THEN rules.

Diagnostic model

A set of static or dynamic relations which link specific input variables, the
symptoms, to specific output variables, the faults.

Analytical redundancy
Use of more (not necessarily identical) ways to determine a variable, where one
way uses a mathematical process model in analytical form.

4. System properties


Ability of a system to perform a required function under stated conditions, within a

given scope, during a given period of time.


Ability of a system not to cause danger to persons or equipment or the



Probability that a system or equipment will operate satisfactorily and effectively at

any point of time.

Time dependency of faults

Abrupt fault

Fault modelled as stepwise function. It represents bias in the monitored signal.

Incipient fault

Fault modelled by using ramp signals. It represents drift of the monitored signal.

Intermittent fault

Combination of impulses with different amplitudes.

6. Fault terminology
Additive fault

Influences a variable by an addition of the fault itself. They may represent, e.g.,
offsets of sensors.

Multiplicative fault

Are represented by the product of a variable with the fault itself. They can appear
as parameter changes within a process.

The story behind fault detection and diagnostics

In the early days, equipment maintenance was restricted to repairing faulty assets and performing
basic routine maintenance based on rigid time intervals. Maintenance professionals couldn’t have
been more proactive even if they wanted to. Their capability to collect, store and analyze data on
equipment health and performance was simply too limited.

However, due to consistent advancements in microprocessor-based controls, automation, real-

time data acquisition, and systems like Fault Detection and Diagnostics (FDD), the way in which
we perform equipment maintenance has been significantly transformed.

FDD in equipment maintenance

The objective of Fault Detection and Diagnostics in the context of equipment maintenance is
to optimize maintenance costs while still improving the reliability, availability, maintainability
and safety (RAMS) ofthe equipment.

The FDD functions by continuously monitoring and analyzing condition monitoring data and
detecting any anomalies (if present). The equipment condition datasets are then processed by
fault diagnostics algorithms, sometimes embedded within the equipment itself, to produce failure
alerts for the equipment operators and enable timely maintenance intervention.

In some cases, the algorithms are sophisticated enough to even initiate failure containment
actions to auto-correct the failure itself and restore the equipment to its healthy condition.

Key elements of the fault detection and diagnostics system

The FDD, as the name implies, contains the detection and diagnosis of equipment failures. The
diagnosis of the failure can be broken down into failure isolation and identification. The failure
evaluation is often added within the scope of FDD as it helps to understand the severity of failure
on system performance – an important aspect of maintenance management.

Nevertheless, the Fault Detection and Diagnostics algorithm for any equipment should contain at
least the four key processes listed below (these can constitute a nonlinear process as well,
provided that some steps happen at the same time):

We need to discuss each element in more detail to really understand how fault detection and
diagnostics work.

1. Fault detection
Fault detection is the process of discovering the presence of a fault in any equipment before it
manifests itself in the form of a breakdown. It is the most important stage of FDD as all of the
downstream processes depend on its accuracy.
If the equipment is unable to discover the right failure mode (or if detection is incorrect and
triggers false alarms), then the isolation, identification and evaluation will also be ineffective.

There are two main approaches to fault detection:

1. Model-based fault detection: It is carried out through mathematical modelling of signals
and processes.
2. Knowledge-based fault detection: It is a method that leverages historical data on
equipment performance.

Model-based fault detection

In model-based detection, we define a set of engineering rules that are written in line with
physical laws that define the relationships of subsystems and components within the equipment.
Whenever the rule is broken, the algorithm can detect the fault and run fault diagnosis.

One example of model-based fault detection is the use of time-domain reflectometry (TDR) to
detect faults in underground cables. In TDR, the signal is sent across the test cable and is
received after being reflected from the point of fault. If the cable has a discontinuity or high
impedance, the portion of the signal will be reflected back to the test equipment or receiver. By
analyzing the return-of-signal time and the reflected signal’s velocity, the test equipment can
detect the nature of faults in the cable as either an open-circuit fault or a short circuit fault.

Another simple rule-based detection example comes from the series operation of bottle filling,
capping, and packaging system on a conveyor belt system. A simple rule can be established that
indicates the hierarchy of processes such as:
 the bottle cannot be capped until the bottles are filled with liquid
 the bottles cannot be packaged unless they are filled and capped
In case of a fault in the bottle capping mechanism, the algorithm will detect the incoming
disruption in the packaging system. It will notify the packaging operator well ahead of time. The
necessary preparation can be made to minimize operational losses on the packaging side of the
conveyor belt.
Knowledge-based fault detection
For knowledge-based fault detection to work, we first need to establish a baseline. This is done
by retrieving the parameters of equipment performance such as voltage, current, vibration,
temperature, pressure and other relevant process variables – while the equipment is working
under normal conditions.

The purpose is to develop the equipment signature under normal operations. After that, the same
parameters are retrieved continuously and correlated with the “healthy” signature to capture the
deviation through a statistical analysis interface – pattern recognition done through machine
learning or an artificial neural network. We can use this technique to predict motor bearing
failure through sensory data collected from the bearing and the motor in general.

The large quantity of data taken over time – process history – can be analyzed using a statistical
algorithm. This helps us understand the impact of the different conditions the motor is subjected
to, such as thermal rating, mechanical stress, or some other operating conditions that occur in
special circumstances.

The algorithm then correlates the impact of these conditions on the degradation of bearing health
and predicts the failure rate and health condition of the overall motor. Based on these data
signatures, the analysis can be made to predict the future health of the equipment. Moreover, the
necessary alarms can be triggered and fault diagnosis can be conducted, so the
operator/technician can take appropriate action.

The same data can be used to establish a predictive maintenance strategy over the remaining life
of the motor.

2. Fault isolation
The goal of the fault isolation process is to localize the fault to the lowest component that can be
replaced. In some applications, fault detection and isolation go hand in hand; they can, of course,
be separate modules of the process. This is because the processes of detecting and localizing the
fault are happening at basically the same time, both done by the Fault Detection and Isolation
(FDI) algorithm.

For instance, consider the example of TDR testing for underground cable. The returned pulse
signal from the cable simultaneously indicates the presence and the location of fault through time
and velocity of the returned pulsed signal.

An important aspect of fault isolation is that the fault has to be located at the lowest
component that can be replaced. This is done to improve the accuracy of isolation and reduce
the impact of downtime.

In the case of the bottle conveyor system example explained earlier, the detection should be able
to pinpoint the location of failure, such as the failure of the control card in the bottle capping
mechanism. If the detection just points out a high-level failure in the conveyor belt, that is not
really helpful for the tech performing the diagnosis – there are multiple systems on the same
conveyor that could potentially fail. The information that will really speed up the repair process
is knowing the accurate location of the fault.

3. Fault identification
The purpose of fault identification is to understand the underlying failure mode, determine the
size of the fault, and find its root cause. Fault diagnosis methods may differ, but the steps to
follow are generally the same.
Understanding the underlying failure mode
In-depth understanding of the failure mode requires work:
 we need to analyze how the fault behaves at different times
 so, we can develop the time-variant signature of the failure mode
 and classify it into different categories

Determining the size of the fault

Regardless of the fault detection method applied, the size or magnitude of the fault plays an
important role in defining what is the desired level of fault tolerance that needs to be built into
the design of the equipment.

If the fault magnitude is low, the system just needs to be able to endure the fault for an extra time
until the fault is cleared by itself. The perfect example is permitting temporary switching over
currents in electrical appliances, for as long as that doesn’t significantly impact equipment

Now, if the fault magnitude is really high, a different methodology is required: engineers have to
use active or passive redundancies to enhance fault tolerance on their devices.

Finding root causes

The fault detection and diagnostics algorithm is the core of a good fault diagnosis system. It is
based on machine learning principles, and can be used to identify anomalies in the data streams
originating from the equipment, determining the root cause behind it. Identifying some failure
modes is really straightforward, while others can be challenging and require extensive
mathematical computations.

Let’s use a high voltage and high power three-phase AC induction motor as an example.
More often than not, the underlying failure modes are mechanical in nature and associated with
the rotary part of the motor: shorted rotor windings, bearing failures, and rotor breakdown. Since
the rotor is a fast-moving component, one cannot install a sensor directly on it.
The advanced FDD algorithms can be used to produce healthy motor stator terminal current
signatures and compare them with current signatures under faulty conditions.

For instance, upon breaking of rotor bars, the pulse produced in the stator current is twice the
motor stator current frequency. There is an indirect correlation between the mechanical breaking
of rotor bars and the fluctuations in the stator current. Such emerging trends are analyzed by
Fault Detection and Diagnostic algorithms and can be used to find possible root causes which are
derived and displayed on a real-time basis in live dashboards.

The usage of such fault identification algorithms has significantly reduced the amount of time
techs need to troubleshoot equipment and reach the root cause of the failures. Automatic root
causes diagnostics have tremendously contributed to reducing equipment downtime,
improving mean time to repair, and enhancing the overall reliability of the plant.

4. Fault evaluation

Once the failure modes and the associated root causes are identified, the next step is to evaluate
the impact of that fault type on the overall performance of the system.
We need to consider factors such as;
 the impact of the fault on the environment and the rest of the system
 the impact of the fault on system safety
 the financial loss due to downtime
 the need to make capital replacement decisions (in case the severity of failure is enough to
warrant the replacement of equipment as opposed to fixing it)

Fault evaluation is a significant element of the overall process as it aims to understand the
severity of the fault. This helps reliability engineers provide equipment validation and calculate
the risk of failures, which will both have a big impact on maintenance requirements,
recommendations, and optimization.
For example, the result of the FDD for one piece of equipment could imply the rapidly
increasing failure rates. However, the impact of that fault could be minimal on the overall system
performance, thus making the overall risk to be moderate. In this case, the less stringent
maintenance strategy such as run-to-failure or preventive maintenance could be sufficient to
manage the risk.

Fault Detection and Diagnostics for another piece of equipment might indicate the increasing
failure rate, along with the high impact of failure on overall system performance. In this case, the
most stringent predictive maintenance program should be adopted despite its high cost. This is
because the increased cost of maintenance is warranted to prevent major fallout that will be way
more costly.

Optimizing maintenance with FDD

In short, fault detection and diagnostics play a decisive role in optimizing the maintenance
regime for any piece of equipment, across its lifecycle.

With the advent of fast computing technologies, big data processing, and advanced learning
algorithms, traditional fault detection has evolved into automatic fault management systems that
not only detect faults, but also identify the root cause and implement corrective actions to avoid
future recurrence.

Such automation of a series of manual processes has enabled reliability and maintenance
engineers to apply predictions on equipment health, derive future equipment performance, and
shape optimal maintenance intervals.

The only thing they have left to do is fire up their computerized maintenance management
software (CMMS), track the condition of their critical assets, and schedule appropriate
maintenance work.
Machine failure, once an accepted part of life for manufacturers and OEMs, has met its match
with modern technology using IoT devices, the cloud, and edge computing. In order to pre-empt
and prevent machine failure, it’s first important to understand what it is and why it happens in an
industrial environment.

We can also review existing strategies for dealing with equipment failure including reactive
maintenance, diagnostic analytics, and preventive maintenance. In understanding where these
strategies fail, we can learn why manufacturers are moving toward predictive maintenance,
which resolves the issues of each of its three predecessors.

Here is what we will review:

 What is Machine Failure?
 What are the Types of Equipment Failure?
 The Most Common Causes of Machine Failure
 How Can You Prevent Equipment Failure?
 The Importance of IoT in Preventing Equipment Failure

What is Machine Failure?

Machine Failure, or Equipment Failure, is any event in which a piece of industrial machinery
underperforms, whether entirely or partially, or stops functioning in the way in which it was
intended to. The term “machine failure” can encompass differing scenarios and levels of
A failure, in this context, is not only those critical show-stopper issues that halt production
entirely, but also includes any loss of usefulness within a machine. The tolerance threshold for
machine failure will vary based on circumstances since all systems degrade and lose
effectiveness in some form or another over time. Even perceived minor losses of usefulness can
lead to huge resource waste at-scale.
For our purposes, any malfunction that causes a piece of industrial machinery to underperform
its duties, whether entirely or partially, is considered a machine failure.

Success Story: Learn how machine monitoring enabled

General Grind to reduce downtime, identify bottlenecks, and
increase machine utilization by 100%. Read the full case study.

What are the Types of Equipment Failure?

Machine failure is a spectrum, and many failures can’t be attributed to a specific point in time.
While some are apparent failures that render equipment defunct, others insidiously creep up,
while others still steadily drain effectiveness, the longer they are left ill-maintained. There are
three main classifications of machine failure:

Sudden Failure
This is what most people think of when they hear machine failure. The production line is
humming along when an unexpected (but obvious) breakdown happens. Things like a shattered
tool, snapped band, melted wire, etc. fall into this category.

Intermittent Failure
Think of this like a sputtering engine in your production line. It’ll go a little bit, then quit. You
start it back up, and it keeps working as intended a little longer, but then it starts failing again.
Intermittent failures come and go, usually on their way to a “full” machine failure. These
sporadic or random failures can, by their nature, be difficult to identify. Intermittent failures can
frequently be prevented with maintenance.

Gradual Failure
These are the failures you can see over time as a machine’s usefulness takes a steady decline.
This includes things like a belt that’s slowly shredding, blades that get duller, pipes that
eventually clog with residue buildup. Most gradual failures can be prevented through regular
maintenance, armed with an understanding of the expected lifetime of the parts at hand.
The Most Common Causes of Machine Failure
Failure starts somewhere. The following are some of the most frequent causes of machine failure
and can be used to analyze, prepare, and prevent future instances of malfunction.

Operator Error
Despite extensive training, humans are still prone to making errors, forgetting important
principles from training, laziness, tiredness, and plain old forgetfulness. Sometimes misuse and
abuse of equipment by machine operators is to blame for failure. This can also include simple
accidents, like dropping a piece of equipment.

Wrong Amount of Maintenance

This can be too little maintenance, but it can also be too frequent maintenance that leads to
machine failure. Maintenance that happens too infrequently can let problems go by unnoticed
which can then lead to a domino effect of failure, but frequent maintenance, essentially,
introduces chaos into the system each time. Whenever a technician opens up a piece of
machinery, there is always the potential for risk and for failure, whether that is breaking a panel,
losing a screw, accidentally jiggling a wire the wrong way, stripping a bolt… the possibilities are
endless and increase the more times the equipment is touched.

Physical Wear and Tear

This cause of industrial machine failure includes things like bearing failure, metal fatigue,
corrosion, misalignment, and general surface degradation.

Reliability Culture Failings

If operators are pushed as hard as the equipment and production goals are so tight that they feel
like they can’t take a minute to breathe or to resolve an issue safely and to completion, then
failures are inevitable. “Band-aid fixes” eventually wear-out, and a widespread culture of quick-
and-dirty resolutions can lead to compounding problems and massive machine failure headaches
down the line (ultimately resulting in lower overall production, in most cases).

How Can You Prevent Equipment Failure?

There are multiple strategies you can use to prevent equipment failure, and choosing the right
one depends on the criticality of the machine, the predictability of its failures, and the budget and
monitoring infrastructure available. The following methods to handle machine failure in an
industrial environment are listed from least to most complex.

Reactive Maintenance
This is the traditional maintenance paradigm. When it breaks, we fix it. It doesn’t prevent the
machine from failing so much as it offers a route to resolving the problem once the malfunction

Diagnostic Analytics
This requires a little more digging. Within this maintenance structure, machine data and root
cause analysis are deployed to determine why the machine failed in the first place. This
information can then be used within a preventive maintenance strategy.

Preventative Maintenance
Preventive maintenance includes regularly inspecting machines prior to use, establishing and
sticking to a maintenance schedule, regularly replacing components before their average lifespan
is over, and anything that tries to ward off the failure before it happens. Think of it like changing
the oil in your car every few thousand miles. We don’t wait until the oil is muck and has clogged
the rest of our equipment, we just preemptively, preventively, maintain it based on our
expectations of when failure would otherwise occur.

Predictive Maintenance
Predictive maintenance uses past machine performance to model asset behavior. With enough
data, algorithms can work to predict equipment failures based on real-time data off of machines
that are IoT-connected. This means that preventive maintenance tasks don’t happen
unnecessarily—like replacing perfectly good parts—but instead are based on a deeper and more
customized analysis of when failure is impending or most likely to occur.

The Importance of IoT in Preventing Equipment Failure

IoT devices offer unprecedented insight to manufacturers and OEMs thanks to the data they
provide. IoT-connected machinery can operate within an intelligent network that monitors
machine data to identify bottlenecks, notify operators of impending failures, and—when paired
with machine learning— even offer suggestions for next actions based on KPIs, e.g. “Should we
stop the machine for ten minutes to replace this bit and proceed at normal speed? Would we
derive greater value from running the machine at 80% capacity for the next two hours, leaving it
with only a 10% chance of complete failure vs. the 60% failure likelihood during the same
period when running at 100% capacity?”

The real boon of IoT vs. more traditional data-gathering and analytics methods is its real-time
collection capacity. While historical data can offer great insight for preventive maintenance
strategies, IoT-enabled predictive maintenance offers a competitive edge to manufacturers by
increasing uptime, reducing resource waste, and providing strategic insights that can extend
beyond maintenance schedules into process optimization and more. Plus, IoT-connected
machinery has the potential to utilize the cloud for deep, rich analysis as well as edge computing
for lightning-fast insights, even in secure and air-gapped environments.

What is a Time Domain Reflectometer, TDR
Time domain reflectometers are used for testing cables like twisted pairs, coaxial cable, etc.,
where they can locate the position of faults.

Time domain reflectometer, TDR, includes: TDR basics and Optical TDR

Time domain reflectometers, TDRs are used for testing cable systems and other forms of feeder
where they are able to detect and pinpoint issues. As a result, time domain reflectometers, TDRs
are widely used in any area where there may be long or inaccessible lengths of cable that may
need to be tested, or they may have faults. Time domain reflectometry can also be used on
printed circuit boards to locate issues that can arise there as well.

TDR applications
Time domain reflectometers, TDRs are used in a variety of applications, some obvious, but
others less so.
Some of the TDR applications include:
 Telecommunications cable landlines: TDRs are an invaluable tool for
telecommunications field engineers who need to repair telephone and broadband
landlines. They can be used for testing of very long cable runs, where it is impractical
to dig up or remove what may be a kilometers-long cable. If a break occurs, a TDR is
able to locate the position of the break with considerable accuracy.
 Landline preventative maintenance: TDRs are used for preventive maintenance of
telecommunication lines. They can detect resistance on joints and connectors as they
corrode. TDRs can also detect increasing insulation leakage as it degrades and absorbs
moisture. Ultimately this can lead to catastrophic failure, but the TDR is able to detect
this before this point is reached.
 Landline security surveillance: Time domain reflectometers can detect the existence
and location of wire taps. The wiretap introduces a slight change in line impedance
and this can be seen on the TDR when connected to a phone line.
 Circuit board testing: Specialized time domain reflectometers can be used for the
failure detection of modern high-frequency printed circuit boards, especially on tracks
designed to emulate transmission lines. The reflections seen by the TDR reveal any
unsoldered pins of a ball grid array device or short-circuited pins, etc.
 Industrial applications: Time domain reflectometry is used in a variety of industrial
applications, including the testing of integrated circuit packages where failing areas of
an IC can be detected. TDR technology can even be used for measuring liquid levels,

Time domain reflectometer basics

The basis of time domain reflectometry is to treat a cable as a transmission line and look at its
properties in this manner.

Although it is possible to use instruments such as network analyzers and the like to check the
integrity of cables this way, these test instruments are very expensive and not easy to use. A
much better approach for many applications is to use time domain reflectometry techniques and a
specific test instrument. This considerably simplifies the operation as well as reducing the cost of
the test instrument. Also, many time domain reflectometers are specifically made for portable
operation, enabling them to be used far more easily in the scenarios where they are required, i.e.,
for telecommunications cables that may be running under roads, paths, etc.

The time domain reflectometer operates by sending a short pulse along the line in question. With
the far end terminated in the required impedance, i.e., that of the line, if there are no problems
with the line, then all the energy in the pulse will travel along the line at the propagation velocity
and be dissipated in the load and no reflection will be observed.
Basic block diagram of a time domain reflectometer, TDR

From this it can be seen that the time domain reflectometer consist of a pulse generator and a
sampler. The sampler could be an oscilloscope that displays the waveforms on the line. In reality
a little more signal processing is often included to help locate problems and issues with the line
However, if there is a discontinuity in the line, energy will be reflected back to the reflectometer
where it is detected.

Within the reflectometer it is possible to analyze the returned pulse assuming that the voltage of
the outgoing pulse level is Ei, and the reflected pulse has a level Er.

Idealized waveforms seen by a time domain reflectometer, TDR

Idealized waveforms seen by a time domain reflectometer, TDR
It can be seen that the outgoing pulse registers on the sampler screen. It then takes a finite time
for the pulse to travel along the line. If all the power from the pulse is absorbed, then nothing
will be returned and the display on the sampler will not show any change. However, if power is
returned it will alter the overall shape of the waveform seen at the test instrument.

The power return may occur for a variety of reasons from a break somewhere in the cable to a
poor match at the remote end. The time delay, T will be twice that for the wave to travel to the
mismatch point, i.e. out and return time together.

The sampler will be able to detect not only the level change and be able to calculate the
mismatch, but also the time difference from which the distance along the line where the
discontinuity exists can be calculated.

Locating cable faults

One of the key points of a time domain reflectometer is that it is able to locate failures within a
cable. This is a key issue where the cable may be sealed as in the case of coaxial cable and it may
not be possible to see inside the cable. Also, where cables are buried under ground any failures
can be located and the required holes dug to locate the area where the cable problem has

Distance = Vρ (T/2)

D = distance in metres
Vρ = velocity of propagation in metres per second
T = transit time from the monitoring point to the mismatch in seconds.
This is a straightforward calculation to make and is normally made within the time domain
reflectometer, giving the user a good indication of where the fault may be located.
The main issue is the propagation velocity within the cable. This can be determined by testing a
known length of the cable under test and leaving the remote end open.

Nature of mismatch
Not only is it possible for the time domain reflectometer to discover where the fault or problem
has occurred along the cable, it is also possible to discover much about the nature of it as well.
The reflected pulse enables the test instrument to see both the nature and magnitude of the

ρ=Er/ Ei = ZL − Z0/ ZL + Z0
ρ = reflection coefficient
ZL = load impedance in ohms
Z0 = line impedance in ohms

From a knowledge of the reflection as well as either Z0 or ZL, either ZL or Z0 can be

ZL can be determined for any tests by placing a known load at the end of a good line, e.g. a spare
length of coax, etc., and the cable impedance can be determined from this. With a knowledge of
the cable impedance, it is then possible to apply this to the cable under test.

Although it may appear to be a specialist test instrument, the TDR is widely used in a variety of
industries, but particularly within the telecommunications industry where it is an invaluable tool.
Without the time domain reflectometer, locating problems with long inaccessible lines would be
very difficult and costly.
Fault Tolerance and Its Impact on System Reliability

Equipment and systems that are designed with no fault tolerance in mind often have poor(er)
reliability. This is why a fault-tolerant system design is an obvious choice for most reliability and
design engineers – especially when it comes to critical equipment which failure can compromise
the reliability, availability, maintainability, and safety (RAMS) of the whole system they are a
part of.

What is fault tolerance?

Fault tolerance represents the capability of any system or equipment to sustain its operation
during the presence of a fault.
Systems and equipment with high fault tolerance, depending upon the adopted fault tolerance
mechanism, are able to completely or partially sustain their operation upon the occurrence of a
fault. For this to work in practice, such systems can’t have a single point of failure (SPOF).
The essence of fault-tolerant designs
The development of fault-tolerant design requires careful consideration of failures that can be
manifested throughout the equipment life cycle, along with their probable causes and
consequences. However, the design engineers must also consider the cost and resource factors
needed to achieve the required level of tolerance, reliability, and dependability of the equipment.
It is often misunderstood that a fault-tolerant design should provide complete tolerance to all
types of faults. This is not true. A good design should match the degree of tolerance to the
criticality of the fault such that the overall optimization of cost and resource efficiencies can be
achieved. For example, it might not be cost-effective to spend money on product redesign, just to
address a fault that has an extremely low chance of occurring.

Characteristics of fault-tolerant systems

To create a fault-tolerant system, efforts are required at every stage of the equipment life cycle.
This includes but is not limited to the specification and design phase (incorporating fault
detection controls in the design), validation and verification (V&V), maintenance and operation
(using OEM-approved replacement parts and guidelines for routine maintenance), and even
disposal stage.

Each stage may adopt combinations of the below-stated techniques to develop new designs or
improve current ones to enhance their level of fault tolerance:
1. fault detection and display
2. fault diagnosis and containment
3. fault masking and compensation

1) Fault detection and display

Fault detection refers to the capability of the system/equipment to sense and display the fault. It
is the fundamental aspect of any fault-tolerant system. All other aspects are contingent upon
the effectiveness of the fault detection process. If the system is not designed to detect its fault, or
somehow incorrectly detects a fault, the rest of the aspects will also be ineffective.

For example, a simple air pressure sensor in a car tire pressure monitoring system (TPMS) can
detect the air overfill and notify the driver via the car dashboard.
A representation of TPMS activation

In this case, the detection and display are the only acceptable tolerance level for this fault event.
The customer can safely disengage the air hose before rupturing the tire. If the pressure detection
is inaccurate, the driver may disengage the hose too soon/late and experience tire failure during
driving. Since there is no automatic correction of air pressure, the tolerance aspect for this fault is
restricted to just detection and display.

2) Fault diagnosis and containment

In more sophisticated systems, additional layers are often added in the product design stage.
Their purpose is to diagnose and perform containment on top of detection and display. These
additional layers are warranted due to the criticality of the system or because of various safety
concerns. For example, a Distributed Control System (DCS) – a control system for process plants
– not only monitors critical process parameters through a set of sensors but also performs a
diagnosis to detect the location of the fault and perform necessary containment.
A representation of the DCS system

For instance, in the case of overpressure of petroleum products in a vessel, the system is
triggered by relevant pressure sensors. It opens the safety pressure valve and exhausts the vapors
out in the flare stack. In this example, the containment is carried out by diverting the high-
pressure flammable vapor to the exhaust stack, protecting the system from fire or explosion.

3) Fault masking and compensation

Another effective approach to fault tolerance is by masking the state of fault. It is very effective
for equipment that can be monitored and controlled through the Internet of Things (IoT)

With such equipment, one of the most significant challenges comes in the form of cybersecurity
threats. These types of threats can attempt to induce the fault by altering the state of the
equipment through the injection of false equipment data into the server.

With incorrect equipment state records, the very control and monitoring system originally
intended to protect can instead cause the failure of the asset. Alternatively, it can be “tricked”
into thinking the asset is in good condition when it is actually not – letting the deterioration lead
to failure without triggering any alerts.

By incorporating fault-masking, the system is designed in a way that it can recognize and mask
those incorrect values. For example, in the electricity grids, the circuit breakers are often
controlled and monitored through Supervisory Control and Data Acquisition (SCADA).

A representation of the SCADA system

Such a system closely monitors the voltage and frequency parameters of the electrical equipment
and causes them to close or open to maintain power network stability.

An incoming cyberattack could alter the voltage and frequency limits on the equipment.
Consequences? The system could cause a power breakdown instead of preventing it.

Fault masking is often carried out through algorithms that detect anomalous data streams and
inject false data with the purpose of masking the data which represents the faulty state of the
equipment. This prevents the bad data actors from spreading the fault and further exacerbating
the grid’s reliability.
Improving fault tolerance through redundant designs
One of the simple actions that can be taken to increase fault tolerance is by incorporating
redundancies in the design. Redundancy simply means the presence of an alternate system or
solution that can take over the intended function should the primary system fail.

While redundancy improves fault tolerance, haphazardly adding systems should not be the
objective as the amount of cost required to add any new system can significantly outweigh the
attainable reliability benefit.

From the perspective of physical equipment, they can be broadly classified as

either active or passive redundancies.

Active redundancies
Active redundancies can be established when multiple pieces of equipment are operated
simultaneously. In this configuration, each piece of equipment contributes its share towards
attaining the intended function while still acting as redundancy for each other.

A simplistic active redundancy is the parallel operation of two pumps at half of their rated
capacities. Both pumps jointly operate to achieve desired discharge pressure. If one pump fails,
the other pump can still be boosted to its rated capacity to attain intended discharge pressure on
its own. To attain economy of design, the reliability engineers have come up with various other
complicated ways to achieve active redundancies such as K of N redundancies and graceful

In K of N redundancies, a given subset of equipment is always under operation. This increases

the reliability of the system as some of the equipment is still on hot standby and can join the
operation upon failure of some equipment. This guarantees greater reliability compared to the
simple parallel operation of two pumps as there will be a larger number of small pumps
Graceful degradation is an alternative to adding costly identical and parallel systems. It ensures
that the features or functionality of the overall equipment degrades proportionally to the number
of failed components. To achieve such scalable degradation, an examination of all possible
failures within all components should be carried out. Their impact on the overall system’s
performance should be analyzed and documented.
Such techniques provide tolerance to partial failures and enable the system to continue its
function at a degraded capacity.

Passive redundancies
Passive redundancy is the standby redundancy where the alternate equipment is present – but it
can only take over the intended function upon failure of the primary equipment.
We can differentiate two types of passive redundancies:
1. operating passive redundancies
2. non-operating passive redundancies

Operating passive redundancies are the ones where the alternative equipment is present as a
hot spare. The standby equipment is hot because it could be operating under no-load conditions.
In some cases, it may be serving a function that is outside the definition of primary equipment’s
function. Upon failure of the primary equipment, the operating standby equipment can be
automatically transitioned into performing the function of primary equipment.

An example of operating passive redundancies can be a secondary alternator that operates under
no-load conditions and meets all other paralleling conditions such as the same terminal voltage,
frequency, and phase sequence. Upon failure of the primary alternator, the secondary alternator
can be automatically synchronized with the system and take over the load.

In the case of non-operating passive redundancies, the standby equipment is powered down.
Upon failure of primary equipment, the standby equipment can be automatically or manually set
to operating conditions and take over the functionality of primary equipment.

A good example of non-operating passive redundancy is a standby municipal water pump which
can be started and operated manually to deliver water to residents if the primary water pump
malfunctions. Since the restoration of operation is not critical, an operator can go and start the
pump (and synchronize it with the system later, as needed).
Reliability techniques for analyzing fault tolerance
Fault tolerance is a part of reliability engineering efforts and requires careful examination of all
possible failures that can happen within the equipment. The Failure Mode Effect Analysis
(FMEA) and the Fault Tree Analysis (FTA) are two well-known techniques to analyze system
design from bottom-up and top-down approaches respectively.

To better understand tolerance, the failure sequence and dependencies must be analyzed and
investigated. A particularly useful technique to analyze dependencies and sequence is
the Markov model where the probability of any failure event would depend upon the state of the
previous event.

Similarly, another powerful technique is Monte Carlo simulations that can be used to model the
impact of uncertainties of any failure event on the system performance.

Fault tolerance and maintenance operations

Do fault-tolerant systems need less maintenance? Well, yes and no.

Because of redundancies and other characteristics we discussed earlier, such systems can usually
take on more faults before their functionality is compromised. However, if the issues aren’t
addressed, the accumulation of faults will eventually lead to a system or equipment breakdown.
Therefore, maintenance teams should use a CMMS system to make sure corrective maintenance
actions are taken in due time.

In some sense, fault tolerance gives maintenance and support teams more breathing room. They
still need to deal with the problem, but maybe not right away.

While fault-tolerant designs have their challenges in terms of increased costs and complexity,
they make up for it in the form of improved equipment reliability.
Root Cause Analysis (RCA): Steps, Tools, And Examples

What is root cause analysis?

By definition, root cause analysis is the process of finding the underlying cause for an effect we
observe or experience. In the context of failure analysis, RCA is used to find the root cause of
frequent machine malfunctions or a significant machine breakdown.

you’ll use your skills to determine:

 what happened
 why it happened
 how to prevent it from happening again

RCA is a reactive process, meaning it’s performed after the event occurs. But once a root cause
analysis is done, it takes the shape of a proactive mechanism since it can predict problems before
they occur.

If you fix a symptom of the problem, but you don’t fix the actual cause of the problem, there’s a
high chance the failure will happen again.

For example, suppose you replace the broken belt but don’t change the misaligned part causing
the belt to overheat and break. In that case, you could bet your paycheck that the belt is going to
fail again. RCA tries to follow the chain of cause and effects to pinpoint the problem that will
make all the other faults disappear when finally eliminated.

The RCA process and outcomes

Conducting root cause analysis can be very complicated. It involves a vast amount of data
collection and review. The result of a root cause analysis isn’t always black and white. It can’t
always tell you if the problem you identified is the root cause.
You will often get only a strong correlation between cause and effect and not the exact cause.
From there, you’ll have to use your experience and professional knowledge to judge whether to
investigate further or not.
RCA is a craft that requires specialized knowledge and in-the-field experience. Meaning you’re
likely the best person for the job here. Otherwise, any fixes implemented will likely be just a
cosmetic solution to the problem. In the worst-case scenario, the changes made could actually
make the situation worse. Despite these limitations, RCA is still a powerful tool for
understanding and improving the fundamental nature of systems and procedures.

Industry applications
Over the years, RCA has evolved to work within various fields, each with its own unique needs
and approach. The most apparent use of RCA is in the medical field. Aside from the healthcare
field, many other industries use root cause analysis regularly. Some of them are:
 manufacturing (machine failure analysis)
 industrial engineering and robotics
 industrial process control and quality control
 information technology (software testing, incident management, cybersecurity analysis)
 complex event processing
 disaster management and accident analysis
 pharmaceutical research
 change management
 risk and safety management
These industries will generally use one specific type of root cause analysis that fits their situation
best. Below are some examples of different types of RCA methodologies used by various fields
and industries.

Different types of RCA

RCA comes in different forms depending on the problem you’re trying to solve. Here’s what
they look like:
 Safety-based RCA comes from the field of occupational safety and health, as well as
accident analysis. This type of root cause analysis is used to determine why an accident
happened at work I.e. why someone cut themselves or why a part was accidentally dropped
by a worker at heights).
 Production-based RCA is used in the field of manufacturing to ensure quality control.
You might use this to find out why the injection-molded plastic parts are coming off the line
 Process-based RCA is used in business and manufacturing to determine the fault in a
process or a system. This might be used in accounting to determine why vendors aren’t
getting paid on time.
 Failure-based RCA is used in engineering and maintenance to determine the root cause of
any type of equipment failure.
 Systems-based RCA originated as a combination of some of the root cause analysis
techniques listed above. This methodology is an approach that combines two or more
methods of RCA. It can be used in a wide variety of fields/applications.

When to perform a root cause analysis

When you’re doing an RCA to determine the source of a fault, you’ll usually find 3 basic types
of problems:
 physical causes
 human causes
 organizational causes
You can also do a root cause analysis if you want to drill down and find out exactly why a
process or procedure is producing better-than-average results. By identifying the cause of a
positive event, you could presumably replicate it and see those results elsewhere. Even if it’s
time-intensive, one round of RCA can mean a lot of bang for your buck.

Keep in mind that RCA requires a significant investment of time, manpower, and money. And it
will likely cause further disruption in the specific production line or the system you’re working
on. So, bearing that in mind, you don’t need to (and you shouldn’t) do RCA for every single
Unfortunately, there is no cut-and-dry rule when to run an RCA and when not to. As the expert
and the experienced professional, you’re generally the best person to determine whether or not to
run a root cause analysis.

Persistent faults
If the same fault occurs over and over, it’s worth investigating. If the same defect is repeatedly
happening, you can assume that it won’t be cleared simply by fixing the visible problem. There
is an underlying reason for the recurring faults. These types of incidents need to be investigated
with RCA.

Critical failure
To determine if a failure is critical, you can look at the cost to the plant or the total downtime due
to the particular failure. When a critical failure occurs, it needs to be investigated to identify the
root cause to help avoid this situation in the future. Explosions at an oil rig and airplane crashes
are examples of critical failures that need to be investigated.

Failure impact
There are critical machines and critical subprocesses in any system. A failure of these types of
machines will halt the entire operation because there may not be a backup or mitigation plan for
that particular machine. In this case, how critical the machine is will determine whether or not to
do RCA.

The 3 Rs of Root Cause Analysis

No doubt you’ve heard these 3 Rs: “reduce, reuse, recycle” or maybe even “reading, writing,
arithmetic.” But RCA also has its own system of 3 Rs: Recognize, Rectify, Replicate.

The actual cause of a problem is not always apparent, and simple cosmetic fixes usually don’t do
much to correct the underlying fault. Even though RCA can be an elaborate time-consuming
exercise, we do it to pinpoint the actual cause so we can take corrective actions that will
eliminate future issues. As mentioned earlier, RCA can also be done to identify the reason for an
unexpected positive outcome.
This first step is when you notice something’s not working quite right. The machine is
leaking fluid, making a weird sound, or not running as productively as it usually does. This is
when it’s time to put on your detective cap and find out what’s going on.

Once you’ve recognized the root cause, it’s time to start a corrective course of action. If the
root cause is addressed, the same problem should not be cropping up again. If the same problem
reappears, it’s likely because the cause you identified was not actually the root cause. In this
case, you might have to go through the RCA process again to make sure that you get to the actual
root cause.

For example, you notice the machine is leaking fluid, so you patch the hole in the metal. If you
stop seeing fluid on the ground under the machine, you’ve solved the problem, and you’ve taken
care of the root issue. But if a leak crops up again in a week, it’s time to run another RCA to find
out if there are other holes in the metal or if gaskets are failing.

Once you’ve identified and rectified the root cause, your next step is to ensure it will not happen
again at any point during the process or system. Sometimes you’ll want to do an RCA to get to
the bottom of an unexpectedly good outcome. In that case, you will test whether the same factors
can be replicated in other scenarios and environments.

Suppose there were issues with faulty parts coming off the line, but you’ve since fixed the issue.
The next step would be to replicate the problem to test whether you actually fixed the root issue.
In that case, you’d need to replicate what happened during this period to ensure that you
got to the bottom of the issue.
RCA is about solving problems. But one of the most significant benefits for you is that being
skilled at RCA makes you look good. When you’re good at what you do, you can get
management on your side (which usually means an easier time getting the budget you need). And
it can even make a big enough impression that it can change your career trajectory for the better.

How to do a root cause analysis

RCA can be accomplished using many different tools and techniques. And even though those
processes may look different, they all arrive at the same end goal: fixing the root cause of the

To do a root cause analysis the right way, you should follow four basic steps.

Step 1: Define the problem

Start with the obvious: What is the problem? By defining the problem, the symptoms, and what
you can see happening, you set the scope and direction of the analysis.
Without a specific problem statement, it’s hard to create a path to a solution. A well-defined
problem statement also helps determine the scale and scope of the potential solution to be
implemented. When you’re writing your problem statement, keep these three pieces in mind:
 How would you describe the problem at hand?
 What do you see happening?
 What are the specific symptoms?

Step 2: Collect the data

Collect all available data related to the incident. Ask yourself, “What proof is there? How long
has this problem existed? What is the impact of the problem?” Be sure to record any other
data you think might help you determine the issue.
Take, for example, machine failure in a manufacturing plant. These are examples of types of
information you’ll want to document.
 the age of the machine
 time of continuous operation
 operating patterns
 maintenance schedule
 operators handling the machine
 specifications of the machine
 schematic of the plant infrastructure
 operating characteristics of the machine
 characteristics of the operating environment

Inspecting the machine in person also provides information that could be beneficial for root
cause analysis. It will be easy for facilities that run predictive maintenance to collate data

Step 3: Map out the events

Establish a timeline of events. This will help you determine which factors among the data
collected are worth investigating. RCA needs data points that potentially lead to the root
cause. Putting events and data in chronological order helps to differentiate causal events
from non-causal events.

From the data collected, you can identify correlations between various events, their timing, and
other data collected. Remember that correlation does not mean causation.
Questions to ask yourself when looking for correlations:
 What sequence of events allowed this to happen?
 What conditions are present/allowed this to happen?
 What other problems surround the occurrence of the main problem?
The next step is to map out a causal graph. These graphs are used to represent the relationship
between events that happened and the data collected.
But it’s important to not stop investigating when you find a correlation between events.
Correlation means there is a link between two events, but it doesn’t automatically mean that one
event caused the other. That’s why it’s essential to continue your sleuthing until you find a
causal relationship. Find out what event caused another event. This will help you find the actual
root cause.

From the data collected, chronological sequencing, and clustering, we should be able to create
a causal graph (or use one of the root cause analysis tools we discuss later). You can use this
graph to represent the relationship between various events that occurred and the data collected.
The different paths are given different probability weights. They can serve as a visual tool to
track down the root cause.
Example of a causal graph. Source: Adam Kelleher on Medium

Step 4: Solve the root of the problem

Once you’ve identified the root cause, you can quickly determine the best solution to fix it. You
can then map it against the scope defined in your initial problem statement. If the solution works
with your available resources, it can be implemented.
Fixing the root cause should eliminate the issues. If the symptoms occur again, it’s time to
return to the drawing board and conduct RCA again.

Once the problem is solved, you will need to take proactive steps to ensure it doesn’t happen
again. There can be multiple solutions applied to solve a single issue.

For example, the root cause could be the wear of a bearing, which happened much earlier than
expected. In this case, the procedure has to be adjusted to change the bearing at an earlier time.
Similar steps to avoid recurrence of fault can be changes in the maintenance schedule, different
modes of maintenance, changes in design, different OEM vendors, etc.

The implemented solution will have to be in line with the available resources. So, if the root
cause is pushing the machine too hard, the obvious answer is to shorten the machine run time.
However, if the production schedule doesn’t allow for shortened runtimes, another solution
might be scheduling more preventive maintenance.

Tried-and-true RCA tools and techniques

There are many tried and trusted frameworks available to execute RCA. None of these methods
are foolproof, but they provide a solid base for how to go about root problem investigation. Each
method has its own list of benefits and shortfalls. Some methods are more suitable for different
industries and types of problems.

You and your company should have your own unique protocol when conducting RCA. In
some instances, external consultants might be brought in to conduct RCA. In such cases, the
consultants will generally have their own preferred technique or a combination of techniques
they use. This is one of the reasons why it is hard to create a universal template for RCA that
everyone can follow.

Let’s look at the different forms of root cause analyses.

5 Why analysis
5 Whys is the original technique developed by Sakichi Toyoda for root cause analysis at Toyota
factories. It is addressing everything with a ‘why’, just like a curious child. Keep asking ‘why’
until you’ve reached the root cause. You can continue this process until you reach a stage where
there is no need to ask ‘why’ again. At that point, you should have reached the root cause of the

As a rule of thumb, asking and finding answers to 5 subsequent ‘why’s’ should be more than
enough to reveal the root cause of most problems. Hence the name ‘5 why’ analysis.
Benefits of the 5 Whys:
 helps identify the root cause of a problem
 offers an understanding of how one process can cause a chain of problems
 helps determine the relationship between different root causes
 highly effective without complicated evaluation techniques

When to use the 5 Whys:

 for simple to moderately complex problems
 more complex issues may need this method in conjunction with another
 any time human error is involved in the issue

Fishbone diagram (a.k.a. Ishikawa diagram)

The Ishikawa method for root cause analysis emerged from quality control techniques employed
in the Japanese shipbuilding industry by Kaoru Ishikawa. The shape of the resulting diagram
looks like a fishbone, which is why it is called a fishbone diagram. This diagram is built on the
idea that multiple factors can lead to a failure/event/effect.

The 5 M framework (shown above) from the Toyota Production System uses RCA with the
Ishikawa method. The 5 Ms are:
 man/mind power
 machines
 measurement
 methods
 material
The problem or fault is written down at the far right end, where the fish head would be. The
cause of the problem is represented along the horizontal line. Further effects and their respective
causes are written down along the fish bones representing each of the 5 Ms. This process
continues until the team is convinced that the root cause is identified.

Benefits of the fishbone diagram:

 a good way to brainstorm within a defined structure
 helps to visually diagram a problem or condition’s root cause
 helps to show bottlenecks in the process
 helps to find ways to improve the process
When to use a fishbone diagram:
 to analyze a complex problem with many causes
 when you need a different view of the issue
 to identify root causes
 to identify bottlenecks and identify issues where a process doesn’t work

Failure mode and effects analysis (FMEA)

FMEA is a proactive approach to root cause analysis, preventing potential failures of a machine
or system. It is a combination of reliability engineering, safety engineering, and quality control
efforts. It tries to predict future failures and defects by analyzing past data.

A diverse cross-functional team is essential when using FMEA. You will need to clearly define
and communicate the scope of the analysis to your team members. Each subsystem, design, and
process is closely reviewed. The purpose, need, and function of each system are questioned.
Potential failure modes are brainstormed. Failure of similar processes and products in the past
can also be analyzed.

The potential effects and disruptions that could be caused by each of the identified failure modes
are assessed and used to calculate its RPN.
If the failure mode has a higher RPN than a company is comfortable with, you can address this
by changing one or more factors outlined in the image above.
Benefits of FMEA:
 enables early identification of a failure point
 captures the collective knowledge of a team
 improves the quality, reliability, and safety of the process
 a logical, structured approach for identifying process areas of concern
 reduces process development time, cost
 documents and tracks risk reduction activities

When to use the FMEA methodologies:

 when designing a new product, process, or service (DFMEA)
 when you’re going to update a current way of doing things
 when you have a plan for quality improvement
 when you need to understand the failures in a process and improve upon them (PFMEA)
Fault tree analysis (FTA)
Fault tree analysis is a method for root cause analysis that uses boolean logic (using AND, OR,
and NOT) to figure out the cause of failure. It was developed in Bell laboratories to evaluate an
Inter Continental Ballistic Missile (ICBM) launch control system for the U.S Air force.
Fault tree analysis example. Source: Six Sigma Study Guide

Fault tree analysis tries to map the logical relationships between faults and the subsystems
of a machine. The fault you are analyzing is placed at the top of the chart. If two causes have a
logical OR combination causing effect, they are combined with a logical OR operator. For
example, if a machine can fail while in operation or while under maintenance, it is a logical OR

If two causes need to occur simultaneously for the fault to happen, it is represented with logical
AND. For example, if a machine only fails when the operator pushes the wrong button AND
relay fails to activate, it is a logical AND relationship. It is represented using the boolean AND
symbol. In the image above, AND is the blue symbol, and OR is the purple symbol.
The fault tree created for a failure is analyzed for possible improvements and risk management.
This is an effective tool to conduct RCA for automated machines and systems.

Benefits of using a fault tree analysis:

 use deduction to find the causes of each event, like the 5 whys
 highlights the critical elements related to system failure
 creates a visual representation for analysis
 can focus on one area of failure at a time
 exposes system behavior and possible interactions
 accounts for human error
 promotes effective communication

When to use a fault tree analysis:

 when the effect of a failure is known — to find out how it might be caused by a
combination of other factors
 when designing a solution — to identify ways it may fail in order to make the solution more
 to identify risks in a system
 to find failures that can cause the failure of all parts of a “fault-tolerant system”

Pareto charts
A Pareto chart indicates the frequency of defects and their cumulative effects. Italian economist
Vilfredo Pareto recognized a common theme with almost all frequency distributions he could
observe. There is a vast imbalance between the ratio of failures and the effects caused by them.
He proposed that in any system, 80% of the results (or failures) are caused by 20% of all
potential reasons.

The principle is dubbed the Pareto principle (some know it as the 80-20 rule). This skew between
cause and effect is evident in many different distributions, from wealth distribution among
people to failures in a machine.
Paret chart for shirt defects. Source:

With the 80-20 principle in mind, you can use Pareto analysis to dig into failures and possible
causes. To start, draw a bar graph that includes the frequency of faults and causes. With this
graph, it’s easier to see the skew between causes and failures. Usually, you’ll see how a small
percentage of factors cause the majority of faults.
Next, you’ll analyze the causes that contribute to the largest number of faults and take corrective
action to eliminate the most common defects.

Benefits of using pareto charts:

 defects are ranked in order of severity, with the most severe handled first
 can determine the cumulative impact of the defect
 offers a better explanation of defects that need to be resolved first

When to use a pareto chart:

 to analyze problems or causes in a process that involves the frequency of occurrence, time,
or cost
 to narrow down a list of problems to find the most significant
 to analyze a problem with a broad list of causes to identify specific components
Pareto charts work great for determining the priority for taking up root cause analysis. According
to the Pareto principle, eliminating 20% of the most common failure causes can result in
reducing the overall number of malfunctions by 80%. Pareto charts will indicate the top failure
causes to be further investigated and addressed, according to the criticality of the machine, the
impact failure of a specific part, or a combination of the two.

Honorable mentions
Root cause analysis is very open-ended and has a lot of widely used tools in various industries.
We covered the major ones in the sections above, but these systems also deserve some
recognition. A few honorary mentions:
 Cause and effect diagrams. The Fishbone diagram is an example of cause and effect
diagrams. Many similar tools try to map the relationship between causes and effects in a
 Kaizen is another tool from the stable of Japanese process improvements. It is a continuous
process improvement method. Root cause analysis is embedded within the structure of
 Barrier analysis is an RCA technique commonly used for safety incidents. It is based on
the idea that a barrier between personnel and potential hazards can prevent most safety
 Change analysis is used when a potential incident occurs due to a single element or factor
 A scatter diagram is a statistical tool that plots the relationship between two data in a two-
dimensional chart. It can also be used as an RCA tool.

Root cause analysis examples

RCA example #1: The case of the faulty parts
Injection molding machines are widely used around the world to create plastic in almost any
shape or form. The part the machine produces should match specifications within the allowable
Let’s say there is a high incidence rate of faulty products, and we need to get to the bottom of it.
First, the problem needs to be well defined. This includes explaining the exact defect the plastic
output is having. By observing the output, we can determine if it is one of the four primary
defects within injection molding. They are:
1. flash
2. gassing & venting
3. part distortion
4. short mold

Let’s presume that the defect is part distortion. First, write down the problem, including the
number of defects occurring as a percentage. Once that is completed, collect all the available
data. Pull any maintenance logs can be pulled from your CMMS, review, manuals from the
injection mold machine manufacturer, etc.

Collect information on each defective product. From this, measure the deviation from
specifications. Take the heat signature of the product once it comes out of the mold, then
measure the temperature of molten plastic in the barrel.

We know that part distortion almost always occurs due to temperature problems. But we cannot
be sure where the temperature problem is…is it in the barrel while heating or in the mold while

By analyzing the data, you collected, you would be able to identify that. For this example, we’ll
assume the heat signature of the finished product is different from the expected one.
This determines that the problem is in the cooling process. Further investigation concludes that
the root problem is the wrong spatial arrangement of cooling liquid conduits.

Changing the conduit arrangement that best fits the mold currently being produced will solve the
problem of part distortion.
RCA example #2: The mystery of the blown fuse
Next, let’s say a machine stopped because it overloaded and the fuse blew.
Investigation shows that the machine is overloaded because it had a bearing that wasn’t being
sufficiently lubricated.

Your investigation continues, and you find that the automatic lubrication mechanism had a pump
that was not pumping sufficiently. A review of the pump shows that it has a worn shaft.
Investigation of why the shaft was worn discovers that there isn’t an adequate mechanism in
place to prevent metal scraps from getting into the pump. This enabled scraps to get into the
pump and damage it.

The apparent root cause of the problem is metal scrap contaminating the lubrication system.
Fixing this problem should prevent the whole sequence of events from happening again. The real
root cause could be a design issue if no filter prevents the metal scrap from getting into the
system. Or if it has a filter that was blocked due to a lack of routine maintenance, then the actual
root cause is a maintenance issue.

Compare this with an investigation that does not find the causal factor: replacing the fuse, the
bearing, or the lubrication pump will probably allow the machine to go back into operation for a
while. But there is a risk that the problem will simply reoccur until the root cause is dealt with.
5 Steps to Troubleshooting That Will Fix Just About Anything

Everything breaks eventually. When rebooting doesn’t solve the problem, we brainstorm causes
and test them to find the issue. That is troubleshooting in a nutshell.
This lesson looks at:
 What troubleshooting is
 Some common causes
 How to streamline the process using your CMMS (computerized maintenance management

What is troubleshooting?
Troubleshooting is a step-by-step approach to finding the root cause of an issue and deciding the
best way to fix it to get it back in operation. Troubleshooting is not just for equipment that has
completely broken down. We also use it when a machine is just not working as expected.
Efficient troubleshooting is an essential part of asset management, diagnosis, and repair.
Machines that are properly operated and regularly maintained are less likely to suffer major
breakdowns. Still, there will never be a zero chance of failure. If you are using equipment, it will,
at some point, need repairing.

When and why to troubleshoot?

It may seem obvious that troubleshooting occurs whenever there is any kind of, well, trouble.
But anticipating the different types of problems that may arise can help you streamline your
response. Broadly speaking, troubleshooting is done in the following instances:
1) Device failure
This is the big one: the most urgent reason to troubleshoot. The machine is broken, entirely out
of commission, and needs to be fixed pronto to keep working. This can have a knock-on effect in
a company by bringing all operations to a grinding halt and putting everything on hold.

The fact is, unplanned downtime is expensive for companies, often costing them hundreds of
thousands of dollars per minute. Suppose you’ve got a capable maintenance team that knows
how to troubleshoot effectively. In that case, you can reduce high-severity outages and save
the company money.

2) Unexpected operation
Every machine has a defined set of functions it can perform. Most devices don’t do things
exactly the same way every time because of limitations in engineering and human error (as hard
as we may try to avoid it). Even with these slight variations in performance, the machine can
operate smoothly. This is considered its normal operation range.

If the machine starts to run outside these ranges, we may have a problem, and it needs to be on
your crew’s radar. These situations are not as urgent as a total failure. Still, unexpected
operations should be reported to fix the problem before a real issue comes up.

Take the cooling fans in your plant, for example. Imagine they are running and pushing out cool
air, but every so often, they stop blowing for a few minutes (or the air isn’t as cold as it should
be). Other equipment might overheat because of that malfunction and eventually start to break
down. Fixing the fan as soon as you know about it will save the company time and a lot of
Getting operational users to log faults when they come up can be a great way to get to the
problems early and avoid total failure. Using your CMMS to log the problem will give you a
written history of what happened and how it was fixed, making troubleshooting time in the future
that much easier.

3) Other anomalies
The machine is working within the ideal operating range and is delivering the expected output.
However, an operator has spotted some anomaly. It could be a strange sound, a weird smell,
visible smoke, excessive vibration, etc. Such anomalies should also be investigated within an
appropriate time window

The process for reporting problems should never be made into a tedious task. It is the only way
to ensure people use it.

With detailed asset history logs and troubleshooting experience, users can take care of things
independently. This will free up more time for your team to focus on things that matter more.

What are the benefits of troubleshooting?

There are a lot of costs that come with reactive maintenance and a lack of troubleshooting know-
how. What we don’t always consider is that these costs go beyond pure dollars and cents.
A penny saved is a penny earned
The immediate costs are the most apparent costs linked to maintenance and repairs. These are the
actual, unplanned dollars that it costs to repair broken and faulty equipment. Expenses like these
often cause the finance team to be up in arms and get them wondering why maintenance is so
In the long term, repeated breakdowns, failures, and stops in production can lead to the need to
bring in expensive vendors in for repairs and replacement of the asset.
Being able to troubleshoot well and having all the information you need at your fingertips
will give you the leverage to reframe the conversation and relationship. Instead of Finance
coming to you wondering why everything you need costs so much money, you can say, “Hey,
look at how much we’ve saved you. This could have cost hundreds, if not thousands more”.
Now, as far as Finance is concerned, you’re the hero instead of the villian.

The show must go on

Downtime is expensive — more expensive than just the cost of fixing the machine. When you’ve
got equipment that’s broken down, it stops your revenue-producing activities in their tracks.
Every minute you can’t operate is more money out the window. The faster your maintenance
crew can get running again, the more money you stand to save.
Who performs troubleshooting?
Often, the most experienced technicians are the ones doing the troubleshooting. Unfortunately,
60% of these maintenance professionals are retiring in the coming few years.

What makes these technicians so good at what they do? Many of them have learned through trial
and error what are the best troubleshooting techniques for each piece of equipment. There is
massive value in having those senior technicians running the troubleshooting teams and
creating checklists that hit on the most common issues.

The problem is that when all these experienced technicians retire, they take their knowledge with
them. There is already a big labor shortage in the industry. Suppose we haven’t codified the
information into a central hub (like Limble). In that case, we risk losing valuable historical info
when they leave.

Depending on the complexity of the machine, your maintenance crew can train experienced users
for straightforward troubleshooting tasks. They will need to perform visual checks, general
troubleshooting, and other maintenance tasks to do this. It is an approach known as autonomous

If users or operators are troubleshooting, you need an easy-to-understand, user-friendly method

for collecting and saving as much information as possible. This can make current and future
repairs far less complicated.

Troubleshooting steps

Troubleshooting is a step-by-step process. Below, we break it down into six simple to follow
steps. It doesn’t matter if you are an advanced or inexperienced professional; you will follow the
same systematic approach every time.
Step 1: Define the problem
The first step of solving any problem is to know what type of problem it is and define it well. A
clear definition is fundamental when troubleshooting. When looking at a problem, you need to
know what you are up against and the possible causes. Is it machine failure, an unexpected
operation, user error, or a random anomaly? What happened that alerted you to the problem?
Some equipment will have built-in ways of letting you know; alarms can sound, red lights flash,
or a warning can go off when certain parts overheat. These signals can help with problem-
solving. Other equipment just stops working. Whatever the case may be, you have to identify and
define the problem before you can move forward.

Step 2: Collect relevant information

You need to gather all the available information about the machine and its operations. You’ll
need the machine manual, any data regarding operations. For example, how often the machine is
used, by whom, for what, and how long. You will also need the maintenance history, problem
reports, etc. A modern CMMS like Limble should have the option to keep a digital copy of all
the documents, history, and information. If communication with the Original Equipment
Manufacturer (OEM) is possible, the maintenance crew can discuss the issue first. Sometimes
calling the OEM is the fastest and easiest way to get the right help.

Step 3: Analyze collected data

Using all the information you have gathered, available checklists, and as much technical know-
how as you can muster, you can now try to determine the root cause of the problem. Seek out
expertise from other maintenance troubleshooters or the person who reported the fault. It’s much
easier to solve a problem that you have seen before.
Think about recent changes to the asset. Ask yourself:
 Did we use new replacement parts?
 Has there been an upgrade lately?
 Did we change the type of input material we use?
 Has the device been used in a different way than usual?
 Has there been an electrical surge?
Recent changes to the system or environment can often explain why the problem has come up.
If you still have no clue what caused the problem after analyzing the data, you need to go back to
Step 2 and collect more info. It is possible to overlook things or disregard something as
unimportant during the first round of the information-gathering process.
After this exercise, the person performing troubleshooting should form an educated guess and
put forward some solutions.

Step 4: Propose a solution and test it

Using what you know from above, you can create your plan of attack. You will get to the
solution through a process of elimination and trial and error. In some cases, you may be able to
test your theory on a smaller scale asset. You may have multiple options to try. Start with the
simplest one first and work from there.
Take the following into account:
 potential safety concerns
 all the required resources and associated costs
 how complex the implementation will be?
 the long-term outlook for the machine
 any personal biases person performing the troubleshooting may have

Keep testing until you are sure that you have found the right solution. If nothing works, you will
need to rethink what the actual cause is.

Step 5: Implement the solution

Once you have accurately diagnosed the problem, found the solution, and tested it, it’s time to
get your hands dirty and fix it. Even if your solution worked during testing, it is important to test
it again. Ensure the asset is working the way it should before you pack up and sign off. You’ll
also want to make a note of all the steps you take as you make them, so you don’t forget what
you’ve done.
Bonus step: It’s fixed! You’re a hero! Now what?
It sounds obvious, but it is crucial to document the solution and add it to the asset log in your
CMMS. It’s easy to get carried away working and forget to document your findings. “Ah, I’ll do
it next time,” you might think. But what if you don’t remember next time?!? Then we’re in
As you’re going through the process, take the time to do it right and save yourself the
trouble next time.

A practical maintenance toolkit holds as much information about an asset as possible. In Limble,
tracking an asset’s history is ridiculously easy. You can see all related Work Orders, Parts, who
did work most recently – you can even manually add notes and images taken with your phone.

Example of an asset log entry in Limble CMMS

By keeping a record of every step, from reporting the fault or failure to the five steps above, you
can create a clear path through the troubleshooting journey to repair or, in some cases, show the
need to replace the asset.
Imagine how easy it will be to fix if the problem happens again!
Ways to make troubleshooting easier
We are here to make your job easier. When it comes to troubleshooting, it can feel overwhelming
and disorganized. There are many tools available to help you and your crew get to the bottom of
any problem. Below are a few of the commonly used tools and resources for effective

Troubleshooting checklists
Checklists are a great way to approach common problems methodically and help standardize the
process. They do the heavy lifting for you. When you’ve got a lot going on it can be risky to rely
on your own brain to remember all of the steps. Having a checklist means that you don’t have to.
Maintenance platforms like Limble also let you create and store troubleshooting checklists that
can be accessed on mobile devices and used in the field.

Maintenance engineers can work with experienced technicians to identify problematic assets and
create step-by-step troubleshooting instructions that include warnings and images for specific
assets/issues. When you finish, you can attach each checklist to the corresponding piece of

A modern CMMS
Having the right CMMS can streamline, organize, and automate your maintenance
operations. A modern CMMS will save you and your team time and your company a lot of
As a centralized repository of maintenance data, a CMMS keeps a lot of helpful information used
during the troubleshooting process like:
 OEM manuals
 contact information for machine and parts vendors
 maintenance logs and reports
 details of the work request sent to report the problem
 troubleshooting and other maintenance checklists
 past and current machine-condition and performance data gathered through CBM sensors
Limble CMMS uses QR codes to give your users easy access to all the information about the
equipment with a simple scan of their phone. They can scan the code on the side of the
equipment and quickly report faults to your team with the correct asset already attached to the
work order.

Having quick and easy access to this information can significantly speed up the troubleshooting
process and reduce the loss of institutional knowledge when technicians retire or move on. These
are just a few of many reasons why more and more organizations are implementing cloud-based
maintenance solutions.

The future of troubleshooting

Factories are becoming more automated, and machines need fewer operators. Because of these
changes, the number of technicians required for troubleshooting and equipment maintenance is
Luckily, technology is making troubleshooting easier, faster, and less dangerous. Here are some
solutions that are making their way to many plant floors.

A robot with a crystal ball

Can you imagine a world where computers fix themselves? Machine learning is a step towards
this. It gives systems the ability to learn and get better at things without being programmed. It
can help predict possible problems and is a big part of predictive maintenance.

When it comes to troubleshooting, machine learning is helping us analyze large amounts of data
and identify/predict possible causes of faults and failures.

Some organizations are already taking things a step further and testing something called
prescriptive analytics. In the context of troubleshooting, prescriptive analytics aims to help
machines diagnose themselves and then present possible solutions based on that self-diagnosis.
Enhancing the real world with AR
Augmented reality (AR) combines computer-generated imagery with the actual equipment to
give an additional layer of information. You can overlay parts and look into things that you
ordinarily wouldn’t be able to.

All you need is a phone or tablet loaded with the software. Hold it over the machine, and the
program will pull up all the different layers for you to look at.
If you are in the middle of a diagnosis, this can be a great way to check if everything is where it
should be or make sure that it is in good working order.

Augmented reality in quality control. Source:

AR allows your maintenance team to see all the information about a component on the screen. It
can also show you tips, warnings, and next steps, improving quality and safety during the
troubleshooting process.

