Nothing Special   »   [go: up one dir, main page]

CPS Mid Sem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 227

GSFCUB.Tech (Chem.

)-Final Year
Lecture 1 & 2
09-06-2021 09: 00 to 11:00 AM

Unit-1

ALARM MANAGEMENT- Best Practices


Alarm Management Best Practices
The Alarm Problem

People who can help

The ANSI/ISA 18.2 Alarm Management Standard

Seven Steps to a Highly Effective Alarm System

Summary
The Alarm Problem
"My definition of an expert in any field is a person who knows enough about what is really going on to be
scared" P.J.Plauger

What is an Alarm ?
An alarm is audible and/or visible means of indicating to the operator an
equipment malfunction, process deviation, or abnormal condition requiringa
response"
Poorly performing alarm system has resulted into serious incidents/ accidents
(example : Bhopal incident in India, 3 mile island off shore platform in Mexico/ US)
What keeps process within limits?
Can only alarm maintain the Chemical Process Safety?
ecida Alarm Purpose
PV LYPical Process Upset
Shutdown
erectrve
Response

Abnormal

Alerm

Purpose: to notify the operator of an eguipment malfunction. process deviation or


abnormal condition that requires a timely response (to prevent an impending
consequence)
Keep process within normal operating limits
Maintain plant safety

ensulting, SArm ationaaien


SA19.2standarcde conenits pnalaa enaaesnt e 299
evetopent tication of th P tandartafens as
1Aima A
Is alarm a perfect (IPL)Independent Protection Layer?

xida
Alarms are One of the First Protection Layers

Community
Emergency response
Plant
Emergency response
ve protectio
Lsa

Safety InstrunmentedSystem
-

(Ls
Trip
Operator lntervention
Alarm
Process Control
LoOp
Process Design
Process value

tioalzeti rsd as envotins


SA 18.2 itte siec 2aogs. H ast
nt ubli an ef the 1SA. tandiard i i apd as an eitor
OLD is GOLD !
(No programming skill needed; easy to maintain due the limited number of alarms)

Operator easily detects, diagnose and responds!

Adding an alarm needs wiring from sensor to panel lamp through Annunciator panel to its window!

Autcikuen
The Good Old Days... autorumatOn

URBNE RELAY 101


RUNNNG CLOSED
B YP ASs ON
RELAY 20 HATCH s01
POGR OPEN
L E O
COOLERS wBRATIOIN
TEMP TOO LOw TRIPpeD ENCLOsURE
coOLANT FAULT
AT SEAL GAS VBRATION 02 SYsTEM
RE TOO HGH
Result is a too many alarms in Computerised system-UNMANAGEABLE by Operator
(Example: A pressure transmitter signal can be configured for 15 types of alarms
like: H,HH, HHH, L, LL, LLL, ROC, Discrepancy alarm, Bad value alarm, Diagnostic
alarm using software for DCS/ PLC/ SCADA )

Rockwel
Automation
Misapplication of Modern Technology
AMARASWbennAirnSmnen

#57 RUN FORWARD ON


6 o3/199o
08/1 00:13.08 263621ARNF
625ASV1/
FAN 112
UNKNOWN
(COLL.
UNKNOv UNKNoWN
pG 08/1399 00:13.0a 202423V2 DuST CYCLONE 22 sOLENOIO VAL VE
#222 SOLENOIO VALVE OFE
06 08/1990 DUST cYCLONE

ass
00:13:08 26242S1
VVEIGHEEDER ZEROSPEED STVITCH
2 KZZ COAL
p24F@. K22 COAL WEIGHFEEDER FLOVW TOTALIZER
ONT
991AON DOME SUMP PUMP ROR ON
SL 14 BiN C
021
T 21 021| FAUL-T
PRESSO 21 RUN COMMAND OF F

BaSV37
FEEDER
aAGHOUSE
K21
AOTION
AAN
DETECT
PURGE SOL.
Sw. ON

65SVa5
253SV37
KZ1
K19
SAGHOUsE
BAGHOUSE AAN
CPT.3
SOLONOD
OFP
OFF
aUAD VAL VE 004 RUN ON
041ARNE
ON
42SV11 DUST CYCLONEVALVE
22 sOLENOID VALVE
SS
CUTOFF
EEOCOAL WEIGHFEEDER
FLOW TOTALIZER
2EROSPEED SViTCH AUL

A Akoeda 24 K22 COAL WEGHrEEDER FLOWTOTALzeR


516

3 sever Data 0 Cent Data


centaen. ALAHM
HHF 21 R usINT
BODL

BODL
BEEP_ON BOOL
BOOL ase
Fake
C C BUDL
STRING
Who can help ? (Competent People)

People who are acknowledged experts in the alarm management field, with in-
depth understanding of the historical and current problem, the science and
literature, the studies and standards, and the range of solutions
People with in-depth knowledge of process control, distributed control systems
(DCS/PLC), human-machine interfaces (HMI), process networks, and critical
condition management
People with experience in every stage ofa successful alarm system improvement
project, along with many examples of successful projects
People who understand work processes based on successful experience in
different PROCESS industry segments. You want to know what your industry is
doing, what are the best and most efficient practices, and frankly, what the worst
practices are.
The ANSI/ISA-18.2-2009 (2016)
Alarm Management Standard

Emergence of lndustry Standards Automatio


2009
2007
2003
1999 Emergence of Standards
Regulations
EEMUA 191 and
ASM Management
Alarm 1st 2nd Edition ISA S18.02
1994 Guidelines Edition 49 CFR 195446
EEMUA 191
NAMUR 102 on Alarm
1st Edition Management
Establishment of Good
Abnormal situation
(ASM) Engineering Practice
Management Understanding the
Consortium formed Costs of Poor Alarming
Ineffective Alarm Management
Management Major accidents and
Losse
Excessive number of
alarms Lost production (up to
to
Alarm
Operator
meaning
ambiguous Operator burnout
Alarms and often
notifications
disabled
Various phases of Alarm Management as per ISA Standard 18.2 2009 revised in 2016
for Process Industry (Chemical, Fertilizer, Petrochemicals, Refinery, Pharma, Power)

The 1SA-18.2 Alarm Management Lifecycle


Document guidelines, practices &
procedures for the alarm system Review & Authorization
Philosophy
of Changes to the
Identify potential (candidate) Alarm Systenm
alarms..
identification

Determine which alarms are


necessary, establish their design Rationalization
Management Periodic Check to
settings (priority, limit) and ensure system
document basis (cause, of Change
alarm is
meeting objectives and
Detailed
consequence, corrective action Design
Audit procedures are beingg
time to respond, etc) in a Master followed
Alarm Database.. plementation
Design the system to meet
rationalization requirements
Includes Basic alarm design, HM Operation
design, and advanced alarming Monitoring &
Assessment Measure Alarm System
design.- Performance and
Maintenance
Compare to KPIs.
Alarm system commissioning.
is put into operation ldentify Alarm system
(installation & ISSueS

initial testing, and initial training). Alarms are taken Out of


service for repair and
Operator responds to alarms. replacement. periodic testing.
shelving and suppression are
Copyright exida.com LLC 2000-2018
used as necessary.
Seven Steps to a Highly Effective Alarm System
Always needed steps:
Step 1: Develop, Adopt, and Maintain an Alarm Philosophy
"Here's how to do alarms right!"
Step 2: Collect Data and Benchmark Your Systems ;Where do we stand ?
Step 3: Perform Bad Actor Alarm Resolution; Nuisance Alarms
Steps to implement to improve alarm system performance:
Step 4: Perform Alarm Documentation and Rationalization (D&R) Master Document with
Setpoints, Priorities, Post rationalised configuration
Step 5: Implement Alarm Audit and Enforcement Technology-
Checking Actual setting Vs Master Alarm data base and ensuring secured accessibility
Step 6: Implement Real-time Alarm Management; Shelving, State based Alarms, Alarm
Flood Suppression
Step 7: Control and Maintain Your Improved System; Ageing effect leading to nuisance alarms
L Copyrighted
Copyright
Materials
2011 ISA Retieved from www.knovel.com

CHAPTER 1

Alarm Management
Best Practices:
Highly Condensed

"My definition of an expert in any field is a person who knows


enough about what's really going on to be scared."
-P.J. Plauger

1.1 The Alarm Problem


A poorly functioning alarm system is often noted as a contributing fac-
tor to the seriousness of upsets, incidents, and major accidents. Signifi-
cant alarm system improvement is needed in most industries utilizing
computer-based SCADA or distributed control systems; it is a massively
common and serious problem. Most companies have become aware of
the need to thoroughly investigate and understand their alarm systemn
performance. Alarm management is a fast-growing, high profile topic in
the process industries. It is the subject of constant articles in the trade
journals and at various technical society meetings and symposia.

Having decided to investigate this area, how do you proceed? Your time
and resources are always limited. The subject is complex. Alarm system
improvement involves an interlinked combination of technology and
work processes.

5
6 Alarm Management: A Comprehensive Guide, Second Edition

1.2 People Who Can Help


You should seek help from the best experts in the field. You want infor
mation, advice, products, arnd services from:
People who are acknowledged experts in the alarm management
field, with in-depth understanding of the historical and current
problem, the sciernce and literature, the studies and standards, and
the range of solutions
People with in-depth knowledge of process control, distributed
control systems, human-machine interfaces, process networks,
and critical condition management
People with experience in every stage of a successful alarm system im-
provement project, along with many examples of successful projects
People who understand work processes based on successful experi-
ence in different industry segments. You want to know what your
industry is doing, what are the best and most efficient practices,
and frankly, what the worst practices are.

1.3 The ANSI/ISA-18.2-2009 Alarm Management Standard


In 2003, ISA began developing a standard on alarm management. Doz-
ens of contributors (including the authors) from many industry segments
spent thousands of person-hours participating in the development. After
six years of work, the new standard "ANSI/ISA-18.2-2009 Management of
Alarm Systems for the Process Industries" is now available at www.isa.org.

The issuance of ISA-18.2 is a significant and important event for the pro-
cess industries. It sets forth the work processes for designing, implement-
ing, operating, and maintaining a modern alarm system, presented in a
lifecvcle format. This standard will definitely have a regulatory impact,
but more on that later.

This second edition contains a lengthy chapter on understanding and


implementing this standard. Readers of this book should not expect to
learn much that is basically new or different from reading ISA-18.2. Stan-
dards intentionally limit and concern themselves with what to do rather
than how to go about doing it in an effective and efficient manner. By
design, standards contain the minimun aceptable and not the optimum.
This book exists to provide detailed guidance and impart detailed knowl-
edge far exceeding the content of a starndard.

There is no conflict between this book's seven step approach and the
ISA-18.2 life cycle approach-there is only some different nomenclature
Chapter 1-Alarm Management Best Practices: Highly Condensed

and arrangement of the topics. The seven step approach is well proven
for efficiency and effectiveness.

1.4 Seven Steps to a Highly Effective Alarm System


Here is a brief outline of a best practices approach in a typical alarm man-
agement project. These straightforward steps can be easily implemented
in any work process framework, such as Six Sigma. The first three steps
are universally needed for the improvement of an alarm system. They
are often done simultaneously at the start of a project.

Always needed steps:


Step 1: Develop, Adopt, and Maintain an Alarm Philosophy
Step 2: Collect Data and Benchmark Your Systems
Step 3: Perform Bad Actor Alarm Resolution

These first three steps are placed first in the process because they collec-
tively provide the most improvement for the least expenditure of effort.
They provide the best possible start and the fundamental underpinnings
for the remainder of steps necessary for effective alarm management.

Steps to implement to improve alarm system performance:


Step 4: Perform Alarm Documentation and Rationalization (D&R)
Step 5: Implement Alarm Audit and Enforcement Technology
Step 6: Implement Real-time Alarm Management
Step 7: Control and Maintain Your Improved System

Step 1: Develop, Adopt, and Maintain an Alarm Philosophy


An Alarm Philosophy is a comprehensive guideline for the develop-
ment, implementation, and modification of alarms. The philosophy says
"Here's how to do alars right!" It provides an optimum basis for alarm
selection, priority setting, configuration, response, handling methods,
system monitoring, and many other topics. In this book, you will learn
exactly how to develop an Alarm Philosophy, complete with examples.
An Alarm Philosophy will be an immediately useful document covering
the entire range of alarm topics. It will reflect a full understanding of the
alarm problem and the proper practices to follow.

Step 2: Collect Data and Benchmark Your Systems


Analysis is fundamental to improvement. You must analyze your alarm
system to improve it. You should look for alarm analysis software with
full graphical and tabular output, easy access to the full control system
event journal entries, automatic report generation, web-based report
8 Alarm Management: A Comprehensive Guide, Second Edition

viewing, and so forth. You want a comprehensive and complete set of


alarm analyses to enable you to pinpoint your exact problems and apply
the most efficient solutions.

Since operator changes (e.g., controller setpoints, modes, and outputs)


are recorded by most DCSs in a similar fashion to alarm events, you will
want software that includes the analysis of such events. The results can be
amazingly useful, and can point out areas where control schemes are not
working as designed or where operating procedures or operator training
need improvement. While this book is focused on alarm management, we
include a section on the benefit of these operator change analyses.

There can be no improvement without an understanding of your start-


ing point. A comprehensive Baseline Report sets your benchmark and
will enable you to target your resources to get the most improvement
possible for the minimum cost and effort. The start of an improvement
effort requires an examination of your actual data.

Step 3: Perform Bad Actor Alarm Resolution


Based on the analysis of hundreds of systems, there are always several
varieties of nuisance or Bad Actor alarms. This book contains an efficient
and effective process for analyzing these and provides exact recommen-
dations for configuration changes to improve their performance. The
average improvement is over a 50% reduction in overall alarm events
for a relatively minimal effort. While on some systems this result may
not meet an overall improvement goal, it is a great first step, providing
much-needed immediate relief. It also establishes the credibility of the
alarm management effort with an immediate early success.

These first three steps are universally needed for the improvement of an
alarm system. The following steps generally involve more time, resourc-
es, and expense. Some of them may or may not be needed depending on
the performance characteristics of your system.

Step 4: Perform Alarm Documentation and Rationalization (D&R)


Many existing systems need a total rework-a review of the configuration
and purpose of every alarm. We call this Alarm Documentation and Ratio
nalization (D&R), also commonly called Alarm Objective Analysis, among
other terms. You will want to use a software-assisted methodology to make
D&R fast and efficient. Besides just having software, there is an art to per-
forming a D&R in an efficient manner. The knowledge herein is based upon
participation in the rationalization of hundreds of thousands of points. This
experience provides detailed krnowledge of the common problems and the
Chapter 1-Alarm Management Best Practices: Highly Condensed 9

best solutions, which are provided here in this book. One result of a D&R
effort is the creation of a Master Alarm Database, which contains the post-
rationalized alarm configuration with changed setpoints, priorities, and so
forth. A Master Alarm Database has several uses.

Step 5: Implement Alarm Audit and Enforcement Technology


Once your alarm system is improved, it is essential to ensure the con-
figuration does not change over time unless the changes are specifically
authorized. DCS arnd SCADA systems are notoriously easy to change,
which is why software mechanisms that frequently audit (and enforce)
the current configuration versus the Master Alarm Database are needed.
Paper-based Management of Change solutions for DCS configuration
(alarm or otherwise) have a wide and consistent history of failure.

Step 6: Implement Real Time Alarm Management


Based on the performance you need your alarm system to achieve and
the nature of your process, you may want to implement more advanced
alarm handling solutions, such as the following:
Alarm Shelving: A safe, secure way to temporarily disable a nuisance
alarm until the underlying problem can be corrected. Most control
systems have inadequate mechanisms to properly control temporary
alarm suppression. Computerized lists of shelved alarms, with time
limits, reminders, and auto-re-enabling are necessary. It must be im-
possible to temporarily suppress an alarm and then forget about it-a
very common and dangerous occurrence throughout industry.
State-based Alarming and Alarm Flood Suppression: Algorithms
detect when the plant changes operating state (such as startup, shut-
down, different products, rates, feedstocks, etc.) and dynamically al-
ter the alarm settings to conform to the proper settings for each state.
State-based settings for inadvertent shutdown of a piece of equipment
have proven to be effective in managing most alarm flood situations.
Operator Alert Systems: Once the alarm system has been properly re
served for things meeting the requirements of what should actually be
an alarm, there may remain a need for an operator-configurable noti-
fication tool explicitly separate from the alarm system. Such operator
alert systems are a best practice and are described later in this book.

Step 7: Control and Maintain Your Improved System


Processes and sensors change over time, and alarm behavior will change
with them. Alarms working correctly now may become nuisances or
malfunction in the future. Effective management of change methodolo-
10 Alarm Management: A Comprehensive Guide, Second Edition

gies, and an ongoing program of system analysis and correction of prob-


lems as they occur, is needed for an effective alarm system.

1.5 Summary
If you know or suspect you have an alarm problem, read this book and
begin doing the things it recommends.
16-07-2021 09:00 11:00
Operator Response Time Cycle
.What makes alarm a problem?
.When is an alarm considered as "Overloaded" as per standard?
Definition of "Alarm-flood" and its permitted range as per standard?
Definition of "Nuisance Alarm"
Stale Alarms and its recommended range as per ISA 18.2 standard
Operator needs to detect easily (Detect)
to think about reasons why ??(Diagnose)
To take an Action (Respond)

exida a Successful
What makes Operator Response?
Sensor Logic Solver Final lement

DETECT DIAGNoSE RESPOND


Operator Response Time

On average, How many alarms can an operator respond to in a 10


minute periodd?
Copyrignt O exda.com LLc 2000-2016

odd Stauffer
What makes Alarms - A PROBLEM?

exida
Common Alarm Management "Villains"
Alarm Overload (Steady State)
Good Alarm
O Alam Floods (After an Upset)
Bad Alarm
NuisanceAlarms
OChattering Alarms
o Standing/ Stale Alarms

BadActors/ Frequently Occurring

Redundant Alarms
OAlarms which have no response
OAlarm priority not meaningful

presence of these "villains" diminishes the usefuiness of the


alarm
The
systerm and robs" the operatorof an important tool

odd Stauffer
When is Alarm -considered as Overload ?
What is recommended practice as per 18.2 for Alarms / hour?

exida Alarm Overload (Too Many Alarms)

Very Likely to be Acceptable Maximum Manageable


6 Alarmsperper hour (average)_ per hour (average)
-12 Alarmsper
1Alarm 10 minutes (average)|-2 Alarms 10 minutes (average)
Copyrighitexidacom LLC 2o0o.2018

Todd Stauffer
When an overload / alarm flood occurs ?
-
Can flood be maintained/ contained ? How about alarm- flood ?
What is the permitted range of Alarm Flood in percentage as per stnadard?

exicda
Alarm Flood (Alarm Shower)
Definition: A condition during which the alarm rate is greater than the operator
can effectively manage (e.g., more than 10 alarm per 10 minutes).
Usually triggered by a single event (plant upset)
Most complex alarm problem to solve
Results
Increasedtolikelihood of operator missing an alarm
Potential overwhelm
the operator
Occurs during most critical
time for the operator
Metric Target Value
Percentage of10-minute periods containing more than 10 alarms
Maximnum number of alarms in a 10-minute pericod
Percentage of time the alarm system is in flood condition
com LLC a
Copyrane exxda 200o-2018

dd Stauffer
"The Boy Who cried Wolf "- an universally known story
What is moral of story w.r.to Nuisance Alarm bells?
Definition of "Nuisance Alarm"

exida
Nuisance Alarm
An alarm that annunciates excessively, unnecessarily, or does not
return to ornormal
fleeting, stale
after
alarms).
the correct response is taken (e.g., chattering.

Results
Desensitize the operator
Distract the operator
Lead to Missing of Alarms
Increase Operator Stresss
The Boy Who Cried Wolf
Ref: 1SA-18.2. IEC 62682

dd Stauffer
What happens if level high alarm (LAH) is set at 70 % of level and when it goes
down and up in the range of 70.1 & 69.9%?

ISA 18.2 Standard recommends Quantity of chattering alarms = ZERO


Comp Pres

rding D 2015
exida
Chattering Alarm
An alarma that repeatedly transitions between the alarm state and the normal
state in
Clears
short
quickly
period
whether
of time (e.g.
operator
3x / min)
responds or not
Makes up large percentage
nmay
of systems total alarm loading
Single chattering alarm
produce thousands of events
Most common nuisance alarm

MetriG Target Value


Quantity of chattering and fleeting aiarms Zero, action plans to corect any that occur
Ceoyrid exoida.com LLC 2000-2018 Ref: ANSSA 18.2
Iftemperature high alarm (TAH)
for water of cooling tower is set at 33
degree Centigrade, it shall always appear as a STALE alarm in summer
months (4 months) since temperature in summer is> 35 degree Centigrade.
Standard 18.2 recommends STALE ALARM less than 5 % on any day
(Less than 15 alarms in 300 alarms per day of Alarm-Summary)

xida
Stale Alarm
An alarm
period
that remains
of time (e.g.,
in the
24 hours).
alarm state for an extended

No operator action is required or


Don't clear afterwhen operator action has been taken
Often occurs
Operating state (not
alarm
abnormal
is not relevant
condition)
in the current
Results
Distract operator by filling up aarm
Summary screen Ref: ANSIISA18.2
Interfere with detecting new alarms
Metric Target
Stale alarms Less than 5 present on any day. Value
with action plans to address
copyrant uda com LLc 2000- 2o18 29
Definition of Bad Actors (in terms of Alarm)
-

80 /20 Rule for Alarm generation


What is the recommended range permitted for "Bad Actors" ?

exida
Bad Actors (aka Frequently Occurring Alarm)
Definition: A tag that produces a large number of alarm events
80/ 20 Rule: "10-20 tags produce 80% of the alarns"
Often caused by System / Instrument Diagnostic Alarms

Metric Target Value


Percentage contribution of the top 10 most 1% to 5% maximum, with action plans to
frequent alarms to the overall alarm load address deficiencies.
Copyngnt exida.com LLC 2000-2018

dd Stauffer
Alarm priority: Be selective in prioritisation
5 priority; 15 % Medium priority; 80 % Low priority
% High
High/Medium/ Low Vs Response in 5 min./15 min./ 30 min.

exida Incorrect Alarm Priority


Priority is a measure of criticality and urgency
Alarms with the wrong priority
Too many high priority alarms
Reduces operator confidence in the system
Inconsistent use of priority makes it difficult to gauge urgency of response
Gmail
High
Alam Priority
Distribution
Medium
15%
Low

MetriG Target Value


Annunciated priority 3 priorities: -80% low.15% medium, 5% high or
distribution 4 priorities:-80o low.-15% medium, 5% high.-<19% highest
other special-purpose priorities) excluded from the calcutation

dd Stauffer
17-07-2021 09:00 11:00
Common DCS and Alarm Displays
Modern large screen in Control Room in addition to table top Computer
Alarm Management Life Cycle
Key Performance Indicators
Six DCS screen shots (Operation & Engineering Configuration)
-
Actual Alarm Summary Display: Explained for top most alarm dated 14 th March 2013
15:15:10 LIC 301 PVHH 110 (cms) 56 TPD Polish Brine Tank Level 110.505(cms)
Time Tag no. Process Value set point HH Description of Equipment Actual Value
PLCmanufacturer claims features
More Than a DCS!
Roccell
PlantPAx -
It's More Than a DCS Automation

PlantPAx
Syster
Process Au
ation

It's a process automation system that gives you everything


you want in a worid-class, contemporary DCS ptus
Plant-wide control capabilities
Open. flexible architecture
Integrated control, power. safety. and information
Support by a global network of local expertss
Authors' of Book Displaying a over-view of Modern Control Room on Cover-page

By Bill Hollificld and Eddie Habibi


Expanded View of computer screen to a LARGE SCREEN in front of operator in control room
*Group Trends,-
*Group Display of Controllers; *Individual Controller; *Operator graphic for a pump system
*A group display of six controllers for Set point (SP) & Process Value bargraph
Rockuuell
Automation
Alarm Management Lifecycle
ldentify. Rationallize, and Design
What should alam? When? Implement
Philosophy To whom should it alarm? How are they notified?
How is the operator to respond?
How should the alarm be configured?

Operate and Maintain


Potential
Poential G Verity Respehemica
Scubbe er Scrubber Operaton
PreSiery
Breaktrough Manual neadings initiate repr o
Faled irstrument SCrubbeninstrument

Monitor and Assess


Alarm system configuration as intended
All alars in-servic ur harve acion plarns fur repair
Most frequent alarms/systemic issues addressed
Rate of alarms appropriate for operator
How well is your Alarm Management?
How Alarm Management's performance is measured ?
Key Performance Indicators (8 KPIs)

Automation
Other Useful Alarm System KPIs..

Top 1O most frequently Number


per
of alarm peaks
OGCurring alarms time floods)
period
(alarm

Number/
standing
of
stale
long
alarms
Priority distribution
alarms
of

Unauthorized
property changes
alarm Number
operating
of alarms
position
per

Chattering
Suppressed alarms
alarms outside of approved
methodologies
Target Benchmark number (s) for 8 KPIs

and Some Recommended AufoSkve


Benchmarks

<5% of Total (10 alarms / 10D


(over 30 ddays) min)<1%%
<5 80% Low
With plans to address
15% Medium
5% High

O 6-12 Alarms
hour
/

O O
AiaIiS/nUuI 5 On Vs 12 Alarms/ HOur as (ISA
Benchmark-2009) done in 1994/96 by GACL/ UPL Mobile HMI
14 Mar 14 1119135
lnTouch Access Anywhere

PULO
BLEACH
ie.c0e
2.00
s6 TPO POLISH BRINE LUL
S1OA
KFER PUNP
1
9
110.585
30.
STOPPED
Alarm- Displays
PRESSURE

PuLO
.co

oISH
CLARIFIEO 8RTHE LruEL

UNLORD TRANSEPHP
BRINE LUL
TART
ro
stORPED
on DCS Screens
23P0

Automation Honeywell
PlantPAx Library Alarms
Rockw
Automation
PlantPAx Library Alams
FromISA 18.2: 11.2.1 HMI Info Requirements:
Rockwell-Allen Bradley PLC
The inlerface shall cleany indicale
A lam;
From ISA 18.2 11.2.3 HMI Display Requirements Alarmta
A ortios
Factory-Talk Software
The interface shall provide capability for the following Alarm Ypes
a At least one alarm summary display:
b) Alarm indications on process aisplays, H 9443 GPM 1ag in Aarm Descpbon orogt in aii
featuresS
Alarm Phonty.symbol and color
c) Alarm indications on tag detail display
olcaies ign riOy Alarm State: Solid o

Alarm Type: HH= HighHigh F


4.43 GFM
B naAcknowledaed
Event and Alarm Summary
re: Traser Parp
From ISA 18.2: 11.3.1 Reqd Alarm Stated Indicatons
gsh
be Used lo
Display
the lollowing alam states
a Normal

Page view A Read aloud V Draw


Tag in Alarm State -Libraryy
FactoryTalk Alarms and Events FactoryTalk Alarms and Events
Alarm Summary
Automa
Alarm Banner
Rocivell
Automatio Alarm Banner
Acl Ac A R Print Select Fiter Provides all the details
w a No HMI effort required, configuration only
HMI on Mobile
E

Ust
. Naw
Driority alarms
alarms
Newr ealows
d Uried TRPp Mcta Coniact Wld Let. Comia 2 0 it to be stationed as a
permanent fixture on
the HMI client. Reactor
Launch summary
a /0 ar
recuy ommore
bottom of C
Banner for
o Rn Command 1 Phesae details
@er
eRed in Client window to
always appe sereet
en T y opnicseeen

Numbe In Alarm In Alarm Noma Faults,.


15 Alarms/Hour is OK Vs 12 Alarms / Hour as (ISA
Benchmark-2009) done in 1994/96 by GACL/ UPL
14Mar 44 1511919
ALECTE0 T w$385
OF

1SHS28LIO3O PUHN
11.000 56 TPO POLISH BRINE LUL 110.58BS
151864L1I752862
459
PUHI 38.0 cOOLING UATER SUPP-TEMP
5 P1e01
L 16P81
PULL
OFFNOR
2.000 TORAGE-5 PRESSuRE
8LEAC
1.518
X52P01 X-FER PUNP STOPPED
OFFMOR HCL NG PUNP STOPPED
24010X PULO 7.000LO LO CL2 TO X24TO18a
7.00
681
PULO
PUH
.ses sTORAGE-5 PRESSUREE S.458
1.518
0491X PUNK
85. HCL HEAD TANK LEUEL 87.0
5X 30.000 CLARIFIED BRINE LEVEL
a 5752841 OFFMORM LORDINC RECT 91-98
PULO 68.00 START
OFFHORN
coOLING UATER SUNP LEUEL
PUNI
N2504 UMLOaD TRANSFR PRP S6.e6
56 TPD POLISH 8RINE STOPPED
PUNI LUL 12.585
PUNT Ns04
DRY TO x21003
CL2 1O X22 91.57

40
PN Parikh
1
25 th March
& with
14
way forward
Achiovomont
for Associated Display
Rockwell
PlantPAx Library Alarms Automation

From ISA 18.2: 11.2.1 HMI Info Requirements:


The interface shall clearty indicate:
Tag in alarm;
Alarm States;
Alarm Priorities
Alarm Types
Tag in Alarm: Description of objectin alarm
Alarm Priority, "T symbol and color
indicatesHigh Priority Alarm State: Solid color
not fiashing= Acknowledged
AlarmType: HH High High
9443 GPM

From ISA 18.2: 11.3.1 Reqd Alarm Stated Indications


Acombination of visual indicatioris, audible indications, or both, shall be used to
distinguish the following alam states
Normal;
b Unacknowledged
Aknowledged
ktop/Rockwell-Awarenesse20-%20Aarm620Management%2004-07-2014.pdf
E D Page view1 A Re ad aloud Draw Hig

FactoryTalk Alarms and Events Rockwell


Aufonation
Alarm Banner
Up to 5 most current,
highest priority a arms
New FactoryTalk Viewv
docking feature allowss
it to be stationed asonna
permanent fixture
the HMI client. Reactor
Launch Summary
directly from bottom of
Banner for more
details Deyctis
Docked in Client window to
always appear at top
bottom of any graphic
orscreen
FactoryTalk Alarms and Events Rockwell
Automation
Alarm Summary
Acl Ac A SR Print
E Cu
Select Filter Provides all the details
wlcu
No HMI effort required, configuration only

Cosdion Nooesge Even Tme Alarm Clo


oenel Unecied Tank 1 Leral 10//2005 4235 P TarkFarr
oml Unacied TRIP Mcer Cericte Weld -ti1 Coeactor 12 102m6L2%PM Noh@ldeo Cb12
omn ncind Mcta Ceninctoe WeldUe3 Contach 10/2044236 PM
NohldeaE2
Nomal Urecked lank 1eae TarFam
Terk
omel Ureched LLO Ten 1Leve 1oe/20064244 Fan
Tark Fan
Level 10//2006 4243PM
Carc Lm Codnitra 10t/0042
Ack Commen ereteid LtCeat 1 00
Ack Page 20 00825 PM
Suppress 1 Level 10/2000252 P
larm 5ratvs 1Preae 10//200 4312PM arF
Run Command hetnure
1
10/8/20644322PMM Tark Fn
Prnt
Clear Fter
Find
Togale Detad pane Kndod
1C/e/2008 44:50 FM
Ta
Numbe In Alarm In Alarm Norma Faults
Events UnAck Ack UnAck Display List

FiR Not Fiterwa Soted ty Erenk Tere (Ascerdha


ARockwell
Automafion
PlantPAx Library Alarms

From ISA 18.2: 11.2.3 HMI Display Requirements:


The interface shall provide capability for the following
a) At least one alarm summary display
Alarm indications on process displays; HH 94 43 GPM
c) Alarm indications on tag detail display
--

Pretuct Tsarsfer PuTp

13202 1o1156 AM SeerErecAam TRIP


13/203 1co7 31 Lee Aan 0L0 LOra Ala
VI32012 07 91 AM Aa
Lirve L0
Interlock Tip Alam
Alarm configuration for eliminating "Chattering"
Dead Band ( Percentage %) & Time Delay (Seconds)

alm 2

95tu
1
PN Parikh 25 th March 14 37
way Forward:Alarm to be set for Dead Band of Time & Alarm value at DVACL
Alarm-Shelving
Reason for Shelving Nuisance Alarm
Shelving Period: 2 hours

36
Parikh
PNForward-Alarm
Way
26 th March at'14 DVACL
Shelving
Copyrighted Materials
Copyright 2011 ISA Retieved from www.knovel.com

CHAPTER 4

Common DCS and


SCADA Alarm
Display Capabilities
-and Their Misuse
"In a way, staring into a computer screen is like staring into an eclipse.
Its brilliant and you don't realize the damage until it's toolate.
-Bruce Sterling

There are typically three methods by which alarms are displayed to a DCcs
or SCADA console operator. (The term "DCS" is used to include SCADA
systems, since their alarm-related functionality is essentially identical.)
These methods are:
The alarm display functionality provided by the DCS manufacturer
Custom graphics created by the owning companyy
External lightbox annunciators added to the DCS

These capabilities will be individually discussed.

4.1 DCS and SCADA Alarm System Capabilities


External sensors connected to a DCS are represented as points. Points
of different types can have various built-in or custom alarm functions.
The most common example is of an analog signal, such as a pressure.
The standard analog point type will generally provide the capability for
several alarms to be configured on the single pressure reading, with little
or no effort by the engineer-simply fill in the blanks and the alarm
is turned on. There are typically alarms for pressure low or high, pres-
sure low-low or high-high, pressure rate-of-change low or high, pressure
reading out of range or bad value, and so forth. Digital input signals
25
26 Alarm Management: A Comprehensive Guide, Second Edition

from switches, and other more complex poirnt types have many addi-
tional alarm types and choices. Logic points can be constructed to create
special-purpose alarms under a variety of Boolean conditions. Programn
code can be written to create quite complex alarms.

Alarms are assigned an attribute of priority. The priority of an alarm deter


mines various on-screen alarm depiction behaviors, such as color, sound,
or symbology. Most systems have at least three available priorities; many
have dozens. This doesn't mean using dozens of priorities is a good idea
though. In general, the use of many of the possible alarm-related func-
tions supplied by a control system manufacturer is often a bad idea!

When alarms occur, their status is depicted on the control system screens.
New alarms can be acknowledged by the operator, which generally alters
their appearance in some way. When the alarm condition is no longer
in effect, the alarm clears and either automatically disappears from the
displays or can be manually dismissed by the operator. Time-stamped
electronic records of new alarms, alarm acknowledgement, and alarm
clearing are automatically created and saved.

There are usually means by which an alarm can be temporarily sup-


pressedsome with better control than others.

This is by no means an exhaustive list of alarm system capabilities; the


reader is assumed to be familiar with these basics. Different vendors ac-
complish these basic functions in different ways and with different capa-
bilities and restrictions, but the general functionality is as stated.

4.2 The Alarm Display


All DCSs come from the manufacturer with an Alarm Display. This is
not a dedicated piece of hardware; it is a pre-configured graphic basically
showing a scrolling list or multiple pages of alarms. Often, the operator
selects one physical screen (CRT or LCD) from the several they have
available, and keeps this display up most or all of the time. The usual
capabilities of these displays include:
Sorting by alarm priority
Sorting by chronological order
Sorting by predetermined process area
Color coding by priority
Ability to temporarily freeze the display list during periods of high
alarm actuation
Chapter 4-Common DCS and SCADA Alarm Display Capabilities 27
and Their Misuse

Ability to temporarily silence the alarm horn based on alarm priority


Color and alarm symbology choices
Displaying the measurement and the alarm setpoint violated.
Many will have a portion of this feature, and the best will have a
live updating of the measurement value.
Guiding the operator in responding to the alarm, by linking the alarm
to the display used to control the measurement or system in alarm
Other capabilities may exist, depending on the manufacturer. Most are
quite similar, but there are important differences. From an alarm man-
agement point of view, the important message is to thoroughly under
stand every option regarding this display the DCS manufacturer gives
you. Make thoughtful decisions about these options as you select them;
do not just use the default settings. Alas, this will involve the dreaded
task of reading the system documentation.
If you are purchasing a new control system, be sure to make needed
alarm display capabilities part of your specification. VWe still see many
alarm system design omissions from the DCS manufacturers. Proper de-
sign should include the following elements:
Priority systems allowing independent priority settings for each alarm
Alarm summaries that update the alarm list or measurement values
dynamically
Ability to temporarily suppress the alarm sound for some priorities
Navigation ability to go, in one click, from an alarm on the dis-
play to the proper graphic for diagnosing the relevant situation
.Temporary alarm scroll freezing to aid readability
The delivered systems will not improve without pressure from potential
buyers. It is a best practice that the Alarm Display screen be configured to
show alarms first sorted by priority (highest priority at the top or earliest
page), then reverse-chronologically (most recent at the top) within each
priority section.
4.3 Custom Graphics-Alarm-Related Guidelines
DCS operating graphic displays should act to always effectively help the op-
erator control the process in the best possible way. Custom graphics are the
most common method for conveying process information on a DCS. The
proper design of such graphics became a book all by itself! We will summarize
here just a few basic alarm-related principles. See Appendix 4 on High Perfor
nance HMI for a more lengthy discussion on effective operator graphics.
28 Alarm Management: A Comprehensive Guide, Second Edition

Keystrokes: The DCS operator interface system should be de-


signed to minimize the number of keystrokes required to identify,
verify, arnd assess an alarm. The system and graphics should be
configured so it is never necessary for the operator to type in a
point name or graphic name.
Associated Graphic: Every point with a configured alarm should
have an associated graphic display on the DCS. This associated
display should aid the operator in the proper diagnosis and miti-
gation of the event causing the alarm. Methods by which the op-
erator is quickly directed with a single keystroke or button-click
(i.e., one-touch access) to the associated display should be used.
Many DCSs have this capability, but it must be configured.
Inherited Alarm Behavior: Graphics should not be hard-coded
with alarm behavior for points; the behavior should be consistent
based on the configuration of a point's alarm and should change
if the configuration changes. For example, if the practice is that a
certain type of alarm indicator is displayed based on the alarm's
priority, the graphic should detect the priority currently in effect
on the alarm and display the correct indicator.
Alarm Status Indication: A process graphic should visually and
consistently highlight points in alarm, whether or not the alarm is
acknowledged, and the priority of the alarm. Alarms should always
be the most prominent information or object on the display.
Colors: Alarm colors are used ONLY to depict alarm-related func-
tionality and not for anything else. If yellow is an alarm color,
then yellow is never used as a text label, line color, border, or any
other non-alarm-related element.
"Fat Finger" Contingencies: Techniques should be used to mini-
mize the possibility of operator mistakes, and provide validation
and security measures. For example, a graphic element pushbut
ton that initiates an infrequent shutdown action should also re-
quire a step of confirmation of operator intention. Major process
ipsets have occurred by mistyping an input-for example, open-
ing a slide valve to 47% instead of 4.7%. DCSs using membrane
keyboards are particularly susceptible to this type of error.
Single Alarm Interface: A single alarm interface should be used,
namely that of the DCS. If alarms can come from sources nomi-
nally outside of the DCS, those should be brought into the DCS
if the DCS is used in any way to respond to the alarm. All alarms
should be acknowledged only once; it should never be required to
acknowledge the same alarm in more than one place.
Chapter 4-Common DCS and SCADA Alarm Display Capabilities 29
and Their Misuse

4.4 The Nature of Alarm Priority


Alarm priority is a means to convey the seriousness of a specific process
condition to the operator and drives the operator's responses. For higher
priorities to be effective, they should be small in number compared to
the lowest priority in order to give them proper significance. The prior-
ity of an alarm is solely to act to help the operator differentiate alarm
importance. It is a human-interface factor.

Annunciated alarms are those communicated to the operator through an


operator display and generally an audible notification. DCSs generally al-
low for multiple alarm priorities to distinguish alarms, as well as a separate
alarm priority assignment for each alarmable paranmeter of a point.

The best practice principles of alarm management require every indi-


vidual alarm to be assigned a priority using a logical and consistent ap-
proach. It is important for the DCS to present alarms to the operator with
a priority that has a consistent meaning. This means separate alarms on
the same point should often have different priorities.

The best practice is to use three primary levels of annunciated DCS alarm
priority. Your DCS may allow many more than that. Do not succumb
to the temptation of using them! Humans are wonderfully able to put
things in three categories and to understand items in three categories.
Four or five categories are about the maximum; more than that will get
cognitively blurred together and become confusing rather than helpful.
(Quick! What is the difference between Priority 17 and Priority 18?)

Alarm systems from different DCSs may have differing nomenclature


for priority levels. In this book, the levels of alarm priority will be desig-
nated as:
.Critical (rarely used in practice)
Priority 1 (P1)-normally the highest DCS alarm priority
Priority 2 (P2)-the second highest DCS alarm priority
Priority 3 (P3)-the third highest DCS alarm priority
Priority 4 (P4)-used for diagnostic-type alarms
The vast majority of alarms will be assigned to the P1, P2, and P3 priori-
ties, via the principles contained in the Alarm Documentation and Ratio-
nalization section. Critical alarms and diagnostic alarms are thoroughly
discussed in the Rationalization and Philosophy chapters, respectively.
Copyrighted Materials
Copyright 2011 ISA Retieved from www.knovel.com

CHAPTER 12

Understanding and
Applying ISA-18.2:
Management of Alarms for
the Process Industries

"Laws are like sausages. It is better not to see them being made."
-Otto von Bismarck (1815 1898) -

Ditto for standards! Over the last several years, alarm management has
become a highly important topic, and the subject of a number of ar-
ticles, technical symposia, and books. In response to this, in 2003 ISA be-
gan developing an alarm management standard. Dozens of contributors,
from a variety of industry segments, spent thousands of person-hours
participating in the development. The authors of this book participated
in the roles of section editor and voting member. After six years of work,
he new standard, ANSI/ISA-18.2-2009 Management of Alarm Systenms for
the Process Industries, is now available at www.isa.org.

The issuance of ISA-18.2 is a significant and important event for the


chemical, petrochemical, refining, power generation, pipeline, mining
and metals, pharmaceutical, and similar industries using modern con-
trol systems with alarm functionality. It sets forth the work processes for
designing, implementing, operating, and maintaining a modern alarm
system presented in a life cycle format. As previously discussed, it will
have regulatory impact.

ISA-18.2 is quite different from the usual ISA standard. It is not about
specifying how some sort of hardware talks to other hardware, or the
detailed design of control components. It is about work processes of
people. Alarm management isn't really about hardware or software; it's
about work processes. (Poorly performing alarm systems do not create
183
184 Alarm Management: A Comprehensive Guide, Second Edition

themselves!) ISA-18.2 is a consensus standard developed per stringent


methods based on openness, balancing of interests, due process, and
consensus. These make it a "recognized and generally accepted good en
gineering practice" from the regulatory point of view.

In this section, we will review the most important aspects about the scope,
requirements, recommendations, and other contents of ISA-18.2. But,
there is no substitute for obtaining and understanding the full document!
Additionally, you will learn a bit about what it is like to be on a standards
committee, and why a standard requires several years to develop.

12.1 Purpose and Scope


The basic intent of ISA-18.2 is to improve safety. Ineffective alarm sys-
tems have often been documented as being contributing factors to ma-
jor process accidents.

There are several common misconceptions about standards. Standards in-


tentionally describe the minimum acceptable, and not the optimum. By
design, they focus on what to do rather than how to do it. Also by design,
standards do not have detailed or specific "how-to" guidance. ISA-18.2
does not contain examples of specific proven methodologies or of de-
tailed practices. The standard focuses on both work process requirements
("shalls") and recommendations ("shoulds") for effective alarm manage-
ment. Being a consensus document, many aspects of standards appear to
the knowledgeable reader to often be somewhat watered down.

Readers familiar with alarm management literature (such as this book)


should not expect to learn much new or different information from read-
ing ISA-18.2. The most significant difference is that ISA-18.2 is a stan-
dard, not a guideline or a recommended practice or a b0ok, and it was
developed in accordance with stringent ANSI methodologies. As such, it
will be regarded as "recognized and generally accepted good engineering
practice" (RAGAGEP) by regulatory agencies. The regulatory aspect of
this was covered in Chapter 2.

In early 2010, the ISA-18.2 committee began working on creating ad-


ditional explanatory and methodology information in six follow-on ISA
technical reports. The contents of these reports will address some of the
same topics covered in this book.

12.2 Does ISA-18.2 Apply to You?


The focus of ISA-18.2 is on alarm systems that are part of modern control
systems, such as DCSs, SCADA systems, PLCs, Satfety Systems-basically
Chapter 12-Step 7: Understanding and Applying ISA-18.2: 185
Management of Alarms for the Process Industries

anything in which you have an operator that responds to alarms de-


picted on a computer-type screen and/or an annunciator. This includes
the bulk of all processes operating today, specifically:
Petrochemical
Chemical
.Refining
Platform
Pipelines
Power Plants
Pharmaceuticals
Mining & Metals

Additionally, it applies whether your process is continuous, batch, semi-


batch, or discrete. The reason for this commonality is that alarm response
is really not a function of the specific process being controlled--it is a
human-machine interaction. The steps for detecting an alarm, analyzing
the situation, and making a response are human steps in which there is
little difference if you are making (or moving) gasoline, plastics, mega-
watts, or aspirin. While many industries feel "We're Different!" that is
simply not the case when it comes to alarm response. Many different
industries participated in the development of ISA-18.2, recognized this
similarity, and the resulting standard has overlapping applicability.
ISA-18.2 does indicate the boundaries of the alarm system relative to terms
used in other standards, such as BPCS (Basic Process Control System), SIS
(Safety Instrumented System), and so forth. Several exclusions are listed so
as to not contradict or overlap content existing in other standards.

12.3 Definitions in ISA-18.2


A lot of work was done in researching various definitions and ensuring
consistency between ISA-18.2 and other references. The definitions were
carefully crafted.
An alarm is defined as "an audible and/or visible means of indicating to the
operator an equipment malfunction, process deviation, or abnormal condition
requiring a response,"

12.4 Alarm State Transitions


ISA-18.2 includes a moderately complex diagram depicting the alarm states
and sub-states of Normal, Unacknowledged, Acknowledged, Returned-to-
Normal, and Latched. Of particular interest are the states of Shelved, Sup-
pressed by Design, and Out of Service. These have specific meanings:
186 Alarm Management: A Comprehensive Guide, Second Edition

Shelved: An alarm that is temporarily suppressed, usually via a


manual initiation by the operator, using a method meeting a va-
riety of administrative requirements to ensure that the shelved
status is known and tracked.
Suppressed by Design: An alarm intentionally suppressed due to
a designed condition. This is a generic description that includes
things as simple logic-based alarms up to advanced state
based alarming techniques.
Out of Service: An alarm that is not functioning, usually for rea-
sons associated with the Maintenance stage of the life cycle. An
Out of Service alarm is also tracked via similar administrative re-
quirements to a shelved alarm.

The terms "suppress" and "alarm suppression" are intentionally generic


and non-DCS-specific. They are used to indicate that the alarm function-
ality is not working (generally through an override mechanism of some
sort). It is possible, and unfortunately common, to suppress an alarm
outside of the proper work practices, and the detection of such undesir
able situations is part of an ISA-18.2 requirement.

Philosophy

ldentification

Rationalization

Detailed Design
Management
of Change

EImplementation

Audit

Operation

Monitoring&
Assessment
Maintenance

Figure 12-1: The ISA-18.2 Life Cycle Diagramm


Chapter 12-Step 7: Understanding and Applying ISA-18.2: 187
Management of Alarms for the Process Industries

12.5 The Alarm Management Life Cycle


ISA-18.2 is written with a life cycle structure containing ten stages. They are:
A. Alarm Philosophy: Documents the objectives of the alarm sys-
tem and the work processes to meet those objectives.
B. Identification: Work processes that determine which alarms are
necessary.
C. Rationalization: The process of ensuring an alarm meets the re-
quirements set forth in the alarm philosophy, including the tasks
of prioritization, classification, settings determination, and docu-
mentation.
D.Detailed Design: The process of designing the aspects of the alarm
to meet the requirements determined in rationalization and in the
philosophy. This includes some HMI depiction decisions and can
include the use of special or advanced techniques.
E. Implementation: The alarm design is brought into operational status.
This may involve commissioning, testing, and training activities.
F. Operation: The alarm is functional. This stage includes refresher
training if required.
G.Maintenance: The alarm is non-functional, due to either test or
repair activities. (Do not equate this life cycle stage with the Main-
tenance Department or function!)
H.Monitoring and Assessment: The alarm system's performance
is continuously monitored and reported against the goals in the
alarm philosophy.
I. Management of Change: Changes to the alarm system follow a
defined process.
J. Audit: Periodic reviews are conducted to maintain the integrity of
the alarnm system and alarm management work processes.
The structure is shown in a typical diagram involving boxes, lines, and
arrows. Literally months were spent in committee discussions about
things like arrow direction and dotted vs. solid lines. More months were
spent in similar discussions as to such minutiae as, "Carn you enter the
life cycle at more than one place?" This is a major reason why ISA-18.2
took over six years to develop, which is by no means an unusual dura-
tion. (See the opening quote to this chapter!)

12.6 Life Cycle Stages vs. Activities


Do not confuse a life cycle stage with an activity. The life cycle is a struc-
ture for the content of the ISA-18.2 document. It is not specifically or
necessarily a list of activities to be accomplished in a particular order.
188 Alarm Management: A Comprehensive Guide, Second Edition

This sounds confusing and is one aspect of "standard-speak"-the use of


common words in standards to mean things that are different than their
normal usage! Misinterpretations and misunderstandings result when
people read standards without understanding this phenomenon. It is es-
sential to carefully read the exact definitions of terms used in a standard,
and think of those terms only in the context of the specific definitions
rather than their common usage.

For exanmple, in a few minutes an engineer could sit down and resolve a
single nuisance chattering alarm. This simple task could involve going
through several different life cycle stages as part of performing the activi-
ties associated with a simple task. Follow along with this engineer as he
thinks through the process of resolving a single nuisance alarm:

Monitoring Stage:
"Well today, I am spending some time fixing nuisance alarms.
Which of my alarms are on the most frequent alarm list? Ah, there's
one-a chattering high-value alarm on the column pressure."

Identification Stage
"Ah yes, I happen to remember that we need this alarm as part of
our quality program. My job today, though, is to make it work cor-
rectly and eliminate the chattering behavior, not to decide wheth-
er to get rid of it or not. So I don't have to research as to whether it
was originally specified by some particular process like a PHA."

Detailed Design Stage:


"Let's check the configuration of this alarm.. Nothing unusual
about it. Hmmmm, I see that the alarm deadband on this point
is set to zero. That's certainly not a proper thing, and it could eas-
ily lead to chattering behavior. Let's examine some process and
alarm history, and consult a good book on alarm management to
determine a more appropriate deadband setting."

Operation Stage and Maintenance Stage:


"So now I am going to alter the alarm deadband to a new setting.
Hmmm, do I have to take the point off-scan for this? Not in this
case, on this DCS. IfI did, I would have to tell the operator first.
But I can make this change without that and the alarm will re-
main online throughout."
Chapter 12-Step 7: Understanding and Applying ISA-18.2: 189
Management of Alarms for the Process Industries

Management of Change Stage:


"So far, I haven't actually changed anything. Before I type in and
activate this new deadband value, I mentally review the manage-
ment of change requirements for doing so. This specific type of
change is covered in our alarm philosophy, and our site procedures
empower me to make this change as part of my authorized job du-
ties. I do not have to seek any approval or signatures. I will have to
document this change though, in the Master Alarm Database."

Implementation Stage:
"Now I actually change the deadband. I type in the new number
and hit Enter. Done!"

Rationalization Stage: (Note, documentation is a part of the Rationaliza-


tion stage of the life cycle.)
"Since I have the proper security access, I will add this new dead-
band setting into the master alarm database along with my name,
date, and rationale. I will also make a note in the weekly nuisance
alarm tracking report about this one. As long as I am here look-
ing at this alarm, I note it is configured as a Priority 3. That seems
reasonable, but let's just check the online Master Alarm Database
for the reasons that resulted in that priority assignment. Hmmm,
they look pretty good. (If they did not, I could not change them
myself. I would need the Prioritization team take a look at it. Any
change in priority requires notification to the operators.)"

Monitoring Stage:
"Part of my work process for this is to continue to look at the
alarm data to see if this deadband setting change solved the prob-
lem. I will add this one to my tracking and follow up list."

In a few minutes, several different life cycle stages were briefly visited in
accomplishing this one example task. So in understanding and applying
ISA-18.2, don't get overwrought about trying to figure out which life
cycle stage you are in at any point in time. It is a requirements structure,
not a work process sequential checklist!

12.7 Seven Steps vs. Life Cycle Stages


A good document with a good explanatory structure is not necessarily a
good plan for improvement! The seven-step approach is specifically de-
signed for ensuring fast and cost-effective improvements, primarily for
190 Alarm Management: A Comprehensive Guide, Second Edition

existing alarm systems. It is also applicable, with common-sense modifi-


cations, for new systems. There are thousands of existing alarms systems
for every planned new one, but it is desirable fora standard to be written
in a way that applies similarly to both.

There is no conflict between the seven-step approach presented in this


book and the ISA-18.2 life cycle methodology-there is only some differ-
ent nomenclature and task arrangement.

12.8 The Alarm Philosophy Life Cycle Stage


ISA-18.2 recognizes the alarm philosophy document as a key require-
ment for effective alarm management, and requires one. A table lists
topics which are noted as either mandatory or recommended for in-
clusion. The list is short conmpared to the comprehensive philosophy
contents discussed in Chapter 5 and Appendix 3. Remember, a standard
describes the minimum acceptable, not the optimum.

The major mandatory contents of the alarm philosophy include roles


and responsibilities, alarm definition, the basis for alarm prioritization,
HMI guidance, performance monitoring, management of change, train-
ing, and a few others.

There are no surprises in the list except for two concepts not seen before
in the alarm management lexicon-namely "Alarm Classification" and
"Highly Managed Alarms."

12.9 Alarm Classification


Alarm classification is a term used in ISA-18.2 with a limited, specific,
and unusual meaning. This is yet another instance of standard-speak.
In ISA-18.2, alarm classification is a specific method for assigning and
keeping track of various requirements for alarms (mostly administrative
ones). For example, some alarms may require periodic refresher training
and others may not. The same could be true for testing, maintenance,
reporting, HMI depiction, and similar aspects. Alarm classes are defined
and used to keep track of these requirements. It is mandatory in ISA-18.2
to define alarm classes.

This is a slightly unusual thing for a standard! Normally, standards tell


you what to do, but not how to do it, or to require a specific method.
For example, the standard could have simply stated, "Identify and track
alarms that require periodic testing." There are a variety of methods to
do this successfully and an alarm classification structure is only one of
Chapter 12-Step 7: Understanding and Applying ISA-18.2: 191
Management of Alarms for the Process Industries

them. But the committee elected to require a classification structure, al-


though it need not be an onerous one.
There are no specific classes required and no minimum number of class
definitions specified. We recommend the keep-it-simple approach by
having a simple class structure with minimum variations.
12.10 Highly Managed Alarms
The committee thought it desirable to specifically define one class of
alarms. A variety of designations were considered, from "critical" to "vi-
tal" to "special" to "super-duper." "Highly Managed Alarms" or HMAs
was chosen as the term. The intent is to identify the alarms that must
have a significantly high level of administrative requirements.

Now, there is no requirement to have or use this classification! But if


you do-if you state "This classification in my philosophy is per the ISA-
18.2 usage of Highly Managed"-then you must document and handle
a multitude of special administrative requirements in a very specific way
according to the standard.

The various mandatory requirements for HMAs are spread out in several
sections throughout ISA-18.2. These include:
Specific Shelving requirements, such as access control with audit trail
Specific Out of Service requirements, such as interim protection,
access control, and audit trail
Mandatory initial and refresher training with specific content and
documentation
Mandatory initial and periodic testing with specific documentation
Mandatory training around maintenance requirements with spe-
cific documentation
Mandatory audit requirements
Our advice is to specifically avoid the usage of this alarm classification!
You might choose to have your own similar classification, and then
choose only the administrative requirements you deem specifically nec-
essary for those alarms. These will probably be only a subset of the ISA
18.2 listing for HMAs.

12.11 The Alarm System Requirements Specification (ASRS)


This non-mandatory section basically states that if you are buying a new
control system, it is a good idea to write down your requirements and
evaluate vendor offerings and capabilities against them. Specific defi-
Chapter 8-Step 4: Alarm Documentation and Rationalization (D&R) 143

Any modifications needed to the alarm, such as introduction of


logic, reconfiguration of alarm type, alarm message rewording,
DCS graphic changes, building of new common alarms, and so
forth.
Optional items (time permitting) to document include:
Method of alarm verification (often this is assumed to be a capa-
bility of a trained operator and specific steps are not individually
documented)
Other points likely to be involved with the alarm
Relevant Operating Procedure, PHA, or otherreferences
Relevant P&ID or hazard scenario identification

8.16 The Master Alarm Database


The output of the D&R is a Master Alarm Database, the collection of
proper alarm settings for the system. The management of change system
must treat the Master Alarm Database as controlled information that
must be kept up-to-date.
.The Master Alarm Database has several important uses: It is the
reference database for the alarm system. Manual or automated
audit/enforce mechanisms compare the actual alarm settings in
effect on the DCS to these reference values. Mismatches indicate
improper alarm change.
It is a reference for state-based alarm management, alarm shelv-
ing, and alarm flood suppression.
It contains the documentation of alarm cause, response, and con-
sequences.

Operators should have electronic access to alarm documentation (paper


printouts are hopeless). Methods of making the documentation avail
able through the company intranet via a web browser are common.

Even better is the integration of relevant alarm data available via a single
click call-up within the DCS graphics themselves, i.e. within the opera-
tor's HMI. This method provides for the quickest access by the operator
to alarm information at the time and at the point where they need it. It
is an important adjunct to operator effectiveness and abnormal situation
response.
144 Alarm Management: A Comprehensive Guide, Second Edition

8.17 Alarm Classification


There are a variety of administrative requirements most companies have
around the alarm system. For example, some alarms may require:
.Periodic testing
Periodic operator or staff training
Special reporting requirements when they occur
Restrictions on shelving or taking out-of-service

The ISA-18.2 standard requires alarm classifications be set up document-


ing such requirements, and rationalized alarms be assigned to one or
more classes as they are defined in the alarm philosophy. We recom-
mend that class structures should be kept very simple. More information
on the ISA-18.2 standard can be found in Chapter 12.

8.18 After the D&R-Implementation of Changes


D&R is an off-line exercise. The revised alarm settings are collected in a
database but not implemented onto the control system every day! At
the end of the D&R it is necessary to alter the on-line control system so
it matches the configuration specified by the D&R. This is a significant
undertaking that requires planning.

The changed alarm system will usually involve a major shift in operat-
ing methodology for most operators. It can be quite uncomfortable for
them, as well as staff engineers, to accept. There are several consider
ations and methods to accomplish effective implementation.

AD&Rwill usually produce several hundred or more desired changes in alarm


configuration. Most DCSs have the ability to accomplish such changes in
bulk, given a properly formatted file. Export of the desired configuration data
from the D&R software is the logical starting point for this method. Proper
site MOC procedures must be followed for implementing these changes.

It is possible to implement many changes with the use of the Alarm


Audit and Enforcement software. Activate it with the new Master Alarm
Database and the changes will occur in one cycle of the software.

Note, not every desired alarm change can be accomplished in such an


on-line fashion. Sometimes, depending on the point type, the specific
change, and the DCS, a point may have to be taken off-line to accom-
plish the change, and then reactivated. Care must be taken so the pro-
cess is not disturbed during these operations.
60 Alarm Management: A Comprehensive Guide, Second Edition

5.18 DCS System Diagnostic Alarms


Alarms specific to the internal workings of a DCS (redundant cable faults,
module errors, communication errors, etc.) should be absent under nor-
mal operating conditions, and they should not be tolerated when they
occur. You don't just drive around with the "Check Engine" light on!

System diagnostic alarms are generally configured by the control system


manufacturer and are not subject to change by the end user. They are
usually rare in occurrence. They are very similar in their operator alarm
response considerations to instrument diagnostic alarms. The primary
issue with them is they are often cryptic in nature.

System diagnostic alarms should be presented in ways to make them easily


understandable by the operator. Clear explanations and guidance should
be provided within the operator HMI, not contained in volumes of dusty
control system reference books on closet shelves. In particular, it should be
very clear which system diagnostic alarms require immediate resolution,
compared to those that can be handled on a routine basis. The guidance
should include identifying the functional group to contact for assistance.

5.19 Point and Program References to Alarmns


There are some poor (but common) DCS programming/configuration
practices with serious consequences if they are not dealt with correctly.
These practices involve programming the DCS to take actions based spe-
cifically on alarm behavior.

For example, consider a simple interlock that closes a feed valve based
on a high level of 80% in a tank.

Poor Practice: Cornfigure the logic element with the occurrence of the
high alarm (often via a flag) as the input to cause the valve to close. This
is poor because:
The alarm setpoint parameter, or even the existence of the alarm,
is subject to change from a variety of places. Years of history have
led many to believe that the change of alarm settings is not a
significant action, regardless of procedures or MOC policies. A
change to the alarm setpoint will change the functionality of the
interlock, and this will likely not be obvious!
In some DCSs you have nmany obscure choices and methods as to
suppression options on an alarm, some of which could negate the
flag chosen to close the valve. So a suppressed alarm could prevent
the safety function of an interlock.
Chapter 5-Step 1: The Alam Philosophy 61

The alarm occurs simultaneously with the activation of thhe interlock


and provides no warning that the tank level is approaching the valve
closure value. This could result in an upset of the upstream facility.

Better Practice: Configure such logic elements with the process value
(PV) as an input, and compare it to a numeric (80%) contained within
the logic construct. This is better because:
Even though the numeric could be changed, logic elements are far
more obscure control system constructs and are much less likely to
be changed by the non-expert. The logic will activate and the valve
will close based on the PV, whether the alarm occurs or not.
A separate alarm can be configured to provide warning ofthe im-
pending interlock action.
This design leaves the flexibility for adjusting, resetting, shelving,
or otherwise modifying the alarm appropriately, without inadver
tently changing the performance characteristics of the interlock.

DCS systems should be surveyed to see if this poor programming practice


has been used. Any alarm change on such systems should be checked to
ensure interlock functionality has not been altered. DCS logic points
are not the only ones to check; programs and signals into PLC logic and
other similar things should be as well. We have found these poor pro-
gramming practices to be pretty common.

The alarm philosophy should specify the particular, site-desired method-


ology for interlock construction, plus any desired safeguards or special
HMI depiction.

Some control systems have an end-user available programming language


which can be used to accomplish a variety of tasks based on process
readings. The activation of programmatic functionality based on alarms
is an identical issue to be addressed.

Our conclusion is, if you want something to happen based on the pro-
cess attaining a certain value, then program it or configure it based on
reading the value itself, not on whether an alarm occurs at that value.
Exceptions deserve careful evaluation.

5.20 Operator Messaging Systems


The principles covered in this book apply to some aspects of a DCS other
than the alarm system, particularly any type of operator messaging sys-
tem used. (Do not confuse a messaging system with an Operator Alert
62 Alarm Management: A Comprehensive Guide, Second Edition

system. For details, see Chapter 10.) If the messaging system attracts the
operator's attention by sounding tones or flashing lights, and requires
acknowledgement, then the messaging system has a similar effect as the
alarm system in loading the operator. Therefore, the use of such mes-
sages should meet many of the same principles as alarms.

Operator messaging systems were originally provided for use in batch


production processes. They allow batch sequential programs to prompt
the operator to do any manual steps (physical field operations, deci-
sions, enter lab results) necessary to move the sequential operations of
the batch forward. When invoked, a confirmation option required the
operator to confirm the manual steps had been completed so the pro-
gram could resume. They were also used to notify (without requiring
confirmation) the operator of significant batch milestones (e.g., end of
batch).

Use of a messaging system other than for data input or confirmation


prompting should be avoided. There are other ways to announce status
to the operator; for example, graphic elements displaying sequence sta-
tus without generating messages. Only status changes requiring opera-
or action, such as entering needed data before proceeding, should use
messages.

An example of a common misuse of messages would be that one se


quence has successfully completed and the next sequence has started, as
per plan or normal conditions. The operator is better served by having
a graphic showing sequence state and progress rather than individual
messages coming in reflectirng normal progress. The operator will ignore
the entire message system if it mostly announces that everything is pro-
gressing OK.

The usual worst-case scenario if the operator message is ignored should


be delayed production, not a process condition that will worsen. Alarms
should be used for process conditions that will worsen if ignored. Some
DCS vendors allow more than one message priority. There should be no
need to assign a higher priority to a message. If a condition has more
severe consequences or should be responded to more quickly, then the
condition should be alarmed and not sent via the messaging system.

Any messaging system should use a separate visual and audible interface
(different tones) than the alarm system.
Copyrighted Materials
Copyright 2011 ISA Retieved from www.knovel.com

CHAPTER 6

Step 2: Baseline
and Benchmarking
of Alarm System
Performance
"Ifyou torture data sufficiently, it will confess to almost anything."
-Fred Menger

An initial alarm system baseline and benchmark against industry best


practices is essential to planning the improvemernt process. This is not
difficult; a few simple and straightforward analyses will provide an excel-
lent picture of the current performance level. A proper baseline should
use at least eight weeks of continuous alarm system data. The data for
each analysis must be based upon the alarms assigned to the span of
control of a single operating position.

Good alarm analysis software should be able to perform all of the analy-
ses in this chapter, and many others. It is possible to do these in a spread-
sheet, although the data parsing and reduction will become tedious,
speed is quite slow, and spreadsheet page size limits are easily exceeded
when importing alarm journals. Frankly, using a spreadsheet to analyze
alarm events is like using a water hose to fill an Olympic-sized swim-
ming pool! The proper tool for alarm analysis is a real database.

6.1 Operator Alarm Handling Capacity


If operators could effectively handle thousands of alarms per day, there
Would be no need for alarm management. But they cannot. The question
arises-what can they handle? A variety of research studies have been con-
ducted, including but not limited to those performed under the auspices
of the Abnormal Situation Management Consortium and subsequently
63
64 Alarm Management: A Comprehensive Guide, Second Edition

published in a variety of articles and publications. More studies are ongo-


ing, but a common-sense approach can be quite enlightening as well.
The human factors issues involved in alarm response are subject to many
variables, and firm, fixed performance numbers cannot be established.
Alarm response is not an automated process involving deterministic ma-
chines; it is a human cognitive process involving thought and analysis.
Operator response to an alarm consists of several steps:
1. Detecting the alarm.
2. Silencing and/or acknowledging the alarm.
3. Navigating to the appropriate screen to obtain contextual infor-
mation from the process of which the alarm is a part.
4. Verifying that the alarm is valid and not a malfunction.
5. Analyzing the process situation to determine the alarm's cause,
and deciding on the proper action(s) to take in response to the
alarm. This may involve consultation with other people.
6. Implementing the chosen action(s), generally through manipula-
tion of the control system, contacting and directing other people
to perform tasks, leaving the console to take action that cannot be
accomplished without doing so, or a combination of all of these.
7. Continuing to monitor the system to ensure the action(s) per-
formed correct the situation causing the alarm.
It is clear from these steps that alarm response cannot be instantaneous!
Several of these steps can only be accomplished sequentially. Some of
the steps can be performed in parallel as part of responding to several
simultaneous alarms.
Given these cognitive tasks, it is obvious that an alarm handling rate of
one alarm per second is untenable, but one alarm per hour is certainly
possible. The maximum rate that can be handled lies somewhere in be-
tween. The EEMUA 191 and ISA-18.2 documents use the terms "likely
to be acceptable", "maximum manageable", "likely to be over-demand-
ing", and "likely to be unacceptable." These have become part of the
alarm management lexicon. Research indicates:
Handling one alarm in ten minutes, involving these steps, can gen-
erally be accomplished without the significant sacrifice of other
operational duties, and is considered likely to be acceptable. More
than this rate (-150 per day) begins to enter a problematic zone.
.Up to 2 alarms per 10 minutes are termed maximum manage-
able (-300 alarms per day). More may be unmanageable. The pos-
sibility of effective response to higher alarm rates is very highly
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 65

affected by the particular alarms, the complexity of the situations


indicated by the alarms, the complexity of the responses, the op-
erator's HMI, and several other factors.
Higher numbers represent thresholds above which proper alarm
response becomes less likely, alarms are likely to be missed, and
operational performance is potentially affected.
Between two and five alarms per ten minutes can be characterized
as possibly over-demanding.
More than five but less than ten alarms per ten minutes becomes
likely to be over-demanding.
It has been demonstrated that alarm response rates of ten alarms
per ten minutes can possibly be achieved for short periods of time;
this is highly dependent upon the specific alarms (i.e., they had
better be simple ornes!). And this does not mean such a rate can be
sustained for many ten minute periods in a row.
More than ten alarms in ten minutes are considered likely to be
unacceptable.
Extrapolation to hourly and daily amounts greatly aids in the visualiza-
tion of performance, and these rates are best shown via trends rather by
averages. Averages by themselves can be highly misleading, a subject
cover in more detail later in this section.
But whenever the operator's handling capacity is exceeded, then the op-
erator is (like it or not) ignoring alarms. Not because they want to, but
because they have to. The average, mean, median, standard deviation,
Roche limit, or whatever other measure doesrn't matter-at that point
management has no assurance that the "right alarms" are being ignored.
This is the stuff of major accidents.

6.2 Operator Span of Control and Multiple Operators


In most cases, a single operator is assigned an area of control authority
and responsibility for the process-an operating position. The control
console provided can manipulate a certain part of the process, and usu-
ally not other parts. The alarms annunciated on the console are relevant
to the specific operating position, and with a few exceptions, do not
include alarms from other operating positions.

In some situations, an extra operator is assigned to the console, usually


temporarily. This can be for startups or shutdowns, or similar complex
tasks or modes. In some countries, this is a more common practice, even
for normal operations. The operators choose their responsibility divi-
sion-"You take feed systems now and I'll take the reactors. We'll switch
66 Alarm Management: A Comprehensive Guide, Second Edition

in the afternoon." The control console is not logically split in such situ-
ations, nor are the alarms segregated. The question arises-since more
than one person is monitoring them, are substantially higher alarm rates
(perhaps doubled) possible to be handled successfully?

Answering this requires understanding how the operators will have to


interact. Either:
both operators will still have to evaluate each new alarm to at least
determine if it is applicable to their current portion of the process, or
one operator will be assigned to monitor all alarms, respond to his/her
own, and tell the other operator, "This one is yours," every time it is.

While some minor alarm handling rate increase might be possible, there
is no documented research or testing available about this situation. It is
obvious that doubled rates would not be achievable.

6.3 Alarms Are Not Created Equally


In discussing acceptable alarm rates for small periods of time (such as
ten minutes or an hour) the specific nature of the alarms becomes much
more of a determining factor than does the raw count of alarms. The na-
ture of the response is highly variable in terms of demand upon the op-
erator's time. There is no such thing as a single number that represents
a time quantity or duration of, "In general, how much time does it take
for an operator to handle an alarm?" That's like saying "How much time
does it take to talk your spouse into getting a boat?" The answers depend
upon the alarm, the boat, and the spouse!
As an example, consider a simple tank with three inputs and three outputs.
The tank sounds a high level alarm. Now consider all of the possible things
possibly causing the alarm and what the operator has to figure out:
Too much flow on Inlet Stream A
Too much flow on Inlet StreamB
Too much flow on Inlet Stream C
Where would you even keep a boat?
Too much flow on Inlet Streams A and B combined
Too much flow on Inlet Streams B and C combined
Too much flow on Inlet StreansA and C combined
Too much flow on Inlet Streams A and B and C combined
You have to get insurance for a boat, you know.
Not enough flow on Outlet Stream D
Not enough flow on Outlet Stream E
Not enough flow on Outlet Stream F
Our neighbors have a boat and have to spend money on it all the time.
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 67

Not enough flow on Outlet Streams D and E combined


Not enough flow on Outlet Streams D andF combined
Not enough flow on Outlet Streams E and F combined1
Not enough flow on Outlet Streams D and E and F combined
A particular blockage or mis-valving event that may have occurred
in the field.
Or several more additional combinations of the above inlet and
outlet possibilities...

The situation can take quite awhile to figure out, involving looking per-
haps at trends of all of these flows and comparing them to the proper
numbers for the current process situation. The correct action to take var-
ies highly with the proper determination of the cause(s). The diagnosis
time is highly variable based upon the experience of the operator and
whether the operator has been in the situation before.

The HMI plays a major role in effective abnormal situation detection


and response, directly affecting the ability of the operator to quickly and
properly ascertain the cause and corrective action for an abnormal situ-
ation. The quality of the HMI varies widely throughout industry. Some
HMI implementations make the problem diagnosis quite easy, but most
are little more than a collection of numbers sprinkled on a screen show-
ing a P&lD, making diagnosis much more difficult. (See Appendix 4 on
High Performance HMI for additional discussion.)

The result is that the diagnosis and response to a simple high tank level
alarm becomes not quite so simple at all. Given the tasks involved, cer
tainly much less than ten such alarms can be handled in a ten minute
period. Or, sixty in an hour.

Compare and contrast the above simple "high level tank" alarm to anoth-
er, different simple alarm stating "Pump 14 is supposed to be running but
has kicked off." The needed action is verydirect: "Restart the pump or if it
won't, start the spare." Operators can handle several such alarms as these in
ten minutes. The time required to figure out the situation is much less.

The real concern is to get the alarm rates down to a level so there is a
low likelihood an alarm will be missed. Remember, when alarms indi-
cate a situation requiring an operator action, missing an alarm means an
avoidable consequence will occur. Alarm rate also then indirectly indi-
cates control system effectiveness-its ability to keep the process within
bounds that do not require manual operator intervention to avoid con-
sequences of differing severity!
68 Alarm Management: A Comprehensive Guide, Second Edition

Alarm rates are thus controlled by indirect means rather than direct
means. The solution to an alarm rate problem may lie in control im-
provements rather than in directly addressing the alarm system.

6.4 The History of Alarm Analysis


Alarm analysis really began in the early 1990s and corresponded to indus-
trial adoption of personal computer technology. Prior to the PC, control
rooms had alarm printers. These bulky, noisy devices ate large boxes of pa-
per and took up lots of space. They often needed expensive replacement
ribbons.The printed records were not very useful. It was said that it would
save time to feed the output of the printer directly into a shredder.

Alarm analysis capabilities were not supplied by the DCS manufacturer.


Third-party companies specialized in filling gaps in DCS capability. The
replacement of expensive alarm printers with conmparatively cheap PCs
was an early use of PC technology in the control room environment. En-
hancements soon made the alarm data searchable. Reasonably capable PC
databases made more detailed alarm analysis possible -and very surpris-
ing things were then learned about actual alarm system performance.

In the early 1990s, control systems were generally big, expensive, closed, pro-
prietary boxes. They were not designed to connect to alien systems like PCs.
The printer port was one of the few standard interfaces available. The DCS
manufacturer wanted you to buy their equipment for anything you needed.
A simple replacement keyboard could cost $5,000 (but it was "certified!")

The closed nature of DCSs meant that any advanced methods of collect-
ing alarm events for analysis were very DCS-specific, which made multi-
DCS commercial solutions uneconomic. Many home-brewed solutions
began to appear from innovative end-users and third parties. In the late
1990s and early 2000s, DCSs became more "open," generally beginning
to support Microsoft-based technologies. A major advance came about
with the support of the OPC standard by several DCS manufacturers.
OPC stands for Object Linking and Emnbedding (OLE) for Process Control.

The OPC Foundation (www.opcfoundation.org) is an industry consor-


tium that specifies open connectivity solutions for industrial control.
The advent of the OPC Alarm and Events and Data Access standards
make it much easier to create standardized collection tools for alarm
occurrences and configuration files-and to both read and write such
information to the control system. This has made it possible to much
more easily create solutions to very difficult real-time alarm problems-
such as the ones addressed later in this book.
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 69

For further details about the computational methods used in extracting


and analyzing alarm data, see Appendix 2.

How Far We Have Come


Most relatively young engineers have no idea about what the early years of
the personal computer revolution were like. For example, did you know?...
No one owned a telephone. The telephone company owned the phone
and you paid rent for it every month. You were not allowed to provide your
own phone and indeed there were none for sale. You chose from the vey
few offered by the phone company.
When modem technology was invented, it was illegal to directly connect
non-telephone-company devices to the phone line. To get around this,|
early modems used acoustic couplers. They had rubber cups you attached
to a conventional telephone handset. A small speaker was in one and a
microphone in the other. Data rates using this technology were very low,
e.g., 10 characters per second!
For storing information, disk drives were expensive and oflow capacity. An
early 1980s five megabyte (mega, not gigal) personal computer hard drive|
cost $5,000. The same physical drive when supplied to you as "certified" by
the DCS manufacturer for their equipment could cost you $30,000.
RAM Memory was expensive. In 1981, a 16K memory card for an Apple II+
computer cost about $180. In September 2009, a 4 gigabyte USB flash drive
was about $10. At 1980 prices, that much memory would have cost more
than $4.5 million dollars. Actually, a lot more, because $180 was worth a|
lot more in the year 1980. (A Camaro Z28 was $7,200)

6.5 Alarm System Key Performance Indicators (KPls)


Measurement is fundamental to control and improvement and improve-
ment is best measured against a pre-determined goal. The following
alarm performance targets are achievable goals. Based on our experience,
the chasm between the initial baseline of a system and these targets may
seem too wide to cross. You may have numbers 10X or 100OX as large as
these! But the methods covered in this book, and particularly in the next
chapter, will result in major inmprovement.
70 Alarm Management: A Comprehensive Guide, Second Edition

Figure 6-1 is from the ISA-18.2 Alarm Management Standard document


(see Chapter 12), with some annotations. ISA-18.2 notes:

The target metrics in the following sections are approximate and


depend upon many factors (e.g, process type, operator skill, HMI,
degree of automation, operating environment, types and signifi-
cance of the alarms produced). Maximum acceptable numbers
could be significantly lower or perhaps slightly higher depending
upon these factors. Alarm rate alone is not an indicator of accept-
ability

The question sometimes arises about "normalization." The 150/300


alarms per day are in fact normalized, because they are based upon the
span of control of a single human operator. Since alarms are a human-
machine interaction, this is the most consistent possible method of nor-
malization. If company A has a process with 1,000 loops successfully
controlled by a single human, and that works for them, great. If com-
pany B has a more operator-intensive process where they require or have
a human operating only 300 loops, that is fine for them. The alarm rate
we are concerned with is per human, not per loop; we are measuring the
alarm load on a person, not on the DCS. After all, the alarm system ac-
complishes absolutely nothing unless there is a human there to perceive
it! So all alarm rate measures are calculated per human responsible for
doing something with the alarms.
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 71

Alarm Performance Metrics


Based upon at least 30 days of data
Metric: Annunciated Target Value: Very Target Value:
Alarms per Time per Likely to be Maximum
Operating Position: Acceptable Manageable
Alarms Per Day 150 alarms per day 300 alarms per day
Alarms Per Hour -6(average) (Note 1) -12 (average)
Alarms Per 10 Minutes 1 (average) -2 (average)
Metric Target Value
Percentage of hours 1%
containing more than 30
alarms
Percentage of 10-minute <1%
periods containing more
than 10 alarms
Maximum number of 10
alarms ina 10 minute
period
Percentage of time the <1%
alarm system is in a flood
condition
Percentage contribution of <1% to 5% maximum,
deficiencies.
with action plans to address
the top 10 most frequent
alarms to the overall alarm
load
Quantity of chattering and | Zero, action plans to correct any that occur.
fleeting alarms
Stale Alarms Less than 5 present on any day, with action plans to
address
Annunciated Priority 80% P3, -15% P2, -5% P1 (Note 2)
Distribution
Unauthorized Alarm Zero alarms suppressed outside of controlled or
Suppression approved methodologies
Unauthorized Alarm Zero alarm attribute changes outside of approved
Attribute Changes (alarm methodologies or MOC
types, setpoints, priorities,
deadbands, etc.)
Note: Designed state-based alarming, flood suppression, alarm shelving, logic based
alarming, etc. would be approved methodologies.
Figure 6-1: Recommended Alarm System Key Performance Indicators
Note 1: Averages can be misleading! (See Section 6.9.)
Note 2: Note: If a 4th "Highest" or "Critical" type of priority is used, its occurrence rate
should be much less than 1%. If a 4th "lowest" priority is used for diagnostics,
there is no set percentage rate because there is no set desirable rate of instru-
ment malfunction. If other special-purpose priorities are used (e.g., "Journal"),
they also have no specifically desired percentage distribution. (See Section 4.8.)
72 Alarm Management: A Comprehensive Guide, Second Edition

All of the examples in this book are of real data, but slightly disguised to
protect the embarrassed.

6.6 Alarms per Day


The most important analysis is simple: Alarms per Day (for a single operat-
ing position, as stated above). The number of alarms per day is a good in-
dicator of the overall health of the alarm system, and is the place to start.

8000 Annunciated Alarms Per Day

7000

6000 Annunciated Alarms


M
5000 --- "Maximum Manageable' (300/day)

4000
Likely Acceptable' (150/day)

3000

2000

1000

-30 Days -

Figure 6-2: An Example Alarms per Day Graph

Recommendations from ISA-18.2, EEMUA 191, and other published


studies are used to produce the two straight lines labeled as "Maximum
Manageable (300)" and "Likely Acceptable (150)." Alarm rates above ap-
proximately 300 alarms per day place the operator in the unenviable
position of being forced to ignore many alarms-the quantity simply
overwhelms their ability to analyze each one.

In the above (and quite typical) data, this alarm system produces alarms
at rates far beyond the operator's abilities of evaluation and response,
and for days at a time. Such an alarm system is not a useful tool to help
the operator perform the right action at the right time! In fact, it is much
more of a distraction or a hindrance to the operator.

Journal-only alarms, which are intentionally not annunciated to thhe op-


erator, should not be a part of these analyses. As stated before, they are
not really alarms at all and do not enter into the operator loading and
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 73

response aspects of these analyses. However, an occasional separate look


into their amount and rate of production should be made, because DCS
performance can actually be negatively affected if it is generating and
recording thousands of these "invisible" events-and this is not an un
usual circumstance.

Alarms per hour can be similarly analyzed and plotted, although it is a


bit redundant. The alarms per day, alarms per ten minutes, and alarm
flood measures provide the best alarm rate performance indications.

6.7 Alarms per Ten Minutes


Burst rates of alarms are quite important. Looking at alarms in ten min-
ute time slices gives a better picture of this than the daily amounts.

Annunciated Alarms per 10 Minutes

160 Highest 10-


minute Rate =
144
140
Alarm Flood =
10+ in 10 minutes
12 0

100

80

60

40

20

-8 WeekS

Figure 6-3: Example Graph of Alarms per 10 Minutes


74 Alarm Management: A Comprehensive Guide, Second Edition

Fixed ten minute intervals are used (e.g., 1:00:00 pm through 1:09:59
pm). An alarm rate of ten or more alarms in ten minutes defines the be-
ginning of an alarm flood. In the chart above, rates often exceed 20, 40,
60, or more alarms in ten minutes. Such rates can continue for hours.
During such periods, the likelihood of an operator missing an important
alarm increases, as has been shown many times in the analysis of major
accidents.

6.8 Alarm Floods


A refinement of the alarms per ten minutes analysis is the alarm flood
analysis. A good default definition of an alarm flood is that it begins when
the alarm rate exceeds ten or more alarms occurring in ten minutes, and
ends when the rate drops below five alarms in ten minutes. Again, fixed
ten minute intervals are used (e.g., 1:00:00 pm through 1:09:59 pm).

An alarm flood can last for many hours and include hundreds or thou-
sands of alarm events. Alarm floods can make a difficult process situation
much worse. This analysis depicts the alarm floods occurring during an
eight week analysis period, showing a breakdown by alarm count during
the flood. Only alarms annunciated to the operator are included.

Alarm floods are a significant problem for this system. Most alarms pro-
duced by the system are during flood periods. Flood magnitude is very
high, generally hundreds of alarms contained in each flood. There are
over fourteen floods per day on average.
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 75

Alarm Floods Alarm Count

1000
820 Separate
900 Floods
Several
Peaks
800 above
Highest Count
in an Alarm
1000
Flood 2771
700
Longest
600 Duration of
Flood 19hrs
500

400

300

200

100

- 8 Weeks

Figure 6-4: Example Graph of Alarm Floods

Alarm Flood Analysis Table


Number of Floods 820
Average Floods per Day 14.6|
Total Alarms in All Floods 58,376
Average Alarms per Flood 71
Highest Alarm Count in a Flood 2,771
Percentage of Alarms in Floods vs. All Annunciated Alarms 73.8%
Total Duration of Floods (in hours) 676.5
Percentage of Time Alarm System is in a Flood Condition 50.3%
Figure 6-5: Example Table of Alarm Floods
76 Alarm Management: A Comprehensive Guide, Second Edition

6.9 Alarms Likely to Have Been Missed


Having covered the basics of alarm rate analysis, here is why it is im-
portant to look at alarm performance in ternms of graphs rather than as
simple averages. Consider this example. What if your system showed the
following averages for a one-week period?
Average Alarms per day: 138
Average alarms per 10 minutes: 0.96

Upon first examination, this might seem like very good performance,
and our work is done! The average per day (138) is less than even the
"Likely Acceptable" value of 150. However, a more detailed look at the
data is needed, preferably involving trends. The question is-regardless
of my averages-how many alarms were likely to have been missed?

The alarms per day chart for this week looks like this:

Annunciated Alarms Per Day

500

450
"Maximum
400
Manageable"
350

300

250

200

150

100

"Average "Likely Acceptable"

-7 Days

Figure 6-6: Alarms per Day for a Pretty Good Week

Five of my days were less than the 150 "Likely Acceptable" value, and
although two days exceeded it, they were still well under the 300 "Maxi-
mum Manageable" value. Is this the end of the story?
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 77

The alarms per ten minute chart (averaging only 0.96) looks like this:

Annunciated Alarms per 10 Minutes


70
Flood of 134
60 Flood of 118 alarms over 30
alarms over 40 minutes
minutes
50

40

30

20

10

-7 Days

Figure 6-7: Alarms per 10 Minutes for a Pretty Good Week

There were only two fairly minor floods. The peak rate during one flood
was 48 alarms in ten minutes and the other had sixty alarms in one ten
minute period. The flood breakdowns were:

Flood 1 Alarms Flood 2 Alarms


Day 2, 7:40 Day7, 8:30
Day 2, 7:50 22 Day 7,8:40 18
Day 2, 8:00 32 Day 7, 8:50 56
Day 2, 8:10 48 Day 7, 9:00 60
Day 2, 8:20 16 Day 7, 9:10
Day 2, 8:30 3

Figure 6-8: Alarm Flood Breakdown


Flood 1 lasted forty minutes with 118 alarms. Flood 2 lasted thirty min-
utes with 134 alarms. So, how many alarms were likely to have been
missed? A simplistic answer (and even so, good enough for this illustra-
tive purpose) is to count the alarms exceeding ten in any ten minute
period for the duration of the flood.
78 Alarm Management: A Comprehensive Guide, Second Edition

By this method:
Flood alarms were likely to havebeen missed.
1: 78
Flood 2: 104 alarms were likely to have been missed.
This week: 182 alarms were likely to have been missed.

In other words, despite these great averages (and actually this is pret
ty good alarm system performance, and most sites would be happy to
achieve it!) we still put the operators in the position of being likely to
miss almost 200 alarms. Almost 200 cases where the failure of the op-
erator to take proper corrective action could have resulted in a conse-
quenceperhaps a quite significant one.

The result of this simple analysis is plotted in Figure 6-9, which is taken
from the same data as Figure 6-3, Alarms per Tern Minutes.

Alarms Per Day Likely to Have Been Missed


1200

Week 1: 3,885
Week 2: 2,281
1000 Week 3: 2,728
Week 4: 1,903
Week 5: 2,173
Week 6: 1,443
800 Week 7: 2,253
Week 8: 4,260

600

400

200

- 8 Weeks

Figure 6-9: Alarm Counts Exceeding 10 in 10 Minutes

In the eight week period of Figure 6-3, almost 21,000 alarms were very
likely to have been missed! This is great for the proverbial elevator speech
Chapter 6-Step 2: Baseline and Benchmarking of Alarm System Performance 79

where you have one opportunity to state your case in-between floors. A
weekly view of such data can really get the attention of management. So,
don't rely on averages to tell the whole story!

6.10 Most Frequent Alarms


The next analysis to perform is a simple ranking of the most frequent to
least frequent alarms during the analysis period. The following chart and
table are highly typical.

As is often the case, only ten alarms are a significant fraction of the en-
tire system alarm load, in this case 55%. (The analysis of hundreds of
systems shows the number is often over 80%, rarely is it less than 20%.)
In fact, the top four alarms are over 40O% of the load! Were they inten-
tionally designed to annunciate so frequently? Of course not! Are they
performing a useful function in their current configuration? Doubtful.
The beauty of this analysis is that it can direct improvement efforts to
where they will do the most good. Imagine finding the time and making
the effort to improve only one alarm per week-to make it work as it was
intended to work. In four weeks, this system would be improved by over
40%. Someone would be a hero!

Top 10 Annunciated Alarms

3500 100
90
3000-
80
2500 70

2000 60
50 Count
1500 Accum
40
1000 30
20
500

LLLg.ngl. 10

Figure 6-10: An Example of a Top 10 Most Frequent Alarms Chart


23-07-2021: 0900~11:00
Balance topics of Unit-1 of Syllabus & Miscellaneous examples
Future of Alarm Management
Linking Operator Graphic page to individual alarm tag- line
Advanced Alarm Management features (Industry awareness required)
-Shelving, Suppression -State based alarm, -Rate of change Alarm
-

Alarm Annunciator Sequence


Associated Graphic page to individual alarm on Alarm Summary page
Alarm Response graph as per standard 18.2
Case- Study
.Miscelaneous Topics
A few Acronyms -worth remembering!
MCQs- Online Test 45 minutes
-
30 MCQs-30 marks ( 30-07-2021 tentatively)
-
Future of Alarm Management in your Palm!

Mobile HMI
nTouch Access Anywhere
A

Awareness about How to improve operator


response time and avoiding nuisance
alarms
Access to Operator Graphic page from Alarm-
Summary page (slide follows)
Dead Band, Dead Band Time configuration
awareness (slide follows)
Shelving feature Awareness (slide follows)
Suppression of alarm/s when and who?
-

Difference between Shelving & Suppression

PN Parikh July 2014 32


Recommended Standard for
Audio-Visual
Alarm State
Alarms -1SA 18.2
Audible Visual Indications
Indication
Color Symbol Blinking
Normal No No No No
Unacknowledged Alarm Yes Yes Yes Yes
Acknowledged Alarm No Yes Yes No
Return to Normal State Unacknowledged Alarm No Optional Optional Optional
Latched Unacknowle dged Alarm Yes Yes Yes Yes
Latched Acknowledged Alarm No Yes Yes No
Shelved Alarm No Optional Optional No
Suppressed by Design Alarm No Optional Optional No
Out-of-Service Alarm No Optional Optional No
Note 1: Yes signifies an indication that is different from the normal state indication.

Fiqure 11-Recommended Alarm State Indications


PN Parikh July 14 20
Operator Access to respond timely by a single click
Tag no.A-16-105 B has an alarm ( Cell Gas Pressure)
Alarm summary page - Honeywell DCs & new
oldDVACL-11)
Actions possible if configured (not done at
to Associated Graphic page / Shelving Alarm

PN Parikh
Uncomfigured
25 tn Marcn
Alarm
14
Feature
Corresponding Graphic page is available to operator to jump from
Alarm-Summary page to Operational Graphic page where in he
gets more details about the same Tag-number: A-16-105 B

35
PN Parikh
Unconfigurod
25 th March
Alarm Feature
14
5.5 Alarm Response Timeline

Figure 6, represents a process measurement that increases from a normal


on whether condition to an
abnormal condition and the two possible scenarios based the operator takes the
corrective action or ofnot.
clarify the definition
Using Figure 5, it
terms related to time.
is possible to map some states to this timelino to

Return to-
Normal
(A)
Unack(B)Alarm Ack & Response
(C) Normal (D)

process rosponse-
without operator action

consequence
threshold
- measurement
process
rosponse to
operator action deadband
delay
Operator
takos action sotpoin
alarm

ack operatopr process alarm deadband


delay response deadtime
delay process response delavv
Tme

Figure 6- Alarm Timeline


PIRParKD JUy UTS

General Configuration Recommendations


Deadbands & Counters & Time Delays,
Oh My!
Delay
Signal Type Signal Type Deadband
Time
Flow 15 seconds Flow 5% of span
Level 60 seconds Level 5% of span

Pressuree 15 seconds Pressure 2% of span


Temperature 60 seconds Temperature 1% of span

Other 5 saconds Other Denendsl


What is Pareto Analysis ? (80/20 ratio)
Pareto Report:
The Pareto report lists the etso
most frequently alarming
tags. The data is presented
both in a graphical and a metso

tabular way.

TTTIIT
E
Key actions which can reduce Alarm-Flood by 50 %

Dead Band Adjustment


Time Delay Adjustment Off time delay/On time delay
.Recommended DB & TD settings for
Temperature, Pressure, Flow and Level
Safety alarms and Process Alarms
Alarm- by passing by software and/ or hardware
Safety Interlock Alarms Not allowed to be bypassed
- -
loss of job?
Who is authorised to bypass safety interlocks/alarms?
Alarm Management
Data of DVACL Plant Before

Qutcome of the Study


Date for Month consist of > 13000 Alarms
1
Average 19 Alarms per hour.
Average 456 Alarms per day.
Lots of unwanted alarm interference.
Hard for Operator to Concentrate.
Need for Detail Analysis.
Action Plan Operator Process
Phase 1 Actionss Alarms
Quick Cean up of bad acting alarms 38% 62%
Discussion with plant people for alarms
Phase 2
Unnecessary alams removals by
Optimize alam settings dynamicaly to match the
granularity of the plant operations
Optimization of PID settings
Dead Band/Dead Time for Alarm/s set as per ISA System
Alarms
standard.Discussed Shelving & Discrepancy 0%
Alarm with process & instrument professionals.
Vishal Danapune, June 26, 2014 No. 21
Alarm Management
Data of DVACL Plant After

Gompletion of Phase 1
Bad Acting Alarm Type of Alarms Counts
Phase 2
Unnecessary alarms
removals by Process Alarmss 2924
Optimize alarm settings dynamically with help of System Alarms
standard data and Mr. PN Parikh Sir.
Dead Band, (Dead )Time-delay for Operator Actions 658
Temp. :Pressure; Level & Flow Set-points Avg Alarema
heur
Ma. Alarmm
hour
of houira wtth
m.fhour
Classtscstses

Optimization of PID settings


G00 o0o >s0% Oerioaded
6000 25% Reacthve
Qutcome of the Complotion
Date for 1 Month consist of 3600 Alarms 600 Stnble

Average 5 Alarms per hour. 1% Robs


Average 120 Alarms per day.
Predictve

Viahal Dhanapkine, June 20,2014 No. 22


Miscellaneous topics missed
Common DCS and SCADA Alarm Display:
-
MeaningUnderstanding of DCS & SCADA
&

What is Master Data Base ? Its uses ? ("Alarm-Audit" Justification)


-

Alarm Capabilities and Their Misuse


How alarms software can help to reduce response time of Operator ?
Interlock and other safeguard systems, The Future of Alarm Management
The Future of Alarm Management
Advance Management by:
-Flood Suppression /First out alarm Annunciation / Standard Alarm-Audio/Visual
Test /Acknowledge /Reset
State based Alarm -example of Low pressure alarm for pump dry run protection
Operator configurable alert - ( to remind him to take timely action example Alert at 85 % level
-

for overhead tank to keep himself alert to manually trip the pump on need basis) Alerts setting is by
Operator for his convenience.
A few Acronyms ... worth remembering
SCADA: Supervisory Control& Data Acquisition- (Software Engineering)
RAGAGEP Recognised And Generally Accepted General Engineering
Practice
Standards and Codes are RAGAGEP.
OSHA (Occupational Safety and Health Act) fined Texas City Refinery in
2009, 87 Million USD fine for not following ASME code and ISA standard
resulting into accidents with fatalities. (Level high alarm did not function)
ANSI: American National Standards Institute
ISA: International Society of Automation
IEC: International Electrotechnical Commission (Europe)
BIS: Bureau of Indian Standards (Starts with IS-IEC 61511 accepted)
ISA/IEC 61511 is a SIS Safety standard for Process Industry includes Chem.
Unit 2: Hazard and Operability Studies (HAZOP)
30-07-2021: Friday Lectures 8 &9:09:00 11:00 AM
31-07-2021:Saturday Lectures 10 & 11: 09:00~11:00 AM
(Reference: Text book Chemical Process Safety by Crow/& Louvar, 3 rd Edition-2011)
Hazard analysis : Why needed ? What is required ? How done?
Definitions: Hazard, Risk, Failures- Safe & Dangerous (Hidden), Human Error, Safe guards
Fault tree analysis

Event tree analysis

Failure modes and effect analysis (FMEA)

HAZOP (Hazard and Operability Studies-3d Edition, CPS by Crowl & Louvar; Ch #11,Section 11.3 page-510)

04-08-2021 Pramod N Parikh


History of FTA

Fault -Tree Analysis (FTA)


Fault trees origginated in the aerospace industry and have been
used extensively by the nuclear power industry to qqualify and1
quantify the hazards and risks associated with nuclear power
plants
This approach is becoming more popular in the chemical process
industries, mostly as a result of the successfül experiences
demonstrated by the nuclear industry. Afault tree for anything but
the simplest
events.Fortunately,
of plants
this
can be large,
approach 1ends
involving
itself
thousads
to computerization,
of process
with a variety of computer programs commercially available
to draw fault trees based on an interactive session.
Fault trees are a deductive method for identifyig ways in which
hazards can lead to accidents. The approach starts with a well-
defined accident, or top event, works backward
and to ward the
various scenarios that can cause the accident.

04-08-2021 Pramod N Parikh


Fault Tree Analysis

Data, Knowledge, & Skill Requirements:


a. A complete understanding of how the (chemical) plant / (process) system
functions and fails too!

b. Knowledge of the plant/system equipment failure modesand


their effects on the plant/system.

C Symbols knowledge (AND gate ,OR gate, Final Event, Intermediate Event)
d. Logical thinking & skill expressed terms of above gate(s): Logic Diagram

Analysis starts from Top Event down to individual Basic Event(s) thru' Intermediate Events

04-08-2021 Pramod N Parikh


Understanding Gate Symbols & Logics

Gate Symbool Gate Name Causal Relation

Output event occurs if all input events occur


AND gate
simultaneously.
TTT
Output event occurs if any one of the input eventss
OR gate
occurs.

Input produces output when conditional evenat


Inhibit gate
occurs.

Table 2.1 Gate Symbols

04-08-2021 Pramod N Parikh


Event symbols (Primary & Intermediate)

Event Symbol Meaning of Symbols

Basic event with sufficient data

Circle

Undeveloped event

Diamond

Event represented by a gate


Rectangle

Table 2.2 Event Symbols

04-08-2021 Pramod N Parikh


Fault Tree Analysis (Rector Explosion)

REACTOR EXPLOSION
3.6 x 10-4F/YR

RUNAWAY BURSTING
REACTION DISC FAILS
0.02
Probability
1.8x 10F/YR of failure
on demand
FLOW CONTROL TEMPERATUREE
LOOP FAILS INTERLOCK FAILS

0.3 F/YR 06

D -
FLOW ALVE THERMO
VALVE FAILS
CONTROLLER STICKS COUPLE &
TO CLOSE
FAILS OPEN RELAY FAIL
0.2 F/YR 0.1 F/YR 05 0.01
Probability Probability
offailure of failure
On demand on demand
2 tttKKE tFRSSIH
04-08-2021 Pramod N Parikh
Fault Tree Analysis

Purpose: Identify combinations of equipmentfailures


and hunnan errors that can result in an accident event.

When to Use:
a. Design:FTA can be used in the design phase of the chem.plant to uncover hidden failure modes that
result from combinations of equipment failures.

b. Operation: FTA including operator and procedure characteristics can be used to study an operating
plant to identify potential combinations of failures for specific accidents.

04-08-2021 Pramod N Parikh


Fault Tree Analysis

Type of Results: A listing of sets of equipment and/or operatorfailures


that can result in a specific accident.

These sets can be qualitatively ranked by importance


Nature of Results: Qualitative with quantitative potential.
Quantitative:
The fault tree can be evaluated quantitatively
when probabilistic data are available.

04-08-2021 Pramod N Parikh


Fault Tree Analysis

Staffing Requirements
Single Individual
One analyst should be responsible for a single fault tree, with frequent consultation
with the engineers, operators, and other personal who have experience with the
systems/equipment that are included in the analysis.
Team
A team approach desirable if multiple fault trees are needed, with each team
is
member concentrating on one individual fault tree. Interactions between team
members and other experienced personnel are necessary for completeness in the
analysis process.

04-08-2021 Pramod N Parikh


Fault Tree Analysis

Time and Cost Requirements:


Time and cost requirements for FTA are highly dependent on the
complexity of the systems involved.
Modeling a small unit could require a day or less with an
procesS
experienced team. Large problems, with many potential accident
events and complex systems, could require several weeks even with
an experienced analysis team

04-08-2021 Pramod N Parikh 10


Material A is transferred to an Exothermic Chemical Reactor with Material B through Flow controller- FRC & Flow Control
valve FCV, Emergency Shut-off Valve (ON-OFF).

If a Fault /Failure takes place in FRC/ FCV, reactor temperature may increase to hazardous level (temp.)and hence TIS
Temperature Indicating Switch is provided by process design engineer to close the flow of Material-A by shutting down
Emergency Shut-Off Valve (XV).This is termed as High Temp. Interlock through TIS.,

Further, if this high temperature interlock fails, then high temperature may cause high pressure and hence a Bursting Disc is
installed on the reactor to release the pressure out of the reactor vessel to prevent harm to human / operators in the area
which may take his life if impact is extensive. (Rupture of Reactor/ Fatal Accident)

HIGH TEIP
EMERGENCY INTERLOCK
SHLT-OFF|
VALVE BURSTING
FLO W TIS
CONTROLLE
DIS

FRC

FLO
CONTROL
W

ALVE
MATERIAL
B

DILATERIAL

Chemical

04-08-2021 Pramod N Parikh


Fault Tree Analysis of Exothermic Reactor Explosion
REACTOREXPLOSION
3.6x 10-4F/YR

RUNAWAY BURSTING
REACTION DISC FAILS
0.02
Probability
1.8x 10-2F/YR of failure
Dn deinand1
FLOW CONTROL TEMPERATURE
LOOP FAILSs INTERLOCK FAILSS

0.3 FYR 0.06

FLOW VALVE THERMO -


wALVE FAIILS
CONTROLLER STICKS COUPLE &
TO CILOSE
FAILSS OPEN RELAY FAIIL
0.2 FYR 0.1 FYR 0.05 O.01
Probability Probability
of failure of failure
on denand on denand

04-08-2021 Pramod N Parikh 12


Power Supply fails if Mains power & (AND) Standby Generator fails

Fire water deluge fails if fire-detector OR Fire-panel OR Fire pump fails

Fault Trees: AND & OR Gates

Fault Tree Examples

Stand I
PSU

Circles represent basic events


Rectangular boxes serve as descriptions

04-08-2021 Pramod N Parikh 13


Basic 3 Elements of a closed Safety loop fails on
OR logic of failure
Fault -
tree for process control failure in a
chemical process OR- Gate
even awareof a particular events.
failue orproblem,
In other words,
you subsequently if yrou re not
it in the model. we can't model unkcnownunknowns won't incude

Sensor ogic Final


element

Detect Panet vave


Figure 9-5. Fauit Treop

Markov Models

04-08-2021 Pramod N Parikh 14


Flat -Tire Example of Fault -Tree
(Ref: Textbook of Crowl & Louvar)

F!at Tire Top Evons

OR

Tire Faiiure
Road
Debr is

Defectí Worn
T ire Tiree

Figure 11-12 A fault tree describing the various events contributing to a flat tire
04-08-2021 Pramod N Parikh 15
Failure-Analysis of a flat-tire

FTA -
for flat tire continued )
For instance, a fiat tire on an automobile is caused by
two possible events.
In one case the flat is due to driving over debris on thne
road, such as a nail. The other possible cause
as the top event.
is tire
failure. The flat tire is identified
The two contributing causes are either basic or
intermediate events. The basic events are events that
cannot be defined further, and intermediate events aree
events that can be defined further.
For this example, driving over the road debris is a
basic event because no further definition is possible.

04-08-2021 Pramod N Parikh 16


Use of OR & AND gate symbols to express the logic of Top Event

FTA - for flat tire ( continued )


The tire failure is an intermediate event because
it results from either a defective tire or a worn tire.
The circles denote basic events and the rectangles
denote intermediate events.
The fish-like symbol represents the OR logic
function.
It means that either of the input events will cause
the output state to occur- As shown in Figure 11-
12, the flat tire is caused by either debris on the
road or tire failure
Similarly, the tire failure is caused by either a
defective tire or a worn tire.

04-08-2021 Pramod N Parikh 17


Problem: Draw High Pressure Failure/Fault Tree Analysis diagram for a Chemical Reactor shown below.
Chemical reactor feed flows through a solenoid valve at normal pressure in reactor.
High pressure damage is preventable by closing Reactor-Feed & Operator closing feed on high pressure alarm.
PIA-Pressure Indicating Alarm (Pressure Switch -1) switches on liqht / lamp to alarm or alert the operator for action
PIC-Pressure Indicating Controller (Pressure Switch-2 closes solenoid to avoid overpressurisation of Reactor

Chemical Reactor with over pressure scenario


chapter
540 12-
Risk A
Pressure
Alarmn Switch
PA
PIA) PiG
eactor
Solenoid
Valve

Figure
shutdown 12-5systemns are inkedroactor
Achemical with an alarm
in paraliel. and an iniet feed solenoid. The alam and e

04-08-2021 Pramod N Parikh 18


Narrative of the a process system shown in previous slide
UacarEuareves EmGGUGere

Example 12-5
Consider again the alarm indicator and emergency shutdown system of Example 12-5.Draw a fault
tree for this system
Solution
The first step is to define the problem.

1. Top event: Damage to reactor as a result of overpressuring.


2. Existing event: High proces pressure.
Unallowed events: Failure of mixer, clectrical failures, wiring failures, tornadoes, hurricanes
electrical storms.
4. Physical bounds: The equipment shown in Figure 12-5S.
5 Equipment configuration: Solenoid valve open, reactor feed fiowing.
6. Level of resolution: Equipment as shown in Figure 12-5.
The top event is written at the top of the fault tree and is indicated as the
Two events must occur for overpressuring:failure
top event (see Figure 12-14)
of the alarm indicator and
shutdown system. These events must occur together
failure of the emergency
so they must be connected by an AND fune
tion. The alarm indicator can fail by a failure of either pressure switch 1 or
the alarm indicator light
These must be connected by OR functions. The emergency shutdown system can fail by a
cither pressurc switch 2 or the solenoid valve. These must
also be connected
ailure of
by an OR function
The complete fault tree is shown in Figure 12-14

Determining the Minimal Cut Sets


04-08-2021 Pramod N Parikh 19
Understand when & why OR/AND gate (s) are used
Q-1. When can (PIA JAlarm Indicator (PIA) fail?
-

Q-2.When can (PIC) Emergency Shutdown through Switch 2 & Solenoid fail ?

FTA -diagram for a Chemical


Reactor Over Pressure Scenario
Overp aring of Reactoor To Esent

Faitura
naieao
of
Aarm Faituroshusof i orgericy

Pressure
Swisch
ressare
lndicator Pressur
Switch2
Fature Faiure
Failune Faluire

R-8.8 Figure 12-14 Fault troo tor Exampte 42-5

04-08-2021 Pramod N Parikh 20


Probability Maths & OR /AND gate outcome!

Probability
100
Zero
.
Simple
percent
Probability
of

0.00;
therefore
=
any event
Probability
occuring

100 %=
mathematically
1
can be

expressed
expressed
Concepts

in the
from

range
zero

of
percent

a to 1
to

In ericket , expert predicts data on basis of last winning records


c.g.Probability of winning IPL by Dhoni / Kohali are 45 % f 55
(c.g o45/ 0.55 0.45 + 0.55= 1
OR gate events : Probability of each event to be ADDED for final event
Example: P of (event A) +P of (event B) +P oF ( oevent C) = Probability
of final event.
P A or B) = P (A) + P(B)
AND Gate Events : Probability of each event to be MULTIPLIED.
P (A and B)= P(A)x P (B)

04-08-2021 Pramod N Parikh 21


Probability Maths

Review of Probability Theory


P(t)=1-R(t)= 1-e -ut)
Overall probability of failure in a process dependds on nature of inaferacion of each
coponents &t process environment.
Data are collected on failure rate of each hardware component.
With adequate data it can be shown that on an average , the component fails after a
certain period of time , known as average failure rate represented by **u'" with units
of faults / time
The probability that the componet wilt not fcnil during the time interval (0,t) is given
by Poisson distribution : R(t)=e ^ ut
where R is the reliability.
This eqn. assums a cOnstant failure rate p
As t tends to infinity, reliability goes to zero.
Higher the failure rate, faster the reliability decreases.
Compliment of reliability is called the failuure probability -
P and given by
PO-1-R(1)= 1-e G ut)
A typical bathtubfailere rate curve for process hardware indicating a constant failure
is shown on next slide

04-08-2021 Pramod N Parikh 22


Bathtub curve indicates failure rate behaviour of process safety
devices- like sensors, shutoff valves, logic solvers (DCS/PLC)

are valid.
Bathtub Curve
i2-1 through 12- esonabl3 consitzxnt csel

Failure
Rate
(faults/time) Period of Approximately
Constant

Mortality
Old Age
infant
Time

12-2 typical bathtub failure rate curve for process hardware. The failure rate is
constant nvartha midlifeof the comnponent.

04-08-2021 Pramod N Parikh 23


Failure rate (Faults/year) data for Controller, Control valve, Pressure
switch Solenoid valve, Indicator lamp, Pressure relief valve are
important for deriving probability of failure ofsafety loop/layer
Table 12-1 Failure Rate Data for
selected Frocess Co ponents Various
Instru nent
Fauts/year
Controller -29
Control vale
O.6O
Flow
Flow
measurement
measurement
CAuids)
(solids)
1.144
3.75
Flow
Gas-liquid
switch chromatograph
1-12
S0.6
Hand valve O-133
Indicater la mp o044
Level measurement liquids) 1.70
Level
Oxygen
measurement
analyze
(solids) 6.86
.6SS
PH meter 5.88
1-41
Pressure measurement
o.022
Pressure
Pressure
relief
switch
valve
O.14
O.422
Solenoid valve
O.044
StepPer motor O-222
Strip chaart recorder O.522
hermoeouple temperature
temperature
measurenacnt
naeasurement O.0227
hermometer O.444
Valve positicner

elected from Frank P Lees. LosS Prervention it rae Process Tneliustries


Butterworths. 1986). p- 343.
CLondon:
04-08-2021 Pramod N Parikh 24
Calculations for reactor overpressure/ rupture scenario of reactor
protected by two safety loops PIA & PIC (slide -18)

Solution Refer Slide-6 1 for failure rates of PS & Indiactor & Solenoid

Componetnt Failure rate Reliability Failure Probabilityy


Faults per yea R= e* -p.t P= 1- R
Pressure switch 1 14 O.87 0.13
For Alarm
Alarm lamp Indiactor O.044 0.96 O.04

Pressure switch 2 O.14 0.87 O.13


Solenoid Valve 0.42 0.66 O.34
Adangerous high pressure reactor rupture can occur only when both - alarm system and
shutdown systenm fail. These twO components are in parallel. For alarm system the2
COmponets are in series,
R=(0.87)
P
LL=
1- R-
failure
(0.96)=
1-0.835
rate=
-
0.835

-
0.165..
ln R=
.-. Reliability
Probabilityy
-
ln (0.835)=0.
of alarnm system
of failure of alarm system
180 faults / year
(nmultiplication of reliabilities)

MTBF = 1/ u= 1/. 18 f/yT= 5.56 years for Alarm systen

04-08-2021 Pramod N Parikh 25


Probability & Reliability Complimentary -

Solution (continued)
For the Shutdown system also the components are also in series
So, R= (0.87) (0.66) = 0.574
P =1-R = 1-0.574 =0.426 Probability of Failure
4F- In R = - ln (O.574)= 0.555 faults /year

MTBF = 1/u=1/O.555=1.8 years

Two systems combined (when alarm system and shutdown system fail1)
Probabilities to be multiplied for overall performance= 0.165 x 0.4265
0.070: This is probability offailure on both system to occur hazard
System reliability =
(1- 0.070)= 0.93 =(93 %)
=Failure rate= -ln R = -ln (0.93)=0.073 faults / year

MTBF = 1 / u=1/0.073 = 13.7 years (after which reactor can rupture)

04-08-2021 Pramod N Parikh 26


06-08-2021 Lecture 12: 09:00 11:00 AM (+Test)
07-08-2021 Lecture 13 :09:00~ 11:00 AM
06-08-2021
09:05 09:15 ::MCQ instructions - guidance
09:15 09:30 ::MCQs-Test 15 MCQs in 15 minutes
09:45 10:00 :Revision of failure rate, (ref. table from book,slides 1 4)
Reliability Vs probability ( Reli.+ Prob. 1)
OR logic ofA & B = P= P(A) +P(B) approx.
AND logic of A & B = P(A) x P(B)
10:00 11:00 :: Exothermic Reactor - explain Safety Functions of component
w. r.to Failure/Demand and Initiating Event, Branching

07-08-2021
Explain-Operating modes of Controllers (Auto/Man.)
How calculation is done for various 9 results in Event -Tree Analysis
Basic understanding of calculations with data available for IE in occurrence/ year & failure rate of safety function
Relationship between FTA & ETA

07-08-2021 Pramod N Parikh


Failure rate (Faults/year) data for Controller, Control valve, Pressure
switch Solenoid valve, Indicator lamp, Pressure relief valve are
important for deriving probability of failure ofsafety loop/layer
Table 12-1 Failure Rate Data for
selected Frocess Co ponents Various
Instru nent
Fauts/year
Controller -29
Control vale
O.6O
Flow
Flow
measurement
measurement
CAuids)
(solids)
1.144
3.75
Flow
Gas-liquid
switch chromatograph
1-12
S0.6
Hand valve 13
Indicater la mp o044
Level measurement liquids) 1.70D
Level
Oxygen
measurement
analyze
(solids) 6.86
.6S
PH meter 5.88
1-41
Pressure measurement
o.022
Pressure
Pressure
relief
switch
valve
O.14
O.422
Solenoid valve
O.044
StepPer motor O-222
Strip chaart recorder O.522
hermoeouple temperature
temperature
measurenacnt
naeasurement O.0227
hermometer O.444
Valve positicner

elected from Frank P Lees. LosS Prervention it rae Process Tneliustries


Butterworths. 1986). p- 343.
CLondon:
07-08-2021 Pramod N Parikh
Probability & Reliability Complimentary -

Solution (continued)
For the Shutdown system also the components are also in series
So, R= (0.87) (0.66) = 0.574
P =1-R = 1-0.574 =0.426 Probability of Failure
4F- In R = - ln (O.574)= 0.555 faults /year

MTBF = 1/u=1/O.555=1.8 years

Two systems combined (when alarm system and shutdown system fail1)
Probabilities to be multiplied for overall performance= 0.165 x 0.4265
0.070: This is probability offailure on both system to occur hazard
System reliability =
(1- 0.070)= 0.93 =(93 %)
=Failure rate= -ln R = -ln (0.93)=0.073 faults / year

MTBF = 1 / u=1/0.073 = 13.7 years (after which reactor can rupture)

07-08-2021 Pramod N Parikh


Calculations for reactor overpressure/ rupture scenario of reactor
protected by two safety loops PIA & PIC (slide -18)

Solution Refer Slide-61 for failure rates of PS &Indiactor &t Solenoid

Componetnt Failure rate Reliability Failure Probabilityy


Faults per year R=e -u.t P 1- R
Pressure switch 1 0.14 0.87 0.13
For Alarm
Alar lamp Indiactor O.044 O.96 0.04

Pressure switch 2 0.14 0.87 O.13

Solenoid Valve 0.42 0.66 0.34


A dangerous high pressure reactor upture can occur only when both alarm system and
shutdown system fail. These twO components are in parallel.For alarm system the
Components are in series,
R=(0.87) (0.96)= 0.835 ... Reliability of alarm system (multiplication ofreliabilities)
P 1- R= 1- 0.835- 0.165.. Probability of failure of alarm system
= failure rate = - In R= - In (0.83S)= 0.180 faults / year
MTBF = 1/ u= 1/. 18 f/yr= 5.56 years for Alarm system

07-08-2021 Pramod N Parikh


Chemical Reactor with Cooling coil, TIC & TIA
ica ses20ea 620tundamen ith Capplications oDanie20K220Cro 203oeepo 0

487
11-2 Event Trees

Reactor Cooling
Feed ,coit
Cooling
Water
Out
Cool ing
Water
In
actor

Temperature
Controllor
ThermocoupIe

arm
at
High Temperature
TA Alar m
Figure 11-8 Reactor with high-temperature alarm and temperature controller.
07-08-2021 EARROTGNTkea
Event-Tree with 4 Safety Functions for an Initiating Event
I.E in Failures/year& Safety Functions in Probability of failure on demand
Oprtor Ope Operter
Hih Te- Re-stert Shu1s Oouna
Saf Funeri en: Ater
Oprto
Rl-rs
Hiah
Nices
T=p C..1in
C
Iden iie B-25 e.25 e.1
Feilres0e mand
8.725 Con
e.993 Shu
e- 247TS
2475 *
305625 Ce

Iii
Los co Ev
in e-ea75 e.ea6B8 ABO Shu
1Ocurrnee s.s8107anosTOrs -9661a7s
e.e1

o.888ssas
.8062slaaGB666259*Rur
-2227 .81688 .88a5625 B.2ees8 ceure nces/u
Rwnony B.szTs
aure 11
.ees1875
Event tree
.saeeses
for a koue-of-cooiant
a. eeses eucrencs/y
mocidont ranctor oft Figure 11-8
for the

07-08-2021 Pramod N Parikh


Full picture of previous slide with calculation explained

Opiemtor
aren ae Cpto uts dcn
rodctor Bosult
sntor
Fatan ertan o.01

Continue opieratioes
hea
o.7425
AD
o2227
0.2475 PADE reuanawy
o.c247s
o 005625 ainue operation
ngevent o00700 ABO
O.00t6B8 sut dowm
o.001876 AEDE
o0 O.0001875
ABC
o.00187 Contine operation
0.0025 ABCD
O.000ses25 shutdown
O.000625 ABCRE
O.0000625 Runan
For 02475 .000s62022s0 oocuooou ecesy
Runayo.0247so.oo01erso.0000625O.02500 esy

igur 12-9 Eveint tron for a loss-o-coolant accidert tor oOceita


the roactor ot Figure 12-8,

07-08-2021
s a2
Shutdown: Total of all 3 Shutdown Results= 0.2250 Occurrence/year A
Runaway : Total of all 3 Runaway Results = 0.02500 Occurrence/year- B
Continued: Total of all 3 Continued Operation=0.7500 Occurrences /Year - C
Total Occurrence per year = A+ B + C 10ccurrence /year per Reactoor
Hish Te=p Opereo Op-retor Oeretor
Saf y Fune tion Rirm alers No ic Re arts Shuts Den
Operete Hiak T-= Co-Iins Reecer
Idoifier: C
Feilures/Dmand: a. 1 e.25 B.255
B-7425 Cen
e.99
-2475
AD
e.2227 S Ss
B.82475 R
AB
In i ering Even
Loss f Co o Iin .875 ABO
e.e81688S) Sh
1 Oce urrenee e.e81e75|ABDE
.81 B-06618rs
ABC
e.ebe5 ARCD
Shurdo en
Runon=y
8.2227
e. e247S
Figure
.
8. e81688
11-9
ese1875
Event tree
.a88s625

for
.ss8es25
a loss-of-coolant
.eee6aslaaco662s25
8.2es8
B.e2sa8
accident
ceumeences/ur-
e currencesyr
for the reactor of Figure 11-8
07-08-2021 Pramod N Parikh
Basic Understanding

1.Resulting Frequency (Occurrence per year)


= Frequency of Initiating Event x Failure rate of Safety Function

(Occurrence per year) x (Failure per demand)


Example: f an Operator shuts down reactor with failure rate of 0.1
If Failure /demand = 0.1, it means 0.9 is the Success /demand = (1-0.1)
Operator shuts down reactor 9 out of 10 (0.9) times.
2. Total of all resulting frequency
Sum of all individual event frequency
(A+B+C) as verified in previous slide
3. Failure rate of Safety Function (Safe guard)
This could be considered as failure per demand (Failure /Demand) or interpreted as
probability of failure on demand (PFD)- Probabilityhas nounits for an ease of understandin (LOPAconcept)

07-08-2021 Pramod N Parikh


Relationship between FTA & ETA
Initiating Event & Cause (IE & Cause)
Top Event & Consequence (Result and Hazard)

ETA Begins with IE and work toward the Top Event (Induction
FTA: Begins with Top event and work backward toward IE (Deduction)
.IE Cause of the incident (Accident)
Top Events: Final Out come ofthe incident (Hazard)

Top events in FTA are the IE in ETA


Both are used to have complete analysis of the Incident ( Accident)
Probabilities and Frequency of occurrence are attached to these
diagram / analysis.

07-08-2021 Pramod N Parikh 10


13-08-2021 Lecture 15,16::09:00 11:00AM

Last 4 Slides of lecture-14-Revision for Event Tree Analysis


Lecture-15 from slide no.5
Risk: Acceptable & Not Acceptable Risk
ALARP region concept: As L ow As Reasonably Possible for above two extremes of Risk
How Risk gets reduced by LOPA method?
Process risk without use ofany safeguards/ protection layers is Not Acceptable Risk
Tolerable Risk expressed as defined by Company and What is Residual Risk?
Protection layers works for Risk-Reduction to bring it to Acceptable risk.
Layers of Protection Analysis (LOPA) is the methodology in current time to design a safe chemi.plant
Prior to LOPA- HAZOP study is to be completed -"Hazard and Operability Study" as per title in syllabus
HAZOP in next lecture: Most important for any Chemical Process Safety and for Chemical Engineers.
Chemical Reactor with Cooling coil, TIC & TIA
ica ses20ea 620tundamen ith Capplications oDanie20K220Cro 203oeepo 0

487
11-2 Event Trees

Reactor Cooling
Feed

Cooling
Water
Out
Cool ing
Water
In
actor

Temperature
Controllor
ThermocoupIe

arm
at
High Temperature
TA Alar m
Figure 11-8 Reactor with high-temperature alarm and temperature controller.
13-08-2021 EARROTGNTkea
Shutdown: Total of all 3 Shutdown Results= 0.2250 Occurrence/year A
Runaway : Total of all 3 Runaway Results = 0.02500 Occurrence/year- B
Continued: Total of all 3 Continued Operation=0.7500 Occurrences /Year - C
Total Occurrence per year = A+ B+C 1Occurrence /year per Reactor
Nih Tep Op Opraser
Safey Fun
lont OgerrAler
Bier
Hik T-
t-s
Co-lins
rer1s Shuts
Rce
De
Ideoifier? C

Feilures/Omand e.e1 e.25 8.25


B.742s
.993 AD Sh
-2475
e-2227
62475 Run

Iniietin Eve
e.P87T5 BBO
Loss
1 Occ
fCoo
urrenc
I e-e616Be Sh

.8 .ea1atas561875Y
ABC
B.881875 c.
e.eb25
965625 Sh

Shurden
Runen=y
e.222r
8. s247S
Figure
.
e. e816e8
e0e1875
a. 8a8s625
8.sa88s25

TTaTmouNGINI
8.a2588
B.e006eaacDEaSESR
8.225s8.<e s/
e eurmencesye
urrenc
11-9 Event tree for a loss-of-coolant aocident forthe reactorof Figure
31-8
Basic Understanding

1.Resulting Frequency (Occurrence per year)


Frequency of Initiating Event x Probability of failure of Safety Function per demand
= (Occurrence per year) x (probability of Failure per demand)
Example: If an Operator shuts down reactor with probability of 0.1/demand
Failure /demand = 0.1, means 0.9 is the Success /demand = (out of 10 demands, operator fails in one )
Operator shuts down reactor 9 out of 10 times successfully (0.9) and fails one time (0.1)
Initiating Frequency of 1 occurrence per year of event means once in a year Cooling water failure -event occurs
2. Total of all resulting frequency which is a sum of all individual product of l.E frequency and Probability of failure
Sum of all individual event frequency of (Shutdown frequency +Runaway frequency +Continued process frequency)
=[A+B+C) as verified in previous slide
3. Failure rate probability ofSafety Function (Safe guards which are nothing but protection layers)
This could beconsidered/ expressed as failure per demand (Failure/ Demand) or is interpreted as probability of failure on demand (PFD)
Note: Probability has no units for an ease of understanding_(PED)
LEhas unit ofoccurrence per year fexample:_Frequency expressed as O.1 per Year =once in 10years or one occurrence per 10 years)

13-08-2021 Pramod N Parikh


13-08-2021: Lecture 15: 09:00~ 11:00 AM
Disadvantages of Fault Trees:
-
Enormous tree for a complicated process (thousands of gates !)
Time & cost (may take huge time for a complex process plant)
-
Not sure of having considered all types of failures (safe/ dangerous)
-FTA is subjective to individual & hence different in logic structures
-
Failure probabilities needed for all the events in the fault tree to know probability of Top event
Advantage of Fault Trees: User specific selection of Top event as needed
Disadvantage of Event trees:
Multiple Events resulting from single failure might not be the event of specific interest of user.
Advantage of Event Tree:

Concept is extended in Layer of Protection Analysis (LOPA) for Risk reduction requirement
Risk: Acceptable, Not Acceptable
LOPA: Layer of Protection Analysis
Advantages and Disadvantages of Fault Trees
Relationship between Fault trees and Event Trees
Risk:
Risk is defined as combination of the probability of occurrence of harm ( frequency/ likelihood) and the
severity of that harm (consequence)
Risk could be of loss of human life/ costly assets/ environment harming all livings (PPE)
Acceptable risk: Refer figure 11-15 on page 499 of Frequency & Consequence in the text book.
Further risk reduction NOT required at additional cost for Acceptable risk
Not Acceptable risk: Refer figure 11-15 on page 499 of Frequency & Consequence.
Consider re-design of process or protection system to reduce the risk to acceptable zone
by adding layers of protection and/or layers of mitigation in the process design.
LOPA: Layers of Protection Analysis :Ref. to Figure 11.16 on page 501of text book.
Next slides are to make easy understanding of above risk concepts and LOPA
How Not Acceptable (Intolerable ) risk is reduced to Acceptable risk ?
(By HAZOP study, Controllers in DCS/ PLC & LOPA)
who decides Acceptable risk?
Recording has started. This meeting is being recorded. By joining. you are giving consent for this meeting to be recorded Prbcac

HAZOR, LOPA & RISK REDUCTION

SIF

Intolerable HAZOP
risk level
SIL Selection SIL of
The SIF
ALARP
or tolerable
risk region Validate
Requirements
Acceptable risk implementation Safety
loop

37
Onion-Diagram for Protection Layers :Preventive & Mitigation
Layers (Plz. refer textbook of Crowl & Louvar)

ARecording has started. This meeting is being recorded By joining- you are gwing consent for this meeting to be recorded. Priacy.policy

Layers of Protection
Onion Model

8. Community emergeney Response

7Plantemergency Reapon
. Passve Physical
mitigation
containment Protection DIkes)
Active Physicai Protection
(RellefDeviens
4. Autonatic Safety
System
Instrumented

3. Critical Alarme, operator


Supervision, Manual intervention
Prevention
2. Basic Proces
controls alams
operator Supervision

Peoc
design
IEC 61508/61511
Layers of Protection for avoiding
hazards and consequences
SIS: is one of the layers of Protection & Event tree Analysis is deployed in LOPA Analysis

Community Emergency Response

Plant Emergency Response

Physical Protection (Dikes)


Physical Protection (Relief Devices)
Safety Instrumented System

Alams, Operator Interventionn

Basic Process Controol

Process

Defense in depth, or, don't put all your eggs in one basket.
Is alarm a perfect (IPL)Independent Protection Layer ?

xida
Alarms are One of the First Protection Layers

Community
Emergency response
Plant
Emergency response
ve protectio
Lsa

Safety InstrunmentedSystem
-

(Ls
Trip
Operator lntervention
Alarm
Process Control
LoOp
Process Design
Process value

tioalzeti rsd as envotins


SA 18.2 itte siec 2aogs. H ast
nt ubli an ef the 1SA. tandiard i i apd as an eitor
3 major layers: BPCS with Alarms, SIS, Mechanical Pressure Safety Valve (PSV)
shown with process value trend (PV) graph with High alarm (H) & Operator takingg
action and at higher pressure High High (HH) Trip point (e.g PAH,PAHH)-shown as SIF-
action in figure below which shall trip the process to prevent further rise of pressure
ARecording has started. This meeting is being recorded By joining. you are giving consent for this meeting to be recorded. Priacypos

Onion Model Continued.--

Mechanical safety
Unsafe Condition
SIF action
Trip level

Alarm Condition
Operator takes action High alarm leve

High level
Normal Condition

Process value
28 Request control

27
Chain of with Events with failure of 3 layers- BPCS,SIS,PSV with
release of toxic, explosive, flammable gas leading to harmto
humans, fire of asset.
Recording has started. This meeting is being recorded By joining you are ghing conent for this meeting to tbe recorded. Erox.o

Chai of Events

ema
tritiating
oenaio
Event
Process under control
ontroat Loop
ESO vatve teil cle
Humen erro
Process deviation Pump malfunction
or disturbance etc

Process out of control

Hazardous situation to
S
Dreven us situation
SIF Released Hazard

MECHANICAL SAFEGUARD (PSV


Hazardous event

Consequences

FALURE ON DEMAND
28
LOPA-Event Tree: Risk Arrow reduces in size due to each of
protection layer's effectiveness to reduce risk.

a Recording haa started. Thä meeting is being recorded. By joining- you are giving consent for this meeting to be recorded: Prbavnolio

LOPA Frequeney Model

Impect Event Occurs

PFDYa
yz Evene Frequeney
PFD-
PED,-y Success Safe outcome
Initiating
Event Succe
Safe outcome
Estimated Frequency

Succees
Safe outcome

8
RP
DM SM N
Tolerable risk is same as Acceptable risk
Example: Du-Pont story in USA (1 death in 14 months)

Residual
risk
Tolerable
risk
Processs
risk

Necessary risk reduction


Increasing
risk
Actual risk reduction

Partial risk
Partial risk by other
Partial risk by other non-sIF
protedtion layers cOvered by SIF protection layers
initially SIL1)

Instrumented contribution reduced from SIL2 to SIL1


Risk reduction achieved by all protection layers

13-08-2021 Pramod N Parikh 14


Accidents are always avoidable!
Which is the recent major accident in India /in Gujarat?

Why do accidents happen?

Texas City Refinery,


USA, 2004
Chernobyl,
Russia, 1986

Bhopal, india,
1984

Piper Aipha, UK, 1988

13

SM) Sv JN
n DM
Simple Example of Hexane Tank overflow Scenario
Determine frequency of fire event & fatality of operator by LOPA method in case
- LIC fails once in 10 years (This is initiating event in Event -tree)
Overflow has a passive protection by a Dike and hence Hexane is contained safely.

COL

ble fatality.
HAZOP
----
Vent

A/s
Hexane
Storage
Tank
-
Prior Proces eErososs

Process for LOPA Example


Figure 7-3. Sample
SAALsTY= 1Xs/yea
Data: Dike failure probability = 0.01 (1 %)
Probability of Ignition 1 (100 %)
Probability of presence of operator in the area = 0.5 (50%)
Probability of fatality = 0.5 (50 %)
Fire Fatality Occurrence Frequency=0.1x (0.01x1x0.5x0.5) = 0.00025 f/yr. by LOPA
= (1.E freq.)x (Probability of failure of all layers- & Gates)

obab1LEy
rows, which amounts to 2.5 10 UUL 1.00.5t
of the remaining
CONDITON -MoDEE RS
Probability Probability of Probability
BPCS loop Dike personnel
of ignition in of fatality
fallure
o-1/y P oo1 P=1 area P-o-5 P= O5
No significant event
Success P= o.99
No significant event
FO.1/yr No P=0
Fire
Failure P 0.01 No P= 0.5 Fire, no fatality
Yes P= 1.0
Preb
8 Fie
No P= 0.5
'o1 Yes P
=
0.5
=4 /y. Yes
Fire, w/ fatality
P= 0.5-
Figuro 7-4. Event Tree for LOPA
Example
MHRD Scheme on Global Initiative on
Academic Network (GIAN) &
Initiating Event Frequency
Commissionerate of Technical Education
Gujarat State example
(REF:AIChE CCPS LOPA Text Book)

Initiating Event Frequency (per year)


Gasket- Packing blow out 1x 10 -2

Lightening Strike 1x 10 -3

BPCS Loop Failure (Control valve, Logic Solver, Sensor; HW/SW) 1x10 ^ -1

Safety Relief Valve opens spuriously 1x 10 -2

Regulator Failure 1x 10 -1

Procedure Failure ( per opportunity) 1x 10 -3

Operator Failure ( per opportunity ) 1x 10 -2


MHRO Sdheme on Clcbal nitatieon

AasenicWewok(GAN
Probability Failure on Demand (PFD)example
AConmisinente d lktnial tdcaton,
Gujant Sate
(REF:AIChE CCPS LOPA Text Book)
R7

Passive / Active IPLs PFD

Dike ( passive layer) 1x10 -2

Fireproofing (passive layer) 1x10 -2

Blast wall/ bunker (passive layer) 1x 10 A -3

Flame/detonation arrester ( passive layer) 1x10 -2

Safety Relief Valves (active layer) 1x10 -2

Rupture Disk_(activelayer) 1x10 -2

Basic Process Control System (active layer) 1x 10A-1


MHRD Scheme on Global Initiative on
Academic Network (GIAN) &
Commissionerate of Technical Education
Benchmark Risk Tolerance Criteria for single fatality per 100-million-
Gujarat State
man hours

(REF: AIChE CCPS LOPA Textbook)

Max Tolerable Risk for workforce Negligible Risk for


(all scenarios) Workforce ( all scenarios)

Shell 10 -3 10 A
-6

BP 10 -3 10^ -6
ICI (on shore) 3.3X10 -5

Rohm & Haas 2.5 X10 A -5

Typical 10 A -4 10 A -6

Similar table can be developed for


Financial losses, environmental
releases.
Quantitative LOPA Initiating Cause Guidance Table (example)

Initiating Event
Likelihood of Failure
(Events per Year)
BPCS instrument loop failure
Note: IEC 61511 limits the likelihood of BPCS failure to no less than 101
8.76 x10yr (EC, 2001)
Regulator failure 10-1

Fixed Equipment Failure (E.g. exchanger tube failure) 102


Pump Failure (single pump normally running) 101
Compressor or blower failure 10
Cooling water failure (redundant CVW pumps, diverse drivers) 10
Loss of Power (redundant power supplies) 101
Human Error -
(Routine Task, Once-per-Day Opportunity) 100

Human Error (Routine Task, Once-per-Month Opportunity) 101


Human Error (Non-Routine Task, Low Stress) 101
Human Error (Non-Routine Task, High Stress) 100
Human Error- Inadvertent opening/closing of valve -
No unique
exposure 10 per valve
Quantitative LOPA IPL Probabilities Guidance Table (example)

Probability of
Independent Protection Layer (IPL) Failure on
Demand (PFD)
*Basic Process Control System, if not associated with the initiating event
being considered (assume high demand BPCS) 1x10
*Operator response to an audible alarm with at least 10 minutes response
1x 10
time
Critical operating procedure 1x 10
| Relief valve (non-dirty service) 1 x
10
|Relief valve (dirty service) 1x 10
HAZOP:Hazard and Operability Study

Next session on 20-08-2021

13-08-2021 Pramod N Parikh 23


21-08-2021 Lectures 16 &17 09:0011:00 AM

Revision of last 3 slides:(Benchmark Tolerable Risk, Initiating Frequency, PFD for devices)

HAZOP: Hazards and Operability Studies


Basics- Definition
-

- 3 key factors on which HAZOP depends


-
History of HAZOP and how is it conducted (as per Standard)
- Pre-requisites of HAZOP study
- HAZOP study Objectives

26-08-2021 Pramod N Parikh


MHRD Scheme on Global Initiative on
Academic Network (GIAN) &
Commissionerate of Technical Education
Benchmark Risk Tolerance Criteria for single fatality per 100-million-
Gujarat State
man hours

(REF: AIChE CCPS LOPA Textbook)

Max Tolerable Risk for workforce Negligible Risk for


(all scenarios) Workforce ( all scenarios)

Shell 10 -3 10 A
-6

BP 10 -3 10^ -6
ICI (on shore) 3.3X10 -5

Rohm & Haas 2.5 X10 A -5

Typical 10 A -4 10 A -6

Similar table can be developed for


Financial losses, environmental
releases.
Quantitative LOPA Initiating Cause Guidance Table (example)

Initiating Event
Likelihood of Failure
(Events per Year)
BPCS instrument loop failure
Note: IEC 61511 limits the likelihood of BPCS failure to no less than 101
8.76 x10yr (EC, 2001)
Regulator failure 10-1

Fixed Equipment Failure (E.g. exchanger tube failure) 102


Pump Failure (single pump normally running) 101
Compressor or blower failure 10
Cooling water failure (redundant CVW pumps, diverse drivers) 10
Loss of Power (redundant power supplies) 101
Human Error -
(Routine Task, Once-per-Day Opportunity) 100

Human Error (Routine Task, Once-per-Month Opportunity) 101


Human Error (Non-Routine Task, Low Stress) 101
Human Error (Non-Routine Task, High Stress) 100
Human Error- Inadvertent opening/closing of valve -
No unique
exposure 10 per valve
Quantitative LOPA IPL Probabilities Guidance Table (example)

Probability of
Independent Protection Layer (IPL) Failure on
Demand (PFD)
*Basic Process Control System, if not associated with the initiating event
being considered (assume high demand BPCS) 1x10
*Operator response to an audible alarm with at least 10 minutes response
1x 10
time
Critical operating procedure 1x 10
| Relief valve (non-dirty service) 1 x
10
|Relief valve (dirty service) 1x 10
HAZOP:Hazard and Operability Study

Definition

"A HAZOP study is a systematic review of a process, by a group of experienced people, to


identify deviations from the normal design intent which may lead to hazardous events or
significant operability problems"

It is to identify hazards in a Chemical process facility- HAZOP is a well established method!

It relies on:

Systematic identification
Methodical brainstorming
Creative interaction of diverse disciplines

26-08-2021 Pramod N Parikh


HAZOP History and HAZOP Standard

1960s invention by ICI's Bert Lawley, UK


1970s further development by CIA,UK
1980s & 90s adopted by major petroleum and chemical industry
1990s joined by mineral processing, food and water industries
1990-2000 expanded from process facility to control systems, electrical, materials handling etc.
etc.
2000s expanded to construction and road design
2001: IEC61882 (HAZOP Studies Application Guide) Standard
2021 ISA- conducting an OnPoint recording for PHA /HAZOP in Oct.21 to discuss Frequently
asked Questions, (FAQ) Virtual meeting wherein your teacher shall be a m0derator. lt is
improving to reduce time and increase ejficiency & effectivenes by SW EXIda -USA, & Others
companies from UK, Brazil,Nederland etc.
The basic idea is to let the mind go free in a controlled fashion to consider all possible ways that
process equipment failures and operational failures can occur A brain -storming session!.

26-08-2021 Pramod N Parikh


HAZOP Study Objectives

To facilitate safe plant start-up


To minimise the need for modifications
To maintain on-line time

And:

Toidentify all process safety, health and environmental hazards (SH & E)
And wherever possible, to determine:

THE INHERENT STRUCTURE OF SUCH HAZARDS / HAZARDOUS PROCESS


Pre-requisites for HAZOP

Process detailed information / description


Updated PFDs
Updated P& IDs.
Specifications of Equipment used in process design (Reactor, Pumps, Compressors,
Instruments, Piping as applicable with MOC)
Mass and Energy Balance.
MSDS: Material Safety Data sheets for all Chemicals
Cross-Functional team of experiencedprofessionals (Process Design, Operation,
Maintenance, Engineers (Chem, Mech. Inst. & Control,Electrical, Safety with an
independent sector expert of process who shall conduct and control the meeting
proceeding and get it recorded by a scribe

26-08-2021 Pramod N Parikh


GSFCU-B.Tech (Chem.)-VII Sem.

27-08-2021

27-08-2021 Pramod N Parikh


Pre-requisites for HAZOP

Process detailed information/description


Updated PFDs
Updated P & IDs.
Cause & Effect table/ diagram for safety interlocks
Specifications of Equipment used in process design (Reactor, Pumps, Compressors,
Instruments, Piping as applicable with MOC)
Mass and Energy Balance.
MSDS: Material Safety Data sheets for all Chemicals
Cross-Functional team ofexperienced professionals (Process Design, Operation,
Maintenance, Engineers (Chem, Mech. Inst. & Control,Electrical, Safety with an
independent sector expert of process who shall conduct and control the meeting
proceeding and get it recorded by a scribe

27-08-2021 Pramod N Parikh


27-08-2021: Lecture 18 & 19: 09:00 11:00 AM
HAZOP - I1

HAZOP study Objectives


HAZOP Team Composition
Where lives the Hazards?
How is it identified before plant is built and how is it documented as record1?
HAZOP:A "Bottom-up" Technique of event analysis by an experienced team
HAZOP Study General Principles
.Deviation column & Two Example Sets of Guidewords (Process and Electrical)
Simple example of a node
.HAZOP Preparation
.HAZOP Timing

27-08-2021 Pramod N Parikh


HAZOP Study Objectives

To facilitate safe plant start-up


To minimise the need for modifications
To maintain on-line time

And:

identify all process safety, health and environmental hazards (SH


To & E)
And wherever possible, to determine:

THE INHERENT STRUCTURE OF SUCH HAZARDS /HAZARDOUS PROCESS


HAZOP Team Composition

Independent Leader
Project Manager / Engineer
Operation / Maintenance Representative(s)
.Discipline Engineer(s) / Specialist
ProceSS
.Mechanical
.Electrical / Instrument
Others, e.g. Chemist, Vendor Representative
HAZOP Minutes Recorder
Team Size 4 to 8 People (10 to 12 people if Design is out-sourced)

27-08-2021 Pramod N Parikh


Where lives the Hazards in a Chemical process plant ?
IDENTIFICATION OF: Inherent Hazards - known and unknown in
Equipment(s):
Storage tanks, Reaction vessels (Reactors), Distillation /Drying Towers,
pumps, blowers, Compressors, turbines etc.
Instrument(s):
Sensors: for flow, pressure, temperature, level,pH, conductivity, corrosion etc.
Final Elements: Control Valves, ON- OFF valves, Motor start-stop contactors.
High tech Controllers: DCS/ PLC system based process controllers ( PIC, TIC, LIC, FIC) made up of HW& SW.

When any of the above or both fails to function as per design, it initiates a hazardous event (Initiating Event)
This initiated event may result into a hazardous event if process design does bot have adequate protection.

In HAZOP Study cause of initiating event is identified for deviation from normal design intent of operating in normal range.
Also, the consequence(resulting out-come ) is recorded.
The HAZOP team reviews and records safe guards if any in place if any to prevent or mitigate the consequence.

27-08-2021 Pramod N Parikh


Howthe hazard is addressed in HAZOP method?
Typical HAZOP Study Record sheet/ work-sheet
Hazop-Expert (HE) asks the diverse team if existing safe guards are adequate or need more ?
Brain storming takes place to come to conclusion in a controlled environment by HE.
Extra protection device/system if any gets recorded as Recommendation in HAZOP work-sheet.

ORICA HAZOP STUDY RECORD SHEET

PROJECT:
TEAM MEMBERS: DATE:

NODEE: LEADER:
DRAWING: MINUTES BY:

NO. GUIDE CAUSES cONSEQUENCES EXISTING ACTION RECOMMENDED DONE


wORDs SAFEGGUARDs

27-08-2021 Pramod N Parikh


HAZOP:A "Bottom-up" Technique

POTENTIAL cONSQUENCES

INTERMEDIATE EVENTs

INITIATING EVENTS
HAZOP Study General Principles

GUIDEWORDS:
No, NoT, NONE Plus, Special
MORE Application Guide
LESS Words,
As WELLAs Sooner,/ Faster,
PART OF Later/Slower
REVERSE
Where else
OTHER THAN

INTENTION DEVIATION CAUSE/CONSEQUENCE EVALUATION SAFEGUARDS


Two Example Sets of Guidewords
Process HAZOP Electrical HAZOP
High Level/ High Flow High/Low/ No/ Reverse Current
Low Level/ Low Flow High/ Low Voltage
Zero Flow/ Empty High/Low Potential Differences
Reverse Flow High/Low Frequency
High /Low Pressure Phase-Lead / Lag/ Loss of
High/ Low Temperature Start-up/ Shutdown- Common Mode
Impurities Failures/ Spurious Failures
Changes in Concentration Synchronization/Sequencing
Two Phase Flow Maintenance Safety
Reactions Testing/Commissioning
Testing-Equipment/ Product Interference
Plant Items -Operable/ Maintainable Emergencies- Power Failure
Electrical and Instruments Timing Effects- Too early or Too Late
Simple Example of a node

--- The first "node"

LC
To Plant

Line Used For Filling8


H F- Tank
HAZOP Preparation

Decide appropriate team membership


Suitable meeting facilities
Pre-training of team members
Order of systems to be reviewed
Time requirementsfor meeting
Guide word development
Quality of documentation
Meeting documentation requirements
.Minuterecording
Personal preparation by participating team-member and the leader

27-08-2021 Pramod N Parikh 12


HAZOP Timing

"HAZOP an audit of a completed design" (Frank Lees)


is
Hence, it should be conducted at the end of the detailed process design
stage, proving you are ready to construct.
Based on a number of design reviews including:
Basis of Design including design flows, pressure, temperatures
Material of construction (MOC)including gaskets beyond piping
Layout (2D/3D) reviews, preferably including operability/ maintainability access
reviewS
Major protection features detailed including:
o Functional description incl. Cause & Effects Diagram, Process and Emergency Shut Down etc.
o Pressure relief and blowdown
o Fire safety
o Containment
o Etc.

27-08-2021 Pramod N Parikh 13


HAZOP for a Chemical Reactor as per text book
NODE: Cooling coil (process parameters: flow and temperature) and
Stirrer (process parameter: agitation)

Reactor

Cooling coils
Monomer feed

Cooling water out

Cooling water in

TC
Thermocouple
Figure 10-8 An exothermic reaction controiled by cooling water.

27-08-2021 Pramod N Parikh 14


Hazard Scenario for HAZOOP

Exampie 10-2
Consider the reactor system shown in Figure 10-8. The reeaction is exothermic, a cooling systemso
is provided to remove the excess energy of reaction. In the event thal the cooling function is lost,
the temperature of the reactor would increase.This vould lead to an increase in reaction ratc, lcad-
ing to additional energy release. The result would be a runaway reacaion with pressures exceeding
the bursting pressure of the reactor vessel.
The temperature within the reactor is measured and is used to control the cooling water fow
rate byy a valve.
Performa HAzOPstudy on this unit to improve the safety of he process. Use as study nodes
the cooling coil (process parameters: fow and temperature) and the stirrer (process parameter:
agitation).
Solution
The guide words are applied to the study node of the cooling coils and the stirrer with the desig-
nated process parameters.

analysis.
The HAZOP results are shown in Tablc 10-7, which is only a small part of the complete

27-08-2021 Pramod N Parikh 15


HAZOP STUDY RECORD SHEET

Process: Reactor of Example 10-2. Team Members Date 27-08-2021

Node Description: Cooling Coil used for Ref.Drg.: Figure 10-8 Virtual class of
circulating cooling water in Exothermic Sem.VIll students
reactor-Tag no. 27-R 108 B.Tech (Chem.)

HAZOP Leader: PN Parikh HAZOP Scribe: PNP Target Date

NO. GUIDE CAUSES cONSEQUENCES EXISTING RECOMMENDATION ACTION BY BY Completec


WORDS SAFEGUARDS (WHO) WHEN
No LesSS Control Valve FAILS Loss of cooling, Nil Select valve Paresh
Flow closed (TCV-108) possible RUNAWAY fail to open (FO)- (Process Design)
(Design- Safe guard/ Project Dept.
Protection Layer)
Plugged cooling Nil 1.Install filter with Prakash
coils Maintenance Procedure (Maintenance)
(SOP)
2.Install cooling water
flow meter and low flow PN Parikh
alarm (Instrument)
3.Install high temperature Project Dept.
alarm to alert the
Operator
Cooling water Nil Check and monitor Raj Mehta
Service failure reliability of water service (Operation)
Temperature Nil Place Temp.Controller on PN Parikh
controller FAILS Critical Instrumentation (Instrument)
(TC-R-108) & Closes list to maintain its spare -

valve (TCV-108) parts and do preventive


maintenance
Instrument Air Nil Selecting FO valve Paresh
Pressure fails, (Design Safeguard- (Process Design)
closing valve Protection Layer)
(TCV-108)
? Nil Redundant Sensor PN Parikh
(Sensor?!/Power ?) Redundant Power Supply (Instrument-
Project Dept.)

27-08-2021 Pramod N Parikh 16


-
The End -

27-08-2021 Pramod N Parikh 17


448 Chapter 10 Hazards Identification

10-3 Hazards and Operability Studies


The HAZOP study is a formal procedure to identify hazards in a chemical process facility5 The
procedure is effective in identifying hazards and is well accepted by the chemical industry.
The basic idea is to let the mind go free in a controlled fashion in order to consider all the
possible ways that process and operational failures can occur.
Before the HAZOP study is started, detailed information on the process must be avail-
able. This includes up-to-date process flow diagrams (PFDs). process and instrumentation di-
agrams (P&IDs), detailed equipment specifications, materials of construction, and mass and
energy balances.
The full HAZOP study requires a committee composed ofa cross-section of experienced
plant, laboratory, technical, and safety professionals. One individual must be a trained HA-
ZOP leader and serves as the committee chair. This person leads the discussion and must be
experienced with the HAZOP procedure and the chemical process under review. One individ-
ual must also be assigned the task of recording the results, although a number of vendors pro-
vide software to perform this function on a personal computer. The committee meets on a reg-
ular basis for a few hours each time. The meeting duration must be short enough to ensure
continuing interest and input from all committee members. A large process might take several
months of biweekly meetings to complete the HAZOP study. Obviously, a complete HAZOP
study requires a large investment in time and effort, but the value of the result is well worth the
effort.
The HAZOP procedure uses the following steps to complete an analysis:

1. Begin with a detailed flow sheet. Break the fiow sheet into a number of process units. Thus
the reactor area might be one unit, and the storage tank another. Select a unit for study.
2. Choose a study node (vessel, line, operating instruction).
3. Describe the design intent of the study node. For example, vessel V-1 is designed to store
the benzene feedstock and provide it on demand to the reactor.
4. Pick a process parameter: flow, level, temperature, pressure, concentration, pH, viscosity,
state (solid, liquid, or gas), agitation, volume, reaction, sample, component, start, stop,
stability, power, inert
5. Apply a guide word to the process parameter to suggest possible deviations. A list of guide
words is shown in Table 10-3. Some of the guide word process parameter combinations
are meaningless, as shown in Tables 10-4 and 10-5 for process lines and vessels.
6. If the deviation is applicable, determine possible causes and note any protective systems.
7. Evaluate the consequences of the deviation (if any).
8. Recommend action (what? by whom? by when?)
9. Record all information.

Guidelines for Hazard Evaluation Procedures, 2d ed. (New York: American Institute of Chemical Engi-
neers, 1992).
10-3 Hazards and Operability Studies 449

Table 10-3 Guide Words Used for the HAZOP Procedure

Guide words Meaning Comments


NO, NOT, NONE The complete negation No part of the design intention is achieved, but
of the intention nothing else happens.
MORE, IIIGIIER, Quantitative increase Applies to quantities such as flow rate and tempera-
GREATER ture and to activities such as heating and reaction.
LESS, LOWER Quantitative decrease Applies to quantities such as flow rate and tempera-
ture and to activities such as heating and reaction.
AS WELL AS Qualitative increase All the design and operating intentions are
achieved along with some additional activity, such
as contamination of process streams.
PART OFF Qualitative decrease Only some of the design intentions are achieved,
some are not.
REVERSE The logical opposite of Most applicable to activities such as flow or chemical
reaction. Also applicable to substances, for example,
poison instead of antidote.
OTTIER THAN Complete substitution No part of the original intention is achieved-the
original intention is replaced by something else.
SOONER THAN Too early or in the Applies to process steps or actions.
wrong order
LATER THAN Too late or in the Applies to process steps or actions.
wrong order
WHERE ELSE In additional locations Applies to process locations, or locations in operat
ing procedures.

Table 10-4 Valid Guide Word and Process Parameter Combinations


for Process Lines (x's represent valid combinations)

No, More, As
Process not, higher, Less, well Part Other Sooner, Later, Where
parameters none greater lower as of Reverse than faster slower else
Flow X X X X X

Temperature
Pressure
Concentration X X
pH x X
Viscosity X X X

State X
450 Chapter 10 Hazards ldentification

Table 10-5 Valid Guide Word and Process Parameter Combinations


for Process Vessels (x's represent valid combinations)

No, More, As
Process not, higher, Less, well Part Other Sooner, Later, Where
Parameters none greater lower as of Reverse than faster slower else
Level X X x X

Temperature X

Pressure X

Concentration
pH X X
Viscosity X X

Agitation X X
Volume
Reaction
State X X
Sample X X X X

10. Repeat steps 5 through 9 until all applicable guide words have been applied to the cho-
sen process parameter.
11. Repeat steps 4 through 10 until all applicable process parameters have been considered
for the given study node.
12. Repeat steps 2 through 11 until all study nodes have been considered for the given sec-
tion and proceed to the next section on the flow sheet.

The guide words AS WELL AS, PART OF, and oTHER THAN can sometimes be conceptually dif-
ficult to apply. As WELL AS means that something else happens in addition to the intended de-
sign intention. This could be boiling of a liquid, transfer of some additional component, or the
transfer of some fluid somewhere else than expected. PART OF means that one of the compo-
nents is missing or the stream is being preferentially pumped to only part of the process.
OTHER THAN applies to situations in which a material is substituted for the expected material,
is transferred somewhere else, or the material solidifies and cannot be transported. The guide
words soONER THAN, LATER THAN, and WHERE ELSE are applicable to batch processing.
An important part of the HAZOP procedure is the organization required to record and
use the results. There are many methods to accomplish this and most companies customize
their approach to fit their particular way of doing things.
Table 10-6 presents one type of basic HAZOP form. The first column, denoted "Item,"
is used to provide a unique identifier for each case considered. The numbering system used is
a number-letter combination. Thus the designation "1A" would designate the first study node
and the first guide word. The second column lists the study node considered. The third column
lists the process parameter, and the fourth column lists the deviations or guide words. The next
three columns are the most important results of the analysis. The first column lists the possible
Table 10-6 HAZOP Form for Recording Data

Hazards and Operability Review

Completed:
Project name: Date Page of

No action
Process:

Reply date
Section: Reference drawing:
Deviations
Item
Study Process Possible causes Possible consequences Action required
Assigned
(guide to
node parameters words)
452 Chapter 10 Hazards ldentification

Reactor

Monomer feed- Cooling coils

Cooling water out-

Cooling waterin

TC
Thermocouple

Figure 10-8 An exothermic reaction controiled by cooling water.

causes. These causes are determined by the committee and are based on the specific devia-
tion-guide word combination. The next column lists the possible consequences of the devia-
tion. The last column lists the action required to prevent the hazard from resulting in an acci
dent. Notice that the items listed in these three columns are numbered consecutively. The last
several columns are used to track the work responsibility and completion of the work.

Example 10-2
Consider the reactor system shown in Figure 10-8. The reaction is exothermic, so a cooling system
is provided to remove the excess energy of reaction. In the event that the cooling function is lost,
the temperature of the reactor would increase. This would lead to an increase in reaction rate, lead-
ing to additional energy release. The result would bea runaway reaction with pressures exceeding
the bursting pressure of the reactor vessel.
The temperature within the reactor is measured and is used to control the cooling water flow
rate by a valve.
Perform a HAZOPstudy on this unit to improve the safety of the process. Use as study nodes
the cooling coil (process paranmeters: flow and temperature) and the stirrer (process parameter
agitation).

Solution
The guide words are applied to the study node of the cooling coils and the stirrer with the desig-
nated process parametersS.
The HAZOP results are shown in Table 10-7, which is only a small part of the complete
analysis.
Table 10-7 HAZOP Study Applied to the Exothermic Reactor of Example 10-2.

Hazards and Operability Review

Completed:
Project name: Example 10-2 Date: 1/1/93 Page 1
of 2

No action:
Process: Reactor of Example 10-2

Reply date
Section: Reactor shown in Example 10-2 Reference drawing: Figure 10-8
Deviations
ltem Study Process uide
9words) Possible causes Possible consequences Action
requiredASSigned
node parameters to:
1A Cooling Flow No 1. Control valve fails closed 1. Loss of cooling, possible 1. Select valve to fail open DAC 1/993
coils 2. Plugged cooling coils runaway 2. Install fiter with maintenance DAC 1/93
2. procedure
Install cooling water flow meter DAC |2/93
and low flow alarm
Install high temperature alarm DAC 2/93
to alert operator
3. Cooling water service failure 3. 3. Check and monitor reliability of |DAC 2/93
water service
Controller fails and closes valve 4. 4. Place controller on critical DAC 1/93
instrumentation list
5. Air pressure fails, closing valve 5. See 1A.1
1B High 1. Control valve fails open 1. Reactor cools, reactant 1. Instruct operators and update JFL 1/93
conc. builds, possible procedures
runaway on heating
2. Controiler fails and opens valve 2. 2. See 1A.4
C LoW 1.Partially plugged cooling line 1.Diminished cooling, 1. See 1A.2
possible runaway
2. Partial water source failure 2. See 1A.2
3. Control vaive fails to respond 3. Place valve on critical JFL 1/93
instrumentation list
1D As well as, Contamination of water supply 1. Not possible here 1. None
1E part of, 1. Covered under 1C
1F reverse 1. Failure of water source resulting in 1. Loss of cooling, possible 1. See 1A.2
backflow runaway
2. Backflow due to high backpressure 2. Install check valve JFL 2/93
1G Other than, 1. Not considered possible
1H soonerthan,| 1. Cooling normally started early 1. None
later than 1.Operatorerror 1. Temperature rises, 1. Interlock between cooling flow JW 1/93
possible runaway and reactor feed
1J
1K Temp.
Where else 1.1. Not consideredpossible
Low Low watersupply temperature 1. None-controiler handles 1. None
L High 1. High water supply temperature 1.Cooling system capacity 1. Install high flow alarm and/or JW 1/93
limited, temp. increases cooling water high temp. alarm
2A StirrerAgitation | No
1. Stirrer motor malfunction 1. No mixing, possible 1. Interlock with feed line 1/93
accumulation of unreacted JW
materialss 2/93
2. Power failure 2. Monomer feed continues, 2. Monomer feed valve must fail JW
possible accumulation of closed on power loss
unreacted materials
2B More Stirrer motor controller fails, 1. None
resulting in high motor speed
454 Chapter 10 Hazards Identification

The potential process modifications resulting from this study (Example 10-2) are the
following:

install a high-temperature alarm to alert the operator in the event of cooling function loss,
install a high-temperature shutdown system (this system would automatically shut down
the process in the event of a high reactor temperature; the shutdown temperature would
be higher than the alarm temperature to provide the operator with the opportunity tore-
store cooling before the reactor is shutdown),
install a check valve in the cooling line to prevent reverse flow (a check valve could be in-
stalled both before and after the reactor to prevent the reactor contents fromn flowing up-
stream and to prevent the backflow in the event of a leak in the coils),
periodically inspect the cooling coil to ensure its integrity,
study the cooling water source to consider possible contamination and interruption of
supply,
install a cooling water flow meter and low-flow alarm (which will provide an immediate
indication of cooling loss).

In the event that the cooling water system fails (regardless of the source of the failure),
the high-temperature alarm and emergency shutdown system prevents a runaway reaction. The
review committee performing the HAZOP study decided that the installation of a backup con-
troller and control valve was not essential. The high-temperature alarm and shutdown system
prevents a runaway reaction in this event. Similarly, a loss of coolant water source or a plugged
cooling line would be detected by either the alarm or the emergency shutdown system. The re-
view committee suggested that all coolant water failures be properly reported and that if a par
ticular cause occurred repeatedly, then additional process modifications were warranted.
Example 10-2 demonstrates that the number of suggested process changes is great, al-
though only a single process intention is considered.
The advantage to this approach is that it provides a more complete identification of the
hazards, including information on how hazards can develop as a result of operating procedures
and operational upsets in the process. Companies that perform detailed HAZOPs studies find
that their processes operate better and have less down time, that their product quality is im-
proved, that less waste is produced, and that their employees are more confident in the safety
of the process. The disadvantages are that the HAZOP approach is tedious to apply, requires
considerable staff time, and can potentially identify hazards independent of the risk.

10-4 Safety Reviews


Another method that is commonly used to identify safety problems in laboratory and process
areas and to develop solutions is the safety review. There are two types of safety reviews: the
informal and the formal.
10-4 Safety Reviews 455

VENT

Reflux -2Caustic
v-2XScrubber
Condenser
(NaOH)

COCl2
L

REACTOR

Figure 10-9 Original design of phosgene reactor before informal safety review.

The informal safety review is used for small changes to existing processes and for small
bench-scale or laboratory processes. The informal safety review procedure usually involves
just two or three people. It includes the individual responsible for the process and one or two
others not directly associated with the process but experienced with proper safety procedures.
The idea is to provide a lively dialogue where ideas can be exchanged and safety improvements
can be developed.
The reviewers simply meet in an informal fashion to examine the process equipment and
operating procedures and to offer suggestions on how the safety of the process might be im-
proved. Significant improvements should be summarized in a memo for others to reference in
the future. The improvements must be implemented before the process is operated

Example 10-3
Consider the laboratory reactor system shown in Figure 10-9. This system is designed to react phos-
gene (COClh) with aniline to produce isocyanate and HCl. The reaction is shown in Figure 10-10.
The isocyanate is used for the production of foams and plastics.
Phosgene is a colorless vapor with a boiling point of 46.8°F. Thus it is normally stored as a
liquid in a container under pressure above its normal boiling point temperature. The TLV for phos-
gene is 0.1 ppm, and its odor threshold is 0.5-1 ppm, well above the TLV.
Aniline is a liquid with a boiling point of 364°F. Its TLV is 2 ppm. It is absorbed through
the skin.

NH NCO
O+co0,|O+2n
Aniline Isocyanate
Figure 10-10 Reaction stoichiometry for phosgene reactor.
456 Chapter 10 Hazards ldentification

Vacuum
Control

Reflux
Condenser

LCOCi
Flow
Indicator
Relief Trap
50%
NaOH
20%
NH,OH

Caustic REACTOR
Solution
Figure
reactor

Figure 10-11 Final design of phosgene reactor after informal safety review.

In the process shown in Figure 10-9 the phosgene is fed from the container through a valve
into a fritted glass bubbler in the reactor. The reflux condenser condenses aniline vapors and re-
turns them to the reactor. A caustic scrubber is used to remove the phosgene and HCl vapors from
the exit vent stream. The complete process is contained in a hood.
Conduct an informal safety review on this process.

Solution
The safety review was completed by two individuals. The final process design is shown in Fig-
ure 10-11. The changes and additions to the process are as follows:

1. vacuum is added to reduce boiling temperature,


2. relief system is added with an outlet to a scrubber to prevent hazards resulting from a plugged
fritted glass bubbler,
3. flow indicator provides visual indication of flow,
4. bubblers are used instead of scrubbers because they are more effective,
5. ammonium hydroxide bubbler is more effective for absorbing phosgene,
trap catches liquid phosgene,
7. pail of caustic is added (the phosgene cylinder would be dumped into this pail in the event of
a cylinder or valve leak; the caustic would absorb the phosgene).

In addition, the reviewers recommended the following: (1) Hang phosgene indicator paper around
the hood, room, and operating areas (this paper is normally white but turns brown when exposed
to 0.1 ppm of phosgene), (2) use a safety checklist, daily, before the process is started, and (3) post
an up-to-date process sketch near the process.

The formal safety review is used for new processes, substantial changes in existing pro
cesses, and processes that need an updated review. The formal safety review is a three-step pro-
10-4 Safety Reviews 457

cedure. This consists in preparing a detailed formal safety review report, having a committee
review the report and inspect the process, and implementing the recommendations. The for
mal safety review report includes the following sections:

I. Introduction
A. Overview or summary: Provides a brief summary of the results of the formal safety
review. This is done after the formal safety review is complete.
B. Process overviewor summary: Provides a brief description of the process with an em-
phasis on the major hazards in the operation.
C. Reactions and stoichiometry: Provides the chemical reaction equations and stoi-
chiometry.
D. Engineering data: Provides operating temperatures, pressures, and relevant physical
property data for the materials used.
II. Raw materials and products: Refers to specific hazards and handling problems associated
with the raw materials and products. Discusses procedures to minimize these hazards.
III. Equipment setup
A. Equipment description: Describes the configuration of the equipment. Sketches of
the equipment are provided.
B. Equipment specifications: Identifies the equipment by manufacturer name and model
number. Provides the physical data and design information associated with the
equipment.
IV. Procedures
A. Normal operating procedures: Describes how the process is operated.
B. Safety procedures: Provides a description of the unique concerns associated with the
equipment and materials and specific procedures used to minimize the risk. This in-
cludes:
1. Emergency shutdown: Describes the procedure used to shut down the equipment
if an emergency should occur. This includes major leaks, reactor runaway, and loss
of electricity, water, and air pressure.
2. Fail-safe procedures: Examines the consequences of utility failures, such as loss of
steam, electricity, water, air pressure, or inert padding. Describes what to do for
each case so that the system fails safely.
3. Majorrelease procedures: Describes what to do in the event of a major spill of toxic
or flammable material.
C. Waste disposal procedure: Describes how toxic or hazardous materials are collected,
handled, and disposed.
D. Cleanup procedures: Describes how to clean the process after use.
V. Safety checklist: Provides the complete safety checklist for the operator to complete be-
fore operation of the process. This checklist is used before every startup.
VI. Material safety data sheets: Provided for each hazardous material used.
458 Chapter 10 Hazards Identification

DIRTY TOLUENE
STORAGE

POD

7G
O WATER
CLEAN
TOLUENE

WATER

Figure 10-12 Toluene water wash process before formal safety review.

Example 10-4
A toluene water wash process is shown in Figure 10-12. This process is used to clean water-soluble
impurities from contaminated toluene. The separation is achieved with a Podbielniak centrifuge,
or Pod, because of a difference in densities. The light phase (contaminated toluene) is fed to the
periphery of the centrifuge and travels to the center. The heavy phase (water) is fed to the center
and travels countercurrent to the toluene to the periphery of the centrifuge. Both phases are mixed
within the centrifuge and separated countercurrently. The extraction is conducted at 190°F.
The contaminated toluene isfed froma storage tank into the Pod. The heavy liquid out (con-
taminated water) is sent to waste treatment and the light liquid out (ch an toluene) is collected in a
55-gal drum.
Perform a formal safety review on this process.
Solution
The complete safety review report is provided in appendix D. Figure 10-13 shows the modified pro-
cess after the formal safety review has been completed. The significant changes or additions added
as a result of the review are as follows:

1. add grounding and bonding to all collection and storage drums and process vessels,
2. add inerting and purging to all drums,
3. add elephant trunks at all drums to provide ventilation,
4. provide dip legs in all drums to prevent the free fall of solvent resulting in the generation and
accumulation of static charge,
5. add a charge drum with grounding, bonding, inerting, and ventilation,
6. providea vacuum connection to the dirty toluene storage for charging,
7. add a relief valve to the dirty toluene storage tank,
8. add heat exchangers to all outlet streams to cool the exit solvents below their flash point (this
must include temperature gauges to ensure proper operation), and
10-5 Other Methods 459

Vacuumn
90°F
LHO77°F DIRTY TOLUENE
STORAGGE

POD/
Na

GWATER
CLEAN
TOLUENE

N
SL DIRTY TOLUENE
CHARGE
WATER

L
Figure 10-13 Toluene water wash process after formal safety review.

9. provide a waste water collection drum to collect all waste water that might contain substan-
tial amounts of toluene from upset conditions.

Additional changes were made in the operating and emergency procedure. They included

1. checking the room air periodically with colorimetrie tubes to determine whether any tolu-
ene vapors are present and
2. changing the emergency procedure for spills to include (a) activating the spill alarm, (b) in-
creasing the ventilation to high speed, and (c) throwing the sewer isolation switch to prevent
solvent from entering the main sewer lines.

The formal safety review can be used almost immediately, is relatively easy to apply, and is
known to provide good results. However, the committee participants must be experienced in
identifying safety problems. For less experienced committees, a more formal HAZOP study
may be more effective in identifying the hazards.

10-5 Other Methods


Other methods that are available for identifying hazards are the following
1. "Whatif" analysis: This less formal method ofidentifying hazards applies the words "what
if" to a number of areas of investigation. For instance, the question might be, What if the
flow stops? The analysis team then decides what the potential consequences might be and
how to solve any problems.
460 Chapter 10 Hazards ldentification

2. Human error analysis: This method is used to identify the parts and the procedures of a
process that have a higher than normal probability of human eror. Control panel layout
is an excellent application for human error analysis because a control panel can be de-
signed in such a fashion that human error is inevitable.
3. Failure mode, effects, and criticality analysis (FMECA): This method tabulates a list of
cquipment in the process along with all the possible failure modes for each item. The ef
fect of a particular failure is considered with respect to the process.

Suggested Reading
Dow's Fire and Explosion Index Hazard Classification Guide, 7th ed. (New York: American Institute of
Chemical Engineers, 1994).
Guidelines for Hazard Evaluation Procedures, 2d ed. (New York: American Institute of Chemical Engi-
neers, 1992).
revor A. Kletz, HAZOP and HAZAN, 3d ed. (Warwickshire, England: Institution of Chemical Engi-
neers, 1992).
Frank P. Lees, Loss Prevention in the Process Industries, 2d ed. (London: Butterworths, 1996), ch. 8.

Problems
10-1. The hydrolysis of acetic anhydride is being studied in a laboratory-scale continuously
stirred tank reactor (CSTR). In this reaction acetic anhydride {(CH,CO),0] reacts with
water to produce acetic acid (CH,COOH).
The concentration of acetic anhydride at any time in the CSTR is determined by
titration with sodium hydroxide. Because the titration procedure requires time (rela-
tive to the hydrolysis reaction time), it is necessary to quench the hydrolysis reaction as
soon as the sample is taken. The quenching is achieved by adding an excess of aniline
to the sample. The quench reaction is

(CH,CO) + CH,NH,-»CH,COOH +C,H,NHCOCH3

The quenching reaction also forms acetic acid, but in a different stoichiometric ratio
than the hydrolysis reaction. Thus it is possible to determine the acetic anhydride con-
centration at the time the sample was taken.
The initial experimental design is shown in Figure 10-14. Water and acetic anhy-
dride are gravity-fed from reservoirs and through a set of rotameters. The water is
mixed with the acetic anhydride just before it enters the reactor. Water is also circulated
by a centrifugal pump from the temperature bath through coils in the reactor vessel.
This maintains the reactor temperature at a fixed value. A temperature controller in the
water bath maintains the temperature to within 1°F of the desired temperature.
CHAP TER 1 1

Risk Assessment

R isk assessment includes incident identification and


consequence analysis. Incident identification describes how an accident occurs. It frequently
includes an analysis of the probabilities. Consequence analysis describes the expected damage.
This includes loss of life, damage to the environnment or capital equipment, and days outage
The hazards identification procedures presented in chapter 10 include some aspects of
risk assessment. The Dow F&EI includes a calculation of the maximum probable property dam
age (MPPD) and the maximum probable days outage (MPDO). This is a formofconsequences
analysis. However, these numbers are obtained by some rather simple calculations involving
published correlations. Hazard and operability (HAZOP) studies provide information on how
a particular accident occurs. This is a form of incident identification. No probabilities or num-

bers are used with the typical HAZOP study, although the experience of the review committee
is used to decide on an appropriate course of action.
In this chapter we will

review probability mathematics, including the mathematics of equipment failure,


show how the failure probabilities of individual hardware components contribute to the
failure of a process,
describe two probabilistic methods (event trees and fault trees),
describe the concepts of layer of protection analysis (LOPA), and
deseribe the relationship between quantitative risk analysis (QRA) and LOPA.

We focus on determining the frequency of accident scenarios. The last two sections show
how the frequencies are used in QRA and LOPA studies; LOPA is a simplified QRA. It should
be emphasized that the teachings of this chapter are all easy to use and to apply, and the results

471
472 Chapter 11 Risk Assessment

are often the basis for significantly improving the design and operation of chemical and petro-
chemical plants.

11-1 Review of Probability Theory


Equipment failures or faults in a process occur as a result of a complex interaction of the indi-
vidual components. The overall probability of a failure in a process depends highly on the na-
ture of this interaction. In this section we define the various types of interactions and describe
how to perform failure probability computations.
Data are collected on the failure rate of a particular hardware component. With ade
quate data it can be shown that, on average, the component fails after a certain period of time.
This is called the average failure rate and is represented by u with units of faults/time. The
probability that the component ill not fail during the time interval (0, 1) is given by a Poisson
distribution:

Rt) = e (11-1)

where Ris the reliability. Equation 11-1 assumes a constant failure rate u. As t0o, the reli-
ability goes to 0. The speed at which this occurs depends on the value of the failure rate p. The
higher the failure rate, the faster the reliability decreases. Other and more complex distribu-
tions are available. This simple exponential distribution is the one that is used most commonly
because it requires only a single parameter, u. The complement of the reliability is called the
failure probability (or sometimes the unreliability), P, and it is given by

PO) = 1- R(t) = 1
- e. (11-2)

The failure density function is defined as the derivative of the failure probability:

dP(1)
f)=
dt Leu (11-3)

The area under the complete failure density function is1.


The failure density function is used to determine the probability P of at least one failure
in the time period o to :

P4)= ft) dt =
e dte eH- (11-4)

B. Roffel and J. E. Rijnsdorp, Process Dynamics, Control, and Protection (Ann Arbor, MI: Ann Arbor
Science, 1982), p. 381.
11-1 Review of Probability Theory 473

Failure Rate Failure Density Failure Probability Reliability

f(t) Area=
1
p(t) R(t) 1-P(t)

t
(o) (b) (d)

Figure 11-1 Typical plots of (a) the failure rate


probability P(t), and (d) the reliability R().
, (b) the failure density f(), (c) the failure

The integral represents the fraction of the total area under the failure density function between
,
time and 4.
The time interval between two failures of the component is called the mean time between
failures (MTBF) and is given by the first moment of the failure density functioon:

= (11-5)
E) = MTBF = f) dt

Typical plots of the functions 4,f.P, and R are shown in Figure 11-1
Equations 11-1 through 11-5 are valid only for a constant failure rate a. Many compo-
nents exhibit a typical bathtub failure rate, shown in Figure 11-2. The failure rate is highest
when the component is new (infant mortality) and when it is old (old age). Between these two
periods (denoted by the lines in Figure 11-2), the failure rate is reasonably constant and Equa-
tions 11-1 through 11-5 are valid.

Faiiure
Rate,
Period of Approximately Constant
(taults/time)

Infant Mortality Old Age

Time

Figure 11-2 A typical bathtub failure rate curve for process hardware. The failure rate is ap-
proximately constant over the midlife of the component.
474 Chapter 11 Risk Assessment

Interactions between Process Unitss


Accidents in chemical plants are usually the result of a complicated interaction of a num-
ber of process components. The overall process failure probability is computed from the indi-
vidual component probabilities.
Process components interact in two different fashions. In some cases a process failure re-
quires the simultaneous failure of a number of components in parallel. This parallel structure
is represented by the logical AND function. This means that the failure probabilities for the in-
dividual components must be multiplied:

P IIP, i=1
(11-6)

where

n is the total number of components and


Pis the failure probability of each component.
This rule is easily memorized because for parallel components the probabilities are multiplied.
The total reliability for parallel units is given by

R =1- I1- R,), (11-7)


i=1

where R, is the reliability of an individual process component.


Process components also interact in series. This means that a failure of any single com-
ponent in the series of components will result in failure of the process. The logical OR function
represents this case. For series components the overall process reliability is found by multiply-
ing the reliabilities for the individual components:
n

R R i=1
(11-8)

The overall failure probability is computed from

P 1 -

II1 -

P). (11-9)

For a system composed of two components A and B, Equation 11-9 is expanded to


=
P(A or B) P(A) + P(B)- P(A)P(B). (11-10)

The cross-product term P(A)P(B) compensates for counting the overlapping cases twice. Con-
sider the example of tossing a single die and determining the probability that the number of
points is even or divisible by 3. In this case
11-1 Review of Probability Theory 475

P(even or divisible by 3) = P(even) + P(divisible by 3)- P(even and divisible by 3).

The last term subtracts the cases in which both conditions are satisfied.
If the failure probabilities are small (a common situation), the term P(A)P(B) is negli
gible, and Equation 11-10 reduces to

P(A or B) = P(A) + P(B). (11-11)

This result is generalized for any number of components. For this special case Equation 11-9
reduces to

P P.
Failure rate data for a number of typical process components are provided in Table 11-1.
These are average values determined at a typical chemical process facility. Actual values would

Table 11-1 Failure Rate Data for Various


Selected Process Components
Instrumeent Faults/year

Controller 0.29
Control valve 0.60
Flow measurement (fiuids) 1.14
Flow measurement (solids) 3.75
Flow switch .12
Gas-liquid chromatograph 30.6
Hand vale 0.13
Indicator lamp 0.044
Level measurement (liquids) 1.70
Level measurement (solids) 6.86
Oxygen analyzer 5.65
pH meter 5.88
Pressure measurement 1.41
Pressure relief valve 0.022
Pressure switch 0.14
Solenoid valve 0.42
Stepper motor 0.044
Strip chart recorder 0.22
Thermocouple temperature measurement .52
Thermometer temperature measurement 0.027
Valve positioner 0.44

'Sciected from Frank P. Lees, Loss Prevention in the Process Industries


(London: Butterworths, 1986), p. 343.
476 Chapter 11 Risk Assessment

Failure Probability Reliability Failure Rate

POR-P R0R-R OR-


:1 -
(1- P, 1 -
P2 R
R2
n
1 TT1-P R T
i 1 i1 i1
Series Link of Componente: The fai lure of either c omponent adds
to the total system failure.

ND-P R2ANDR
-R2)
P
PP2 R 1 (1 R)1 -Ln R)/t

P
TP 1

Paral le1 Link of Components: The fai lure of the system requires the
failure of both components. Note that
there 19 no convenient Way to combine

the failure rate.

Figure 11-3 Computations for various types of component linkages.

depend on the ma nufacturer, materials of construction, the design, the environment, and other
factors. The assumptions in this analysis are that the failures are independent, hard, and not in-
termittent and that the failure of one device does not stress adjacent devices to the point that
the failure probability is increased.
A summary of computations for parallel and series process components is shown in
Figure 11-3.

Example 11-1
The water flow to a chemical reactor cooling coil is controlled by the system shown in Figure 11-4.
The flow is measured by a differential pressure (DP) device, the controller decides on an appropri-
ate control strategy, and the control valve manipulates the fow of coolant. Determine the overall
failure rate, the unreliability, the reliability, and the MTBF for this system. Assume a 1-yr period of
operation.
11-1 Review of Probability Theory 477

Controller

FIC

Pump

Control Flow
valve meter

TITTTTTTTTTTÍTTTTT
Figure 11-4 Flow control system. The components of the control system are linked in series.

Solution
These process components are related in series. Thus, if any one of the components fails, the en-
tire system fails. The reliability and failure probability are computed for each component using
Equations 11-1 and 11-2. The results are shown in the following table. The failure rates are from
Table 11-1.
Failure Failure
rate Reliability probability
Component (faults/yr) R e* P 1- R

Control valve 0.60 0.55 0.45


Controller 0.29 0.75 0.25
DP cell 1.41 0.24 0.76

The overall reliability for components in series is computed using Equation 11-8. The result is

R IIR = 0.55)(0.75)0.24) = 0.10.

The failure probability is computed from


=
P 1 -
R = 1 -
0.10 0.90/yr.
The overall failure rate is computed using the definition of the reliability (Equation 11-1):
0.10 e*

-In(0.10) = 2.30failures/yr.
The MTBF is computed using Equation 11-5:

MTBF = 0.43 yr.

This system is expected to fail, on average, once every 0.43 yr.


478 Chapter 11 Risk Assessment

PresSure
Switch
Alarm
at
P 1
React or
Feed
Solenoid
Valve

Reactor

Figure 11-5 A chemical reactor with an alarm and an iniet feed solenoid. The alarm and feed
shutdown systems are linked in parallel.

Example 11-2
A diagram of the safety systems in a certain chemical reactor is shown in Figure 11-5. This reactor
contains a high-presure alarm to alert the operator in the event of dangerous reactor pressures. It
consists of a pressure switch within the reactor connected to an alarm light indicator. For additional
safety an automatic high-pressure reactor shutdown system is installed. This system is activated at
a pressure somewhat higher than the alarm system and consists of a pressure switch connected to a
solenoid valve in the reactor feed line. The automatic system stops the flow of reactant in the event
of dangerous pressures. Compute the overall failure rate, the failure probability, the reliability, and
the MTBF for a high-pressure condition. Assume a 1-yr period of operation. Also, develop an ex
pression for the overall failure probability based on the component failure probabilities.
Solution
Failure rate data are available from Table 11-1. The reliability and failure probabilities of each com-
ponent are computed using Equations 11-1 and 11-2:
Failure Failure
rate Reliabilityy probability
Component (faults/yr) R et P 1 R

1. Pressure switch 1 0.14 0.87 0.13


2. Alarm indicator 0.044 .96 0.04
3. Pressure switch 2 0.14 0.87 0.13
4. Solenoid valve 0.42 0.66 0.34
11-1 Review of Probability Theory 479

A dangerous high-pressure reactor situation occurs only when both the alarm system and the shut-
down system fail. These two components are in parallel. For the alarm system the components are
in series:

2
R= IIR, = (0.87)(0.96) = 0.835,

P 1 -
R 1 -
0.835 = 0.165,

= -In R = - In(0.835) = 0.180 faults/yr,


1
MTBF 5.56 yr.

For the shutdown system the components are also in series:

=
R IIR (0.87)(0.6) = 0.574,

P 1- R 1
-0.574 =
0.426,

-In R = -In(0.574) = 0.555 faults/yr,

MTBF = L= 1.80yr.

The two systems are combined using Equation 11-6:

2
P= IIP
i=1
= (0.165)(0.426) = 0.070,

=
R 1- P 0.930.,

=-In R =
- In(0.930) = 0.073 faults/yr,

=13.7 yr.
MTBF =

For the alarm system alone a failure is expected once every 5.5 yr. Similarly, for a reactor with a high-
pressure shutdown system alone, a failure is expected once every 1.80 yr. However, with both sys-
tems in parallel the MTBF is significantly improved and a combined failure is expected every 13.7 yr.
The overall failure probability is given by

P P(A)P(S).

where P(A) is the failure probability of the alarm system and P(S) is the failure probability of the
emergency shutdown system. An alternative procedure is to invoke Equation 11-9 directly. For the
alarm system

P(A) =
P + P2- PP2
480 Chapter 11 Risk Assessment

For the shutdown system

P(S) = P3 + Pa- PzPa


The overall failure probability is then

P P(A)P(S) = (Pi + Pa- PiPa)P + P -


PzPa).

Substituting the numbers provided in the example, we obtajin

P [0.13 +0.04 (0.13)(0.04)][0.34 +0.13 (0.34)(0.13)]


- (0.165)(0.426) = 0.070.

This is the same answer as before.


If the products P,P2 and PsPa are assumed to be small, then

P(A) =P1 +P2,


P(S)= Ps + Pas

and

=
P P(A)P(S) (Pi + P:NP; + Pa)
= 0.080.

The difference between this answer and the answer obtained previously is 14.3%. The component
probabilities are not small enough in this example to assume that the cross-products are negligible.

Revealed and Unrevealed Failures


Example 11-2 assumes that all failures in either the alarm or the shutdown system are
immediately obvious to the operator and are fixed in a negligible amount of time. Emergency
alarms and shutdown systems are used only when a dangerous situation occurs. It is possible
for the equipment to fail without the operator being aware of the situation. This is called an un-
revealed failure. Without regular and reliable equipment testing, alarm and emergency sys-
tems can fail without notice. Failures that are immediately obvious are called revealed failures.
A flat tire on a car is immediately obvious to the driver. However, the spare tire in the
trunk might also be flat without the driver being aware. of the problem until the spare is needed.
Figure 11-6 shows the nomenclature for revealed failures. The time that the component
is operational is called the period of operation and is denoted by To. After a failure occurs, a
period of time, called the period of inactivity or downtime (), is required to repair the com-
ponent. The MTBF is the sum of the period of operation and the downtime, as shown.
11-1 Review of Probability Theory 481

Component Repoired Component Fails


Component Repoired

Operational

Failed
To

Tr
MTBF

Time

Figure 11-6 Component cycles for revealed failures. A failure requires a period of time for repair.

For revealed failures the period of inactivity or downtime for a particular component is
computed by averaging the inactive period for a number of failures:

T (11-12)

where

n is the number of times the failure or inactivity occurred and


T, is the period for repair for a particular failure.

Similarly, the time before failure or period of operation is given by

(11-13)

,
where is the period of operation between a particular set of failures.
The MTBF is the sum of the period of operation and the repair period:

MTBF =
= T, + To (11-14)
482 Chapter 11 Risk Assessment

It is convenient to define an availability and unavailability. The availability A is simply


the probability that the component or process is found functioning. The unavailability U is the
probability that the component or process is found not functioning. It is obvious that

A +U =1. (11-15)

The quantity 7, represents the period that the process is in operation, and 7, + To Tepresents
the total time. By definition, it follows that the availability is given by

A = To (11-16)
T+To

and, similarly, the unavailability is

Tr
U (11-17)
T+To

By combining Equations 11-16 and 11-17 with the result of Equation 11-14, we can write the
equations for the availability and unavailability for revealed failures:

U =
T
A Tg (11-18)

For unrevealed failures the failure becomes obvious only after regular inspection. This
situation is shown in Figure 11-7. If T, is the average period of unavailability during the in-
spection interval and if 7, is the inspection interval, then

U (11-19)

The average period of unavailability is computed from the failure probability:

T Po)dr. (11-20)

Combining with Equation 11-19, we obtain

U-P) d. (11-21)
11-1 Review of Probability Theory 483

Component Repaired Component Fails


Failure not Noticed
until Inspection
Operotional

Failed

Tu
T
Time

Figure 11-7 Component cycles for unrevealed failures.

The failure probability P() is given by Equation 11-2. This is substituted into Equation 11-21
and integrated. The result is

U=1-1-
Ti
e*) (11-22)

An expression for the availability is

A 1-T e). (11-23)

If the term u7, 1, then the failure probability is approximated by

P) ut, (11-24)

and Equation 11-21 is integrated to give, for unrevealed failures,

1
U T (11-25)

This is a useful and convenient result. It demonstrates that, on average, for unrevealed failures
the process or component is unavailable during a period equal to half the inspection interval. A
decrease in the inspection interval is shown to increase the availability of an unrevealed failure.
484 Chapter 11 Risk Assessment

Equations 11-19 through 11-25 assume a negligible repair time. Thisis usually a valid as-
sumption because on-line process equipment is generally repaired within hours, whereas the
inspection intervals are usually monthly.

Example 11-3
Compute the availability and the unavailability for both the alarm and the shutdown systems of Ex
ample 11-2. Assume that a maintenance inspection occurs once every month and that the repair time
is negligible.

Solution
Both systems demonstrate unrevealed failures. For the alarm system the failure rate is = 0.18
faults/yr. The inspection period is 1/12 = 0.083 yr. The unavailability is computed using Equa-
tion 11-25:

1 (1/2)(0.18)(0.083)
U= uT; = = 0.0075,

=
A 1- U 0.992.

The alarm system is available 99.2% of the time. For the shutdown system u = 0.55 faults/yr. Thus

=
U =T= (1/2)(0.55)(0.083) 0.023,

A = 1- 0.023 = 0.977.

The shutdown system is available 97.7% of the time.

Probability of Coincidence
All process components demonstrate unavailability as a result of a failure. For alarms
and emergency systems it is unlikely that these systems will be unavailable when a dangerous
process episode occurs. The danger results only when a process upset occurs and the emer-
gency system is unavailable. This requires a coincidence of events.
Assume that a dangerous process episode occurs pa times in a time interval 7;. The fre-
quency of this episode is given by

(11-26)
Ti

For an emergency system with unavailability U, a dangerous situation will occur only when the
process episode occurs and the emergency system is unavailable. This is every p,U episodes.
11-1 Review of Probability Theory 485

The average frequency of dangerous episodes Ad is the number of dangerous coincidences di-
vided by the time period:

A AU. (11-27)
T
For small failure rates U = žuT; and pa= AT;. Substituting into Equation 11-27 yields

(11-28)

The mean time between coincidences (MTBC) is the reciprocal of the average frequency of
dangerous coincidences:
=
MTBC (11-29)
AuT

Example 11-4
For the reactor of Example 11-3 a high-pressure incident is expected once every 14 months. Com-
pute the MTBC for a high-pressure excursion and a failure in the emergency shutdown device. As-
sume that a maintenance inspection occurs every month.

Solution
The frequency of process episodes is given by Equation 11-26:

A = 1 episode/[(14 months)(1 yr/12 months)] = 0.857/yr.

The unavailability is computed from Equation 11-25:


1
U -4T,- (1/2)(0.55)(0.083) 0.023.

The average frequency of dangerous coincidences is given by Equation 11-27:

AgAU= (0.857)(0.023) = 0.020.

The MTBC is (from Equation 11-29)


1
MTBC = 50 yr.
0.020

It is expected that a simultaneous high-pressure incident and failure of the emergency shutdown
device will occur once every 0 yr.
If the inspection interval 7, is halved, then U = 0.023, A = 0.010, and the resulting MTBC is
100 yr. This is a significant improvement and shows why a proper and timely maintenance program
is important.
486 Chapter 11 Risk Assessment

Redundancy2
Systems are designed to function normally even when a single instrument or control
function fails. This is achieved with redundant controls, including two or more measurements,
processing paths, and actuators that ensure that the system operates safely and reliably. The
degree of redundancy depends on the hazards of the process and on the potential for economic
losses. An example ofa redundant temperature measurement is an additional temperature
probe. An example of a redundant temperature controlloop is an additional temperature probe,
controller, and actuator (for example, cooling water control valve).

Common Mode Failures


Occasionally an incident occurs that results in a common mode failure. This is a single
event that affects a number of pieces of hardware simultaneously. For example, consider sev
eral flow control loops similar to Figure 11-4. A common mode failure is the loss of electrical
power or a loss of instrument air. A utility failure of this type can cause all the control loops to
fail at the same time. The utility is connected to these systems via OR gates. This increases the
failure rate substantially. When working with control systems, one needs to deliberately design
the systems to minimize common cause failures.

11-2 Event Trees


Event trees begin with an initiating event and work toward a final result. This approach is in-
ductive. The method provides information on how a failure can occur and the probability of
OCcurrence.
When an accident occurs in a plant, various safety systems come into play to prevent the
accident from propagating. These safety systems either fail or succeed. The event tree approach
includes the effects of an event initiation followed by the impact of the safety systems.
The typical steps in an event tree analysis are3

1. identify an initiating event of interest,


2. identify the safety functions designed to deal with the initiating event,
3. construct the event tree, and
4. describe the resulting accident event sequences.

If appropriate data are available, the procedure is used to assign numerical values to the vari-
ous events. This is used effectively to determine the probability of a certain sequence of events
and to decide what improvements are required.
2S. S. Grossel and D. A. Crowl, eds. Handbook of Highly Toxic Materials Handling and Management
(New York: Marcel Dekker, 1995), p. 264.
3Guidelines for Hazard Evaluation Procedures, 2d ed. (New York: American Institute of Chemical Engi
neers, 1992).
11-2 Event Trees 487

Reactor Cooling
Feed Coils

Cooling
Water
Out
Cooli ng
Water
In
Reactor

Temp eraturre
Controller

Thermoc ouple
Alarm High Temperature
at
T
TA AIarm

Figure 11-8 Reactor with high-temperature alarm and temperature controller.

Consider the chemical reactor system shown in Figure 11-8. This system is identical to the
system shown in Figure 10-6, except that a high-temperature alarm has been installcd to warn
the operator of a high temperature within the reactor. The event tree for a loss-of-coolant ini-
tiating event is shown in Figure 11-9. Four safety functions are identified. These are written
across the top of the sheet. The first safcty function is the high-temperature alarm. The second
safety function is the operator noticing the high reactor temperature during normal inspection.
The third safety function is the operator reestablishing the coolant flow by correcting the prob-
lem in time. The final safety function is invoked by the operator performing an emergency shut-
down of the reactor. These safety functions are written across the page in the order in which they
logically occur.
The event tree is written from left to right. The initiating event is written first in the cen-
ter of the page on the left. A line is drawn from the initiating event to the first safety function.
At this point the safety function can either succeed or fail. By convention, a successful opera-
tion is drawn by a straight line upward and a failure is drawn downward. Horizontal lines are
drawn from these two states to the next safety function.
If a safety function does not apply, the horizontal line is continued through the safety
function without branching. For this example, the upper branch continues through the second
function, where the operator notices the high temperature. If the high-temperature alarm op-
erates properly, the operator will already be aware of the high-temperature condition. The se-
quence description and consequences are indicated on the extreme right-hand side of the event
tree. The open circles indicate safe conditions, and the circles with the crosses represent unsafe
conditions.
High Temp Operator Operetor Operator
Safe ty Fune tioni Alerm Alerts Notices Re-S tar 1s Shuts De wn
Operato High Temp Coling Reae tor

Ident if ier: B C E

Failures/Demand: .01 8.25 0.25 .1


Con
0.7425
8.99 AD Shu
0.2227
0.2475 ADE Run
.02475
AB Con
.005625
In itiating Even t
Loss of Cooling 0.0075 ABO
.001688 O Shu

1 Occurrence/yr
0.001875 ABDE
Run
.0001875
0.01
ABC
8.001875 Con

.0025 ABCD Shu


0.0005625
8.000625ABCDE-Ru
e.8000625
Shutdonn 0.2227+ 8. 081688 + 0.0005625 0. 2258 ccurrence s/yr.
Runanay 8.02475 8.8001875 0.00e0625 0.02580 oceurrences/yr.
Figure 11-9 Event tree for a loss-of-coolant accident for the reactor of Figure 11-8.
11-2 Event Trees 489

Safety Function
0.01 Failures/Demand

Initiating Success of Safety Function


Event (1-0.01)*0.5 = 0,495 Occurrences/yr.

0.5 Occurrences/yr.

LFailure_of Safety Function


o.01*0.5 = 0.005 Occurrences/yr.

Figure 11-10 The computational sequence across a safety function in an event tree.

The lettering notation in the sequence description column is useful for identifying the par
ticular event. The letters indicate the sequence of failures of the safety systems. The initiating
event is always included as the first letter in the notation. An event tree for a different initiating
event in this study would use a different letter. For the example here, the lettering sequence
ADE represents initiating event A followed by failure of safety functions D and E.
The event tree can be used quantitatively if data are available on the failure rates of the
safety functions and the occurrence rate of the initiation event. For this example assume that
of-cooling event ocurs once a year. Let us also assume that the hardware safety func-
tions fail 1% of the time they are placed in demand. This is a failure rate of 0.01 failure/de-
mand. Also assume that the operator will notice the high reactor temperature 3 out of 4 times
and that 3 out of 4 times the operator will be successful at reestablishing the coolant flow. Both
of these cases represent a failure rate of 1 time out of 4, or 0.25 failure/demand. Finally, it is es-
timated that the operator successfully shuts down the system 9 out of 10 times. This is a failure
rate of 0.10 failure /demand.
The failure rates for the safety functions are written below the column headings. The oc-
currence frequency for the initiating event is written below the line originating from the initi-
ating event.
The computational sequence performed at each junction is shown in Figure 11-10. Again,
the upper branch, by convention, represents a successful safety function and the lower branch
represents a failure. The frequency associated with the lower branch is computed by multiply-
ing the failure rate of the safety function times the frequency of the incoming branch. The fre-
quency associated with the upper branch is computed by subtracting the failure rate of the
safety function from 1 (giving the success rate of the safety function) and then multiplying by
the frequency of the incoming branch.
The net frequency associated with the event tree shown in Figure 11-9 is the sum of the
frequencies of the unsafe states (the states with the circles and x's). For this example the net
frequency is estimated at 0.025 failure per year (sum of failures ADE, ABDE, and ABCDE).
This event tree analysis shows that a dangerous runaway reaction will occur on average
0.025 time per year, or once every 40 years. This is considered too high for this installation. A
possible solution is the inclusion of a high-temperature reactor shutdown system. This control
Sha

490
11-3 Fault Trees 491

system would automatically shut down the reactor in the event that the reactor temperature ex-
ceeds a fixcd value. The cmergency shutdown temperature would be higher than the alarm
value to provide an opportunity for the operator to restore the coolant flow.
The event tree for the modified process is shown in Figure 11-11. The additional safety
function provides a backup in the event that the high-temperature alarm fails or the opera-
tor fails to notice the high temperature. The runaway reaction is now estimated to occur
0.00025 time per year, or once every 400 years. This is a substantial improvement obtained by
the addition of a simple redundant shutdown system.
The event tree is useful for providing scenarios of possible failure modes. If quantitative
data are available, an estimate can be made of the failure frequency. This is used most suc-
cessfully to modify the design to improve the safety. The difficulty is that for most real processes
the method can be extremely detailed, resulting in a huge event tree. If a probabilistic compu-
tation is attempted, data must be available for every safety function in the event tree.
An event tree begins with a specified failure and terminates with a number of resulting
consequences. If an engineer is concerned about a particular consequence, there is no certainty
that the consequence of interest will actually result from the selected failure. This is perhaps
the major disadvantage of event trees.

11-3 Fault Trees


Fault trees originated in the aerospace industry and have been used extensively by the nuclear
power industry to qualify and quantify the hazards and risks associated with nuclear power
plants. This approach is becoming more popular in the chemical process industries, mostly as
a result of the successul experiences demonstrated by the nuclear industry.
A fault tree for anything but the simplest of plants can be large, involving thousands of
process events. Fortunately, this approach lends itself to computerization, with a variety of
computer programs commercially available to draw fault trees based on an interactive session.
Fault trees are a deductive method for identifying ways in which hazards can lead to ac-
cidents. The approach starts with a well-defined accident, or top event, and works backward to-
ward the various scenarios that can cause the accident
For instance, a flat tire on an automobile is caused by two possible events. In one case the
flat is due to driving over debris on the road, such as a nail. The other possible cause is tire fail-
ure. The flat tire is identified as the top event. The two contributing causes are either basic or
intermediate events. The basic events are events that cannot be defined further, and interme-
diate events are events that can. For this example, driving over the road debris is a basic event
because no further definition is possible. The tire failure is an intermediate event because it re-
sults from either a defective tire or a worn tire.
The flat tire example is pictured using a fault tree logic diagram, shown in Figure 11-12.
The circles denote basic events and the rectangles denote intermediate events. The fishlike
symbol represents the OR logic function. It means that either of the input events will cause the
output state to occur. As shown in Figure 11-12, the flat tire is caused by either debris on the road
or tire failure. Similarly, the tire failure is caused by either a defective tire or a worn tire.
492 Chapter 11 Risk Assessment

Flat Tire Top Event

OR

Tire Fai lure Road


Debris
OR

Defect i
ve Worn
Tire Tire

Figure 11-12 A fault tree describing the various events contributing to a flat tire.

Events in a fault tree are not restricted to hardware failures. They can also include soft
ware, human, and environmental factors.
For reasonably complex chemical processes a number of additional logic functions are
needed to construct a fault tree. A detailed list is given in Figure 11-13. The AND logic func-
tion is important for describing processes that interact in parallel. This means that the output
state of the AND logic function is active only when both of the input states are active. The IN-
HTBIT function is useful for events that lead to a failure only part of the time. For instance, driv-
ing over debris in the road does not always lead to a flat tire. The INHIBIT gate could be used
in the fault tree of Figure 11-12 to represent this situation.
Before the actual fault tree is drawn, a number of preliminary steps must be taken:

1. Define precisely the top event. Events such as "high reactor temperature" or "liquid level
too high" are precise and appropriate. Events such as "explosion of reactor" or "fire in
process" are too vague, whereas an event such as "leak in valve" is too specific.
2. Define the existing event. What conditions are sure to be present when the top event
occurs?
3. Define the unallowed events. These are events that are unlikely or are not under con-
sideration at the present. This could include wiring failures, lightning, tornadoes, and
hurricanes.
4. Define the physical bounds of the process. What components are to be considered in the
fault tree?
11-3 Fault Trees 493

AND Gate: The resulting output event


requiras the eimu ltaneous
occurrence of a! input evente.

OR Gat The re su !ting output event


requires the occurr ence of any
indivi dual input vent.

Inhibit INHIBIT Event The out put event witl occur if


the input occure and tha inhibit
Condition
event occure.

BASIC Evant: A fault event that needs no

O further definition.

INTERMEDIATE Evant: An event that resulte from the


interactian of a n umber af other
avent 8.

UNDEVELOPED Event: An evant that cannot be deveiaped


further due to lack of suitabie
informat i on.

EXTERNAL Event: An event that ia a boundary


conditi on to the fault tree.

TRANSFER Symboia: Uaed to tranafer the fault tree


ZoUT into and out of a ahe et of paper.
T IN

Figure 11-13 The logic transfer components used in a fault tree.

5. Define the equipment configuration. What valves are open or closed? What are the liq-
uid levels? Is this a normal operation state?
6. Define the level of resolution. Will the analysis consider just a valve, or will it be neces-
sary to consider the valve components?

The next step in the procedure is to draw the fault tree. First, draw the top event at the
top of the page. Label it as the top event to avoid confusion later when the fault tree has spread
out to several sheets of paper.
494 Chapter 11 Risk Assessment

Second, determine the major events that contribute to the top event. Write these down
as intermediate, basic, undeveloped, or external events on the sheet. If these events are related
in parallel (all events must occur in order for the top event to occur), they must be connected
to the top event by an AND gate. If these events are related in series (any event can occur in
order for the top event to occur), they must be connected by an OR gate. If the new events can-
not be related to the top event by a single logic function, the new events are probably improp-
erly specified. Remember, the purpose of the fault tree is to determine the individual event
steps that must occur to produce the top event.
Now consider any one of the new intermediate events. What events must occur to con-
tribute to this single event? Write these down as either intermediate, basic, undeveloped, or ex-
ternal events on the tree. Then decide which logic function represents the interaction of these
newest events.
Continue developing the fault tree until all branches have been terminated by basic, un-
developed, or external events. All intermediate events must be expanded.

Example 11-5
Consider again the alarm indicator and emergency shutdown system of Example 11-2. Draw a fault
tree for this system.

Solution
The first step is to define the problem.

1. Top event: Damage to reactor as a result of overpressuring.


2. Existing event: High process pressure.
3. Unallowed events: Failure of mixer, electrical failures, wiring failures, tornadoes, hurricanes,
electrical storms.
4. Physical bounds: The equipment shown in Figure 11-5.
5. Equipment configuration: Solenoid valve open, reactor feed flowing.
6. Level of resolution: Equipment as shown in Figure 11-5.

The top event is written at the top of the fault tree and is indicated as the top event (see Figure 11-14).
Two events must occur for overpressuring: failure of the alarm indicator and failure of the emergency
shutdown system. These events must occur together so they must be connected by an AND func
tion. The alarm indicator can fail by a failure of either pressure switch l or the alarm indicator light.
These must be connected by OR functions. The emergency shutdown system can fail by a failure of
either pressure switch 2 or the solenoid valve. These must also be connected by an OR function.
The complete fault tree is shown in Figure 1-14.

Determining the Minimal Cut Sets


Once the fault tree has been fully drawn, a number of computations can be performed. The
first computation determines the minimal cut sets (or min cut sets). The minimal cut sets are
11-3 Fault Trees 495

Overpressur ing of Reactor Top Event


P: 0.0702
R 0.9298

Failure of Aiarm Fai lure of Emergency


Indicator Shutdown
P 0.1648 P
0. 4258
B R 0.8352 R 0.5742

Pressure Pressure Pressure Solenoid


1
Switch Indicator Sitch 2 Valve
Fai lure Light Failure Fai lure
Failure

1 2 3 4
P 0.13 P 0.04 P 0.13 P 0. 34
R: 0. 87 R
0. 96 R 0.87 R
0. 66
Figure 11-14 Fault tree for Example 11-5.

the various sets of events that could lead to the top event. In general, the top event could occur
through a variety of different combinations of events. The different unique sets of events lead-
ing to the top event are the minimal cut sets.
The minimal cut sets are useful for determining the various ways in which a top event
could occur. Some of the mimimal cut sets have a higher probability than others. For instance,
a set involving just two events is more likely than a set involving three. Similarly, a set involv-
ing human interaction is more likely to fail than one involving hardware alone. Based on these
simple rules, the minimal cut sets are ordered with respect to failure probability. The higher
probability sets are examined carefully to determine whether additional safety systems are
required.
The minimal cut sets are determined using a procedure developed by Fussell and Vesecly
The procedure is best described using an example.

4J. B. Fussell and W. E. Vesely, "A New Methodology for Obtaining Cut Sets for Fault
Trees" Trans-
actions of the American Nuclear Society (1972), 15.
496 Chapter 11 Risk Assessment

Example 11-6
Determine the minimal cut sets for the fault tree of Example 11-5.

Solution
The first step in the procedure is to label all the gates using letters and to label all the basic events
using numbers. This is shown in Figure 11-14. The first logic gate below the top event is written:

AND gates increase the number of events in the cut sets, whereas OR gates lead to more sets. Logic
gate A in Figure 11-14 has two inputs: one from gate B and the other from gate C. Because gate A
is an AND gate, gate A is replaced by gates B and C

AB C

Gate B has inputs from event 1 and event 2. Because gate B is an OR gate, gate B is replaced by
adding an additional row below the present row. First, replace gate B by one of the inputs, and then
create a second row below the first. Copy into this new row all the entries in the remaining column
of the first row:

AB1 C
C

Note that the C in the second column of the first row is copied to the new row.
Next, replace gate C in the first row by its inputs. Because gate C is also an OR gate, replace
Cby basic event 3 and then create a third row with the other event. Be sure to copy the 1 from the
other column of the first row:

AB1 3
2 C
1 4

Finally, replace gate C in the second row by its inputs. "This generates a fourth row:

AB1 3
2 e3
1 4
2 4
The cut sets are then

1,3
2,3
1,4
2,4

This means that the top event occurs as a result of any one of these sets of basic events.
11-3 Fault Trees 497

The procedure does not always deliver the minimal cut sets. Sometimes a set might be of the
following form:
1,2,2
This is reduced to simply 1, 2. On other occasions the sets might include supersets. For instance,
consider
1,2
1,2,4
1,2,3
The second and third sets are supersets of the first basic set because events 1 and 2 are in common.
The supersets are eliminated to produce the minimal cut sets.
For this example there are no supersets.

Quantitative Calculations Using the Fault Tree


The fault tree can be used to perform quantitative calculations to determine the proba-
bility of the top event. This is accomplished in two ways.
With the first approach the computations are performed using the fault tree diagram it-
self. The failure probabilities of all the basic, external, and undeveloped events are written on
the fault tree. Then the necessary computations are performed across the various logic gates.
Remember that probabilities are multiplied across an AND gate and that reliabilities are mul
tiplied across an OR gate. The computations are continued in this fashion until the top event
is reached. INHIBIT gates are considered a special case of an AND gate.
The results of this procedure are shown in Figure 11-14. The symbol P represents the
probability and R represents the reliability. The failure probabilities for the basic events were
obtained from Example 11-2.
The other procedure is to use the minimal cut sets. This procedure approaches the exact
result only if the probabilities of all the events are small. In general, this result provides a num-
ber that is larger than the actual probability. This approach assumes that the probability cross-
product terms shown in Equation 11-10 are negligible.
The minimal cut sets represent the various failure modes. For Example 11-6 events 1,3
or 2,3 or 1, 4 or 2, 4 could cause the top event. To estimate the overall failure probability, the
probabilities from the cut sets are added together. For this case

P(1 AND 3) = (0.13)(0.13) 0.0169

P(2 AND 3) (0.04)(0.13) = 0.0052

P(1 AND 4) = (0.13)(0.34) 0.0442


=
P(2 AND 4) (0.04) 0.34) = 0.0136

Total 0.0799
498 Chapter 11 Risk Assessment

This compares to the exact result of 0.0702 obtained using the actual fault tree. The cut sets are
related to each other by the OR function. For Example 11-6 all the cut set probabilities were
added. This is an approximate result, as shown by Equation 11-10, because the cross-product
terms were neglected. For small probabilities the cross-product terms are negligible and the
addition will approach the true result.

Advantages and Disadvantages of Fault Trees


The main disadvantage of using fault trees is that for any reasonably complicated process
the fault tree will be enormous. Fault trees involving thousands of gates and intermediate
events are not unusual. Fault trees of this size require a considerable amount of time, measured
in years, to complete.
Furthermore, the developer of a fault tree can never be certain that all the failure modes
have been considered. More complete fault trees are usually developed by more experienced
engineers.
Fault trees also assume that failures are "hard," that a particular item of hardware does not
fail partially. A leaking valve is a good example of a partial failure. Also, the approach assumes
that a failure of one component does not stress the other components, resulting in a change in
the component failure probabilities.
Fault trees developed by different individuals are usually different in structure. The dif
ferent trees generally predict different failure probabilities. This inexact nature of fault trees is
a considerable problem.
If the fault tree is used to compute a failure probability for the top event, then failure
probabilities are needed for all the events in the fault tree. These probabilities are not usually
known or are not known accurately.
A major advantage of the fault tree approach is that it begins with a top event. This top
event is selected by the user to be specific to the failure of interest. This is opposed to the event
tree approach, where the events resulting from a single failure might not be the events of spe-
cific interest to the user.
Fault trees are also used to determine the minimal cut sets. The minimal cut sets provide
enormous insight into the various ways for top events to occur. Some companies adopt a con-
trol strategy to have all their minimal cut sets be a product of four or more independent fail
ures. This, of course, increases the reliability of the system significantly.
Finally, the entire fault tree procedure enables the application of computers. Software is
available for graphically constructing fault trees, determining the minimal cut sets, and calcu-
lating failure probabilities. Reference libraries containing failure probabilities for various types
of process equipment can also be included.

Relationship between Fault Trees and Event Trees


Event trees begin with an initiating event and work toward the top event (induction).
Fault trees begin with a top event and work backward toward the initiating events (deduction).
11-4 QRA and LOPA 499

Not
acceptable

Acceptable

Figure 11-15 General description


Frequency of risk.

The initiating events are the causes of the incident, and the top events are the final outcomes.
The two methods are related in that the top events for fault trees are the initiating events for
the event trees. Both are used together to produce a complete picture of an incident, from its
initiating causes all the way to its final outcome. Probabilities and frequencies are attached to
these diagrams.

11-4 QRA and LOPPA

Risk is the product of the probability of a release, the probability of exposure, and the conse-
quences of the exposure. Risk is usually described graphically, as shown in Figure 11-15. All
companies decide their levels of acceptable risk and unacceptable risk. The actual risk of a pro-
cess or plant is usually determined using quantitative risk analysis (QRA) or a layer of protec
tion analysis (LOPA). Other methods are sometimes used; however, QRA and LOPA are the
methods that are most commonly used. In both methods the frequency of the release is deter-
mined using a combination of event trees, fault trees, or an appropriate adaptation.

Quantitative Risk Analysis


QRA is a method that identifies where operations, engineering, or management systems
can be modified to reduce risk. The complexity of a QRA depends on the objectives of the study
and the available information. Maximum benefits result when QRAs are used at the beginning

SCCPS, Guidelines for Chemical Process Quantitative Risk Analysis, 2d ed. (New York: Center for Chemi-
cal Process Safety, AICHE, 2000).
500 Chapter 11 Risk Assessment

of a project (conceptual review and design phases) and are maintained throughout the facility's
life eycle.
The QRA method is designed to provide managers with a tool to help them evaluate the
overall risk of a process. QRAs are used to evaluate potential risks when qualitative methods
cannot provide an adequate understanding of the risks. QRA is especially effective for evalu-
ating alternative risk reduction strategies.
The major steps of a QRA study include

1. defining the potential event sequences and potential incidents,


2. evaluating the incident consequences (the typical tools for this step include dispersion
modeling and fire and explosion modeling),
3. estimating the potential incident frequencies using event trees and fault trees,
4. estimating the incident impacts on people, environment, and property, and
5. estimating the risk by combining the impacts and frequencies, and recording the risk us-
ing a graph similar to Figure 11-15.

In general, QRA is a relatively complex procedure that requires expertise and a sub-
stantial commitment of resources and time. In some instances this complexity may not be war-
ranted; then the application of LOPA methods may be more appropriate.

Layer of Protection Analysis


LOPA is a semi-quantitative tool for analyzing and assessing risk. This method includes
simplified methods to characterize the consequences and estimate the frequencies. Various lay-
ers of protection are added to a process, for example, to lower the frequency of the undesired
consequences. The protection layers may include inherently safer concepts; the basic process
control system; safety instrumented functions; passive devices, such as dikes or blast walls; ac-
tive devices, such as relief valves; and human intervention. This concept of layers of protection
is illustrated in Figure 11-16. The combined effects of the protection layers and the conse-
quences are then compared against some risk tolerance criteria.
In LOPA the consequences and effects are approximated by categories, the frequencies
are estimated, and the effectiveness of the protection layers is also approximated. The approx
imate values and categories are selected to provide conservative results. Thus the results of a
LOPA should always be more conservative than those from a (QRA. If the LOPA results are
unsatisfactory or if there is any uncertainty in the results, then a full ORA may be justified. The
results of both methods need to be used cautiously. However, the results of QRA and LOPA
studies are especially satisfactory when comparing alternatives.
Individual companies use different criteria to establish the boundary between acceptable
and unacceptable risk. The criteria may include frequency of fatalities, frequency of fires, maxi-
mum frequency of a specific category of a consequence, and required number of independent
layers of protection for a specific consequence category.
6CCPS, Layer of Protection Analysis: Simplified Process Risk Assessment, D. A. Crowl, ed. (New York:
Center for Chemical Process Safety, AICHE, 2001) (in press).
11-4 QRA and LOPA 501

Community emergency response

Plant emergency response

Post-release physical protection (dikes)

Physical protection (relief devices)

Safety instrumented functions (SIFs)

Critical alarms and human intervention

Basic process control systems


Process design

Monomer
feed

Steam
Polymer
product

CW

Figure 11-16 Layers of protection to lower the frequency of a specific accident scenario.

The primary purpose of LOPA is to determine whether there are sufficient layers of pro-
tection against a specific accident scenario. As illustrated in Figure 11-16, many types of pro-
tective layers are possible. Figure 11-16 does not include all possible layers of protection. A sce
nario may require one or many layers of protection, depending on the process complexity and
potential severity of an accident. Note that for a given scenario only one layer must work suc-
cessfully for the consequence to be prevented. Because no layer is perfectly effective, however,
sufficient layers must be added to the process to reduce the risk to an acceptable level.
The major steps of a LOPA study include

1. identifying a single consequence (a simple method to determine consequence categories


is described later),
2. identifying an accident scenario and cause associated with the consequence (the scenario
consists of a single cause-consequence pair),
3. identifying the initiating event for the scenario and estimating the initiating event fre
quency (a simple method is described later)
n

502
11-4 QRA and LOPA 503

4. identifying the protection layers available for this particular consequence and estimating
the probability of failure on demand for each protection layer,
5. combining the initiating event frequency with the probabilities of failure on demand for
the independent protection layers to estimate a mitigated consequence frequency for this
initiating event,
6. plotting the consequence versus the consequence frequency to estimate the risk (the risk
is usually shown in a figure similar to Figure 11-15), and
7. evaluating the risk for acceptability (if unacceptable, additional layers of protection are
required).
This procedure is repeated for other consequences and scenarios. A number of variations on
this procedure are used.

Consequence
The most common scenario of interest for LOPA in the chemical process industry is loss
of containment of hazardous material. This can occur through a variety of incidents, such as a
leak from a vessel, a ruptured pipeline, a gasket failure, or release from a relief valve.
In a QRA study the consequences of these releases are quantified using dispersion mod-
eling and a detailed analysis to determine the downwind consequences as a result of fires, ex-
plosions, or toxicity. In a LOPA study the consequences are estimated using one of the follow-
ing methods: (1) semi-quantitative approach without the direct reference to human harm, (2)
qualitative estimates with human harm, and (3) quantitative estimates with human harm. See
footnote 6 for the detailed methods.
When using the semi-quantitative method, the quantity of the release is estimated using
source models, and the consequences are characterized with a category, as shown in Table 11-2.
This is an easy method to use compared with QRA.
Although the method is easy to use, it clearly identifies problems that may need addi-
tional work, such as a QRA. It also identifies problems, which may be deemphasized because
the consequences are insignificant.

Frequency
When conducting a LOPA study, several methods can be used to determine the frequency.
One of the less rigorous methods includes the following steps:
1. Determine the failure frequency of the initiating event.
2. Adjust this frequency to include the demand, for example, a reactor failure frequency is
divided by 12 if the reactor is used only 1 month during the entire year. The frequencies
are also adjusted (reduced) to include the benefits of preventive maintenance. If, for ex-
ample, a control system is given preventive maintenance 4 times each year, then its fail-
ure frequency is divided by 4.
3. Adjust the failure frequency to include the probabilities of failure on demand (PFDs) for
each independent layer of protection.
504 Chapter 11 Risk Assessment

Table 11-3 Typical Frequency Values Assigned to Initiating Events

Example of a
Frequency range value chosen by a
from literature company for use
Initiating event (per yr) in LOPA (per yr)

Pressure vessel residual failure 10 to 10-7 1 x 10-6


Piping residual failure, 100 m, full breach 10 to 10-6
1x 105
Piping leak (10% section), 100 m 10- to 104 1x 10-3
10-3
Atmospheric tank failure 10 to 10-s 1x 10-2
Gasket/packing blowout 102 to 106 1x
Turbine/diesel engine overspeed with casing breach 10 to 10-4 1x 104
Third-party intervention (external impact by back- 102 to 104 1x 102
hoe, vehicle, etc.)
Crane load drop 10 to 10/lift 1x 10 (/1ift)
Lightning strike 104
10-3 to 1 x 10-3
102 to 104 10-2
Safety valve opens spuriously 1x
Cooling water failure 1
to 10-2 1 x 10-1
10 to 10-2 10-1
Pump seal failure
1 to 10-2
1xx 101
1
Unloading/loading hose failure
BPCS instrument loop failure 1 to 10-2
1x 10-
Regulator failure 1
to 10 1 x 10-1
Small external fire (aggregate causes) 10 to 10-2 x 101
1

10-2
Large external fire (aggregate causes) 10-2 to 10-3
1x
LOTO (lock-out tag-out) procedure failure 10- to 104/ 1x 10-3
(overall failure of a multiple element process) opportunity (lopportunity)
Operator failure (to execute routine procedure; 10 to 10/ 1
x 102
well trained, unstressed, not fatigued) opportunity (lopportunity)

Individual companies choose their own values, consistent with the degree of conservatism or the company's risk toler
ance criteria. Failure rates can also be greatly affected by preventive maintenance rOutines.

The failure frequencies for the common initiating events of an accident scenario are
shown in Table 11-3.
The PFD for each independent protection layer (IPL) varies from 10l to 10-S for a weak
IPL and a strong IPL, respectively. The common practice is to use a PFD of 10 unless expe-
rience shows it to be higher or lower. Some PFDs recommended by CCPS (see footnote 6) for
screening are given in Tables 11-4 and 11-5. There are three rules for classifying a specific sys-
tem or action of an IPL:

1. The IPL is effective in preventing the consequence when it functions as designed.


2. The IPL functions independently of the initiating event and the components of all other
IPLs that are used for the same scenario.
3. The IPL is auditable, that is, the PFD of the IPL must be capable of validation including
review, testing, and documentation.
11-4 QRA and LOPA 505

Table 11-4 PFDs for Passive IPLs


Comments PFDs PFDs
(assuming an adequate design basis, from from
Passive IPLS inspections, and maintenance procedures) industry CCPS
Dike Reduces the frequency of large consequences 1x 102 to 1x 10-2
10-3
(widespread spill) of a tank overfill, rupture, 1x
spill, etc
Underground Reduces the frequency of large consequences 1x 10-2to 1x 102
drainage system (widespread spill) of a tank overfill, rupture, 1X 10-3
spill, etc.
Open vent Prevents overpressure 1x 10-2 to 1 x 10-2
(no valve) 1 x 10-3
Fireproofing Reduces rate of heat input and provides additional 1x 10 to 1x 102
10-3
time for depressurizing, fire fighting, etc. 1x
Blast wall or Reduces the frequency of large consequences of an 1x 10 to 1 x 10-3
bunker explosion by confining blast and by protecting 1x 103
equipment, buildings, etc.
to 10-2
Inherently safer If properly implemented, can eliminate scenarios 1x 10
10-
1x
design or significantly reduce the consequences 1x
associated with a scenario
Flame or If properly designed, installed, and maintained, can 1x 10 to 1x 102
detonation eliminate the potential for flashback through a 1 x 10-3
arrestors piping system or into a vessel or tank

CCPS, Simplified Process Risk Assessment: Layer of Protection Analysis, D. A. Crowl, ed. (New York: American In-
stitute of Chemical Engineers, 2001) (in press).

The frequency of a consequence of a specific scenario endpoint is computed using

Sf -f!x IIPFD;,. (11-30)

where
f is the mitigated consequence frequency for a specific consequence C for an initiating
event i,
fi is the initiating event frequency for the initiating event i, and
PFD is the probability of failure of the jth IPL that protects against the specific con-
sequence and the specific initiating event i. The PFD is usually 10, as described
previously.
When there are multiple scenarios with the same consequence, each scenario is evalu-
ated individually using Equation 11-30. The frequency of the consequence is subsequently de-
termined using

(11-31)
506 Chapter 11 Risk Assessment

Table 11-5 PFDs for Active IPLs and Human Actions


Comments [assuming an adequate design basis,
inspections, and maintenance procedures
(active IPLs) and adequate documentation, PFDs PFDs
Active IPL or training, and testing procedures from from
human action (human action)] industry' CCPS
x 10 10-2
Relief valve Prevents system from exceeding specified over- 1

x 105
to 1x
pressure. Effectiveness of this device is 1

sensitive to service and experience.


Rupture disc Prevents system from exceeding specified over- 1 x 107 to
x 10-5
1X 10-2
pressure. Effectiveness of this device can be 1

sensitive to service and experience.


Basic process Can be credited as an IPL if not associated with 1x 10 to 1 x 10-1
control system the initiating event being considered. See IEC
(1998, 2001).23
1x 10-2
(BPCS)
Safety instru- See IEC 61508 (IEC, 1998) and IEC 61511 (IEC, 2001) for life-cycle require-
mented func- ments and additional discussion.23
tions (inter-
locks)
Human action Simple well-documented action with clear and 1 to 1 x 10-1
1x 10
with 10 min reliable indications that the action is required.
response time
Human action Simple well-documented action with clear and 1x 10 to 1 x 10-2
with 40 min reliable indications that the action is required. 1 X 10-2
response time

CCPS, Simplified Process Risk Assessment: Layer of Protection Analysis, D. A. Crowl, ed. (New York: American In-
stitute of Chemical Engineers, 2001) (in press).
IEC (1998), IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems,
Parts 1-7, Geneva: International Electrotechnical Commission.
SIEC (2001), IEC 61511, Functional Safety Instrumented Systems for the Process Industry Sector, Parts 1-3. (Draft in Pro
cess), Geneva: International Electrotechnical Commission.

where
f isthe frequency of the Cth consequence for the ith initiating event and
Iis the total number of initiating events for the same consequence.

Example 11-7
Determine the consequence frequency for a cooling water failure if the system is designed with two
IPLs. The IPLs are human interaction with 10-min response time and a basic process control sys-
tem (BPCS).

Solution
The frequency of a cooling water failure is taken from Table 11-3, that is, i = 10". The PFDs are
estimated from Tables 11-4 and 11-5. The human response PFD is 10 and the PFD for the BPCS
is 10. The consequence frequency is found using Equation 11-30:
Suggested Reading 507

-fx IIPFD,
= 10-1x (10- )(10-1) = 10 failure/yr.

As illustrated in Example 11-7, the failure frequency is determined easily by using LOPA
methods.
The concept of PFD is also used when designing emergency shutdown systems called
safety instrumented functions (SIFs). A SIF achieves low PFD figures by

using redundant sensors and final redundant control elements,


using multiple sensors with voting systems and redundant final control elements,
testing the system components at specific intervals to reduce the probability offailures
on demand by detecting hidden failures, and
using a deenergized trip system (i.e., a relayed shutdown system).

There are three safety integrity levels (SILs) that are generally accepted in the chemical
process industry for emergency shutdown systems:

1. SIL1 (PFD 10 to 10 2): These SIFs are normally implemented with a single sensor,
=

a single logic solver, a single final control element, and requires periodic proof testing.
2. SIL2 (PFD = 102 to 10): These SIFs are typically fully redundant, including the sen-
sor, logic solver, final control element, and requires periodic proof testing.
3. SIL3 (PFD = 103 to 10 *): SIL3 systems are typically fully redundant, including the sen-
sor, logic solver, and final control element; and the system requires careful design and
frequent validation tests to achieve the low PFD figures. Many companies find that they
have a limited number of SIL3 systems because of the high cost normally associated with
this architecture.

Suggested Readinng
CCPS, Guidelines for Consequence Analysis of Chemical Releases (New York: American Institute of
Chemical Engineers, 1999).
Guidelines for Hazard Evaluation Procedures, 2d ed. (New York: American Institute of Chemical Engi-
neers, 1992).
J. B. Fussell and W. E. Vesely, "A New Methodology for Obtaining Cut Sets for Fault Trees," Transactions
of the American Nuclear Society (1972), 15.
F.P. Lees, Loss Prevention in the Process Industries, 2d ed. (London: Butterworths, 1996).
J.F. Louvar and B. D. Louvar, Health and Environmental Risk Analysis: Fundamentals with Applications
(Upper Saddle River, NJ: Prentice Hall PTR, 1998).
B. Roffel and J. E. Rijnsdorp, Process Dynamics, Control, and Protection (Ann Arbor, MI: Ann Arbor
Science, 1982), ch. 19.
508 Chapter 11 Risk Assessment

i ii

iii) iv

23 23

v)

12 Figure 11-17 Fault tree gates.

Problemns
11-1. Given the fault tree gates shown in Figure 11-17 and the following set of failure
probabilities:
Failure
Component probability
0.1
2 0.2
3 0.3
0.4

a. Determine an expression for the probability of the top event in terms of the compo-
ent failure probabilities.
b. Determine the minimal cut sets.
c. Compute a value for the failure probability of the top event. Use both the expression
of part a and the fault tree itself.
11-2. The storage tank system shown in Figure 11-18 is used to store process feedstock. Over-
filling of storage tanks is a common problem in the process industries. To prevent overfill-
ing, the storage tank is equipped with a high-level alarm and a high-level shutdown sys-
tem. The high-level shutdown system is connected to a solenoid valve that stops the flow
of input stock.
a. Develop an event tree for this system using the "failure of level indicator" as the ini-
tiating event. Given that the level indicator fails 4 times/yr, estimate the number of
overfows expected per year. Use the following data:

You might also like