STN HiAv V2.0

How can I
System Technical Note
increase the availability of a system?

Security solutions
Design your architecture
Disclaimer
This document is not comprehensive for any systems using the given architecture and does not absolve users of their duty to uphold the safety requirements for the equipment used in their systems or compliance with both national or international safety laws and regulations. Readers are considered to already know how to use the products described in this document. This document does not replace any specific product documentation.
The STN Collection

The implementation of an automation project includes five main phases: Selection, Design, Configuration, Implementation and Operation. To help you develop a project based on these phases, Schneider Electric has created the System Technical Guide (STG) and System Technical Note (STN). A System Technical Guide provides technical guidelines and recommendations for implementing technologies to address your needs and requirements. This guide covers the entire scope of the projects life cycle, from the Selection to the Operation phase, providing design methodology and source code examples for all components of a sub-system. A System Technical Note provides a more theoretical approach by focusing on a particular system technology. These notes describe complete solution offers for a system, and therefore support you in the Selection phase of a project. The STGs and STNs are related and complementary. In short, you will find technology fundamentals in a STN, and their corresponding applications in one or several STGs.
Development Environment
PlantStruxure, the Process Automation System from Schneider Electric, is a collaborative system that allows industrial and infrastructure companies to meet their automation needs while also addressing growing energy management requirements. Within a single environment, measured energy and process data can be analyzed to yield a holistically optimized plant.
Table of Contents
1. Introduction to High Availability .........................................7

1.1. Purpose ........................................................................................................................................................ 7 1.2. Introduction.................................................................................................................................................. 7 1.3. Document Overview ..................................................................................................................................... 8
2. High Availability Theoretical Basics.................................11

2.1. Fault Tolerant System ................................................................................................................................ 11 2.2. Lifetime and Failure Rate .......................................................................................................................... 11 2.3. RAMS (Reliability, Availability, Maintainability, Safety) .......................................................................... 12 2.4. Reliability Block Diagrams (RBD)............................................................................................................. 21
3. High Availability with PlantStruxure.................................35

3.1. Operating and Monitoring ......................................................................................................................... 36 3.2. Redundant Network System........................................................................................................................ 46 3.3. Redundant Control System......................................................................................................................... 59 3.4. Redundant Field Bus .................................................................................................................................. 74
4. Conclusion ..........................................................................77
4.1. Benefits....................................................................................................................................................... 78 4.2. More Detailed RAMS Investigation ........................................................................................................... 79
5. Appendix .............................................................................81
5.1. Glossary ..................................................................................................................................................... 81 5.2. Standards ................................................................................................................................................... 87
1-Introduction to High Availability
1. Introduction to High Availability

1.1. Purpose
The intent of this System Technical Note (STN) is to describe the capabilities of different Schneider Electric solutions that answer many critical application requirements, and thus increase the availability of a Process Automation System. It provides a description of a common, readily understandable reference point for end users, system integrators, salespeople, business support and other parties.
1.2. Introduction
Increasingly, process applications require a high availability automation system. Before deciding to implement a high availability automation system in your installation, you need to consider the following key questions: What security level is needed? This concerns the security of both persons and hardware. For instance, the complete start/stop sequences that manage the kiln in a cement plant include a key condition: the most powerful equipment must be the last to start and stop. What is the criticality of the process? This point concerns all the processes that involve a reaction (mechanical, chemical, etc.). Consider the example of the kiln again. To avoid its destruction, the complete stop sequence includes a slow cooling of the constituent material. Another typical example is the biological treatment in a wastewater plant, which cannot be stopped every day. What is the environmental criticality? Consider the example of an automation system in a tunnel. PACs (default and redundant) would be installed on both sides of the tunnel as a safety precaution in case of a fire. Furthermore, the system may be subjected to harsh environmental factors such as vibration, extreme temperatures, shock and so on. Which other particular circumstances does the system need to address? This last area includes additional questions such as: Does the system really need redundancy if the electrical network becomes inoperative in a specific layer of the installation? What is the criticality of the data in the event of a loss of communication?

Availability is a term that is increasingly used to qualify a process asset or system. Additionally, Reliability and Maintainability are terms often used in analyses of the usefulness of design improvements and in production diagnostic issues. Accordingly, the design of automation system architectures must take into consideration these types of questions into consideration. Schneider Electrics PlantStruxure offers various levels of redundancy, allowing you to design an effective high availability system.
1.3. Document Overview

This document contains the following chapters: The High Availability Theoretical Basics chapter describes the fundamentals of High Availability. It presents the theory and basics, and explains a method of conceptualizing series and parallel architectures. Calculation examples are used to illustrate this approach. The Process Automation System Availability chapter provides a number of High Availability solutions, especially for PlantStruxure, ranging from the information management to the I/O modules. The Conclusion chapter summarizes the customer benefits provided by Schneider Electrics High Availability solutions, and provides additional information and references.

The following diagram illustrates various levels of PlantStruxure automation system architecture:
As shown in the following chapters, redundancy is a convenient means of elevating global system reliability and availability. The High Availability of a Process Automation System, in terms of redundancy, can be addressed at different levels of the architecture: Single or redundant I/O modules (depending on sensor/actuator redundancy). Depending on the I/O handling philosophy (for example conventional remote I/O stations, or I/O islands distributed on Ethernet), different scenarios can be applied: dual communication medium I/O bus or self-healing ring, single or dual. Programmable controller CPU redundancy (Hot Standby PAC station). Communication network and ports redundancy. SCADA system - dedicated approaches with multiple operator station locations and resource redundancy capabilities.
10
2-High Availability Theoretical Basics
2. High Availability Theoretical Basics

This section describes basic high availability terms, concepts and formulas, and includes examples for typical applications.
2.1. Fault Tolerant System

A Fault Tolerant System is a system that can operate even though a hardware component becomes inoperative. The redundancy principle is often used to implement a Fault Tolerant System. An alternate component takes over the task transparently.
2.2. Lifetime and Failure Rate

For a given system or device, its Lifetime corresponds to the number of hours it can function under normal operating conditions. This number is the result of the individual life expectancy of the components used in its assembly. Lifetime comprises a sequence of three successive phases: the early life (or infant mortality), the "useful life and the "wear-out period. Failure Rate () is defined as the average (mean) rate probability at which a system will become inoperative. When considering events occurring on a series of systems, for example a group of light bulbs, the units to be used should normally be failures-per-unit-of-system-time. Examples include failures per machine-hour or failures per system-year. Because the scope of this document is limited to single repairable entities, we will focus on failures-per-unit-of-time. Failure Rate Example: For a human being, Failure Rate () measures the probability of death occurring in the next hour. Stating (20 years) = 10-6 per hour would mean that the probability for someone age 20 dying in the next hour is 10-6.
11

2.2.1. Bathtub Curve
The following figure shows a bathtub curve, which represents the Failure Rate () according to the Lifetime (t): Consider the relationship between the Failure Rate and Lifetime for a device consisting of assembled electronic parts. This relationship is The Bathtub Curve also represented by the bathtub curve. In "early life", this system exhibits a high Failure Rate which gradually diminishes until it approaches a constant value, which is maintained during its "useful life. The system finally enters the wear-out stage of its life, during which Failure Rate increases exponentially. Note: Useful life normally starts at the beginning of system use and ends at the beginning of its wear-out phase. Assuming that "early life" corresponds to the burnin period indicated by the manufacturer, we generally consider that Useful Life starts with the beginning of system use by the end user.
2.3. RAMS (Reliability, Availability, Maintainability, Safety)

The following text, from the MIL-HDBK-338B standard, defines the RAM criteria and their probability aspect: "For the engineering specialties of Reliability, Availability and Maintainability (RAM), the theories are stated in the mathematics of probability and statistics. The underlying reason for the use of these concepts is the inherent uncertainty in predicting a failure. Even given a failure model based on physical or chemical reactions, the results will not be the time a part will fail, but rather the time a given percentage of the parts will fail or the probability that a given part will fail in a specified time." Along with Reliability, Availability and Maintainability, Safety is the fourth metric of a meta-domain that specialists have named RAMS (also sometimes referred to as dependability).
12

2.3.1. Metrics
RAMS metrics relate to time allocation and depend on the operational state of a given system. The following curve defines the state linked to each term:
MUT: Mean Up Time
MUT qualifies the average duration of the system being in an operational state. MDT: Mean Down Time
MDT qualifies the average duration of the system not being in an operational state. It comprises the different portions of time required to detect the reason for the nonoperation state, fix it, and restore the system to its operational state. MTBF: Mean Time Between Failure
MTBF is defined by the MIL-HDBK-338 standard as follows: "A basic measure of reliability for repairable items. The mean number of life units during which all parts of the item perform within their specified limits, during a particular measurement interval under stated conditions." Thus, for repairable systems, MTBF is a metric commonly used to appraise Reliability, and corresponds to the average time interval (normally specified in hours) between two consecutive occurrences of inoperative states. Put simply: MTBF = MUT + MDT MTBF can be calculated (provisional reliability) using data books such as UTE C80810 (RDF2000), MIL HDBK-217F, FIDES, RDF 93, or BELLCORE. Other inputs include field feedback, laboratory testing, or demonstrated MTBF (Operational Reliability), or a combination of these. Remember that MTBF only applies to repairable systems.
13

MTTF (or MTTFF): Mean Time To First Failure
MTTF is the mean time before the occurrence of the first failure. MTTF (and MTBF by extension) is often confused with useful life, even though these two concepts are not related in any way. For example, a battery may have a useful life of 4 hours and have a MTTF of 100,000 hours. These figures indicate that for a population of 100,000 batteries, there will be approximately one battery failure every hour (defective batteries being replaced).
Consider a repairable system with an exponential distribution Reliability and a constant Failure Rate (), MTTF = 1 / . Mean Down Time is usually very low compared to Mean Up Time. This equivalence is extended to MTBF, assimilated to MTTF, resulting in the following relationship: MTBF = 1 / . This relationship is widely used in additional calculations. Example: Given the MTBF of a communication adapter, 618,191 hours, what is the probability for that module to operate without failure for 5 years? Calculate the module Reliability over a 5-year time period: R(t) = e -t = e -t / MTBF a) Divide 5 years, that is 8,760 * 5 = 43,800 hours, by the given MTBF: 43,800 / 618,191 = 0.07085 b) Then raise e to the power of the negative value of that number: e -.07085 = .9316 Thus, there is a 93.16% probability that the communication module will not fail over a 5-year period.
FIT: Failures In Time
Typically used as the Failure Rate measurement for non-repairable electronic components, FIT is the number of failures in one billion hours. FIT= 109 / MTBF or MTBF= 109 / FIT
14

2.3.2. Safety
Definition Safety refers to the protection of people, assets and the environment. For example, if an installation has a tank with an internal pressure that exceeds a given threshold, there is a high chance of explosion and the possible destruction of the installation (with injury or death of people and damage to the environment). In this example, the safety system put in place will open a valve to the atmosphere in order to prevent the pressure threshold from being crossed.
2.3.3. Maintainability
Definition Maintainability refers to the ability of a system to be maintained in an operational state. This, once again, relates to probability. Maintainability corresponds to the probability for an inoperative system being repaired in a given time interval. If design considerations may impact the maintainability of a system, then the maintenance organization will also have a major impact on its maintainability. Having the right number of people trained to observe and react with the proper methods, tools and spare parts are considerations that usually depend more on the customer organization than on the automation system architecture. Mathematics Basics Equipment shall be maintainable on-site by trained personnel according to the maintenance strategy. A common metric named Maintainability, M(t), gives the probability that a required given active maintenance operation can be accomplished within a given time interval. The relationship between Maintainability and repair is similar to the relationship between reliability and failure, with the Repair Rate (t) defined in a way analogous to the Failure Rate. When this Repair Rate is considered as a constant, it implies an exponential distribution for Maintainability: M(t) = e -t. The maintainability of a given equipment is reflected in MTTR, which is usually considered as the sum of the individual times required for administrative, transport and repair times.
15

2.3.4. Availability
Definition The term High Availability is often used when discussing Fault Tolerant Systems. For example, your telephone line is supposed to offer you a high level of availability: The service you are paying for has to be effectively accessible and dependable. Your line availability is related to the continuity of the service which you are provided. As an example, assume you are living in a remote area with occasional violent storms. Because of your location and the damage these storms can cause, long delays are required to fix your line once it is out of order. In these conditions, if on average your line appears to be usable only 50% of the time, you have poor availability. By contrast, if on average each of your attempts to place a call is 100% satisfied, then your line has high availability. This example demonstrates that Availability is the key metric to measure a systems tolerance level, that is typically expressed in percentage (for example 99.999%), and that it belongs to the domain of probability. Mathematics Basics The Instantaneous Availability of a device is the probability that this device will be in the functional state for which it was designed, under given conditions and at a given time (t), with the assumption that the required external conditions are met. Besides Instantaneous Availability, other variants have been specified, each corresponding to a specific definition, including Asymptotic Availability, Intrinsic Availability and Operational Availability.
16

Asymptotic (or Steady State) Availability: A Asymptotic Availability is the limit of the Instantaneous Availability function as time approaches infinity,
A = 1 MDT MUT + MDT = MUT MUT+MDT = MUT MTBF
Downtime includes here all repair time (corrective and preventative maintenance time), administrative time and logistic time. The following curve shows an example of asymptotic behavior:
Intrinsic (or Inherent) Availability: Ai Intrinsic Availability does not include administrative time and logistic time, and usually does not include preventative maintenance time, either. This is primarily a function of the basic equipment/system design.
Ai = MTBF MTBF + MTTR
We will consider Intrinsic Availability in our Availability calculations. Operational Availability Operational Availability corresponds to the probability that an item will operate satisfactorily at a given point in time when used in an actual or realistic operating and support environment.
A0 = Uptime Operating Cycle
Operational Availability includes logistics time, ready time, and waiting or administrative downtime and both preventative and corrective maintenance downtime. This is the availability that the customer actually experiences. It is essentially the a posteriori availability based on actual events that happened to the system.
17

Classification A quite common way used to classify a system in terms of Availability consists of counting the number of 9s in its availability percentage. The following table defines types of Availability: Class Type of Availability 1 2 3 4 5 6 7 Unmanaged Managed Well Managed Tolerant High Availability Very High Ultra High Availability (%) 90 99 99.9 99.99 99.999 99.9999 99.99999 Downtime per Year 36.5 days 3.65 days 8.76 hours 52.6 minutes 5.26 minutes 30.00 seconds 3 seconds Number of Nines 1-nine 2-nines 3-nines 4-nines 5-nines 6-nines 7-nines
For example, a system that has a five-nine availability rating means that the system is 99.999 % available, with a system downtime of approximately 5.26 minutes per year.
2.3.5. Reliability
Definition A fundamental associated metric is Reliability. Return to the example of your telephone line. If we consider that the wired network is very old, having suffered from many years without proper maintenance, it may frequently be out of order. Even if the current maintenance team is doing his best to repair it within a minimum time, it can be said to have poor reliability if, for example, it has experienced 10 losses of communication during the last year. Notice that Reliability necessarily refers to a given time interval, typically one year. Therefore, Reliability will account for the absence of shutdown of a considered system in operation over a given time interval. As with Availability, we consider Reliability in terms of perspective (a prediction), and within the domain of probability. Mathematics Basics Fortunately, in many situations, a detected disruption does not necessarily mean the end of a devices life. This is usually the case with the automation and control systems being discussed, which are repairable entities. As a result, the ability to 18

predict the number of shutdowns due to a detected disruption over a specified period of time is useful to estimate the budget required for the replacement of inoperative parts. In addition, knowing this figure can help you maintain an adequate inventory of spare parts. Put simply, the question "Will a device work for a particular period?" can only be answered as a probability; hence the concept of Reliability. According to the MIL-STD-721C standard, the definition of Reliability R(t) of a given system is the probability of that system performing its intended function under stated conditions for a stated period of time. As an example, a system with 0.9999 reliability over a year has a 99.99% probability of functioning properly throughout the entire year. Note: Reliability is systematically indicated for a given period of time, for example one year. Referring to the system model we considered with the bathtub curve, one characteristic is its constant Failure Rate during the useful life. During that portion of its lifetime, the Reliability of the considered system will follow an exponential law, reflected in the following formula: R(t) = e -t, where stands for the Failure Rate. The following figure illustrates the exponential law, R(t) = e Failure Rate: As shown in the diagram, Reliability starts with a value of 1 at time zero, which represents the moment the system is put into operation. Reliability then drops gradually down to zero, following the exponential law. Therefore, Reliability is about 37% at t=1/. As an example, assume a given system experiences an average number of 0.5 inoperative states per 1-year time unit. The exponential law indicates that such a system would have about a 37% chance of remaining in operation, reaching 1 / 0.5 = 2 years of service. Note: Considering the flat portion of the bathtub curve model, where the Failure Rate is constant over time and remains the same for a unit regardless of this units age, the system is said to be "memoryless.
-t
, where stands for the
19

2.3.6. Reliability versus Availability
Reliability is one of the factors influencing Availability, but must not be confused with Availabilty: In other words, 99.99% Reliability does not mean 99.99% Availability. Reliability measures the ability of a system to function without interruptions, while Availability measures the ability of this system to provide a specified application service level.
Although Availability is a function of Reliability, it is possible for a system with poor Reliability to achieve high Availability. For example, consider that a system averages 4 failures a year, and for each failure, this system can be restored with an average outage time of 1 minute. Over the specified period of time, MTBF is 131,400 minutes (4 minutes of downtime out of 525,600 minutes per year). In that one-year period: Reliability R(t) = e -t = e - 4 = 1.83%, very poor Reliability (Inherent) Availability A i = very good Availability
MTBF MTBF + MTTR = 131 400 131 400 + 1 = 99.99924% ,
Higher Reliability reduces the frequency of inoperative states, while increasing overall Availability. There is a difference between Hardware MTBF and System MTBF. The mean time between hardware component failures occurring on an I/O module, for example, is referred to as the Hardware MTBF. Mean time between failures occurring on a system considered as a whole, a PAC configuration for example, is referred to as the System MTBF. As will be demonstrated, hardware component redundancy provides an increase in the overall system MTBF, even though the individual components MTBF remains the same.
20
2-High Availability Theoretical Basics 2.4. Reliability Block Diagrams (RBD)

Based on basic probability computation rules, RBDs are convenient tools to represent a simple system and its components to determine the Reliability of the system.
2.4.1. Series-Parallel Systems

The target system, for example a PAC rack, must first be interpreted in terms of series and parallel arrangements of elementary parts. The following figure shows a representation of a serial architecture:
PAC Rack - Single Power Supply - 4 modules POWER SUPPLY MODULE A MODULE B MODULE C MODULE D
POWER SUPPLY
MODULE A
MODULE B
MODULE C
MODULE D
We make the assumption that one of the 5 modules (1 power supply and 4 other modules) that populate the PAC rack becomes inoperative. As a consequence, the entire rack is affected, as it is no longer 100% capable of performing its assigned mission regardless of which module is inoperative. Thus, each of the 5 modules is considered as a participant member of a 5-part series. Note: When considering Reliability, two components are described as in series if both are necessary to perform a given function.
21

The following figure shows a representation of a parallel-series architecture:
PAC Rack - Dual Power Supply - 4 modules POWER SUPPLY 1 MODULE A MODULE A MODULE B MODULE B MODULE C MODULE C MODULE D MODULE D POWER SUPPLY 2
POWER SUPPLY 1 MODULE A POWER SUPPLY 2 MODULE B MODULE C MODULE D
We now assume that the PAC rack contains redundant power supply modules, in addition to the 4 other modules. If one power supply becomes inoperative, then the other supplies power for the entire rack. These 2 power supplies would be considered as a parallel sub-system, in turn coupled in series with the sequence of the 4 other modules. Note: Two components are in parallel, from a reliability standpoint, when the system works if at least one of the two components operates properly. In this example, Power Supply 1 and 2 are said to be in active redundancy. The redundancy would be described as passive if one of the parallel components is turned on only if the other is inoperative, for example in the case of auxiliary power generators.
22

Serial RBD Reliability Serial system Reliability is equal to the product of the individual elements reliability.
n R S (t) = R1(t) R2 (t) R3 (t) ... R n (t) = R i (t) i=1

where: Rs(t) = System Reliability Ri(t) = Individual Element Reliability n = Total number of elements on the serial system Assuming constant individual Failure Rates:
R (t) = R (t) R (t) R (t) ... R (t) = e S 1 2 3 n t ( + + + ... + t t t 3 ... e 1 2 3 n 1 e 2 e n =e
That is
n = : Equivalent Failure Rate for n serial elements is equal to the sum S i i =1

St
of the individual Failure Rate of these elements, with R S (t) = e Example 1:
Consider a system with 10 elements, each of them required for the proper operation of the system, for example a 10 modules rack. To determine RS(t), the Reliability of that system over a given time interval t, if each of the considered elements shows an individual Reliability Ri(t) of 0.99:
10 R S (t) = R i (t) = (0.99) x (0.99) x (0.99) . . . (0.99) = (0.99)10 = 0.9044 i=1
Thus, the targeted system Reliability is 90.44%. Example 2: Consider two elements with the following Failure Rates:
1 = 120 x 10-6 h-1 and 2 = 180 x 10-6 h-1
Element 1 1 = 120.10-6h-1
Element 2 2 = 180.10-6h-1
23

The system Reliability, over a 1,000- hour mission, is:
n S = i = 1 + 2 = 120 x 10-6 + 180 x 10-6 = 300 x 10-6 = 0.3 x 10-3 h-1 i=1
R S (1000 h) = e S t =e 0.3 x 103 x 103 =e 0.3 = 0.7408
Thus the targeted system Reliability over the considered time period is 74.08%. Availability Serial system Availability is equal to the product of the individual elements Availability.
n A S = A1 . A 2 . A 3 .... .A n = A i i=1
where:
As= system (asymptotic) availability Ai = Individual element (asymptotic) availability n = Total number of elements on the serial system
24

Calculation Example In this example, we calculate the availability of a PAC station using shared distributed I/O islands. The following illustration shows the final configuration:
This calculation applies the equations given by basic probability analysis. To do this calculation, a spreadsheet was developed. These are the figures applied in the spreadsheet: Failure Rate: = 1/MTBF Reliability: R(t) = e -t Total serial systems Failure Rate
n = S i i =1
Total MTBF serial systems = 1 / s Availability = MTBF/ MTBF+ MTTR Unavailability = 1 Availability Unavailability over years: Unavailability hours (one year = 8460 hours) The following table shows the method to perform the calculation: Step 1 2 3 Action Perform the calculation of the standalone populated CPU rack. Perform the calculation of a distributed island. Based on the serial structure, add up the results from Steps 1 and 2.
Note: A common variant of in-rack I/O module stations are I/O islands distributed on an Ethernet communication network. Schneider Electric offers a versatile family named Advantys STB, which can be used to define such architectures. Step 1: Calculation linked to the standalone CPU rack. The following figures represent the standalone CPU rack:
25
The following screenshot is the spreadsheet corresponding to this analysis.
Step 2: Calculation linked to the STB island. The following figures represent the distributed I/O on STB island:
The following screenshot is the spreadsheet corresponding to this analysis:
26

Step 3: Calculation of the entire installation. Assume that the communication network impact on reliability/availability calculations is negligible. The following figures represent the final distributed architecture:
The following screenshot is the spreadsheet corresponding to the entire analysis:
Note: The highlighted values were calculated in the two previous steps.
Considering the results of this serial system (Rack # 1+ Islands # 1 ... #4), Reliability over one year is approximately 82 % (the probability that this system will encounter one failure during one year is approximately 18%). System MTBF itself is approximately 44,000 hours (about 5 years) Considering the Availability, with a 8-hour Mean Time To Repair (which is a typical figure with a proper logistics and maintenance organization), the system would achieve a 3-nines Availability, an average probability of approximately 95 minutes downtime per year.
27

Parallel RBD Reliability The theory of Probability provides expression for the Reliability of a parallel system with Qi(t) (unreliability) being the reverse of Ri(t).
Q i (t) = R i (t) = 1 - R i (t) :

R (t ) = 1 [ Q1 (t ) x Q2 (t ) x Q3 (t ) x s
with:
....
n n x Qn (t ) ] = 1 Qi (t ) = 1 1 Ri (t ) i =1 i =1
Rs = Reliability of the simple parallel system

n Q i = Probability of failure of the system i=1
Ri = Probability of non-failure of an individual parallelized element Qi = Probability of failure of an individual parallelized element n = Total number of parallelized elements Example: Considering two elements with the following Failure Rates:
1 = 120 x 10-6 h-1 and 2 = 180 x 10-6 h-1

The system Reliability, over a 1,000 hour mission, is:
Reliability of elements 1 and 2 over the 1,000 hour period:
R1 (1,000 h) = e 1t = e 120 x 10
x 103 x 10
3
= 0.8869 = 0.8353
R2 (1,000 h) = e 2t = e 180 x 10
Unreliability of elements 1 and 2 over the 1,000 hour-period:
Q1 (1,000 h) = 1 R1 (1,000 h) = 1 0.8869 = 0.1130 Q2 (1,000 h) = 1 R2 (1,000 h) = 1 0.8353 = 0.1647

28

Redundant system Reliability over the 1,000 hour period:
R12 (t = 1,000 h) = 1 [ Q1 (t = 1,000 h) . Q2 (1,000 h) ] = 1 (0.1130 x 0.1647) = 0.9814

Thus, with individual elements Reliability of 88.69% and 83.53% respectively, the targeted parallel system Reliability is 98.14% Availability Parallel system Unavailability is equal to the product of the individual elements Unavailability. Thus, the parallel system Availability, given the individual parallelized elements Availability, is:
AS = 1 [ (1 A1 ) . (1 A2 ) . ... . (1 An ) ] = 1 (1 Ai )
i =1
where:
As= system (asymptotic) availability Ai = Individual element (asymptotic) availability n = Total number of elements on the serial system
29

Calculation Example To illustrate a parallel-series system, we perform the calculation of the Availability of a PAC system controlling distributed I/O islands, its local rack being allotted a redundant power supply. The following illustration shows the target configuration:
The formulas are the same as the ones used in the previous calculation example, except for the calculation of the Reliability of a parallel system, which is as follows:
Rs (t ) = 1 [ Q1 (t ) x Q2 (t ) x Q3 (t ) x
....
n n x Qn (t ) ] = 1 Qi (t ) = 1 1 Ri (t ) i =1 i =1
This formula is used to consider the power supply subsystem. The following table shows the method to perform the calculation: Step 1 2 Action Perform the calculation of the redundant power supplies sub-system Perform the calculation for the local CPU rack with its redundant power supplies 3 4 Perform the calculation of a distributed island Concatenate the results from Steps 2 and 3
Note: The previous results from the serial analysis regarding the calculation linked to the distributed islands are reused.
30

Step 1: Calculation of the redundant power supply sub-system.
Step 2: Calculation of the local CPU rack with its redundant power supplies. The following figure show the local CPU rack:
The following screenshot is the spreadsheet corresponding to this analysis:
31

Step 3: Calculation linked to the STB island. Because the analysis is identical to that of the serial case, the following screenshot shows the spreadsheet corresponding to the final results only:
Step 4: Calculation of the entire installation. The following screenshot is the spreadsheet corresponding to the entire analysis:
Looking at the results of this system, it should be noted that the enhancement brought by the power supply modules redundancy only makes sense if doing a comparison between single and redundant power supply figures. So, Power Supply MTBF itself would increase from 485,873 hours (approximately 55 years) to 27,434,080 hours (approximately 3131 years).
Note: The previous examples cover the PAC station only. To extend the calculation to an entire system, the MTBF of the network components and the SCADA systems (PC, servers) must be taken into account.
32

Conclusion Serial System The above computations demonstrate that the combined availability of two components in a series is always lower than the availability of its individual components. The following table gives an example of combined availability in serial system: Component X Y X and Y Combined Availability 99% (2-nines) 99.99% (4-nines) 98.99% Downtime 3.65 days/year 52 minutes/year 3.69 days/year
This table indicates that even though the very high availability of Part Y was used, the overall availability of the system was reduced by the low availability of Part X. It is generally accepted that "a chain is as strong as its weakest link". However, in this instance, the chain is actually weaker than its weakest link. Parallel System The above computations indicate that the combined availability of two components in parallel is always higher than the availability of its individual components. The following table gives an example of combined availability in a parallel system: Component X Two X components operating in parallel Three X components operating in parallel 99.9999% (6-nines) 31 seconds /year ! Availability 99% (2-nines) 99.99% (4-nines) Downtime 3.65 days/year 52 minutes/year
This indicates that even though a very low availability Part X was used, the overall availability of the system is much higher. Thus, redundancy provides a powerful mechanism for making a highly reliable system from low reliability components
33

Redundant System Considering a Hot Standby system, a common temptation would consist in using parallel formulas to calculate dependability figures. Unfortunately, this approach leads to erroneous results, as far as the nice exponential relation linking Reliability to a Constant Failure rate doesn't apply anymore. In other words, parallel formulas do not take in account such tricks as common mode failures, or undetected failures. Besides RBDs, some other analytical methods have been designed, such as the well known Markov chains. Anyway, Markov modelization appears to become very complicated as the complexity of the considered system increases. And simplified forms of this type of model, applied to a Hot Standby configuration give even more optimistic figures, compared to parallel formulas. It finally appears that, having to produce realistic figures for a redundant architecture, there is no real other choice that using a dedicated professional software tool (what deserves skills, time and budget), or looking for some help. Note that this type of service can be provided by Schneider Electric. Last, please remember that focusing on producing a single figure such as a MTBF for a whole architecture is not necessarily the most adapted approach. A criticality assessment would rather try to identify the "undesired events" (for example, loosing a pump motor control). Then, for each of these undesired events, it would attempt to evaluate the Reliability and Availability of the associated "control chain". Finally, these figures would help to decide whether or not a solution would have to be provided to enhance the availability of a given "control chain", for example by implementing a redundant PAC system.
34
3-High Availability with PlantStruxure
3. High Availability with PlantStruxure

In an automation system, how can you reach the level of availability required to keep the plant in operation? By what means should you enforce the system architecture, providing and maintaining access to the information required to monitor and control the process? This chapter provides answers to these questions, and reviews the system architecture from top to bottom, that is to say, from operator stations and data servers (Information Management) to controllers and devices (Control System Level), via communication networks (Communication Infrastructure Level). The following figure illustrates the PlantStruxure system:
Ethernet
Ethernet
PAC
PAC
PAC
Ethernet
Profibus DP Ethernet
This system architecture drawing above shows the various layers where redundancy capabilities have to be proposed:
Operating and Monitoring System Control Network Control System Field Network
35
3-High Availability with PlantStruxure 3.1. Operating and Monitoring

3.1.1. Redundancy Level
Key Features The key features Vijeo Citect SCADA software has to handle relate to:
Data acquisition Graphics (displays and user interfaces) Events and alarms (including time-stamping) Measurements and trends Recipes Reports
In addition, any current SCADA package provides access to a treatment/calculation module (openness module), allowing users to edit program code (IML/CML, Cicode, Script VB, etc.). Note: This model is applicable for a single station, including small applications. The synthesis between the stakes and the key features will help to determine the most appropriate redundant solution. Stakes Considering the previously defined key features, stakes when designing a SCADA system include: Risk Analysis Linked to the previous stakes, the risk analysis is essential to defining the redundancy level. Consider the events the SCADA system will face, that is the risk, in terms of: Inoperative hardware Inoperative power supply Environmental events (natural disaster, fire, etc.) 36 Is a changeover possible? Does a time limit exist? What are the critical data? Has cost reduction been considered?

that can imply loss of data or operator screen or connection with devices and so on. Level Definition Finally, the redundancy level is defined as the compilation of the key features, the stakes, and the risk analysis with customer expectations related to the data criticality level. The following table illustrates the flow from the process analysis to the redundancy level:
Process Analysis - Key Features - Stakes - Risk Analysis
Customer Expectations Data loss without importance Data loss allowed Data must not be lost
Redundancy Level - No Redundancy - Cold or Warm Standby - Hot Standby
The table below explains the redundancy levels: Redundancy Level No redundancy Cold Standby State of the standby system No standby system The standby system is only powered up if the default system becomes inoperative. Warm Standby The standby system switches from normal to backup mode. Hot Standby The Standby system runs together with the default system. Switchover performance Not applicable Several minutes Large amount of lost data Several seconds Small amount of lost data Several milliseconds No lost data
37

3.1.2. Principles
Display Client
The Vijeo Citect functional organization corresponds directly to a Client/Server philosophy. An example of a Client/Server topology is shown in the diagram on the left: a single display client operator station with a single I/O server in charge of device data
I/O Server
(PAC) communication.
I/O Device
Vijeo Citect architecture is a combination of several operational entities that handle Alarms, Trends and Reports. In addition, this functional architecture includes at least one I/O server. The I/O server acts as a client to the peripheral devices (PAC) and as a server to Alarms, Trends and Reports (ATR) entities. ATR and I/O server(s) act either as a client or as a server, depending on the designated relationship. The default mechanism linking these clients and servers is based on a publisher/subscriber relation. As shown in the following screenshot, because of its client/server model, Vijeo Citect can create a dedicated server, depending on the application requirements: for example for ATR or for I/O server:
Display Clients
I/O Server Alarm Server Report Server Trend Server I/O Devices
An example of redundancy is a complete duplication of the first server. Basically, if a system becomes inoperative, for example Server A, Server B would retain the job and respond to the service requests made by the clients.
38

Based on data criticality, the second level duplicates all the servers, ATR and I/O. Identical data is maintained on both servers. The following diagram illustrates this configuration.
Display Clients
Primary I/O Server
Standby I/O Server
Primary servers Alarm/Trend/Report
Standby servers Alarm/Trend/Report
I/O Devices
3.1.3. I/O Server Redundancy

Vijeo Citect can define Primary and Standby servers within a project, with each element of a pair being
Display Client
held by different hardware. The first level of redundancy duplicates the I/O server, as shown in the illustration. In this case, a Standby server is maintained in parallel to the Primary server. In the event of a detected interruption in the hardware, the Standby server will assume control of the communication with the devices according to the priority allocation. When the primary server is
I/O Servers
I/O Device
returned to service, the SCADA system
automatically returns control back to the primary server. This is done by synchronization from the operating standby server and allowing the clients to reconnect. With this system, you can use redundant I/O servers to share the normal operational processing load. This would allow higher performance as the I/O servers would be running in parallel when servicing the I/O devices. An I/O Device can be hosted by one or several I/O servers but is accessed by one single I/O server at a time. If the Primary I/O Server fails to access the I/O Device, active path is allotted to the first Standby server and so on.
39

A given I/O server is able to handle
I/O Server 1 (Primary) I/O Server 2 (Standby)
designated pairs of devices, Primary and Standby. This device redundancy does not rely on a PAC Hot Standby mechanism: Primary and Standby
1 2
3 4
devices are assumed to be concurrently acting on the same process, but no assumption is made concerning the
I/O Device
relationship between the two devices.
Seen from the I/O server, this redundancy offers access only to an alternate device in case the first device becomes inoperative. The I/O device is an extension of I/O device redundancy, providing for more than one Standby I/O device. Depending on the user configuration, a given order of priority applies when an I/O server (potentially a redundant one) needs to switch to a Standby I/O device. For example, in the figure above, a single I/O Device is seen as 4 different devices from the SCADA system, the Device 1 is set up as the Primary and the other ones as Standby. Thus, it is possible to affect different priority to the standby devices. In our example, I/O Device 3 would be allotted the highest priority, then I/O Device 2, then finally I/O Device 4. In those conditions, in the case of a detected interruption occurring on Primary I/O Device 1, a switchover would take place, with I/O Server 2 handling communications, and with Standby I/O Device 3. If an interruption is then detected on I/O Device 3, a new switchover would take place, with I/O Server 1 handling communications, with Standby I/O Device 2. Finally, if there is an interruption on I/O Device 2, yet another switchover would take place, with I/O Server 2 handling communications, with Standby I/O Device 4.
3.1.4. I/O Data Path Redundancy

I/O Server
Data path redundancy does not involve alternative device(s), but alternative data paths between the I/O server and connected I/O devices. Thus, if one data path becomes inoperative, the other is used. Note: Vijeo Citect reconnects through the primary data path when it is
Primary Data Path
Standby Data Path
returned into service.
I/O Device
40

On a larger Vijeo Citect system, you can also use data path redundancy to maintain device communications through multiple I/O server redundancy, as shown in the diagram on the right.
Primary Data Path Standby Data Path Primary I/O Server LAN Standby I/O Server
I/O Device
3.1.5. LAN Redundancy

Each SCADA system component in the cluster can be connected to a second TCP/IP LAN (LAN2). Starting with Vijeo Citect V7.0, LAN Redundancy can now be implemented very simply just by specifying multiple IP addresses for each server As previously indicated, redundancy of Alarms, Reports, Trends, and I/O Servers is achieved by adding standby servers. Vijeo Citect can also use the dual end point (or multiple network interfaces) potentially available on each server, enabling the specification of a complete and unique network connection between a Client and a Server.
Primary Servers
LAN 1 I/O Alarm Report Trend
Display Clients
I/O LAN 2
Alarm
Report
Trend
Standby Servers
41

3.1.6. Alarm Server Redundancy
It is possible to configure two alarm servers in a SCADA project. That is a primary alarm server and a standby alarm server, as described in the following diagram:
Primary I/O Server
Standby I/O Server
Primary Alarm Server
Standby Alarm Server
I/O Device
The alarm servers perform parallel data read from I/O server and parallel alarms processing. If the primary server fails, the standby alarm server starts to log alarms to devices. When an alarm server first starts up, it tries to establish a connection to the redundant alarm server. If it can connect, it transfers the dynamic alarm data from the running server to the other alarm server. If the connection cannot be established, the alarm server opens the save file and restore data from the file. The Alarm On/Off status is not exchanged between servers. Information is transmitted from Primary to Standby alarm server when operator acts on alarms (for example, acknowledge, disable, enable, add comments and so on). On startup, all Clients try to establish a connection with Primary alarm server, if the connection cannot be established, another attempt takes place, trying to reach Standby alarm server. If Primary alarm server responds and is available, any client connected to Standby alarm server remains connected
3.1.7. Trend Server Redundancy

It is possible to configure two trend servers in a SCADA project. That is a primary trend server and a standby trend server. When both trends servers are in operation, they execute parallel processing. The historical archive is written to disk and each server must write to its own local disk or on its own private area on the file server.
42

When a trend server starts up or after a network disconnection, it tries to establish a connection to the other trend server. If it can establish a connection, it will transfer all the trend data from the last tile it was shutdown until the current time in order to backfill missing data. On startup, all Clients try to establish a connection with Primary trend server, if the connection cannot be established, another attempt takes place, trying to reach Standby trend server. If Primary trend server responds and is available, any client connected to Standby trend server remains connected.
3.1.8. Report Server Redundancy

It is possible to configure two report servers in a SCADA project. That is a primary report server and a standby report server. The scheduled reports usually run on Primary report server only. If the Primary report server fails, scheduled reports will then run on Standby report server. If required, the Standby report server can also be configured to run the scheduled reports in parallel with Primary report server. On startup, all Clients try to establish a connection with Primary report server, if the connection cannot be established, another attempt takes place, trying to reach Standby report server. If Primary report server responds and is available, any client connected to Standby report server remains connected.
3.1.9. File Server Redundancy

The SCADA solution also allows redundancy for a File server. If the SCADA system cannot find a runtime file, the backup path is parsed. If the file is found in the backup path, the SCADA system assumes that the run path has failed (that is the file server has failed). File server redundancy removes any dependence on network file location and as a result, control systems dont have to be dependant on one file server for its operation ensuring stability in the event of network failure.
43

3.1.10. Clustering
A cluster may contain several possibly redundant I/O servers (maximum of one per machine), and standalone or redundant ATR servers; these latter servers being implemented either on a common or on separate machines.
Server
Primary I/O Server
Server
Primary Trend Server Primary Alarm Server Primary Report Server
Server
Standby Trend Server Standby Alarm Server Standby Report Server
Client
Display Client
Server
Standby I/O Server
Client
Display Client
Cluster
The cluster concept offers a response to a typical scenario of a system separated into several sites, with each of these sites controlled by local operators and supported by local redundant servers. The clustering model can concurrently address an additional level of management that requires all sites across the system to be monitored simultaneously from a central control room. With this scenario, each site is represented by a separate cluster, grouping its primary and standby servers. Clients on each site are interested only in the local cluster, whereas clients in the central control room are able to view all clusters. Based on cluster design, each site can then be addressed independently within its own cluster. As a result, deployment of a control room scenario is fairly straightforward, with the control room itself only needing display clients.
- Alarm 1 - Trend 1
- Report 1 - I/O 1
- Alarm 1 - Trend 1 - Report 1 - I/O 1
Primary 1 Cluster 1
Standby 1 & 2
Primary 2 Cluster 2
44

The cluster concept does not actually provide an additional level of redundancy. Regarding data criticality, clustering organizes servers and, consequently, provides additional flexibility. Each cluster contains only one pair of ATR servers. Those pairs of servers, redundant to each other, must be on different machines. Each cluster can contain an unlimited number of I/O servers; those servers must also be on different machines to increase the level of system availability.
3.1.11. Conclusion
The following illustration shows a complete installation. Redundant solutions previously discussed can be identified: Scada Clients
Data Servers
Control Network
Targeted Devices
PAC
PAC
45
3-High Availability with PlantStruxure 3.2. Redundant Network System

The previous sections reviewed various aspects of enhanced availability at the Information Management level. This section covers High Availability concerns between the Information level and the Control level. A proper design at the communication infrastructure level must include: Analysis of the plant topology Localization of the critical process steps The definition of network topologies The appropriate use of communication protocols
There are different features needed to increase the resilience of a redundant network: Manageable switches - Support redundancy management protocol - Able to find and use an alternate path to access a target device - Clever enough to avoid creating loops - Generally offer a redundant DC power supply input Network topology - Ring (single or dual) - Daisy chaining loop - Mesh Network Redundancy Management Protocols - Rapid Spanning Tree (RSTP) - Ring management protocols (MRP, HYPERring)
3.2.1. Plant Topology

The first step of the communication infrastructure level definition is the plant topology analysis. From this survey, the goal is to gather information to develop a networking system diagram prior to defining the network topologies. This plant topology analysis must be done as a top-down process: Breakdown of the plant into selected areas Localization of the areas to be connected
46

Localization of the hazardous areas Localization of the station and the nodes included in the areas to be connected Localization of the existing networks and cabling paths, in the event of expansion or redesign Before defining the network topologies, the following project requirements must be considered: Expectation of availability, according to the criticality of the process or data Cost constraints Operator skill level
From the project and the plant analyses, identify the different functional areas and the communication between them. The following diagram shows the result of such an analysis applied on a water treatment plant.
Mechanical Treatment
Control and PAC Rooms
Sludge
Biological Treatment
47

3.2.2. Network Topology
Topologies Following the process criticality analysis, the networking diagram can be defined by selecting the relevant network topology. The following table describes the main topologies from which to choose: Architecture Bus Limitations The traffic must flow serially, therefore the bandwidth is not used efficiently. Star Efficient use of the bandwidth, as the traffic is spread across Cable ways and distances Tree the star. Preferred topology when there is no need for redundancy. Ring Auto-configuration if used with self-healing protocol. Behavior quite similar to Bus. Dual Ring Possible to couple others rings for increasing redundancy. The auto-configuration depends on the protocol used. Advantages Cost-effective solution Disadvantages If a switch becomes inoperative, the communication is lost. If the main switch becomes inoperative, the communication is lost.
Note: These different topologies can be mixed to define the plant network diagram.
48

The following diagram shows the level of availability based on topology:
In automation architecture, ring (and dual ring) topologies are the most commonly used to increase the availability of a system. Mesh architecture is less used in process applications; therefore we will not discuss it in detail. All these topologies are feasible using Schneider Electric ConneXium switches. Ring and Multiple Ring Principles In a ring topology, four events can occur that would lead to a loss of communication: 1) Broken ring line 2) Inoperative ring switch 3) Inoperative ended-line devices 4) Inoperative ended-line devices switch The following diagram illustrates these four occurrences:
49

To protect the network architecture from these events, several communication protocols such as HIPER-Ring, MRP, Fast HIPER-Ring and RSTP are proposed and described in the following section. The use of at least one active component, usually named Redundancy Manager (RM), allows enhancing the availability of a ring topology. Such an Ethernet network is distributed as a bus but built as a ring. The two figures below illustrate the RM reaction facing a broken line event.
Consider an Ethernet loop designed with a RM switch. In normal conditions, this RM switch will RM open the loop which prevents Ethernet frames from circulating endlessly in a loop.
PAC PAC
If a break occurs, the Redundancy Manager RM switch reacts immediately and closes the Ethernet loop, bringing the network back to full operating condition.
PAC PAC
A mix of dual networking and network redundancy is possible.
Redundant SCADA System
Dual Ring Control Network
Hot-Standby PACs
Topological factors may lead to consideration of a network layout aggregating satellite rings or segments around a backbone network (itself designed as a ring or as a segment).
50

This may be an effective design, considering the junction between trunk and satellites, especially if backbone and satellite networks have been designed as ring networks to provide for High Availability. With its Connexium product line, Schneider Electric offers switches that may afford a redundant coupling. Several variations allow connection to the network. Ring Coupling This architecture is used with Standalone or Hot-Standby PAC and handles a capability of dual access of target devices or servers. The following drawing illustrates a common ring coupling architecture:
Ring coupling capabilities increase the level of networking availability by allowing different paths to access targeted devices. New generation Schneider ConneXium switches authorize different architectures based on dual ring. A unique switch is able to couple two Ethernet rings, extending the capabilities of the Ethernet architecture.
51

One Switch Coupling The following illustration shows this architecture, where 2 ports of a ring switch are coupled to 1 port each of 2 backbone switches. Two different paths are available to link the 2 rings: one primary link and one redundant link, blocking traffic during normal operation. When primary link becomes inoperable, redundant link is activated. When primary link becomes functional again, redundant link is blocked for normal operation. The protocols available for this architecture are: HIPER-Ring, MRP and Fast HIPERRing.
RM
Main link
RM
Backbone Ring
Redundant link
Ring
PAC
Two Switches Coupling The following illustration shows this architecture, where 1 port of each of the 2 ring switches is connected 1 port of each of the 2 backbone switches. Two different paths are available to link the 2 rings: one primary link and one redundant link, blocking traffic during normal operation. When primary link becomes inoperable, redundant link is activated. When primary link becomes functional again, redundant link is blocked for normal operation. During operation, Homothetic switches exchange control packets to inform each other about their operational state. The protocols available for this architecture are: HIPER-Ring, MRP and Fast HIPERRing.
RM
Main link
RM
Backbone
Redundant link
Ring
PAC
52

Ring Nesting This architecture allows the aggregation of multiple sub-rings on a primary ring. It can be used with Standalone or Hot-Standby PAC. Primary ring supports HIPER-ring, MRP, or Fast HIPER-ring protocol and sub-rings only support MRP protocol. Note: This architecture is only available with the new Connexium extended switches (TCSESM-E). The following drawing illustrates a common ring nesting architecture:
As seen on the figure below, this architecture is built with one Ring Manager (RM) switch on the Primary Ring and a pair of Sub-Ring Managers (SRM) switches for each Sub-Ring.
SRM 1 RM
Primary Ring
Sub-Ring
SRM 2
The sub-rings can overlap, but cannot be cascaded:

SRM 1 RM SRM 2
Primary Ring
Sub-Ring SRM 3
Sub-Ring SRM 4
53

One given Sub-ring Manager switch can be shared by several Sub-Rings:
SRM 1 RM
Primary Ring
Sub-Ring SRM 2
Sub-Ring SRM 3
Independently from the management protocol, the Ring Manager switch normally opens the Primary Ring. It closes it in case of a Ring discontinuity, and automatically reopens when the situation is fixed The Sub-Ring Manager switches operate in a default/backup mode: the second one blocking the frames circulation in normal mode. The Backup Sub-Ring manager switch takes the lead when the default one fails. It automatically returns to standby mode when default manager is back Dual Ring
This architecture allows a significant increase of the level of Availability. The implementation of such topology implies:
Servers with dual communication cards PAC with two Ethernet modules I/O devices with two Ethernet interfaces
The protocols used for both rings can be chosen from: MRP, HIPER-Ring, Fast HIPER-Ring. Note that in such a design, a SCADA I/O server has to be equipped with
54

two communication boards, and reciprocally, each device (PAC) has to be allotted two Ethernet ports. The Dual Ring topology just replicates the chosen type of single architecture (ring, ring coupling, ring nesting), therefore, each terminal node gets a dual network access. The figure below presents a typical dual ring architecture:
3.2.3. Redundancy Management Protocols

The management of an Ethernet ring requires dedicated communication protocols as described in the following table:
Fast HIPERRing
HIPERRing Structure
MRP
RSTP
Mesh Ring Tree
802.1W 802.1D
Ring
Ring
IEC 62439
Ring
Standard Mix fr. Diff Manufact. Support Mult. Fail.

Size Limit
Proprietary
Licenced
Proprietary
(Mesh)
50 switches in a ring 80 ms @ 50 sw
39 Hops
Typical Recovery time Worst Case Recovery time
<1s
(1)
3/500 ms
@ 50 sw
2/500 ms
@ 50 sw
< 30 s
55

Each protocol is characterized by different performance criteria in terms of fault detection and global system recovery time. Rapid Spanning Tree Protocol (RSTP) RSTP stands for Rapid Spanning Tree Protocol, described in IEEE 802.1w standard
(The new edition of the 802.1D standard, IEEE 802.1D-2004, incorporates IEEE
802.1t-2001 and IEEE 802.1w standards). Based on STP, RSTP introduces some additional parameters that must be entered during the switch configuration. These parameters are used by the RSTP protocol during the path selection process. Because of these, the reconfiguration time is much faster than with STP (typically less than one second).
The TCSESM ConneXium switches allow good performance of RSTP management with a detection time of 15 ms and a propagation time of 15 ms for each switch. For a 6-switches configuration, the recovery time is about 105 ms. HIPER-Ring (Version 1)
RM HIPER-Ring
Version 1 of the HIPER-Ring networking strategy has been available for approximately 10 years. It applies to a Self Healing Ring networking layout. Such a ring structure may include up to 50 switches. When configuring a Connexium TCS ESM switch for HIPER-Ring V1, the user is asked to choose between a maximum Standard Recovery Time, which is 500 ms, and a maximum Accelerated Recovery Time, which is 300 ms. As a result, if an issue occurs on a link cable or on
56

one of the switches populating the ring, the network will take about 150 ms to detect this event, and cause the Redundancy Manager switch to close the loop. Note: The Redundancy Manager switch is said to be active when it opens the network. Note: If a recovery time of 500 ms is acceptable, then no switch redundancy configuration is needed and only dip switches have to be set up. MRP
RM MRP
MRP is an IEC 62439 industry standard protocol based on HIPER-ring. Therefore, all switch manufacturers can implement MRP if they so choose. This allows a mix of different manufacturers in an MRP configuration. Schneiders switches support a selectable maximum recovery time of 200ms or 500ms and a 50-switch maximum ring configuration. TCESM switches also support redundant coupling of MRP rings. MRP rings can easily be used instead of HIPER-Ring. MRP requires that all switches are configured via Web pages and allows for recovery times of 200ms or 500ms. Additionally, the I/O network can be a MRP redundant network and the control network HIPER-Ring, or vice versa.
57

Fast HIPER-Ring A family of Connexium switches named TCS ESM Extended offers a third version of HIPER-Ring strategy, named Fast HIPER-ring. Featuring a guaranteed recovery time of less than 10 milliseconds, the fast HIPERRing structure allows both a cost optimized implementation of a redundant network as well as maintenance and network extension during operation. This makes fast HIPER-Ring especially suitable for complex applications such as a combined transmission of video, audio and data information.
3.2.4. Selection
To conclude the communication level section, the following table presents all the communication protocols to help you select the most appropriate installation for your High Availability solution:
Selection Criteria Ease of configuration or installed base
Solution HIPER-Ring
Comments If a recovery time of 500 ms is acceptable, then no switch redundancy configuration is needed. Dip switches have to be set up only.
New installation
MRP
All switches are configured via Web pages. Installation with one MRM (Media Ring Manager) and X MRCs (Media Ring Client).
Open architecture with multiple vendor switches Complex architecture
RSTP
Reconfiguration time: 15 ms (detected fault) + 15 ms per switch.
MRP, RSTP or FAST HIPERRing
We recommend MRP or RSTP for High Availability with dual ring, and FAST HIPER-Ring for high performance.
58
3-High Availability with PlantStruxure 3.3. Redundant Control System

Having detailed High Availability aspects at the Information Management level, focused on SCADA architecture, represented by Vijeo Citect, we will now concentrate on High Availability concerns at the Control level. Specific discussion will focus on PAC Hot Standby architecture.
3.3.1. Redundancy Principles

Modicon Quantum and Premium PAC provide Hot Standby capabilities and have several shared principles:
1. The type of architecture is shared. A Primary unit executes the program, with a Standby unit ready but not executing the program (apart from the first section of it). By default, these two units contain an identical application program. 2. The units are synchronized. The Standby unit is aligned with the Primary unit. Also, on each scan, the Primary unit transfers its "database to the Standby unit, that is, the application variables (located or not located) and internal data. The entire database is transferred, except the "Non-Transfer Table", which is a sequence of Memory Words (%MW). The benefit of this transfer is that, in case of a switchover, the new Primary unit will continue to handle the process, starting with updated variables and data values. This is referred to as a "bumpless" switchover. 3. The Hot Standby redundancy mechanism is controlled via the "Command Register" (accessed thru %SW60 system word); reciprocally, this Hot Standby redundancy mechanism is monitored via the "Status Register" (accessed thru %SW61 system word). As a result, as long as the application creates links between these system words and located memory words, any HMI can receive feedback regarding Hot Standby system operating conditions, and, if necessary, address these operating conditions.
59

4. For any Ethernet port acting as a server (Modbus/TCP or HTTP protocol) on the Primary unit, its IP address is implicitly incremented by one on the Standby unit. In case a switchover occurs, homothetic addresses are automatically exchanged. The benefit of this feature, from a SCADA/HMI perspective, is that the "active" unit is still accessed at the same IP address. No specific adaptation is required at the development stage of the SCADA / HMI application. 5. The common programming environment used with both solutions is Unity Pro. No particular restrictions apply when using the standardized (IEC 1131-3) instruction set. In addition, the portion of code specific to the Hot Standby system is optional, and is used primarily for monitoring purposes. That means that with any given application, the difference between its implementation on standalone architecture and its implementation on a Hot Standby architecture is mainly cosmetic. Consequently, a user familiar with one type of Hot Standby system does not have to start from scratch if he has to use a second type. The initial investment is preserved and re-usable and the differences between the two technologies are not significant.
3.3.2. PAC Hot Standby Architectures

Depending on project constraints or customer requirements (performance, installed base or project specifications), a specific Hot Standby PAC station topology can be selected: Hot Standby PAC with in-rack I/O or remote I/O Hot Standby PAC with distributed I/O, connected on standard Ethernet or connected to another device bus, such as Profibus DP. Hot Standby PAC mixing different I/O topologies
The following table presents the available configurations with either a Quantum or Premium PAC:
In Rack & Remote I/O Quantum PAC Premium PAC Configuration 1 Configuration 2 Distributed I/O Ethernet Profibus Configuration 3 Configuration 5 Configuration 6 Not Applicable
Note: A sixth configuration may be considered which combines all other configurations listed above
60

3.3.3. In-Rack and Remote I/O Architectures
Quantum Hot Standby Remote I/O: Configuration 1 With Quantum Hot Standby, in-rack I/O modules are located in the remote I/O racks. They are "shared" by both Primary and Standby CPUs, but only the Primary unit actually handles the I/O communications at any given time. In case of a switchover, the control takeover executed by the new Primary unit occurs in a bumpless way (the holdup time parameter of the rack has to be greater than the communication gap during the switchover). The module population of a Quantum CPU rack in a Hot Standby configuration is very similar to a standalone PAC. All I/O modules are accepted on remote I/O racks, except 140 HLI 340 00 (Interrupt module). Among Ethernet adapters currently available, the 140 NWM 100 00 communication module is not compatible with a Hot Standby system. Also EtherNet/IP adapter 140 NOC 771 00 is not compatible with Quantum Hot Standby in Step 1. The only specific requirement is that the CPU module must be a 140 CPU 671 60. For redundant PAC architecture, each unit requires two interlinks to execute different types of diagnostics - to orient the election of the Primary unit and achieve synchronization between the two machines. The first of these "Sync Links, the CPU Sync Link, is a dedicated optic fiber link anchored on the Ethernet port on the CPU module. This port is dedicated exclusively for this use on Quantum Hot Standby architecture. The second of these Sync Links, Remote I/O Sync Link, is not an additional one: the Hot Standby system uses the existing remote I/O medium, hosting both machines, thus providing them with an opportunity to communicate. The use of 62.5/125 multimode optic fiber on the CPU optic fiber port lets you install the two units up to 2 km apart. The Remote I/O Sync Link can also run through optic fiber, as long as the that Remote I/O Communication Processor modules are coupled on the optic fiber. Up to 31 remote I/O stations can be handled from a Quantum CPU rack, whether standalone or Hot Standby. Note that the remote I/O service payload on scan time is approximately 3 to 4 ms per station.
61

Redundant Device Implementation Using redundant in-rack I/O modules on Quantum Hot Standby and interfacing redundant sensors/actuators, will require redundant input/output channels. Preferably, homothetic channels should be installed on different modules and different I/O stations. Even a simple transfer of information to both sides of the outputs requires that the application define and implement rules to select and process the proper input signals. In addition to information transfer, the application must address diagnostic requirements. Single Device Implementation Assuming a Quantum Hot Standby application is required to handle redundant in-rack I/O channels without redundant sensors and actuators, special devices are used to handle the associated wiring requirements. Any given sensor signal, either digital or analog, goes through such a dedicated device, which replicates it and passes it on to homothetic input channels. Reciprocally, any given pair of homothetic output signals, either digital or analog, are provided to a dedicated device that selects and transfers the proper signal (i.e. the one taken on Primary output) to the target actuator.
Depending on the selected I/O Bus technology, a specific layout may result in enhanced availability. Dual Coaxial Cable Coaxial cable can be installed with either a single or redundant design. With a redundant design, communications are duplicated on both channels, providing a massive communication redundancy. Either the remote I/O processor or remote I/O adapters are equipped with a pair of connectors, with each connector attached to a separate coaxial distribution.
62

Self Healing Optical Fiber Ring Remote I/O stations can be installed as terminal nodes of a fiber optic segment or self-healing ring. The Schneider Electric catalog offers a transceiver (490 NRP 954) applicable for 62.5/125 multimode optic fiber: 5 transceivers maximum, 10 km maximum ring circumference. When the number of transceivers is greater than 5, precise optic fiber ring diagnostics or single mode fiber are required. Configuration Change on the Fly A High-level feature is provided to Quantum Hot Standby application designs. This is CCTF, or Configuration Change on the Fly. This new feature lets you modify the configuration of the existing and running PAC application program without having to stop the PAC. As an example, consider the addition of a new discrete or analog module on a remote Quantum I/O station. For the CPU Firmware version upgrade executed on a Quantum Hot Standby architecture, this CCTF will be sequentially executed, one unit at a time. This is critical for applications that cannot afford to stop and now becomes available for architecture modification or extensions. Premium Hot Standby in Rack I/O: Configuration 2 Premium Hot Standby can handle in-rack I/O modules, installed on Bus-X racks. The initial version of Premium Hot Standby handles only one Bus-Xrack on both units.
PAC
The Primary unit acquires its inputs, executes the logic, and updates its outputs. As a result of cyclical Primary to
Standby data transfer, the Standby unit provides local outputs that are the image of the outputs decided on the Primary unit. In case of a switchover, the control takeover executed by the new Primary unit occurs in a bumpless fashion. The module population of a Premium CPU rack, in a Hot Standby configuration, is very similar to that of a standalone PAC Note: Counting, motion, weighing and safety modules are not accepted. On the communication side, except for Modbus modules TSX SCY 11 601/21 601, only currently available Ethernet TCP/IP modules are accepted. Also, the EtherNet/IP adapter (TSX ETC 100) is not compatible with Premium Hot Standby in Step 1. Two types of CPU modules are available: TSX H57 24M and TSX H57 44M, which differ mainly in memory and communication resources.
63

The first of the two Sync Links, the CPU Sync Link, is a dedicated copper link anchored on the Ethernet port and local to the CPU module. With Premium Hot Standby architecture, the second Sync Link, Ethernet Sync Link, is established using a standard Ethernet TCP/IP module (TSX ETY 4103 or TSX ETY 5103 communication module). It corresponds to the communication adapter elected as the "monitored" adapter. This link is selected during the Unity Pro configuration. The following picture illustrates the Premium Ethernet configuration, with the CPU and the ETY SyncLinks:
Primary Standby
PAC
CPU Sync Link
ETY Sync Link
This Ethernet configuration is detailed in the following section. Redundant Device Implementation In-rack I/O module implementation on Premium Hot Standby corresponds by default to a massive redundancy layout: each input and each output has a
PAC
physical connection on both units. Redundant sensors and actuators do not require additional hardware. For a simple transfer of information to both sides of the outputs, the application must define and
implement rules for selecting and treating the proper input signals. In addition to information transfer, the application must address diagnostic requirements.
64

Single Device Implementation Assuming a Premium Hot Standby application is required to handle redundant in-rack I/O channels, but without redundant sensors and actuators, special
PAC
devices are used to handle the associated wiring requirements. Any given sensor signal, either digital or analog, passes through such a dedicated device, which replicates it and passes it on to homothetic input channels. Reciprocally, any given pair of homothetic output signals, either digital or analog, are provided to a dedicated device that selects and transfers the proper signal (i.e. the one taken on
Primary output) to the target actuator.
3.3.4. Ethernet Distributed I/O Architectures

Ethernet TCP/IP: Configurations 3&4 Schneider Electric has supported Ethernet strategy for several years. In addition to SCADA and HMI, variables speed drives, power meters and a wide range of gateways, distributed I/Os with Ethernet connectivity, such as Advantys STB, are also being proposed,. In addition, many manufacturers are offering devices capable of communicating on Ethernet using Modbus TCP8 protocol. These different contributions using the Modbus protocol design legacy have helped make Ethernet a general purpose preferred communication support for automation architectures. Note: In step 1, support of Ethernet/IP is not available with Hot Standby. In addition to Ethernet messaging services solicited through application program function blocks, a communication service is available on Schneider Electric PACs: the I/O Scanner. The I/O Scanner makes a PAC Ethernet adapter/Copro act as a Modbus/TCP client, periodically launching a sequence of requests on the network. These requests correspond to standard Modbus function codes, asking for Registers (Data Words), Read, Write or Read/Write operations. This sequence is determined by a list of individual contracts, specified in a table defined during the PAC Unity Pro configuration. The following screenshot shows the result of this I/O scanner.
65
The I/O Scanner service may also be used to implement data exchanges with any type of equipment, including another PAC, provided that equipment can behave as a Modbus/TCP server, and respond to multiple word access requests. Ethernet I/O scanner service is compatible with a Hot Standby implementation, whether Premium or Quantum. The I/O Scanner is active only on the Primary unit. In case of a controlled switchover, Ethernet TCP/IP connections handled by the former Primary unit are properly closed, and new ones are reopened once the new Primary gains control. In case of a sudden switchover, resulting, for example, from a power cut, the former Primary may not be able to close the connections it had opened. These connections will be closed after expiration of a Keep Alive timeout. In case of a switchover, proper communications will typically recover after one initial cycle of I/O scanning.
66

Note: The automatic IP address swap capability is a property inherited by every Ethernet TCP/IP adapter installed in the CPU rack. Self Healing Ring As demonstrated in the previous chapter, Ethernet TCP/IP used with products like Connexium offers opportunities to design enhanced availability architectures, handling communication between the Information Management level and the Control level. Such architectures, based on a Self Healing Ring topology, are also applicable when using Ethernet TCP/IP as a fieldbus. Note that Connexium accepts copper or optic fiber rings. In addition, dual networking is also applicable at the fieldbus level.
3.3.5. Profibus Distributed I/O Architecture

I/O Devices Distributed on PROFIBUS DP/PA: Configuration 5 A PROFIBUS DP V1 Master communication module for Quantum is available. It handles cyclic and acyclic data
Ethernet
exchanges, and accepts FDT/DTM Asset Management System data flow through its local Ethernet port.
Quantum + PTQ
Profibus DP Profibus PA
Advantys STB
The Profibus network is set up with Configuration Builder software, which supplies the Unity Pro application program with data structures corresponding to cyclic data exchanges and diagnostic information. The Configuration Builder can also be
configured to pass Unity Pro a set of DFBs, allowing easy implementation of acyclic operations. Each Quantum PAC can accept up to 6 of these DP Master modules (each of them handling its own PROFIBUS network). Also, the PTQ PDPM V1 Master Module is compatible with a Quantum Hot Standby implementation. Only the Master Module in the Primary unit is active on the PROFIBUS network; the Master Module on the Standby unit stays in a dormant state unless awakened by a switchover.
67

PROFIBUS DP V1 Remote Master and Hot Standby PAC
PAC Primary
PAC Standby
Plant Asset Management
With a smart device such as PROFIBUS Remote Master, an I/O Scanner stream is handled by the PAC application (M340, Premium or Quantum) and forwarded to Remote Master via Ethernet TCP/IP. In turn, Remote Master handles the corresponding cyclic exchanges with the devices populating the PROFIBUS network. Remote Master can also handle acyclic data exchanges.
Ethernet Profibus DP Profibus PA
The PROFIBUS network is configured with Unity Pro, which also acts as an FDT container, able to host manufacturer device DTMs. In addition, Remote Master offers a comDTM to work with third party FDT/DTM Asset Management Systems. Automatic symbol generation provides Unity Pro with data structures corresponding to data exchanges and diagnostic information. A set of DFBs is delivered that allows an easy implementation of acyclic operations. Remote Master is compatible with a Quantum or Premium Hot Standby implementation.
3.3.6. Redundancy and Safety

Availability and Safety requirements are often considered to be at odds with each other. Safety can follow the slogan: "stop if any potential danger arises, whereas Availability follows the slogan: "Produce in spite of everything. Two CPU models are available to design a Quantum Safety configuration: the first model (140 CPU 651 60S) is dedicated to standalone architectures, whereas the second model (140 CPU 671 60S) is dedicated to redundant architectures. The Quantum Safety PAC has some exclusive features: a specific hardware design for safety modules (CPU and I/O modules) and a dedicated instruction set.
68

Otherwise, a Safety Quantum Hot Standby configuration has much in common with a regular Quantum Hot Standby configuration. The configuration windows, for example, are almost the same, and the Ethernet communication adapters inherit the IP address automatic swap capability. Thus, the safety Quantum Hot Standby helps to reconcile and integrate the concepts of safety and availability.
3.3.7 Mixed I/O Configuration: Configuration 6

Whether Premium or Quantum, application requirements such as topology, environment, periphery, time criticality and so on may influence the final architecture design to adopt both types of design strategies concurrently, i.e. in-rack and distributed I/Os, depending on individual subsystem constraints.
3.3.8. Premium / Quantum Hot Standby Switchover Conditions

System Health Diagnostics Systematic checks are executed cyclically by any running CPU in order to detect a potential hardware corruption, such as a change affecting the integrity of the Copro, the sub-part of the CPU module that hosts the integrated Ethernet port. Another example of a systematic check is the continuous check of the voltage levels provided by the power supply module(s). In case of a negative result during these hardware health diagnostics, the tested CPU will usually switch to a Stop state.
69

When the unit in question is part of a Hot Standby System, in addition to these standard hardware tests separately executed on both machines, more specific tests are conducted between the units. These additional tests involve both Sync Links. The basic objective is to confirm that the Primary unit is effectively operational, executing the application program and controlling the I/O exchanges. In addition, the system must verify that the current Standby unit is able to assume control after a switchover. If an abnormal situation occurs on the current Primary unit, it gives up control and switches either to Off-Line state (the CPU is not a part of the Hot Standby system coupling) or to Stop state, depending on the event. The former Standby unit takes control as the new Primary unit. Controlled Switchover As previously indicated, the Hot Standby system is controlled through the %SW61 system register. Each unit owns an individual bit on the system Command Register that decides whether or not that particular unit will make it possible to "hook" to the other unit. An operational hooked redundant Hot Standby system requires both units to indicate this intent. Consequently, executing a switchover controlled by the application on a hooked system is straightforward; it requires briefly toggling the decision bit that controls the current Primary units "hooking" intent. The first toggle transition switches the current Primary unit to Off-Line state, and makes the former Standby unit take control. The next toggle transition makes the former Primary unit return and hook as the new Standby unit. An example of this function is a controlled switchover resulting from diagnostics conducted at the application level. A Quantum Hot Standby, Premium Hot Standby or Monitored Ethernet Adapter system does not handle a Modbus Plus, Ethernet or Profibus communication adapter malfunction as a condition that would implicitly force a switchover. As a result, these communication modules must be cyclically tested by the application, both on Standby and on Primary. Diagnostic results elaborated on Standby are usually transferred to the Primary unit because of the Reverse Transfer Registers. Finally, if the application reports a non-fugitive inconsistency affecting the Primary unit, while the Standby unit is fully operational, the Control Register will force a switchover. Hence, the application program can decide on a Hot Standby switchover, having registered a steady state negative diagnostic on the Ethernet adapter linking the Primary unit to the "Process Network," while also being informed that the Standby unit is fully operational. Note: There are two ways to implement a controlled switchover: automatically, through configuration of a default DFB, or customized, with the creation of a DFB with its own switchover conditions. 70

Switchover Latencies The following table details the typical and maximum swap time delays encountered when reestablishing Ethernet services during a switchover event. (Premium and Quantum configurations) Service Swap IP Address I/O Scanning Client Messaging Server Messaging 6 ms 1 initial cycle of I/O scanning 1 MAST task cycle 1 MAST task cycle + the time required by the client to reestablish its connection with the server HTTP Server
(1)
Typical Swap Time 500 ms
Maximum Swap Time
500 ms + 1 initial cycle of I/O scanning 500 ms + 1 MAST task cycle 500 ms + the time required by the client to reestablish its connection with the server (1) 500 ms + the time required by the client to reestablish its connection with the server (1)
The time required by the client to reestablish its connection with the server
(1)
(1)
The time the client requires to reconnect with the server depends on the client communication loss
timeout settings.
3.3.9. Selection
To conclude the Control level section, the table below presents the four main criteria to help you select the most appropriate configuration for your high availability solution:
Criteria Switchover Performance
Cost Effective Premium In Rack Architecture
High Criticality Quantum In Rack Architecture Quantum Distributed Architecture
Openness
Premium Distributed Architecture
71

3.3.10. Premium/ Quantum Hot Standby Solution Reminder
The following tables provide a brief reminder of essential characteristics for Premium and Quantum Hot Standby solutions, respectively:
Premium Hot Standby Essential Characteristics
72
Quantum Hot Standby Essential Characteristics
73
4-Conclusion 3.4. Redundant Field Bus

Generally, the network topologies described in chapter 3.X, ring, dual ring and so on, can all be used on the field bus to connect the PACs and field devices. Nevertheless, the daisy chain topology is widely used on a field bus, and thus we describe the redundancy aspects of this architecture.
3.4.1 Daisy Chain Topology

An Ethernet "daisy chain" refers to the integration of a switch function inside a communicating device. The resulting daisy-chainable device offers two Ethernet ports, for example one "in" port and one "out" port. The advantage of such a daisychainable device is that its installation inside an Ethernet network requires only two cables.
PAC
The chain can receive up to 32 devices. This topology is tipically used when the interaction of all devices is linked, and the loss of 1 device will require the process to pause.
3.4.2 Daisy Chain Loop Topology

In addition, to provide redundancy, a daisy-chain can be set-up as a network loop.
PAC
Without redundancy management protocol In this case, the loop is handled by a Manageable Connexium switch, featuring Ring Manager capability, used as a HIPER-ring redundancy manager.
74
4-Conclusion
The current recovery times can take up to 30s. With RSTP embedded in all daisy chain devices The use of RSTP protocol allows improved capabilities:
Max. 15ms detection Max. 15ms propagation No disturbances during the recovery period
The figure below illustrates the example of the Modicon M340 Ethernet module with an embedded 4-port switch. Two ports support RSTP to ensure redundant architecture.
The recovery time for a 10-device chain is approximately 165 ms. The first daisy-chainable devices Schneider Electric plans to offer are: Advantys STB dual port Ethernet communication adapter (STB NIP 2311) Advantys ETB IP67 dual port Ethernet Motor controller TeSys T Variable speed drive ATV 61/71 (VW3A3310D) PROFIBUS DP V1 Remote Master ETG 30xx Factorycast gateway
75
4-Conclusion
3.4.3. Conclusion
This chapter has covered functional and architectural redundancy aspects, from the Information Management level to the Control level, and up through the Communication Infrastructure level.
76
4-Conclusion
4. Conclusion
This section summarizes the main characteristics and properties of Availability for Collaborative Control automation architectures. Chapter 1 demonstrated that Availability is dependent not only on Reliability, but also on Maintenance as it is provided to a given system. The first level of contribution, Reliability, is primarily a function of the system design and components. Component and device manufacturers thus have a direct but not exclusive influence on system Availability. The second level of contribution, Maintenance and Logistics, is totally dependent on end customer behavior. Chapter 2 presented some simple Reliability and Availability calculation examples, and demonstrated that beyond basic use cases, dedicated skills and tools are required to extract figures from real cases. Chapter 3 explored a central focus of this document, Redundancy, and its application at the Information Management level, Control System level and Communication Infrastructure level. This final chapter summarizes the customer benefits of Schneider Electric High Availability solutions, as well as additional information and references.
77
4-Conclusion 4.1. Benefits

Schneider Electric currently offers a wide range of solutions, providing the best designs to meet specific customer needs for Redundancy and Availability in automation and control systems.
4.1.1 Standard Offer

One key concept of High Availability is that redundancy is not a default design characteristic at any system level. Also, redundancy can be added locally, using standard components in most cases.
4.1.2. Simplicity of Implementation

At any level, intrusion of redundancy into system design and implementation is minimal, compared to a non-redundant system. For SCADA implementation, network active components selection or PAC programming, most of the software contributions to redundancy are dependent on selections executed during the configuration phase. Also, redundancy can be applied selectively.
4.1.3 System Transparency

The transparency of a redundant system, compared to a standalone one, is a customer requirement. With the Schneider Electric automation offer, this transparency is present at each level of the system. Information Management Level For client display stations, dual path supervisory networks, redundant I/O servers, or dual access to process networks, each redundant contribution is handled separately by the system. For example, concurrent display client communication flow will be transparently re-routed to the I/O server by the supervisory network in case of a cable disruption. This flow will also be transparently routed to the alternative I/O server in the event of a sudden malfunction of the first server. Finally, the I/O server may transparently leave the communication channel it is using by default, if that channel ceases to operate properly, or if the target PAC does not respond through this channel.
78
4-Conclusion
Communication Infrastructure Level Whether utilized as a process network or as a Fieldbus network, currently available active network components can easily participate in an automatically reconfigured network. With continuous enhancements, HIPER-Ring strategy not only offers simplicity, but also a level of performance compatible with a high-reactivity demand. Control System Level The IP address automatic switch" for a SCADA application communicating through Ethernet is an important feature of Schneider Electric PACs. In addition to simplifying the design of the SCADA application implementation, which may reduce delays and costs, this feature also contributes to reducing the payload of a communication context exchange on a PAC switchover.
4.1.4 Ease of Use

As previously stated, increased effort has been made to make the implementation of a redundant feature simple and straightforward. Vijeo Citect, ConneXium Web pages and Unity Pro software environments offer clear and accessible configuration windows, along with a dedicated selective help, in order to execute the required parameterization.
4.2. More Detailed RAMS Investigation

In case of a specific need for detailed dependability (RAMS) studies for any type of architecture, contact the Schneider Electric Safety Competency Center. This center has skilled and experienced individuals ready to help you with all of your needs.
79
4-Conclusion
80
5-Appendix
5. Appendix
5.1. Glossary
Note: The references in brackets refer to standards, which are specified at the end of this glossary. 1) Active Redundancy Redundancy where the different means required to accomplish a given function are present simultaneously [5] 2) Availability Ability of an item to be in a state to perform a required function under given conditions, at a given instant in time or over a given time interval, assuming that the required external resources are provided [IEV 191-02-05] (performance) [2] 3) Common Mode Failure Failure that affects all redundant elements for a given function at the same time. [2] 4) Complete failure Failure which results in the complete inability of an item to perform all required functions. [IEV 191-04-20] [2] 5) Dependability Collective term used to describe availability performance and its influencing factors: reliability performance, maintainability performance and maintenance support performance [IEV 191-02-03] [2] Note: Dependability is used only for general descriptions in non-quantitative terms. 6) Dormant A state in which an item is able to function but is not required to function. Not to be confused with downtime. [4] 7) Downtime Time during which an item is in an operational inventory but is not in condition to perform its required function. [4] 8) Failure Termination of the ability of an item to perform a required function [IEV 191-04-01] [2] Note 1: After failure, the item detects a fault. Note 2: "Failure" is an event, as distinguished from "fault," which is a state. 81
5-Appendix
9) Failure Analysis The act of determining the physical failure mechanism resulting in the functional failure of a component or piece of equipment [1] 10) Failure Mode and Effects Analysis (FMEA) Procedure for analyzing each potential failure mode in a product to determine the results or effects on the product. When the analysis is extended to classify each potential failure mode according to its severity and probability of occurrence, it is called a Failure Mode, Effects and Criticality Analysis (FMECA). [6] 11) Failure Rate Total number of failures within an item population, divided by the total number of life units expended by that population during a particular measurement period under stated conditions. [4] 12) Fault State of an item characterized by its inability to perform a required function, excluding this inability during preventive maintenance or other planned actions, or due to lack of external resources [IEV 191-05-01] [2] Note: A fault is often the result of a failure of the item itself, but may exist without prior failure. 13) Fault- tolerance Ability to tolerate and accommodate a fault with or without performance degradation 14) Fault Tree Analysis (FTA) Method used to evaluate reliability of engineering systems. FTA is concerned with fault events. A fault tree may be described as a logical representation of the relationship of primary or basic fault events that lead to the occurrence of a specified undesirable fault event known as the top event. A fault tree is depicted using a tree structure with logic gates such as AND and OR [7] See diagram on the following page.
82
5-Appendix
FTA illustration [7]
15) Hidden Failure A failure that is not detectable by or evident to the operating crew. [1] 16) Inherent Availability (Intrinsic Availability) : Ai A measure of availability that includes only the effects of an item design and its application, and does not account for effects of the operational and support environment. Sometimes referred to as "intrinsic" availability. [4] 17) Integrity Reliability of data which is being processed or stored. 18) Maintainability Probability that an item can be retained in, or restored to, a specified condition when maintenance is performed by personnel having specified skill levels, using prescribed procedures and resources, at each prescribed level of maintenance and repair. [4] 19) Markov Method A Markov process is a mathematical model that is useful in the study of the availability of complex systems. The basic concepts of the Markov process are those of the state of the system (for example operating, non-operating) and state of transition (from operating to non-operating due to failure, or from non-operating to operating due to repair). [4] See illustration on next page.
83
5-Appendix
Markov Graph illustration [2] 20) MDT: Mean Downtime Average time a system is unavailable for use due to a failure. Time includes the actual repair time plus all delay time associated with a repair person arriving with the appropriate replacement parts. [4] 21) MOBF: Mean Operating Time Between Failure Expected operating time between failures [IEV 191-12-09] [2] 22) MTBF: Mean Time Between Failure A basic measure of reliability for repairable items. The mean number of life units during which all parts of the item perform within their specified limits, during a particular measurement interval under stated conditions. [4] 23) MTTF : Mean Time To Failure A basic measure of reliability for non-repairable items. The total number of life units of an item population divided by the number of failures within that population, during a particular measurement interval under stated conditions. [4] Note: Used with repairable items, MTTFF stands for Mean Time To First Failure 24) MTTR : Mean Time To Repair A basic measure of Maintainability. The sum of corrective maintenance times at any specific level of repair, divided by the total number of failures within an item repaired at that level, during a particular interval under stated conditions. [4] 25) MTTR : Mean Time To Recovery Expectation of the time to recovery [IEV 191-13-08] [2]
84
5-Appendix
26) Non-Detectable Failure Failure at the component, equipment, subsystem or system (product) level that is identifiable by analysis but cannot be identified through periodic testing or revealed by an alarm or an indication of an anomaly. [4] 27) Redundancy Existence in an item of two or more means of performing a required function [IEV 191-15-01] [2] Note: In this standard, the existence of more than one path (consisting of links and switches) between end nodes. Existence of more than one means for accomplishing a given function. Each means of accomplishing the function need not necessarily be identical. The two basic types of redundancy are Active and Standby. [4] 28) Reliability Ability of an item to perform a required function under given conditions for a given time interval [IEV 191-02-06] [2] Note 1: It is generally assumed that an item is in a state to perform this required function at the beginning of the time interval. Note 2: The term reliability is also used as a measure of reliability performance (see IEV 191-12-01). 29) Repairability Probability that a failed item will be restored to operable condition within a specified time of active repair [4] 30) Serviceability Relative ease with which an item can be serviced (i.e. kept in operating condition). [4] 31) Standby Redundancy Redundancy whereby a part of the means for performing a required function is intended to operate, while the remaining part(s) of the means are inoperative until needed [IEV 19-5-03] [2] Note: This is also known as dynamic redundancy. Redundancy in which some or all of the redundant items are not operating continuously but are activated only upon failure of the primary item performing the function(s). [4]
85
5-Appendix
32) System Downtime Time interval between the commencement of work on a system (product) malfunction and the time when the system has been repaired and/or checked by the maintenance person, and no further maintenance activity is executed. [4] 33) Total System Downtime Time interval between the reporting of a system (product) malfunction and the time when the system has been repaired and/or checked by the maintenance person, and no further maintenance activity is executed. [4] 34) Unavailability State of an item of being unable to perform its required function [IEV 603-05-05] [2] Note: Unavailability is expressed as the fraction of expected operating life in which an item is not available, for example minutes per year Ratio: downtime/(uptime + downtime) [3] Often expressed as a maximum period of time during which the variable is unavailable, for example 4 hours per month 35) Uptime The element of Active Time during which an item is in condition to perform its required functions. (Increases availability and dependability). [4] [1] Maintenance & Reliability terms - Life Cycle Engineering [2] IEC 62439: High Availability Automation Networks [3] IEEE Std C37.1-2007: Standard for SCADA and Automation System [4] MIL-HDBK-338B - Military Handbook - Electronic Reliability Design Handbook [5] IEC-271-194 [6] The Certified Quality Engineer Handbook - Connie M. Borror, Editor [7] Reliability, Quality and Safety for Engineers - B.S. Dhillon - CRC Press
86
5-Appendix 5.2. Standards

This section contains a selected, non-exhaustive list of reference documents and standards related to Reliability and Availability:
General purpose
IEC 60050 (191):1990 - International Electrotechnical Vocabulary (IEV)
FMEA/FMECA
IEC 60812 (1985) - Analysis techniques for system reliability - Procedures for failure mode and effect analysis (FMEA) MIL-STD 1629A (1980) Procedures for performing a failure mode, effects and criticality analysis
Reliability Block Diagrams

IEC 61078 (1991) Analysis techniques for dependability - Reliability block diagram method
Fault Tree Analysis

NUREG-0492 - Fault Tree Handbook - US Nuclear Regulatory Commission
Markov Analysis
IEC 61165 (2006) Application of Markov Techniques
RAMS
IEC 60300-1 (2003) - Dependability management - Part 1: Dependability management systems IEC 62278 (2002) - Railway applications - Specification and demonstration of Reliability, Availability, Maintainability and Safety (RAMS)
Functional Safety
IEC 61508 - Functional safety of electrical/electronic/programmable electronic safety related systems (7 parts) IEC 61511 (2003) Functional safety - Safety instrumented systems for the process industry sector.
87
Schneider Electric Industries SAS

89, bd Franklin Roosvelt 92506 Rueil-Malmaison Cedex FRANCE www.schneider-electric.com Version 2.0 - 03 2010 Due to evolution of standards and equipment, characteristics indicated in texts and images in this document are binding only after confirmation by our departments Print:
88

STN HiAv V2.0

Uploaded by

Copyright:

Available Formats

STN HiAv V2.0

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STN HiAv V2.0

Uploaded by

Copyright:

Available Formats

How can I

System Technical Note

increase the availability of a system?

Design your architecture

The STN Collection

1. Introduction to High Availability .........................................7

2. High Availability Theoretical Basics.................................11

3. High Availability with PlantStruxure.................................35

1-Introduction to High Availability

1. Introduction to High Availability

1-Introduction to High Availability

1.3. Document Overview

1-Introduction to High Availability

1-Introduction to High Availability

2-High Availability Theoretical Basics

2. High Availability Theoretical Basics

2.1. Fault Tolerant System

2.2. Lifetime and Failure Rate

2-High Availability Theoretical Basics

2.3. RAMS (Reliability, Availability, Maintainability, Safety)

2-High Availability Theoretical Basics

MUT: Mean Up Time

2-High Availability Theoretical Basics

FIT: Failures In Time

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics

, where stands for the

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics 2.4. Reliability Block Diagrams (RBD)

2.4.1. Series-Parallel Systems

2-High Availability Theoretical Basics

POWER SUPPLY 1 MODULE A POWER SUPPLY 2 MODULE B MODULE C MODULE D

2-High Availability Theoretical Basics

n R S (t) = R1(t) R2 (t) R3 (t) ... R n (t) = R i (t) i=1

n = : Equivalent Failure Rate for n serial elements is equal to the sum S i i =1

of the individual Failure Rate of these elements, with R S (t) = e Example 1:

1 = 120 x 10-6 h-1 and 2 = 180 x 10-6 h-1

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics

The following screenshot is the spreadsheet corresponding to this analysis.

The following screenshot is the spreadsheet corresponding to this analysis:

2-High Availability Theoretical Basics

The following screenshot is the spreadsheet corresponding to the entire analysis:

2-High Availability Theoretical Basics

Q i (t) = R i (t) = 1 - R i (t) :

Rs = Reliability of the simple parallel system

1 = 120 x 10-6 h-1 and 2 = 180 x 10-6 h-1

Reliability of elements 1 and 2 over the 1,000 hour period:

Unreliability of elements 1 and 2 over the 1,000 hour-period:

Q1 (1,000 h) = 1 R1 (1,000 h) = 1 0.8869 = 0.1130 Q2 (1,000 h) = 1 R2 (1,000 h) = 1 0.8353 = 0.1647

2-High Availability Theoretical Basics

R12 (t = 1,000 h) = 1 [ Q1 (t = 1,000 h) . Q2 (1,000 h) ] = 1 (0.1130 x 0.1647) = 0.9814

2-High Availability Theoretical Basics

2-High Availability Theoretical Basics