How to Avoid Alarm Overload with Centralized Alarm Management

In 1999, the Engineering Equipment and Materials Users’ Association (EEMUA) released its general guide to the design, management, and procurement of alarm systems for industrial plants. The guidance document (EEMUA 191), however, is vague about applications to specific facilities, such as electric power plants. This article specifies EEMUA 191 standards and practices applicable to the electric power industry and spells out specific variations in alarming practices that are tailored for today’s power plants.

Awareness of one area of power plant automation surpasses all others: alarm management. This heightened awareness can be attributed to a few key factors: risk reduction/mitigation, personnel safety, changes in the power plant chemical hazard profile, and a desire to improve/optimize plant automation systems’ alarm management.

In the past, alarm management was not a topic of grave concern, as it is today. In the 1960s and 1970s, it was not uncommon for there to be approximately 60 to 100 alarms configured per operator. Each was designed to be highly reliable and available; however, due to limitations in component design, these alarm components were also bulky, relatively costly, and highly proprietary in design. The end result was an overall alarm system that was manageable with respect to quantity of alarms; however, the system was not very flexible.

Today’s systems consist of a combination of some proprietary, but mostly standard, software components. These components enable us to create alarm systems that are flexible, small, and relatively inexpensive. However, this combination of flexibility and economy has resulted in one unintentional byproduct — alarm overload.

Instead of the previously manageable 60 to 100 alarms per operator, modern alarm systems are generating more than 1,000 alarms per operator. This overload effect can create several undesirable outcomes, including:

Causing a process disturbance to last longer than necessary.
Causing the original process disturbance to become worse than necessary.
Causing a catastrophic equipment, system and/or plant failure, as happened at the BP Texas City refinery explosion in 2005.
Creating increased stress for operators that can lead to poor judgment, decreased morale, and higher attrition rates.

In order to optimize plant operations, a centralized alarm system is recommended for logging plant alarms and presenting them to the operator in a unified and coordinated view. When developing a centralized system, a common basis of design must be applied to all control subsystems that constitute the overall plant automation system. At the end of the day, all of the control subsystems in the plant must follow the same design philosophies in order for the centralized alarm system view to be meaningful and useful.

Eliminate "False" Alarms

According to the Engineering Equipment and Materials Users’ Association (EEMUA), "The purpose of an alarm system is to direct the operator’s attention towards plant conditions requiring timely assessment or action."

This basic message provides plant owners and designers with the first principle of alarm development: Alarms should not exist for something that does not require operator action.

If a signal needs to be collected historically, or if a signal has some diagnostic value but does not satisfy the litmus test of requiring operator action, then it may be treated as a "journal event" or an "alert." In turn, in order for a signal to be an alert, it should be capable of being ignored without negative repercussions to the equipment, system, or plant operation.

Alarm suppression should be evaluated as an addendum to prudent alarm system design rather than a means of keeping alarms that may not meet the original alarm design principles. The five following levels of suppression are described in EEMUA 191.

Redundant Alarm Suppression. This level may be applied when several input/output (I/O) points are used for a single status. This situation may occur in safety instrumented systems, protection systems, and the like.

If this situation occurs, the alarm state for the redundant I/O should be designed appropriately to reduce the number of alarms and to retain the integrity of the system logic design intentions. For example, if a burner management system has redundant inputs for tripping (two out of three energize to trip), then an alarm associated with that condition should be conditioned so that it is only active when two out of the three inputs are active.

Eclipsing Logic. This level may be added to the alarm system to address situations in which several alarm points are generated from one process measurement. An example of this situation would be a vessel level. If the high-high level is reached in a vessel, then it is obvious that the high level has also been passed. Therefore, in this case, the alarm system should be designed so that the high level is masked by the actuation of the high-high level alarm.

Out-of-Service Conditions. This problem at either the equipment, system, or plant level can often create alarms that are unnecessary. Hence, the alarm system should be designed to include masking conditions addressing an out-of-service state. An example of a condition requiring out-of-service suppression would be low-flow monitoring on a pump (or a set of pumps). When the pump (or set of pumps) is not running, the alarm for low-flow condition should be masked.

Operating Mode Alarm Suppression. This level is somewhat similar to the out-of-service suppression but is addressed separately. Operating mode alarm suppression can be applied so that particular alarms associated with a mode of operation are enabled or masked according to the real-time operating conditions of the plant, system, or equipment. Some recommended operating modes that may be used for suppression include start-up, shutdown, steady state operation, plant maintenance, or load change (such as increase, decrease, or runback).

Because some operating states may overlap, great care should be taken to fully analyze how the operating state suppression logic is formulated.

Major Event Alarm Suppression. This level may be applied to help reduce alarm floods during periods of time in which alarm traffic may be drastically increased. For example, during a plant trip condition, several actions should be automatically taken to ensure that failsafe operation is achieved (such as master fuel trip relay tripped, combustion air removed, and forced draft retained at the National Fire Protection Association’s prescribed levels).

Given all of this activity, it is possible to experience an alarm flood. Therefore, it is useful and possible to design suppression logic that, based upon operational state, would essentially evaluate the inverted alarm conditions. For example, in the case of a coal-fired plant trip, it may be most useful for the operator to know what fuel path equipment is still in operation, which is the direct opposite of what would be of interest during normal operation. This form of suppression would likely take the most effort to implement.

By eliminating all false alarms, and suppressing unnecessary alarms, plant operators will be able to focus on priority and critical alarms in a more effective and efficient manner.

Prioritize and Group Your Alarms

Alarms must be designed so that they are useful and relevant to the operator based upon prioritization, grouping, and mode of operation of the plant. Furthermore, the design should be such that when an alarm is active, it tells the operator the level of priority and the functional system or process generating the alarm. This is achieved by spending more coordination time and effort in designing and developing consistent application of priorities, graphic coloration, and alarm audible alerts.

Alarm prioritization should be designed using a three- to five-level system. Three levels of priority (1-2-3 or high-medium-low) are the most popular, according to The Alarm Management Handbook by Bill Hollifield and Eddie Habibi.

Within the priority system, alarms should be distributed so that only a select few are designated "high" or "critical," thereby truly differentiating these alarms from lower-priority alarms. For example, if using a three-level system, alarms should be distributed so that 80% are low priority, 15% are medium priority, and only 5% are high priority.

Each level of priority should have its own unique audible tone and unique graphic color. Note that the color should not be repeated for use in any other function on the control system graphics.

Alarms should also be grouped by process. Any alarm that is active within a specified alarm group would cause that group’s visual alarm to activate, as necessary. Some suggested groupings include fuel handling and preparation, turbine, and generator and electrical auxiliaries.

In addition to grouping alarms by process, some alarms may be grouped by criticality of subsequent action. These alarm groups may be treated as first-out groups. First-out groups may include unit protection or a turbine trip summary. Alarm relevance should be built into the alarm design to the greatest extent practical.

Alarm relevance should take into account the operating mode of the equipment, system, and plant in order to provide more intelligent alarm masking. For example, an alarm that may be typically generated on a "main steam header temperature low" alert may not be relevant when the source of steam energy (boiler or heat-recovery steam generator) is offline. Instead of merely alarming steam temperature low, the relevance of low steam temperature should be evaluated, and an alarm should only be generated when it is of concern, such as when steam turbine stop valves are not closed.

Define Alarm Response

The EEMUA says that every alarm should have a defined response and that responses may be designed to be reactive, proactive, or cognitive:

Reactive alarms require direct action to something that has happened.
Proactive alarms, in contrast, ask operators to take action before something happens.
Cognitive alarms don’t necessarily require direct action, but they do require operators to change their reference or frame of mind.

An example of a cognitive alarm would be the starting of a standby pump based on either a primary pump trip or on a process condition requiring the standby start. Even though no direct action is required by the operator, the operator may need to change his or her frame of reference in order to fully understand the present plant configuration and plan future actions accordingly. Such actions might include directing plant maintenance to correct the tripped equipment or having the plant technician evaluate a process transmitter for faulty calibration that might lead to a false start of the standby equipment.

It should be noted that cognitive alarms are very easy to abuse in quantities designed as well as true value for the operations of the plant. Therefore, most cognitive alarms may be best treated as a journal event or an alert.

The EEMUA states that "Adequate time should be allowed for the operator to carry out a defined response." Upon first look, this alarm design principle may appear to be pure common sense. However, based upon findings in the industry, this may not be the case. Each alarm should be evaluated with respect to the logic that drives the alarm as well as the process reacting behind the alarm. The alarm threshold developed must provide the operator enough time to adequately react to the alarm (process) scenario based upon the designed alarm response. For instance, it is common for tanks, sumps, and other vessels to have level alarms based purely off of a level value.

For most applications, this may be sufficient due to the volume of the vessel as well as the service of the process. However, for some processes, further evaluation of the rate of increase/decrease may also need to be applied so that in times of a higher rate of change, the normal alarm threshold limits are adjusted to provide the operator enough time to react accordingly.

Quality and Quantity

The quantity of alarms that should be presented to the operator is somewhat dependent upon the level of performance that is being sought for the alarm system. EEMUA 191 describes five levels of alarm system performance. They range from Level 1 to Level 5 and are described as Overloaded, Reactive, Stable, Robust, and Predictive, respectively. The recommended goal based on industry publications is to set alarm system performance at Robust in lieu of Predictive. This is due to the fact that the investment in additional effort, additional software, and additional cost may not be worth the return.

With a goal of robust performance, the key performance indicators (KPIs) should line out as follows:

Average alarms per day = 1,440
Average standing alarms per shift = 9
Peak alarms per 10-minute period = 100
Average alarms per 10-minute period = 10
Percentage of time alarm rates outside of target = 5%

As a point of comparison, if an alarm system’s performance goal is set at predictive, then the KPIs would line out as:

Average alarms per day = 144
Average standing alarms per shift = 9
Peak alarms per 10-minute period = 10
Average alarms per 10-minute period = 1
Percentage of time alarm rates outside of target = 1%

You might observe that 1,440 alarms per day is more than expected for a high-performing alarm system. However, you may also realize that fully investing in a more predictive-based alarm system is not economically feasible. This is one point of disagreement in the EEMUA 191 document as well as other available documents (the desire to keep the number of alarms minimal but also recognizing the capital expenditure required to achieve greatness).

This is where some balance may need to come into the picture. Industry documentation, as well as the newly released standard for alarm management (ISA 18.2) recommends that 300 alarms per day be set as a maximum number of alarms per operator. The ISA standard recommended that 150 alarms per day be set as a manageable number of alarms per day per operator. Therefore, if a control room staffing plan includes two operators, then the maximum number of alarms per day should be approximately 300 or fewer. This level of performance could be stated as Robust+ or could also be addressed as transitory (Robust-Predictive). For this example of two control room operators, the KPIs would line out as:

Average alarms per day = 300
Average standing alarms per shift = 9
Peak alarms per 10-minute period = 4
Average alarms per 10-minute period = 2
Percentage of time alarm rates outside of target = 2% to 3%

Steps to Improve Alarm System Performance

Based on the information available in the process controls industry, there are some clear and distinct measures that can be taken to improve alarm system performance within the power industry, including those that follow.

Design Each Alarm. Every alarm should have a defined purpose, required response, defined priority, and known consequences. If no operator action is required, it is not an alarm.

Reduce Clutter. The alarm system’s defined audible and visual cues need to be distinct to that alarm or that alarm’s priority alone.

Apply Alarm Suppression (Masking). Some areas of opportunity may not be known during preliminary design; however, this should not provide license to delay the evaluation of alarm suppression as a whole.

Coordination. Every control system within a facility needs to be coordinated in alarm system design approach. Priorities of alarms and defined operator action should be consistent in context and meaning between all plant control systems.

—Brandon Parker (parkerbs@bv.com) is plant automation systems section head at Black & Veatch in Overland Park, Kan.