Whether or not your power plant falls under the new Critical Infrastructure Protection standards (and many more will than in the past), you should be adding security controls. Here are some lessons learned to help you manage that process.
On August 14, 2003, large parts of the Northeast and Midwest of the U.S. and the Canadian province of Ontario experienced one of the largest blackouts in history: 61,000 MW of electric load were lost, affecting the lives and businesses of about 50 million people in the affected areas. The U.S. and Canada created a joint commission to study the blackout, determine its causes, and develop recommendations to reduce the potential for future outages. The discovered causes of the Northeast Blackout were condensed into four groups (summarized from Northeast Blackout report, pp. 18–19):
- Failure of entities to assess and understand the inadequacies of their power system.
- Inadequate situational awareness leading to lack of recognition that the system had deteriorated.
- Inadequate vegetation management along transmission rights-of-way.
- Failure of reliability organizations to provide effective real-time diagnostic support.
The commission made 46 recommendations to reduce the possibility of future outages, and the severity of ones that might still occur. The most interesting part of this, considering the causes identified, was that 15 recommendations directly addressed physical and cybersecurity issues, while neither physical nor cybersecurity was identified as a major cause of the blackout. The Security Working group ruled out a cyber attack and unintentional consequences of a currently active virus (SQL Slammer), but it identified how technology had contributed to the blackout due to degraded situational awareness.
The implication here is clear: A third of the Northeast Blackout recommendations were related to improved cyber and physical security because the investigation showed how poor the cybersecurity of the EMS/SCADA (energy management system/supervisory control and data acquisition) systems used for managing the grid was.
These recommendations led directly to development of the North American Electric Reliability Corp. (NERC) Urgent Action 1200 standards during 2004–2005 and the creation of the NERC Critical Infrastructure Protection (CIP) standards, which we have today.
Generation was never a part of the Urgent Action 1200 standards, but attention has shifted over the past several years due to awareness that generators are a crucial part of a reliable grid.
NERC CIP Version 5
The latest edition of the NERC CIP standards should be NERC CIP Version 5. I say “should” because the Federal Energy Regulatory Commission has directed NERC to make changes to the standards that will likely result in a Version 6 in 2015. These changes cover four major areas, one of which addresses a lack of objective requirements for “Low”-impact assets. The majority of North American generating facilities that have not had specific technical and procedural regulations so far will fall into that category. (For more on determining facility and asset classifications, see “Identifying CIP Version 5 Assets in Generation” in this issue.)
Generation facilities will likely be classified as either Medium or Low impact, based on a set of bright-line criteria identified in CIP-002-5. If your facility is a Medium impact one, you’ll have a set of prescriptive requirements to follow, but a Low-impact facility currently has a requirement for a security policy, and not much else, until the NERC development efforts are complete.
(Often Painful) Lessons Learned
When applying cybersecurity controls, you are going to run into problems. Many of the major control systems in power generation plants are of an older variety and haven’t been updated due to a combination of cost concerns and sufficient current reliability. Although network perimeter strategies will generally have little effect on older systems, the CIP-007-5 protections are another matter entirely.
Below I discuss two common challenges I’ve encountered while involved in cybersecurity upgrades at generation plants, so that engineers responsible for their own upgrades can plan appropriately to either avoid them or understand how to resolve them.
Lesson Learned: Old Code Can Be Incompatible with Modern Cybersecurity Tools. The origins of many plant control systems, especially distributed control systems (DCSs), go back to the mid-1980s, and a lot of the code from that era can still be found in our systems and devices. There has simply not been a major reason to change, as the hardware platforms for these control systems are controlled by the same companies that write the code. This code persists, and can form an impediment to some cybersecurity controls because it was simply not designed from a maintainability and security perspective the way more modern code must be.
Old code affects systems in various ways when adding cybersecurity. One way is that older code isn’t developed to take advantage of multi-core systems. This can lead to performance bottlenecks that inhibit the functionality of cybersecurity controls running alongside the DCS software and that degrade the performance of some control systems to an unacceptable level.
Another is that older code often uses hooks and operating system features that were once commonplace but are no longer acceptable, and systems continue to use these features in an unsupported manner. The most popular feature that old code uses is running in kernel mode instead of user mode.
Kernel mode is a highly privileged mode of operation that allows full control of a system, but it runs the risk of crashing system processes. The modern convention is to write user-mode programs that make use of carefully constructed kernel interfaces, so that crashes are limited (and for various other reasons). Using kernel mode was once a necessary evil when processing and memory resources were far more scarce, and the DCS needed everything to maintain a high level of functionality. The argument now is that processing and memory are far more plentiful, so a new design decision is needed that prioritizes reliability and maintainability in an environment of resource abundance.
Here’s an anecdotal example. I found several years ago that a major power DCS vendor was shipping modern workstations with quad core processors, but it had disabled all but one core to ensure that its code ran. This was due to the old code being designed to run as a single process; it was not capable of being swapped between the different cores due to concurrency issues.
When the time came for generators to add a basic cybersecurity protection—anti-virus—the system wasn’t capable of running both the anti-virus and the DCS at the required efficiency. This led to hangs, slowdowns, and other problems that were quickly blamed on the anti-virus, as it was the last thing added to the system. Making the situation worse was the tendency of anti-virus to scan each and every file as it was opened. Anti-virus will intercept calls to open files and will lock them until scanning is complete. DCS systems, often because of old code, make significant use of file operations and are capable of opening and closing hundreds or thousands of files in the course of a day. An assumption I made while looking at the problem was that the old code was written with an older design philosophy that assumed exclusive access to the file system, and it reacted badly when this assumption was challenged by the anti-virus.
Many generators disregard this old code argument and maintain that their DCS is functional and that it performs its duties to the level of expectation. I would argue that the DCS may be functional, but it is rapidly losing its functionality as the rest of the world moves past it.
A good analogy is the use of hot sticks in power. Hot sticks used to be made of wood; they were functional and got the job done. However, the excessive maintenance and upkeep needed to keep them safe, along with pressure from U.S. Occupational Safety and Health Administration requirements and insurance companies, made wooden hot sticks obsolete, and they have been replaced with fiberglass. In a similar way, older code is being rendered obsolete due to the pressures of cybersecurity and NERC CIP regulations; consequently, newer code that can be better maintained should be the new specification going forward.
Lesson Learned: Controls Upgrades Hold Surprises. Vendors to the generation industry have stepped up in the past five years, and most have support for NERC CIP activities. Generally, this starts with a CIP-007 R2 port/service specification, and slowly increases in scope as the vendor gets more involved in how its systems should be secured. These vendor-supported activities rely on having control system software that is at a specific patch and revision level and having a generator commit to maintaining that level.
This often results in a required control system upgrade, where workstations, servers, network equipment, and potentially DCS controllers must be replaced or modified to ensure effective operation of those controls. Considering the planning and testing that must go into upgrading control systems, this will not be a quick change-out, and it never has been. What’s different today is that plants that have procrastinated will have far more systems to upgrade.
For many generators, engaging in a controls upgrade for the sole purpose of adding cybersecurity controls will be daunting, so don’t delay. Determine your exposure, and make your plans far ahead of when NERC requires compliance, especially if you already know you will be a Medium-impact facility with Medium bulk electric system cyber systems.
I’ve had the opportunity to talk to the critical infrastructure group at Burns & McDonnell (disclosure: I’m a former employee-owner), who have supervised many of these upgrades over the past several years. They often recommend a full factory acceptance test (FAT) to ensure the functionality of the control system and to exercise the cybersecurity controls in a consequence-free environment. This helps to identify problems in the control system and to ensure corrections are made before a system reaches the site.
Because there are often multiple control systems that must integrate with one another, and may even share a common set of security controls, Burns & McDonnell often recommends an “integrated FAT” to bring all the vendors into the same space to test the integrated components of the control systems. This is good practice, as the integration points between control systems are often the most prone to incompatibility problems.
Specific tasks during the FAT should include:
- Testing functionality of the control system and the security system.
- Validating that ports and services meet specifications.
- Testing anti-virus while in operating conditions.
- Patching systems, and validating patch management procedures on the system.
- Creating, testing, and securing user and machine accounts.
- Validating that security monitoring is capturing relevant logs.
Redesigning One-offs and Ancillary Systems
Every generator has a one-off system or two, or an ancillary system that has been basically ignored for years because it just worked. These systems will be surprises when you add cybersecurity controls, as they will likely have little vendor support in the way of cybersecurity and an unknown upgrade path. Most of these discoveries follow the rule of “the last 20% of the project takes 80% of the time.”
Encountering one of these systems will likely cause a flurry of activity, because it won’t fit into the cybersecurity model you’ve created for the rest of your control system. This is likely to be a system with limited connection to a good, monitored network. Such systems will usually be older, missing patches, and sometimes have no upgrade path.
It will be tempting to leave these systems operating as one-offs and not do the redesign to bring them into your cybersecurity model. Resist this temptation, as manually conducting cybersecurity activities is a human performance issue that increases the risk of both noncompliance and compromise. Take the effort to identify these systems as quickly as possible, make the necessary network adjustments, and work with your security and control system vendors to include these systems in the cybersecurity model.
Share Lessons Learned
The addition of cybersecurity at your facility is going to bring with it a lot of changes, and it’s going to introduce problems as well. At best, you will encounter changes in systems, in procedures, and in how you look at maintaining your control systems in the future. The key to successfully adding these controls is first ensuring that you have a plan and then executing that plan while watching for problems to develop.
This article mentions just a few of the common problems likely to surface when adding security to an existing system. To gain a more comprehensive understanding of your situation, network with your peers and user groups to identify other potential problems as well. A larger community-based effort will be needed over the next few years to increase communication between generators who have cybersecurity and CIP responsibilities.
We are in the business of producing competitive and efficient electric power, but reliability is everyone’s responsibility, and security is a component of reliability.■
— Michael Toecker, PE (toecker @context-is.com) is a consulting engineer at Context Industrial Security and has extensive experience in NERC CIP compliance, control system security, and how to implement both in the context of power generation.