ITIL Risk response measures and recovery options from catastrophic events

IT Service Continuity is one of those ITIL chapters that emphasizes the strong relationship between business needs (and requirements) and the IT service provider, because once a worst-case scenario happens – everything after that will either look like a well-performed ballet or someone’s worst nightmare.

One of the key areas that require great alignment between business and IT are those events that will “never happen,” but if they do, there is no more business or IT. Within ITIL, IT Service Continuity Management is part of the Service Design stage of the service lifecycle, and is responsible for aligning IT service continuity strategy with Business continuity strategy, ensuring that the required IT technical and service facilities (including computer systems, networks, applications, data repositories, telecommunications, environment, technical support, and Service Desk) can be resumed within required, and agreed, business timescales.

Risk response measures

We are all well aware of the typical risks that exist within our line of business, and some of them could probably be found in your IT environment as well. Such risks generally fall under Incident Management and Availability Management, because they are simply part of everyday life. However, if none of the following risk response measures exists, they may also cause the same effect as a catastrophic event on a larger scale.

Examples of typical risk response measures include:

  • Installation of UPS (Uninterruptible Power Supply)
  • Installation of fault-tolerant systems for critical business applications
  • Configuration of disk drives in RAID arrays with mirroring
  • Stocking critical spare parts for quick replacement
  • Designing systems with no SPoF (Single Point of Failure) – e.g., redundant internet links
  • Implementation of resilient IT systems and networks
  • Outsourcing services to more than one provider
  • Implementation of greater physical and IT-based security controls
  • Integration of Cloud and Cloud-based services
  • Implementation of fault detection and monitoring services
  • Implementation of automated fire detection & suppression systems
  • Implementation of comprehensive backup & recovery strategy

On top of typical response measures, one of the most popular additional measures is surely off-site storage, which involves storage of all the relevant data needed for recovery at a separate location in case something happens to the primary location.

Recovery options

When designing the IT Service Continuity Strategy, note that there may be several feasible recovery options, depending on the root cause of service unavailability. Since these recovery options vary in complexity and duration required for service restoration, they must be planned and put into operation well ahead of any event that may cause the service to halt.

Manual work-around – In some cases, manual workarounds may provide cheap, fast, and “good enough” solutions for an extremely complex situation, but for a limited time.

Reciprocal arrangements – These represent a method of ensuring contingency for business services, where similar organizations would agree to share resources in case of a catastrophic event. Due to the specific nature of IT, we don’t see such arrangements in our everyday lives.

Gradual recovery – Also known as “cold standby,” this includes provision of empty facilities, infrastructure (such as cabling and telecommunication), and power, but with no actual computing equipment. This recovery method is feasible if service recovery is expected in days (even weeks), as computing hardware has to be purchased, installed and set up.

Intermediate recovery – Also called “warm standby,” this includes everything listed under “gradual recovery,” with the addition of actual computing equipment that is necessary. This equipment still has to be set up and configured, but recovery time is much faster than the one mentioned previously.

Fast recovery – Sometimes referred to as “hot standby,” in reality this recovery option may consist of everything listed in “intermediate recovery,” but data and equipment are not mirror images of the production site. It takes a shorter time to prepare, but some services or data may be missing (e.g., data is copied from the main site to off-site every night).

Immediate recovery – This can truly be called “hot standby,” as a secondary location is the mirror image of the primary location, equipment, service, and data-wise. Such locations are often used for load balancing (split site) and provide no loss of service in case of catastrophic event. Due to the fact that everything is doubled, immediate recovery is the most expensive recovery option.

Are there any other options?

Believe it or not, not doing anything is also an option, but not one that IT can make on their own. If the business evaluates their data stored within information systems as not worth extra safekeeping (and paying for any costs that may arise), then there is nothing IT can do, other than the typical risk responses listed within this article. But, this must be a business decision, and a business decision only.

While ITIL 2011 may look pretty fresh, it is merely an update to ITIL 2007 (V3). At that point in time, terms like “Cloud” were not this widespread, and effectively, there was no public (or private) Cloud providers that could offer IaaS (infrastructure as a Service), PaaS (Platform as a Service), or SaaS (Software as a Service), disregarding the size. So, each and every one of the mentioned Cloud services may well fit into any recovery option. Nowadays even Cloud providers offer recovery options to their clients; if anything happens to a whole world region – they’ll move all the data and services seamlessly to another region, effectively ensuring service continuity.

When everything runs smoothly, it’s hard to talk about worst-case scenarios, but trust me, it’s the best possible time.

If you are implementing processes and functions according to ITIL, use this free  Checklist of recommended ITIL documents for processes and functions to help you with generated documents.