Failure — And What Happens Immediately After

Prescient article by analyst Jim Grogan of 451 Research.
"There are those who have, and those who will."
Emphasis in red added by me.
Brian Wood, VP Marketing

Microsoft hosting failure is a reminder to all: failures happen, be prepared

Microsoft recently suffered an outage that serves a reminder: everything with an on/off switch will one day have a failure. Best practices require constant monitoring, maintenance – and plans for when that failure occurs.
The customer message indicated that the root cause of an outage experienced by some Microsoft customers on March 14 was a firmware upgrade that did not go as planned. Without many details, Microsoft stated that the upgrade failure caused heat to rise rapidly, and we all know that electronic equipment and heat don't get along well.
Failures happen – period. We can talk about resilience for hours and days, but even with the best planning and operational procedures, IT equipment will fail. In some sense, the critical task at hand is not limited to anticipating the failure, but also planning how to respond.

Change control or change management?

Nearly 25 years ago, speaking with an industry colleague about change control, he said we can't control change, but can only hope to manage it well. This analyst has always remembered the wisdom in that perspective; how we respond to failures that will occur in reality define how customers see companies. Whether the trigger event is a power failure, cyber attack, system upgrade or a Hurricane Sandy or Katrina, outages demand a focused response. Contract renewals, customer retention and business growth are all tied to the reliability of products and services as perceived by the customer; resilience is not an option, it is a requirement.
Those involved with multi-tenant datacenters (MTDCs) over any length of time have likely experienced the unexpected: software upgrades gone awry; database conversions done poorly; construction/installation/development projects behind schedule; floods, fires, earthquakes, ice storms and hurricanes. Each incident requires a business response, but the challenge remains that the specific event will probably differ from the scenario planning, and the response will necessarily need adjustment in real time.

Identifying risks

Risk management programs can be extremely complex, but a basic understanding points to a few major elements:

  • Risk assessment – Conducting an assessment helps to identify and name risks, and attempts to classify them with probability of occurrence and potential business impact. Under many regulatory environments and within most industry standards, regular periodic risk assessments are required elements.
  • Inherent risk – Risk exists whether identified or not, and inherent risk collects all risks under one term. In one sense, the lack of detail is helpful in speaking of the breadth of potential issues, but it could lead to some level of paranoia. Just because a risk has not been identified doesn't remove the threat and responsibilities; also, once identified, standards of due care demand that the risk is dealt with appropriately.
  • Residual risk – Risk managers speak of residual risk as that which remains after all mitigation steps have been taken. A key aspect of understanding residual risk requires recognition that it does include unnamed inherent risk as well that has not been mitigated by other steps.

Operational risks are those that will occur in the normal course of conducting business; MTDC providers address and deal with operational risk components 24/7. Engineering resilient operations minimizes operational risks, and automated monitoring and response tools make this job more effective.
But failures will happen, usually at the worst time.

All-stars, first responders and escalation

Every star player wants to have the chance to control the outcome when the game is on the line; the same is true for IT first responders. Business leaders hold the responsibility of attempting to get star performers working 24/7, something that is unlikely; an overly tired star in the midst of a prolonged event will not remain the best decision maker in the room, no matter how good he or she normally functions. Star performing companies, then, need to raise the response bar at the organizational level rather than as individuals; resilience and crisis response must be a team effort to be perceived by customers as world-class, which includes appropriate and timely escalation.
Crisis events rarely are a moment in time, but will occur over hours or days. Those events that occurred momentarily and were handled by the automated monitoring and response systems are not crisis events – like the tens of thousands of virus or malware attacks that hit every interconnected device daily and are stopped by firewalls and antivirus software. Such events simply reflect operational risk management working as planned. Crisis events, in contrast, will require escalation and management oversight. Even if the event occurred and was resolved, there may be an ongoing business response related to SLA satisfaction, regulatory oversight and reporting, and simply the critical internal reviews of how a good response could be made better the next time.

MTDC discipline

In the MTDC world of colocation, managed and cloud services, there is a significant opportunity for crisis response to be stronger than might be achieved from internal enterprise operations. This strength results from the relationship between the provider and the customers, and the discipline of change management that relationship demands. SLAs may require MTDC providers to communicate with customers certain changes to its infrastructure, related to upgrades and expansions. Similarly, customers will have a layer of formality about anticipated changes when dealing with an outside vendor – the MTDC provider – that may not be typical between internal departments; change management discussed at the coffee machine will usually be less effective than what is handled in a vendor conference call. In a recent conversation with an MTDC executive, it was pointed out that the company never wants to issue SLA credits – it wants to make sure that its operation is reliable and robust, and would rather invest in being able to deliver world-class service than paying for a past failure event.
Additionally, MTDC providers and 451 Research indicate that more customers are engineering resilience systems through the use of multiple, geographically separated instances – typically seen in 30-40% of colocation customers today. Whether planned between dual colocation sites, or taking advantage of cloud availability options, MTDC options serve as important tools in building resilient information systems.

The 451 Take

From the moment the 'on' button is pressed on any device, it begins to wear out; that is change, and it must be effectively and appropriately managed. Risk and crisis management call for layers of monitoring, preparation and practicing likely response options and anticipating failure. The most recent outage should serve as a wake-up call to ensure that preparations, employee training and plans remain current. Best-in-class enterprises and MTDC providers learn from every internal IT failure, as well as learning from what others have experienced. International standards, as well as federal continuity of operations (COOP) procedures call for some form of an 'after action' report to capture lessons learned, something that we recommend all organizations heed – public, private, commercial, non-profit or governmental. Business growth demands resilience, which in turn demands lessons to be learned every day. First responders may sprint to the fire or downed server, but business leaders know you are training for a marathon.