Downtime, Outages and Failures - Understanding Their True Costs
This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes carried out in the enterprise cloud environment. Evolven helps leading enterprises cut the number of incidents, slash troubleshoot time, and eliminate unauthorized changes. Learn more
When it comes to mission-critical applications or data-center performance quality, enterprises are willing to make huge investments. Unfortunately, these investments don’t always fully deliver.
Confronting system downtime
Despite the efforts invested in infrastructure robustness, many IT organizations continue to deal with database, hardware, and software downtime incidents that last from just a few minutes to several days, completely incapacitating the business and causing tremendous losses.
The world of IT failure can sometimes seem awkward.
Despite the variety of advanced solutions and the mounting data collected by major enterprise software vendors and IT departments (from ERP to CRM and more), outages are still a valid and a terrifying threat to the industry.
On the other hand, IT failures have somehow become an inherently accepted, even expected, part of the enterprise life.
This is counter intuitive…
IT downtime revisited
While IT professionals find themselves confronting downtimes from time to time, and then they are fully focused on trying to get on top of them, the business organization as a whole suffers from the ‘financial pain’ by effects, which tend to be very significant.
In the past, we took an in-depth look at the multiple ways in which IT downtime can impact enterprises’ bottom line (you can read more about it here - Cost and Scope of Unplanned Outages). We looked at different aspects, from direct loss of revenues through reputation damage to indirect effects such as decrease in productivity.
Now, I wish to revisit the issue and examine how organizations should address and assess threats to their IT operations, including systems, applications and data, by analysing solid (and established) benchmarks that represent the potential costs behind downtime and outages.
Measuring big brand failures
When should the industry start measuring the financial impact of big brand outages, such as the one that recently hit Facebook, theone that hit hundreds of thousands of Lloyds Bank customers, or the Jetstar outage that resulted in hundreds of flights delays?
In other words, at what point is an outage ‘significant enough’ so that a cost analysis becomes valuable to the industry in order to learn from it and predict the impact of future outage incidents?
Well, apparently at some point the outage creates an impact that can’t be ignored, PR wise. That’s the point of no return, which is followed by financial impact estimations.
Downtime costs vary significantly between industries. The affected business size is obviously a critical factor, but it is not the only major one. The role of the IT systems in the business is also key.
Setting a numerical value behind an IT outage means predefining its implications across multiple business and organizational aspects, so that the whole industry can learn and optimize accordingly.
A failure of a critical application can lead to two distinct types of losses:
- Loss of the application service – the impact of downtime varies according to the application and the business;
- Loss of data – the potential loss of data due to a system outage can have significant legal and financial implications.
Now, I am sure that you would agree that today's data centers should never go down; applications must stay available 24/7, and internal (let alone external) end-users worldwide must be able to rely on data centers’ availability (for critical data and application availability) at all times.
Well, reality bites. In the back office (meaning inside the data center) this is not the case. No organization enjoys 100% uptime. Should you aspire to reach 100%? Sure. But you should also develop a deep understanding of downtime implications and ways to minimize it.
The worst outage nightmare ever? Probably the one that happened to you…
Some past outage incidents turned into PR catastrophes, like the mythological Virgin Blue debacle from 2010, or the recent one that affected Facebook.
Why? The mass impact probably had something to do with it.
As a reminder, the Virgin Blue outage prevented passengers from boarding flights for 11 days (!!) resulting in negative press, damaged reputation, and millions of dollars lost.
To be more accurate: Virgin Blue's reservations management company, Navitaire, ended up compensating Virgin Blue for more than $20 million (Navitaire booking glitch earns Virgin $20M in Compo).
There are many other incidents that still manage to capture the attention of the media. Here’s just one recent article by USA Today about the Wells Fargo outage that prevented customers from accessing their accounts for many hours.
I can safely say that anyone in the IT industry would agree that outages or downtimes are VERY bad for business. They are unwanted, very harmful financially, and must be fought against using all available resources.
Misconfigurations are key
The IT Process Institute's Visible Ops Handbook reported in the past that "80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers" (Visible Ops).
The Enterprise Management Association reported that 60% of availability and performance errors are the result of misconfigurations.
What’s the cost?
Downtime can cost companies $5,600 per minute and up to $300,000 per hour in web application downtime (according to a 2014 Gartner's analysis).
The average hourly cost of enterprise server downtime, worldwide, 2017-2018:
Application maintenance costs are increasing at an annual rate of 20%. But that can’t solve all of your problems. A past industry survey revealed that at least one-quarter of polled downtime was caused by configuration errors. (How much will you spend on application downtime this year?).
How common are downtimes or outages?
Ok, downtime can be a financial nightmare. That part is clear. But If you wish to properly estimate the risk potential of outages to your business, the immediate question should be “how likely is it to happen?”
Source: Data Center Knowledge
Ok, so outages are way too common to be ignored by thinking “I am not likely to experience a major outage”. Now comes the question of how to calculate their specific risk to your business.
Production and application downtimes costs made clear
Unplanned outages are up to IT to resolve. Nevertheless, and as I already mentioned, at the end of the day these outages impact the entire organization.
An important part of a thorough outage risk evaluation process is estimating how much money you will lose per hour (or minute, or any other time increment of your choice) in the incident of downtime.
For enterprises that depend solely on data centers' ability to deliver IT and networking services to customers – such as telecommunications service providers or e-commerce companies – downtime can be particularly costly, with the highest cost of a single event topping $1 million (more than $11,000 per minute) according to estimations by experts.
In a USA Today survey of 200 data center managers, over 80% reported that their downtime costs exceeded $50,000 per hour. Over 25% reported downtime costs of over $500,000 per hour (!!).
According to another survey, while companies can't achieve zero downtime, one in every 10 companies said that their availability must be greater than 99.999%.
Source: Searchcio Techtarget
To get a firm understanding of the implications of production and release downtime, let's take a look at how the consequences of downtime are manifested.
Downtime cost - per year or per incident?
A 2017 study revealed that out of 400 IT decision makers, 46% experienced more than four hours of IT-related downtime over 12 months; 23% said that they incurred costs ranging from $12,000 up to more than $1 million per hour.
Over 35% admitted that they are unsure of the cost of an outage to their business.
If you ask Delta airlines, which had to cancel 280 flights due to an outage in 2017, the losses of a single outage incident can reach over $150 million.
A couple of years ago, Dun & Bradstreet reported that 59% of Fortune 500 companies experience a minimum of 1.6 downtime hours per week.
If you take the average Fortune 500 company (or a company that employ at least 10,000 employees) and assume that it pays an IT team members an average of $56 per hour, then (assuming the entire IT is busy solving the downtime) just the labor part of downtime for an organization of this size would reach $896,000 per week, translating to more than $46 million per year (Assessing The Financial Impact Of Downtime).
Of course that the reality is more complicated, as you need to take into consideration many parameters like the time of the event (mid-week or weekend? Day or night time?) and more. Still, understanding the costs of outages will significantly help estimate your risk potential and the ROI of tools that can help minimizing the effect of downtime incidents.
Has the industry managed to learn from the past and to minimize the collateral damage during an outage?
How have things changed from the past?
So, we already know that downtimes and outage incidents still happen today, and the industry has yet to successfully abolish. But how has their cost changed over time? Are these incidents less harmful today?
In 2010, a research by Coleman Parkes found that IT downtime incidents collectively cost businesses more than 127 million man-hours per year - an average of 545 man-hours per company - in employee productivity.
In 2009, it was reported that the average downtime costs vary considerably across industries, from approximately $90,000 per hour in the media sector to about $6.48 million per hour for large online brokerages (How to quantify downtime).
According to a survey of IT managers conducted during those years, companies are becoming more aware of the direct financial costs of computer downtime. The survey revealed that one in every five businesses loses $12,000 an hour through systems downtime (How to quantify downtime).
As mentioned above, a later analysis performed in 2014 by Gartner, reported an average cost of $5,600 per minute and over $300k per hour.
Even as early as 2004, a conservative estimate from Gartner pegged the hourly cost of downtime for computer networks at $42,000. Accordingly, a company that suffers from a worse-than-average downtime of 175 hours per year can lose more than $7 million annually. However, the cost of each outage affects each company differently, so it's important to know how to calculate the precise financial impact (How to quantify downtime).
It makes sense to believe that the cost of outage only gets higher with time (since we all lean more on data systems today). You can therefore understand why past data can be multiplied by a significant number in order to reflect today’s reality…
Every minute counts
Over ten years ago, the average cost of a data center downtime across industries was valued at approximately $5,600 per minute (Unplanned IT Outages Cost More than $5,000 per Minute), a figure which, according to Gartner, remained the same until 2014. The aforementioned past study by the Ponemon Institute calculated the minimum, median, mean and maximum cost per minute of unplanned outages, based on input from 41 data centers. The greatest cost of an unplanned outage was found to exceed $11,000 per minute.
On average, the cost of an unplanned outage is likely to exceed $5,000 per minute.
It only gets more significant
A 2013 study saw an uplift of over 41% from the past averages described above, and an average of more than $7900 cost per one minute.
An ITIC survey from 2015 clearly showed that the hourly cost (compared to data from 2008) has increased by between 25% to 30%.
Downtime impact per year
A past analysis Gartner has calculated that downtime incidents can reach 87 hours per year, on average. Obviously that's the sum of many outages - anywhere from a few minutes to several hours (Average large corporation experiences 87 hours of network downtime a year).
How things have changed?
A later research from 2011 revealed that although the industry has managed to successfully fight the downtime epidemic and decrease their occurences, we are still seeing significant downtime hours and huge revenue losses (Source: led to over 3 million (apparently Whatsapp users) that migrated to Telegram)
The impact on reputation and loyalty
How much is your business reputation worth? This may be extremely difficult to assess, as well as the long-term effect of a damaged reputation and its impact on revenue and profitability.
In this case, downtime costs include lost customers (both short and long term), and other tangible elements that reflect the costs of reputation impairment like stock downturns, marketing hours (crisis and brand recovery management) and media budget required to reboot and polish up an organization's profile.
What parameters should impact your calculation?
When trying to estimate the cost of downtimes, there are the obvious direct costs (such as loss of business during downtime). However, many indirect costs such as employee overhead or reputation issues discussed above, should be calculated in as well.
Workforce overhead is derived from the cost of burning ‘war-room’ tasks that focus on getting the IT systems back up and running, the cost of being delayed with all other planned tasks, the cost of employee overtime expenses (if applicable), and more. Then there’s the value of data loss, emergency maintenance fees (particularly if the outage occurs during off hours), and additional repair costs that may continue long after service has been restored.
Needless to say, you must calculate these costs when you estimate the implication of downtime, as they are usually very significant; but even a rough guesstimate can prove to be extremely beneficial for understanding the risks and deciding on the required level of technology you should lean on, in order to fight it.
There’s also the impact of lost sales. To have an accurate assessment of the total lost sales, the impact percentage must be increased to reflect the real lifetime value of customers who permanently defect to a competitor. For instance, the Facebook (and Whatsapp) outage that I mentioned earlier Cost-Unconscious: Denying the True Cost of Network Downtime. What is the revenue loss derived by the fact that these users will present less billable ad-impressions?
Stock dropped by 25%
Although it's hard to put a number on so many parameters, they are still substantial and significant. For instance, when Amazon.com went offline for several hours during its early days, its stock dropped by 25% in a single day (Cost-Unconscious: Denying the True Cost of Network Downtime)!
In this Amazon cloud outage example, the company continued to scramble to get its cloud services back online. As a result, many customers questioned the reliability of its cloud and Amazon’s communication surrounding the outage. Other customers thought they should be compensated for the downtime as part of their SLA.
I know you are curious: As for the SLA, despite the almost-four-day outage, Amazon's EC2 SLA was not breached (Seven lessons to learn from Amazon's outage).
The cost of downtime: Calculating it yourself
How much are you bound to lose from an unexpected downtime of your servers or business applications?
According to multiple sources, the simplest way to calculate potential revenue losses during an outage is by using this equation:
|LOST REVENUE||=||(GR/TH) x I x H|
|GR||=||gross yearly revenue|
|TH||=||total yearly business hours|
|H||=||number of hours of outage|
How to minimize outage and downtime risk?
Downtime and outages are catastrophic, but they don’t have to be that impactful. By utilizing solutions that focus on getting to the root of the problem, outages can be prevented before they even occur.
Evolven Change Analytics developed a unique AIOps solution that focuses on changes - the true root cause of performance incidents. Evolven helps enterprise IT and Cloud Ops teams prevent and troubleshoot incidents before the trouble starts.
Contact us to see how we help leading enterprises slash the number of incidents and MTTR.