Top Industry Performers in Unplanned Server Downtime | Q1 The Uptime Report

Can you be incompetent and still stay in business? Not as far as your web infrastructure is concerned. All the studies show that when a website is unavailable, or even just slow to load, customers go elsewhere—and often they don’t come back. After all, if you can’t keep a website up and running, why should people trust you to deliver any other product or service?

So it’s worth asking: how reliable is your website relative to the top brands in your industry? Do the biggest firms have the best infrastructure? Or is there a potential competitive edge to be had for smaller concerns here, simply by minimizing their downtime? And how much downtime is acceptable in a given period before your brand starts to suffer?

Industries by downtime minutes.

If you had to guess, how much unplanned server downtime would you expect the average business to experience in a given quarter? What about the average length of an incident? And is downtime more or less evenly distributed, or are some sites very much less reliable than the average? And the big question: how much does downtime cost? Read on to find the answers.

The Uptime.com report for Q1 2020 includes data from the top performers in each industry we track. For this first quarter report, we have analyzed only the top industry performers from the pool of 6,378 of the world’s top websites for the period according to Alexa.

Follow @uptimemon and @bitfield on Twitter for the latest updates and insights.

How the Uptime.com Report Data is Collected

Uptime.com monitors sites from multiple locations around the globe, making an HTTP(S) connection to each site’s main landing or login page every three minutes, and recording server performance metrics along with any downtime incidents (whether planned or unplanned downtime). Status pages used for this report are available for view below:

Destimate Risks for Unplanned Downtime from the Business Sector

“S*** BREAKS; THAT’S JUST THE WAY IT IS. THE QUESTION IS, HOW DO YOU RESPOND?”

Rapid Incident Response is Critical

On average, the business sector outages we tracked lasted less than 1 hour. 6 of the businesses we track showed total downtime for the quarter higher than one hour. When the emphasis is on rapid response, total downtime decreases.

Unplanned Server Downtime is Common in Business

If you think it might not happen to you, probability dictates it just has not happened yet. The majority of businesses we tracked experienced three or fewer outages for the first quarter of 2020.

Does Redundancy Help Unplanned Downtime?

One common error we tracked was 502, highlighting a need for better infrastructure. 502 errors can indicate a problem with the server, with performance, traffic volume, and other similar factors. The user is also contributing to this problem inadvertently, refreshing and hoping to get a connection.

The big takeaway from continued unplanned downtime is that it compounds. If you cannot get back up, you pay even greater.

SRE consultant and writer John Arundel, author of ‘Cloud Native DevOps with Kubernetes’, is an expert on monitoring and resilience, and works with firms worldwide to help them improve reliability and reduce downtime. Arundel says:

“S*** breaks; that’s just the way it is. The question is, how do you respond? How comprehensively do you monitor every feature, endpoint, and microsite across your web estate? When problems happen, how soon do you know about it, and what’s your procedure to triage and fix it? Do you drill and refine those procedures? Do you have metrics on how quickly and how well you respond to server downtime incidents? When those metrics look good, do you analyze why you succeeded? When they don’t, are you able to figure out what you’re doing wrong and address it?

“THAT’S A CATASTROPHE, NOT A STRATEGY”

One of the best ways to wreck your website’s reliability is to react in a panicky, ad-hoc fashion to incidents and to start flailing around changing things at random, in the hope of getting back online as quickly as possible. But that’s a catastrophe, not a strategy. If you don’t have the instrumentation and data to understand what went wrong in the first place, you’re just digging a deeper hole—and that’s where you’ll find your cashflow if you can’t deliver a reliable online service.

It never ceases to amaze me that businesses often pour huge effort and resources into developing an online offering, and then just put it out there and leave it flapping in the wind, with minimal monitoring and observability. If you don’t care whether your site is up or down, why should anyone else? When I go to a site and it’s not there, or I get a timeout or a 500 error, I tend to assume the firm has gone out of business. Because even if that’s not true right now, it soon will be.”

Destimate Risks for Unplanned Downtime from the eCommerce Sector

Double-Digit Server Outage Incidents

A number of eCommerce businesses we tracked had double-digit outage numbers with hours of downtime. The worst offenders had 59 downtime incidents, while another had 4 hours and 35 minutes of downtime.

Unplanned Downtime from Spikes in Traffic

Q1 has been a lesson in disaster planning and adaptation in real-time. Traffic spikes can occur suddenly and last for days at a time, driven by trends outside your organization’s control. Now is the best time to adopt a status page for incident communication, especially as your business grows in size and media coverage scale. Transparency within your organization

Users are Understanding Server Downtime Happens

As in our annual report, eCommerce response time rose and was the highest of our testing (1.74 seconds on average). eCommerce demand also skyrocketed during this time period, with Twenty-percent growth in e-commerce revenue in Q1 2020 versus 12% in Q1 2019. Mobile e-commerce traffic also grew by 25% across all industries

There is no longer an excuse to ignoring mobile performance and experience

Destimate Risks for Unplanned Downtime from the Financial Services Sector

One Hour or Less of Unplanned Outages

The majority of the providers managed to keep downtime at less than one hour, even as number of outages rose. When the proper support is in place, downtime incidents are manageable. It’s not the volume, it’s the length of time that impacts SLA fulfilment and organization growth.

Second Highest Response Time

Financial industry players may have a handle on incident response, but their website performance scored far higher than other industries. eCommerce still outranks finance in terms of slow speeds, but both industries would benefit from Real User Monitoring for insights that would improve the user experience.

High Uptime Score

Financial service companies had a 99.83% for the quarter despite the high number of incidents. According to our data, 500 errors are frequent problems for banking. This is consistent with our observations regarding high response time and a high number of incidents.

Memory errors can also be a source for 500 codes. Keep an eye on internal processes to stay informed of spikes that might indicate a service is unusable. Combined with external monitoring, you get the complete picture of downtime and its cause.

Destimate Risks for Unplanned Downtime from the Health Sector

Fluttering Server Performance

Downtime adds up, but flapping downtime before your team can respond still hurts your SLA. We tracked some businesses with a high number of outages but low downtime per incident overall. These undetected outages, or outages that have no immediate action needed to fix them, tend to indicate a greater problem and should be observed in aggregate.

Backend Changes Affect Server Performance

The user experience for health websites often doesn’t change drastically in the same way as the backend. Eyes on infrastructure become a critical facet of monitoring the user experience. Wherever possible, health services should incorporate RUM and Synthetic Monitoring to capture that user experience and ping important pieces of infrastructure.

Geography Matters in Server Performance

Health infrastructure in a place like America can mean one company serving specific states. Redundancy and geography matter in content delivery and uptime. A critical question for health service firms is whether response time can be improved with better delivery.

Destimate Risks for Unplanned Downtime from the Social Sector

Most Resilient

Parts of social media may go down, but you are almost always able to reach a login page and at least access portions of the application. So if your services go down, you will always have a place to vent about it.

Unplanned Ad Outages

One side-effect of this fragmented infrastructure is that some services can go down while others remain up. If that service happens to be the underlying ad infrastructure, it can affect your business outlook and your marketing prospects.

Conclusions and Takeaways

The Last Miles Matter

Right now, every part of the network is stressed with unprecedented traffic numbers across websites that might not be accustomed to high volume. Even those that are used to big numbers will find that the last few miles of cable could be the difference between your customer seeing your service or an error message.

Infrastructure Versus People

Uptime and SLA fulfillment don’t really mean much when your user can’t access your server. The equivalent of pointing at a contract and saying “I’m not legally obligated to do more” is a recipe for disaster and not the best use of an SLA. SLAs should guide development and growth. If you’re performing well, invest a little in growth using the same principles that got you where you are. Great code and a good development flow tend to have a better impact on uptime than more “boxes”.

Downtime Adds Up Quickly

We tracked a lot of small incidents at about 5 minutes a piece. If you have 10 of these in a three month period, you have accumulated almost an hour of downtime for the quarter.

The number of outages do not need to be high for downtime minutes to exceed industry standards. When response time is low, downtime increases.

But it’s not the actual downtime that kills you, necessarily. Downtime doesn’t work on a schedule, and it doesn’t conform to policies. When it happens at 3 AM, there’s a built-in response time lag as engineers rouse themselves out of bed and examine whether the incident is serious. If you can cut that time down, say with technical data that will point them to the source of the problem once they are awake, you can save unnecessary outage time.

Downtime incidents tend to get worse when we don’t attend to them with root cause analysis (things do fix themselves, that’s not the point. They are indicative of a bigger problem).

Business Losses from Unplanned Server Downtime for Q1 2020

We estimate on average that businesses faced a destimate risk for Q1 of between 7-9%. This risk is elevated due to the coronavirus pandemic, which has stressed networks down to the very last mile of infrastructure. Businesses would do well to focus on tracking downtime incidents, building out transparency to explain unplanned downtime, and equipping lower level engineers with the tools necessary to respond. These will be the critical deciding factors in keeping SLAs fulfilled and total downtime low, while ensuring customer satisfaction reigns supreme.

The best in the industry are investing in automation for every level of infrastructure to get a fuller picture of what went wrong. Even starting small and monitoring a few critical processes will increase visibility and have a net positive effect on destimate risk to your organization.

Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.

Get Started

Don't forget to share this post!

Avatar

John Arundel is a well-known Go developer and mentor, and an expert on DevOps and infrastructure. He is the author of several technical books, most recently For the Love of Go: Fundamentals, and the bestselling Cloud Native DevOps with Kubernetes. Follow him on Twitter at @bitfield.

Catch up on the rest of your uptime monitoring news