Semiannual Report of Unplanned Server Downtime | 2020 Q1 + Q2

Dark matter discovered!

No, this isn’t physics news; hold that Nobel Prize. This is about downtime; the dark matter of the web. It’s invisible to most of us, but its gravity has huge effects on commerce, companies, and markets. For everybody who does business online, unplanned website or service outages drag down their revenue, drag down their profits, and drag down their brand. In some cases, really disastrous downtime incidents can crush a company into the economic equivalent of a black hole, never to be seen again.

In its first semi-annual report, covering the first half of 2020, website monitoring specialist Uptime.com is publishing hard data on outages and downtime, as well as website performance, for over six thousand of the world’s top online properties, from Apple to Zoom.

Industries by Total Downtime Hours

The results reveal that, like the dark matter swirling around our galaxy, a vast amount of previously-unsuspected server downtime is out there to be discovered. Every minute your website is offline has a measurable business cost in dollars: the only question is how much? Only you can answer that question precisely, but isn’t it time you started tuning your detectors in to look for the black hole in your balance sheet?

Introduction

Uptime.com monitors sites from multiple locations around the globe, making an HTTP(S) connection to each site’s main landing or login page every three minutes, and recording server performance metrics along with any downtime incidents (whether planned or unplanned). Status pages used for this report are available for view below:

Unplanned Server Downtime for Business

Certificate Anxiety and Preventable Failures

What is your business doing about preventable failure? When you know that there is a possibility of certificates lapsing or external systems failing, are you taking the necessary measures to create backup systems?

“Some holes you fall in by accident; others you dig for yourself,” says John Arundel, an expert on infrastructure reliability and consultant at BitfieldConsulting.com. “It’s crazy to me how many companies aren’t taking the most basic precautions, like monitoring their SSL/TLS certificates for expiry. I mean, you know it’s going to expire. You even know the date it’ll happen. So why are there so many outages related to certificate expiry?”

Arundel explains that many engineers aren’t used to thinking about failure. “We’re builders; we like to make things. Naturally, we expect them to stay up. You don’t build a bridge expecting it to collapse the first time someone drives a bus over it. But we need to temper that optimism with at least a small dose of reality.”

Indeed, every major bridge or tunnel is studded with instrumentation, constantly monitoring stress and strain, twisting and flexing, and ready to alert engineers at the slightest sign of trouble. Arundel wants to know why we don’t have the same attitude towards online infrastructure. “It’s actually easier to build in resilience online. You can’t very well build a backup bridge in case the primary one fails, but you can totally do that with servers and load balancers. Why don’t we do it? Fundamentally, we just don’t expect things to go wrong.”

Unplanned Server Downtime for eCommerce

User Confidence and Uptime

Ultimately, eCommerce is about trust and timing. You need that fundamental trust between you and your user to get them to swipe a credit card with you, and you build that trust with reliability. Reliability plays into timing. If your service is alive, and the user is ready to buy, you can earn the sale. If it’s not, the user will go down the list until he or she finds the holy grail item they seek.

eCommerce had only 2 days and 17h downtime, with 99.96% uptime for the period. Impressive scores for the pandemic we lived through.

eCommerce has a lot of balance work to do between serving assets, completing transactions, registering users, and marketing-related functions like special sales and landing pages. As DevOps looks ahead, it’s worth asking now that stability is strong what can be done to improve     response time? 1.74 seconds is respectable, but there is always room for improvement or maintaining the status quo as you expand.

Unplanned Server Downtime for Financial Companies

The year opened with some worrying downtime trends that financial firms should be prepared to face. Namely:

  • Interconnected services
  • Cyberattacks

We learned that Nigerian banks spent n200 billion preventing cyber attacks, and expected costs to rise higher. If your bank or firm dealt with Travelex in any form, you also most likely saw how third party services can affect your business. Currency freezes were in effect at one point, meaning stranded travellers could not get their money which is the worst case scenario come to life.

The Robinhood Outage is in itself an incident report. An example of how to manage PR, and what not to do, as well as a thoughtful exercise in what went wrong and what your team can learn from it. Finance more than most industries must play fast.

Firms need to take these warnings seriously because today it is the major players failing in high-profile cases but our data suggests these failures are a widespread problem brewing on the horizon.

Unplanned Server Downtime for Health Companies

Ransomware a Concern for Health Services

Across the world, we’re all concerned with maintaining patient records and improving health services. That requires a connection to the internet, which is where the more unscrupulous players take advantage of the poorly prepared.

We see Health having a relatively low number of outages with a high overall downtime, indicating that response time is a majorly overlooked factor in health devops. Either the infrastructure isn’t there to support a service under stress, or there are no effective measures in place that reduce time to respond.

Ransomware is expected to cost internet connected businesses $20 billion worldwide, and we see high health downtime as a sign that this industry is susceptible to these kinds of attacks. Organizations should take extra steps to safeguard patient data and provide personnel the means to respond.

Unplanned Server Downtime for Social and Tech

When it Rains it Pours Downtime for Social

Social media is highly susceptible to trend driven downtime. You can plan for Christmas or New Years, you cannot plan for spontaneous health crises or worldwide protests with media uploads and rapid messaging. You cannot plan for surges of traffic from stay at home parents or college kids binging Netflix. You can try, and you can shore up your infrastructure, but doing so responsibly is akin to madness.

The best of social tend to have great infrastructure with lots of cash powering those servers and workers. The rest of the pack don’t always fare so well. When we start to break down the wider field of social media sites, we see a lot of smaller players with high numbers of downtime:

 

 

General Takeaways

DDoS a Growing Concern

If you have been following these reports, you will see that DDoS attacks have grown for the period and are expected to reach 15 million by 2023. Business that are not investing in mitigation are just biding time before they are attacked, statistically speaking.

30% Already Have 100 or More Hours of Downtime

COVID and the general stress of work from home conditions have taxed the web. Based on the Alexa top sites we survey, 29.4% have more than 100 hours of downtime, a worrying sign as we move into the latter half of 2020.

The Best of the Best Make Up Less than 15% of the Population

The best of the best only comprise 13% of the sites we survey. Many services either aren’t taking downtime seriously or don’t have the tools on hand to deal with a major outage. Some social sites we observed, for instance, just have a tough time staying online with high numbers of outages.

If your business had to call someone else to get the site back online this year, this is a good sign you need monitoring.

Conclusion

Okay, it’s not been a typical year so far. That’s a given. Maybe ‘unplanned downtime’ has a new context for many of us, and right now nobody knows how the global economy is going to shake out. But one thing that seems a good bet is that we’re all going to be doing more stuff online from now on, both as consumers and as businesses.

That means the reliability of online services is more critical than ever, and more closely tied to the bottom line. We should be asking ourselves some key questions:

  • How much downtime do we have, and how many outages?
  • Is that figure getting better, or worse? Why?
  • How much does downtime cost us?
  • How does it affect our brand and reputation?
  • What can we do about it?

This report may give you a little help as you start to try to answer those questions for your own business. Keep an eye on the Uptime.com blog for regular monthly updates on outages and site reliability news, and there’ll be more data-driven reports to come. Maybe the clouds of dark matter hanging over 2020 will start to clear in the next few months. We sure hope so. Stay safe.

Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.

Get Started

Don't forget to share this post!

Avatar

John Arundel, principal consultant at Bitfield Consulting, is a well-known expert on DevOps, infrastructure, Kubernetes, Puppet, and the Go programming language. He has been writing software for 40 years, managing Unix systems for nearly three decades, and working on infrastructure from nuclear power stations to Netflix since he was knee-high to a login prompt. He is the author of several technical books, most recently Cloud Native DevOps with Kubernetes. When not writing or consulting, he tweets as @bitfield.

Catch up on the rest of your uptime monitoring news