Internet Downtime in Q3 2020 | The Uptime.com Report
We once asked whether it was possible to remain both incompetent and in business. We concluded that while incompetency is not a desirable trait, it is surprisingly possible to remain in business for a fairly short while. What we have learned in Q3 is that complacency allows internet downtime to fester.
Our Destimate for Q3 of 2020 is fairly high. We expect the average business with monitoring and minimal IT resources stand to lose 17.4% of revenue on the conservative side, and as much as 22.35%. We primarily considered downtime events that occurred outside the organization’s control. Additional losses are possible if the organization is not investing in automation, DDoS mitigation, and IT resources.
Because the cost of attacks are low, and internet usage and reliance is high, nearly every business with a significant market presence is at risk for an unplanned downtime event outside their control.
Response time can mean the matter of hours or even a day’s worth of downtime in a quarter. With some industries seeing 10+ days of downtime, Uptime.com recommends an audit of your infrastructure and necessary upgrades as soon as possible.
The optimistic among us might call this an adjustment period, but it is starting to seem as though stronger infrastructure and an emphasis on uptime are becoming the new normal.
Whereas Health websites felt the pain in Q2, Business, eCommerce, and Banking bore the brunt in Q3. Banking suffered a staggering 79+ days of downtime, the most of any industry we survey. To give an idea of scope, the best in the industries we surveyed accounted for only .28% of all outages, and only .0007% of actual downtime hours.
With more than 287 days of downtime recorded for Q3 alone, it’s safe to say that we have felt the effects of COVID and businesses will need to deal with this issue of usage in some form or another.
Not only is our usage of the Internet changing in response to lockdowns and COVID life, previously low traffic time periods for the web have also changed dramatically. For DevOps, this means constant vigilance of the third party services you rely upon, and more direct oversight of available resources. Prompt alerting at any time of day will be critical to your daily operations.
Consider what happens if your team’s primary communication channel went down. If an outage occurred during that time period, how long would it take for that alert to get to someone who can do something about it? Distributed workforces are also complicating this issue. It’s great we all work from home, but a communication outage amplifies the feeling of isolation for everyone involved. It’s difficult for devops to coordinate over video if video is broken, or a natural disaster has reduced service.
COVID Has Completely Changed Network Outages, Monitoring, and Even Hacking
It’s fairly safe to say much of what we know about incident response will turn on its head this next year.
We have also seen that as costly as security can be, not having it is much worse. We reported earlier this year that DDoS attacks were on the rise. With those increases come the added threats of extortion and DDoS for hire.
What makes DDoS the tool of choice? Deterrence for DDoS attacks essentially does not exist. The attacks are distributed by nature, making the source difficult to pinpoint without significant investment of resources. That’s why we see large companies fighting back, while smaller companies more or less suffer the slings or try their best to adapt.
It is cheap and profitable to level multiplayer gaming servers, eCommerce platforms, or generally any business the hacker is determined to take down. Sustained attacks are very cost efficient. So much so, that hackers are finding themselves in the financial position to outsource their work.
Who is being targeted? We mentioned that in Q2 health suffered. In Q3, finance and education are big targets. IT needs to remain vigilant for security risks, and infrastructure upgrades. Scale up before you need to if you want to stay ahead.
Big Outages Happen to Other Providers You Don’t Control
We all know that monitoring is helpful as a first-response mechanism, but we don’t often think about it as a function of customer service. This quarter offered some teachable moments in the form of a major Cloudflare outage.
The downtime happened at the end of August, and you likely felt it in some form or another. A third party transit provider, later identified as Century Link, went down taking half of the internet with it.
Cloudflare is set up to avoid this scenario. So what happened?
Simply put, there are faults you can tolerate and those you cannot. When a major provider goes down, you can get swept up in that outage.
You pretty much have two options: move to a backup provider (if one is available), or wait out the outage. Most likely you’re stuck in column 2, especially if the outage was on the massive scale of 2020’s Cloudflare outage. All the backups in the world can’t help if you have nowhere to host them and deliver to your users.
Having an early warning system can signal the extent of the problem if you can track the timeline of an outage. If a customer calls in to report the issue before you’re aware of it, you have already lost time. How long was it down before you got the ticket?
And if you do have a backup provider, flipping that switch takes time. Anywhere from minutes to hours. If you have early detection, you can make a data-informed judgment call about making that move.
Automation As a Means to Counter Big Downtime Events
What’s faster than instructions via runbook? Automating your response, or even automating changes in status. We find that removing human intervention is not only about shaving off response time, but reducing human error and adding a layer of credibility to your outage.
If your system can reboot itself when a problem is detected, if the problem is still reported you know the initial steps have already been taken.
Additionally, monitoring offers first alerting capabilities. UiPath noted in a recent post mortem:
“In this case, the synthetic tests caught the problem first, creating an alert less than one minute after the maintenance window started. These tests run continuously and simulate common user activity to ensure that UiPath is available from the public internet.”
If a breach happens, you need to shut it down 5 minutes ago. You have no idea how long attackers have had access to your systems, and outages might only be the tip of the iceberg. If that breach is detected at 4 AM, do you really think you have time to rouse someone out of bed to flip an off switch?
Automation requires its own setup costs, but thinking about systems you can automate today will help bake this process into your everyday alert handling. With dedicated effort, most issues can be resolved before you ever hear about them.
Looking Ahead for Q4 Internet Downtime
What can businesses do to prevent these downtime events from wreaking havoc on the bottom line? If you’re reading this report, it’s a good start. The users we have surveyed have reported taking the following steps:
Add More Checks
Adding more checks, and more complexity to these checks. Our users see the most benefit from checks that mimic user actions closely.
Escalations and Alert Audits
Our users report that auditing their alert system ensures data arrives promptly during an outage, that escalations require minimal human intervention, and that only the intended user receives the intended alert.
Public Status Pages
Status pages offer a level of transparency that can help businesses get ahead of an outage. Providing incident updates to end users builds trust. Users know you are down, it’s ok to talk about why while protecting your in-house operations.
Alert Analysis and Runbooks
Work on analysis, create runbooks and keep records. Data is king in these times because every outage is a learning experience. Frequent outages over a sustained period? You may need a new provider. Something broken? Who can fix it and what steps did they take? The more you can document, the more specific steps you can point to when disaster strikes again. Lightning always strikes twice in this business.
This next quarter, you are warned: there will be outages outside of your control. You will feel them. Now, what are you going to do about it?
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.