April 2020 Outage Report
We will always remember April 2020 as the month that a DDoS attack took the world’s most expensive bottle of whiskey offline. We barely knew ye.
— Uptime.com (@uptimemon) May 13, 2020
But other notable outages taught us a lot about which threats dominate our landscape. Namely DDoS attacks, which are highlighting vulnerabilities organizations have with redundancy and threat mitigation. You might know something is down, but what can you do about it?
First, we will dive into our industry reports for April, then we’ll discuss the notable outages and takeaways for devops. Let’s get to work.
Industry Reports for April
Financial Sector Suffers Equivalent of 3 Days of Downtime
The financial sector suffered the equivalent of 3 days of lost downtime, or roughly 10% of the month of April, with a total of 82 outages. Average response time was 1.24 seconds, indicating banks were serving users quickly and efficiently throughout the period but the strain was overwhelming at times.
eCommerce Major Players: Near Perfect
eCommerce major players had an uptime percentage of 99.95%, with a total of just 17 hours of accumulated downtime. An excellent goal to shoot for, and admirable considering the strain these sites have come under as more buyers have shifted to online orders. However, the higher response time of 1.54 seconds indicates this resilience does come at a cost.
Social Looks Strong in April
Social media had our highest uptime percentage for the month at 99.99%. It’s important to note that social media is a complicated beast, where disruptions are frequent and parts of the service failing while others remain online. However, there is a takeaway here: resilience and components are linked. Building your applications in such a way that services can’t or don’t take others down with them has a benefit.
Uptime doesn’t matter if your service is unavailable, unreachable or returning the wrong responses 😬
— Jamie Allen (@jamie_allen) May 13, 2020
Spanish Television Series Causes 47% Spike in Nokia Network Traffic
The fourth season of the popular La Casa De Papel series, which translates to “Money Heist” in English, was released on Friday 4th. Nokia reported a 47% peak traffic increase compared to the Friday a week before. Netflix streams were up 36% compared to the previous Friday as well. The company was able to bump up the average bitrate, increasing quality by as much as 11%. This was huge considering that Netflix itself lowered streaming quality to allow for uninterrupted access.
Overall, Nokia said traffic has risen by 40%, which the company attributes to DDoS attacks. “We noticed a steady increase in the overall volume of DDoS traffic – with amounts exceeding the pre-pandemic levels by 40%,” said Craig Labovitz, CTO of Nokia Deepfield. “This increase may be related to the significant rise in gaming-related DDoS attacks; we continue to investigate this issue – so more to come on this topic.”
Gmail Service Disruption for Almost 60 Minutes
Gmail went down for almost an hour, but worldwide productivity did not come grinding to a halt. The outage was centered primarily in the ET and CT timezones (New York, Chicago, etc), and the disruption did not fully stop service. User reports include:
- Unable to receive messages
- Unable to login
- Unable to fully render the website
Google’s response was swift, and we can glean two primary takeaways from the incident:
- Respond to Users: don’t be shy, acknowledge the outage and let them know you are working on it. That’s all they want to know. Keep that as your primary focus, don’t explain, just apologize and reassure.
- Implement Planning: You did practice those gameday exercises we wrote about, right? Now is the time to implement that hard work in planning. Your war-ready devops team can respond to anything if you take the time to train them. Time to response means everything in SLA fulfillment.
Snapchat Has Provided a Glimpse into COVID 19
Who has a tougher job right now than Snapchat devops? Snapchat saw an additional 11 million daily active users in the first quarter of 2020. People are using the service to connect in these stay-at-home conditions, and the company’s support Twitter provides a unique glimpse into the effect that has had on the team and its service infrastructure:
This is just two day’s worth of service interruptions that are still ongoing.
When demand is reaching unprecedented levels, and service disruptions become more frequent, you need to consider two strategies moving forward. The first is to improve the time it takes to diagnose and resolve the issue, which requires automated monitoring at every level. Your internal processes can provide insights about failures your customer-facing processes are experiencing.
Striving to increase observability is a net positive for your organization. Like a commander over a battlefield, you benefit most when you can see what is coming and prepare your team to face it. You can better communicate issues as they unfold and you can direct the right people to fix the services in question.
The second is to re-evaluate deployments. It is a marvel that Snapchat has continued to release new features even as the service faces unprecedented traffic levels. That said, Service Level Agreements typically require you stop and audit yourself in the face of major events. Build some error budgeting into your SLA so you are continually evaluating your development pipeline.
Quibi Proves You Can Actually Get What You Pay For
No doubt you have seen Quibi’s ads by now (or you will after reading this, we’re sure). The company has gone full speed ahead with it’s marketing push, which bought the service exactly what it wanted: an influx of users.
Quibi is a video service specifically designed for mobile devices. Videos are short, typically under ten minutes, and the focus is on-the-go viewing. It has rekindled cult favorites like Reno 911 in service of its launch, but the application went down for about an hour on its launch day.
This is really a question of PR more than reliability. We all know streaming services mostly work, and that’s part of the reason most of us are subscribed to so many. There’s always something to watch, and if one goes down well there are other channels aren’t there?
So Quibli’s launch foibles (since we’re using creative language) are really about egg in the face. The service wanted to position itself as the next big thing. Just like Disney + and HBO and so many other streaming services that started the race with a face plant.
When faced with these outages, the most important element for devops is to remain calm. Everyone will be screaming from every direction because launch time is important for the financial people and the team as a whole. Your job is a simple one: make it work.
So go back to basics: what’s working and what’s not. If you don’t have visibility on your systems, and you still have a job, let this be your first important life lesson.
April Ends with Outage of Virgin Media
Virgin Media suffered an intermittent outage that caused service disruptions for multiple users. The disruption occurred for users at the same time, as reported by Express, and Virgin acted quickly to resolve the issue.
Brief outages like this are big problems when your entire business relies on that last set of cables beaming your products and the user’s money back and forth across the intertubes.
Ten minutes to someone watching Netflix has a very different effect than ten minute outages at a hospital.
These outages highlight a need for businesses to consider multiple service providers, and other forms of redundancy. Is your business worldwide? Do you have servers in each country that can quickly and reliably deliver content?
What about third-party services your business relies on? Do you know when your mail provider or your code repository goes down? Do you have true visibility on your infrastructure?
It’s Down, So What?
There are only so many outcomes when a service goes down:
- You can fix it fast
- It can fix itself
- You can’t fix it and you have to wait for someone else to do it
- You can migrate
What you can’t do is throw your arms up and give up. Some outages we tracked this month no doubt felt to devops like there was just nothing that could save them from the onslaught. Those poor Snapchat engineers. But devops is about inventing survivability, and that’s why we signed up for the job.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.