March 2020 Outage Report
It’s pretty safe to say that March was the month where everything changed for most of us. By now, enough has been said on coronavirus and we need not add to the pile. Our concern remains continuous uptime, and reporting on outages as teachable moments.
During this time of heightened tensions, let’s take a few moments to do some post mortem work and see what we can learn from March’s outages.
This crisis will improve our infrastructure and drive completely new ways of commerce and daily life. How are we going to build and maintain that infrastructure?
This month’s outages showed us the importance of transparency in downtime, as well as the downtime cost of sudden surges in traffic for reasons outside our control. Let’s dive into stats from March, 2020.
Uptime.com and COVID 19
First a look at how Uptime.com is faring. Overall, we’ve seen an increase in requests that have taken the form of alerts and additional data. We have added some redundancy to cope with increased loads.
We also track more than 6,000 websites from Alexa’s Top 10k list, where we have noted an increase of roughly 10% in alerts in March alone:
Internally, our users have seen major surges in traffic across food and health sectors.
RobinHood Outage Massive | Down for 7 Days
RobinHood, the stock trading app popular with investors of all skill levels, went down on a particularly bad day. A day in which the Dow saw it’s biggest gain since 2009. The company was also slow to acknowledge and report on causes for the outage, issuing a vague statement along the lines of:
Our systems are currently experiencing downtime. We’re determined to restore full functionality as soon as possible. We’ll be sharing updates here and on https://t.co/ZS733Gooqj.
— Robinhood Help (@AskRobinhood) March 3, 2020
Conspiracy theories abound, most based around the idea that the company made some flub in the code regarding a leap year. For its part, the $912 million-funded RobinHood said it will be offering members three months of Robinhood Gold, a premium service for borrowing money and access to additional research.
The company’s founders did attempt to explain the outage as a series of factors, including an unprecedented load on the system and highly volatile market conditions.
Global Signs Certificate Down
Some of you may have been affected by this outage of Global Signs, which provides fully-trusted certificates. GlobalSign suffered an OCSP server failure, but service was restored after ten hours and servers from GlobalSign were back online. During that time, Uptime.com detected a number of SSL alerts related to these certificates, putting more emphasis on automated certificate monitoring.
We helped a number of users navigate this outage with technical alerts on what was happening behind the scenes.
Understanding the nature of a certificate error is as important as identifying it as the root cause. There are a number of actions to take, from doing nothing to reconfiguration, that compound the time it takes to respond to this kind of outage.
We are currently experiencing an outage in one of our data centres, which affects the following services:
– Issuance (GCC)
– OCSP (+ Microsoft OCSP)
– Timestamping services (non DSS)
Additionally, some customers may experience issues with intermittent HTTP status errors.
— GlobalSign Alerts (@GSSystemAlerts) March 11, 2020
Microsoft Teams Goes Down Under Work From Home Strain
The first major COVID related downtime we tracked in March was a Microsoft Teams outage from the strain of working at home. The system has grown in adoption before this crisis, and promised free service to the NHS to support workers.
This outage follows last month’s SSL outage, but service was restored fairly quickly. The emphasis here is on time to response. Your SLA becomes a point of contention the more a user requires your service. When you’re in a supporting role, the severity of an outage from your customer’s perspective is proportional to their needs.
They don’t really care what the problem is, they want to know is it up or down, and if it’s down when it will be up again.
Staggering: Ransomware Predicted to Cost $20 billion worldwide
Security training company, KnowBe4, has released a shocking report predicting that Ransomware will cost businesses $20 billion globally in 2021. Of course this highlights the need for security awareness and training, but it’s also important to keep infrastructure impact in mind as well.
Are you actively monitoring for changes to your services that are outside your scope, such as content changes or ensuring elements are present? In other words, are you looking at your website from the user’s perspective? More than just pinging a server to see if it’s up, you need to simulate user behavior, track performance, and look at multiple data points to help judge whether services are compromised.
When disaster does strike, you need to have a plan in place to restore service. Ransomware presents a particularly difficult problem to overcome without some form of backup capability. Layers of security and redundancy help improve the odds of surviving a ransomware attack with data in tact and without paying for the privilege.
Github’s Post Mortem on February’s Outage
Github updated the company blog with a diagnosis and post mortem of the incident we tracked in February. The outages were numerous, but never long lasting, and the company committed to improving oversight of services and optimizing code for stability.
We like this post mortem a lot. It takes the time to provide background and depth to the incident, showing us the state of the company before and a roadmap guiding future actions.
Any outage is less than ideal, but a thorough examination and some public accountability goa long way in building customer loyalty. These teachable moments also have the benefit of getting audiences to think about their own infrastructure.
Are you safe from surging traffic? You might consider allocating additional server resources, or building out some redundancy for services hardest hit in the surge.
Key Takeaways from March
February was a more ideal time to prepare for this surge in traffic, but we’re in April so let’s make sure we audit our alert system so our important infrastructure is covered. Set escalations, if you have not already, so your higher support tiers respond when necessary. They likely have a lot on their plate just keeping the lights on and plugging holes where they find them.
These first two months will be the test. If we pass, the internet will collectively be more resilient than ever before.
Hang in there, and remember that Uptime.com has your back. We’re operating at full system strength and have already made adjustments to some of the surges we’ve outlined above. Meanwhile, we’re working on some big ticket items. Stay tuned.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.