January 2020 Outage Report
Welcome to 2020, where Google Drive can fail for some of you but not others, you can’t access your passwords, and you can’t withdraw cash on vacation. This stranded on a desert isle dream was reality in the month of January, which saw drama in the financial services and internet infrastructure sectors. January’s downtime reinforces just how connected we have become, and how reliant we are on infrastructure that can seemingly fail on a whim.
These outages also underscore what SREs already know: everything is a potential threat to uptime. Any small change, planned event, surge in traffic and byte consumed holds the potential for chaos.
One of our lingering questions this year is what uptime means in the age of the cloud. When redundancy is built into everything we do, what does downtime really mean? The answer is deceptively simple: downtime affects our clients and/or interrupts our work. But what is an interruption when our job is the maintenance of–well, everything?
This month, we’ve teamed up with our resident SRE expert, John Arundel, to develop key takeaways for what to do when you feel stranded in this interconnected system of internet tubes we’ve built.
Google Drive Went Down for Some
Google never goes down, until it does. An issue with the tech giant’s cloud storage service caused widespread outages for users of Google Drive, Docs, Sheets, Slides, and other products.
Me right now because Google Drive is down and I have work to do pic.twitter.com/SFQae2hvln
— emma benson (@Bensonville18) January 27, 2020
Google announced the problem on its G Suite Status Dashboard, saying that they were investigating reports of an issue that was causing users to be unable to access Google Drive:
Although the problem was resolved for most users within an hour or so, Google has so far made no announcement about the root cause of the issue, saying only “Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.” However, many organisations who rely on the cloud file storage service lost at least an hour’s work:
Remote workers when google drive is down: pic.twitter.com/I4RuljqfsN
— Kevin Crawford (@K_Crawf) January 27, 2020
Uptime.com’s in-house site reliability consultant, John Arundel, explains:
“Big firms like Google use a common infrastructure for file storage across all their services, so Google Drive, Gmail, Photos, and so on, all use the same underlying platform. It’s essentially a giant database: what’s called a ‘blob storage system’. Not gelatinous monsters from outer space, but ‘binary large objects’—chunks of data to you and me. Your vacation pictures, my meeting agenda, and your boss’s emails are all blobs in Google’s database.”
So when an outage happens, it can affect multiple services, and this is one reason that Google segments its data into multiple independent shards, or regions. This is designed to limit the blast radius from service outages. Arundel says that this is why Google Docs could be fine for you, but down for the person at the next desk. “Segmenting infrastructure is a good idea, and you don’t need to be Google to do it. The less tightly-coupled the different parts of your system, the less likely a problem with any individual component is to affect the others.”
Last Pass Outage: Website and Chrome Extension Go Down
The LastPass login page went down for three days before the company first acknowledged a downtime event, creating some frustrating inconsistencies in the user experience. As usual, social media was the recipient of venting (and a critical channel for information about the outage).
On January 19th, at 5:14 PM UTC, the company finally acknowledged the outage with a tweet and a status update.
We are aware of and actively investigating reports from some LastPass customers who are experiencing issues and receiving errors when attempting to log in. At this time no service issues have been identified.
— LastPass Status (@LastPassStatus) January 20, 2020
The outage was resolved within three days, “After a thorough investigation, we’ve identified and resolved the login errors caused by a bug in a recent release impacting a small set of users. This has been resolved and all services are now functional.”
Enterprise operations should be concerned about a status page, both in bragging about good numbers and improving the not so great ones. When one of your services goes down, users will be searching for your status page first. If you’re not laying out a clear incident report that explains the downtime, their next search will take them somewhere to vent about your site reliability.
When is the best time to alert users of downtime? When the outage affects the user experience. Even when priority one in-house is fixing the issue, user transparency turns your potential naysayers into product evangelists. Reporting on an outage should be built into your incident management protocol. Everyone understands mistakes happen, and can plan around them. The cardinal sin is failing to let them know.
98.12% – you have a few problems
99.65% – you have no major problems
100.00% – your monitoring is broken
— John Arundel (@bitfield) March 4, 2015
Disaster struck again when the team’s Chrome application disappeared from the Chrome Extension Store. Apparently, it was accidentally removed by the LastPass team. Web login was already restored by then, so users had some recourse.
The LastPass extension in the Chrome Web Store was accidentally removed by us and we are working with the Google team to restore it ASAP. You can still access your Vault by signing in on our website. Thank you for your understanding and patience in the meantime.
— LastPass Status (@LastPassStatus) January 22, 2020
Even simple inconveniences that add an extra step to a user’s work add up to lost productivity. It’s a good idea to build some redundancy into your operations when you rely on password managers, intranets, and single sign-on (SSO) services. Make sure users have backup access to passwords hosted within your network, ideally in a location that will be unaffected by routine outages—and, of course, with appropriate security.
Local password storage is almost never a good idea, and there may be security concerns using browser-based storage (as good and diverse as the options are). Security is about secure access as much as preventing the wrong hands from finding your data. Make sure you’re not locked out of the services you rely on.
Take a minute to think about the online services you and your team use every day. Would an outage or data loss event on one of them stop you from getting work done, or affect your own customers? Are there actions you can take now to avoid this? For example, you could maintain backup copies of critical information, or providing manual workarounds for automated actions. Does everybody know how to find this information and these backup procedures? Run an exercise to find out, simulating an outage of one of your critical vendors, and see what lessons you can learn from it as a team.
Travelex Taken down by Malware
On New Year’s Eve, travelers hoping to withdraw a bit of extra cash for a night on the town to ring in the new year were met with error messages. Currency exchange firm Travelex appeared to suffer a hit by the Sodinokibi ransomware, also known as REvil.
With user data at stake, Travelex made the difficult decision to take its service offline. This choice had a ripple effect on financial services that failed when Travelex went down. Some stores were able to process transactions, but the website and mobile apps were completely unreachable.
To compound this challenging situation, some evidence suggests that Travelex was notified of a vulnerability on 9/13/19:
We notified Travelex about their vulnerable Pulse Secure VPN servers on September 13, 2019.
No response. pic.twitter.com/lCjk7IY3OM
— Bad Packets Report (@bad_packets) January 4, 2020
And another security expert suggests the company was not properly equipped to receive incident notifications:
Day 6 of Travelex woes. Nobody has any kind of cybersecurity incident notification. pic.twitter.com/1qBGCP4jRt
— Kevin Beaumont (@GossiTheDog) January 6, 2020
In spite of this catalogue of (avoidable) catastrophe, Travelex may have made the right decision given the scope of the outage. When things are going really wrong, sometimes the best action you can take is to pull the entire operation offline, as painful as that may be. This is especially true when there’s potential for a user data breach.
“When you’re in downtime, you have everyone’s attention,” adds consultant John Arundel. “During an incident there can be huge pressure from management, the business, customers, and the media to get back online, no matter what. But sometimes that can make things worse. If you start flailing around, changing things at random to try to make them work, you can turn a routine outage into an extinction-level event.”
Instead, go dark. Quit Slack and put your phone in airplane mode. Get the key people in a room, and use whatever resources you have to protect that team from all outside pressure until they’ve figured out what’s really causing the issue, and come up with a workable plan to resolve it. “Until you truly understand what’s going on,” adds Arundel, “the best thing you can do is nothing. This goes double for security incidents.”
There’s a lesson as well in how Travelex brought its services back online, focusing on internal and processing services first before rolling out customer-facing fixes. The company granted refunds “where appropriate”, and contacted users directly to arrange for alternative methods to retrieve money.
Protecting the end user’s information doesn’t just make sense from a liability standpoint, building that trust in your user base is worth the effort of complying with the often tight regulations guiding business on the internet.
Other Notable Outage News
There’s a dark side to competition, and this month it reared its ugly head. Tucker Preston, just 22 years old and residing in Georgia, admitted to hiring criminals in an effort to carry out cyber-attacks against others. The only thing we know about the victim was that it was a business that operated servers in New Jersey.
Another DDoS-for-hire story, but this one was used against Ubisoft in September of 2019. A DDoS attack cripples the multiplayer gaming experience and crashes servers under the stress. The defendants here ran multiple services offering lifetime memberships to players looking to engage in DDoS activity for the purposes of disabling or crippling the game experience. The company sued in January and is seeking damages and fees, as well as the complete shutdown of these malicious sites.
Perhaps the clearest example of unanticipated traffic since Black Friday was the NCAA National Title Game, which completely crashed the ESPN app for many frustrated viewers. For perspective, more than 6.6 million viewers subscribe to ESPN +.
It’s a wonder Twitter didn’t crash under the volume of users angry tweeting about the issue.
That’s a wrap for January’s downtime events. We’ll see you again in March for another breakdown of outages and key takeaways. What did you learn in January?
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.