December 2019 Outage Report
December was a busy month with systems we don’t normally see experiencing server downtime. Our first story, in particular, is an excellent example of how complicated monitoring can get as infrastructure grows.
We saw every level get hit, from the government to big-name players, with ransomware being one of the major thorns in our collective sides. But we also bring you the heartwarming tale of the little Minecraft Server that could.
As we wrap up January and look forward to a new year full of possibility, we should all have security firmly in our minds. If last year was any indication, managing credentials and access points will be of critical importance in reducing server downtime.
Read on to see December’s biggest outages.
Table of Contents
- Dexcom Outage Leaves Diabetic Patients in Limbo
- Yahoo! and AOL Outages
- Minecraft Trolls Takedown Vatican Priest’s Server
- Google’s GCP Appears to be at Fault in Discord Outage
- Eastern Band of Cherokee Indians Suffers Major Cyberattack
- Ransomware Cripples School District
- Microsoft Confirms Skype for Business Outage
- Echocash Suffers Major System Outage
- Virgin Media Outages leaves UK With Limited Access
Dexcom Server Downtime Leaves Diabetic Patients in Limbo
Our big story this month is an outage on the Dexcom G6 system, which was not detecting drops in blood glucose levels. Dexcom’s wearable system allows patients to easily track their glucose levels. Alerts push to a mobile device to warn of levels that spike too high or low, but during the early morning hours Saturday, 11/30, the outage struck.
Several patients were angry, and multiple posts on social media and blogs warned users of the server downtime. Dexcom itself remained silent on the issue until the following Monday morning.
Dexcom said its servers became overloaded. The company had suffered an outage this past year around the same time, which interrupted data sharing, but that incident was announced and resolved on the same day. The lack of communication led to some understandably angry patients. CNBC reports that the company did have monitoring in place, but did not realize the scope of the server downtime.
There are two key takeaways from this outage. The first is that your alert system must take into account customer impact. You can understand there is an outage without seeing the full scope, the latter being very critical to how you manage your resources. One small adjustment teams can make involves tagging on a note with an outage stating customer impact:
“Notification queue is CRITICAL. Customer impact level 1 (critical): customers will not get glucose level alerts. Some may die, with a corresponding adverse effect on brand image.”
The second takeaway is in how you monitor. Ideally, a push notification system is sending messages continuously to a system that receives and alerts your team when it’s not seeing those messages. You might also consider monitoring the queue of messages to see that it does not exceed a certain amount. Either way carries some logistical challenges.
Dexcom said along with fixing its servers, it wants to fix its response time to users “Ultimately we need to be able to communicate faster, and in a broader way,” Jake Leach (CTO) said. “Clearly, there’s an opportunity there.“
Yahoo! and AOL Outages Confirm Even Large Players are Not Immune
The two 90’s Internet titans found themselves out of commission, as Yahoo! Saw various pieces of infrastructure fall victim to an outage that was traced to the deprecation of Yahoo! Mail Classic.
Meanwhile, AOL was serving customers a 504 and a 502 instead of the classic “You’ve Got Mail” we all know and love.
To operate at scale, these big-time players need a great deal of redundancy. Redundancy creates redundancy in the form of identical servers and systems. Changes made to any system mean many more moving parts to deal with. If either system has even ten thousand servers (not uncommon for a global organization), human operators can no longer relate directly to those systems. Instead, they have to work through one or more layers of automation, which can also lead to server downtime.
A common pattern is to roll out changes to 1% of users, wait an hour to see if there are problems, roll out another 1%, and so on, until you’ve completed the rollout. Of course, this means you’re almost always in the middle of a complex rollout, which is a bad time to have an unrelated incident or failure, because your systems are not in a consistent state.
Minecraft Trolls Takedown Vatican Priest’s Server
Tech blogger and priest, Father Robert Ballecer, created a Minecraft server after polling his followers to see what game the Vatican should host. The idea was all in good fun, but Father Ballecer poked the bear, seemingly, when he declared his server:
“A little less ‘toxic’ and a bit more ‘community’”
To his credit, Father Ballecer cleverly used initial attempts to DDoS the server as a “honeypot” of sorts to capture and ban IP addresses that could be used against him. But alas, the valiant fight was for naught as the servers were completely overrun.
That said there is a happy note to this story: as more experienced IT and tech users have joined the struggle to keep the Vatican’s Minecraft server alive.
As of writing, it does appear the Vatican has taken steps to whitelist users who wish to try the server. A public server is also available, but users are encouraged to apply for whitelisting at the Discord.
If you want to get into the main Minecraft server (https://t.co/o0o2kLzbFu), join the discord below, get a “MC Player” role, and request access to the whitelist.
— Fr. Robert R. Ballecer, SJ (@padresj) December 17, 2019
Google’s GCP Appears to be at Fault in Discord Outage
There are as many Discord users as Fortnite players. They send roughly 100 million messages a day according to company estimates, and that is a lot of pissed off folks.
According to a Discord statement, the source was Google: “”Oh gosh~! We’re seeing some issues with messages sending & failure to connect due to an issue with Google compute platform & are waiting for a resolution!! Sorry for the inconvenience. Unfortunately there is a known issue affecting the timestamp for countries which do not observe DST. We’re aware of the problem and hope to push out a fix for it in a future update. Apologies for the inconvenience in the meantime!”
Our research seemed to indicate this outage may have been related to I/O performance on some cloud disks. Here is the issue from the GCP Status Page.
Whatever the true source of the issue, Google resolved it within two hours.
Eastern Band of Cherokee Indians Suffers Major Cyberattack
Tribal networks suffered a ransomware attack on 12/8 that took systems down completely for a period of 9 days. The attack was treated as domestic terrorism, and an arrest was made. Benjamin Cody Long, 36, was charged in connection with the attack. A tribal member and a former IT worker, he was suspended without pay two days before the attack occurred.
Although critical emergency services were working throughout the outage, this attack is a reminder that provisioning and de-provisioning access is critical when dealing with sensitive systems.
Ransomware Leads to School District Server Downtime
The Claremont Unified School District first detected a ransomware attack in its system on 11/21. Server downtime was extensive, leaving schools without internet entirely as the network was being worked on.
The school’s Blackboard system was operational throughout the outage, so parents and students had access to some critical information. On-campus it was a different story. The outage had triggered the school to disable the entire system, leading to delays in school work and general disruptions to learning.
Microsoft Confirms Skype for Business Outage
Microsoft had a small Skype outage, which they claimed was due to a maintenance issue. The Microsoft error report says maintenance was the cause of the issue:
“Users already signed into the service are able to make or receive voice calls and send peer to peer messages, however, they may be unable to perform certain write actions. Some of the scenarios impacted for write actions include, but are not limited to; Creating meetings, adding contacts, activating meetings, or creating groups.”
While the incident was resolved the same day, this brings up an interesting point about fall-back solutions. When your chief method of communication no longer works, do you have a back-up plan or are your operations paralyzed until it returns?
Echocash Suffers Major System Outage
Echocash suffered a massive outage that seems to have lasted at least one month in duration. The cause was an issue with server migration, which did not appear to affect physical locations but devastated the online service.
When a major upgrade goes really wrong, things can get worse and worse. When you don’t really understand the problem, increasingly frantic attempts to fix it with short-term measures can create further problems, and make it harder to get things stable again. The bigger the customer impact, of course, the greater the pressure on staff to resolve the issue, and the greater the temptation for ‘quick fixes’ and sticking plasters. You can rapidly get into a situation where multiple unrelated issues are making it more or less impossible to devise a coherent strategy for fixing the system.
Sometimes the right response is to take the system down completely, for as long as it takes to fully analyse what’s gone wrong, fix it properly for the long term, and roll out the changes in a safe and disciplined way. However, that may take a few days, which is a long time to be down in the face of pressure from angry customers. But the overall downtime and customer impact can be less than trying to grapple with the incident ‘on the fly’. Having plans in place for major incidents before they happen is very helpful here. Design a process, and train your staff on it before an incident happens. Then you won’t be groping in the dark to figure out what’s wrong.
Virgin Media Outages leaves UK With Limited Access
We end with a classic example of what we in the business affectionately call a “backhoe incident”. “During construction work carried out by a third party, a significant number of cables were pulled out of the ground at a building site severing thousands of fibres.”
What can you do?
In this instance, the ransomware incidents, and the Skype outage earlier in our report, we’ve seen how overreliance on fragile infrastructure can hurt operations.
The best steps you can take involve creating infrastructure that can’t be brought down by any single incident. If your services are redundant across two geographically separated data centers, for example, it’s unlikely that a backhoe incident at one of them can take you down altogether.
Key Takeaways from December 2020
All outages are not created equal. The consequences of a particular website or service being down can range from not-so-serious to… well, serious.
“Think about phone apps, for example, and the backend services that power them. People rely on their phone for things like navigating in strange places, or making payments, and the consequences of having those services fail can be serious,” Arundel says. And it’s clear that with outages of healthcare systems (such as this month’s Dexcom incident), there can be risks to patient safety, or even to life.
But when your SREs get a page that ‘Service X is down’, none of that comes across. Context is everything, and it’s impossible to make good decisions about how to handle an incident if you don’t understand who’s affected, and how.
Put customer impact at the heart of your monitoring strategy, because even if lives don’t depend on your service, your customers sure do.
Check out the past outage reports from Uptime.com.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.