February 2020 Downtime Report
February kicked 2020 off with a terrifying glimpse into what happens when the Internet of Things stops Internetting things. If we consider our central question this year of uptime in the age of always-connected, then we start to see the impact of hidden failures. All the stuff we don’t know we know impacts the end-user.
Someone forgets to renew a TLS certificate, half the business world can’t collaborate. Someone else flubs an update? Users who rely on connected cameras to guard their homes are suddenly less protected than the sewers of Helm’s Deep.
We live in a world that starts and stops according to nines.
This month we’ll learn how to overcome certificate anxiety, and look at what the Nigerian banking industry spent n200 billion defending against.
Table of Contents
Microsoft Teams Down Thanks to SSL Mixup
Microsoft’s Teams application went down for about three hours after the team failed to renew its SSL certificate. We’ve determined that an authentication certificate has expired causing users to have issues using the service,” explains Microsoft stated in an outage notification.
A few important notes arise from this incident. First, Microsoft is a very large organization and it’s not too hard to imagine that someone could have grabbed a certificate and forgotten to document it. It also shows that no one is immune to “certificate anxiety”.
‘Certificate Anxiety’ and How to Avoid It: Three Tips
An expired TLS certificate used to be a minor embarrassment, but it’s now a serious matter, as the Microsoft Teams outage amply demonstrates. Most browsers will block access to sites and services with invalid certificates, so from your customer’s point of view, if your cert has expired, your site may as well be down. What can you do to avoid this?
Infrastructure expert and consultant John Arundel, of Bitfield Consulting, has three key tips for eliminating certificate anxiety.
1. Automate the renewal
Any manual task can be overlooked, especially if it’s a long way into the future. If you use a certificate provider that supports automation, such as LetsEncrypt, there’s basically no manual task for you to do. Once you’ve provisioned the cert, you can tell the LetsEncrypt client to auto-renew it as soon as it becomes necessary. LetsEncrypt certs normally last 90 days, and will be automatically renewed when there are 30 days to go. Problem solved!
2. Automate the reminder
If you can’t automate the cert renewal, you can at least automate the reminder. You can usually renew a cert up to 90 days before it’s due to expire. Using whatever project management or task tracking system you use for the rest of your work, create an automated reminder that emails you or creates a ticket to renew the cert, as soon as it can be renewed. If the cert lasts 12 months, set the reminder to trigger every 12 months.
3. Automate the monitoring
Whatever your TLS renewal process, manual or automatic, you still need monitoring as a backstop. Most automated monitoring services can not only check that your TLS cert is valid, but also check its expiry date. Get your monitoring system to alert you as soon as the cert is renewable. If you use auto-renew, set the alert for one day after the renew date. Then if there’s any problem preventing the cert from auto-renewing, you’ll know about it in time to fix it.
DDOS Attacks on US Voter Registration Websites
The FBI issued a warning in February that indicated Pseudo Random Subdomain Attacks were possibly launched against state-level voter registration and voter information sites. This type of attack uses DNS queries to randomized (non-existent) subdomains in an effort to obfuscate the source of the attack.
The servers in question had rate-limiting algorithms in place that helped filter traffic and reduce the effects of the attack. The FBI is recommending that companies evaluate their DDOS mitigation strategy, and to strengthen accordingly.
What are ‘denial of service’ (DoS) attacks, and what can you do about them?
A denial of service attack is an attempt to overwhelm a service, such as a website, with so many bogus requests that it can’t fulfil them all. When this happens, the server has no spare resources to handle legitimate requests, so users see the service as unavailable.
A distributed denial of service attack (DDoS) is a more sophisticated version of this. If a huge volume of requests were coming from a single internet address, the server administrator could easily block them by banning that address. In a DDoS, the requests come from many different addresses (perhaps thousands), often ordinary computers compromised by malware and hooked up into a botnet. This makes it very difficult to block the bogus requests without also accidentally blocking legitimate users.
The best way to protect your site from such attacks is to use a DDoS mitigation service, such as that provided by Cloudflare, Akamai, and others. This can help detect DDoS attacks and automatically block the bogus traffic before it reaches your servers, making sure you stay available to real users.
However, no DDoS mitigation service is perfect, so you will need to monitor your servers yourself, and be prepared to respond when an attack happens. If your site becomes slow or fails to respond, and you see a suspiciously high volume of traffic to it, a DDoS may be in progress. Contact your ISP and get their help to verify what’s happening and deal with it.
A particularly nasty variant of the DDoS is to attack, not the website itself, but the DNS servers responsible for its domain. Everything on the internet has an address, and a DNS server is the address book which tells a web browser where to find a given site. Overloading the DNS server can cause an ‘invisible’ outage; instead of seeing excessive traffic to your server, you’ll see traffic drop to zero (monitor for this too). Your site will appear unavailable to users, because they can’t look up your address.
Because most people don’t run their own DNS servers (nor should they; it’s a specialist job), protecting against these attacks is often overlooked. Talk to your DNS provider about what mitigation measures they have or can put in place.
Hacker Users AWS to Launch DDoS Attack Against Candidate Website
In a related attack, Arthur Jan Dam of Santa Monica, California was charged with one federal count of intentionally damaging and attempting to damage a protected computer. Authorities claim Dam launched 4 attacks that were traced to the same AWS server. Further evidence suggests Dam logged in via his work or home terminal at various times.
This story has some political intrigue as well, as Dam is reported to be married to an employee in the rival candidate’s office.
Nigerian Banks Say Downtime has Cost n200 Billion
The Nigeria Computer Society (NCS) has said that banks in the country spent about N200 billion to prevent various forms of cyberattacks on their operations in 2019.
The two major attack sources it has identified are DDoS and Social Engineering. More effort must be taken at the local level to highlight the extent of the threat. This is especially true as threats grow in magnitude and sophistication.
This is a huge loss for the industry as a whole, and reflective of the overall state of downtime in 2019. As IT managers and DevOps personnel, we can do more to educate users on threat detection. Run exercises, not with punishment as your goal but as a learning experience. Your company should institute Red vs Blue tactics now, so you can identify security holes and plug them.
Thieves don’t need to snatch information to do damage. They can use outside attack vectors to decimate productivity and effectively grind the organization to a halt. Enact a DDOS mitigation strategy now so you’re not caught unaware.
Github outage (unable to leave comments or push)
Folks in DevOps got an unplanned ~2 hour break as Github went down. The company was quick to respond on Twitter, and here’s a big takeaway:
As of the writing of this post, the company has still not yet revealed the source of the outage. Why is this important? So often, we debate what to tell the customer and how to communicate throughout an outage. Github’s model doesn’t speculate, yet still manages to build trust.
Just look at the responses to Github’s acknowledgment of the incident:
Sincere apologies to all GitHub users for the downtime this morning, and the brief outages last week as well. We take reliability very seriously, and will publish a full RCA in the near future.
— Nat Friedman (@natfriedman) February 27, 2020
A happy user base is an informed user base, even when everyone relies on you. They want to know: is this a bathroom break or can they go home for lunch?
When you effectively use your Status Page, you improve communication to the userbase. Remember: you don’t need to overpromise. Promise change, investigate the incident and follow through.
IoT Applications Down | Who’s Watching Your Home and Feeding Your Pet?
This month in “Things We Should Probably Have a Backup Plan For”, Nest went down for 17 hours, and PetNet went down so frequently people are asking if the company is still in business.
Thankfully for Nest users, Google issued $5 refunds.
So what happened? Google says a scheduled storage server software update didn’t go as intended. Let’s be fair here. Google’s infrastructure is at a level of complexity few of us can fathom.
PetNet’s internet connected feeder went down for an entire week beginning on 2/13. The company’s work is supposed to ensure pets are fed properly each day when humans are not present. The devices are designed to regulate feeding to prevent over or under feeding
System Update: We are investigating a system outage that may affect customers using the SmartFeeder (2nd Gen). Scheduled automatic feeds will still dispense on at the desired time although SmartFeeders will appear offline. Sorry for any inconvenience that this may cause.
— Petnet Support (@petnetiosupport) February 14, 2020
These outages prompted one blogger to say: It’s time for smart home devices to adopt local failover. There must be a balance between the costs of running cloud devices and the control consumers have over their devices
January’s outages largely struck at access points, bringing down logins or preventing users from accessing credentials. February shows us that there’s a lot DevOps can mess up internally, especially as systems grow in complexity.
January showed us why automated testing was necessary to protect the user experience. February is a lesson in monitoring behind the scenes. How are deployments going? When does your domain or TLS certificate expire? If you don’t know off the top of your head (and who does, really), put monitoring in place so you’re never caught off guard.
What outages kept you awake at night this month?
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.