What We Learned About Uptime from 2019 Website Outages
One thing we’ve always known: there’s no such thing as 100% uptime for any website.
Too many variables are at play to keep a site from staying up all the time. From traffic surges to hardware failures and everything in between, keeping sites up and running is a full-time job for SREs and IT pros.
Here at Uptime.com, website performance monitoring, we track major downtime all year long to provide websites of all sizes with lessons in how to catch downtime and resolve incidents quickly.
This post will provide some general observations from a year of major outages, a round-up of some of the most significant outages of this year, and some tips for you going into 2020.
Let’s get started.
Table of Contents
- What uptime even means anymore
- 2019 Website Outages: Uptime is Valuable
- 2019 Major Outages
- Tips for Maximizing Uptime in 2020
- 1. Change your methods for detecting cyberattacks.
- 2. Review your load balancing strategy.
- 3. Combine RUM and synthetic monitoring for an accurate picture of the user experience.
- 4. Use Chaos Engineering to learn the strengths and weaknesses of your web infrastructure.
- 5. Continually monitor your web infrastructure.
- 6. Have a plan to deal with downtime.
- 7. Review your web hosting service SLAs.
- Key Takeaways
What uptime even means anymore
“What is uptime?” sounds like a silly question.
Isn’t it obvious?
Either your website or service is up, or it’s not.
We’re used to measuring the resilience and availability of our applications in uptime, usually measured as a percentage. For example, an application with 99% uptime was unavailable for no more than 1% of the relevant time period. 99.9% uptime, referred to as ‘three nines’, translates to about nine hours downtime a year. The more nines, the better, you might think. But this misses an important point: Nines don’t matter if users aren’t happy.
If your site is flat-out unavailable (it never loads, or you get a browser error), that’s clearly a downtime situation, Arundel points out. But there are many ways a service can be making users unhappy, even if it’s nominally ‘up’, including:
- Your site could be slow to load, driving impatient users to your competitors.
- Or the homepage might be speedy, but it takes users 20 seconds to log in (Amazon, we’re looking at you).
- Searches might not be returning any results.
- Items might appear and disappear from users’ shopping carts.
- Payments might not go through properly, or, worse, customers might be charged twice.
You get the point.
‘Up’ means happy users, not numbers on a spreadsheet. But what gets measured gets maximized, so you better be careful what you measure.
If you’re focused on traditional ‘site uptime’ numbers, you could be missing serious problems that are making customers angry.
Best case, they’ll call you out on social media. Worst case, they’ll silently leave and never come back.
The answer, according to Arundel, is to up your monitoring game. “More sophisticated ways of instrumenting your site, such as synthetic monitoring and RUM (Real User Monitoring), get you closer to measuring customer happiness. Synthetic monitoring checks simulate what customers really do on your site: logging in, searching for products, reading reviews, adding items to the cart, even making payments. By contrast, RUM metrics show you what real customers are doing on your site right now, and what kind of experience they’re having.”
Distributed systems such as cloud native applications are never one hundred percent ‘up’; they always exist in a state of partially degraded service.
The cloud is dark and full of terrors. To navigate it successfully, Arundel recommends you find out what makes your customers happy, and use modern, user-centric monitoring tools to make sure you keep them that way.
2019 Website Outages: Uptime is Valuable
A common thread we see in website outages is the effect it has on business revenue.
Downtime is expensive, especially during critical shopping days like Amazon Prime Day and Black Friday. After Amazon’s big fail on Prime Day 2018, Ebay took advantage of the problem by advertising a “Crash Sale” during this year’s Prime Day.
According to Gremlin, sites like Amazon lose hundreds of thousands of dollars per minute of website downtime. But ecommerce sites aren’t the only one losing money. Financial services including banking and cryptocurrency experienced crippling downtime that infuriated customers.
Financial Services Outages
The Wells Fargo outage was the worst US bank outage ever, with customers unable to access accounts for up to 3 days, the only outage we’ve seen in the US that affected ATMs. Most banks experienced outages that affected online and phone banking, but were still able to access their money in person. Banks in the UK seem to have the most outages.
Social Media Outages Wreak Havoc on the Web
A regular addition to our outage reports were social media sites. These outages became so common that they almost weren’t news anymore. While the Facebook family of apps garnered the most attention, we found that several sites, including Reddit, Flickr and Quora also went down frequently.
As people become dependent on social media for entertainment, information and communication, these sites become overloaded and go down. Many times the homepages are still accessible, but functionality is compromised once a user logs in. Since ecommerce businesses heavily advertise on social media, precious advertising dollars are wasted.
If you’re thinking of using Facebook’s login functionality for your site, you may want to reconsider. When people can’t log in to Facebook, they can’t access your site either. Given the number of problems with Facebook this year, Google may be a safer bet.
Large Cloud-Based Services Have a Ripple Effect on Website Uptime
Something else we noticed is that one site can affect many. Outages from major service providers like AWS and Google can cause multiple sites to go down. This was the case with Cloudflare, when one small piece of equipment took down “half the internet.” (See major outages below.)
Plan for the Worst, Hope for the Best
When launching a new online product, planning for traffic is crucial. But sometimes, even major brands can’t anticipate the demand.
This was the case when Disney+ launched on November 12. The streaming service was so popular that it was plagued with issues on launch day. No matter how much forecasting Disney did, it’s obvious they weren’t ready for the massive amount of traffic they received.
Disney executive Kevin Mayer told The Verge, “We’ve never had demand like we saw that day and what we’re continuing to see.”
Government Services and Education Are DDoS Magnets
While healthcare is ripe for ransomware attacks, government services and education are continually targets of DDoS attacks. These attacks occur on the national and local level, affecting everything from email to online services.
One of the largest attacks of the year was on the country of Georgia, where a cyberattack took down over 15,000 websites.
The Good News About Website Uptime
While we’ve focused on the negative up to this point, there is good news, too.
Companies are learning from outages. With the exception of social media, most of the sites we reported on didn’t make our list more than once this year. As customers continue to speak out about outages on Twitter, many companies are responding appropriately by providing status pages with technical details about outages.
Another bright spot: many ecommerce sites were well-prepared for Black Friday 2019. We ran our own tests and all the sites we checked were up and running all day. Aside from Costco and a couple of other major retailers, most sites seemed to be prepared for holiday shopping season.
2019 Major Outages
Though there were a number of major outages throughout the year, the biggest lessons came from just a handful of these incidents.
Here’s a rundown of some of the biggest outages of 2019.
Wells Fargo (February)
After rebranding and sprucing up its image because of an unrelated issue, Wells Fargo customers were furious when they couldn’t access ATMs or online banking beginning February 7.
In addition to the above problems, customers stated direct deposits weren’t showing up, retailers declined debit cards, and bill pay wasn’t working.
— PhyloFilms (@PhyloFilms) February 7, 2019
Twitter was abuzz with rumors about the cause of the problem. Some said it was a fire in a data center, but the local fire department denied it. Wells Fargo later stated that smoke from equipment was the reason for the outage. Though they tried to reroute everything to a backup system, the situation quickly deteriorated.
The three-day outage (some reports stated two days) provides an important lesson about uptime for distributed systems. While things should work in theory, real-life scenarios like the one at Wells Fargo are the ultimate test for how well they actually work. Regular testing of these systems can prevent a major meltdown like the one that happened at Wells Fargo.
Social Media Outages (All Year)
Instead of focusing on one particular outage, we decided to group social media together. As we mentioned before, the Facebook family of apps appeared regularly in our outage reports.
But to be fair, Facebook has the most social media sites of any other company. Facebook, Instagram, Messenger and Whatsapp are all part of the Facebook family. Oculus is also owned by Facebook.
One pattern we’ve seen emerge with Facebook is that one app goes down, at least one other does as well. This is probably due to shared components on the backend. If a component is unavailable, every app that uses it will suffer.
The biggest outage of the year from Facebook came in March (14 hours). The homepage was replaced by a screen stating the system was unavailable due to maintenance issues. On Thanksgiving, the influx of traffic caused the app to crash again.
Other sites that went down regularly included Reddit, Quora and flickr. Both flickr and Reddit have detailed status pages that give users the lowdown on site problems. Quora, on the other hand, is mysteriously silent when it comes to informing users of issues. This could be that the issues are localized and it would be impossible to detail every little problem.
Launch day is critical for apps trying to make it in the rapidly growing area of streaming services. Customers have high expectations for a glitch-free experience and robust features.
Disney learned firsthand how difficult it is to create the perfect environment for streaming media. On November 12, Disney + experienced a variety of problems on its first day of operation.
— Cass McComb (@cassmccomb) November 12, 2019
The service opened to eager subscribers who were met with problems with playback, issues picking up where they left off watching a show, and logging in. With 10 million subscribers during its debut, it’s no surprise that the high demand created technical problems.
Not only were they not ready for the demand, but the app wasn’t built to handle it, either.
Disney exec Kevin Mayer stated, “There were some limits to the architecture that we had in place were made apparent to us that weren’t before.”
The service uses BAMTech technology, which was used by HBO and the MLB in the past. But the company that created the tech stated they hadn’t handled a load that large before. In April, Disney announced it acquired majority ownership in the company and renamed it to Disney Streaming Services.
An important lesson: investigate technologies thoroughly to be sure it can handle your needs and ensure continuous uptime (as much as possible).
One of the most significant outages of the year happened when service provider Cloudflare went down on June 24.
This outage not only affected the Cloudflare site, but tons of large sites that use their caching to deliver content to users. Everything from hosting providers like WP Engine to chat services like Discord were affected.
What’s so significant about this outage is not just the amount of websites affected, but the cause. Cloudflare was quick to publish a blog post which explained the source of the outage: a small piece of equipment called a BGP optimizer at an ISP in Pennsylvania.
How did this one piece of hardware take down “half the internet?” According to Cloudflare, the BGP optimizer created “better” routes to tons of websites, which Verizon then broadcasted to the entire internet. If Verizon had filtering in place, these routes wouldn’t have been revealed and traffic would proceed as normal.
Cloudflare responded quickly by trying to contact Verizon to correct the issue, but never received a response. Instead, they took matters into their own hands, fixed the problem, then published a blog post detailing what happened.
Tips for Maximizing Uptime in 2020
Though minutes of downtime may not cost your organization thousands of dollars, companies should do everything possible to keep websites available as much as possible.
Here are some tips for you to keep your website safe into the new year:
1. Change your methods for detecting cyberattacks.
DDoS attacks are getting smaller. This means that websites may not experience any downtime and the size of the attack will not trigger any alerts. Examine your methods for detecting attacks, and change thresholds if necessary.
2. Review your load balancing strategy.
High traffic websites with a load balancing strategy should investigate best practices and new ways to distribute traffic appropriately. Load balance as a service (LBaaS) is a way to upgrade your load balancing without purchasing and maintaining additional hardware and software. This can help organizations increase traffic capacity without breaking the bank.
3. Combine RUM and synthetic monitoring for an accurate picture of the user experience.
Transaction checks give ecommerce websites an idea of how website functions work under the best possible circumstances. But network latency, user location, device and connection speed are just some of the variables that affect actual website user experience. Combine Real User Monitoring with synthetic monitoring and set thresholds for suboptimal load times.
4. Use Chaos Engineering to learn the strengths and weaknesses of your web infrastructure.
Chaos engineering was born from a set of tools called Chaos Monkey developed by Netflix in 2010. The company was moving from a physical to a cloud-based infrastructure and wanted to ensure the service would work.
Chaos engineering works on the premise that nothing is perfect, and that breaking things on a regular basis teaches organizations how resilient their infrastructure is. The experts at Gremlin recommend starting with these four simple experiments.
5. Continually monitor your web infrastructure.
Testing is important, but the best way to catch and respond to problems quickly is to continually monitor your web infrastructure. In addition to simple HTTP/S checks that alert you when sites are unavailable, more sophisticated checks can look at your DNS, SSL certificates, APIs, and check website elements to make sure everything is working.
6. Have a plan to deal with downtime.
Monitoring your website is important, but what you once an incident happens is just as important.
New research from Google Cloud showed that even though retailers invested a lot of time into getting their sites ready for Black Friday, 24% didn’t have a plan to deal with downtime if it happened during this critical shopping holiday.
Create a downtime response plan that includes an escalation policy and how to keep customers informed when your website is unavailable.
7. Review your web hosting service SLAs.
Is your web hosting provider meeting their Service Level Agreements (SLAs)?
The average website is down 3 hours a month because of hosting. Now is the perfect time to review your web hosting provider to make sure they are meeting their obligations. If your host offers an uptime guarantee and falls short, review your contract to see what you can do about it.
As always, technology is evolving and getting more sophisticated. With the rise in user expectations and ways things can go wrong, now’s the time to review and update your website performance monitoring.
Here are the key takeaways from a year of website outages to remember going forward:
- New product launches are hard on infrastructure for brands of all sizes. Prepare well.
- Review your load balancing strategy and upgrade to new services when applicable.
- Continually monitor your website with RUM to ensure page load times stay at acceptable levels.
- Have a detailed incident response plan to correct problems and inform customers.
- As attacks get smaller, you may need to change your monitoring to catch attacks that don’t always take websites down.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.