What Does 99.9% Uptime Mean?
An old adage about choosing a hosting provider says that everyone promises 99.9% uptime so you need to test uptime of a site for the real picture. Or scour the forums for reviews and judge for yourself how reliable they are. That works too.
What that saying is really getting at is the need for some kind of indicator that uptime does not fall below expectations, because you can’t just trust the word of the provider when your business is at stake.
From small businesses thinking about adding a web store to enterprise businesses with many components that help to run a single service; monitoring grows in critical importance alongside the need for each new piece of infrastructure.
DevOps is in the middle of this, thinking about how to manage necessary upgrades and growth with the need for uptime. Everyone above you demands continuous connectivity, you’re competing with budget and staff resources. It’s a big job. So, let’s try and help you manage with some ideas on why chasing 100% uptime is a dream best abandoned.
Why You Shouldn’t Chase 100% Uptime
Let’s imagine a bunker we’ll hold out in case of emergency. We can think about our hierarchy of needs:
- First Aid
Already we’re looking at a significant cost to build our room and stock it with necessary supplies. Now let’s say we want to extend our ability to stay in this room by a year. Then by two or five or twelve years. Our costs skyrocket as need for survivability extends.
Those costs also don’t factor in other supplies we might like to have such as light sources, a radio, or some form of entertainment.
The lesson here is twofold. Trying to ensure 100% survivability leads to astronomical costs as you try and cover every nominal problem you will encounter. Additionally, it’s impossible because you don’t know what you don’t know.
We might benefit best from a survival plan that balances our time in our survival bunker with plans to safely acquire supplies or use them more efficiently. We might consider partnering with others, maybe taking up hydroponics. Any one of a number of ways to mitigate the need for just one type of solution.
That’s DevOps in a nutshell: balancing the need for a service’s survival with its operating capacity. Being in a bunker and wondering how to increase the time you’ve got left.
What Does 99.9% Uptime Mean?
When we talk about 99.9% uptime, we need to consider how far those nines extend. You may have heard of a “class of nines”, which describe numbers of a particular order of magnitude. Marketing people understand this concept because they need that big fancy number to attract the customer, with the terms of service protecting the company from liability.
Service Level Agreements (SLA) and Service Level Objectives (SLO) are very different from one another, and understanding those nuances is critical to DevOps. If you manage a specific piece of infrastructure, you have set elements that need to be in place and functioning.
An SLA is the agreement between your company and the end user. It implies a neutral standard of measurement, called a Service Level Indicator (SLI), which is used to inform both parties of the actual uptime of a service. Indicators are essentially probes that check availability and provide a neutral measurement of uptime.
That specific piece of infrastructure is part of a list that makes up your SLO, AKA the minimum amount of infrastructure that must be running for a service to be considered working.
Error Budgeting for 99.9% Uptime
So, DevOps’ job is determining to what extent those nines can exist given the available resources.
DevOps works with project managers to determine which components best serve the number of nines that are needed for a given component. On their side, downtime tends to be the focus. You probably don’t care how far the nines extend if your real concern is the number of minutes, hours, or days those nines represent.
You can either optimize your infrastructure, which is costly, or improve your response to the problem. Optimizing response can make a big difference. If tier one can restart a service when it goes down, tier 3 doesn’t need to get woken up and the outage doesn’t extend unnecessarily.
But there’s something else to consider, the cost benefit of 100%.
In our bunker, what is the difference between 10 years and 11 years? Either way it comes to the same conclusion: we’ll have to leave or find some other means of survival on a long enough timeline. So, extreme reliability that we will survive becomes a moot point against preparing ourselves better for the problem.
The same goes for maintaining infrastructure, where 99.99999996% and 99.99999997% uptime carry a high cost for almost no difference in perception. The service will go down. An unstoppable force meets inevitability.
Reliability and Recovery as Factors in Uptime %
Understanding that we need to balance downtime and reliability leads us to a secondary goal: responding to the outage. We can think back to our bunker, where we don’t know the disaster we might have to survive and how long we have to react when it happens. So we have to plan, maybe run some drills, and do our best to maintain readiness.
DevOps takes care of this relationship between uptime and time to resolution through a variety of means. You might adopt a ticketing system to manage incidents, an alert system to assign and arrange priority, analytics and performance tracking, synthetic monitoring, and maybe public-facing components like a status page.
DevOps’ job is to manage SLOs to meet the SLA the company has set forth. DevOps uses SLIs to measure a system’s integrity, and to allocate resources to fix it.
That’s where a monitoring system comes into play. Uptime.com is an SLI with SLA reporting. Meaning it can be used as an indicator of overall uptime, with public-facing components in support of an SLA. An all-in-one package available to every tier, with a variety of secondary providers integrated for more efficient alerting and response.
When you utilize a provider like Uptime.com, your time to respond improves.
Getting to 99.9% Uptime
So you make choices, is the overall lesson here. Your job is a tough one, balancing the sanity of your workforce, your available resources and budget, and managing expectations you have to agree to rather than control.
Uptime.com has your back with technical details, a history of outages you can use to see how a service has evolved, visualizations and status pages for management, and a powerful alert system to meet any need.
When you have the right tools, reliability is manageable. You might not get it right every time, but you have the means to make it work. And that’s the best place to be.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.