Dispelling 7 SLA Myths That Keep Your DevOps Awake at Night
DevOps fits this odd niche between development and oversight. Like any “Wild West” type of position, pretty much anything goes. Your job is to think of everything including the stuff you haven’t thought of yet. You make the rules, and as long as the lights are on you’re considered a success.
But alongside that freedom come the rumors and SLA myths that inspire such dread that you write them off as jokes.
So today we want to help put your mind at ease by dispelling at least some of the rumors and myths that muddle the field you love so dearly. Let’s dive into it.
One Language to Rule them All
Let’s say we want to build a car, but all we have at our disposal is a multi-tool. A multi-tool can do a lot, but it relies on human input and that the unit in question has the proper tool for each job. Maybe we’re lucky and we get the only Swiss Army Knife in History with an air-driven torque wrench, but odds are we’ll need more than a few tools to do this job properly.
Courtesy of Juxtapoz
Use the right tool for the job.
100% Uptime is Both Achievable and Sustainable
It bears repeating that 100% uptime is probably the biggest and most damaging of all the myths in this list. Why? People lost jobs over unrealistic SLAs. Companies get sued. Users unhappy. Puppies cry. It’s awful.
So what can you promise instead? We talk about what the 9’s mean, and what is reasonable here, but there is a short answer:
- A Service Level Indicator is required for determining uptime in terms of your SLA obligations
- You define your SLA obligations, and should include an Error Budget (or the amount of downtime you are willing to live with)
The magic number in your SLA is really up to you, but 100% is neither attainable nor sustainable. I’ve always liked this quote from John Arundel:
98.12% – you have a few problems
99.65% – you have no major problems
100.00% – your monitoring is broken
— John Arundel (@bitfield) March 4, 2015
System Uptime is the Same as Service Availability
Your Status Page says everything is up and running. Your users beg to differ. This disconnect happens when you fail to pay attention to your service availability. The customer-facing architecture matters most. No one really cares if your SLA says 99.99999999% uptime if they can’t access your site.
The best advice here is improving the knowledge you have of your infrastructure, networks, and services. You just don’t know what you don’t know. And monitoring isn’t going to solve that problem. Sometimes something needs to fail for you to notice it was ever an issue in the first place. Again, error budgeting lets you turn these failures and oversights into learning experiences.
Scrambling to reconnect customer-facing architecture is a particular brand of avoidable pain.
It’s in the Cloud!
By now we all know cloud computing is a fancy term meaning “someone else’s resources”, and that knowledge should make us aware of potential for failure. Cloud computing might be more secure or speedy or whatever the sales person is telling you, but it isn’t impervious. In fact, bigger cloud providers can have bigger targets and a wider influence when something does go wrong.
People remember when big CDNs go down because they bring half the internet with them. And yes, we are all in the same boat when those failures occur… except those who had backups and continued content delivery.
That’s the lesson here. CDNs and cloud delivery are great! They also don’t mitigate the need for monitoring and backup solutions.
I Can Build My Own Website Monitoring (It’s Free!)
Yes! You can! If you want to make it your full time job, and hey who knows… Slack started out as a game development company. Go get em!
But for the rest of us mere mortals, building your own solution is like building your own house. You might do a passable job, but you will spend a lot of resources on development and maintenance. So-called “free” monitoring is not free. Those that do offer a free service may or may not also back that service with a track record of liability. Often, free services test from a random IP, which can affect response time metrics. The fact is; free might be a good start, but it’s rarely where you end up.
The question really becomes: what should you look for in a subscription-based service for uptime monitoring?
Instant Alerting is King
Is it though? Does it really matter if devops is awoken the second the system goes down versus two minutes after the start of a major outage? And when you do have flapping, or a continuous and rapid change in connection state, do you REALLY want to wake up every time it happens? Would it not be more effective to track performance issues in aggregate, so you can tie them to a specific change you need to make?
You think that’s air you’re breathing right now?
Instant alerting can be useful in the right context, but it’s more important to ensure alerts are delivered properly. When the ones most equipped to take action get the alert first, the time to resolve decreases. This hierarchy of alerts is your best safeguard for a high SLA uptime percentage.
Error budgeting should deprecate the need for instant alerting because you have an established margin for failure.
Networks Don’t Factor Into Optimization
The final myth is easy to overlook because it is the hardest one to spot. Networks make a big difference in how the user connects to you, the speed at which you perform, and overall content delivery.
It’s best to consider networks in terms of multiple providers that can each deliver the full load necessary to your end users. Who you choose to work with for hosting and content delivery matters a lot. Be sure to research your options.
Look for the number of servers the network has under its control, (more servers means a higher probability content delivery is geographically closer to the end user). Speed should be comparable to what you experience from your current provider, and delivery of both large and small loads should be equally efficient.
Bonus: We Don’t Lose Anything from A Few Hours of Downtime Every Year
We have to tackle this one, because there is so much evidence to prove it wrong yet the myth persists all the same. Every minute of downtime costs you money. The question is really how much, and the answer is hard to quantify. What we like to do is look at some major players with public revenue figures, then run some calculations. What does 10 hours of downtime hypothetically cost? What about 100?
You can compare these figures by industry as well, giving yourself a scorecard to measure against.
Understanding your “destimate”, or downtime estimate, is critical to informing your SLA obligations. You will lose revenue to downtime, and likely already have. What can you do to mitigate these losses, ensure downtime never extends a minute past what is necessary, and provide continuous availability for customer-facing infrastructure?
Don’t Lose Sleep on DevOps Myths
Myths are exactly that, stories told to scare or lull. DevOps has real concerns to deal with. It’s tough maintaining continuous delivery and optimization. It’s difficult managing resources. Don’t lose sleep over uptime and service availability.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.