How to Gain Observability with Custom Checks and External Monitoring
Slack recently had a no good very bad day in which some broken external monitoring contributed to a perfect storm. But one passage caught our eye:
“After the incident was mitigated, the first question we asked ourselves was why our monitoring didn’t catch this problem. We had alerting in place for this precise situation, but unfortunately, it wasn’t working as intended. The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change.”
At first glance, it might seem the problem lies with monitoring. The monitoring was broken for this particular service, therefore the outage was missed. Pretty open and shut case of bunk monitoring (our doors are open to you if you ever need us, btw).
But that’s the catch in monitoring, isn’t it? Who watches the monitors?
Fortunately, like all things computer science, there is a term that describes the exact solution to these sorts of problems: observability. Do you practice the Steve Jobsian philosophy of “it just works”, or do you know it works with certainty?
Knowledge is Power
Let’s start with a simple definition for observability:
Quibbles about mechanical context aside, this definition serves our needs well. What can we see based on what the system is showing us?
When we consider observability we have to accept that systems do not care whether we have visibility on them or not. Much like a new puppy, when you are not looking they are liable to do just about anything.
In many ways, machines and observability are pretty simple: they either work or they don’t. At scale we tend to find some configurations lead to blindspots. Status pages are a fun example. What if your status page stopped working. Would you know?
Hold on, the nightmares don’t end there.
How about an external code repository? That’s a nice development pipeline you built there. Be a shame if something happened to it. What if your data center catches fire? What if some guy doing his gardening strikes the cable that happens to control whether users can access your resources? I can go on. We wrote a whole report about it.
The point we’re making here is that there are an incalculable number of ways for the stuff to hit the fan.
So we have to have a standard for observability.
Internal vs External Monitoring
One day I wake up with a cough. My body says: something is wrong, I should go to someone who can fix this problem. I take a brief trip to the hospital where I meet several medical professionals, from clerks and desk staff all the way up to nurses and a doctor, who observe my conditions and make judgments based on those indicators.
Do I have a fever? What does my cough sound like? Do I have any respiratory conditions? These are just a few of the key indicators these external sources might use to monitor my condition.
Medical science is all about this concept of observability, so DevOps should think like a doctor!
The body of your network has several components like the databases that hold customer data, or the front and backend code that make all the moving parts work, etc. You know this makeup, probably pretty well if you have been on the job for a while.
Internal monitoring is focused on resource management, and whatever local or locked down infrastructure is available. But you can learn a lot more than just the state of your disk space or your bandwidth consumption.
If you’re an Uptime.com user, you might create a heartbeat, which expects to receive a request to a unique URL at a specific interval, or webhook check to look at the server that manages a critical process like user_registration.
With a heartbeat connected to the job that records new user registrations, you can know with certainty whether it’s working. A webhook check connected to the server running that job can tell you it’s running.
Private location probe servers can also monitor applications and internal APIs that are inaccessible to the public. Is your backend system working? Private location probe servers will tell you.
Internal monitoring is the system providing proof of life.
Bruce Lee had a wonderful quote in Enter the Dragon that strikes at the heart of observability:
It is like a finger pointing away to the moon. Do not concentrate on the finger or you will miss all of the heavenly glory!
He definitely wasn’t talking about SaaS companies when he said it, but a good point is a good point. If we’re looking only at internal indicators, we’re hyper-focused on just one facet of the problem.
Let’s consider malware, another fun problem to have. If you’re blacklisted from Google, you do get an alert but what if it goes to someone on the marketing or executive team? How long until that alert makes its way to your desk?
The whole time you’re trying to figure out how to get your team’s Slack group to run Pong, your users can’t find you and those that do are being served a big ugly warning sign when they visit your site.
Service Level Indicators
There is basically no limit to the checks you can deploy for a website, the question is what key indicators get you out of bed.
Are you concerned with mobile versus desktop visitors? Are you a global company worried about your site’s performance from obscure locations? RUM checks are going to offer the reassurance you need.
Do you find yourself sleeping deepest when you are content with the knowledge it does “just work”? Well, roll up the sleeves of your turtleneck and use some HTTP(S) checks to verify the critical URLs that drive your business are serving OK 200 to users.
Need to verify assets are loading? Want to see if your shopping cart really works? Transaction checks offer the peace of mind you need.
And if you want to go full superhero: you combine these powers and create the ultimate external monitoring system that checks performance, uptime, and availability of critical goal funnels.
Beyond Service Level Indicators
#observability relies on the metrics most relevant to you.
Engineer eyes on
Incident response time
Size of infrastructure
Time of incident
On call hours spent
— Uptime.com (@uptimemon) June 26, 2020
Observability is difficult to encapsulate because it changes by organization, or even personnel. Two departments with different methodologies might view observability and critical metrics differently, so finding middle ground here can be a challenge. Here are some mutual pain points we find are worth observation:
Everyone has on-call hours, but how often did someone actually have to get out of bed or stop dinner to respond to something? If the answer makes you grimace, it’s probably time to look at how you can reduce those incidents or reduce time to respond.
Number of Outages
A five minute outage is still an outage. It still counts against your SLA whether your team can respond to it or not. Annoying. Right? The more of these incidents you catch, the more you diagnose and ultimately the fewer will happen over time.
We have also seen the evidence from our own userbase: the number of downtime related support tickets drops when external monitoring is employed. This is a major cost saver, and gives devops space to focus on the “dev” part of that moniker.
Time to respond plays a major role in extended outages. We found that time to respond had potential to be a more significant factor in major outages than that cause of the outage. When surveying the top companies in several sectors, companies with the budget for IT infrastructure, we observed low numbers of outages for the period with high downtime hours.
If you don’t have infrastructure in place to respond to downtime, it festers.
Act, Don’t Wait
That’s not just a cheesy call to action, it’s a call to improve. The more time an alert spends floating around the office and not directly in front of you where it matters, the lower your margin for error. You can feel the pressure if you’re under these conditions, where an outage sends literal shivers down your spine.
Observability is really another kind of development philosophy. It’s just rooted in reality, where things work or they don’t. Every outage is a chance to evaluate whether your observability failed you.
Don’t be afraid to build off of what you haven’t learned yet.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.