Alert Fatigue | Are You a DevOps Zombie?
George Romero’s zombie classic, Dawn of the Dead, is the perfect analogy for the DevOps experience. Your cluster of services is like your Monroeville Mall. Your safe haven where you live and work for years on end, toiling away at survival.
But who are the zombies?
The answer isn’t always clear when you’re busy chasing down alerts at 2 AM. Just like the survivors in Dawn, it’s important to periodically check your weak spots for alert fatigue:
- Look at the number of alerts received versus acted upon. To conserve group stamina, you want to be sure you’re responding only when the horde actually threatens you.
- Look at who responds to alerts, how is the volume allocated? The best survivors are well-trained. Are you giving your team equal opportunity to stand watch?
- Review escalations. The core of the group is a hierarchy of command. How often are problems escalated to the next higher up?
- Cost of overtime. How much effort is expended to defend against the horde?
It can become difficult to distinguish whether the problem exists within your organization, your alerting and automation, or something else entirely. As we know from every zombie movie ever made, there’s always a hidden bite somewhere.
So today we are going to do the gritty work of examining our alerts to see if we can cure the devops zombie virus.
AdHoc Automation can Hurt On-Call
When you’re fleeing a zombie horde of problems, you tend to adopt solutions based on what’s handy; like reaching for duct tape. As useful as duct tape is, it’s just applying a patch to the overarching problem. Automation can sometimes become that “patch”. You see downtime and think: I’ll monitor that service. If it goes down again, I’ll get an alert.
Now extrapolate that over years. How many seemingly random checks can slip past your radar? And if your colleagues are doing the same thing, you create gaps in your website monitoring. Gaps the horde can exploit.
Automated alerting for systems you don’t “own” present a problem of familiarity. You can correct for this by allocating responsibility. In the 9-5 world, we tend to “own” some specific processes. So, who answers for what? One way we can solve for this problem of ownership is in our on-call approach.
When considering on-call rotations, it’s important to have multiple team members that can respond to an issue since problems can happen to any piece of your infrastructure. The likelihood that a teammember with ownership over a system can respond to an outage increases with more hands on deck.
Smaller teams don’t have the luxury of multiple on-call watchpeople, and so tend to fall back on automation. This can be detrimental in the long run.
Automation can directly impact alert fatigue with a volume of alerts that become less meaningful over time. Repetitive messaging, repeat alerts, and unread emails are a chore to sift through. And when something does break and you need that needle in that haystack, it’s vanished into the ether.
Using Escalations to Filter the Noise
One of the most underrated ideas behind zombie movies is the endless drone of sound. Imagine thousands of voices just moaning outside of a building 24/7. It would become deafening over time, or you would tune it out. Unless you knew what to listen for.
Escalating alerts helps sort through the noise of false positives. Things sometimes break for a minute or five. You should still work on repairing these issues, but on-call shouldn’t need to wake up at 3 AM to observe this wildly engaging phenomenon of incidents that solve themselves.
In practice, the initial alert might be sent to Microsoft Teams or a Slack channel for all Dev to have visibility on the outage. After 10 minutes, the alert is “real” and is escalated to the on-call engineer. You avoid alert fatigue when you intentionally miss these alerts you can’t respond to.
This strategy also helps small teams mitigate alert fatigue. Escalating when there’s a big downtime event helps utilize the resources of a smaller team more effectively.
Categorizing Alerts to Avoid Alert Fatigue
Categorizing an alert, so devops understands the potential scope of the problem intuitively, is essential. Get in the habit of tagging checks and using notes. Schedule monthly (or routine) audits to ensure monitoring never becomes superfluous. This is important. You don’t want to over monitor and over alert. You need to optimize your team’s time if you want to stave off the horde for as long as possible.
This way your alert is doing more of the heavy lifting with detailed information about the outage, such as systems affected or even how to fix the issue.
Creating Instant Visibility on What Matters
One underrated feature in website monitoring is a customizable dashboard for each team. Use custom Dashboards to provide internal visibility, which saves time.
Whether it’s a check wall that tells you the state of your system, or just a dashboard you check throughout the day. Centralizing alert information helps reduce the anxieties of being on call. Weight by importance, so poor performers get noticed and acted upon faster.
Use Primary and Secondary sort to fine tune how check cards are displayed
Auditing Alerts for a Better On-Call
Let’s do a reality check and ask ourselves: of the volume of alerts we receive in a given period (say last month), how many did you act on?
I would wager the answer is somewhere close to “not many” or maybe even zero.
We can be honest here, the boss isn’t watching. You ignored those alerts, and you did so probably for very good reason. Usually because they signaled a problem that could wait until the morning, or they were otherwise just not that urgent. Maybe an automated service rebooted and the alert is there just for visibility. Who knows.
But since devops has taught you that existence is suffering, you can learn how to deal with the problems these alerts present. Namely waking up at 3 AM to go “oh yeah, that happened.” and then going back to sleep again.
Alert Auditing in Practice
Alert auditing really isn’t that complex. You compare the number of alerts you had to the number you responded to. Then you take that knowledge and look at how to proceed:
- What can you automate?
- What alerts signal common recurrences?
- What alerts signal something really bad happened?
A loose board (slow performance), leaking pipes (400 and 500 errors), windows that rattle a bit too much (frequent and brief outages). All of these signal potential problems the horde can exploit. Understanding when trouble is afoot is one of the most valuable skills experienced DevOps has to offer.
Avoiding the Horror of Alert Fatigue and On-Call
Molly Struve says that more on-call can actually help your on-call anxieties. Let’s consider that part of the problem lies with the fear that you might not be able to respond adequately to the problem. But if you had more experience on call, handling more issues as they arise, you wouldn’t feel so ill-prepared.
And if you had more engineers on-call with you, you would have that familiar team element you’re used to with the advantage of their very specific knowledge and input.
Our SRE expert, Jon Arundel, suggests running frequent gameday exercises. More importantly, rotating the leadership roles is the only way to give the “new generation” the rigorous testing it needs to hold the fort. You discover vulnerabilities in your system with frequent testing, and everyone gets a chance at the hot seat in a low-stakes problem.
We also suggest runbooks, which can become indispensable compendiums of devops knowledge. Built up over time, runbooks can provide the tools any engineer needs for rapid solutions to outages.
Alerting and on-call are linked. The more you understand about how alerts affect the on-call experience, the easier it is to deal with the problem. Thousands of years after you are gone, after the face of the Earth has been forever altered by the zombie apocalypse, your knowledge can live on.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.