Is Alert Fatigue Hindering Your DevOps Work?
This year, you tell yourself, you are going to be prepared! You will arm yourself with a new status page, and create web monitoring for every important service in your arsenal. Like the proverbial Eye of Sauron, nothing will escape your omnipotence. But too many tools in your set can contribute to alert fatigue.
Alert fatigue occurs when your team starts to feel like they are always on call. They might already secretly feel this way. There’s a fine line between passion for the role, and obligation to oversight, and good DevOps practice is to try and find a work-life balance. Alert fatigue is your biggest obstacle in this path.
For one, coordinating your team around 20 different toolsets that deliver monitoring piecemeal is a pretty huge ask. Analysis is also a slow process even when you take a proactive approach to incident management. There is always a new user to onboard or a new piece of infrastructure to worry about.
Egads! Your smart and ambitious thinking has put you on the road to disaster.
Avoiding Alert Fatigue in Devops
A deluge of alerts can hurt monitoring because identifying the actionable ones becomes a full-time task. Dev accountability can also become muddled as your team scrambles to determine when an issue is worthy of action.
A good audit to give yourself: how many issues does each team member have to work on during a given shift?
Team size and specialty also influences on-call fatigue. Small teams might love a pit of alerts where everything is up for grabs because it can be more efficient. But what works for startups rarely scales well.
Specialized teams can share some on-call responsibility, but may feel ill-equipped to handle the entire system on their own. Devops must make a concerted effort to improve coordination across specialized teams that are most comfortable in their own environments. On-call can even hurt business relationships, where under-equipped team members turn into under-appreciated team members and look for an exit.
Too many alerts can be a signal you are reacting to downtime, instead of adopting a more proactive approach. A series of small outages, or even one prolonged outage, can be a cue to investigate infrastructure and make improvements.
Friends don’t let friends pass off trends as coincidence.
10 Questions to Determine How Incidents Impacted You in 2020
- Did a specific incident lead to a measurable impact on customer retention?
- Did customers post on social media about an incident?
- How many incidents were escalated due to extended downtime?
- What percentage did the person who received an alert also resolve the incident?
- How many incidents occurred during off business hours?
- Did a specific incident prompt an interruption in development?
- Did a specific incident cause your team to rollback a deployment?
- What is your team’s average response time to an incident?
- How many on-call hours were used?
- How many incidents were preventable?
Number 8 can be a tough one to gauge, but internal ticketing systems can be helpful for this kind of data gathering. You can easily average first-engage times as a decent metric between time the alert was issued and time an action was taken.
Number 10 is a lesson in self reflection, and it’s a tough one to answer honestly as more often than not it involves holding yourself and your squad accountable. If growth is truly the end goal, it’s one of the most important questions of the group.
Creating a Strategy for Alerting
To eliminate false positives, you need to develop an alert strategy that sends only actionable alerts and automates escalations. Put simply: send the alert to the right person every time.
Determining When an Alert Occurs
There is a “tree falling in an empty glade” kind of question to be asked here: when is downtime actually downtime? If downtime occurred for one minute, and was localized to a specific region, it can be difficult to pinpoint. But an extended outage of five minutes or longer tends to point to something suboptimal in the backend.
Modern response time often looks like a few scattered user reports, some initial investigation, and upon confirmation of the outage, an update is made to the status page. It is in this final step that SLA accountability kicks in, making it critical to determine the answer to “is it really down” as quickly as you can.
Threshold, sensitivity, and retries are most important as you consider your SLA obligations. More often than not, the first response is not as important as confirming the outage as quickly as possible.
- Threshold: the time a check takes before it registers a timeout error
- Retries: How many times a probe server will retry checking your URL
- Sensitivity: How many probe servers must register as down before the check alerts you
Tip: try multiple checks with increasingly higher timeout thresholds for the same URL. You will receive progressive alerts from each check as response time increases, providing a clearer picture of a performance issue in real time.
Use Notes to Beef Up Response Capability
Adjustments made to alerting over time reflect how your system responds to downtime. Monitoring should never be considered “set and forget”. Your preconceptions about systems will turn into concrete knowledge after a year gathering data.
Notes can be one of the most powerful tools in your arsenal when you use them properly. A simple note informing your lower tiers to reboot a service, or even what the service is related to can help a great deal during an on-call incident in the early morning hours.
Notes and repetitive alerts can also help you build rules for when to automate a reboot.
As you monitor and gather data, you can use notes to build on your operational knowledge with clear instructions and troubleshooting techniques that lead to the desired outcome. Much like “Comment Your Code”, build on what you know because you and your colleagues exist in a state of impermanence.
Uptime.com includes a notes section that is configurable with every check. Use the Advanced tab to find it.
Rotate On-Call and Share the Load
We can think of alert fatigue like the struggles of learning a language or trying to develop some new feature. You don’t know where to start. It’s overwhelming, but you’re expected to act. Adrenaline can only carry you so far. You need a strategy. You can only develop a sense for what you are doing if you get out there and do it!
Rotating on-call shifts helps give everyone a chance to participate. Take it a step further and rotate on-call roles. Who will lead this incident? Post incident reviews can be effective here, so any learning stumbles are not lost along the shuffle of responding to an outage.
Rotating roles also prepares juniors to handle the work of seniors. If you cannot trust your juniors now, you likely never will and that’s not an environment for growth or longevity.
Combat Fatigue with Smart Alerting
Yes, monitoring for uptime is critical. However, our research shows that a few incidents with poor response time can sink SLA obligations. It must be of equal importance that the team members most able to act on an alert are the ones to receive it first.
That kind of knowledge takes time and data, so the incremental approach to monitoring we’ve outlined here allows you to build on what you have learned. As you learn each team member’s strengths, you will also gain insight into how knowledge sharing can improve your organization.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.