How Downtime Can Affect Morale, And What You Can Do About It
Does the worst case scenario for your company include alert fatigue from false alarms? Maybe it should.
No one likes a false positive when it comes to infrastructure monitoring, and false flags are especially irritating because you have to respond to a problem that doesn’t actually exist.
Just how bad are false positives?
Let’s break down what these annoying little mistakes add up to for your team. You might be surprised to learn just how much they are hurting your DevOps pipeline.
False Positives Cause Restlessness
Let’s begin with the obvious: you have to wake up for that 3 AM false positive. Adults need 7-9 hours of sleep a night to feel well rested, and the consequences of losing that sleep affect everything we do.
Before we get into any of the business and economics of outages, we have to acknowledge the personal toll false flags take on all of us. Poor memory and slow reaction times hurt our reaction times, critical thought process, our ability to relate to work and family, and our overall feeling of happiness.
It’s no secret: you feel like garbage when you don’t sleep.
What You Can Do
Find a monitoring system with customizable alert thresholds.
The false positive problem is preventable with Uptime.com monitoring. Using our sensitivity system, you can designate the number of probe servers that must fail before downtime alerts occur. Using our escalations system, you can schedule alerts to arrive only after extended periods of downtime.
No more 3 AM wakeups for 2 minute outages.
We recommend you create a central repository for alerts. That can be a group email address, like alerts@ or uptimecom@, but the idea is a central location to field all incoming alerts. The next step is to create escalations to be sent to the person in charge of the system after downtime exceeds a certain amount of time. You need to know when downtime occurs, but you don’t always need to respond to every 2 minute outage.
The DevOps Pipeline Suffers
False outages still carry a cost to diagnose. Brainpower and human hours are spent on opening emails, investigating systems, and if you’re thoroughly double-checking systems.
There can also be accrued costs in message delivery. SMS and phone credits are included in most monitoring platforms, but they usually aren’t free.
The time you spend investigating and fielding alerts (or otherwise dealing with fallout) is time not spent developing new items. Frequent outages, true or not, tend to shift development priority away from new items and toward optimization. And that can have benefits! Maybe your system does need a little extra love, some updates, and a bit of bug stomping.
But do you really want to do that at the cost of providing your end users the features they are chomping at the bit to get?
What You Can Do
If you suffer from false alerts, you need to diagnose the root cause of the issue or figure out another avenue to monitor from. Monitoring checks are automated, and they are very effective for testing publicly available infrastructure. You might need to supplement with private location monitoring for eyes on systems the public can’t see. You might need additional check types, like API checks, for insights into interconnected systems and processes.
Alerts you initially think are false false positives may not be. Observability will help you to better understand what causes these failures, true or not, so you can act effectively against outages affecting your real users. Add more check types to gain more insights so you can definitely show it’s up or down and have multiple fail-safes in place to prove your results.
Bad Alerts Breed Complacency
When multiple automated alerts pile up, they tend to blend together and slip past your radar. That’s a form of alert fatigue, and you’ve probably already experienced it. How often do we open automated emails for further investigation?
The same concept applies to important channels, where a flood of false flag alerts can cloud important metrics and make it harder to find the detail you need.
What You Can Do
To overcome alert fatigue, you need to consider how alerts are structured and who needs to see them when. One of our top recommendations is to utilize an escalation structure so you only see an alert when it requires action.
Let’s say you have an HTTP(S) check with one minute monitoring. If it goes down, you need to know. But do you need to respond the first minute it’s down? What if it’s down only for 5 minutes?
With escalations, you can define a set interval like 10 or 30 minutes, so you respond only when it’s REALLY down and needs your intervention.
But you also need a catch-all location. Somewhere to field alerts and create a history of events, because very often those small outages add up to a larger problem (maybe bandwidth or hard disk issues). We recommend Slack or Microsoft Teams for that, but anywhere you can field alerts that will create a meaningful record for you will do. Here are our integration providers in case any happen to fall on that list.
False Positives Erode End User Trust
Reliability is why you chose to implement monitoring in the first place. It’s especially frustrating when your monitoring is generating false alerts. It can undermine the reliability you have worked so hard to maintain, and the trust you have built with your customers.
How can you establish a reliable service when you cannot parse out user generated reports from actual downtime?
What You Can Do
Sometimes, a false alarm isn’t a false alarm or it’s related to a change outside your control. One of the ways you can help yourself in this situation is to reach out to support. Support representatives for your monitoring provider should be able to tell you the status of probe servers. Of course, you should also check the provider’s status page in addition or prior to any reports you submit.
The next step is to clearly communicate the state of your infrastructure to your own users. We recommend a status page of your own, but email or customer service will also suffice depending on the size and responsiveness you aim for.
False Downtime Muddles Reporting
If you identify some incorrect downtime in your reports, it becomes difficult to view the data set as credible. Entire weeks’ worth of reporting can be lost because of a simple misconfiguration or a bad alert.
What You Can Do
We allow our customers to ignore any alerts that were generated unintentionally. Maintenance is a good example, where your maintenance period may have ended prior to the actual work of maintenance.
If you forget to adjust that window, you can simply ignore the alert that is generated and it won’t count against your SLA requirements.
False Downtime Eats Time
The biggest drain false downtime has on your organization is the time it takes to assess the problem. With the right alert data at your fingertips, and proper escalations in place, you will find your team responding to fewer small and false flags.
With proper reporting in place, you’ll be able to track actual downtime as it happens making it easy to parse out the real from the fake.
Monitoring should help measure your system, not create headaches chasing reliability.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.