How Alert Notifications Make Incident Response More Effective
HR people have a saying: right person, right place, right time, meaning that the right resources can make all the difference when it counts. The same goes for Incident management and response, where very often the wrong person, place, or time can contribute to mounting catastrophe.
As systems grow, the right person really can make the difference during an outage simply due to command or knowledge of the system. Ideally knowledge is shared, but it’s natural for some to have more ownership or agency than others. Alerting and website monitoring needs to take into account these hierarchies to be effective.
Today we’ll dig into an important DevOps metric: MTTR, or Mean Time to Review, Respond, or Repair. We’ll look at what this metric can mean and how you can apply it to your organization, all while laying out a practical alert structure for each “R” you might want to refer to.
Plotting The Outage Timeline
It’s easy when you’re in the moment to forget outages have phases.
During the early phase, detection, you’re looking at how quickly can we find the problem before our end users? With one-minute monitoring intervals, the answer is basically in real-time. When the site goes down, you get an alert.
Your support team is probably familiar with the idea of accountability and response. Who will claim the ticket and what is required for a solution? Yet somehow there is this disconnect once we stop directly serving the customer.
Diagnosis is rarely simple, but we can always improve our process because DevOps is an extension of customer service. A small number of incidents can add up to major downtime for a quarter, and improving your MTTR (not simply hurling more on-call hours at it) becomes an important component to SLA fulfillment.
Mean Time To Review
Let’s begin with Mean Time to Review. A problem occurred, so how quickly can someone take a closer look at it? Take a quick glance at some of your past incidents, maybe in the last quarter. How much of that downtime passed before someone was able to put eyes on the problem?
Alerts have the greatest potential to improve time to review when they are timely and delivered to the right person.
How many times have you seen an issue pop up in a public channel, and counted the minutes or even hours before someone claimed it? Delivery to a point person can fix this nebulous problem.
Uptime.com users can easily structure alerts for delivery to various locations. For example, we can define two contacts to receive an alert the moment it occurs. We might send to our Microsoft Teams devops channel, and simultaneously deliver an email or send a text to our lead sysadmin who can delegate to the person best suited to respond to the outage.
A Note On Personal Time
It’s important to define on-call hours for your users and contacts. Each contact has a built-in on-call schedule by day, week, or month, so everyone on your team can share responsibility and maintain website uptime without compromising their work/life balance.
Mean Time To Respond
Somewhere between reviewing the problem and repairing the issue lives the ideal time to respond. This is the critical metric and the one you’re most likely already tracking. This metric tracks the time it takes to arrive at the decision of what to do.
Uptime.com alerts can help with technical details that give your first responders clues about the incident. Is it a 500 error? Is a string missing but the site is otherwise responsive?
Coupled with real-time analysis – which provides screenshots and a chronological breakdown of the outage – you have a powerful toolset for diagnosis and informed decision making, from tracking performance before, during, and after the outage, as well as pinpointing geographic locations where connections are faulty. Accounts with traceroute enabled can dive deeper and examine the connection from us to you with data from every HOP we took along the way.
Response is rarely simple enough for one person to fix every problem. Sometimes, it is critical to escalate and involve other team members. How much time is wasted breaking down what your tier 1 team learned about the outage?
With Uptime.com, you can effectively designate higher tiers for response and define a custom schedule of hours or minutes. With notes built into your checks, you can inform colleagues of steps that have led to successful resolution. Or build an internal status page where your team can document actions taken during each phase of the outage.
Mean Time To Repair
Something must be done to restore service, even if that something is waiting. Barring circumstances outside your control, which you actually can control for, improving mean time to repair is all about understanding the problem as quickly as possible.
As work begins, it’s important to understand the status of the service in real-time. With uptime alerting, Uptime.com informs users of precisely when all locations report everything as A-OK! You can query the real-time status of locations at any time to confirm how effective your efforts are for multiple locations.
Real-time status is one of the last steps you will take in putting the outage behind you. Having good exit data about the outage will help in your analysis of it and offer ideas for future prevention.
Alert Structure: Putting Together Alerting For Your Team
We’ll say it again, alerting is really about right place, right time. Getting technical details to someone who can begin analysis is the most important step you can take today toward improving your MTTR tomorrow, no matter which “R” means the most to you. Let’s recap:
The key to everything is timely alerting that tells you when it’s down, and when it’s back up and you can get back to life as it was before the outage struck.
Deliver The Initial Alert To The Right Person
Put the initial downtime alert in the right hands. This is critical to beginning your review process,determining the extent of the problem, and creating a strategy for resolving it.
Use Diagnostic Tools And Gather Data
Tools like traceroute and real-time analysis offer your team valuable data about how your service is performing as you work toward resolution.
Escalation Techniques Lead To Success
Designate a point person, then escalate when the time is right. Structuring escalations gives your lower tiers the opportunity to work through an outage and keeps your higher tiers informed when they eventually may need to respond.
Reach Out to Uptime.com
Downtime is inevitable, but suffering from it doesn’t have to be. Utilize Uptime.com resources from best-practices, to documentation, to our support team – made of real humans.
Timely alerting that goes to the right person has proven effective in reducing MTTR. Take the time to plot out your alerting hierarchy and Uptime.com will take care of the rest.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.