Building a Smarter Escalation Matrix with Uptime.com
The idea behind an escalation matrix is simple: the situation requires greater authority to resolve. Authority can take many forms, including experience with a particular toolset or simply the proper permissions to flip the right switches. Therefore, escalation must involve putting the proper information into the right person’s hands (well, device).
When your infrastructure is down, you need resolution as quickly as possible. But have you established a protocol for who does what when things go wrong? Delegation isn’t something you want to hash out when your server is on fire. When seconds matter, is your team prepared with the information it needs to resolve the downtime or escalate to someone who can?
Today we’re going to talk about making smarter escalations using the Uptime.com platform. We’ll look at how to create checks that offer multiple points for monitoring your infrastructure, and for putting an escalation protocol in place that delivers.
Table of Contents
How Important is Your Escalation Matrix in Website Monitoring?
We define a smart escalation as multiple points of contact designed to receive information at various stages of a downtime incident, and the concept is very important to the work SREs do.
Our job is uptime.
When a site goes down, an SRE begins the process of assessing the problem and restoring service. Often, those steps involve a number of people who work together, perhaps customer service gets a phone call about the outage and alerts tier I who works on known issues. The basic protocol is probably well-established, but what happens when problems extend beyond Tier I’s control?
Smart escalations mean that when customer service receives that call, Tier I support has already begun working on the issue. A smart escalation might also include customer service itself, perhaps someone in management or a lead for the shift who can communicate a known issue.
Using smart escalations, you can determine who gets which alert at what stage of downtime. They have multiple benefits to you and your team:
- You collect outage data from the time of the alert for real-time analysis
- Notes that direct first responders on deploying known fixes
- Data your first-responders accumulate trying to restore service
- Automated alerts to secondary data sources that trigger only under certain conditions
Here’s how to create them in the Uptime.com UI.
Create smart escalations for your team with Uptime.com. Free 21-day trial, no credit card required.
Setup Smart Escalations for your Uptime Monitor
Once we’ve logged into our dashboard, our next step is to click on Notifications>Contacts. We’ll create three Tier I contacts, one for each method of communication, and then escalate from there.
Let’s head to a check to see this in practice.
The three Tier I contacts are designed to make sure someone gets alert of downtime and knows the problem isn’t a temporary outage from some outside cause. When an alert occurs, Uptime.com will issue an alert email to the check’s primary contact (whoever or whichever team that may be).
The escalation triggers at the five minute mark, where we know the outage is a problem we should look into. Next, we’ll issue an SMS alert at the 10-minute mark. This alert might be programmed for your team lead’s phone, or for an SMS service in office. The third escalation in Tier I is a phone alert and it occurs at the thirty-minute mark.
We’ve gradually increased the sense of urgency, and ensured that the alert data (and the alert itself) is pushed to a location where the user is most likely to take action.
Tier II will have a similar structure, with the email alert coming at the two hour mark to allow Tier I time to work on a resolution.
Using the Notes Section
Notes are the unsung hero of our escalation. This field is for instructions on what to do in the event of an outage. They can be broad or specific to a certain deployment (a good idea after maintenance was performed), but notes should be the guidepost Tier I follows to troubleshoot and hopefully restore service. Here’s a helpful post on troubleshooting documentation that offers some ideas, such as documenting possible causes and symptoms.
Their secondary purpose is to gather data, so good notes instruct teams to troubleshoot known issues. Ideally, Tier II should get the escalation only after all known issues have been tried. At that point, Tier II can decide whether to escalate further or act on known data. If their efforts fail, they don’t need to worry about updating IT administration who is surely keeping tabs on an outage that is hours long.
Internally, you can help your escalations get smart with designated roles that can act with some authority for outages that require fixes to core technology or new releases. Next come the high-priority contacts. These include:
- Project/product managers
- Development leads
- Senior-level developers
Finally, define administrators and project owners/stakeholders with the technical knowledge to apply a fix. This role is your last line of attack/defense, and should be designed to cover all other situations that have failed.
Simple Use Case | Investigating Tiers of Response
Your team is launching a new feature today, and you are creating a check to monitor for crashes. This scenario will walk you through a smart approach to your escalation matrix and downtime reporting, with an expected turnaround of two hours.
Let’s assume your junior and senior-level developers have contacts already in Uptime.com. Let’s also assume that your team has prepared for this product launch and is aware of some known issues. Testing might be ongoing, but there are known fixes in case certain elements crash.
Step one is to head to the Advanced tab and add all the known troubleshooting steps your Tier I should follow into Notes, plus any instructions on how they should collect any data returned to them that is unexpected.
Next, you’ll create your escalation to your various Tiers of support as we outlined above. A useful tip here is to send your higher level alerts directly to the push notification providers your team trusts for project management. Slack is a common application for collaboration, but we’ve integrated with multiple partners to deliver alert data wherever you work.
On launch, the feature crashes because of course it does. If the known fixes applied by Tier I are unable to restore service after two hours pass, Uptime.com issues Tier II an email with the alert data. Internally, Tier II can query Tier I for its report and act accordingly.
If the team is unable to restore service after an additional thirty minutes, Uptime.com would trigger an alert in the developer Slack channel. The issue would need ownership and resolution at that point with enough technical data to begin the diagnosis from an informed perspective.
In this scenario, senior-level developers deal with fewer interruptions to their work and act only out of necessity. Additionally, the team does not need to waste time explaining the conditions of the outage as the developer would have a record in Slack and via multiple Uptime.com alerts.
Maintenance Time and Escalations
Let’s assume that the senior-level developer recognizes a big problem and immediately hops into action at the 30-minute outage escalation we’ve created. This developer coordinates with the team and determines an additional four hours are needed to create, test, and then apply a patch.
Return to your check and put it Under Maintenance Now. This setting is useful for unplanned maintenance, as you will need someone to turn maintenance off so the check will monitor normally when you’ve finished.
Downtime and On-Call Hours
How many of you have teams distributed across the globe? Smart escalations also take into account on-call hours. To create on-call hours, you need to visit your Contact screen.
Create a schedule for each contact that designates when Uptime.com should reach out to that team. Overlapping schedules will provide 24-hour coverage, and ensure employees can enjoy their time off without worrying about outages.
One Last Tip | Other Website Uptime Monitoring Checks
Smart escalations involve receiving the alert when downtime occurs, but there are times where our best intentions can have unintended consequences.
If our scenario used an API or Transaction check, we might not know if the URL used in one of the later steps is down until the check registers downtime. API and transaction checks run at minimum five-minute intervals, so it can be several minutes before an alert is issued. If a customer reported an outage, IT would check Uptime.com and see the transaction check was ok. This mistaken perception would cost valuable response time.
Plug these gaps by assigning several smaller checks that ping the various portions of infrastructure a transaction or API check monitors. HTTPS and Ping ICMP checks are very useful for pinging your site and its necessary infrastructure. Read more about our check recommendations for API and Transaction checks and see how you can get alerts of downtime faster when you fully utilize your Uptime.com account.
Minute-by-minute Uptime checks.
Start your 14-day free trial with no credit card required at Uptime.com.