So You’ve Troubleshooted the Alert. Now What?
Welcome to the companion post to So You Received an Alert. Now What? Last time, we broke down the process between receiving the Uptime.com check alert and figuring out what broke. Today, we’re going to show you how to communicate your efforts so that everyone – your end users, coworkers, and bosses – know what’s going on.
Your first step is to update your Status Page, your central hub for incident management and communication. If you don’t have a Status Page yet, consider this your helpful reminder to get one set up immediately. You don’t want to be caught in an outage without one, trust me.
Next, you’ll want to look into setting up maintenance windows, and get proper incident communication flowing. Keep your end users in the loop: many will calm down if they see that you’re at least aware of the issue. Hopefully, they might even forgo a support ticket if they can see that you’re actively investigating the issue. Make updating your Status Pages with incidents and notifications a central part of your Alert and Troubleshooting process, and save yourself and your users.
Incidents and Maintenance Windows
One of the best features of an Uptime.com Status Page is that you can create and manage incidents and maintenance windows. Depending on the issue, you can even put the check itself into Maintenance mode to pause future alerts and performance reporting until you end the Maintenance Window. Maintenance windows primarily ensure you preserve your Uptime % obligations unaffected by routine and planned downtime.
If your team doesn’t have an Incident Management runbook or similar practice set in place, here is an in-depth guide provided by Carnegie Mellon University. They go into detail on each step, but the basics are clear. Make the information you provide timely and relevant, clarify the affected area(s), and update the users on what happened and what you’re doing to fix it. Keep these in mind when you create your incident.
Incidents are created and managed from the Status Page Manage page. From there, customize the outage message that is sent to subscribers, the components affected, and what stage the incident is in. Once added, update the same Incident to create a clear chronology of the incident.
You change filters on your air conditioner, the oil in your car, or you lube up your bike chain. However you get there, planned and routine maintenance is critical to keeping these tools running regularly and reliably. Planned maintenance is probably built into your team’s ops, but communication to your customers helps them plan for your service outages.
Setting a Maintenance Window is key to reassuring users that any outages are both normal and expected. This can be done at both the Status Page and the Check level.
For Status Pages, Planned Maintenance is created similarly to the Incidents as I described before. Set the details, such as time frame, affected components, and outage severity. And finally customize the notification message outlining the situation for all the Status Page subscribers.
Alerts and Reporting
In your troubleshooting and investigation, you may realize that the alert was a false flag. Maybe a routine maintenance window didn’t get marked, or some other in-house process indirectly caused the check to go DOWN. This time, it might not be a problem with the service itself. For cases like this, we Ignore Alerts.
Ignoring an Alert serves two major purposes. The first is to hide the alert from central systems – like your dashboard – so you can differentiate it from real outages. Secondly, the associated downtime is removed from uptime calculations and reports.
If you want to ignore an Alert, click Reports on the Navigation Panel, then Alerts. Click the Actions menu on the relevant Alert, and select “Ignore This Alert”. This will fade the alert notification to a lighter shade of red, and remove it from Alert pages and reporting.
SLAs are closely tied to downtime, so I’ll hammer this point home. SLA agreements are about trust and accuracy, so make sure that your SLAs and Reporting are correct. If it wasn’t a real Alert, then Ignore it.
If it was, then make sure it’s properly documented in Status Page incidents, and get to work on fixing it as soon as possible.
At Uptime.com, we work to make your DevOps life as easy as possible. And in this case, it means providing you the tools to be efficient, communicative, and accurate as you deal with anything that comes your way.
Using those tools, we are confident that you can get to the bottom of the issue, or escalate it to the relevant team with the entire outage timeline. If you’re lucky, you may even get back to the table before your food goes cold.
Either way, you made it. We hope this outage was small enough to avoid a major headache or revenue loss, but big enough to give you the confidence to handle the next one with ease. In either case, you can rely on us to have your back. At any time or hour, our entirely-human Support team is available to walk you through your Uptime.com Alert.
If you think our tools would make your life (and your team’s) easier, try out our monitoring solution today with a no-credit card-required Free Trial. Happy Monitoring!
Minute-by-minute Uptime checks.
Start your 14-day free trial with no credit card required at Uptime.com.