So You Received an Alert. Now What?
Your phone buzzes with an incoming text message right when you’re about to start dinner. Inconvenient, but better than a 3 am call. It’s an Uptime.com Alert, and if you want to clear it before your dinner gets cold you need the right tools for investigation…
If that scenario sounds familiar to you, then you’re in good (if tired) company. Thousands of DevOps and SRE teams around the world use Uptime.com for our email, SMS or phone call, or a push notification alerts from Slack or PagerDuty. You’ve done the first part setting up an effective monitoring infrastructure. But your job is far from over.
If the bright red “Check DOWN” message is overwhelming, this blog – and its companion post – were written for you. We’ll take a deep dive through the Uptime.com Alerting system, from start to finish. Our comprehensive troubleshooting tools and efficient Incident management will make it that much easier to get back to your relaxing evening.
Table of Contents
The initial alert you receive will have different information depending on the method of delivery, from heavily abbreviated SMS Alerts to an Email alert containing details and hyperlinks to resources in your Uptime.com account.
However you get the message, it’s time to act. Once you log in to your Uptime.com account, you have a few options. Easiest is to follow the link to the alert and Real-Time Analysis, shown below.
Or you could go to the account itself and check out the Dashboard for Latest Alerts, click on the one you need, and go to Alert Details. Alternatively, you can go directly to the check in question on the Checks page, right-click the Actions menu, then Alerts. Finally, you could go to Alert History, by clicking Reports on the left-side Navigation Panel, then Alerts. More on Alert History later.
Checking Alert Details
Regardless of the path you chose, all roads lead to
Rome Alert Details. This page gives you a brief description of what went wrong with the check, for each Location that failed. Different Locations may have failed for different reasons, so don’t assume that checking the error message for one will be the same for the others.
Alert History is a handy tool for seeing all the alerts in one place, as opposed to one in isolation. Getting the full context of alerts across your account will make pattern recognition much easier. If you see a bunch of checks going down at roughly the same time, then keep that in mind as you troubleshoot the problem.
Alert Details will tell you what went wrong, but you need to understand why it went wrong if you want to find a solution and bring the check back UP.
Thankfully, Uptime.com checks come with built-in debugging and troubleshooting tools to help you get to the bottom of the issue. The most common tool used is Run Test, which is found on every Add or Edit Check window. Run Test runs the check as configured from a test server you select to give you the resulting output: success or failure.
However, this only shows the current behavior, on a single test location. For detailed information per Location, you need Real-Time Analysis.
Hint: For more complicated, multi-step checks like API and Transaction checks, Root-Cause Analysis offers step-by-step troubleshooting.
Root-Cause Analysis is a special tool for API and Transaction checks which provide step response times, contents of the endpoint’s response, and further details related to each request in the Browser Console.
Clicking the link from the Alert Details page will show you additional information, such as Check Results for each step, Request Details and Waterfall readouts for Command and Validator steps, as well as a screenshot of the failed step (Waterfalls and screenshots are only available on Transaction Checks).
These details highlight the exact point where an API or Transaction check failed, and will usually provide either a direct answer – or at least a great hint – towards why it went wrong too. If you’re monitoring with API and Transaction checks, this tool will(if it isn’t already) be your best friend.
Real-Time Analysis is the central hub for all other check troubleshooting. Click the button at the bottom-right of the Alert Details window, or click “Analysis” from a check’s Action menu.
There are three sections of the Real-Time Analysis page: Location Status, Recent Alerts Per Location, and Traceroute (for Premium subscriptions only). Let’s break each one down.
Location Status tells you the last known state of the check at each of the configured locations, as well as the exact server address, timestamps, and error message the probe server received. This information is great for correlating timelines from your internal logs.
Tip: Use the Load Real-Time Check Status to query the state of each Location. Use this tool as you make changes and fixes to the downed service, for insights into what each location is seeing and when its next interval will occur.
Recent Alerts Per Location
Recent Alerts Per Location is a chronological breakdown of each location’s state, and a timestamp for when the alert is issued. Patterns in down times, touchy locations, or recurring error messages will be made apparent here, where you can see the entire check’s context at a glance.
Tip: Use this section to see how well sensitivity is working for your check. Do you need alerts when more locations go down? Have you covered the minimum number you consider acceptable to receive an alert at dinner?
Traceroute is an advanced troubleshooting tool available to our customers at the Premium subscription or higher. Similar to the tracert command in Windows and Linux systems, or traceroute for Macs, our Traceroute tool maps the intermediate steps that the Uptime.com probe takes to get from our check servers to your device, service, or site’s address
When your internal logs don’t show anything strange or suspicious, or response times are uncharacteristically high, use Traceroute to make sure that the routing is correct.
Between these three tools and your internal logs, you should have a working theory of downtime. And if you get stuck, email email@example.com. If we can’t help you directly, we can at least confirm that the problem wasn’t on the Uptime.com side and point you in the right direction.
Incidents and Maintenance
Hopefully, you got through the initial troubleshooting process without too many gray hairs or headaches. In either case, you’ve made progress, and it’s important that you communicate that to your end users.
If something is wrong, you’ll need to track it in an Incident. That’s where Status Pages come in handy.
In the next post of this two-part series on managing alerts, we’ll take a closer look at the Incident Management and Reporting, where you put all of our Uptime.com tools to good use as you work towards alert resolution. So stay tuned!
If you think our tools would make your life (and your team’s) easier, try out our monitoring solution today with a no-credit card-required Free Trial. Happy Monitoring!
Minute-by-minute Uptime checks.
Start your 14-day free trial with no credit card required at Uptime.com.