How Uptime.com Can Help Troubleshoot a Server Outage

Everyone has heard about the 3 AM wakeup call, but what about those troublesome issues that dig at your team and eat away at your SLA hours? Hard-to-diagnose issues can strike at any time. They leach from your team, hurt morale, impede the customer experience… it’s just a whole mess.

These kinds of incidents are ones that test what “response” really means to your organization, as fixing them is not always a simple task. Something has gone wrong. At what point do you start taking drastic or major efforts to solve the problem?

Today we will dive into the tool sets you can use as an Uptime.com customer that can help pinpoint what’s really going on and get to a diagnosis faster.

Alerts

The first step in the resolution chain is the alert your team receives. As you begin your diagnosis, you need to know the What, When, and Where of your downtime.  What went down? When did it go down, and how long has it been down? Where are users seeing this outage?

Alerts help you answer these initial questions so you can dive deeper into your data.

Accessing Alerts

Let’s begin by locating a down check within an Uptime.com account and clicking on Actions>Alerts. This will take us to all down alerts for this check only, and it’s a good place to make some notes like:

  • When was this check last reporting downtime
  • How long did that downtime last.

Viewing Alerts

Recent alerts can help inform us when we look at internal data. Maybe a deployment was made recently and our system has been reporting downtime. Maybe the first incident or two seem like coincidence, but the third and beyond point to a deeper problem.

Duration of downtime can also tell us how our system performs, and how likely it is for the problem to fix itself. Many Uptime.com processes are automated, so we have fail safes in place to reboot critical systems when they go down. Looking at duration of downtime can tell us if those systems are working.

Tip: Investigate brief downtime incidents as a whole when they occur in a tight time window. What deeper problems might they point to when combined?

We can also see which locations are reporting as down and make some informed guesses as to what may be happening. If we see a location that doesn’t normally report as down, we might consider how our load balance is performing in that region.

Alerts help us make these logical connections, and looking at the data for a specific check can fill in some important details in our diagnosis.

Reports

A check’s report screen gives us even greater detail to assist in diagnosis when we consider breaking down response time by location. To view a check’s report, click Actions>Reports or just click on the name of the check from the Checks page.

Basic Check Report

With location performance data, we can sometimes track the moments before an outage as response time rises and the server becomes harder to reach. This is especially evident in advanced checks like the Transaction check.

This check hasn’t gone down, but we can see some spikes in performance that might concern us.

The check report also contains an alert history, which helps us establish an event timeline. We can see the last time the check reported as down, as well as a brief description of the error. Whether we are experiencing downtime or not, the right-hand sidebar will tell us the state of our check and for how long that state has been maintained.

Real-Time Analysis

Debugging is rarely a “fix one thing and we’re done” sort of operation. Logic and syntax errors on applications of any size can have cascading effects across systems. Even small variances in time between when a job was performed can cause systems to crash or become less stable.

Real-Time analysis is where a lot of the magic happens for diagnosis and understanding. Just like when we code, build, test; Real-Time Analysis is our “testing” phase. Here is where we’ll see the chronology of our outage, or when the next check interval will occur.

Real-time status of probe servers tells us whether the check is coming back online and if our efforts are paying off. What are the probes experiencing right now, and when will the next interval occur? Server status codes may change in the midst of an outage, and so periodically pinging Uptime.com’s probe servers can improve your overall visibility of the systems.

An easy way to spot performance in a Transaction check, for example, is looking at probe servers. Did the check fail on the same step from every server?

Tip: You can run a test from any probe server available to your account in your check’s edit screen.

Notes and Past Alerts

Notes allow you to package what you’ve already learned into a runbook of sorts for your next outage, so no one is left alone in the dark. Rebooting system X fixed the problems with check Y? Note that for the next person who encounters downtime with that check.

You can do this internally as well, but Uptime.com includes any notes you leave with the alert so all available information is accessible as soon as your check reports downtime.

Expanding Your Monitoring

Systems are dependent on one another, and so our final tip is to apply what you learned during your outages and expand monitoring. You can use supplemental checks like custom checks (webhook or heartbeat), to show the status of internal jobs and infrastructure that might be supporting the check that went down.

The name of the game is information. The more you have, and the more checks generating data for you, the faster you can go from problem to resolution.

So let’s sum up what we’ve learned:

  • Alerts help us through those initial moments of downtime. They confirm something is wrong, and tell us from a technical standpoint what that something could be.
  • Reports give us data on performance, and clue us into recent outages for the system we’re monitoring.
  • Real-Time Analysis is where we see if our testing is paying off, and make critical observations for extended downtime events.
  • Use notes and build on past knowledge. Runbooks can help turn the less experienced into confident incident drivers when you document your knowledge.

Combine it all with 360º  monitoring and you’re well on your way to an efficient downtime resolution. Remember, downtime hurts but not adequately responding can be devastating. Let Uptime.com watch over your infrastructure and get peace of mind.

Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.

Get Started

Don't forget to share this post!

Richard Bashara is Uptime.com's lead content marketer, working on technical documentation, blog management, content editing and writing. His focus is on building engagement and community among Uptime.com users. Richard brings almost a decade of experience in technology, blogging, and project management to help Uptime.com remain the industry leading monitoring solution for both SMBs and enterprise brands. He resides in California, enjoys collecting and restoring arcade machines, and photography.

Catch up on the rest of your uptime monitoring news