Real-Time Analysis Provides Valuable Insights for IT Pros
After stepping out for lunch, you return to find that Uptime.com has issued a downtime alert to your work email address.
You’ve been away from your computer for about 45 minutes, blissfully unaware of your inbox and enjoying a moment of zen.
Now that you’ve walked head-first into a small crisis, what’s the fastest way to confirm downtime, get server response codes, and perform outage analysis?
We’ve got you covered with our real-time analysis tool.
Root Cause Analysis, as a discipline, is familiar to anyone who works in IT. It’s been drilled into us since whatever tier 1 days began your origin story. Identify the problem, establish a timeline, find the distinguishing factor, establish cause and effect. You may have heard the mantra phrased differently (shoutout to all my Mike Meyers fans), but the basic steps will never change.
According to Ian McClarty of PhoenixNAP Global IT Services, every IT worker should pay attention to their logs. In the case of Uptime.com, this applies to the alerts generated by the system.
“A significant uptick of logs is a quick and straightforward pointer to problem servers or applications,” says Ian.
Using Uptime.com to assist in Root Cause Analysis is as simple as finding the alert you’ve been issued in either your Dashboard, Alert History, or your list of checks, and clicking Actions>Analysis to open the Real-Time Analysis tool.
Want to take Real-Time Analysis for a spin? Check out Uptime.com for 21 days for free, no credit card required.
Let’s dive in and see what you’ll find there.
Table of Contents
Real-Time Analysis Tools from Uptime.com
Occasionally, we field support inquiries about what appear to be false positives. We take these potential false alarms very seriously because our job is your uptime. We typically offer some suggestions based on what we see specific to your site, such as:
- Whether load balancers are used
- Performance of Probe Servers
- Other factors
That last point is where real-time analysis can help.
In determining which other factors might contribute to an outage, it would be useful to know some things about the outage itself. Real-Time Analysis provides the following details when an outage occurs:
- Probe servers affected
- Whether a downtime alert was issued
- What the alert details contained
- Dates a probe server was assigned or changed
- Status by location
- Last alerts
- Recent alerts by location
You can also use the Check Status button to confirm the state of your check at the time your alert was issued.
Using Real-Time Analysis to Solve an Outage
Let’s put on our detective hats, pull out our magnifying glasses (or monocles), and see what we can uncover.
Check the Probe Servers
Our first instinct should be to look at the Location Status of the probe servers we’ve assigned to check to see which of them is reporting the outage. We can learn some important details if we Load Real-Time Check Status, such as whether the check is awaiting confirmation of recent changes we’ve made to it, server response codes, and the check state.
We’ll also see the assigned probe server, and the date that this server last received an outage report for your check.
Here we see a check that has encountered a potential error. I’ve configured this check to retry twice before issuing an alert to me, and the check status is confirming that the first retry has failed.
Real-time status provides a glimpse into how your check is performing in the moment. Use it when you think you’ve restored service, if you suspect a false positive, when you want to see recent changes you’ve made to a check applied, and just out of curiosity. It’s good to know everything is running well on our side and yours.
Discover When the Alert was Issued
Next, let’s establish a timeline for the outage. Scroll to Alerts by Location to see some of the same statistics we see with Location Status. The difference when we break down Alerts by Location is the timeline of the outage.
In this example, we can see the timeline of an outage play out. Note the third column (number of locations down), and the sixth column (Check State). Follow from the bottom to the top and you can see the outage timeline.
This outage was related to a misconfiguration, which the technical details helped me establish very quickly.
Establish when Service was Restored
Timeline is an important measure for an outage. It’s good to know how long something was down because it helps you determine where the outage may have occurred. The nature of our probe servers running from around the globe means the occasional delay or timeout. We encourage a default number of two retries on our probe servers to try and reduce these potential false positives.
When you see a small outage, one of the first steps you can take is to confirm check status as we’ve done above. If the check is actually down, you can start to look at factors near you such as DNS issues or temporary outages caused by caching or hosting issues.
Longer outages tend to signal something is broken, but not necessarily. It’s possible a provider you use for some piece of infrastructure is down or overloaded, which Uptime.com would detect. But a 404 is a 404. It doesn’t tell you who caused the 404, only that whatever you’re looking for is not found.
A combination of these factors will help in diagnosing the outage.
Real-Time Analysis and Root Cause Analysis
Those of you running Transaction checks may notice a small change in real-time versus root cause analysis in the Uptime.com UI. This is more than just a difference in naming. Root Cause Analysis provides quite a bit more detail on each step your Transaction check takes.
To access root cause analysis, open the Analysis tab as you do for any other check. Click the alert details you want to analyze, and you will see the option to perform Root Cause Analysis.
Here, you can see each step prior to the failure, alongside performance statistics and technical details on the failure of that step.
You may also notice a screenshot above the Check Results, along with the Browser Console Log that provides the specific response your check received. These elements provide insight into what Uptime.com did, how the server responded, and what Uptime.com saw before the check registered as DOWN.
Transaction checks are indispensable for monitoring the critical goals and flows that drive your business, and they are designed for rapid response. When you receive an alert, it will be full of technical details with downloadable screenshots and a log of everything happening at the browser level.
API checks also use this tool. We encourage you to review the technical data as often as possible when you’re diagnosing downtime.
Key Takeaways for Outage Analysis
First, data is paramount. Uptime.com gives you everything you could potentially need to begin your diagnosis. You can observe other factors yourself as well. Run tests from your workstation, are you getting a response? Can you confirm latency issues, or whatever else Uptime.com data may suggest? Remember, alerts issued via email typically contain links to Real-Time Analysis and Root Cause Analysis reporting tools.
Use the real-time status of your check for more insight into what the probe servers are doing. You can learn a lot about their behavior and how it relates to your UP or DOWN status. From awaiting changes you’ve made to coming back online after a maintenance window, the check state in real-time is where you want to go to confirm the status of an outage.
Establish an outage timeline with the recent alerts you’ve received. Everytime a probe server reports downtime, we record it on this screen. You’ll be able to tell how long an outage lasted, and which locations went down in which order.
Finally, use everything you have at your disposal when you’re performing root cause analysis. Outage analysis is a complicated beast, and it’s easy to overlook the small details. Don’t forget to review the Audit Log for each check (from Monitoring>Checks, locate the check, click Actions>Audit Log). This method of accessing the audit log will show only changes relevant to the specific check you’re analyzing. From your Uptime report, you can download a PDF or XLS file containing outage data for the time period you’re analyzing.
Don’t waste valuable time gathering data when everything is available at your fingertips.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.