Root Cause Analysis: Uptime.com Problem Solving Tools

You manage one of the world’s largest messaging platforms. It’s the middle of the afternoon and you are feeling confidence set in. Your company has recently beefed up its capacity, and performance has never been better. You’re about to step out for a late lunch when a drop in metrics starts triggering alarms.

What do you do?

What is Root Cause Analysis, Anyway?

*record scratch* Yep, that’s me. You’re probably wondering how I ended up in this situation…

Table of Contents
  1. Investigating the Root Cause
  2. Real Time Check Status
  3. Deeper Investigation
  4. Automatic Root Cause Analysis
  5. Root Cause Analysis and Review

Root Cause Analysis (RCA), is critical for crisis management and overly familiar to any IT worker. From the overloaded newbie to the battle-hardened veteran, the general approach is the same; become aware of the problem, organize the timeline, and follow the ping-pong match of cause and effect until you’ve isolated the catalyst and developed a solution.

RCA takes time to do well. Analyzing log times of past events and documenting a timeline of forward action requires coordination, often under less than ideal conditions with teams spread out geographically. Timestamped alerts act as breadcrumbs for process failures but following them isn’t always that simple if multiple factors are involved.

Effective root cause analysis builds on known technical details behind an outage to form an assessment and a plan of action.

“Anything that happens, happens.

Anything that, in happening, causes something else to happen, causes something else to happen.

Anything that, in happening, causes itself to happen again, happens again.

It doesn’t necessarily do it in chronological order, though.”

Douglas Adams

Mostly Harmless, (1992)

Investigating the Root Cause

You and your team assemble and put what is known on the table:

14:18 – A spike in API traffic was felt just before the outage.

14:21- Within three minutes, capacity loss is high.

Quiz time: what’s the first step you take?

Check your logs.

At Uptime.com, alert history is a valuable log. Individual alerts contain key technical data like error codes, response time, and which probe locations are listed as “DOWN”. Deeper information is also available through the Real Time Analysis tool, which provides a chronological breakdown of recent probe servers and links to deeper actions and greater analysis for alerts when outages are reported.

Find Alert History by selecting ‘Reports’ then ‘Alerts’ in your Uptime.com Dashboard

 

Real Time Check Status

So, you’ve checked your logs and zoomed in on the critical alerts. Now it’s time to tunnel deeper and get details on the status of the probes monitoring your service. Check the status of your probe servers. Investigate the location status to find out what the probes have most recently encountered. Status codes can change, different steps can fail, and locations can notice different problems making it important to verify when looking at the status for a particular check.

Using Real Time Check Status is a way of testing that any changes you’ve made to checks have been received and are active. It also relays if the check is OK, or Critical and provides time information for the probe server like last alert details and most recent alerts for a particular location. Real Time Check Status also shows:

  • Next scheduled check
  • Last run check
  • Last status encountered
  • Processing time for the check
  • “OK Since” – time since last alert was issued

Real time status also provides a timestamp of when service was restored. Helpful, as the duration of time between server downtime and restored uptime offers insight into the outage and where it occurred, as well as represents your team’s response time.

A speedy resolution builds trust between you and your users, a thorough RCA can help you improve and shorten that response time window.

Going Further In with API and Transaction Checks

Investigation takes time. Within 30 minutes, you have a theory, within another 15 you have tested some solutions.

You’re not out of the woods yet but the field is narrowing.

Automatic Root Cause Analysis

All API and Transaction Checks generate a Root Cause Analysis report detailing:

  • HTTP request headers for all HTTP errors.
  • A browser console indicating the various requests made
  • Screenshots of what the transaction or API check encountered at the time of the outage
  • Technical alert details to see the exact error codes and status.

Allowing you to analyze the situation at a browser level and identify elements that are bogging things down.

When working to resolve an outage you can take advantage of customizing features and set additional parameters for HTTP checks to ping your servers at regular intervals and continually check for response.

Resolution: Root Cause Analysis and Review

You discover the issue is related to services stuck in a crash loop, you isolate and begin to restore those systems. Users can begin connecting and dev updates the status page.

So, we’re done now? I don’t think so.

Small problems trigger larger issues.

The final step in RCA is retrospective analysis and it’s very Daft Punk; when things get harder, how can we be better? Faster? Stronger? Shorten our response times? And most importantly, how can this evaluation be used to implement better practices to prevent the issue from happening again?

Those who do not learn from history are doomed to repeat it.

In “ancient” times, the basic process of troubleshooting an outage consisted of referencing a circuit schematic while measuring signal levels. You would place your left index finger on the schematic pointing to the last point where your levels were good, and your right index finger at the point where your incorrect readings were measured. Then, you simply stuck your nose in the middle and voila! You found the problem.

Learn from the Past

If you create the monitoring infrastructure, and run routine game day exercises, you are prepared to handle outages; cool, calm, and collected. However, you can’t develop an effective strategy without first collecting and analyzing your outage data.

Servers have evolved but the need for RCA goes back to IT fundamentals. Post-incident review, postmortem, learning review, you can give this process any name you like. The takeaway is that without establishing key test points, checks, and predetermined procedures, all attempts to quickly resolve a problem are doomed to fail.

 

Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.

Get Started

Don't forget to share this post!

Emily Blitstein

Emily Blitstein is a technical content writer for Uptime.com. With a background in writing, editing, and global HR, Emily is committed to delivering informative and relatable content to the Uptime.com user community. Aside from travel, she enjoys making short stop-motion animations, and live music.

Catch up on the rest of your uptime monitoring news