My Website is Down! Ten Steps to Take During a Downtime Event
Oh no. Your website is down. And regardless of what time it is we guarantee it’s not a convenient time for your website to crash. An outage can cause a panicked fight-or-flight response when teams are unprepared for the consequences. One of the worst ways to deal with downtime is to try and wait it out thinking it’ll just magically resolve itself.
Unplanned downtime is a major cost to small businesses and Fortune 500 enterprises alike in an always-connected world where 98% of companies say a single hour of downtime costs more than $100,000. Apart from access issues for the end user, productivity can halt entirely as teams lose connection to systems or crucial resources, consider the Washington Post reported that work interruptions consume, on average, 238 minutes per day
Failure can compound itself: who will respond? What should be documented? What is this downtime costing you?
Fortunately, Uptime.com is here to help with ten practical steps you can take after a website outage to get to the heart of the issue:
- Confirm Your Site is Really Down
- Analyze the Extent of the Site Outage
- Review Your Logs
- Start Troubleshooting
- Brainstorm Some Possible Fixes
- Confirm Connectivity
- Confirm the Timeline of Downtime
- Document the Incident
- Don’t Ignore Small Outages
- Audit your Web Monitoring
Each of these steps is aimed at confirming state, creating a learning opportunity, and most importantly getting down to the root cause of the issue.
Confirm Your Website is Really Down
How many times have you found yourself asking a colleague to confirm that a site you’re trying to reach is actually down? We can save you from this anecdotal testing!
External website monitoring offers round the clock confirmation from locations positioned across the globe. At Uptime.com, we confirm an outage from multiple external locations so you’re never wondering if it’s only down for you.
Using tools like Monitor Entire Site or Real User Monitoring, puts more of your site under our watchful eye. These tools help you better understand the extent of an outage and how it affected your customers.
Analyze the Extent of the Site Outage
Site wide? A specific page? Some specific component? Major outages rightly get more attention from your organization. Understanding the extent of the problems helps you decide if all hands on deck are needed to tackle an incident at hand.
Uptime.com helps organize and report on the uptime of a specific system. SLA Reporting keeps your teams accountable with specific and measurable thresholds that you can use to verify your site and applications are up to standard or meeting your SLAs, while Public Status Pages give you the power to control messaging and communication during outages and incidents.
When you need that extra time to work on restoring connectivity, let your users and subscribers know. Updates on your side go a long way to restoring user trust.
Review Your Logs
Disk space and bandwidth are common reasons for site outages. Spikes in traffic and new deployments should be closely monitored so they don’t break your system.
Your internal infrastructure provide lots of valuable clues throughout an outage, but may require private location monitoring behind a firewall or over a proxy connection. Adding internal and external monitoring gives a holistic view of your systems, usually referred to as improving observability.
Systems with high observability tend to be easier to repair, as you can correlate one outage on the frontend to something breaking on the backend. You should log what you’ve already done at this point so you know what hasn’t worked.
Getting started can be a surprising timesink. The right alerting system can help. Uptime.com has thousands of potential integrations, including direct partnerships, for sending and receiving alerts or response time metrics. A few clicks and you have an alerting system ready to report on the precise issue your website is facing to the platforms you already use and trust.
The information an alert contains can be just as important as who receives it. The contents of the alert should clearly state the problem: any error codes encountered or specific strings or assets that failed to load or be detected. From basic checks like HTTP(S), which can check for uptime and validate TLS, to more advanced checks like Transaction or API checks that mimic user actions, Uptime.com offers powerful alerting to address downtime.
These clues help provide the starting point for your investigative efforts.
Brainstorm Some Possible Fixes
Have you tried turning it on and off? By now, you may have a working theory of what caused the problem and it’s time to start acting. If you are still unsure, your next steps might be to try rebooting services or looking at restoring to an earlier version.
Very often, a specific component will break but the site will operate normally otherwise.
A hotfix can lead to other issues, be careful how you apply it and how you monitor after the job is done.
Synthetic monitoring can be useful here, as it mimics specific user actions that are critical to your goal flows. When combined with Real User Monitoring (RUM), you can study the impact of a deployment before and after it’s made.
Let’s say for example that your website’s shopping cart breaks, and you manage to fix it. A normal check would show the site as functional, but what if the Thank You page stopped working, or if the item wasn’t showing a picture to the end user? Transaction checks can help catch these easy-to-miss errors.
Understanding performance with clear visuals also helps communicate the state of your application to stakeholders quickly and efficiently.
It’s back! …Or is it?
Uptime.com can help here too with automated (and accurate) alerting that tells you when connectivity is verified as restored. Anytime our probe servers signal your site is down, we run tests until each one returns to up status. With multiple confirmations around the globe, you can feel secure in the knowledge that your fixes worked, and connectivity is restored.
Prompt alerting also means you can focus on your next task and put this outage behind you.
Confirm the Timeline
Often, outages require some kind of post-incident reporting (or post-mortem reporting for the morbid among us). This type of reporting might examine the timeline of an outage, who responded and whether this person was on call; and might correlate this outage with other recent downtime to try and investigate the issue and see if it’s part of a deeper problem.
Alert history and a comprehensive audit log can help here. You can quickly and easily gather information on recent outages and changes you have made to Uptime.com. Put it all together, super sleuth, and you get the whole picture of the outage.
Document the Incident
The greatest lie the DevOps devil ever said is “we don’t need to document this.”
Each time you restore connectivity, at a minimum, you should document what you did to help the situation. Call it an exercise in decompression.
From something as simple as rebooting the service to more complicated efforts stretching across teams, the more you document today’s outage the better prepared you are tomorrow. We have a few thoughts on items of importance:
- Did the outage affect other systems?
- Can we automate the actions our responders took to mitigate downtime?
- Was the right person alerted, and did the alert contain useful details?
Don’t Ignore Small Outages
Maybe the outage was resolved before you started your analysis. Is it any less important? Are there other smaller outages within the past 24-72 hours? What about the past week or month?
Smaller outages do very often signal larger issues. If your shopping cart is failing off and on, or you’re seeing other smaller outages, know that they do affect your end users and your bottom line. You won’t know whether a provider or a service is reliable if you’re not actively and continuously monitoring it for uptime (and downtime).
Here is a great tip for observability: add more monitoring after the outage.
If your downtime affected a system you were not previously monitoring, you have just discovered a sacred treasure. Perhaps it was hidden by the sands of time, or by a lack of documentation. Whatever the root cause, very often a good solution to an outage is to monitor more of your system so you better understand how interlinked services affect one another and can even better predict future downtime.
Wrapping it Up
So now you’re ready to face any outage with confidence, right? Well, maybe you’re off to a better start but it’s time to think about the future as you experience new downtime. Here’s some helpful reading on runbooks, which give your team a living “instruction manual” for when the application breaks.
It’s also helpful to simulate some downtime and thoroughly test your systems for a few reasons:
- Your team learns what it’s like to “drive” an incident
- You will learn application vulnerabilities you didn’t know existed
Everyone gets more experience operating under pressure and deadlines
Learning safe testing methods is a great step smaller organizations can build into their development flows that will pay dividends as the business grows. As your infrastructure evolves, Uptime.com is here to provide monitoring peace of mind.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.