Runbooks: What They Are and Why You Need One Yesterday

Let’s talk about The Legend of Zelda: A Link to the Past, and how it relates to DevOps. The game tasks our hero with finding three pendants, which unlock a Master Sword he can use to travel to an alternate realm and ultimately take down the bad guy.

The US version of this SNES masterpiece came packaged with a fairly detailed instruction manual that contained an optional guide at the end to help locate the three pendants. Step by step, almost room by room, the developers guided the player through each of the first three temples to make navigating the game as simple as possible.

You knew what to do, and just had to use the tools you were given to do it.

That’s the essence of runbooks.

What is a Runbook

A runbook is an instruction manual for the hard times you will inevitably face as DevOps. Runbooks help remove the “thinking” aspect of an incident. When you focus on execution, you worry less about what you do and more about how it’s working.

For example:

You receive an alert that a script you are running has timed out. Errors like this one can be challenging to diagnose without a runbook to reference, so one of the first steps you can take during a downtime incident is to document your actions.

Once you have this foundation, you can add documentation to build your runbook as you encounter errors overtime.

What a Runbook Is Not

Remember that your instructions are not a substitute for thinking on the fly. Going back to our Zelda guide, the developers don’t play the game for you, they just tell you what you need to know.

A runbook isn’t going to replace your engineering team, nor does it render automation meaningless. In fact, as you will see, it is possible to uncover more opportunities to automate when you consistently document your process.

Runbooks are not law. It’s possible to write documentation whilst actively observing a process and still miss a meaningful detail. It’s important to teach your team that runbooks are made more reliable when they are actively improved over time.

Building a Meaningful Runbook

Describe action by action what is happening on screen and what the user must enter or look for. Name filenames, cite code where appropriate. Take screenshots! With runbooks, details are key.

Ask for feedback. Reading a lot of documentation can be tedious, especially for engineers, but even reluctant team members are helpful when you are quantifying their needs.

To quote the Site Reliability Engineering: How Google Runs Production Systems book, “clear and thorough troubleshooting steps and tips are valuable when responding to a high-stakes or time-sensitive page.”

Keeping Runbooks Current

A runbook is not something you do once and go about your business. It’s a living document that receives changes as your infrastructure evolves. If you’re making releases daily, you had best update your runbook daily as well.

That’s a pretty tall order once we start thinking about how many (and how different) systems need to work together to make your application work.

The question becomes: how do you maintain a living runbook without incurring excessive costs developing it?

Cost Savings with Runbooks

Fortunately, we can help here too.

One of the ways you can improve your runbook is by keeping its entries general when possible. If we look at our above example, where the script times out, we likely know some investigative steps that will fit every use case.

Even something as simple as “Track the timeline of the outage on Uptime.com using Real Time Analysis” can direct engineers to a resource that provides data they can use to better understand the problem.

Keep It Searchable

Another great way to improve your documentation is to carefully consider keywords in your heading. Think with Ctrl/Command+F. What is your user most likely to search for when website downtime strikes?

Taking the time to craft important headers will pay off in the long term.

You may have a single entry that deals with all alerts of a specific code, or all alerts that conform to a specific problem (such as text strings not found, or incorrect header information returned). Hierarchy of knowledge is important. Here is a rule we live by here at Uptime.com:

  1. Provide the most essential high-level information first
  2. Dig into the specifics (usually instructions)
  3. Dig into explanations (usually use cases)

Be Adaptable

Our final tip is to start a dialogue with your team both before you create your runbook and during your documentation process. Talk with your engineers and listen to their concerns, decide on a structure together that best suits the needs of your team.. After all, they are the ones getting up at 3 in the morning.

Adaptability does not imply flexibility. Settle on the kind of runbook you want and stick to that structure. You may find it’s best to have a gatekeeper of the runbook, and some process of pushing changes to it so it stays organized, maybe github, maybe Zendesk, maybe just a good old fashioned DOCX.

Automation or Runbook?

As you document, you may start to notice certain procedures repeat themselves. This is a good opportunity to make decisions about what is worth automation and what requires human intervention.

True false conditions are easy to program for, and they are just as easy to document. If you notice lots of simple steps repeating, raise with your team and automate the process.

On the other hand, you may want humans at the helm when the choice is whether to migrate something or simply perform a reboot. That’s not a trivial decision.

Runbook documentation lets you make more informed decisions about how to handle your results.

Your Runbook for Incident Management

Here are the important takeaways as you consider what your runbook will look like.

Start the conversation now. Whatever your team does or does not have to work from, it does no one any harm to ask “How can we do this better?”

Write as soon as possible. Once you have settled on some of the bare minimum requirements for a runbook, work on it.

Revise your runbook frequently. Everytime your system changes, your runbook should change.

Develop a structure. As you grow your runbook, it’s likely others will need to make suggestions to it. Creating a system early on that allows for this kind of collaboration is critical to maintaining a document your team will utilize into the future. You’ll also want to consider how engineers utilize the runbook as you edit and revise it, taking time to make the document searchable and easy to read.

Automate based on your findings. The action of keeping a runbook is self reflective. You will find many processes worth automating that can simplify everyone’s life.

The ultimate goal of a runbook is simple: to reduce the amount of on-call hours spent on incident resolution. Any improvements made toward that goal are quantifiable wins.

Minute-by-minute Uptime checks.
Start your 14-day free trial with no credit card required at Uptime.com.

Get Started

Catch up on the rest of your uptime monitoring news

What is Ping Blog Image

What is Ping?

Learn what ping is and why monitoring metrics like latency, round-trip time (RTT), and packet loss is key for optimal network performance.

Read Article