Runbook Concepts & Best Practices

A runbook is a combination of documentation and automation enabling teams to codify processes.

In Transposit, automation can be incrementally added via hundreds of pre-built actions using the no-code builder, or built using code-level customization.

Runbooks can be triggered manually or automatically.

Understanding Runbooks#

There are different types of runbooks, but in general, they are a set of directions for someone unfamiliar with a system. An engineer responding to an incident might be unfamiliar because they are new to the system, or because it's three in the morning and they got paged, or because they wrote it but haven't touched the system in a few months or years.

If the runbook is focused on incident management, it typically explains how to fix a system well enough for a few hours until an owner of a system can provide deeper diagnostics, root cause analysis (RCA), and resolution. Other runbooks might describe a business process like how to run a monthly report or a development process like how to set up a test environment.

But shouldn't everything be automated? While it is optimal to have major tasks automated, oftentimes there need to be humans involved. The task might not be worth automating due to edge cases, complexity or frequency of change. Human judgement or approval could be required due to compliance or other factors outside the system. Or the process could be manual because there just aren't enough developers to fully automate it.

Best Practices for Runbooks#

There are five attributes of any good runbook: the five As. A good runbook must be:

  • Actionable. It's nice to know the big picture and architecture of a system, but when you are looking for a runbook, you're looking to take action based on a particular situation.
  • Accessible. If you can't find the runbook, it doesn't matter how well it is written.
  • Accurate. If it doesn't contain truthful information, it's worse than nothing at all.
  • Authoritative. It is confusing to have more than one runbook for any given process.
  • Adaptable. Systems evolve over time, and if you can't change your runbook, the drift will make it unusable.

Let us examine each of these in turn.

Actionable#

It should be clear what each runbook is trying to accomplish. All the tasks should work toward that goal.

A runbook is, at its core, a list of tasks.

Each task should be discrete and completable. Long explanations of how systems work should be offered as an aside (perhaps a link), not mentioned in the task list.

Tasks don't need to affect the system to be actionable. "ssh into the system and run a tail command" is a useful task, but won't typically affect the server. However, tasks can affect the system, for example running cp /dev/null /path/to/logfile.

The actions should build toward the end goal. Tasks that don't help that goal should be removed. It is important that you write the runbook with the end user and their understanding in mind. Of course, you want to write for a wide audience, but at the end of the day a runbook for someone who is unfamiliar with your systems will be different than a runbook for a senior SRE intimately familiar with its nooks and crannies.

Each task should be one bullet point or line. Here are some good task definitions:

  • "login to the web server"
  • "view this graph in AWS CloudWatch"

In contrast, a poor task definition might be something like "login to the web server, navigate to the nginx config directory, edit it to increase the number of workers (the worker_processes value) and restart the server".

There are too many things to do in this task. Make sure the team is on the same page regarding the level of detail in runbooks.

Sometimes a bit of context can be helpful. This is especially true if the task is scary or permanent; who among us hasn't checked twice when running a delete statement on a production table or restarting a database? However, much beyond a sentence is clutter. Adding a link to a URL or reference doc is a great way to provide more context that can be accessed by the human working through the runbook as needed.

Some runbooks are for typical tasks: pulling a production database down, scrubbing PII, and installing it to a development environment. Others are for incidents or outages: how to go about troubleshooting the failure of a service. In the latter case, an RCA should take place. Closing the loop and ensuring that this learning process takes place is important. You should make sure items to be remediated are added into both the software development roadmap and operational procedures, including runbooks, as applicable.

Accessible#

If the end user can't find the runbook when it is needed, well, it isn't of much use. The best way to find a runbook is to have it delivered to you right when you need it. This is especially useful when the runbook is helping you manage or investigate an outage. A good practice is to have every alert have an associated runbook. Even if this runbook is a stub with only general guidance, it's a start, and can be updated when it is used.

To ease discovery, you can also attach metadata to your runbooks such as:

  • the type of runbook: is this for an incident, a problem, scheduled maintenance, development?
  • creation date and time
  • update date and time
  • author(s)
  • major systems referenced

Another way you can make your runbooks accessible is to make them searchable. Wherever they are stored, you'll want to be able to search both the content and the metadata. A normal search interface may suffice, but you may also want to consider having them be searchable from the command line or your chat system (Slack, MS Teams, etc) depending on who the end user is.

Depending on the size and maturity of your organization, you'll want to consider who has permission to access which runbooks. While you should never store credentials in a runbook, they can still contain information about your systems which could be useful to an attacker. So you may want to have some way to tie the user looking for a runbook to an organizational directory. That will ensure users have access to what they need, but not more.

Accurate#

An inaccurate runbook will cause multiple problems:

  • users will be led down the wrong path, keeping them from achieving the goal for which they found the runbook
  • users will avoid using runbooks in the future, due to distrust

So how can you make sure your runbook is accurate?

The first and best thing you can do is make sure that everyone who runs a runbook has the ability to give feedback on its accuracy. This may include leaving a note, filing a PR, or just changing it. Depending on the size and maturity of your organization, and exactly which system the runbook is associated with, you may want to have a review policy.

There's a tension between making it easy to change runbooks, which leads to group ownership and flexibility as the systems the runbook references changes, and making sure that any changes to the runbook are accurate. What you want to do is lower the friction of changing a runbook as much as possible, while maintaining accuracy. This of course depends on the system in question: runbook changes concerning a production database should be vetted more than changes to a runbook documenting how to get a development environment set up.

A tactical suggestion to increase accuracy is to cut and paste commands, rather than re-type them. This makes sure that such tasks are accurate. When writing a runbook for the first time, perform the tasks at least a couple of times. Though it is tedious, it is surprising what a human brain may misremember.

Keep track of when a runbook was last updated (this is typically pretty easy) but also, if possible, when it was last run. Links to incident alerts or Slack conversations can be helpful in tracking this. You can also have a final task to update the 'last run' field as part of running a runbook. Having the last updated and last run date around will be a valuable signal to the end user about the efficacy of this runbook.

Runbooks can overflow as they are edited. Remember to keep them focused on one goal. Don't be afraid to split runbooks and then insert references, if required.

Authoritative#

Make sure there is one runbook, and only one, for each process. If need be, reference other runbooks via links or any other supported mechanism. If there are multiple runbooks for a given scenario, you'll want to combine them into one and make sure the other is archived.

Not much is more frustrating than finding a great set of instructions, following them faithfully and then realizing they are out of date or incorrect, and that the right runbook was the second search result. To help with this, depending on the scale of your organization, you can implement a feedback system so that people who find an out of date or duplicative runbook can let the owners know about the issue.

Adaptable#

The unfortunate truth is as soon as you write down a set of tasks in a runbook, you've accumulated technical debt. You can fight it by keeping runbooks up to date as a system changes. Almost all software systems evolve over time. Runbooks therefore need to evolve as well. If they don't, they'll be neither accurate nor actionable.

Ways to encourage adaptability include:

  • Having a clear owner of a runbook, and having it be the same as the system owner. If there are cross system runbooks, make sure that each team owns a section of a runbook.
  • Encourage change. Within the constraints of accuracy, security and compliance concerns, let anyone modify a runbook.
  • Treat a broken runbook the same way as a broken test: fix it.
  • Don't be afraid to prune runbooks and remove unused information.
  • Have a feedback loop. Make updating runbooks part of your software release process and your incident management post mortems.
  • Celebrate when a runbook works! Post in Slack about a particularly good checklist or task description. The time a good runbook saves is often hidden, so make sure you surface the win.
  • Automating when and where it makes sense. This can include tooling to increase visibility into systems, make changes, or to take entire runbooks and turn them into single commands.

Encouraging adaptability will make sure your runbooks remain relevant.

Stale runbooks#

What should you do when a runbook is stale? First, how do you know when a runbook is stale? Unfortunately, there's no easy way. If a runbook has been last updated four years ago, that could be an indication of staleness, or it could be that the system the runbook applies to is stable. If the runbook has been updated in the last week, it could be because it's well maintained or because the coresponding system is changing relatively quickly.

The best way to see if a runbook is stale is to perform the actions in the runbook. You can also ask if anyone has used it lately, or see if it has been recently referenced in your incident management system.

At the point where you've determined a runbook is stale, you have two choices. You can correct the inaccurate or missing tasks and bring the runbook up to speed. If the system is still in use, this is the best choice.

Or you can archive the runbook. If a runbook no longer applies, mark that fact clearly. You may be afraid of throwing away knowledge, in which case plan to have an archive folder and tag old runbooks with "ARCHIVED" in their title and in the first header area. But if you're more confident, just delete it. Most modern documentation systems have a way to resurrect deleted content.

Conclusion#

Runbooks are great ways to systematize knowledge. Having all important tasks automated is a worthy goal, but humans are often in the loop. This could be because the process isn't worth automating due to complexity, usage patterns, or change frequency. It could be because human judgement is needed. Or it could be because the resources to fully automate simply are not available.

Even with automation, runbooks document processes and procedures in an accessible, durable manner. You just have to make sure they are always actionable, accessible, accurate, authoritative and adaptable.

Next Steps#