An oncall meta-runbook

“An oncall rotation is where my freedom becomes your responsibility”.

No individual engineer can sustainably and reliably provide a service¹ to an engineering organization. Teams, on the other hand, can be architected to ensure that services meet the needs of their customers. Oncall is a process by which a team ensures that all of its functions are suitably staffed to meet the team’s SLOs. It is, necessarily, a transfer of responsibility from the one, two, or three individuals who have firsthand knowledge of a component to a (hopefully) larger group of people who will be responsible for its ongoing maintenance.

An effective oncall rotation is built upon diffusion of relevant knowledge. This is a tradeoff: if the oncall rotation is too large, the n-squared communication overhead of keeping everyone on the same page becomes prohibitive. If the rotation is too small, then availability of the entire rotation cannot be much greater than the availability of its individual members.

With that in mind, here some guidelines I like to follow for building effective oncall rotations.

Equality

Not everyone will be equally skillful in handling every page, but everyone should be equally allowed to handle every page. For instance, if some alert requires a cryptographic ceremony to resolve, then only people with access to the relevant keys should be included in the oncall rotation. This also means that if there are third party services that are involved in the diagnosis or solution of some issue, then every member of the team should have access to those services. Similarly, if some binary is required to solve a problem, it should be accessible in a standard place (e.g. git), rather than sitting in a “workbooks” directory on someone’s laptop.

Implicit oncall rotations are bad. Telling the current oncaller “if X breaks over the weekend, just page me” makes yourself into a one-person implicit oncall rotation. This is a smell to be aware of, and typically avoided. Even in less extreme cases, we should be alert to situations where alerts are regularly being “bounced” from the oncaller to one individual or a small group of people, who become a de-facto oncall rotation. An oncaller asking their teammate for help is great; an oncaller being continually forced to seek help is not.

People should have time to learn about the services that they are oncall for. If I’m a member of a six person oncall rotation with two three-person subteams, then part of my job should be set aside for staying up to speed on the services that the other half of my team is working on.

Paging

There should be a clear distinction between in-hours and out-of-hours tasks. Oncall can have many duties, from the emergent to the quotidian. Some teams decide to split those duties into separate individuals. But in all cases, it should be clear to the oncaller which type of task they are working on.

Out-of-hours tasks should be reserved for things that are both urgent and broken. Paging someone should be reserved for situations where there is a high probability of the service being broken now or in the near future, and the oncaller can potentially take some action to remedy the situation. What does “high probability” mean? It’s up to the team. Some teams might be willing to accept a higher proportion of false positives than others.

“Urgent” can only be urgent for a short period of time. If a task has been “urgent” for more than a day, then one of two things are true: 1) it isn’t urgent anymore or 2) the team needs to surge more resources into solving the problem.

All automated alerts directed at your oncall rotation should be modifiable by members of that rotation. The worst possible situation is one where an oncaller gets alert fatigue and simply tunes out an alert. Changing the parameters of the alert, or reducing it so that it fires in only a subset of cases, should always be on the table. This doesn’t mean that the oncaller should silence the alert that says that all of their systems are on fire, but that they can do so if the need arises.

Services should be architected to place minimal burden on their oncallers. This is, of course, a topic all by itself. But just generally, services that are simple, scalable, easily debugged, tolerant to change, and which resemble other services in your team’s stack should be favored. In economics, the cash value of a capital asset (the service, in this case) is its revenue generating potential, times its expected longevity, minus the cost of its ongoing maintenance². We’d like to maximize the value of a service to an organization, so we’d prefer to make services with high longevity³ and low maintenance cost.

Consensus

Members of the oncall rotation should be involved at the early stages of the design for every service that they’ll be oncall for. At the very least, the rotation needs a gut check when bringing on new functionality. For instance, if someone says “I want to build an FPGA-based switch to do full cut-through TLS inspection”, this is a great opportunity to have a discussion about whether the team’s composition matches the skills-set that will be needed to maintain such a service.

In a consensus-driven process, the consensus must be inclusive of every member of the rotation. It’s a reality that not every decision will be made by consensus. But in the majority of situations where consensus is possible, everyone on the rotation should have an opportunity to have a say. Let’s imagine Subteam A that builds Service Foo day-to-day is perfectly happy doing administrative tasks via an IPython interface. Members of Subteam B feel less comfortable with this interface, and would prefer a web-based admin portal for common oncall tasks. If a consensus is to be maintained, Subteam A should either build the web interface or accept a PR from Subteam B that does so.

It is the responsibility of each contributor to maintain a continuous flow of relevant context to their teammates. In other words, if you build something, you need to get your rotation-mates up to speed on it. This implies documentation, but also more active measures, like doing a dry run of typical administrative activities, and maybe even assigning some basic tasks to the “wrong” person in order to get them up to speed in an area they otherwise wouldn’t touch.

The moment that an oncall rotation “takes the pager” from a service author is a significant milestone. In fact, I would call it the significant milestone for a service. A service author should be preparing for this milestone well in advance of its actual arrival, by seeding context about their service—goals, architecture, and dependencies—to their teammates throughout the software development lifecycle. A well-run process for this makes the actual handoff into an anti-climax. Of course there are plenty of checklists that people consult for determining when a service is ready for handoff (e.g. “Does it emit metrics?”, “Does it autoscale?”); these are great, but somewhat out of scope for the purposes of this document. Keep in mind that—for most services that are iteratively built-upon—the handoff process is a continuous one.

Every team has a risk level that they’re willing to accept. Calibrating each oncaller’s risk tolerance with each other—to say nothing of the rest of the organization!—is an important task. Without this calibration, services that look sustainable to one member of the rotation will look impossibly dangerous to another.

Conclusion

Without oncall, a team is just a group of individuals who are each pursuing their own peculiar goals. Efforts are atomized, because why bother going in depth learning about that thing that your teammate does all of the work on? With oncall, everyone has a reason to cohere behind the shared goals of a team.

I’m thinking about services in the broad sense, as in “Our company would be worse-off without all of the people providing janitorial services to our HQ”. I’ll use the term interchangeably with “functions”. ↩
In formal terminology, we consider the “net present value” of its ongoing maintenance. This is why fancy private jets sell on the secondary market for very cheaply: because keeping the plane airworthy might cost more in the first year than you actually paid for the thing. This mode of thinking also brings “technical debt” into focus: the “interest” on the debt is ongoing impairment of your asset’s value. ↩
Keep in mind we’re talking about “services” in the broad sense. In this sense, a team might provide “load balancing and IPC” to an organization, and the actual code artifacts which enable the team to offer that service will change with business needs over time. ↩