Queueing theory for the working software engineer

As a software engineer, I find that I can use queueing theory to enrich my understanding of two related problems: how to schedule work for humans and how to schedule work for computers. For today, let’s discuss the first problem: how queueing theory for human software engineers. We’ll cover many concepts that make some intuitive sense, and attempt to justify them using loose mathematical models:

why optimizing a SWE team for throughput is usually a bad idea
why it is often inappropriate for teams to reduce their interrupt load
why building detailed plans for projects can be more costly than just doing them
why it can be reasonable to throw away effort that has already been expended, even when your goals have remained the same throughout
why healthy engineering oncall rotations typically have 5-9 contributors

Disclaimers

Disclaimer 1: Unfortunately I’m not a queueing theory expert. I’ve never taken an academic course on the topic, and most of my knowledge comes from reading the work of a single scholar in the field. Since I’m familiar with only one scholar’s work, there may be a schism within the academic discipline that I’m completely blind to. There’s also a discipline called operations research that is very related to these topics, but that I can’t speak directly to.

Disclaimer 2: All of the models we’ll discuss here are quite simple when compared to the complexity that is human behavior. This post has no hope of explaining all—or even most—of what makes people effective in their work. At the end of the day, we’re left with a choice between an imperfect model and no model at all. Let’s build the imperfect one, but be humble about its applications, in the spirit of “all models are wrong, but some are useful”. If you want to learn more about the topics I can’t cover, I’m told that industrial/organizational psychology and organizational development are the disciplines for you.

An initial model

As we build towards an appropriate model of software engineering work, let’s begin with a highly simplified one¹. Tasks arrive in a queue and get assigned to a single worker. The worker can work on multiple tasks simultaneously, but their progress on each task is inversely proportional to the number of tasks in progress. We’ll quickly learn that this model’s assumptions are so attenuated from reality that it isn’t useful to us, but it will serve as a good jumping-off point for further exploration.

A queueing theorist would note that we’ve already made numerous unstated assumptions. First, we’ve assumed the workers are multitasking perfectly evenly across their concurrent work, which the queueing theorist would call the processor sharing (PS) service discipline. This builds upon two precursor assumptions:

switching between tasks takes zero time and all progress is saved. In other words, the tasks are fully preemptible.
the worker can hold all of the relevant context for an unlimited number of tasks in their “memory” until completion. In other words, there is no limit to the achievable concurrency.

Our hypothetical queueing theorist now asks a litany of questions.

Q: How long does each task take?
A: Each task takes exactly 60 minutes

Q: What is the payoff for completing a task?
A: Every task completed is worth $1

Q: What is our goal?
A: We want to complete as many tasks as possible in a fixed period (8 hours, let’s say), after which all of our work-in-progress is thrown away.

Q: Does the worker ever become blocked on external input?
A: No

Q: How fast do new tasks arrive?
A: New tasks are arriving infinitely fast, and our choice for each arrival is either to start working on it or refuse it.
Q: <raises eyebrows>

Based on our assumptions so far, we can conclude that it is optimal to complete exactly 8 tasks over the course of our 8 hour evaluation period. So long as we work at a concurrency of 1, 2, 4, or 8, the results will be identical. Furthermore, the limit on our concurrency scales linearly with our evaluation period: if we were given 2000 hours to work, then we could work on up to 2000 tasks concurrently and still finish all of them. The only thing that makes us unproductive is reaching the end of the evaluation period with partially performed tasks: when this happens, we lose the work that we have performed thus far.

Practically speaking, there are no useful conclusions that a working software engineer can draw from this model. It might be useful to a factory worker on a fixed-duration piece-rate contract, but even then the assumption of perfect processor sharing is difficult to square with reality.

Context-switching

To make this model more realistic, let’s start by assuming that there’s some delay inherent in task switching. If every switch costs (for example) 1 minute, then the optimal throughput is achieved by never task-switching. This new assumption very strongly favors picking up one task, working on it to completion, and then picking up the next task.

I personally find that context-switching has a nonzero cost in my software engineering; others do too. I believe this is also the primary reason why most Agile practices institute “work-in-progress (WIP) limits”, which the queueing theorist would call concurrency limits. As far as I can tell, the formal queueing-theoretic problem of “should I context-switch in this situation” is unsolved.

Payoff distribution

Let’s say that I’m working on a compiler. One day I’m working within the codebase and notice that it’s scanning a list repeatedly to extract the lowest element. I realize that it would be significantly more efficient if it used a heap instead. This code is not in the hot path, but it’s some technical debt in my team’s codebase, and we take performance seriously around here. I start working on refactoring the data structure.

Meanwhile, a ticket comes in saying that there’s a potential correctness bug with the compiler. I’ve already started work on the performance issue, and I know from my model that I should limit in-progress work to avoid costs arising from context-switching. The model we’ve built so far clearly indicates that I should finish what I’m working on before starting something else.

The model is wrong. It is treating all tasks as equally valuable, but in this case a correctness bug that affects my users is profoundly more important than a performance issue that isn’t hurting anyone. We should find some way to update our model to account for differing payoffs between different tasks. This is difficult because it adds yet another semi-scientific discipline into our thought process: economics.

Modeling the economic payoffs of software engineering tasks is a blog post all to itself. At the very least, you need to consider:

Time. A workflow improvement for your teammates can provide a continuous payoff starting right now and extending indefinitely. A demo to be presented at your next developer conference has no payoff until the conference, a large payoff during, and some lingering payoff after the conference ends.
Degree of completion. A workflow improvement might be atomic: do you or don’t you need to continue performing some menial task? Meanwhile, a demo given at a conference can aim for anywhere from “we proved the concept” to “we wowed everyone.”
Uncertainty. Nobody knows when and how projects will prove their value. Maybe your work to harden the service against DDoS will be tested tomorrow, or maybe it’ll never even be needed. Who knows?

Throughput versus latency

So far we’ve been optimizing for a rather weird objective: a single fixed evaluation period over which identical tasks are done. This might match the work of a software engineer who plans to ship exactly one piece of software as a boxed and shrink-wrapped unit, and rarely or ever do maintenance, integration, bugfixes, or add new features after shipping the golden master copy. This model does kind-of approximate some academic and personal-time projects that I’ve worked on, but most of my professional experience has been the opposite. In most of my professional experience, my work focused on delivering new value steadily over an indefinite period of time.

In general, I think that software engineering organizations should aim to maximize throughput: they should aim to get a train chugging along and producing value continuously. But the nature of organizations is that they comprise teams of people, and those people depend on each other. Those dependencies mean that a given worker’s productivity will be strongly influenced by the support available from their colleagues. For instance, the worker may need:

a design document from a peer or a customer
review of a design proposal for a new feature
a meeting with a partner team to discuss technical approaches to a given problem
access and documentation to a partner team’s API
an answer by a partner team to their Slack query
a resource (e.g. some servers) to be provisioned for them by a partner team
approval to access the production environment for debugging.
a colleague to deploy their code

The dependencies mentioned above are largely the result of the specialization caused by organizations having differing teams and job functions. But even on small engineering teams with homogeneous roles, dependencies creep in. At a minimum, software engineers can expect to block on their CI system to provide testing on their changes, and their teammates for design feedback and code reviews.

(xkcd #303, CC-BY-NC 2.5)

When an engineer gets blocked, they have two choices: they can either take a break or find another task to work on. In the case of taking a break, their productivity drops to zero (more-or-less: stay tuned for the exceptions later). In the case of picking up another task, their productivity is probably slightly lower: the engineer is now working on a lower priority task and faces the additional cost of context-switching back when the prior task becomes unblocked. We can’t say for sure, though. If the engineer takes on too much concurrency, their productivity could fall even beneath the “taking a break” level of throughput; this would occur when juggling many tasks results in a mistake or an outage.

Despite all of the complexity, I think it’s a fair assumption that throughput drops when people become blocked. In that case, the faster people become unblocked, the better. This implies that even though the organization is attempting to optimize its throughput, its constituent members will often need to operate in a latency-critical mode. This is true down to the individual level, where it is usually valuable to provide a facility to answer questions from blocked teammates; at the team level, where it is valuable to turn around inter-team requests quickly; and even at the organization level, where launching a small trial feature is a prerequisite for gathering the customer data which will guide the organization’s strategy around major-feature launches.

Need for responsiveness

As we just established, some members of an organization will need to operate in a latency-critical mode so that they can support the throughput of their colleagues. It turns out this need isn’t evenly distributed; instead it concentrates in certain teams and functions.

To give one example, a Developer Productivity team responsible for maintaining a CI environment needs to be highly responsive. The spice must flow, and if the CI jobs aren’t completing in a timely manner, that’s an interrupt that must be serviced immediately (aka an incident). To solve this problem, a Developer Productivity team typically has two tricks up their sleeve. The first trick is automation. Most such teams aim to produce a fully-automated, autoscaling, fault-isolating, and self-healing CI environment; the net result is that when exceptional conditions arise, the interrupt is serviced by a computer system rather than a human system. The second trick is standardization. Instead of solving N problems once for each team within the organization, a high-functioning team aims to solve 1 problem N times. The high degree of homogeneity afforded by standardization means that their work to automate the processes involved (e.g. setting up new CI jobs, managing fleets of CI workers) experiences a high amount of synergy and leverage.

On the other end of the stack, we can imagine a hypothetical team which builds customer-facing features at the “front end” of an engineering organization. Such teams definitely still are handling inbound interrupts, but two differences are worth considering. First, there aren’t twenty teams clamoring for support at once; instead, customer-facing teams usually handle a queue of features through a single PM, who has already identified their top areas of focus. Second, the time constants involved are more conducive to an orderly planning process: a product team might be aiming to build a feature over the course of two weeks. Since the task is two weeks long, it makes sense for that task to queue among others for allocation into a two week sprint. When the product team sends a PR to the infrastructure team for them to review and apply, that’s a 15 minute task. It makes no sense for that PR to queue for entry into the infrastructure team’s next two week sprint. The product team can use a two week sprint as a planning tool because their tasks are all around two weeks long. But the infrastructure team needs (at least some of) their employees to plan in “sprints” of ~30 minutes in length.

Paradoxically, this advice means that—for the good of the organization as a whole—people lower in the stack must context-switch in ways that actively reduce their throughput as individuals. In this case, the cost of context-switching is felt by teams lower in the stack who have their schedules randomized by interrupts from their colleagues. But if the cost of the interrupt is offset by an equal or larger benefit accruing to the interrupting team, then the overall organization has benefited.

Much of this analysis also applies to the different engineering job families. People in managerial, oncall, and tech lead roles usually aim to act as a pincushion for interrupts, thereby guarding the productivity of the rest of the team. This approach frees up the rest of the team to focus on longer-term issues, even when a maelstrom of inbound work surrounds the team as a whole.

Ramification of uncertainty

One complaint that I (and others) have about waterfall-style development is that it pays no heed to the cost of certainty. If someone asks me “how long will it take to build a distributed key-value store from scratch”, I can certainly try to sit down for a few hours and build a Gantt chart of all of the dependencies. But the more honest and productive approach is “I haven’t a clue. Why don’t we spend 2 weeks building the barest of walking skeletons, paying careful attention to each subcomponent of the system. Then we’ll repeat the exercise recursively for each subcomponent we identify, and maybe by the end of this quarter we’ll have a rough sketch of the expected cost.” I think this is the approach that best embodies the idea that the essence of programming is building a theory of the problem-space, and that the act of building up the necessary mental model underlying the estimate is both 1) highly costly and 2) immensely valuable.

The cost of certainty reaches its apotheosis when an engineer needs to debug an unfamiliar problem. A bug in a software system constitutes a defect in that system, but more importantly it represents a missing chunk in the programmer’s theory of the problem. The programmer is operating with a hazy understanding of the problem, and nobody knows how much investment of time or thought will be required to clear the haze. Eventually the software engineer figures out the problem and implements a solution. Oftentimes that solution is essentially costless relative to the debugging time, with the most famous example being the Steinmetz invoice (hammer blow, $1; figuring out where to place hammer, $9999).

Even in more typical situations where new software is being developed, I think a useful approach is to consider the product development process as a ping-ponging of uncertainty between teams in an organization, hopefully resolving into improvements/additions in each team’s mental model of the problem space. The problem starts high in the “stack”, perhaps with a PM saying “let’s automatically display contextual results from the search engine on this page”. This moves down the stack into a Search team that asks “do we have an index for this already? Will reusing an existing index for this purpose cause trouble? Oh, we haven’t even ingested the datasource we’ll need. Let me go talk to the data infrastructure team…It turns out that there’s actually a compliance obligation around this data, so I’ll need to figure out how to scrub the data before it can be materialized into a search index. And once I have the index, I need to figure out where I’m going to find the quota needed to ultimately serve the index at runtime.”

The net effect of this process is that a single top-level uncertain problem ramifies into a tree of uncertain subproblems, often arriving serially and unpredictably to downstream teams. Even if the top-level problems arrive according to a perfectly regular distribution (e.g. 1 problem arrives, like clockwork, at the beginning of each sprint), the subproblems will flow downstream more unpredictably, like popcorn popping in a microwave (formally: approximating a poisson distribution). Once the questions reach lower in the stack, it becomes ever more important for interrupts to be serviced quickly: a single two-week sprint for the product team may involve numerous round-trips to infrastructure teams, so if the infrastructure teams themselves have a two-week turnaround, it’s essentially guaranteed that nothing can get done.

The opposite effect is also true: if a team is building similar services or features over-and-over-again, using familiar tools, and reusing the same dependencies that they’ve always used, we can expect the ramification of uncertainty to be minimal, akin to a factory assembly line. In that case it might genuinely be possible to predict what each individual contributor will work on for each development-day, since the value-production pipeline will be so rote and well-understood that there is no uncertainty left in the system. But it would be difficult for me to conceive of this as a win condition for a software engineering team. Having a highly repeatable and minimally uncertain process is a violation of the “don’t repeat yourself” rule, and means that the team should be operating at a higher level of abstraction.

Single queues

Person A: That line looks so long. Do we have to wait in it?
Person B: Don’t worry. It moves quickly

Let’s take a pause from software engineering for a moment, and instead consider the grocery store checkout line. This is a classic example in queueing theory, and it asks “which is better: a grocery store with a single checkout line funneling into many cashiers, or individual queues for each cashier?” The popular and intuitive answer is the single checkout line, because it eliminates the possibility of a customer picking the “wrong” line. This means that the single checkout line achieves a notion of fairness between customers, because each customer sees an identical latency distribution, and any “job” ahead of a customer in line affects the entire line equally. For instance, the latency added by a customer who pays by check is shared by everyone a small amount, rather than a few customers a large amount.

The next problem to consider is throughput. Assuming cashiers are equally efficient, any work-conserving system will maximize the overall throughput of the system. To be work-conserving, the system must: 1) never leave a cashier idle when there is work to do, and 2) not allow half-performed work to go to waste. (We’re also assuming that each customer is an atomic unit of work, so it isn’t possible to split a customer between cashiers). Each of the systems fails to be work-conserving in different ways:

for the split queues, it is possible for one cashier’s queue to be empty while other cashiers have customers waiting to be served.
for the single queue, the time for each customer to walk from the front of the central queue to the next idle checkstand is wasted time.

This class of problem arises frequently in the context of load-balancing requests to servers. A central queue is unpopular in that case because, for many services, the dispatch time (e.g. 3-digit microseconds) to reach a worker can be significant relative to the overall time of the request (e.g. a few milliseconds). Systems that operate on this timescale typically route requests to a queue early-on, and accept the possibility of routing the request poorly. When jobs operate over longer timescales (e.g. seconds or minutes), systems that queue jobs centrally for workers to pull from regain favor.

Slowdown

I mentioned above that a single-queue grocery checkout line is fair, which is true in only one sense: upon arrival, each job in the system sees the same latency distribution. But it turns out that there are many facets of fairness beyond simply equalizing latency. Is it fair for a person checking out with a single candy bar to wait in line behind a person who is stocking up on provisions for an expedition of 140 person-days? If your boss first asks you to research and write a 20 page report and second asks you for a customer’s contact information, is it “fair” for the contact info request to queue behind the research project? Certainly not: any reasonable employee is going to task-switch away from the research paper, spend 60 seconds finding the contact info, and then switch back.

This points to a notion that most people understand intuitively, which is that jobs should only compete against jobs of a comparable size. For the purposes of measuring latency, we can define a new metric called slowdown, which is calculated as time-spent-queued (queueing time) divided by the time spent on the job itself (service time).

In formal queueing theory terms, bucketing jobs by size is called size interval task assignment (SITA). Many software engineering teams practice a crude form of SITA by designating a single member of the team as oncall to take inbound interrupts. The hope is that the oncall can maintain a queue of small tasks and respond to them quickly, while other members of the team plug away at larger tasks.

Informally-implemented SITA is also common in server side software. Plenty of web services shunt “reporting workloads” onto separate hardware, even when the reporting and online workloads comprise identical business logic, and emit identical queries to the data layer.

Dedicated queues

At a company I once worked for, the manager of the team responsible for our search stack wanted to create a Search-specific devops team. I couldn’t understand why. I was on the companywide devops team at the time, and I naïvely thought “why can’t they use the companywide team like everyone else?” To me, it seemed clear that this was the best way to prioritize the limited available resources (i.e. devops engineers) against the needs of the entire company. Naïvely, I was right. Creating two separate queues for the same kind of work would cause non-optimal outcomes in both directions: either a) the Search devops team would work on less-pressing needs than the companywide team, or b) the Search devops team would become a bottleneck on their customers, and they wouldn’t be able to surge resources from the companywide team. The first situation would over-prioritize Search and the second situation would under-prioritize them.

A dedicated queue turned out to be the right solution for that need, and I think it comes down to two reasons: 1) predictability, and 2) specialization, a.k.a. human capital. For the time being let’s focus on predictability, and come back to human capital later.

The first-order effect of having a dedicated team to serve the Search function is that the Search manager could reorder the whole backlog for his team, without ever having to justify his actions to any of his peers. On its own this could (maybe) make up for the decreased throughput caused by not being able to optimally allocate IC resources: if it meant that managers didn’t have to spend time arguing with each other, that planning/arguing time could be redeployed for better purposes.

The second-order effect is that the Search manager could make promises around project timelines to his upstream customers with significantly greater confidence, since his ability to reorder the backlog would make his team more capable of immediately responding to unexpected needs.

Human capital formation

When John Dobson taught people to make telescope mirrors, his advice for making a 18 inch mirror was “Take a 12 inch glass blank. Grind it into a mirror. Reflect on all of your mistakes. Start over with an 18 inch blank.” The 12 inch mirror blank appears nowhere in the final bill of materials, but mistakes with a 12 inch blank are so much less costly that any mistake made on the first mirror is an instant savings because it did not affect the final 18 inch mirror. An economist² would say that by grinding the 12 inch mirror and learning the associated lessons, the organization has created a new capital asset in the form of improved knowledge and business processes, just as if a metalworking company had made a new tool or die. The company then deploys that new capital to make the finished product: a 18 inch mirror.

In addition to the brute effort of making new tools or conducting focused inquiries, I think it’s useful to consider the ways that human capital accumulates in software engineering teams. As mentioned earlier, taking breaks can be reflective and recuperative, with the effect of accumulating capital for the organization. Alternatively phrased, we could say that time off can restore depleted capital.

In addition to time off, a periodic change of pace—either upward or downward—can work wonders. On one occasion my team achieved most of a 6 month roadmap over the course of 2 weeks. We realized upon reflection that this had only worked because we had spent so much time building up our understanding of the problem space in advance; when it came time to execute, we converted all of that accumulated understanding into quick action. For our team in that particular situation, the right next move is a downward change of pace: have a team offsite so that folks can start building a new roadmap based on all of the progress they have now made.

Finally, on infrequent occasions blue-sky thinking produces a wholly new idea. While rare, these ideas are so valuable that they often constitute a windfall gain. Cross-functional chitchat and pollination operate similarly.

Specialization of knowledge

Much to the chagrin of some executives, it turns out that software engineers are not anonymous replaceable cogs in a machine. A typical software engineer is knowledgeable in many systems, which we can—crudely—divide into external and internal. External systems are the ones where you can hire knowledge from outside your company; like when you’re hiring a full-stack “generalist” engineer, but what you’re actually hiring is a triple-specialist in React, Python, and Postgres. Internal systems are the ones that are peculiar to your organization; since you can’t hire these skills, each person has to build them on the job. Both external and internal knowledge are forms of human capital; internal knowledge is generally the scarier type because it cannot be procured from the open market.

In general, any given “function” within a software engineering organization will need to be staffed by a group of people who are up to speed on the underlying software to some minimum extent. This entails the capability to reason about the software sufficiently well to debug issues, develop changes, and answer questions from other parts of the organization. Getting people up to speed and keeping them there is an O(n²) issue, since there are usually n people on a team making changes to software, and the same n people need to receive an inbound feed of changes made by their teammates. This is why oncall rotations larger than 10 or so don’t really exist, and anyone who claims to be in such a rotation is either 1) working on a product that isn’t receiving new changes from all 10 members, 2) serving as a first-tier dispatch table to the second-tier actual experts in any given area, or 3) completely ineffective at oncall.

The natural resolution to the context-sharing problem is to hyper-specialize each “function” down to a single specialist who can answer each issue, thereby eliminating any overhead from context sharing. Of course this produces instant tension with numerous other concerns; this tension is colloquially known as “bus factor”. Bus factor decomposes into 1) unavailability (temporary or permanent) of the solitary expert causes an instant outage of their function, and 2) a burst of inbound work can easily saturate the expert, producing an effect similar to unavailability. As an asymptotic rule of thumb, preventing congestion when staffing a queue requires enough people to do the work (n), plus an overhead factor proportional to n^½; this principle is called square-root staffing.

So a person allocating resources within an organization is beset by two concerns pulling in opposite directions. The coordination overhead per team member is O(n); this implies that teams should be as small as possible to minimize coordination costs. The square-root staffing overhead per team member is O(n^-½); this implies that teams should be as large as possible to efficiently handle bursts of work.

Context switching

I’ll close by mentioning a queueing-theoretic result that I have found instructive. The paper is called task assignment with unknown duration, and it considers jobs that are of unknown size, possibly drawn from a distribution that spans multiple orders of magnitude. The jobs are abortable but not preemptible: the job runner doesn’t have enough memory available to checkpoint the job state, so if it wants to context-switch, it has to throw away all of its progress so far. (Sidenote: This is actually a pretty good representation of most web requests and web-service batch workloads). The question is: given a server farm, how should we schedule the jobs?

The proposed algorithm is called task assignment by guessing size (TAGS), and it is a variant of size interval task assignment (SITA). Recall that SITA assigns jobs to discrete server pools by size (e.g. CPU time), so that jobs of one size compete only against jobs of a similar size. Given that we don’t know the size of the inbound jobs in this paper’s setting, TAGS proposes that we simply assign the job to the server pool that handles the smallest jobs. Once we reach the high end of that pool’s range, the job times out and is aborted. Then we start the job over from scratch on the next server pool, which has a larger timeout. The process continues until the job finds a server pool where it can run to completion. By choosing pool timeouts from a geometric progression, the CPU time spent on aborted attempts along the way will be bounded beneath a constant percentage of the job’s CPU requirement.

The paper found that “given any initial system load, TAGS requires far fewer additional hosts to perform well”; the authors called this the “server expansion metric”. I find this to be a remarkable result, especially when looked at from an informational perspective: when TAGS runs and aborts a job, the server farm has spent valuable CPU time and gained only the information that “the job requires longer than X CPU-seconds to complete.” It turns out that this one piece of information is sufficient to make the server farm run more efficiently than a work-conserving server farm.

Mapping back to a software engineering team, this means that—given an overloaded team—TAGS outperforms other algorithms tested at regaining reasonable slowdown (i.e. total latency divided by job size) while requiring the least additional headcount.

I find that TAGS provides a rather natural way to look at uncertain or poorly scoped projects. When seeing a problem, I start by looking for a five-minute solution. After five minutes I stop and write down what I’ve found in a ticket (the paper assumed that all state gets thrown away, but I do still try to preserve as much as I can). The next time that the ticket comes up, I spend an hour trying to work it out. After that, I try a workday. If that fails, the next step is 1-2 weeks.

Conclusion

We have now discussed numerous ideas that would be considered heretical to certain doctrinaire practitioners of software engineering “methodologies”. We considered that an engineer’s responsiveness to interrupts is going to depend directly on their position and role within the value-creation pipeline. We justified context-switching as a natural outgrowth of uncertainty. We discussed the factors motivating larger teams of homogeneous contributors, and the countervailing forces that demand small islands of heterogeneous experts. We closed with a paradoxical result where throwing away partially completed work makes an organization more responsive with fewer headcount.

I believe the simplified model is identical to the one proposed in this post ↩
An academic economist; I don’t know what a GAAP accountant would say ↩