Distributed hash ring maintenance

In Cassandra Availability with Virtual Nodes, we discussed the availability tradeoffs inherent in the use of virtual nodes. In summary: virtual nodes increase the number of dependency relationships between neighbor replicas in a distributed hash ring, and this increased interdependency increases the modeled frequency of partial ring outages.

In this article we’ll discuss something slightly different: how does the query load on a given replica change during an outage of its neighbor nodes? Whether we’re talking about natural outages or planned maintenance, we’d like the load to stay as low as possible. This is doubly true if maintenance is a critical part of your scaling story, because you’re taking nodes out of service in order to scale them up.

Naively we might imagine that (for RF=3, with 3 “racks”), maintenance will result in 150% of normal load on the in-service nodes, like so:

B3 is out of service, so A3 must perform 150% of its normal workload and C3 must as well.

But the situation could be significantly improved, provided we can evenly distribute tokens across the ring:

In this “shingled” pattern, C3 and A3 are taking the brunt of the load, at 133.33% (they are doing 50% excess load, but only for 66% of their respective token ranges). C2 and A4 are doing only 116.66% of their normal work (50% excess load across 33% of their token ranges).

Of course, if there is a hot key somewhere on the ring, the numbers will be more skewed. In the degenerate case of a single hot key on the range, we’re back to 150% of normal load on the in-service nodes.

As a second observation, we can see that none of the other nodes in the cluster will even realize that maintenance is occurring¹. The other nodes in rack B see no change, nor do C1, C4, C5 or A1, A2, A5. This implies that maintenance is a parallelizable activity, since each out-of-service node only affects the load (and availability) of its direct neighbors in other racks.

A maintenance strategy based on alternating nodes in a single rack would look like this:

The odd-numbered B nodes are out of service. The odd-numbered A and C nodes are taking 133.33% of their normal load. The even-numbered A and C nodes are taking 116.66%.

What about just not doing maintenance?

One approach to maintenance is to bring a spare host online with the same dataset, and have it replace the existing host’s place in the cluster. Once the maintenance is complete on the original host, it is now free to download the dataset corresponding to a different position on the ring, and then replace that node. In this way you could use a single extra node to completely hide the amount of time required for the actual maintenance itself, from an availability/load perspective. The time to upgrade the cluster as a whole would be dictated by (single node maintenance duration) times (number of nodes) divided by (number of spares).

But there’ll still be a critical section where the spare synchronizes the most recent writes from the old node², and then finishes joining the cluster. The time-at-risk is (synchronization time) times (number of nodes). All of the same load calculations from above apply, but over a shorter time window.

Doesn’t parallelizing maintenance increase risk?

Mathematically, no. Practically, it depends on your software.

Mathematically: taking a node down imposes increased load on neighboring nodes, and risks an outage if a neighboring node goes down unexpectedly during the maintenance. If maintenance takes 15 minutes and we have 48 nodes, then we have 12 node-hours of maintenance risk. It turns out that there’s not any difference between doing 12 hours of maintenance serially, or doing 8 nodes at a time and taking the same risk over 90 minutes.

Practically: more parts moving in unison means that the machine for orchestrating these maintenance windows needs to be well-oiled. If the maintenance state machine fails in a way that leaves nodes out of service, then you might inadvertently blow through your risk budget just because the orchestrator got stuck and left your cluster in a degraded state.

Meanwhile—from a capacity planning perspective—there is a significant difference between a service that can scale up in 90 minutes and a service that takes 12 hours. For any database that might experience fluctuations in load, I think the greater risk is being unable to respond quickly to those fluctuations.

Should I worry about metrics?

Let’s say I have a 48 node cluster and one of the nodes is serving queries really slowly. Will I see it on my metrics? If I’m tracking 99th percentiles, yes, because the problem node will sit at the 98th percentile. If my cluster ever grows larger than 100 nodes, then it will sit above the 99th percentile and become invisible.

Performing maintenance in parallel will make your percentile-based metrics look worse, because it will affect a greater quantity of nodes at a time. But³ it will affect the same number of queries no matter how it is sequenced. If you’re working on ⅙ of the cluster at a time, you can expect to see “bad” metrics from the hosts undergoing maintenance to reach down into the 90th and 85th percentiles. This is fine: each individual token range is going to experience the same maintenance-caused latency for the same window of time, no matter what you do⁴.

There is one area worth worrying about, which is the application layer. Let’s say that an application is thread-constrained talking to its DB: it has 1000 threads available to handle requests and is using 900 of them. If the average query time jumps by 2x for ⅙ of the cluster, then it is now out of threads (1050 threads needed vs 1000 available). But the solution there, IMHO, is to allocate more threads into the application’s threadpool. Or perform any other form of c10k optimization.

How does this work for virtual nodes?

Differently. I’m not sure it’s possible to allocate vnodes perfectly across the cluster in a shingled manner, but it is possible to do so in a non-shingled manner, such that each vnode covers the same territory as its neighbors in a different rack.

In the non-shingled case, perfect allocation is possible by considering the adjacency of hosts as an N-dimensional matrix, where N is 3 in the typical case of a 3-rack, RF=3 cluster. In the example below I have laid out a cluster with 48 physical hosts (3 racks of 16 per rack). Each host has 4 virtual nodes. If you hover over a given host, you’ll find that it has exactly 4 neighboring hosts in each rack. This ensures that, when a given node goes down, the neighbors shoulder only 125% of their normal load.

This math does scale up, although visualizing it in javascript isn’t quite as easy. For a cluster with 8 vnodes per host and at least 8 hosts per rack, we could limit the load on any one host to 112.5% of normal. At this point I have two conjectures:

if your goal is to achieve perfect load-balancing during maintenance, it is not possible to shingle your vnodes like you can shingle regular nodes. This is a graph coloring problem, and it is overconstrained, at least in the case of RF=3 or higher.
only one rack will be able to achieve perfect parallelism⁵, of maintenance, where it takes nodes out of service in #vnodes waves. The other racks all require at least one more wave than #vnodes.

Should I use vnodes?

We have found that vnodes can be used to limit the load on in-service replicas during maintenance of their peers, either to 125% in the case of 4 vnodes or 112.5% in the case of 8 vnodes. Alternatively, you could use random allocation with 16 vnodes and hope that they are evenly-distributed enough for your needs.

This benefit comes at the cost of increased complexity and decreased parallelism during maintenance. A cluster that could previously perform maintenance in six parallel waves is now believed to need at least 14.

I personally would prefer to over-provision my nodes by a factor of 133% to handle the maintenance case, and have agility of six-wave maintenance as an option.

This assumes that coordination and other cluster-shared tasks are externalized. If coordination is internal, there will be an increase in load proportional to the number of nodes out of service. ↩
Some of my more cavalier readers will point out that it’s possible to just drop the most recent writes, and have the anti-entropy repair process fix any discrepancies. I personally object to this on consistency and durability grounds, since it converts a write that has been accepted by 2 of 3 replicas (a quorum) into a write held by only 1 of 3 replicas. ↩
This assumes that the cluster is under constant query load. If it is not, you can affect fewer queries by fitting maintenance windows into low-load periods (e.g. overnight). ↩
For a given length of maintenance. There are techniques you can use to shorten the maintenance window, like cache-warming. ↩
In the visualization above, that is the middle rack. It can parallelize its maintenance in waves of: (0, 5, 10, 15), (1, 4, 11, 14), (2, 7, 8, 13), (3, 6, 9, 12). You can verify this yourself by clicking on the visualization to simulate nodes out of service. ↩