iostat's %util is wrong

The %util column in iostat has always been something of a weird bird. It was created back in the day when hard disks could only effectively process a single request at once. Back then, if %util was 100, then the disk was completely saturated and could go no faster. Modern block devices have the ability to perform multiple operations concurrently, and so %util’s usefulness has ebbed. The manpage gives a clear warning about the meaning (or lack thereof) of %util, describing it as:

Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

What the manpage won’t tell you is that %util is completely wrong when running on Linux 5.0+. The value can systematically undercount, overcount, and fluctuate. To understand why, we need to look at how %util is calculated. Under the hood, the Linux kernel keeps an ever-ascending counter called io_ticks, whose purpose is to quantify the amount of time a block device had any work to do. To compute %util, all iostat has to do is read the value of io_ticks from the kernel, wait one second (or so), and read the value again. It computes the delta between old and new values, and then divides the delta by the elapsed time to arrive at %util.

Starting with Linux 5.0, io_ticks is sampled rather than calculated. The sampling algorithm works like this: for each “jiffy” (typically 1 hundredth of a second), the kernel tallies a one if the drive was busy, and a zero if the drive was idle.

Specifically:

  1. whenever the kernel issues an I/O, it tallies a one for that jiffy.
  2. when an I/O completes, count the jiffies that elapsed while the request was in-flight, and add all of those jiffies to the tally

The problem with this approach is what happens when I/Os overlap. When two overlapping I/Os complete, we can’t just add both of their elapsed times to io_ticks, because doing so would double count the overlapping portion. To prevent double-counting, the kernel only marks the drive as busy for the jiffies between the completion of the most recent I/O and the just-completed I/O. This prevents double-counting, but can easily under-count if most recent I/O was short and the just-completed I/O was long:

io1    -----------
io2         --

If each - is a jiffy, the completion of io1 adds four jiffies to io_ticks. But the five jiffies of io1 which elapsed before io2 began get completely ignored. The sampling algorithm has various other race conditions as well, but the net effect is that it undercounts. Except when it overcounts. Remember part 1 of the sampling algorithm: “whenever the kernel issues an I/O, it tallies a one for that jiffy”. Most block devices nowadays can return results significantly faster than one jiffy (which is 10ms), but the kernel marks the drive as busy for that entire period. I wrote a small program that spreads tiny I/Os over numerous jiffies, and reports the kernel %util value side-by-side with an estimate of the correct %util. I pretty commonly get >80% kernel reported I/O utilization even though my program is only spending about 8% of its time actually reading from disk.

There’s also the issue of jiffy boundaries. There are only 100 jiffy boundaries in a second, and it’s quite possible for the same workload to appear very different based on whether its I/Os happen to cross jiffy boundaries, or if they stay contained within a single jiffy. To leave less to chance, we would need to sample more frequently than 100 times per second.

Why?

Readers might be left wondering why the kernel maintainers would choose such an inaccurate and imprecise method to compute %util. The answer (so far as I can tell) is that calculating %util is very hard in a multicore context, because the “correct” solution would involve using an atomic integer to track the number of inflight I/Os at any given moment. Managing the atomic integer would require cores to synchronize with their siblings before and after issuing an I/O to the device. In this past this wasn’t much of a problem, because the device itself usually had an I/O submission data-structure that required synchronization to access. But modern devices have multiple submission queues, making it possible for many CPUs to each have their own dedicated “channel” for communicating with a block device.

In the face of this change in speed of hardware, the kernel maintainers chose performance over correctness. The Linux 5.0 sampling algorithm still uses an atomic integer, but it only performs atomic operations to collect the samples once per jiffy. After the initial synchronized access, the statistics code uses significantly faster non-atomic instructions.

What to do about it?

I think %util is pretty dead at this point, and I doubt it’s worth reviving. The better metric for modern devices is avgqu-sz, which more accurately describes the load that a block device is placed under. Happily, avgqu-sz can be maintained without synchronization overhead, because the calculation is a simple sum over individual I/Os, and the correct value does not vary based on how the I/Os do or do not overlap with each other.

References