Deterministic computing on x86

This is a sequel to Running a Linux process deterministically. Read that one first.

tl;dr: if you’re willing to write a little virtual machine monitor, you can have deterministic computing on x86

If I spawn two threads to increment local counters in a tight loop for 1 second, what do you think the result will be?

#define DURATION_SEC 1
#define REPORT_INTERVAL (1ULL << 30)

static int thread_fn(void *arg)
{
    struct thread_arg *t = arg;
    struct timespec start, now;
    uint64_t counter = 0;

    clock_gettime(CLOCK_MONOTONIC, &start);

    for (;;) {
        counter++;
        if ((counter & (REPORT_INTERVAL - 1)) == 0) {
            clock_gettime(CLOCK_MONOTONIC, &now);
            long elapsed = (now.tv_sec - start.tv_sec);
            printf("thread %d: %lu (%lds)\n", t->id, counter, elapsed);
            if (elapsed >= DURATION_SEC)
                break;
        }
    }

    t->result = counter;
    return 0;
}

We have no way to know. It depends heavily on how the scheduler decides to allocate these threads to CPUs. Even if there was only one thread and it got a CPU all to itself, the number of instructions it executes is going to vary based on how the CPU clock scales up and down, among many other factors.

And as we learned in the previous post, forcing deterministic execution on x86 is impossible. But what if we allowed ourselves to emulate an x86 environment rather than running on plain hardware? Given that it’s an emulator, we can do whatever we want, right?

I’ll add two additional constraints:

the emulator must be reasonably fast
the emulator must have very high fidelity to an actual x86 Linux environment

Given these two constraints, we’re basically limited to running as a guest on an actual x86 machine with virtualization extensions (VT-x or AMD-V)¹. In practice this means that you’re going to be using Linux KVM, which functions as a frontend to x86 virtualization. (Or you could use Xen, maybe? I haven’t looked into it).

What does the emulator need?

If we’re using KVM, we need an emulator to wrap it. For a workload like the above, the emulator just needs to provide:

an output layer. A serial console will do.
a Linux syscall interface

Serial consoles are pretty easy for an emulator to provide. A Linux syscall interface is harder. You have two options:

emulate syscalls directly. When the guest calls mmap, you map some memory for it. When it calls getpid(), you make up a believable number. gVisor does this, and has a KVM mode.
just ship an OS kernel into the guest
weird hybrid approaches, like unikernels.

Like I said, two options. gVisor is emphatically not designed for deterministic execution, and the modifications required to make it so would undermine the value it was providing in the first place. You’d basically be reimplementing every Sentry syscall from scratch, but deterministically.

This leaves our remaining option as “ship an OS kernel in the guest”. If you want to emulate Linux, why not just run Linux? In my build it takes 619 million instructions to boot up in a VM, but we can checkpoint the fully booted VM if we want.

With CPU virtualization extensions, we get to control the values of rdrand, cpuid, and rdtsc, which means that we’re now almost in control of all I/O performed by the workload. The missing link, unfortunately, is timers.

Timers: do we need them?

The role of a timer is to deliver an interrupt to the CPU after a certain amount of wall clock time has passed. The whole process is naturally entropic: you basically have an entirely separate piece of hardware² asynchronously asserting the interrupt pin on the CPU. There’s no guarantee of how many instructions your CPU will retire between timer interrupts, and that’s going to depend heavily on the workload, memory access patterns, and microarchitectural state.

So can we get away without timers? No, imho.

If you implement the whole emulator and forget about timers, it will actually work surprisingly well. I tried it, and the result was basically a cooperative multitasking OS. There were no timer interrupts to wake up the Linux scheduler and give it a chance to de-schedule the running process. My threads ran from start-to-finish without a care in the world. But if one thread does happen to do a syscall (e.g. read(), futex(), poll(), etc.), it can still be paused. So waiting on a lock, for instance, creates a preemption opportunity for the scheduler even in the absence of a working timer. But if you expect to launch two threads and have them observe each others’ activity in the absence of syscalls, we need actual timers. Whether you care about this is up to you: Claude at one point lobbied for “just don’t bother with timers; cooperative multitasking is fine”.

Realistically, I’m pretty sure that Linux itself has various timer-bound janitorial routines that need to run to keep its house in order. The kernel certainly acts like it: all of the “tickless kernel” improvements that eliminate periodic timer interrupts (e.g. CONFIG_NO_HZ_FULL) still require one CPU to remain tick-full.

Timers: how do we build one?

In an ideal world, the virtualization subsystem itself would help us implement our timers. We’d tell it “perform exactly 100 million instructions in the guest, and then stop and return to the host”. We don’t have this. What we do have is CPU performance counters, which can be configured to raise an interrupt every 100 million instructions. If the interrupt always halted our program after exactly the requested quantity of instructions, our goal would be achieved and we’d be on the patio drinking mimosas. But in actuality, there’s a certain amount of “skid” between when the PMU recognizes “hey we’ve retired the right number of instructions now!”, and the actual delivery of the interrupt to the CPU. On the Intel hardware I’m playing with the skid is about 250 instructions, and I’m told that it can be much higher on AMD.

So in practice the approach you have to take is to run the CPU for a number of instructions equal to DESIRED - SKID, and then single-step the CPU until you reach the instruction count you were aiming for. This is ridiculous, because you run the CPU with a full pipeline, then VMExit, and then you have to loop over VMEnter -> single-step -> VMExit for at least 50 instructions, and probably a lot more.

If you want actually good performance here, talk to your friendly local CPU designer. Tell them you want a multistep execution mode that you can actually count on, so that you can say “run 100M instructions”, and the CPU runs 100M instructions. I can’t see why this would be very hard with modern CPUs: the CPU would run its normal loop until the instruction fetch unit realizes “oh wait, this next instruction I fetch might be the last one”, at which point it would internally single-step. That would be millions of times cheaper than what we’re doing right now by emulating the single-stepping loop in software.

Then again, I’m not a CPU designer.

Clocks

rdtsc follows naturally from the timer implementation: if the guest exits to ask you for the current time, then you can just reference the running count that you already are using to implement the timer. Fortunately the instruction count is accurate (no skid) in this context, because the guest has already interrupted itself by performing a VMExit.

Halting

If the vCPU ever halts execution waiting for an interrupt, you can do the standard trick of incrementing the clock to the CPU’s requested wakeup time, and waking it up immediately. It won’t know the difference.

Randomness

I left rdrand unimplemented, and the emulated cpuid doesn’t advertise its presence. This means that /dev/urandom returns the same series of bytes every time the kernel boots up. Linux correctly realized that it had no useable sources of entropy, and refused to vend PRNG bytes. This resolved after I injected a random seed into the guest via boot parameters.

Does it work?

Yes, it works, but doesn’t work out of the box with Linux’s KVM hypervisor. KVM never implemented the control bits that allow an emulator to respond to rdtsc and rdrand on its own, so I had to patch it to add those features.

The code is here. It has a test workload designed to do all sorts of awful nondeterministic things (e.g. reading /dev/urandom, hashing data produced by a sibling thread without synchronization). You can run make run-native to see it be non-deterministic when run “natively” (without virtualization), or do make run and it’ll virtualize and produce the same output every time.

Is it slow?

The I/O paths are all super slow, so our goal is to do as little I/O as possible.

The bundled test workload in guest/init.c is 12,885,930,001 machine instructions after Linux reaches userspace + 619,234,471 for it to boot. It takes about 1.83 seconds to execute on my dev machine (excluding the Linux bootup), versus 1.65 seconds to just run the test workload within the same machine’s existing Linux OS. So the overhead is about 11% for this mostly-CPU workload.

There are 161 timer interrupts from the Linux 100Hz scheduler tick. If we sabotage (i.e. don’t deliver) them, the wall clock time for the emulated workload drops to 1.67 seconds.

Is it useful?

Maybe. I’d like it to be possible to have a cluster running builds and tests where the result is guaranteed the same every time. It seems conceptually possible with this approach, although it would need to be optimized, and it might be harder with AMD and its larger interrupt skid.

I suppose the adventurous among us can try setting up an emulator based around Rosetta. ↩
Nowadays the timer happens to be within micrometers of the CPU core. ↩