Running a Linux process deterministically

This post is followed by Deterministic computing on x86. Read that one next.

At one point I found myself asking “why are my Bazel-resident tests passing on weekdays and failing on weekends”. The answer was that the program logic was looking at the system clock, and displayed different behavior on weekends. So I wrote stillness to expose a fake system clock. This solved that one particular problem, but I was left wondering just how hard it would be to run a Linux process deterministically. Would it be possible to build a sandbox that enforces determinism?

In the flawless success case, this would mean that I could just run a completely non-deterministic build or test in the sandbox, and the result would always be the same: pseudo-randomness would be supplied with the same seeds, scheduler quanta would be determined exactly, and any input from the outside world would be recorded and replayed. Similar efforts include Hermit (Facebook OSS, abandoned) and Antithesis. Both efforts are significantly more advanced than anything I could do here, but I was still curious: how much can I incrementally add to an existing codebase to make its testing deterministic? If I can just wrap my existing test runner in basic Linux primitives, what can I achieve?

The obvious stuff

Let’s assume we’re starting from scratch with a minimal implementation: just run some mostly-well-behaved tests deterministically. The first round of controls is straightforward: cut off the sources of change that come through the operating system.

No network access. A network namespace ensures that the test runner has no way to contact the outside world. The idea is that my language runtime (test-time?) should mock all network access. The runtime is the belt, and the sandbox is the suspenders.
No I/O to the host. A mount namespace with a minimal chroot means the process sees only what you explicitly bind-mount in. We can largely avoid sandboxing syscalls like open() and read(), because the mount namespace provides “virtualization” of file access.
No threads. Threads give the scheduler the ability to choose what goes first (or in parallel) within our program. If we don’t have them, our program order is under our control rather than the OS’s.
Remove VDSO. Technically there are two sources of non-determinism here: vdso and vvar (both are visible in /proc/<pid>/maps). vdso is static machine code injected by the kernel at the beginning of the process’s lifetime. It’s deterministic within a machine, but will likely vary across a build fleet. vvar is dynamic, and exposes things like the current time.
Mock non-deterministic syscalls. Syscalls like getpid(), clock_gettime(), and getrandom() all need to return deterministic synthetic data injected by the sandbox rather than non-determinism injected by the OS.
Disable ASLR. ASLR turns mmap() into a non-deterministic operation. Use personality(ADDR_NO_RANDOMIZE) to make it deterministic again.

What does stillness do?

During execve(), Linux injects a pointer to the vdso page into the auxiliary vector for the userspace to consume. stillness zeroes the pointer, which breaks vdso. It then installs a seccomp-bpf and ptrace() jail to produce emulated clock_gettime() syscalls.

In other words stillness is taking the easy way out: the non-determinism is still in the program’s address space, but a compliant libc will never read it.

So now we just need to methodically block the syscalls, right?

Unfortunately no, I have learned. Building a jail/emulator with seccomp and ptrace() isn’t too hard, and it allows you to trap and emulate (or just block). But even after you’ve done that, there are other sources of non-determinism that sneak in. To figure out which ones, I used the simple approach of “if I run the same program twice and force them to dump core, I should get the same core files out of them”.

As a pedantic matter, core files have headers that will always vary, because it’s where the kernel puts its own metadata about the process that died. But we’ll ignore that bit: we’re trying to ensure the programs are identical when viewed from within, not from outside.

This leaves:

rseq: this is a syscall, but the non-determinism occurs in the scheduler long after the syscall has ended, so it has some “spooky action over time” effects. Basically userspace can register to have the kernel update a cpu_id variable that identifies the CPU that is currently running a given thread. The CPU is chosen by the Linux scheduler, and is therefore non-deterministic. The solution is to block the syscall or pin the thread to a specific CPU. Easy peasy.

AT_RANDOM: also in the auxiliary vector is 16 bytes of randomness that the kernel helpfully injects into every process it launches. Fix: zero those bytes using ptrace before the process starts.

rdtsc reads the CPU’s timestamp counter. It’s used by glibc’s dynamic linker for profiling. Since it’s not a syscall, seccomp can’t see it. The fix is prctl(PR_SET_TSC, PR_TSC_SIGSEGV), which causes rdtsc (and rdtscp) to fault with SIGSEGV. Our tracer catches the signal, checks whether the faulting instruction is 0F 31 (rdtsc) or 0F 01 F9 (rdtscp), and emulates it by writing a deterministic counter into the registers.

cpuid is pretty annoying. The dynamic linker calls it during init_cpu_features to detect CPU capabilities. Most CPUID leaves return identical data regardless of which core you’re on, but leaf 1 EBX bits 24–31 contain the Initial APIC ID, which identifies the physical core. If the process runs on core 0, the APIC ID is 0; on core 5, it’s 5. That single differing byte lives in ld-linux’s cpu_features struct.

The fix is arch_prctl(ARCH_SET_CPUID, 0), which programs the CPU to fault on CPUID. The tracer then emulates CPUID. How to emulate is an interesting question: in general I imagine that anyone who has a build farm will want to emulate the lowest common denominator set of CPU features that exist within the farm.

The kernel re-enables CPUID after every execve. This is different from the PR_SET_TSC behavior, which does survive exec.

rdrand ends us on a low note. This instruction’s whole purpose is to inject entropy into userspace. Unlike rdtsc and cpuid, there is no control register bit or prctl flag that makes it fault. The kernel simply has no mechanism to trap it. Your options are virtualization (a hypervisor can configure the VM-execution controls that cause rdrand to VM-exit) or binary rewriting (scan executable mappings for the rdrand opcode and patch it to a trapping instruction like ud2, then emulate in the ptrace handler).

Perfect is impossible, but pretty good is achievable

So the result is that deterministic computing is impossible in x86 due to fundamental properties of the ISA. Thanks, Intel. And they may even add new instructions later, which produce new forms of non-determinism. I imagine Intel and AMD’s answer to this is “use the virtualization extensions we gave you.” So now we understand why tools like Antithesis are hypervisor-based.

Having now gone down this rabbithole and come back out, my conclusions are:

yes, you can get “pretty good” determinism with some simple techniques that you implement on your own. But true determinism is much harder.
if your language runtime is mostly single threaded (e.g. Python, Ruby, Lua, WASM), you have a massive leg up here. If your language runtime is multi-threaded but can be single threaded for tests, that’s also fine. If you truly need multi-threading, then you need to make a deterministic scheduler, which is a hard problem.
I wish simulation-based techniques were better known in the industry. I’ve seen everyone point to FoundationDB as an example, but mostly as a curiosity, and never as a “we should do this”.