Booting Linux faster with nullfs and pivot_root

tl;dr: the new nullfs feature in Linux 7.0 allows shaving a few seconds off your boot time.

I wanted my OS to boot faster. So I started squinting at systemd-analyze plot to figure out what was taking so long. One remarkable gap was the systemd initrd-switch-root service, at a bit more than a second. IMHO the entire boot process (up to SSH being available, or some other application server running) would be <1000ms, so initrd-switch-root was looking like a good optimization target.

initrd-switch-root’s job is to run as the very last step of the initramfs, usually immediately after the system’s “fully booted” root filesystem is mounted. Its role is just to do some metadata operations, boiling down to something like this:

# Move kernel filesystems
mount --move /proc "$NEWROOT/proc"
mount --move /sys  "$NEWROOT/sys"
mount --move /dev  "$NEWROOT/dev"
mount --move /run  "$NEWROOT/run" 2>/dev/null || true

# Must be a mountpoint and not the same fs as /
# switch_root requires this
exec switch_root "$NEWROOT" /sbin/init

The Busybox switch_root utility has some very nice commentary about how it all works (lines 17-26 are especially illuminating), but the simplest version is that this is just a slightly fancy chroot(2) syscall. This should not be slow at all.

But it is slow, for two reasons. The obvious one is that when we execve() the /sbin/init binary, it has to do all of the normal things that binaries do when they start, and the page cache is completely cold (necessarily, because we just mounted the disk). In my case that binary was systemd, which meant it had to load a bunch of text files from disk and parse them into its internal representation. Adding injury to injury, I already had a perfectly good systemd process running as PID 1 in my initramfs. So I’m paying for 2 systemd startups during the system’s boot.

The less obvious slowdown is that all of this has to be done by PID 1, and none of the other processes on the system can stay alive during the switchover. (Technically speaking they can, but they’ll continue to see the initramfs’s view of files, rather than the “real” root filesystem). So initrd-switch-root is both slow on its own, and a bottleneck in the overall boot process.

I considered a few possibilities here:

just don’t have an initramfs: boot into the main filesystem
do a handoff of the fully loaded state from the old systemd to the new systemd
don’t use systemd; find something lighter-weight
don’t exec: have the original systemd perform the entire boot

Just don’t have an initramfs

There are two versions of this one. One is the zero-byte initramfs, and the other is the all-the-bytes initramfs. The zero-byte initramfs isn’t helpful because it turns out that initramfses are useful. Finding the right drivers, initializing them, waiting for devices to become available, mounting drives in the right sequence, etc., are all things that userspace processes are good at. In particular, udev is pretty useful for event-driven onlining of devices.

The all-the-bytes initramfs is quite alluring: just make your root filesystem small, and shove all of the bytes into it. The problem with this approach is that it’s kinda slow: firmwares and bootloaders know how to read the file off of a disk, but they’re built to be reliable, not fast. If you want fast access to your filesystem, use Linux, not UEFI LoadImage(). And after loading, the kernel has to decompress and deserialize the initramfs CPIO archive into an in-memory filesystem. If you’re planning to need every single byte on your initramfs during the bootup process, then getting all of the reads done early is sorta justifiable. But the mega-initramfs approach basically guarantees that there’ll be some stuff in there that you don’t need until significantly later, if at all.

Do a handoff

This is a good idea! Plenty of daemons serialize their state and pass it to their successor. But it would be a somewhat invasive patch to systemd.

Don’t use systemd

Yeah, no. It turns out I kinda like systemd. I’ve been especially liking the configuration format and the batteries-included security features (e.g. automatic namespacing and seccomp policy), and journald is nice too.

And meanwhile, any other init system would have the same problems to solve: quiescence across the boundary, and loading a whole bunch of configuration after switching into the real root filesystem.

Don’t exec

The problem here is that switch_root is at its core a chroot(), and chroot() only affects the current process. Every other process on the system still sees the initramfs as root. To fix that, you need to quiesce all existing processes and start new ones from the real root — which is exactly the stop-the-world cost we’re trying to avoid. But as the title of this post indicates, we can use pivot_root instead.

pivot_root is a weird-ass syscall: it changes the view of the root filesystem for every running process in a given mount namespace, which in this particular case means more-or-less every process on the system. This violates a principle that modern Linux generally follows: when a process asks for something drastic, we do it on the next exec() or fork(), not to processes that are already running. If I were BDFL of an OS project in 2026 and someone proposed pivot_root, I’d welcome them to maintain it out of tree until the end of time. But 2026 is a long way from 2000, and I can imagine it looked a lot more reasonable back then.

Weird as it is, I think pivot_root is the only viable way to allow an initramfs process to continue onward into the rootfs. The problem is that pivot_root doesn’t work on initramfses — until Linux 7.0, which hasn’t been released quite yet.

The specific restriction is that pivot_root refuses to pivot away from the kernel’s very first mount — and the initramfs IS that first mount. But if you’re willing to run a prerelease kernel, Christian Brauner’s new nullfs filesystem removes this limitation. The kernel now creates a nullfs as the true first mount, and then layers the initramfs rootfs on top of it (completely shadowing the nullfs). The restriction still applies, but only to the nullfs underneath — so we can now pivot_root the initramfs and leave the nullfs where it was.

With the removal of the restriction on pivot_root, “Don’t exec” is now the superior option.

Implementation

Pretty simple. Create an initramfs archive (this is just an extremely basic archive format called CPIO, with an optional compression wrapper: I recommend zstd). Copy portions of your filesystem into it. There are tools out there that can do this for you (e.g. dracut, initramfs-tools); I’m not sure how much legwork would be required to make them work for this purpose, but it might “just work.”

The only tricky bits I encountered were:

processes running at pivot_root time must be able to deal with their root filesystem being swapped out from under them.
if any processes are in their own mount namespaces (e.g. systemd’s PrivateMounts feature), they won’t be affected by pivot_root: they’ll see the initramfs as their root for the rest of their lifetimes.
if you copy a systemd unit into the initramfs, in most cases you need to ensure that it doesn’t start pre-pivot.

In my case I just copied all of my systemd units into the initramfs along with all of my udev rules. This allows me to avoid the need for a systemctl daemon-reload immediately after pivot_root, which would cause a whole bunch of cache misses and use a whole bunch of CPU at the literal worst moment of the boot. The systemd-udevd unit has PrivateMounts=yes, which means that unfortunately I do need to force it to restart after pivot_root.

The result is that my system boots a few seconds faster. Probably 1 second is attributable to “we elided the systemd restart”, and another 1 second is attributable to “we no longer have to quiesce everything before switch_root can happen”.