The matryoshka rule

If Kubernetes were ever going to scale to meet everyone’s needs, it would need to satisfy the “always a bigger fish” principle: objects group into namespaces, namespaces group into virtual clusters, and virtual clusters group into bigger virtual clusters. If you follow the matryoshka dolls far enough, eventually you’ll find the outermost one. Or will you?

My observation was that there should be a recursive operation so that virtual clusters can create nested virtual clusters within them. Since then, I’ve been seeing the same “russian nesting doll” principle at work in other systems, and found those systems to be more useful: they can be subdivided at will, and do not hit arbitrary walls.

A line of matryoshka (Russian nesting) dolls, proceeding infinitely away from
the viewer toward a vanishing point

Matryoshka systems are pretty well-known to computer scientists, since anything that looks like a tree is basically guaranteed to satisfy the matryoshka property. Other than simple data structures like the linked list or B-tree, the Unix filesystem abstraction is the best and most accessible example I have seen.

Unix filesystems

In a filesystem, there’s an operation (mkdirat()) that instantiates new recursive children. There are operations that walk the tree step by step (e.g. openat(), chdir()). It is possible to walk the tree upward (e.g. openat(fd, "..")), but there’s also an operation that limits your upward visibility and walkability (chroot()). Operations remain valid and fast at any depth in the tree. You can, for instance, create as many nested directories as you like, and the last mkdirat() costs no more than the first.

Benchmark: creating 65536 nested directories

import os
import time

DEPTH = 65536
DIR_NAME = "d"
INTERVAL = 256

def main():
    parent_fd = os.open(".", os.O_RDONLY | os.O_DIRECTORY)

    # Descend
    t0 = time.monotonic()
    for i in range(DEPTH):
        try:
            os.mkdir(DIR_NAME, 0o755, dir_fd=parent_fd)
        except FileExistsError:
            pass

        child_fd = os.open(DIR_NAME, os.O_RDONLY | os.O_DIRECTORY, dir_fd=parent_fd)
        os.close(parent_fd)
        parent_fd = child_fd

        if (i + 1) % INTERVAL == 0:
            t1 = time.monotonic()
            rate = INTERVAL / (t1 - t0)
            print(f"descend {i + 1:6d}  {(t1 - t0) * 1000:7.1f} ms / {INTERVAL}  ({rate:.0f} ops/s)")
            t0 = t1

    print(f"\ndone — fd {parent_fd} is {DEPTH} levels deep")

    try:
        print(f"path: {os.readlink(f'/proc/self/fd/{parent_fd}')}")
    except OSError as e:
        print(f"readlink: {e}")

    print()

    # Ascend and rmdir
    t0 = time.monotonic()
    for i in range(DEPTH):
        upper_fd = os.open("..", os.O_RDONLY | os.O_DIRECTORY, dir_fd=parent_fd)
        os.close(parent_fd)
        os.rmdir(DIR_NAME, dir_fd=upper_fd)
        parent_fd = upper_fd

        if (i + 1) % INTERVAL == 0:
            t1 = time.monotonic()
            rate = INTERVAL / (t1 - t0)
            print(f"cleanup {i + 1:6d}  {(t1 - t0) * 1000:7.1f} ms / {INTERVAL}  ({rate:.0f} ops/s)")
            t0 = t1

    os.close(parent_fd)
    print("\ncleaned up")

if __name__ == "__main__":
    main()

Unfortunately the readlink() operation in the middle of the benchmark fails: Linux allowed us to create the directory, but since the path is longer than PATH_MAX, it cannot express the entire path.

In actual filesystem implementation, the common (almost mandatory) implementation is that files are stored on disk by their (parent inode, basename) tuple: this is a dentry, in Unix-speak, and this storage format can perform a rename operation in O(1) time even if the directory being renamed has an unbounded number of descendants. Systems that encode full paths in their storage format (e.g. tar, cpio, S3) lack the matryoshka property.

To evaluate systems, we can define matryoshka levels:

Non-matryoshka (level 0): self-similarity is broken, or bounded
Weak matryoshka (level 1): self-similarity exists
Strong matryoshka (level 2): self-similarity exists, and the cost of operating at a deeper level is either completely nil or sufficiently hidden in fastpath operation.

The things that make filesystems matryoshka are the same things that one expects for any tree structure:

ancestry: the .. directory entry returns exactly one parent, which may be the root
extension: mkdir()
prefix-trimming: chroot()

That still leaves plenty of room for filesystems to be weak matryoshka. The concrete realization of a filesystem as strong matryoshka depends on the actual implementation details; most actual filesystems are strong matryoshka because of how they encode data on disk, and they encode data that way because the abstraction layer of dentries and inodes strongly guides a filesystem author to use certain data structures.

A system can still be matryoshka if the extension operation fails due to resource limits: mkdir() can return ENOSPC ("No space left on device") if the disk is full. But what is not allowed are resource limits that arise solely from your current depth in the tree. The “test” goes: if I change this tree node from a child to a sibling, does my resource limit remain the same? If so, the matryoshka property is satisfied.

Some examples

IP addressing: non-matryoshka

IP addressing does have some recursive properties. You can have a network like a /0 and subdivide it 4096 ways to become a /12; whoever receives the /12 can subdivide it 4096 ways to become a /24. But the person who receives the /24 cannot subdivide it 4096 ways: they’re left with only 8 bits to work with, so subdivision tops out at 256. In IP, the extension operation functions differently based on how many layers deep you are in the hierarchy, making it non-matryoshka. In IPv6 the extensions can go on for longer, but there’s still a depth budget of 128 bits.

IP addressing with NAT

This is still non-matryoshka. When packets are routed via the NAT to the outer network, they lack the full routing information necessary to route back to the original sender. Furthermore, the system lacks self-similarity: in IPv4 subnetting with NAT, there are two different extension techniques: 1) subnetting, and 2) NAT. When delegating prefixes (i.e. extending), the technique that a given network node must choose is a function of what has happened in the super-network above it. NAT also fundamentally works at the L4 layer (TCP, UDP), by statefully tracking the port number used by the end device, and translating into another port number that is allocated from a number-space of static size by the NAT router.

FlexIP?

There's an expired Internet Draft—not much more than a sketch, really—that proposes hierarchical segment-based addresses of variable length. In practice I think the limiting factor would be the resources of Internet routers. They have TCAM tables that can match a limited number of bits and grow to only a limited size. If IP addresses were truly infinitely extensible, then we'd have to accept that routers would only be able to "bite off" up to X bits of an address at a time, follow a summary route table entry based on those X bits, and then have the packet be routed a second time by whichever router next receives the packet. So I imagine any implementation here would be weakly matryoshka, although you could maybe find some optimizations to strengthen it?

Linux UID/GID namespaces

Linux user namespaces aren’t namespaces at all, but rather number-spaces carved out of a 32-bit range. When you create a user namespace, the system needs to decide which numbers belong to the child namespace. The net effect is that while the parent had 32 bits of space to delegate, the child only has 16. The situation is just like IP addresses, and the solution would be for users to be natively strings, rather than numbers. One Firefox process would be user josh/firefox/slashdot.org, and another one would be user josh/firefox/google.com.

Linux processes

Linux processes satisfy the matryoshka rule under some configurations. The operations are:

ancestry: getppid(). Notably getppid() can change over time, which doesn’t matter much because the identity of a process’s parent doesn’t affect it.
extension: clone()
prefix trimming: unshare(CLONE_NEWPID)

Two caveats:

the prefix-trimming operation is bounded at MAX_PID_NS_LEVEL (32) for an unfortunate reason: when you run clone() within a PID namespace, the kernel needs to allocate a PID for the new task in your namespace and up to 31 of its parents. This is super sad, and I believe the only solution that satisfies the Matryoshka property would be hierarchical PIDs. So PIDs would look like 55/4/20, encoding “the 20th task launched into the PID namespace owned by the 4th task launched into the PID namespace owned by the 55th task (launched into the PID namespace owned by init)”.
naive fork() is quadratic in behavior for deeply nested processes. Either vfork() or execve() fix the problem. The issue is that Linux needs to track which physical pages are owned by which processes (the rmap facility). Oversimplifying: during fork(), Linux walks the lineage of all ancestors that may share a page with the newly-forked process, and inserts a pointer to the new process into the list of processes that may contain a given page. This is linear in the number of fork-parents you have, and quadratic when done in a loop.¹

The below benchmark illustrates Linux’s behavior here. Try running it with --vfork for maximum speed, but watch out for any cgroups, ulimits, or kernel.pid_max values that will prevent you from successfully forking 65536 processes.

Benchmark: forking 65536 nested processes

#define _GNU_SOURCE
#include <sched.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#define INTERVAL 256

static uint64_t now_ns(void)
{
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

static uint64_t report(int remaining, uint64_t t0)
{
    uint64_t t1 = now_ns();
    uint64_t elapsed_ns = t1 - t0;
    uint64_t elapsed_ms = elapsed_ns / 1000000;
    uint64_t elapsed_ms_frac = (elapsed_ns % 1000000) / 100000;
    uint64_t avg_us = elapsed_ns / INTERVAL / 1000;
    printf("remaining %6d  pid=%d  %d forks=%llu.%llu ms  avg=%llu µs\n",
           remaining, getpid(), INTERVAL,
           (unsigned long long)elapsed_ms,
           (unsigned long long)elapsed_ms_frac,
           (unsigned long long)avg_us);
    return t1;
}

int main(int argc, char **argv)
{
    int use_pidns = 0;
    int use_vfork = 0;
    int use_execve = 0;
    int remaining = 65536;
    int depth_argi = -1;
    uint64_t t0_arg = 0;
    int t0_argi = -1;

    for (int i = 1; i < argc; i++) {
        if (strcmp(argv[i], "--pidns") == 0)
            use_pidns = 1;
        else if (strcmp(argv[i], "--vfork") == 0)
            use_vfork = 1;
        else if (strcmp(argv[i], "--execve") == 0)
            use_execve = 1;
        else if (strncmp(argv[i], "--t0=", 5) == 0) {
            t0_arg = strtoull(argv[i] + 5, NULL, 10);
            t0_argi = i;
        } else {
            remaining = atoi(argv[i]);
            depth_argi = i;
        }
    }

    char **exec_argv = NULL;
    char depth_buf[32];
    char t0_buf[64];
    if (use_execve) {
        int nargs = argc + (depth_argi < 0 ? 1 : 0) + (t0_argi < 0 ? 1 : 0);
        exec_argv = malloc(sizeof(char *) * (nargs + 1));
        int n = 0;
        for (int i = 0; i < argc; i++) {
            if (i == depth_argi)
                exec_argv[n++] = depth_buf;
            else if (i == t0_argi)
                exec_argv[n++] = t0_buf;
            else
                exec_argv[n++] = argv[i];
        }
        if (depth_argi < 0)
            exec_argv[n++] = depth_buf;
        if (t0_argi < 0)
            exec_argv[n++] = t0_buf;
        exec_argv[n] = NULL;
    }

    if (remaining <= 0) {
        printf("bottom: pid=%d\n", getpid());
        return 0;
    }

    uint64_t t0 = t0_arg ? t0_arg : now_ns();

    while (remaining > 0) {
        if (use_pidns) {
            if (unshare(CLONE_NEWPID) < 0) {
                perror("unshare");
                fprintf(stderr, "failed with %d remaining\n", remaining);
                return 1;
            }
        }

        remaining--;

        pid_t pid = use_vfork ? vfork() : fork();
        if (pid < 0) {
            perror(use_vfork ? "vfork" : "fork");
            fprintf(stderr, "failed with %d remaining\n", remaining);
            return 1;
        }

        if (pid == 0) {
            if (remaining % INTERVAL == 0)
                t0 = report(remaining, t0);

            if (use_execve) {
                snprintf(depth_buf, sizeof(depth_buf), "%d", remaining);
                snprintf(t0_buf, sizeof(t0_buf), "--t0=%llu",
                         (unsigned long long)t0);
                execv("/proc/self/exe", exec_argv);
                _exit(1);
            }
            continue;
        }

        int status;
        waitpid(pid, &status, 0);
        _exit(WIFEXITED(status) ? WEXITSTATUS(status) : 1);
    }

    printf("bottom: pid=%d\n", getpid());
    return 0;
}

So the overall report card is:

processes with vfork or execve: strong matryoshka
processes generally: weak matryoshka (quadratic cost)
PID namespaces: not matryoshka

Monotonic attenuation

One property that we’ve glanced at is increasing restrictiveness as nesting continues. This is true for both matryoshka and non-matryoshka systems. IP prefixing, for example, is very about restricting which bits can be used as a source address when sending packets. With that said, monotonic attenuation and matryoshka systems go together like peanut butter and jelly.

Processes provide an illustrative example, because a parent process can apply various restrictions to a child process at launch time. seccomp and Landlock, for instance, have the effect of attenuating the powers of the child process, and the overall trend is enforced to be monotonic: a parent can lessen the child’s powers, but not heighten them.

My understanding is that seccomp policy is weak matryoshka, because evaluating n layered seccomp policies costs O(n) time.

Lightning round

DNS: comes close, but ultimately is non-matryoshka. It lacks a prefix-trimming operation: there isn’t a naturally usable way to treat a subdomain as a root.
telephone numbers: non-matryoshka. The global space of telephone numbers is fixed width and cannot be recursively extended. Private telephone systems (PBXs) introduce a second layer via post-dial extensions, but this is not self-similar: the owner of a telephone line is adding a second dialing step, not extending the space of choices for the caller’s initial dial.
Kubernetes: non-matryoshka. Clusters and namespaces are not self-similar at all.
Docker: non-matryoshka. Famously, Docker gives you the choice of either “mount the Docker socket into the container”—which has all of the same isolation properties as “no container at all”—or “don’t mount the Docker socket”, in which case you cannot produce new Docker containers.
TCP sockets (AF_INET): non-matryoshka. They are allocated from a single flat port-space that exhausts itself quite easily.
Unix sockets (AF_UNIX): these are sockets that are filesystem-resident, so they inherit the filesystem’s matryoshka property.
x86 virtualization: weak matryoshka. You can nest VMs, but you’ll pay in performance.

Lorenzo Stoakes has recently published what seems to be the definitive work on the subject. ↩