But actually, how do flamegraphs work?

Q: So when I make a flamegraph, how does that work?

The specific type of flamegraph you’re seeing is an on-cpu flamegraph. It is synthesized by periodically stopping the CPU, asking “what are you doing right now”, and storing the answer. The flamegraph itself is a convenient representation of the data produced by doing that very frequently (see the -F argument to perf record) over some period of time.

Q: How does perf stop the CPU?

perf calls perf_event_open(2). On amd64, the syscall configures the processor’s performance monitoring unit (PMU) to send an NMI after some number of clock cycles elapse. In the CPU, there are three different ways to configure this (see “Table 20-2. Association of Fixed-Function Performance Counters with Architectural Performance Events” in the Volume 3B of the Intel SDM):

instructions retired (INST_RETIRED.ANY)
actual clock cycles (CPU_CLK_UNHALTED.CORE)
“clock cycles” as measured against a fixed reference clock (CPU_CLK_UNHALTED.REF_TSC)

The fixed reference clock runs at the rate of the CPU timestamp counter. Figuring out that rate involves reading a lot of model-specific registers, but I just had the turbostat command do it for me (sudo turbostat --num_iterations 1 --interval 1 2>&1 | sed -ne '/^TSC:/p;/^TSC:/q').

The actual configuration is just a wrmsr instruction on the appropriate register. It becomes quite obvious that perf is being used, because the NMI and PMI (“Performance monitoring interrupt”) fields of /proc/interrupts start incrementing like crazy.

The kernel installs a handler to process the NMI produced by the PMU. Each NMI it receives becomes a stack sample.

Q: What happens while the CPU is stopped?

The kernel copies and persists various data into a ring buffer of memory pages that are readable by the profiler (often perf). Since we’re using flamegraphs, our profiler must have passed PERF_SAMPLE_STACK_USER to perf_event_open. This induces the kernel to persist entire stacks into the ring buffer. In this way, the kernel can collect many stack+register samples from the profiled process, without having to context switch into the profiler itself (as one would with a ptrace-based profiler).

Q: What does the profiler do?

The profiler usually sleeps on the file descriptor created by perf_event_open with one of the poll/select/epoll syscalls. When it wakes up, it copies data out of the memory pages populated by the kernel. If it fails to process data fast enough, segments of the ring buffer will get overwritten and events will be dropped.

To make a flamegraph, the profiler must unwind each stack, determining the name of the function being executed. This job is the same as any other debugger (e.g. gdb) does, except perf is restricted to reading only data present in the stack: it cannot refer to other data structures in the program’s memory. This makes it less meaningful to profile the state of coroutine or event-loop based programs, because much of their state resides on-heap.

The profiler (or some combination of helper programs) then passes a list of annotated stacks to a flamegraph generator program, which outputs an SVG.

Q: Is it fast?

It seems like the fastest form of profiling available. The profiled process continues running as normal; it just experiences a higher-than-normal quantity of CPU interrupts, just as-if the system were receiving network traffic. The actual data collection doesn’t require:

sending/receiving any signals (e.g. SIGPROF)
waking up a separate process (as with ptrace)
performing any syscalls

The profiler never blocks the profiled process, can run on an entirely different CPU, and only needs to wake up occasionally to process or persist the collected data.