Yandex’s High-Performance Profiler Is Now Open Source | by Sergey Skvortsov | Yandex | Jan, 2025
So, how does perf work? When you run perf record -e instructions -c 1000000 -p 1234
, perf uses the Linux kernel to configure the CPU to count the number of executed instructions via the PMU (Performance Monitoring Unit).
The PMU is a set of dedicated processor registers that increment with each instruction. When the PMU event counter overflows, the processor triggers a special interrupt. This interrupt is handled by the Linux kernel, which captures a snapshot of the thread’s state at the moment the millionth instruction is executed.
This architecture enables highly precise thread state analysis but requires a significant portion of perf to run within the Linux kernel (!) since the interrupt must be handled in kernel space. For instance, Linux can unwind the thread’s stack using its knowledge of the stack frame organization for the specific architecture.
If you want perf record to collect stack traces, add the --call-graph
flag. However, experimenting with this flag often reveals that the generated profiles can be challenging to interpret. A typical broken profile might look something like this.
This problem arises because modern compilers don’t generate frame pointers by default. While this saves a few instructions per function and frees up a register, it also makes profiling much more difficult. Brendan Gregg provides an excellent breakdown of the problem. A popular solution is to reintroduce frame pointers into the build process. On average, the performance overhead is small, around 1–2%. This approach is commonly used by large companies and Linux distributions.
However, recompiling all programs and libraries with -fno-omit-frame-pointer
is a complex task. Even if the main binary is compiled this way, system libraries often remain compiled with -fomit-frame-pointer
. As a result, stack traces passing through, for example, glibc, end up corrupted. Furthermore, the exact performance loss varies greatly depending on the workload; in some cases, the overhead can be much higher, even reaching double-digit percentages.
DWARF provides an alternative solution for stack unwinding that supports debuggers and exception handling. Compilers generate a .eh_frame
section that encodes the steps to reconstruct the parent stack frame from any instruction in the program, even in programs without exceptions or where exceptions are disabled (for example, in C). You can disable .eh_frame
generation with the -fno-asynchronous-unwind-tables
flag, but in practice, this offers only a slight reduction in executable size while making debugging and profiling much harder.
Perf can utilize DWARF for stack unwinding; you just need to specify the --call-graph=dwarf
flag. However, there’s a crucial detail: DWARF is Turing-complete. Stack unwinding occurs during an interrupt in the Linux kernel, making it virtually impossible to support DWARF in that context. Not only would this necessitate reading DWARF data from disk during an interrupt, but the unwinding logic itself becomes excessively complex, introducing numerous potential bugs.
Linus Torvalds famously conveyed his strong opposition to incorporating DWARF-based unwinding into the kernel:
I never ever want to see this code ever again.
…
Dwarf unwinder had bugs itself, or our dwarf information had bugs,
and in either case it actually turned several “trivial” bugs into a total undebuggable hell.… dwarf is a complex mess …
An unwinder that is several hundred lines long is simply not even remotely interesting to me.
…
just follow the damn chain on the stack without the “smarts” of an inevitably buggy piece of crap.
Therefore, perf employs a different strategy. Instead of unwinding the stack in the kernel, it copies only the top portion of the thread’s stack to user space. This enables stack unwinding to occur in user space, where complex code can run safely.
Of course, copying the entire stack (typically several megabytes per sample) would be too costly. To mitigate this, perf copies only a part of the stack, inevitably leading to some data loss for programs that use it. The maximum stack captured by perf is 65,528 bytes, while the default thread stack size on Linux is 8 MB.
Despite these limitations, this approach generally works well and yields decent profiles. For instance, the profile for the same ClickHouse instance appears as follows:
ZSTD appears to use the stack heavily, and perf’s 65,528-byte limit wasn’t enough. Moreover, even setting aside stack size limitations, profiles generated with --call-graph=dwarf
are an order of magnitude larger than those generated with --call-graph=fp
. Instead of saving a compact list of return addresses, it has to store the entire stack. As a result, large-scale use of --call-graph=dwarf
requires a lot of resources. A profile collected over just a few dozen seconds can take up gigabytes of space. If we attempt to reduce the stack size limit, the profile quality begins to degrade. Let’s test this scenario: