The ideal profiling “end product” (at least for the use cases I’m familiar with, which is understanding and optimizing the performance of Python code running under CPython) is a stack that looks something like this:
pyfunc1
pyfunc2
C_func_called_in_implementation_of_some_opcode_in_pyfunc2
pyfunc3_called_due_to_python_code_execution_triggered_by_that_opcode_in_pyfunc2
python_extension_func
other_non_python_C_library_func_called_by_extension_func
expensive_implementation_detail_C_func_in_C_library
These kinds of “mixed stacks” are very useful in debugging and fixing performance problems in both pure-Python code and mixed Python / C-extension code.
The current “perf trampoline” feature provides this type of mixed stack by a) disabling call inlining and call specialization, to ensure that every Python frame has a corresponding C frame, and then b) using a mini assembly trampoline and a perfmap file to give perf
a more informative name for the C frames that correspond to Python frames.
The nice thing about this approach is that it is turnkey and very easy to use: you just enable it, and then perf
natively gives you these useful mixed stacks.
The unfortunate thing is that it gives up one of the major advantages of sampled profiling, which is that the profiling has very little effect on the performance of the profiled code, and what little effect it does have is perfectly evenly distributed and introduces zero data skew. Disabling call inlining and call optimizations is significant performance impact, and unfortunately can also introduce skew, since it makes specifically calling Python functions relatively more expensive than it would otherwise be in unprofiled code.
In https://github.com/python/cpython/issues/100987, @markshannon is proposing to include some of the C frames from the above example (the ones that are CPython built-ins) in the Python stack, as an alternative way to provide a similar stack with no performance overhead. The difference with this proposal is that the stack wouldn’t include non-CPython C functions from C libraries called by extensions, so you wouldn’t get the same visibility when debugging mixed Python/C-extension code; you’d need to use the Python stack to track the performance hotspot as far as python_extension_func
, and then switch to perf
to understand what behavior inside the C library is making python_extension_func
slow. This can make things somewhat trickier in cases where you need to correlate across the two stacks.
If I understand correctly, @gpshead is suggesting that future versions of CPython “perf support” could potentially avoid the impact on perf of profiled code by no longer requiring every Python frame to have a corresponding C frame (so we wouldn’t need to disable all inlining optimizations), but still have the option to emit perfmap files, and instead generate a system stack that might have frames named something like “pyfunc1 - pyfunc2 - pyfunc3”, where pyfunc2
and pyfunc3
were call-inlined into pyfunc1
. This seems potentially useful, but makes it difficult to e.g. isolate the exclusive cost of pyfunc1
. (Let me know @gpshead if I misunderstood your suggestion.) (edit: on third thought, I’m not actually sure how this proposal would work, since perfmaps / trampolines are static, but call inlining is a dynamic decision. But I may have misunderstood.)
One alternate way to generate the ideal kind of “mixed stack”, without much performance overhead (and specifically without overhead that can introduce skew) is to use bpf probes to have perf collect the python stack as well as the system stack, and then as a post-processing step, “merge” the two stacks by matching up PyEval_EvalFrameDefault
system frames with Python frames. (This is what we currently do at Meta.) This approach also has trouble with call inlining, but that can be easily fixed if Python _PyInterpreterFrame
had one additional bit of metadata signifying “I was call-inlined into my parent and don’t have a corresponding C frame.” Then the merge process knows everything it needs again.
The downside of this approach, relative to the current 3.12 perf trampolines, is that it is less turn-key: you need to do more work on the profiler side (collect both stacks, then merge them.) But that work (and the bpf probes necessary to make it work) could be collected into a profiler library/tool.