The performance of Python with `perf` support is not great, and is going to get a lot worse

markshannon · March 29, 2023, 9:57am

3.12 offers support for perf on those platforms that support it.

The downside of perf support is the performance impact. On a normal release build, we see a 8% slowdown.
With a -fno-omit-frame-pointer build (needed to better support perf) the slowdown would be in the 10% to 15% range (I haven’t measured this, but the slowdown for -fno-omit-frame-pointer alone is more than 2%, and nearer 7%.)

This may be an acceptable slowdown for some, but it is going to get worse.

The design for the “tier 2” optimizer, including JIT complier, which we plan for 3.13/3.14 performs inter-procedural optimization.
The current implementation of perf support relies on making a C call for each Python call, which prevents a whole slew of optimizations.

This is not to say that the presence of perf support will prevent all further optimizations, but the difference in performance will become large. I wouldn’t be surprised if some programs saw a 50% slowdown due to perf support.

Is there a use of perf support where performance doesn’t matter?

If not, what’s the point of supporting perf? It isn’t much use for performance profiling if it kills performance, nor is it much use for monitoring systems in production, for the same reason.

pablogsal · March 29, 2023, 1:01pm

Several (because you are talking about pure Python performance). For instance, imagine an application where 95% percent of the runtime is spent in extension modules that are called from Python. This is not hard to imagine because this is the case of most data science, scientific or machine learning applications. Here perf would be basically used to profile the native code underneath BUT informing what Python calls are being made that trigger it, showing the full stack. This is invuluable because otherwise you end with a super unreadable flamegraph full of PyEval_EvalFrameDefault.

You seem to be thinking about profiling in production as well but perf can and normally is used as a profiler in development to understand better your application. Production profiling is just one way to use it.

thomas · March 29, 2023, 1:24pm

The performance cost is only paid when perf profiling is enabled, right?

10-15% slowdown is not “killing performance”. In lots of places it’s a very acceptable cost to pay for that much insight in what the process is doing, especially since it can be enabled and disabled at runtime. Not having to rebuild or distribute a special binary is a big deal, too. All in all the perf profiling gives the kind of insight in performance (latency, throughput, etc) bottlenecks that would otherwise require specialised builds replaying production workloads in customized environments.

(This has been mentioned before, but Google has run with frame pointers (shrink-wrapped or regular) since pretty much forever, precisely because the useful stack info on crashes and when profiling or debugging is so valuable. And this is for highly tuned binaries with massive workloads on huge amounts of machines where even a fraction of a % is a valuable savings. I’m pretty sure the C++ teams would love to only have to pay the cost when enabling the feature at runtime :).

pablogsal · March 29, 2023, 3:38pm

That is correct

markshannon · March 29, 2023, 5:23pm

10-15% may not be killing performance, but 50% is.

If you think that’s acceptable, fine. But I suspect that for many people it would not be.

The release notes and docs on perf support make no mention of any performance impact. Perhaps they should, so that people can make informed decisions.

gpshead · March 29, 2023, 6:43pm

Could you clarify exactly how you’re measuring this (ideally in a reproducible manner)? It appears to conflict with:

brettcannon · March 30, 2023, 12:49am

I’m going to assume this is a Unix tool based on who’s participating in this conversation?

gpshead · March 30, 2023, 3:43am

yep, perf is well known Linux program analysis thing. Perf Wiki

markshannon · March 30, 2023, 9:32am

Here’s the performance results on our standard benchmark suite with perf turned on all the time

If perf support is to be useful, then CPython needs to compiled with -fno-omit-frame-pointer, the cost of which “can be anywhere from 1-10% depending on the specific benchmark”.

If compiled normally, and with perf support turned off, the cost is zero, because it piggybacks on PEP 523 support.

So if it is turned off, and doesn’t work properly when turned on, its free

thomas · March 30, 2023, 11:14am

Specifically, we’re talking about Python support for the Linux perf profiler — Python 3.12.4 documentation here.

steve.dower · March 30, 2023, 4:16pm

Just FYI, I think there’s a way to also make the same concept work well for perf tracing on Windows, but I’m still chatting with our relevant OS experts. So it’s Linux only for now.

gpshead · March 30, 2023, 10:15pm

Okay, that is the state I expected things to be: perf support existing in our codebase has no negative impact by default. We, python.org, don’t ship binary releases on platforms that support perf so the choice of compiler flags is entirely up to those who do build and distribute Linux runtimes.

The decision to build with frame pointers or not is beyond our control for 3.12.

gpshead · March 30, 2023, 10:32pm

So what does this mean for 3.13 and beyond?

For 3.13+ development focus on the tier 2 JIT work and if that means perf support stops functioning, from my point of view that isn’t a big deal. We can see how things work and come up with a solution if people remain interested in keeping it alive and at what cost. I wouldn’t block performance work in 3.13+ on supporting perf.

Stepping back: perf and similar things “just” want to map samples of execution stack addresses to symbols/source-lines. A JIT can record which generated code addresses maps to which symbols at JIT code generation time. That data can be collected and exported for later correlation with the samples. This is what I believe Java VMs have been doing for many years, as well as some others such as V8 and Node. [*]

[*] references: perf: add support for profiling jitted code [LWN.net] via Inspecting OpenJ9 performance with perf on Linux – JIT Compiled Methods – Eclipse OpenJ9 Blog and Using perf to profile Java applications | BellSoft Java for example… Notice that the JVM needs to be asked to generated JIT code -XX:+PreserveFramePointer to make for better perf sample data; the JVM defaults that to false and I see various docs around work suggesting we’ve kept that default. People explicitly profiling Java applications with perf are told to turn it on. That’s really no different than 3.12’s PYTHONPERFSUPPORT= or -X perf flags which could be made to have that effect.

With a JIT you might even get the best of both… build CPython without frame pointers for interpreter performance, but have it generated JIT code with them in place so that the hot code paths in your program wind up with more complete data than the cold non-jitted no-FP ones?

markshannon · March 31, 2023, 1:52pm

You are making all sorts of assumptions about the way will optimize Python code.

Reasonable sounding requests to “just” generate some extra data during JIT compilation are not reasonable. They severely limit the sorts of optimizations we can do, and will involve a lot of extra work.

This is why I wanted to discuss this now. I don’t want support for perf to hurt performance and that is exactly what will happen, if people expect us to support perf in future optimizers, including any possible JIT.

perf is designed for C code. It is easy to make languages that are fundamentally similar (C++, Rust, Swift, Go, etc) support it. Java can be made to support it, but at some cost, and even V8 can, but only because it uses a method-at-a-time compilation model, which creaks and groans when implementing coroutines.

Python has a very rich calling convention, generators, coroutines and async coroutines. These do not map well to the standard C call model.

For example we saw a 150% speedup on a generator heavy benchmark, by avoiding a stack of C calls when iterating over a generator, instead simply jumping into the generator.

I don’t know how much of the 8% slowdown for turning on perf support is the extra work of adding in shim C frames, but some of it is the extra work of making several calls on top of the creating the Python frame, and that is pure overhead.
Not only is it extra work, it breaks up the regions for optimization, reducing the effectiveness of any future optimizers.

gpshead · March 31, 2023, 9:18pm

I think we’re just miscommunicating. I’m well aware of what valuable optimizations can entail. (background: eons ago I worked for Transmeta, a world class JIT was key to our product)

My point is that I wouldn’t worry about maintaining perf support while working on future CPython performance improvements. It’s a non-requirement in my mind. That way we can measure the performance without it.

That leaves those who want perf able to better understand the performance and maintenance cost of any proposed implementations that keep it. I was weary of this when it went into 3.12 but thought it’d be valuable to have anyways until we couldn’t.

gpshead · March 31, 2023, 9:22pm

For future perf support implementors: From my perspective intraprocedurally optimized JIT code with no calls or frames involved can likely still supply a meaningful “symbol” and “stack” to perf. It doesn’t have to match what a pure interpreted old-CPython C-like stack might otherwise look like.

carljm · April 4, 2023, 5:31pm

The ideal profiling “end product” (at least for the use cases I’m familiar with, which is understanding and optimizing the performance of Python code running under CPython) is a stack that looks something like this:

pyfunc1
pyfunc2
C_func_called_in_implementation_of_some_opcode_in_pyfunc2
pyfunc3_called_due_to_python_code_execution_triggered_by_that_opcode_in_pyfunc2
python_extension_func
other_non_python_C_library_func_called_by_extension_func
expensive_implementation_detail_C_func_in_C_library

These kinds of “mixed stacks” are very useful in debugging and fixing performance problems in both pure-Python code and mixed Python / C-extension code.

The current “perf trampoline” feature provides this type of mixed stack by a) disabling call inlining and call specialization, to ensure that every Python frame has a corresponding C frame, and then b) using a mini assembly trampoline and a perfmap file to give perf a more informative name for the C frames that correspond to Python frames.

The nice thing about this approach is that it is turnkey and very easy to use: you just enable it, and then perf natively gives you these useful mixed stacks.

The unfortunate thing is that it gives up one of the major advantages of sampled profiling, which is that the profiling has very little effect on the performance of the profiled code, and what little effect it does have is perfectly evenly distributed and introduces zero data skew. Disabling call inlining and call optimizations is significant performance impact, and unfortunately can also introduce skew, since it makes specifically calling Python functions relatively more expensive than it would otherwise be in unprofiled code.

In https://github.com/python/cpython/issues/100987, @markshannon is proposing to include some of the C frames from the above example (the ones that are CPython built-ins) in the Python stack, as an alternative way to provide a similar stack with no performance overhead. The difference with this proposal is that the stack wouldn’t include non-CPython C functions from C libraries called by extensions, so you wouldn’t get the same visibility when debugging mixed Python/C-extension code; you’d need to use the Python stack to track the performance hotspot as far as python_extension_func, and then switch to perf to understand what behavior inside the C library is making python_extension_func slow. This can make things somewhat trickier in cases where you need to correlate across the two stacks.

If I understand correctly, @gpshead is suggesting that future versions of CPython “perf support” could potentially avoid the impact on perf of profiled code by no longer requiring every Python frame to have a corresponding C frame (so we wouldn’t need to disable all inlining optimizations), but still have the option to emit perfmap files, and instead generate a system stack that might have frames named something like “pyfunc1 - pyfunc2 - pyfunc3”, where pyfunc2 and pyfunc3 were call-inlined into pyfunc1. This seems potentially useful, but makes it difficult to e.g. isolate the exclusive cost of pyfunc1. (Let me know @gpshead if I misunderstood your suggestion.) (edit: on third thought, I’m not actually sure how this proposal would work, since perfmaps / trampolines are static, but call inlining is a dynamic decision. But I may have misunderstood.)

One alternate way to generate the ideal kind of “mixed stack”, without much performance overhead (and specifically without overhead that can introduce skew) is to use bpf probes to have perf collect the python stack as well as the system stack, and then as a post-processing step, “merge” the two stacks by matching up PyEval_EvalFrameDefault system frames with Python frames. (This is what we currently do at Meta.) This approach also has trouble with call inlining, but that can be easily fixed if Python _PyInterpreterFrame had one additional bit of metadata signifying “I was call-inlined into my parent and don’t have a corresponding C frame.” Then the merge process knows everything it needs again.

The downside of this approach, relative to the current 3.12 perf trampolines, is that it is less turn-key: you need to do more work on the profiler side (collect both stacks, then merge them.) But that work (and the bpf probes necessary to make it work) could be collected into a profiler library/tool.

carljm · April 5, 2023, 10:14pm

I just realized that the existing “shim frames” inserted into the frame stack on entry into PyEval_EvalFrameDefault can also serve this purpose. Each occurrence of PyEval_EvalFrameDefault on the system stack should match up to a “shim frame,” and all normal Python frames from that shim frame until the next one belong at that spot in the system stack.

So I don’t think we need this extra bit of metadata on frames for “stack merging” to continue to work.

carljm · April 5, 2023, 10:27pm

If we do keep the perf trampoline support, it would be good to extract the perf map file writing part of it to a distinct lower level thread-safe API. Perf maps are “one file per process” so it is difficult to safely coordinate multiple libraries separately writing to a perf map for the same process without a central API to manage those writes. The Cinder JIT currently writes perf map files, and so does Pyston’s JIT, so along with the perf trampoline support in CPython, we already now have three potential clients for such an API.