Faster python: _PyEval_EvalFrameDefault hard for GCC to optimise

As part of the Fedora discussions on adding frame pointers to all code it was found that most code was slower by 1%-2% but python was 10% slower.

There is a this article that covers the heated discussion on the frame pointer addition: https://lwn.net/Articles/919940/
In the comments there is a pointer to this issues comment on a python analysis of why its 10% slower: https://pagure.io/fesco/issue/2817#comment-826636

I’ve copied the analysis of the slow down below. So that you do not need to follow the links.

Summary: lack of 1 register causes more spilling of variables to the stack and use of frame pointers causes GCC to use 7 byte vs 5 byte stack acceses.

Can this function be refactored to allow compilers to do a better job of optimising the code?
(I think this is a second time I read about this function being a performance bottle neck. There was a blog post about 10-15 years
ago that concluded the same things: refactor the code, imrove GCC code generation).

Quote from comment:

First, perf data suggests that big chunk of CPU is spent in _PyEval_EvalFrameDefault, so I looked specifically into it (also we had to use DWARF mode for perf for apples-to-apples comparison, and a bunch of stack traces weren’t symbolized properly, which just again reminds why having frame pointers is important).

perf annotation of _PyEval_EvalFrameDefault didn’t show any obvious hot spots, the work seemed to be distributed pretty similarly with or without frame pointers. Also scrolling through _PyEval_EvalFrameDefault disassembly also showed that instruction patterns between fp and no-fp versions are very similar.

But just a few interesting observations.

The size of _PyEval_EvalFrameDefault function specifically (and all the other functions didn’t change much in that regard) increased very significantly from 46104 to 53592 bytes, which is a considerable 15% increase. Looking deeper, I believe it’s all due to more stack spills and reloads due to one less register available to keep local variables in registers instead of on the stack.

Looking at _PyEval_EvalFrameDefault C code, it is a humongous one function with gigantic switch statement that implements Python instruction handling logic. So the function itself is big and it has a lot of local state in different branches, which to me explained why there is so much stack spill/load.

Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov 0x50(%rsp),%r10 (and their reverse variants), I see that there is a substantial amount of stack spill/load in _PyEval_EvalFrameDefault disassembly already in default no frame pointer variant (1870 out of 11181 total instructions in that function, 16.7%), and it just increases further in frame pointer version (2341 out of 11733 instructions, 20%).

One more interesting observation. With no frame pointers, GCC generates stack accesses using %rsp with small positive offsets, which results in pretty compact binary instruction representation, e.g.:

0x00000000001cce40 <+44160>: 4c 8b 54 24 50 mov 0x50(%rsp),%r10

This uses 5 bytes. But if frame pointers are enabled, GCC switches to using %rbp-relative offsets, which are all negative. And that seems to result in much bigger instructions, taking now 7 bytes instead of 5:

0x00000000001d3969 <+53065>: 48 8b 8d 10 ff ff ff mov -0xf0(%rbp),%rcx

I found it pretty interesting. I’d imagine GCC should be capable to keep using %rsp addressing just fine regardless of %rbp and save on instruction sizes, but apparently it doesn’t. Not sure why. But this instruction increase, coupled with increase of number of spills/reloads, actually explains huge increase in byte size of _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra bytes for existing 1870 instructions that were switched from %rsp+positive offset to %rbp + negative offset, plus 7 bytes for each of new 471 instructions). I’m no compiler expert, but it would be nice for someone from GCC community to check this as well (please CC relevant folks, if you know them).

In summary, to put it bluntly, there is just more work to do for CPU saving/restoring state to/from stack. But I don’t think _PyEval_EvalFrameDefault example is typical of how application code is written, nor is it, generally speaking, a good idea to do so much within single gigantic function. So I believe it’s more of an outlier than a typical case.

Looking also at @pviktori response in Re: Building Python 3.12 with no-omit-frame-pointer - python-devel - Fedora Mailing-Lists I agree that microbenchmarking matrix multiplication in pure Python instead of offloading it to native C/C++ libraries as is typically done in practice with Python apps for all CPU-heavy stuff, is as far from real-world benchmark as we could get. So I hope that these microbenchmarks won’t be the primary driver of the decision on enabling frame pointers.

cc @pviktori just in case he and Python community would like to look deeper into this and maybe consider splitting up _PyEval_EvalFrameDefault into subfunctions somehow to facilitate better register allocation in compilers.

I hope this was helpful.

1 Like

3 posts were merged into an existing topic: Python 3.11 performance with frame pointers

Redirecting to Python 3.11 performance with frame pointers which was the original post on this topic.