News faster CPython, JIT and 3.14

paugier · March 20, 2025, 3:37pm

I worked a bit on studying the effect of the new CPython JIT and more generally of performance for simple CPU bounded pure Python code.

I focused on a very simple benchmark involving only very simple pure Python:

def short_calcul(n):
    result = 0
    for i in range(1, n + 1):
        result += i
    return result

def long_calcul(num):
    result = 0
    for i in range(num):
        result += short_calcul(i) - short_calcul(i)
    return result

I’m not claiming that this code is representative of any real world Python CPU bounded code. One interest of this benchmark is that it involves only very simple tasks (loops, addition/subtractions of integers and function calls) so it should be relatively simple for compilers to accelerate it.

For example, for this benchmark PyPy 3.11 is 25 times faster than the system CPython 3.11 on Debian (compiled with GCC).

Unfortunately, the results are a bit depressing. The full code and full results are available here.

I used only solutions available without compiling the interpreters (installing Python with UV and Miniforge). I run these benchmarks only on Linux x86_64 (Debian). All Pythons used for this experiment have the GIL and everything is sequential.

Here are few important points:

The compiler used to compile the interpreter has a non negligible effect (see cpython 3.13 installed with UV slow and not compiled with `--enable-experimental-jit=yes-off`` · Issue #535 · astral-sh/python-build-standalone · GitHub).
The CPython JIT still has a relatively small effect on this benchmark. Even for the best case (3.13 from conda-forge), the speedup related to the JIT is only of x1.2, compared to x25 with PyPy!
If we compare only Python provided by UV, 3.14a6 is a bit faster than 3.13 (x1.2) but actually approximately as fast as 3.11 (x1.05 to be more accurate).

For such simple code, this is in my humble opinion quite disappointing. PyPy is still more than 22 times faster than the best CPython result (Python 3.13 from conda-forge with the JIT). And let’s not talk about other languages…

It seems to indicate that the CPython JIT does not manage to avoid boxing/unboxing of integers.

I have few questions about these results:

Is it possible to get information on what happens internally with the CPython JIT with this code? What is wrong in this case?
Is there a chance that a future version of the CPython JIT will really accelerate such kind of pure Python CPU bounded code?
For 3.14 (compiled with Clang), the JIT has in practice no measurable effect. Is it related to the new interpreter using tail calls?
Are there examples for which the CPython JIT leads to a non negligible speedup?

davidism · March 20, 2025, 3:43pm

The JIT is not finished yet and is classified as experimental. In its current stage, it does not yet attempt to improve performance. The first phase was getting it designed, developed, and released at all. Now that the framework is in place, performance can improve as work continues.

JamesParrott · March 20, 2025, 4:25pm

Try it with an example that can’t have result += short_calcul(i) - short_calcul(i) optimised away by a compiler to essentially be:

def long_calcul(num):
    return 0

barry-scott · March 20, 2025, 5:43pm

You might want to read this blog that found tail-call is relying on a compiler regression. Performance of the Python 3.14 tail-call interpreter - Made of Bugs

zware · March 20, 2025, 7:50pm

I think you’re going a step too far in your description that it “relies” on a compiler bug. The initial measurement of expected benefit was inflated by the compiler bug, but there is still measurable improvement and possibly other benefits to the tail-call interpreter. Also note that this is all called out in an Attention box in the whatsnew note about the tail-call interpreter that was linked in the OP.

As far as the JIT, the team working on it has been upfront about the fact that it is still very early days and so far the focus has been more on retaining correctness and making sure it doesn’t hurt performance rather than making everything faster. There is still a long way to go, but my personal expectation is that when there is confidence that the JIT is generally a significant performance win, it will become an opt-out feature included in --enable-optimizations by default. For now, it’s still labeled “experimental” and strictly opt-in for a reason

See PEP 744 and associated discussion thread for further details.

paugier · March 20, 2025, 8:49pm

PyPy does not do that and the speedup would be considerably larger if it was the case.

The results are exactly the same when I change this - to a +.

Moreover, a x20 speedup is consistent with what is seen as soon as you compare for CPU bounded pure Python computations CPython to things quite efficient.

brandtbucher · March 20, 2025, 9:05pm

Thanks for taking the time to try out the new JIT! Hopefully I can help answer some of your questions.

Well, I’d rephrase this as “it doesn’t even try to unbox integers yet”. It’s something we hope to have working for 3.15. As you note, it’s an attractive optimization, but it will be easier to get it working correctly once we land some other changes to how references are handled in the main interpreter loop.

If you have a debug build of CPython and are curious about what’s happening, setting the PYTHON_LLTRACE=1 env var will print a line to stdout whenever we compile something. Setting the env var to 2 (and successively higher values) will print more info about what the code actually looks like before and after the optimizations that we currently perform.

You’ll find the current optimizations are quite limited; currently, we mostly just remove type checks and promote globals to constants, but the list is growing. Our tracing JIT does some inlining as well.

Depends what you mean by “really accelerate”. I think that the 20% speed improvement that you observe is quite significant, but if you’re talking about PyPy-level speeds, that’s still a very long way off (and personally, I don’t think competing with PyPy’s speed is a very realistic goal given the goals of the CPython project).

Not sure! That would be my first thought, though, since I can’t think of anything that would have gotten worse for this JIT code in 3.14.

Again, we may have different ideas of what “non-negligible” means, but a slightly more realistic benchmark like richards speeds up 20% when you turn on the JIT on 3.14. I think 20% on a program like that is significant, but if you’re looking for 2x, 5x, or 10x faster code when the JIT is enabled, I’m afraid that’s just going to take more time and effort on our part.

barry-scott · March 20, 2025, 11:11pm

Sorry I should have made that clear the results are inflated.

JamesParrott · March 21, 2025, 9:06am

If you’re sure that’s true, then great. But the test still rests on that assumption. PyPy may do something smarter in future, and the test would be both more robust, and compelling, if the calculation being benchmarked is non-trivial and could not mathematically be simplified to 0.

paugier · March 21, 2025, 11:45am

I slightly changed the benchmark code. Again, the results are not affected. However, the interesting subject here is not the robustness of this micro-benchmark but the fact that nearly 6 months after the release of 3.13, the CPython JIT is still very limited with tiny acceleration even on a very simple case extremely favorable for JIT compilers.

We also see that in the last figure of the Longitudinal results for Faster CPython (JIT versus tier 1).

I have to admit that I was hoping some signs of progress for 3.14.

One can argue “not ready”, “just infrastructure”, “retaining correctness and making sure it doesn’t hurt performance rather than making everything faster”, but seeing that the project is still stuck in this state is not great, especially when one compares with the promises (for example “5x in 4 years, 1.5 per year” in a 4 year-old document which can be found here) and with other interpreters of dynamic languages (Python, JavaScript and Julia).

To have a real impact on real codes, the CPython JIT should really accelerate simple pure Python codes. Considering how slow CPython is for such codes, x1.2 is still very bad and yes a reasonable target is x10, which would still be very slow compared to alternative Python interpreters and other dynamical languages.

It is mentioned in ideas/3.14 at main · faster-cpython/ideas · GitHub the possibility and necessity for 3.15 to move some code from C to Python for small built-in functions, but for that, the JIT needs to be really efficient.

Anyway, I wish all the best to the Faster CPython project and we will see what it is going to give for 3.14 and 3.15.

To be honest, I also think other projects, in particular PyPy, GraalPy and HPy, could have very interesting impacts for Python users in the long term. It seems to me that promises and beliefs (“CPython will soon be much faster” or “The CPython C API will soon be much more acceptable for alternative interpreters”) can have a strong negative effect for these projects.

brandtbucher · March 21, 2025, 5:11pm

I don’t see it at the link you provided, but I know the document you’re referring to, and it makes no promises at all. The very first lines read:

The overall aim is to speed up CPython by a factor of (approximately) five. We aim to do this in four distinct stages, each stage increasing the speed of CPython by (approximately) 50%.

This was a goal, not a promise.

Your interest in what exactly the JIT was doing made me hopeful that either I could learn something useful from you, or you could learn something useful from me. But I’m afraid this is just turning into another “Python is slow” rant, which is mostly just a waste of everyone’s time.

The simple reality is that Python has gotten over 50% faster in less than four years, due to a combination of paid and volunteer effort that costs Python’s users nothing. As somebody who has been working full-time on CPython performance during that entire period, please trust me when I say that it is a much harder problem (in both technical and social aspects) than you seem to believe.

When you say “a reasonable target is x10”, do you mean:

…reasonable to expect (for free)?
…reasonable to achieve? In how long, by how many people, costing how much, and breaking how much code?
…or just reasonable to want?

I’m sorry our progress during this time has disappointed you, but it seems that that’s mostly due to you being unaware of how hard of a problem this is, and perceiving optimistic goals as promises made by an entire project.

paugier · March 21, 2025, 10:29pm

I’m really sorry that my words upset you. It was not meant to be mean.

The word “promise” is not from me but from the third page of this file ideas/FasterCPythonDark.pdf at main · faster-cpython/ideas · GitHub

I work quite a lot on HPC with Python. For example, I’m the first author of this article in Nature Astro (pdf available here). So my message is far from being “Python is slow”. However, I know that CPU bounded computations in pure Python executed with CPython are indeed very inefficient. I also know that this has quite strong implications for the community, in particular that one avoids using too much the interpreter for all hot codes. This is also why 50% faster for pyperformance does not imply 50% faster for most real world applications, in particular in the fields of data and science.

I’m not at all an expert in the CPython internals but I have some ideas on why it’s so hard to improve in terms of performance. My point is really not to say that the faster CPython project is bad or people working on it incompetent.

However, I’m skeptical about the collective choice of not investing more in HPy (or something similar), which would also help for CPython performance in the long term.

When I wrote “a reasonable target is x10”, it’s really specifically for this micro-benchmark and of course it’s an order of magnitude. This order of magnitude is based on my knowledge of the performance with other tools (PyPy, GraalPy, Numba, Pythran, …) and other dynamical languages. PyPy gives x25 with only reasonable things (no loop unrolling, no complex code analysis and no SIMD for example). With function inlining, int unboxing and compilations of sufficiently long traces, you should get a decent speedup.

My feeling is that if you don’t get a decent speedup for such simple micro-benchmarks, I don’t see how you could get much better than 50%-100% on pyperformance, and I still hope CPython will go beyond that.

Of course, this is just feeling from a relatively advanced user knowing few stuffs on the subject. Nothing really interesting I admit. Thanks a lot for your interesting answer and more generally for your hard work on CPython perf. I’m going to follow the progresses in the next months and years.

TomFryers · March 26, 2025, 11:04pm

I can see what’s going on here with the word promise, so I’m going to take the liberty of interjecting in the hope of clearing things up.

Promise as a verb has (at least) two meanings in English. Wiktionary phrases the distinction well enough:

To commit to (some action or outcome), or to assure of such commitment; to make an oath or vow.
To give grounds for expectation, especially of something good.

I’m pretty confident Guido’s slides were using promise in this second sense. (For one, he has the hardware not to make errors of this sort.)

The problem comes when you write ‘especially when one compares with the promises’.

‘Promises’, as a plural noun, refers to things promised (1), never things promised (2). (If you promise (2) something, you haven’t made a promise, you’ve shown promise.)

That’s where things are going awry.

I apologise on behalf of my country: English is confusing. (At least Python isn’t like JS, with Promises everywhere adding to the mess!)

paugier · March 27, 2025, 9:52pm

Thanks @TomFryers for your explanations. Happy to know that the problem comes from me then .

However, I read in different places that some people really think that CPython will become much faster without deep changes in the ecosystem. It seems to me that people don’t realize that the CPython C API is a big issue for any major Python performance improvement.

I continued to experiment a bit more on measuring what can be obtained with alternative Python implementations (PyPy and GraalPy) and with two other dynamic languages (Javascript and Julia), so that I can give numbers.

The code and few results are there.

Comparing with CPython 3.14a6, on this simple benchmark

PyPy is ~20 times faster
GraalPy is ~40 times faster
Nodejs is ~28 times faster
Julia is ~800 times faster

On this very simple benchmark, Julia (LLVM) does an advanced optimization (related to the simplicity of the function short_calcul).

On a slightly less simple code (bench_brutal in the same directory), it is a bit less impressive and Julia is “only” 5 times faster than PyPy.

My point is not to claim that Python interpreters have to become as fast as Julia or that it is a big deal that they are slower.

I’d like to stress that my target for CPython “x10” for this simple benchmark is not that crazy.

And that the Python community should invest more on making alternative Python interpreters more usable and used in practice, which implies going towards a nicer C API for Python, less dependent on CPython implementation details.

johannesnoordanus · April 13, 2025, 1:29pm

Wouldn’t it be a good idea to work the given example (code) out for the python JIT interpreter, to see whats needed to get this to work? This will also give an indication what the upperbound of the JIT framework is for Python.

kj0 · May 26, 2025, 2:54pm

@paugier could you please share system specs? On my system (a relatively new i7-12700H on Ubuntu x86-64), I get a significant slowdown when enabling the JIT. I’m investigating to see if there’s a difference between new and old systems.

# JIT OFF
Number of long_calcul per second: 90.58 ± 2.90
# JIT ON
Number of long_calcul per second: 76.45 ± 1.19

I narrowed down about 8% to _TIER2_RESUME_CHECK, which forces a deopt out of jitted code pretty often.

paugier · May 28, 2025, 10:00pm

AMD Ryzen 7 7730U, Ubuntu x86_64.

3.14.0a6 (main, Mar 17 2025, 21:03:03) [Clang 20.1.0 ]
JIT Compiler: disabled
Number of long_calcul per second: 47.44 ± 0.26
JIT Compiler: enabled ✨
Number of long_calcul per second: 46.30 ± 0.49

kj0 · June 21, 2025, 5:05pm

Thank you. That’s what I expected. Interestingly, the JIT is slower than the interpreter when the interpreter is built with clang 20, but faster than the interpreter when the interpreter is built with GCC 11.4 on my system. Might be down to compiler differences.

@paugier I have some good news though. Once Mark’s PR on register allocation and my work on reference counting elimination in the JIT land, the JIT should get a roughly 10% speed boost over 3.14 in your benchmark.

This is with pyperf system tune, hence the lower numbers than the previous post. With JIT on:

3.14.0b3+ (heads/3.14:73e2089ed13, Jun 22 2025, 00:57:20) [GCC 11.4.0]
Number of long_calcul per second: 37.38 ± 0.63

3.15.0a0 (heads/moonshot-dirty:54812dd8fd4, Jun 22 2025, 00:49:24) [GCC 11.4.0]
Number of long_calcul per second: 42.96 ± 0.29

with JIT off:

3.15.0a0 (heads/moonshot-dirty:54812dd8fd4, Jun 22 2025, 00:49:24) [GCC 11.4.0]
Number of long_calcul per second: 40.42 ± 0.1

So somewhat good news I guess. Though we could still be doing a lot better. The JIT is no faster than the interpreter on clang 20 for me, which I would like to improve.