Hello all, we on the Pyston team would like to propose the contribution of our JIT into CPython main. We’re interested in some initial feedback on this idea before putting in the work to rebase the jit to 3.12 for a PEP and more formal discussion.
Our jit is designed to be simple and to generate code quickly, so we believe it’s a good point on the design tradeoff curve for potential inclusion. The runtime behavior is intentionally kept almost completely the same as the interpreter, just lowered to machine code and with optimizations applied.
Our jit currently targets Python 3.7-3.10, and on 3.8 it achieves a 10% speedup on macrobenchmarks (similar to 3.11). It’s hard to estimate the potential speedup of our jit rebased onto 3.12 because there is overlap between what our jit does and the optimizations that have gone into the interpreter since 3.8, but there are several optimizations that would be additive with the current performance work:
Eliminating bytecode dispatch overhead
Mostly-eliminating stack management overhead
Reducing the number of reference count operations in the interpreter
Faster function calls, particularly of C functions
More specialization opportunities, both because a jit is not limited by bytecode limits, but also because it is able to do dynamic specializations that are not possible in an interpreter context
There is also room for more optimizations – in Pyston we’ve co-optimized the interpreter+jit combination such as by doing more extensive profiling in the interpreter. Our plan would be to submit an initial version that does not contain these optimizations in order to minimize the diff, and add them later.
Our jit uses the DynASM assembler library (part of LuaJIT) to generate machine code. Our jit currently supports Mac and Linux, 64-bit ARM and x86_64. Now that we have two architectures supported, adding additional ones is not too much work.
We think that our jit fits nicely in the technical roadmap of the Faster CPython project, but conflicts with their plan to build a new custom tracing jit.
As mentioned, we’d love to get feedback about the overall appetite for including a jit in CPython!
I think that this is all very exciting and I look forward to seeing progress in the future. Thank you for your efforts!
What sort of overhead does the JIT have? I know that for short scripts, PyPy’s JIT doesn’t get a chance to kick in, so if the overhead is not negligible, there should be a switch to control whether it runs or not.
How practical would it be for the JIT to be selectable at interpreter start up?
Assuming that the two JITs have different strengths and weaknesses, and are better or worse for different code patterns, people could choose whether to run the Pyston JIT or the tracing JIT depending on which one works better for their application. Duelling JITs
Our (I’m one of the pyston devs) JIT has very low startup overhead. It works a Python function at a time but just writes out each opcode in the function after each other without an IR and optimization passes.
I recommend checking out our pyston-lite extension module (available for CPython 3.7-3.10) (Pyston-lite announcement) to get a feel of how low overhead the JIT is (in startup perf and memory usage).
Note: If you set the environment variable SHOW_JIT_STATS=1 it will print some compilation stats on exit.
Also now that CPython is using the adaptive interpreter I assume that some of the profiling/warmup counters it uses can also be reused for the JIT compiler which will likely lower the additional runtime overhead the JIT introduces by a small bit.
It should be possible to support switching between different JITs at startup - the main interpreter will just have to decide which JIT to branch into. But I think it makes sense to only implement it when there is a second JIT because it will likely just complicate things with a lot of additional abstraction layers. I don’t think its desirable to have two general JITs integrated (because of the maintenance complexity).
In general my experience with pyston(-lite) have shown that just implementing a JIT as an extension requires to copy in a lot of code from the interpreter and the JIT compiler really needs access to a lot of low level implementation details of CPython in order to be fast (and for some stuff just can’t be as fast as having it directly in the main executable)…
Thanks for working on Pyston. I have a question regarding the speedups. In a blog post last year, the speedup for the full Pyston (CPython 3.8 fork) is more than double of JIT-only. Why is the disparity so huge? Are there optimisations to the interpreter that are orthogonal to the JIT driving these speedups?
We made quite a few improvements to the non-interpreter parts of CPython. One big class of optimizations was pre-compiling a bunch of the format strings, such as to PyArg_ParseTuple and PyErr_Format – these are essentially additional little interpreters inside of CPython, and they are quite slow. For example django templates frequently test if a string is an integer by calling int() on it – and the vast majority of this time is spent formatting the message string for the exception it throws. I believe PyArg_ParseTuple has mostly been eliminated from inside CPython; PyErr_Format is still pretty common. There are other optimizations we made but this is the first that comes to mind.
I looked through the changes we made and tried to see if any were upstreamable, but generally they all increased complexity and may or may not be wanted by the core devs. I tried submitting a few of them, but the reception was discouraging so I stopped.
Yes definitely, there would have to be a compile-time switch because I think any JIT wouldn’t support every single platform that CPython currently supports (or works on but doesn’t officially support).
I can see a path where we use some of the technology in Pyston, but not the JIT as is.
We will want a JIT compiler, but I expect that JIT to be dumber than Pyston currently is.
This is because we want to do more optimizations at a higher-level, where they are fully portable.
What I want from a JIT is the ability to take a sequence of optimized bytecode/IR, and convert that to machine code doing only low-level optimizations (simple constant propagation, etc), register allocation and code gen.
Would Pyston be useful tool for this? If it will, then I’m interested.
If it replaces, rather than complements, the high-level optimizations we already implement and plan to implement, then I see less value in it.
If you still want to port Pyston to 3.12, but lack the VM support needed, please suggest hooks and plugin APIs that would help.
The Cinder team are actively doing that: We now have support for dict, class and function watchers. Your input could help improve and extends those APIs.
I think the Pyston JIT is what you are asking for but I suppose it depends on what you are imagining for an IR. If you’re planning on going down all the way to an LLVM-ish level, then no Pyston is not really about that – but then you might be interested in some of the optimizations that Pyston does in our version of that process. If you have an IR that is suitable for interpretation then I imagine it would be somewhat high level and it would be a good match.
But I think the main point is that if you are comparing where Pyston is now to a potential thing you might build in the future, then yes it will necessarily seem a bit lacking.
I guess my question is: what, if anything, would it take for you to accept a JIT contribution?
For PyArg_ParseTuple our approach is just to convert to argument clinic. Argument clinic code is far far faster at parsing arguments in every case we saw – I don’t remember the numbers but I think it’s many times faster. It’s been a little while now but I believe that when I checked, all of the performance sensitive functions had already been converted to argument clinic in main.
PyErr_Format still remains; we didn’t build any automation for it, so we just manually wrote out the handling for the two hottest sites: int() failure, and attribute errors. Yes it’s ugly but it’s a meaningful speedup. I looked into speeding up PyErr_Format but a good chunk of the time is spent dispatching on the format string characters, so I didn’t see an easy win.