PEP 744: JIT Compilation

smontanaro · April 15, 2024, 5:11pm

Before posting, thinking there was something different, I searched for “tier” in the PEP text, but it seemed to only be used in the “tier 1/2 platform” sense. What does it mean here?

I’m not trying to be a PITA. I used to be a semi-intelligent Python programmer. These sorts of distinctions should either be blindingly obvious or explicitly defined…

jamestwebber · April 15, 2024, 5:15pm

I guess I’m only familiar with the term because I lurk around the faster cpython project and they use that terminology a lot.

diegor · April 15, 2024, 5:19pm

You are not the only one. In this case the term “tier” is overloaded and has a totally different meaning referring to the level of optimisation for code execution and it is completely unrelated to the CPython support level of PEP 11.

Oops, I misunderstood something then OK, this proves that maybe we need a little bit more documentation about that.

brandtbucher · April 15, 2024, 5:19pm

I specifically avoided referring to the two execution stages as “tiers” in the PEP, to avoid the confusion with PEP 11 tiers… maybe that was a mistake.

Amongst those of us working with them, the interpreters and JIT are indeed referred to in “tiers”. This terminology is borrowed from other language runtimes like Java and Javascript, and you can see it being used in the YouTube talk, the paper, the blog post, and the What’s New doc.

Tier one is the specializing interpreter, tier two is only the “hot code”, represented as sequences of micro-ops.

smontanaro · April 15, 2024, 5:35pm

I see. Maybe -O0 (vanilla bytecode execution), -O1 (-O0 + specialization) and -O2 (-O1 + JIT) make sense then?

kj0 · April 15, 2024, 5:58pm

Not Brandt but regarding the second question

It seems there is a general contradiction between working on tiny unit of works (micro-ops) and using a JIT compilation scheme that doesn’t optimize accross micro-ops. Is it possible to break out of this contradiction while still benefitting from the advantages of the copy-and-patch approach?

Yes there is some perf lost from having smaller units of compilation (micro ops vs normal ops). My own estimations from Jeff Glass’ superinstructions experiment is between 1-4% performance loss on pyperformance.

The JIT compiler doesn’t optimize acrosss micro-ops. That is the job of the trace optimizer (or tier 2 optimizer). The optimizer lives at cpython/Python/optimizer_analysis.c at main · python/cpython · GitHub . Currently it doesn’t do much apart from simple elimination of type checks and some other guards. In 3.13 we just set up the main infrastructure for it to collect type information. There are plans for significantly more optimizations in 3.14.

brandtbucher · April 15, 2024, 6:08pm

To clarify, the items in that list are just the things that we identified a little while back as worth pursuing for the 3.13 release (beta freeze is in 3 weeks). Realistically, the only remaining task I think we’ll have time for is top-of-stack caching, which I’m working on now. This is one where we’re not exactly sure how much it will win us, so we’re just going to do it and measure. It’s essentially cutting down tons of memory traffic and avoiding trips to and from the heap for things that could just be kept in machine registers, so I’m hopeful that the results will be significant… and there’s a chance that it would pay off even more with Clang 19’s new preserve_none calling convention, which it looks like we’ll need to wait for.

JIT-compiling the micro-ops also allows us greater freedom to tweak the micro-op format itself, since we no longer care about efficiently interpreting them. This isn’t something we’ve seriously explored yet. Which brings me to your next question…

We have a couple different ways of resolving this, yes. Consider the two different “kinds” of optimization opportunities across micro-ops:

The first kind is high-level optimizations at the “Python” level, like removing type checks, propagating constants, inlining functions, and so on, where we can prove that these things are safe to do. In this “middle-end” stage (after hot traces are identified and translated into micro-ops, but before compilation), we absolutely have the opportuinity to reason “across” micro ops, and we have many opportunities to replace them with more efficient ones, or remove them entirely. This can range from small things like skipping refcounting for known immortal values, to larger things like turning global loads into constants if we can prove they’re unmodified.

The second kind is low-level optimizations at the machine code level. Here, we actually can allow the LLVM build step to optimize across micro-ops, by creating “superinstructions” that are made up of common pairs or triples of micro-op instructions. So for a common triple like loading a local variable of a known type, converting it to a boolean, and popping the result and branching on its truthiness, we can “smush” those ops together, and give LLVM more opportunities to optimize those three operations together. When compiling, we can replace runs of those three instructions with the “super-micro-op” that combines them.

There’re more opportunities too (like creating variants of common micro-ops with hard-coded opargs, or raising/lowering the level of abstraction for individual micro-ops), but I see that this reply is starting to turn into a wall of text, so I’ll stop here.

diegor · April 16, 2024, 8:51am

Guido, thanks for the (also private) feedback! I couldn’t leave with an inaccurate version of the diagram so here a fixed version of it. Hopefully this diagram makes people understand what’s going on under the hood when executing code.

The paths that the code can take depends on build and runtime flags. The JIT is enabled at build time and it cannot be disabled at runtime whilst the uops option can be enabled at runtime.

Question for @brandtbucher: is there a plan to enable/disable the JIT at runtime?

pf_moore · April 16, 2024, 10:29am

Cool - thanks for doing this! Just for my information, am I right in thinking that both the JIT and the -X uops flag are new in 3.13? Is the expectation that the uops flag becomes default at some point? (I can check the PEP for plans for the JIT, but I’m not aware of any similar roadmap for the uops stuff).

guido · April 16, 2024, 3:03pm

Yes, the JIT (in fact, the whole “tier 2” IR (Internal Representation)) is new in 3.13, and so is the -X uops flag (and the corresponding PYTHON_UOPS=1 env var).

You are not expected to use the -X uops flag – it exists for the benefit of those core devs who are working on tier 2. I expect that eventually the JIT will be good enough to be always enabled, but not in 3.13.

I don’t think we’ll ever make the tier 2 interpreter the default, as according to our benchmarks it is not competitive with tier 1 – it’s the JIT (which uses the same IR but instead of another interpreter translates it to machine code) that is the long-term focus of our efforts.

I am right now assessing whether it makes sense to offer a build-time option to disable Tier 2 entirely – this could potentially make the tier 1 (traditional) interpreter slightly faster.

brandtbucher · April 16, 2024, 6:43pm

Yep, something like that definitely makes sense. It’s currently actually possible (for testing), but having both a more ergonomic way of doing this at runtime (in the sys module) as well as environment variable or command-line options to control is something I’ve been thinking about.

itamaro · April 17, 2024, 4:28am

How is the existence of a built-in JIT in core CPython expected to influence other JITs? (whether extension-based JITs, or JITs implemented in forks and alternative Python implementations)

Would it become harder to continue maintaining “other JITs”? (I’d expect the answer would be “no” since the built-in JIT can always be turned off, so nothing should change in that regard as far as the “other JITs” are concerned)
Is there an expectation that “other JITs” would become irrelevant? (because the built-in JIT would always perform better, given its advantaged position and coupling)
Would it be possible for “other JITs” to take advantage of the Tier-2 IR as the IR they operate on? (would it even make sense?)

vstinner · April 17, 2024, 4:38pm

Would it be possible to have a a way to configure Python to disable the JIT by default, but have a command line options (ex: env var and/or CLI option) to enable it? The use case is to ship a single binary in Fedora with JIT disabled, to avoid any issues with the JIT by default, but let early adopters to play with it.

Example:

python3.13 runs Python without the JIT (disabled)
python3.13 -X jit runs Python with the JIT enabled

Fedora discussion: https://lists.fedoraproject.org/archives/list/python-devel@lists.fedoraproject.org/thread/O3QSAOTIZKNSETB4B6NZK4CG5LGE5FRA/

markshannon · April 19, 2024, 4:19pm

It shouldn’t get harder, although it depends on how the “other JITs” hook into the VM. If using PEP 523 and/or replacing the vectorcall on functions, then it shouldn’t make a difference. Although, this isn’t something we test.
I would expect JITs like Cinder and Pyston to become irrelevant eventually, but not overnight.
Domain specific JITs like Numba and Torch Dynamo will probably remain useful more or less forever, complementing our JIT.
Yes. I think the tier 2 IR would be a good input for Torch Dynamo, as that is already a trace based JIT.
Numba and other method based JITs might be better off sticking to their existing IR.

itamaro · April 22, 2024, 12:27am

Interesting!
How do you envision domain specific JITs complementing the built-in one? (as opposed to contending with it, e.g., who decides which JIT handles which trace?)

Is the tier 2 IR expected to have any stability guarantees, to facilitate other JITs using it as their IR? Perhaps “as stable as the bytecode is”?

jamestwebber · April 22, 2024, 12:51am

I imagine that the CPython JIT wouldn’t even “see” the code that a JIT like numba is operating on, just due to the order of operations.

I decorate my numeric function in @nb.njit
the first time I call it, numba compiles it into a chunk of code and basically replaces the function reference
the interpreter never sees a python function, it calls something that looks like an extension written in another language
there’s no bytecode to get hot, so the CPython JIT is never involved

This also makes sense because of how the JITs work–the CPython one is on top of the adaptive interpreter, so it’s waiting for code to get hot before trying to optimize^[1]. Whereas numba is on-demand–I tell it what I want it to compile.

the threshold for this might go down in the future, but I imagine it’ll still be true ↩︎

geo_m · April 24, 2024, 7:45am

I think the recent efforts to make Python faster are very fruitful, and I appreciate every effort has been made so far, but I’m curious about the choice of JIT using LLVM. Why not AOT with embedded runtime akin to the approaches taken by Go or Swift and other newer languages?

I believe AOT compilation could streamline runtime dependencies and enable lightweight deployment without the need for the entire VM, it would also open the door for various optimizations when generating machine code that can be slow to run using JIT.

alicederyn · April 24, 2024, 7:57am

This should probably be a separate thread.

brandtbucher · April 30, 2024, 1:37am

(Sorry for the late reply.)

So I’ve talked this over with the rest of the Faster CPython team, and we think the following scheme would make sense for 3.13 (names subject to bikeshedding):

--enable-experimental-jit=no (the default): Do not build the JIT or the micro-op interpreter. The new PYTHON_JIT environment variable has no effect.
--enable-experimental-jit=interpreter: Do not build the JIT, but do build and enable the micro-op interpreter. This is useful for those of us who find ourselves developing or debugging micro-ops (but don’t want to deal with the JIT), and is equivalent to using the -X uops or PYTHON_UOPS=1 options today. PYTHON_JIT=0 can be used to disable the micro-op interpreter at runtime.
--enable-experimental-jit=yes-off: Build the JIT, but do not enable it by default. PYTHON_JIT=1 can be used to enable it at runtime.
--enable-experimental-jit=yes (or just --enable-experimental-jit): Build the JIT and enable it by default. PYTHON_JIT=0 can be used to disable it at runtime.

I think this is sufficiently flexible that it covers the most common use-cases… for example, @markshannon can build with --enable-experimental-jit=interpreter, Fedora can build with --enable-experimental-jit=yes-off, and I can build with --enable-experimental-jit. And if it isn’t flexible enough, we can always add more configurations later.

brandtbucher · April 30, 2024, 1:44am

Also, earlier today @savannahostrowski bumped the JIT’s LLVM requirement from 16 to 18 on main. I believe Fedora 40 ships LLVM 18, so that should probably make experimentation easier.