To clarify, the items in that list are just the things that we identified a little while back as worth pursuing for the 3.13 release (beta freeze is in 3 weeks). Realistically, the only remaining task I think we’ll have time for is top-of-stack caching, which I’m working on now. This is one where we’re not exactly sure how much it will win us, so we’re just going to do it and measure. It’s essentially cutting down tons of memory traffic and avoiding trips to and from the heap for things that could just be kept in machine registers, so I’m hopeful that the results will be significant… and there’s a chance that it would pay off even more with Clang 19’s new preserve_none
calling convention, which it looks like we’ll need to wait for.
JIT-compiling the micro-ops also allows us greater freedom to tweak the micro-op format itself, since we no longer care about efficiently interpreting them. This isn’t something we’ve seriously explored yet. Which brings me to your next question…
We have a couple different ways of resolving this, yes. Consider the two different “kinds” of optimization opportunities across micro-ops:
The first kind is high-level optimizations at the “Python” level, like removing type checks, propagating constants, inlining functions, and so on, where we can prove that these things are safe to do. In this “middle-end” stage (after hot traces are identified and translated into micro-ops, but before compilation), we absolutely have the opportuinity to reason “across” micro ops, and we have many opportunities to replace them with more efficient ones, or remove them entirely. This can range from small things like skipping refcounting for known immortal values, to larger things like turning global loads into constants if we can prove they’re unmodified.
The second kind is low-level optimizations at the machine code level. Here, we actually can allow the LLVM build step to optimize across micro-ops, by creating “superinstructions” that are made up of common pairs or triples of micro-op instructions. So for a common triple like loading a local variable of a known type, converting it to a boolean, and popping the result and branching on its truthiness, we can “smush” those ops together, and give LLVM more opportunities to optimize those three operations together. When compiling, we can replace runs of those three instructions with the “super-micro-op” that combines them.
There’re more opportunities too (like creating variants of common micro-ops with hard-coded opargs, or raising/lowering the level of abstraction for individual micro-ops), but I see that this reply is starting to turn into a wall of text, so I’ll stop here.