Use register-based VM rather than stack-based

The idea of using a register-based VM for CPython is not a new idea. I remember discussing it with Tim Peters probably 20 years ago. I think he commented that a register-based VM should give a few percent speedup just because it will do less “shuffling around” of data vs with the stack. My profiling with perf showed the cost of this shuffling (e.g. INCREF/DECREF of local variables put on stack), I think. As with most optimizations, it won’t help for all programs and maybe won’t help much for many. However, I think it actually not that hard to implement. So, perhaps worth consideration.

The last time I was thinking about this, I was working on the “compiler” package (bytecode compiler for Python written in Python). It would be quite a bit easier to prototype the register VM compiler if the compiler is written in Python. One possible design is to unify the registers and the local variables. So, they would all be treated as registers for the bytecode. The code objects would get a property that specifies the max register used by it. The compiler can just allocate registers as it needs them. You don’t have to deal with the problem of spilling registers into memory.

I believe the key problem I ran into years ago was that we could not statically analyze how many registers would be required (i.e. bytecode stack size cannot be determined by compiler). That was due to exception handling bytecodes having runtime stack effects. That problem has been solved (see bpo-17611: Move unwinding of stack for “pseudo exceptions” from interpreter to compiler).

2 Likes

Victor has pointed out that he tried implemented something like this and it was faster:
https://faster-cpython.readthedocs.io/registervm.html

I also remember that there was a project “Rattlesnake” done years ago, I think by Skip Montanaro. In the case of Rattlesnake, there was not a new compiler but a translator that took stack VM bytecode and translated it to register VM bytecode. A web search can’t find anything about the Rattlesnake VM so I uploaded a tarfile I had on my PC:
http://python.ca/nas/python/rattlesnake20010813/

1 Like

IIRC PyPy’s bytecode compiler does an exact analysis of how many stack slots it needs. Might be useful for reference.

One side effect of potentially switching to a stack-based VM is that it will make bytecode translation to other VMs harder. E.g. Pyjion was able to translate CPython bytecode to .NET IL thanks to both being stack-based. I also know WebAssembly is stack-based.

Not saying this a reason to not look into this and not switch if there’s a perf bump, but FYI.

Conversely, Numba computes a function’s dataflow by running an abstract execution pass over Python bytecode, trying to resolve stack offsets into fixed “variables”, and then generates a SSA-like representation of the code. A register-based interpreter would simplify that work a bit. So it definitely goes both ways.

General remark: I only this thread on discuss.python.org after I replied to python-dev mailing… Oh :frowning: There are now two ongoing discussions in parallel (at 2 different places) on the same topic. Now I’m not sure if I should continue the discussion here or on python-dev…

<digression> I would suggest to move a discussion to Discourse, when they’re happening on both the mailing list and Discourse – there’s obviously personal bias here. </digression>