The idea of using a register-based VM for CPython is not a new idea. I remember discussing it with Tim Peters probably 20 years ago. I think he commented that a register-based VM should give a few percent speedup just because it will do less “shuffling around” of data vs with the stack. My profiling with perf showed the cost of this shuffling (e.g. INCREF/DECREF of local variables put on stack), I think. As with most optimizations, it won’t help for all programs and maybe won’t help much for many. However, I think it actually not that hard to implement. So, perhaps worth consideration.
The last time I was thinking about this, I was working on the “compiler” package (bytecode compiler for Python written in Python). It would be quite a bit easier to prototype the register VM compiler if the compiler is written in Python. One possible design is to unify the registers and the local variables. So, they would all be treated as registers for the bytecode. The code objects would get a property that specifies the max register used by it. The compiler can just allocate registers as it needs them. You don’t have to deal with the problem of spilling registers into memory.
I believe the key problem I ran into years ago was that we could not statically analyze how many registers would be required (i.e. bytecode stack size cannot be determined by compiler). That was due to exception handling bytecodes having runtime stack effects. That problem has been solved (see bpo-17611: Move unwinding of stack for “pseudo exceptions” from interpreter to compiler).