A fast, free threading Python

PEP 703 is touted as offering free-threading in CPython at almost no performance cost. But building a fast, free threading CPython isn’t as simple as just accepting a PEP.

So, what would take to have a fast, free threading Python?

The short answer:

A lot of developer time, which means a substantial amount of money from somewhere.
This can happen, but there needs to be sufficient commercial interest in free-threading Python.

The long answer:

There are three important aspects of Python to consider if we want fast, free-threading in CPython:

  • Single-threaded performance
  • Parallelism
  • Mutability

The last of these, mutability, seems to have been largely ignored in the previous discussion, but it is key.

Almost all of the work done on speeding up dynamic languages stems from original work done on the self programming language,
but with many refinements. Most of this work has been on Javascript, which is single-threaded and less mutable than Python, but PyPy has also made some significant contributions. PyPy has a GIL.

There is also work on Ruby. Ruby is even more mutable than Python, but also has a GIL. Work on Ruby mirrors that on Javascript.

The JVM supports free threading, but Java objects are almost immmutable compared to Python objects, and the performance of Python, Javascript and Ruby implementations on the JVM has been disappointing, historically.

Performing the optimizations necessary to make Python fast in a free-threading environment will need some original research. That makes it more costly and a lot more risky.

Specialization

The key to speeding up dynamic languages is specialization.
Instead of performing general operations all the time, the VM adapts the code to perform more specialized operations.

The specialized form performs one or more checks and an action which, in combination, are equivalent to the general operation.
This guard+ action sequence is faster and gathers information to enable other optimizations.

We can compose these specialized forms into larger regions, as we are doing for 3.13, such that many of the guards become redundant:
guardA actionB guardA actionC guardA actionD can be simplified to guardA actionB actionC actionD further improving performance.

Specialization (without which all other optimizations are near worthless) works fine with just mutability (we know what the guards and actions do, so we know what they mutate) and it works fine with just parallelism (guardA remains true regardless).
The problem is having both mutability and parallelism. It is possible that guardA is invalid before actionB happens because the object of interest can change, and it can happen on another thread at any time.

Integrated design

We already practice some integrated design, for example the layout of plain Python objects in 3.11 is co-designed with the specialization of the LOAD_ATTR instruction.
We will need to broaden this to consider free-threading. We need to design with specialization (and other optimizations) in mind,
as well as resiliance to race conditions. Considering either in isolation will simply not work.

It is possible to do this, but it will be expensive and may need some compromises on one of the three axes; performance,
parallelism, and mutability.

It isn’t quite:

Performance, parallelism, mutability: pick two.

but more like:

Performance, parallelism, mutability: pick one to restrict.

It is possible that the VM can change which compromise is chosen depending on the circumstances at runtime,
but the compromise will always be there, and making the implementation more complex makes it buggier and more expensive.

Like many systems, the cost and complexity is not additive, but multiplicative in terms of the parts.
Design of data structures and algorithms to support specialization and free-threading is much harder than to support one or the other.

Cycle GC

This is a problem for single-threaded performance, but free threading makes the problem worse and complicates the solution.

As a program goes faster, it will produce garbage faster. We need to either speedup the cycle collector, or produce less garbage. Ideally both.
Partial evaluation, an optimization that builds on specialization and doesn’t like parallel mutation, can reduce the amount of garbage produced.
We hope that with improved GC and partial evaluation cycle GC can scale with single-threaded performance improvements we expect for 3.13.

Free-threading produces more garbage. 10 threads running in parallel will produce 10 times as much garbage.
Without radical changes to the cycle GC, this will kill performance.
Spending 10% of the execution time in the cycle GC is fairly typical for single-threaded Python programs.
With 9 threads running, and a single threaded cycle GC, 50% of execution time would be spent in GC, with the threads idling half the time.

This is a solvable problem, but it is a hard one.

The C API and ABI

Free threading will need an ABI break.
It will also need API changes, although it is not clear what yet.

Probably, all C structs will need to be opaque, as parallel modification of those structs
is a hazard that it is unreasonable to expect third-party developers to handle.

Returning borrowed references is already unsafe. It becomes extra unsafe with free threading,
so all C API functions that return borrowed references will need to be replaced.

Summary

If there is the will to have fast, free threading, we can do it.

Long term financial backing and the right people will be needed.
I believe we already have the right people, we just need the backing.

There are three options available to the Steering Council:

  1. Choose single-threading performance: We ultimately expect a 5x speedup over 3.10
  2. Choose free-threading: NoGIL appears to scale well, but expect worse single threaded performance.
  3. Choose both.

The pros and cons of the three options

Option 1:

  • Pros: We know how to do this. We have a plan. It will happen. Sub-interpreters allow some limited form of parallelism.
  • Cons: It keeps the GIL, preventing free-threading.

Option 2:

  • Pros: Free threading; much better performance for those who can use multiple threads effectively.
  • Cons: Some unknowns. Worse performance for most people, as single threaded performance will be worse. We don’t know how much worse.

Option 3:

  • Pros: Gives us the world class runtime that the world’s number one programming language should have.
  • Cons: More unknowns and a large cost.

Personally, I would be delighted if we choose option 3 because it is clearly the best option (and because it gives me great job security :wink: )

Please don’t choose option 2 hoping that we will get option 3, because “someone will sort out the performance issue”. They won’t, unless the resources are there.

If we must choose option 1 or 2, then I think it has to be option 1.
It gives us a speedup at much lower cost in CPUs and energy,
by doing things more efficiently rather than just throwing lots of cores at the problem.

51 Likes

Thanks for the great write-up, @markshannon

Would there be a chance to get option 3 in two phases ?

Phase 1: Optionally switch on nogil mode and turn off specializations using a command line switch, so that the two don’t clash. You’d only be able to run one or the other, not both at the same time. People who’d need fast single thread performance would go with the specialization mode. Others with more need to scale up Python apps to multiple cores could use the nogil version.

Phase 2: Slowly work on merging the two approaches.

Or would this not be feasible because the “slow merge” would actually require redesigning the whole specialization approach ?

3 Likes

Something to consider having mapped all of these discussions on my one use-case.

Sub-interpreters allow some limited form of parallelism.

I see from The updated PEP 703 discussion that many people have shared their experience and metrics with their library/application running with no-gil. I didn’t see as much (or any?) people sharing similar experience/metrics having done so with sub-interpreters. I suspect that’s because they don’t’ yet support rich use-cases, like sharing data (I remember from the PyCon talk that plans are in the works for that to be the case).

I wonder if those rich-use cases were supported (ideally soon) if people would be happy with the experience/performance of using sub-interpreters. Then the discussion becomes slightly richer, as we might see that Option 1 feels less like a tradeoff because sub-interpreters provide real benefit for parallel use-cases.

2 Likes

This is a very well detailed explanation. Thanks a lot @markshannon !

One thing that is not clear to me (and maybe others) is what that means in practice to restrict mutability.
Could you give a simple toy example that works today in Python and would not work if we restrict mutability? I believe it would help the community understand what is at stake here.

4 Likes

Another idea is would it be helpful to have something similar to slots that adds restrictions to class? A way to mark class/module as not allowing certain kinds of mutations to make optimization assumptions. I’m thinking partly of strict modules in Cinder.

This would have to be marking module/code directly sets but for common usages where heavy dynamicness isn’t needed maybe that could reduce assumptions risked by free threading.

4 Likes

“Restrict” could also be read as “make inefficient”.

One example could be “pause all running threads before doing things like mutating a class, setting certain function attributes, or assigning to __class__”. It sort of depends what tradeoffs we’re willing to make (what’s useful for speed vs what’s common in user code).

6 Likes

Free threading will need an ABI break.

Is there a way to provide a backward compatibility layer which would allow a nogil-enabled CPython to still use old-abi native extensions (perhaps somewhat inefficiently)?

1 Like

Out of curiosity, does this take into account more recent efforts like graal-python (claimed 3.4x speedup)?

Sidenote: The post is great, though it would be quite a bit easier to read if you let discourse do the (intra-paragraph) linebreaks, rather than adding them manually.

1 Like

What is the current state of funding for the Faster CPython team?

2 Likes

PEP 703’s current suggestion is to re-enable the GIL (globally) in this scenario, with a message so the user (hopefully) notices that this has happened.

[edit: those extensions would still need to be compiled for the same ABI, though. But I thought that a major python release (almost?) always required this]

1 Like
2 Likes

No; see PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) - #19 by guido

3 Likes

I’ve seen that and actually it’s what I’m trying to address. If practically all Python 3 code would run on Python 4 without modifications (albeit maybe with reduced performance if it uses native extensions not yet using abi4), it would be more viable.

In other words that would alter PEP 387 – Backwards Compatibility Policy, providing a custom loader or vm or a translation layer of some sort in order to bridge the rift between “3” and “4” worlds.

I am not proficient enough with C++ to judge if it’s hard, if it’s a lot of work or if the only way is to disassemble the .so, map and rewrite structures and pointers and reassemble it (whether on pypi or on import)… Though maybe that would work?

On the other hand, if this is impossible, perhaps the optimization that requires abi modification could be postponed (reducing performance, but not creating the 3-to-4 rift in the first place).

1 Like

It is strictly impossible, period. (Rest assured that it would already have been discussed otherwise.) abi3 exposes refcounting-related macros like Py_INCREF that mutate the object’s refcount field directly, and because they’re macros this is inlined into the generated code. Even if you could edit assembly, which is not possible portably, a so doesn’t have information on the compile-time types of parameters so you cannot even know where Py_INCREF is used, let alone change it in the optimized assembler code.

4 Likes

Ruby is even more mutable than Python, but also has a GIL

Ruby as a language does not have a GIL. CRuby/MRI, the default implementation uses GIL. TruffleRuby, an alternative implementation, does not use GIL and provides significant speedup for multi-threaded code (on top of the speedup for single threaded code).

Python, Javascript and Ruby implementations on the JVM has been disappointing, historically.

All three are implemented on GraalVM (which is a JVM). GraalPy mostly matches PyPy performance (sometimes bit slower, sometimes bit faster). JavaScript manages to be in the same ballpark as V8 (highly optimized VM tailored for JS), TruffleRuby outperforms CRuby (and does not have a GIL unlike CRuby).

Performing the optimizations necessary to make Python fast in a free-threading environment will need some original research. That makes it more costly and a lot more risky.

There is a research around that such as: Daloze, Benoit, et al. “Efficient and thread-safe objects for dynamically-typed languages.”. I am not challenging that it would still be costly and risky anyway.

Specialization (without which all other optimizations are near worthless) works fine with just mutability (we know what the guards and actions do, so we know what they mutate) and it works fine with just parallelism (guardA remains true regardless).

Truffle is doing just that. I think that V8 shares code across JavaScript “execution contexts”, so they probably have to deal with something similar too. GraalPy, being based on Truffle, does not need GIL for specializing the interpreter, and it sometimes specializes also outside of the GIL protected sections.

9 Likes

Saying “Ruby as a language does not have a GIL” is a bit like saying “Python as a language does not have a GIL”. It might be true in some abstract sense, but you can’t use a language without an implementation.

Your claim that TruffleRuby outperforms CRuby is somewhat undermined by ignoring “warmup”. That is basically rigging the results in your favor.

Why are you citing TruffleRuby performance rather than GraalPython? Do you have GraalPython numbers for the full pyperformance benchmark suite?

“Use Truffle” isn’t something we can do, but we might be able to learn from GraalPython, though.
How does GraalPython support the atomic operations that Python expects? Does it support sequential consistency, or the weaker JVM memory model?

8 Likes

It’s featured prominently on their website:

Saying “Ruby as a language does not have a GIL” is a bit like saying “Python as a language does not have a GIL”. It might be true in some abstract sense, but you can’t use a language without an implementation.

I think that what makes a language a “GIL language” or not is whether its thready safety guarantees are tied to the existence of GIL or are more abstract. Ruby does not have official specification of this AFAIK, but I would argue that Ruby is less of a “GIL language” than Python, because there are GIL-free implementations of Ruby (also JRuby IIRC) and there are libraries such as concurrent-ruby. I do not insist on calling Ruby “GIL-free” language, but I thought it would be useful for the discussion to point out that it’s less tied to the existence of GIL than Python at this point.

Your claim that TruffleRuby outperforms CRuby is somewhat undermined by ignoring “warmup”. That is basically rigging the results in your favor.

Valid point. I believe that the warmup issue is not due to not having GIL, the same goes for memory footprint. I thought that for this discussion it would be useful data-point that a GIL-free implementation can outperform a GIL-based one on single threaded peak performance.

I was not trying to rig the results in favor of anything. The comparison of TruffleRuby vs MRI is complex thing like any comparison and I don’t think it belongs to this discussion other than that it is GIL-free and it appears to do quite fine compared to the GIL based implementation on some metrics. There will always be tradeoffs and no free lunch. I don’t think that not having a GIL is a big factor in these tradeoffs, but I do not have any rigorous research to support that claim.

Why are you citing TruffleRuby performance rather than GraalPython? Do you have GraalPython numbers for the full pyperformance benchmark suite?

Because with TruffleRuby we can see GIL vs non-GIL based implementation of the same language. GraalPy has a GIL at this point to be compatible with CPython, therefore I don’t think it’s that relevant for this discussion. The geomean speedup over CPython on pyperformance is 3.4 (measured against 3.10.8), but it’s still under development.

How does GraalPython support the atomic operations that Python expects? Does it support sequential consistency, or the weaker JVM memory model?

GraalPy has a GIL to be compatible. Once the specification of what Python expects to be atomic and how objects are locked during internal operations is finalized and accepted, we can do a non-GIL version.

1 Like

If GraalPy needs a GIL then I’m confused about what you are trying to say.

You seemed to be saying that using Truffle means you don’t need a GIL, but then you say GraalPy has a GIL.

GraalPy has a GIL for other reasons, mainly compatibility with CPython, but it would not be necessary for specializing the interpreter and JIT compilation. TruffleRuby, also built on Truffle, also dynamic and mutable language, does not have a GIL (it’s more complicated, their do lock around some extensions for compatibility reasons too).

1 Like