A fast, free threading Python

Rosuav · June 16, 2023, 2:03pm

I’ve yet to see any language whose thread safety guarantees are tied to a GIL. Can you name one for me?

steve-s · June 16, 2023, 2:43pm

I believe that Python does not specify a lot of things and gets away with just “execution of one bytecode is atomic” and this is what I meant.

My intention was not to trigger a discussion on whether and how much much is language or implementation X tied to GIL or GIL free. We do not have precise definitions for these terms and therefore this language is all informal. I’ve explained how it is with GIL in Ruby (on a high level). I believe that it is different enough to be worth mentioned here.

Rosuav · June 16, 2023, 2:51pm

Is Python bytecode part of the language or the implementation? I thought it was an implementation detail.

pf_moore · June 16, 2023, 3:16pm

Python bytecode is an implementation detail, as is the idea that “execution of one bytecode is atomic”. As is the GIL. And if the language spec doesn’t say something is atomic, implementations don’t have to make it atomic. Of course there’s a difference between conforming to the language spec and being compatible with CPython, and for practical reasons implementations tend to aim for compatibility with CPython.

Can you point to a thread safety guarantee made by the Python language spec that requires the GIL? I think the actual problem is likely to be that the language spec makes very few thread safety guarantees. The nogil work has a much stricter target, because it aims to be compatible with the CPython implementation, and in particular with the C API. But the C API is entirely an implementation detail, so not relevant if you’re specifically talking about “Python the language”. Which, to be fair, I didn’t think we were until this sub-thread appeared…

steve-s · June 16, 2023, 3:47pm

Can you point to a thread safety guarantee made by the Python language spec

Like you write in your post: one thing is language spec and another is what contract existing real Python code expects. I could not find anything on this in the spec, so it’s unspecified, but I believe that the mental model that most developers use is a bytecode/one “language instruction” is atomic and they are atomic because of GIL and if there was no GIL, it would not be such an obvious choice to have this guarantee.

The spec does not say what happens when you try to modify a list concurrently from two threads, for example (or at least I haven’t found it), but I bet that every Python developer expects that the list will end up in a consistent state (both items added and nothing more or less). For example, in Java/C#/C++ you have thread safe an non-thread safe collections and this expectation is not true. I assume it never made sense to distinguish a thread safe list from non-thread safe list in Python, because there would be no observable difference with GIL and “one bytecode is atomic” guarantee.

So in one sense GIL is not part of the spec. There is actually no thread-safety spec it seems. However, in reality there is unwritten spec that any alternative Python needs to follow to be able to run real world Python code. It appears to me that this “spec” is tied to GIL as explained above. Ruby also does not have an official spec for this, but given that there are alternative GIL-less implementations that can run real world Ruby code, I think that the commonly used contract of Ruby is less tied to existence of GIL. One could say that alternative GIL-less Pythons (Jython and IronPython I think are GIL-less) prove me otherwise. I don’t know if they are as compatible.

if you’re specifically talking about “Python the language”

Alright. I implicitly meant the “Python the language” that people code against, not (only) the documentation of the language that does not specify some of these things. In the same sense I meant “Ruby the language”.

lunixbochs · June 16, 2023, 4:05pm

I think the answer is (parallelism ^ mutability) for specialization. If an object is both mutable and has parallel references, it would ideally fail a guard check.

I like how pony’s pointer capabilities allow you to share immutable values but not mutable ones.
Similar to that, if you can safely tell at the guard position whether a reference is shared between threads or not, you can roll that into the specialization decision.
I don’t know how that could interact with a C reference, or gc.get_objects() though.

Is the specialization decision being made on nested value types, so the concern is changing a type within an object or container in the middle of a specialized region?

Rosuav · June 16, 2023, 4:23pm

I would weaken that down to expecting that the interpreter is in a consistent state. For example, in no way should pure Python code ever^[1] be able to cause reference counts to be incorrect, or attempt to reference a non-object. This will almost certainly mean that SOME operations require locks (eg resizing a list to add room to it), but beyond that observation, everything is implementation details.

ctypes aside - you can do anything with ctypes ↩︎

danijar · June 16, 2023, 9:12pm

@Mark and others: If this is the right place to ask, could you be more specific about what funding will be needed for the different options, please? I guess details will take more time to figure out but even knowing a rough range could be helpful for starting to think about how to secure that.

gpshead · June 17, 2023, 3:26am

This feels feasible to me and is effectively close to what we currently see proposed via PEP 703. The difference being that in the early stages of letting the community experiment with a free threaded CPython, the PEP 703 version is an entirely different build of our runtime. The switch actually happens at CPython compilation time. Though there are ways that detail could be hidden behind a command line flag if we desire that interface.

I’d more or less expect work on specialization for to proceed in parallel without worrying if those benefits cannot yet be available in a free threaded build for a few of releases. Turning it mostly into an additional code maintenance and test matrix burden on the CPython core dev side to keep both our still-primary single threaded GIL based interpreter and the experimental free threaded build working.

I figure this is basically exactly what Mark claims not to want. Presumably due to the interim added build and maintenance complexity. But also seems like the most likely way to get to his “both” option 3 that I suspect we all magically wish would just happen.

I personally don’t think mark’s option list 1 / 2 / 3 reflects the actual slate of decisions to be made. But how to word that eludes me for now so I’ll save expressing that for later if it remains important.

We’ve got measurements for today’s impact: A vague ballpark number of 10% is accurate enough.

In the above scenario of doing both but ignoring most specialization in the free-threaded build, the divergence between the performance of the two builds for a single thread of execution will naturally increase over time in the above world. For development purposes it is reasonable to declare this as a known tradeoff.

gpshead · June 17, 2023, 3:39am

There is one long term thing to think about when it comes to “options”… Your option 1 is already in progress. Lets say y’all succeed as planned and a hypothetical Python 3.15 is 5x faster than 3.10. Yay! … Then what?

Nobody’s CPU core will meaningfully get faster going forward (just as it hasn’t in the past decade), but everyone can easily have 100 cores.

When we believe we’ve hit the wall of large single threaded performance improvements, another way forward is multi-core. We’ve known this over a decade: We’re have to deal with cores eventually. Or wind up declaring ourselves at a performance dead end. I don’t expect anyone who wants maximum performance to be satisfied at “just 5x” in the end. (counterpoint: people who want maximum performance are never satisfied)

What does that mean? We’ll have to deal with free-threading or some viable alternative in the end anyways.

So the question could be seen as one of ordering if work on both at once is deemed infeasible and which factors are chosen to decide that.

guido · June 17, 2023, 3:40am

But what do we then do when no-gil becomes the default (and only) mode? Just throw away most of the work done on single-threaded specialization? Or do you believe that keeping both GIL mode and no-gil mode forever is the right trade-off?

gpshead · June 17, 2023, 4:05am

I’d never throw a major performance improvement away. Iff nobody exists to bring the single-threaded performance wins to the free-threaded side and people like using both, that suggests we’d keep both forever. I doubt that to be the final outcome.

This sounds like a self resolving problem to me: I expect there’d be demand for performance in the free-threading side if it proves desirable and useful. That alone should attract resources to make it happen. (am I being too hopeful?)

guido · June 17, 2023, 4:34am

But while single-threaded performance improvements automatically benefit every Python program that needs more performance, multi-core only benefits those folks who are able to rewrite the performance-critical part of their application to benefit.

Or do you see us changing the language to benefit from multi-core? Even if this took the form of new primitives (e.g. a parallel for or map) that would require application developers to modify their program (not every for can be parallelized without changing the program’s meaning, and a compiler that can reason about this in a Python context seems like research project).

Of course, until now, two alternatives (viable or not) have been writing some libraries (e.g. numpy) in C or C++ and using multi-core at that level, and (for certain types of applications) multi-processing. In 3.12 we’re already adding subinterpreters with their own GIL to the palette.

It’s not so much that the work cannot be done in parallel. The problem is more that in a GIL-free world the work on single-threaded performance requires a different approach (see e.g. Brandt’s post in the other thread). The experts (not just Mark, but also several academic folks whom I asked for advice) seem to agree that this different approach is not just different, it takes more effort, and there is less previous work we can borrow.

This means that it would be helpful to know sooner rather than later what the SC is going to decide: If the SC decides to keep the GIL, the best road to the best single-threaded performance is to continue the work that Mark and the rest of the Faster CPython team have already planned – if we keep the GIL, we don’t need to worry about other threads invalidating our caches, versions and what have you. OTOH, if the SC decides to accept free-threading (whether in the form of PEP 703 or some variant or alternative), we should stop the current work and start redesigning the optimization architecture to be truly thread-safe. And we should seek additional funding (or accept that we won’t get even close to the 5x in 5 years goal for single-threaded performance).

I understand that this just increases the pressure on the SC, which I know you don’t need (if y’all resign under the pressure like I did in 2018, where would we be? ). But I worry that you might be betting on hope as a strategy: choosing Mark’s option (2) and hoping that the demand will lead to (3) – exactly what Mark says is a mistake.

Like Mark, I hope that you’re choosing (3) – like Mark says, it’s clearly the best option. But we will need to be honest about it, and accept that we need more resources to improve single-threaded performance. (And, as I believe someone already pointed out, it will also be harder to do future maintenance on CPython’s C code, since so much of it is now exposed to potential race conditions. This is a problem for a language that’s for a large part maintained by volunteers.)

ananis25 · June 17, 2023, 8:08am

The post is very well-stated though I don’t think the second part above is correct.

When working with data/machine-learning, a lot of python usage relies on established libraries like pandas, duckdb, etc. with primitives implemented in languages like c++ and rust. The typical shape of these computations is a dataflow, and these libraries can leverage multiple cores by concurrently running multiple operations, or the same operation on multiple data, as long as those operations come with the library. However, if you want to add a python UDF (user-defined-function) to the dataflow, the GIL limits parallel execution since this step can’t be run in parallel. This is something multi-core python likely solves by letting library authors call into the python interpreter from multiple threads.

I guess my point is, non-system programmers like me already rely on established libraries to orchestrate computation, and will still benefit from multicore python when I want to customize it a tiny bit.

pf_moore · June 17, 2023, 9:33am

That is a use case I have myself. But in that scenario, multiple interpreters should be sufficient (each thread gets its own interpreter to run the use in). So I don’t think it’s a compelling use case for nogil.

smontanaro · June 17, 2023, 9:50am

This was what killed Greg Stein’s 1.4 no-gil implementation (and, I presume, any other gilectomy attempts since). The assumption that since single-threaded performance will be worse right now, it must be rejected. As long as that keeps being the stance, CPython will never lose its GIL, despite how much that might improve it in other ways. Personally, I find it disappointing.

Does anybody know if the 10% hit is mostly due to one-time-only specialization or to other factors?

smontanaro · June 17, 2023, 11:07am

There is a downside to the aphorism that “a bird in the hand is worth two in the bush.” That bird never has babies. As Greg pointed out:

There are already documented cases (in this thread or related ones) where people/teams/projects who need better multi-core performance needed to jump through hoops to get it or switch to other languages. I have no doubt that the smart folks involved will figure out how to make things work. You just need to give them time.

Why not think in terms of specialized bytecode performance gain as restoring the performance hit of a nogil implementation built on a pre-specialized interpreter instead of the other way around? I performed a gedanken experiment on my M1 Mac using the world’s worst Python benchmark, pystone 1.1. I built from Github source using default configure for each of several branches. I could only go back to 3.9 (3.8’s configure script apparently doesn’t like M1s), but that’s enough (it’s where @colesbury started anyway). Here are the numbers I got (“nogil” is Sam’s nogil-3.12-bufix branch):

3.9     349215
3.10    367112
3.11    612340
3.12    633701
main    619266
nogil   579258

So, it appears that the single core hit is around 9% (3.12 v nogil). Now, suppose nogil had actually landed in 3.9 with a 9% hit. We’d have something like this:

3.9     319212
3.10    335572
3.11    559732
3.12    579258
main    566063

People would surely have grumbled about the performance hit when 3.9 was released (might well have been worse than 3.8). Nevertheless, we would still have wet our pants with excitement looking at the performance gains between 3.10 and 3.11, and would probably have forgotten about the loss in performance in that hypothetical 3.9.

I have no vote, but I will vote anyway. Ignore single core performance loss and work to make nogil the default in a near-future release. I predict Python will get that back in relatively short order, and the free-threaded interpreter will open up new application domains for Python.

pf_moore · June 17, 2023, 11:53am

I sort of agree with this. However, there is one rather big catch here. The 3.10-3.11 gains came largely from the “faster CPython” project, which used techniques that aren’t easily transferrable to a free-threading implementation. So there’s a question over whether those same gains could have been achieved in a nogil world. And more importantly, those gains happened because the “faster CPython” project was funded - to get that funding, presumably the project needed to provide plans and expectations. Having to ask for funding on the basis that we’d be breaking new ground in terms of techniques would have been harder.

None of these points are showstoppers. But I think we do need to consider whether the faster CPython team will be able to continue delivering performance improvements like the 3.10-3.11 increment if we add nogil, and just as importantly, whether they will be able to secure funding for whatever extra work this incurs.

Even having said all that, and even if the faster CPython work cannot continue under nogil, it’s still not obvious what choice to make between “3.12 is 75% faster than 3.9” and “3.12 is 65% faster than 3.9 and free threaded”. People wanted more performance when 3.9 was the current version, but it’s not like 3.9 was unusable. And 65% is still a heck of a speedup. Focusing on “free threaded 3.12 is slower than gil-based 3.11” isn’t really looking at the big picture, even if “this version of Python is slower overall (because of new features)” is a hard message to present.

My impression is that the SC is indeed looking at the big picture. But Mark’s point is valid too - whether we can get support for the faster CPython project to continue, and what they can achieve under free threading, is also part of the big picture, and simply hoping it will be OK isn’t the answer^[1].

even if it’s the best we can manage ↩︎

elis.byberi · June 17, 2023, 3:10pm

A quick fix:

Version	Speed	%
3.9	1.0	–
3.12	1.75	75.0%
3.12 nogil	1.575	57.5%

If 3.12 nogil is 10% slower than 3.12, the performance improvement of 3.12 nogil is 57.5% (not 65%) compared to 3.9, resulting in a 17.5% decrease.

itamaro · June 17, 2023, 7:47pm

It’s not clear to me to what extent the SC is in a position to tie PEP acceptance or rejection to allocation of funding. I think it would be helpful if @markshannon or @guido can shed more light on what you mean by “additional funding”.

If we end up taking the free-threading-multi-core route, what kind of additional funding will be required, in what form, and for what goal?
If “additional funding” can manifest in the form of “allocating PSF-employed and SC-directed CPython developer(s)” then that’s probably something the SC can decide to do (but maybe not a consideration in PEP acceptance or rejection?).
If “additional funding” towards “option 3” must take the form of “the Faster CPython team hires N more world-class cpython devs” (on top of the existing world-class core devs already on that team), then I don’t know how the SC can do anything about it, nor how they can take that into consideration when making their decision.

It is also important to have more clarity on the implications of “lack of additional funding”.
Let’s do a thought experiment where the SC decided to accept PEP-703, and there is no additional funding.
My assumption is that the 5x single-threaded speedup will not be achieved in the timeframe of the original plan (was a specific timeline set? is it “by 3.15”?).
What happens next? If the existing funding for the Faster CPython team dries up as a result (e.g. because the original goal can no longer be achieved “on time”), that’s a pretty bad outcome, but also a very different one from “the existing funding stays as is and it takes N more releases to achieve the 5x goal” (or even “it takes N more releases and we end up with “only” 3x”).

Python had existed for a long time now, and I expect it will continue to exist for many more decades.
I share @gpshead 's sentiment:

In my opinion, “declaring a performance dead end” can be a defensible position. It is a valid strategy to avoid multi-core-free-threading complexity and continue investing in single core performance up to that dead end, and plenty of use-cases out there are very well served with “the best single-core language in the world”.
But it must be a decision made explicitly and intentionally!
If this is made explicit, then it will be clear to the community that there’s no point in trying to “solve” the multi-core-free-threading problem (at any expense to single-core performance), and future @colesbury es will not need to invest years in trying to tackle it.
But if, on the other hand, there’s an expectation (consensus?) that “we would have to deal with multi-core eventually”, then it becomes a question of when, how, and who.
PEP-703 is one option (arguably, the best option proposed so far). PEP-684 (per-interpreter-GIL) is another (not mutually exclusive to PEP-703 according to @eric.snow ) option. There may be more options in the future.

I think what’s missing is a “meta-decision” about the long-term strategy:

We want to have a good multi-core story; vs
We declare Python as the best single-core language

If the meta-decision is “we want a good multi-core story”, then I think something that @colesbury has been asking for is “what does an acceptable good multi-core story look like”, because without a framework to operate in, all we have to guide us is an invisible moving goalpost.