Free threading will need an ABI break.
Is there a way to provide a backward compatibility layer which would allow a nogil-enabled CPython to still use old-abi native extensions (perhaps somewhat inefficiently)?
Free threading will need an ABI break.
Is there a way to provide a backward compatibility layer which would allow a nogil-enabled CPython to still use old-abi native extensions (perhaps somewhat inefficiently)?
Out of curiosity, does this take into account more recent efforts like graal-python (claimed 3.4x speedup)?
Sidenote: The post is great, though it would be quite a bit easier to read if you let discourse do the (intra-paragraph) linebreaks, rather than adding them manually.
What is the current state of funding for the Faster CPython team?
PEP 703’s current suggestion is to re-enable the GIL (globally) in this scenario, with a message so the user (hopefully) notices that this has happened.
[edit: those extensions would still need to be compiled for the same ABI, though. But I thought that a major python release (almost?) always required this]
I’ve seen that and actually it’s what I’m trying to address. If practically all Python 3 code would run on Python 4 without modifications (albeit maybe with reduced performance if it uses native extensions not yet using abi4), it would be more viable.
In other words that would alter PEP 387 – Backwards Compatibility Policy, providing a custom loader or vm or a translation layer of some sort in order to bridge the rift between “3” and “4” worlds.
I am not proficient enough with C++ to judge if it’s hard, if it’s a lot of work or if the only way is to disassemble the .so, map and rewrite structures and pointers and reassemble it (whether on pypi or on import)… Though maybe that would work?
On the other hand, if this is impossible, perhaps the optimization that requires abi modification could be postponed (reducing performance, but not creating the 3-to-4 rift in the first place).
It is strictly impossible, period. (Rest assured that it would already have been discussed otherwise.) abi3 exposes refcounting-related macros like Py_INCREF
that mutate the object’s refcount field directly, and because they’re macros this is inlined into the generated code. Even if you could edit assembly, which is not possible portably, a so doesn’t have information on the compile-time types of parameters so you cannot even know where Py_INCREF
is used, let alone change it in the optimized assembler code.
Ruby is even more mutable than Python, but also has a GIL
Ruby as a language does not have a GIL. CRuby/MRI, the default implementation uses GIL. TruffleRuby, an alternative implementation, does not use GIL and provides significant speedup for multi-threaded code (on top of the speedup for single threaded code).
Python, Javascript and Ruby implementations on the JVM has been disappointing, historically.
All three are implemented on GraalVM (which is a JVM). GraalPy mostly matches PyPy performance (sometimes bit slower, sometimes bit faster). JavaScript manages to be in the same ballpark as V8 (highly optimized VM tailored for JS), TruffleRuby outperforms CRuby (and does not have a GIL unlike CRuby).
Performing the optimizations necessary to make Python fast in a free-threading environment will need some original research. That makes it more costly and a lot more risky.
There is a research around that such as: Daloze, Benoit, et al. “Efficient and thread-safe objects for dynamically-typed languages.”. I am not challenging that it would still be costly and risky anyway.
Specialization (without which all other optimizations are near worthless) works fine with just mutability (we know what the guards and actions do, so we know what they mutate) and it works fine with just parallelism (
guardA
remains true regardless).
Truffle is doing just that. I think that V8 shares code across JavaScript “execution contexts”, so they probably have to deal with something similar too. GraalPy, being based on Truffle, does not need GIL for specializing the interpreter, and it sometimes specializes also outside of the GIL protected sections.
Saying “Ruby as a language does not have a GIL” is a bit like saying “Python as a language does not have a GIL”. It might be true in some abstract sense, but you can’t use a language without an implementation.
Your claim that TruffleRuby outperforms CRuby is somewhat undermined by ignoring “warmup”. That is basically rigging the results in your favor.
Why are you citing TruffleRuby performance rather than GraalPython? Do you have GraalPython numbers for the full pyperformance benchmark suite?
“Use Truffle” isn’t something we can do, but we might be able to learn from GraalPython, though.
How does GraalPython support the atomic operations that Python expects? Does it support sequential consistency, or the weaker JVM memory model?
Saying “Ruby as a language does not have a GIL” is a bit like saying “Python as a language does not have a GIL”. It might be true in some abstract sense, but you can’t use a language without an implementation.
I think that what makes a language a “GIL language” or not is whether its thready safety guarantees are tied to the existence of GIL or are more abstract. Ruby does not have official specification of this AFAIK, but I would argue that Ruby is less of a “GIL language” than Python, because there are GIL-free implementations of Ruby (also JRuby IIRC) and there are libraries such as concurrent-ruby. I do not insist on calling Ruby “GIL-free” language, but I thought it would be useful for the discussion to point out that it’s less tied to the existence of GIL than Python at this point.
Your claim that TruffleRuby outperforms CRuby is somewhat undermined by ignoring “warmup”. That is basically rigging the results in your favor.
Valid point. I believe that the warmup issue is not due to not having GIL, the same goes for memory footprint. I thought that for this discussion it would be useful data-point that a GIL-free implementation can outperform a GIL-based one on single threaded peak performance.
I was not trying to rig the results in favor of anything. The comparison of TruffleRuby vs MRI is complex thing like any comparison and I don’t think it belongs to this discussion other than that it is GIL-free and it appears to do quite fine compared to the GIL based implementation on some metrics. There will always be tradeoffs and no free lunch. I don’t think that not having a GIL is a big factor in these tradeoffs, but I do not have any rigorous research to support that claim.
Why are you citing TruffleRuby performance rather than GraalPython? Do you have GraalPython numbers for the full pyperformance benchmark suite?
Because with TruffleRuby we can see GIL vs non-GIL based implementation of the same language. GraalPy has a GIL at this point to be compatible with CPython, therefore I don’t think it’s that relevant for this discussion. The geomean speedup over CPython on pyperformance is 3.4 (measured against 3.10.8), but it’s still under development.
How does GraalPython support the atomic operations that Python expects? Does it support sequential consistency, or the weaker JVM memory model?
GraalPy has a GIL to be compatible. Once the specification of what Python expects to be atomic and how objects are locked during internal operations is finalized and accepted, we can do a non-GIL version.
If GraalPy needs a GIL then I’m confused about what you are trying to say.
You seemed to be saying that using Truffle means you don’t need a GIL, but then you say GraalPy has a GIL.
GraalPy has a GIL for other reasons, mainly compatibility with CPython, but it would not be necessary for specializing the interpreter and JIT compilation. TruffleRuby, also built on Truffle, also dynamic and mutable language, does not have a GIL (it’s more complicated, their do lock around some extensions for compatibility reasons too).
I’ve yet to see any language whose thread safety guarantees are tied to a GIL. Can you name one for me?
I believe that Python does not specify a lot of things and gets away with just “execution of one bytecode is atomic” and this is what I meant.
My intention was not to trigger a discussion on whether and how much much is language or implementation X tied to GIL or GIL free. We do not have precise definitions for these terms and therefore this language is all informal. I’ve explained how it is with GIL in Ruby (on a high level). I believe that it is different enough to be worth mentioned here.
Is Python bytecode part of the language or the implementation? I thought it was an implementation detail.
Python bytecode is an implementation detail, as is the idea that “execution of one bytecode is atomic”. As is the GIL. And if the language spec doesn’t say something is atomic, implementations don’t have to make it atomic. Of course there’s a difference between conforming to the language spec and being compatible with CPython, and for practical reasons implementations tend to aim for compatibility with CPython.
Can you point to a thread safety guarantee made by the Python language spec that requires the GIL? I think the actual problem is likely to be that the language spec makes very few thread safety guarantees. The nogil work has a much stricter target, because it aims to be compatible with the CPython implementation, and in particular with the C API. But the C API is entirely an implementation detail, so not relevant if you’re specifically talking about “Python the language”. Which, to be fair, I didn’t think we were until this sub-thread appeared…
Can you point to a thread safety guarantee made by the Python language spec
Like you write in your post: one thing is language spec and another is what contract existing real Python code expects. I could not find anything on this in the spec, so it’s unspecified, but I believe that the mental model that most developers use is a bytecode/one “language instruction” is atomic and they are atomic because of GIL and if there was no GIL, it would not be such an obvious choice to have this guarantee.
The spec does not say what happens when you try to modify a list concurrently from two threads, for example (or at least I haven’t found it), but I bet that every Python developer expects that the list will end up in a consistent state (both items added and nothing more or less). For example, in Java/C#/C++ you have thread safe an non-thread safe collections and this expectation is not true. I assume it never made sense to distinguish a thread safe list from non-thread safe list in Python, because there would be no observable difference with GIL and “one bytecode is atomic” guarantee.
So in one sense GIL is not part of the spec. There is actually no thread-safety spec it seems. However, in reality there is unwritten spec that any alternative Python needs to follow to be able to run real world Python code. It appears to me that this “spec” is tied to GIL as explained above. Ruby also does not have an official spec for this, but given that there are alternative GIL-less implementations that can run real world Ruby code, I think that the commonly used contract of Ruby is less tied to existence of GIL. One could say that alternative GIL-less Pythons (Jython and IronPython I think are GIL-less) prove me otherwise. I don’t know if they are as compatible.
if you’re specifically talking about “Python the language”
Alright. I implicitly meant the “Python the language” that people code against, not (only) the documentation of the language that does not specify some of these things. In the same sense I meant “Ruby the language”.
I think the answer is (parallelism ^ mutability) for specialization. If an object is both mutable and has parallel references, it would ideally fail a guard check.
I like how pony’s pointer capabilities allow you to share immutable values but not mutable ones.
Similar to that, if you can safely tell at the guard position whether a reference is shared between threads or not, you can roll that into the specialization decision.
I don’t know how that could interact with a C reference, or gc.get_objects() though.
Is the specialization decision being made on nested value types, so the concern is changing a type within an object or container in the middle of a specialized region?