A fast, free threading Python

barry-scott · July 11, 2023, 7:41am

There are two possibilities, there is a bug in the no-gil cpython code that caused the problem.
And that bug will need fixing, which is no different to bugs in a GIL cpython.

Or that the python code is not thread safe and that is true for running the code in the GIL version as well.

pf_moore · July 11, 2023, 8:35am

Please provide an example. Many people (myself included) have expressed this concern, and have been convinced that it is not an actual problem. If you have code that is correct (and threadsafe) under the GIL, but will not be threadsafe without the GIL, please show it.

Note that if your code is not threadsafe with the GIL, that’s not a problem with nogil - it’s just something for you to decide (do you want to support threads in the first place).

AndersMunch · July 11, 2023, 9:43am

I think it’s a misunderstanding that we are faced with a choice between a path forward where we need extra care with C extension thread safety, and a path forward where we don’t.

Rather, the choice is between improving multi-core performance with PEP 703 free-threading, and improving multi-core performance with subinterpreters. Both make the same demands on C extension thread safety. For example, the Linux putenv/getenv issue that was mentioned as a problem for regular threading, applies just as much to subinterpreters.

There are differences: One is that subinterpreters have an additional requirement that PyObjects from one subinterpreter must not be used in another. Another is that we can ask more thread-safety diligence of authors of subinterpreter code, as it will be new code, whereas regular threaded code has a large installed base.

But unless we want subinterpreters to be second class use-at-your-own-risk citizens of the Python ecosystem, we need to address C extension thread safety.

rhpvorderman · July 11, 2023, 12:02pm

Thanks for the clarification. I co-develop in a few python C-extensions. One of which I am the core developer for got marked as “essential” to the python ecosystem recently. I’d like to carry it forward into the future with Python so this information is relevant to me.I hope my posts did not come across as throwing shade at free-threading.

lieryan · July 11, 2023, 6:53pm

I think that is a leading question. I don’t think anyone can really provide the kind of example that you want because 1) there’s no full nogil implementation yet, so it’s not really fair to point out specific implementation flaws, 2) the existence of GIL already trivially proves that in theory you can always make everything as safe as GIL by just adding a lock. If anything breaks, then you just add a lock or make the lock bigger until it works.

But that’s not really a satisfactory situation, is it?

The core issue with free threading, especially in Python, is that you’re left with two unsatisfactory choices:

either dumb down free threading so everything would be safe enough by having enough locks (and you still have the issue with native libraries which won’t always be written by people who are experienced with writing multithreaded native code, much less multithreaded nogil CPython extension, so in practice, it’s will still be unsafe anyway), or
prioritize the ability to maximize multicore performance, but allow interpreter state anomalies when things are not used correctly. It’s the programmer’s responsibility to do threading correctly. This is a non starter, it’s just buggy.

There is no way of avoiding either of those two outcomes when you are allowing free threading. Completely safe free threading necessarily means leaving some multicore performance on the table.

On the other hand, if free threading is disallowed, I think there’s an opportunity to define a disciplined way of using multithreading that maximize multicore performance, without allowing any internal interpreter state anomalies, and without impacting single thread performance. The only cost is some enforced discipline, and multithreaded Python code being a little more verbose and explicit when it comes to synchronisation.

Basically, if Python’s threading model is to make subinterpreters-based threading be shared nothing by default, but have a way to define a set of shareable objects, then that would eliminate the vast majority of the reasons that people wants free threading. This is the core idea behind arena based allocation.

Python code should gain a mechanism to explicitly define which memory arena that objects are allocated into (e.g. by entering a context manager), and the active arena defines the memory location that objects are allocated (arenas never share memory pages, to avoid cache contentions and optimise for locality in NUMA systems). Then sharing objects between multiple threads would be as simple as taking turns holding one or more arena locks. This is mostly what people who write sane multithreading code are already supposed to be doing anyway, so it’s really just enforcing good multithreading practice.

There are a couple of details here with arena based multithreading. In that you don’t want objects to be able to directly refer to objects in another arena, because allowing direct reference means that every object access would require arena correctness checks which is slow and may cause issues with garbage collection. Instead, if objects are limited so that they can only hold weak references to objects in a different arena, it should be possible to implement arenas without performance impact to single arena code (i.e. single threaded code). And by only allowing weak references, garbage collection can be done on each arena independently, which prevents the issue of having a stop-the-world garbage collector.

If the implementation is generic enough it might also be possible to extend the mechanism to have more specialised arenas like single writer, multiple readers arenas or even arenas that are fully free threaded for objects that it owns.

brettcannon · July 11, 2023, 8:36pm

Not necessarily every object. The key point is that at the C level the GIL has been an implicit lock to avoid data corruption. With that gone folks will have to watch out for that a bit more in their extension modules. This typically comes up where the C code itself isn’t thread-safe, not specifically Python objects at the C level (i.e. it isn’t inherently unsafe, but the chances do go up).

It’s more likely the code was already unsafe to use with threads, it’s just no one cared/noticed since threads are not that widely used in Python code as they only get you so much right now. Think of that code that has global state in a module: it’s not going to get corrupted from a partial write, but it could have a logic bug where you aren’t locking appropriately and thus don’t increment some counter or dict appropriately. And I will fully admit I have never worried about this in any of my Python code, although I will have to start if we end up with no GIL (and why async is still handy IMO ).

jamestwebber · July 11, 2023, 8:45pm

I’ll just add that something like this would cause a mess in multiprocessing code as well. So a lot of modules have been battle-tested for this, to some extent.

pf_moore · July 11, 2023, 8:57pm

Hardly. The post I was responding to was basically saying “nogil will require code changes in pure Python”. I was simply asking for proof of that assertion, because the same statement has been made and refuted a number of times already. If the OP has something new, then great, let’s see it. Otherwise, this is just unsubstantiated scaremongering at this point.

The problem when people say this is that they forget (or ignore) that it’s perfectly possible to write code today that isn’t thread-safe. Removing the GIL won’t make any difference to that code. Hence, concrete examples are important.

ofek · July 11, 2023, 10:54pm

For those who have not yet seen, the poll of core developers ended yesterday with general favorability Poll: Feedback to the SC on making CPython free-threaded and PEP 703

ppolewicz · July 12, 2023, 2:01am

I think this is important for:

libraries that you don’t have code for (because they are provided to you “as is”)
libraries that are old and practically abandoned, the only thing that happened to them in the last few years was adding a new build target after stable CPython release
closed-source code that corporations keep behind closed doors, written a long time ago, authors are no longer there to adjust everything to be compatible with free threading

If you have the source code, skills and time, maybe you’ll prepare your code for free threads… But those who don’t have such luxury can be afraid of what’s going to happen when it doesn’t work on free threads.

I think this could be fixed by a compatibility lock. It would work in cases where the library you are importing has been merely recompliled for a python version with free threads, with no modifications done to stop implicitly relying on GIL (which is no longer there) and now maybe nobody even knows if it’s safe against free threads or not. It is assumed it doesn’t change the allocator and uses PY_INCREF (and not += 1) to manage refcount.

With compatibility lock, if you, as a user, don’t trust a package (even if it’s an indirect dependency) to be able to handle free threading, you’ll be able to disable free threads for that package from your code by wrapping it in a proxy that uses a global lock. The question now is how easy we could make it and the worst solution I can imagine is if everyone who has doubts about a dependency handcrafts a custom implmementation of a compatibility lock of their own.

I think there is a way to make it easily reusable where an import hook wraps the target module in a transparent object with __getattr__ etc (sort of like functools.wraps, but for entire modules). Then the user that wants to apply a lock over something only needs to specify the package names (before they are imported) like this:

import compatibility_lock
compatibility_lock.disable_free_threading_for(["graphviz", "PIL"])
import graphviz
import PIL

The funny thing is that the hook doesn’t know if it’s a native or pure python module that it’s wrapping, so those who believe their pure python code shouldn’t be run on free threads can lock that down too. The compatibility lock could, however, be configured to use a package-scoped lock instead of a global lock, so two different cpu-hungry dependencies which don’t share state could still run in parallel, speeding up even a program with 2+ non-nogil-aware dependencies to run on multiple cores.

Passing a custom lock object to the compatibility lock wrapper could then allow the operator to measure the contention - in cases where it’s rare it might not be worth to investigate and provide fine-grained locks, just keep the compatibility lock if it works well (if you don’t mind the overhead, that is - integrating the lock in the native extension would remove function call overhead). Perhaps more optimizations are possible, cCompatibilityLock etc.

I’m not sure if it helps anyone, though if it is helpful and can ease the Python ecosystem into the free threading, I would provide the implementation (+maintenance for at least a few years).

I’m also not sure about how/where to distribute - maybe a 3rd party package isn’t appropriate after all (for those who can only use whatever RHEL ships on the dvd + own code)

barry-scott · July 12, 2023, 6:25am

How would that compat locking work?

Would it build on top of the C api Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS?

miraculixx · July 12, 2023, 12:11pm

How so? In multiprocessing global objects are either copy-on-write (fork) or initialize-on-start (spawn) and there is never any contention on its access.

miraculixx · July 12, 2023, 12:34pm

This is a key observation and it should get more attention.

An implementation of “Pythonic threading” by combining subinterpreters with arena based allocation will come at a great advantage to Python and its ecosystem: safe and shared-nothing multithreading by design, with an opt-in to get all benefits of free threading at an explicit cost for just those parts of the program that need it.

Also it would be in line, conceptually, with existing ways to have explicitely declared nogil sections of code e.g. in Cython.

Is there a PEP (or multiple) on arena based allocation and this threading model?

miraculixx · July 12, 2023, 12:42pm

The key part here is “no one cared/noticed”: code that works with GIL and doesn’t work with nogil means the blame will be on Python, not in said code. This sounds like 2-to-3 territory, risking Python alledged as backward incompatible.

jamestwebber · July 12, 2023, 1:07pm

If a module is relying on a global state that actually gets modified by running code, then that state will be in sync when the subprocesses start, but changes in each process will be independent. So subprocesses won’t see the “true” global state that would be seen in a single-threaded application.

If the global state is basically read-only after initialization then it wouldn’t cause a problem. But it wouldn’t be a problem for multi-threaded code either.

This is all pretty theoretical–I can’t think of a module that behaves like this. Having a global state that actually needs to be written to (and isn’t just a cache or something) is pretty rare and usually not a great idea.

ofek · July 12, 2023, 2:02pm

I believe the context manager to ignore warnings modifies global state, off the top of my head

miraculixx · July 12, 2023, 2:15pm

Actually it does: Multithreading is safe in Python for any objects that provide updates in one bytecode instruction. For example most operations on lists and dicts are atomic. The GIL provides this atomicitiy. The free threading model as in nogil/PEP 703 removes this implicit atomicity, requiring explicit locking. Not only does this make code more complex to read & write, it also means reasing about multithreaded programs in Python requires a lot more know-how and experience than most developers have.

mdrissi · July 12, 2023, 2:29pm

The free threading model as in nogil/PEP 703 removes this implicit atomicity, requiring explicit locking

This part is false as operations that do happen atomically by the pep have had fine grained locks added to preserve atomicity. So existing list atomic operations with Gil will stay atomic with nogil and any difference there is valid bug report to be fixed.

One of the main goals of nogil pep is that all race conditions/atomicity guarantees for pure python code stay same.

miraculixx · July 12, 2023, 4:11pm

This is assuming that there will be no unintended side-effects by removing the GIL. That I doubt and my intention is not scaremongering - I want better threading as anyone.

However I don’t think it would be wise to “push through” a model that is known to be errorprone and hard to get right by (most) users i.e. Python application and extensions developers, most of whom do not specialise in multithreading.

Perhaps even more importantly free threading is known to be hard to get right for VM based languages, not least due to GC requiring some “stop the world” event. E.g. in some JVM applications this used to be a troubling/challenging concern for a long time.

Besides, my 35y development experience in multiple languages and platforms tells me that asserting “no effects will happen” almost always turns out to be a major headache, especially when this concerns a foundational property that has effectively become a relied-upon guarantee throughout the ecosystem.

Last but not least I think Python should avoid any other opportunity for being labeled backward incompatible. Having to specify an additional cli option to enable some specific mode is a strong indication of exactly that.

miraculixx · July 12, 2023, 4:25pm

Meeting this goal w/o fail would mean to keep the GIL because nobody can possibly know what (implicit) guarentees the GIL has been the foundation of. Actually that’s why there is ‘–nogil’