PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

mdroettboom · June 15, 2023, 3:34pm

A related discussion (which zooms out quite a bit) here: A fast, free threading Python - Ideas - Discussions on Python.org

brandtbucher · June 15, 2023, 7:37pm

Under the Mimalloc Page Reuse section:

It is beneficial to keep the restrictions on mimalloc page reuse to a short period of time to avoid increasing overall memory usage. Precisely limiting the restrictions to list and dictionary accesses would minimize memory usage, but would require expensive synchronizations. At the other extreme, keeping the restrictions until the next GC cycle would avoid introducing any extra synchronizations, but would potentially increase memory usage.

It seems to me that waiting for GC seems fine, since that’s generally the place where users expect memory to be reclaimed. Is waiting for GC for this actually an issue in practice, or just “potentially” problematic? If it actually is an issue in practice, is that also the case when using lower GC thresholds like CPython currently has?

I ask because waiting for GC avoids the complexity of the strategy described in this section (which reads to me as more of “a maybe-nice-to-have optimization” than “necessary for GIL removal to be viable”).

colesbury · June 15, 2023, 8:41pm

Yes, it’s an optimization to avoid potential problems rather than something necessary for GIL removal to be viable. If you got rid of it, you could remove Python/qsbr.c (145 LOC) and a few lines of code elsewhere, but I don’t think you’d save much more. I think the actual code complexity is relatively small and probably worthwhile, but it’s certainly possible to take a wait-and-see approach and just wait for GC at first.

lunixbochs · June 15, 2023, 11:42pm

I’m still curious about the perf impact of making container GetItem return a reference that is autoreleased at the next thread state release, as removing possible GetItem unsafety would make me feel much better about C extension safety during a nogil transition. PEP 703: Making the Global Interpreter Lock Optional - #87 by lunixbochs

There’s another possible transition method - enable the GIL per thread by default, but allow opting out of the GIL for specific threads (rather than all or nothing for the whole program) from either a C API function or a with context manager. This may be able to reduce any possible safety impact at first to nogil opt in users rather than all cpython users, and buys time to work out possible edge cases. This makes it less necessary to have a --disable-gil compile flag or separate pypi interpreter name as well.

Can the ABI changes necessary for nogil happen independently of the SC deciding on nogil itself being merged to the main interpreter? That’s a much smaller change and would allow C extensions to be compatible with both interpreters long before nogil is included.

(FWIW the realtime improvements from GIL removal are so compelling for me that I’m investigating switching to it even before a cpython merge)

laike9m · June 16, 2023, 7:59am

With respects to the technical cocerns people brought up, I think it’s all agreed that the opportunity loss of rejecting nogil is too large to even quantify.

An energised community with lots of problems to solve, is (IMHO) better than a disappointed community with no problem to solve.

I do agree with Mark that picking a long-term strategy is very important.

fonini · June 16, 2023, 6:41pm

FWIW I have an example that I believe fits the kind of operation you mean.

A while ago, I had to iterate over a mutable, global dict (let’s call it MGD) in a multi-threaded Python app. Inevitably, at some point I started getting Dict mutated during iteration errors. The first thing that came to my head was: ok, whenever I want to iterate over MGD, I have to do keys = tuple(MGD); for k in keys: ... instead of for k in MGD: .... But I started scratching my head: is what I’m doing now actually correct, or have I just made the race window smaller? I asked around in #python at Libera.Chat, and everybody told me the same thing:

tuple(MGD) still formally iterates over the mutable global dict, so you’re still subject to races, but you can trust that it will be atomic under the GIL and you’ll be fine.

The problem is that AFAIK there is no standard way of atomically retrieving a dict’s keys, so there might be a lot of code like tuple(MGD) in the wild that hopes to be atomic.

Sidenote: if somebody is thinking that I just shouldn’t be accessing global, mutable data without proper manual locking/synchronization, that global mutable dict was actually sys.modules. Even if I was ok with placing locks around every single import in my code, that’s not something I can expect from third-party pure-python libraries that I want to use. Also, I know that I’m still subject to races when I do keys = tuple(sys.module); for k in keys: ..., since another thread could import stuff between the construction of the tuple and the end of my loop, but I know the implications and I’m ok with it.

smontanaro · June 16, 2023, 7:27pm

That error isn’t threading-related. If the dict’s keys change during iteration, the order of keys produced might well change. That’s just a property of dictionaries (and sets, I believe). You should be able to produce it without involving threads at all.

colesbury · June 16, 2023, 7:50pm

This example isn’t currently atomic in nogil-3.12, but it should be. I’ll make that change.

There are a bunch of similar patterns that are already atomic. For example, the equivalent keys = list(MGB.keys()); for k in keys: is already atomic. FWIW, there’s CPython library code that relies on this sort of behavior – for example cpython/Lib/concurrent/futures/process.py at 34e93d3998bab8acd651c50724eb1977f4860a08 · python/cpython · GitHub.

matmel · June 16, 2023, 8:13pm

When doing multithreaded python code, I’ve always wondered if there is a definitive list of atomic operations documented somewhere? I understand they are probably only related to basic builtin types (dict, list, set), and I seem to remember reading that setting a key in a python dict could be considered atomic:

my_dict["foo"] = "bar"  # this is atomic

I struggled with this atomic aspect, and wondering to a paranoid level should I lock here or not? And I thought that even if the GIL is mostly here to protect the C side state validity, it also helped creating those very few atomic python operations.

Now, if Sam’s nogil doesn’t even change the list of atomic operations, I don’t really see on the python side what nogil would break? Already today, in multi-threaded python code (that I practiced) you have all the free threading pitfalls (concurrency bugs associated with shared mutable types accessed in different threads), with very few of the advantages: you can only scale with IO waiting threads, but not a lot due to GIL acquisition contention (I think?). I have used them with good success with some pyQt app. My understanding so far, on python code side at least, is that we don’t change the pitfalls, and we get an increase of performance.

danijar · June 16, 2023, 8:17pm

Following up on Gregory bringing up __future__: Would it make sense to use from __future__ import nogil together with an is_gil_enabled() runtime check that C libraries can use and then not need a --disable-gil flag and also don’t require library maintainers to package and test two separate releases/builds?

fonini · June 16, 2023, 8:23pm

In my use-case, the problem was exactly Thread2 executing import statements while Thread1 was iterating over sys.modules. My point was just to mention tuple(sys.modules) as an example of pure-Python code that, today, doesn’t need locking because of the GIL, but, in nogil, would need either:

manual locking (which is not feasible in the case of sys.modules); or
a guarantee from the language that this operation (and also other selected operations) are atomic (which is what Sam and Matthieu mentioned),

smontanaro · June 16, 2023, 9:03pm

That may well be, but the error wasn’t because the dictionary’s keys were changed during iteration. Here’s a trivial example to demonstrate without a second thread in sight:

>>> d = dict(zip(range(10), range(10, 20)))
>>> for k in d:
...   if k % 2 == 0:
...     d[2 * k] = "str"
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

The use of threading might simply make the error more sporadic and harder to track down.

Rosuav · June 16, 2023, 9:06pm

Though without threading, the “for k in list(d):” strategy is 100% guaranteed to eliminate the error.

jamestwebber · June 16, 2023, 9:14pm

I think this was the whole point of the original example. Even with threading, CPython wouldn’t let anyone else modify d during the call to list(d). One could imagine another thread adding a new element to d even during that operation, and so list would throw an exception as it tried to iterate over the dictionary. Python doesn’t do that now and existing threading code might rely on that, even if it’s not explicitly defined as an atomic operation ^[1].

I don’t know how reasonable it is to diagnose these situations from the CPython implementation itself…it seems like any block of code where the interpreter could never switch threads (a lot of builtins that are written in C, for instance) might be an example.

it might be, somewhere? Not something I’ve had the misfortune of needing to learn ↩︎

brettcannon · June 16, 2023, 9:20pm

For the version-specific ABI, probably. But it wouldn’t work for the stable ABI.

I would disagree with that assessment. I have not seen a lot of core devs outside of the Faster CPython team commenting here, and its us/those folks who will have to maintain this. As has been pointed out, there’s a long-term cost to this sort of performance gain/approach.

I’m not sure if you could swap a running interpreter into a different mode after having already started.

gpshead · June 17, 2023, 3:59am

That’s basically a “definition” for what happens to be atomic in as much as their is one. A builtin / C-defined type running the iteration internally consuming an iterator of another C-defined type. Those are magically atomic today, so we’d seek to preserve that semantic as so much relies upon it.

This is of course assuming that they’re the simple unforgiving implementation that do not periodically call the pending eval breaker check equivalent (there’s an API name, it’s not on the top of my head) to check for signal handlers to jump to or if it’s time to drop the GIL to allow a thread change.

Sam mentioned that list(d) is already handled? Most of these for the builtins can be as they’re all working in concert and can know about each-other and grab the container being iterated over’s lock for example.

ofek · June 18, 2023, 4:45pm

I don’t think it is reasonable to assume that the people commenting in this thread represent the entire Python community accurately. Anecdotally, I know many people in industry and academia that are in favor of this and are watching for updates but have nothing much to contribute other than expressing support, which seems quite spammy.

edit: actually, I personally do not know anybody who thinks the benefits of this do not outweigh the costs

lunixbochs · June 18, 2023, 7:03pm

@colesbury I haven’t studied the nogil changes extensively yet, but taking a specific example
list_fetch_item first checks list_needs_read_lock(obj), which seems to test whether (1) an object is owned by the current thread and (2) whether the object has the _PyGC_MASK_SHARED flag. If not, no locking is performed on the list fetch.

Am I missing some nuance, or does this mean at any point in time there’s a fast path to tell whether an object is thread local or whether it’s shared between threads? Because if this check is enough to elide locks on an object, that seems to mean we’re confident the method won’t race, which should also be enough to satisfy a specialization guard.

a-reich · June 18, 2023, 7:43pm

Even if passing this check means we know the method won’t race, that won’t help performance if this “fast path” can’t be take often. Surely, in the sorts of realistic programs users want multi threading for, they have hot code operating frequently on objects not created as thread locals?

hauntsaninja · June 19, 2023, 3:41am

At OpenAI, we have extension modules that would not have been written if we had no GIL. It would also make maintenance of the bindings to native libraries we would still have easier. In just the past two weeks, we’ve run into two problems in our C extensions that wouldn’t have been a problem in a no GIL world (performance issue caused by not dropping the GIL somewhere, segfaults caused by dropping GIL somewhere where an existing GIL drop should have been more narrowly scoped). With the kinds of distributed ML systems that OpenAI runs, it’s pretty common for us to spend time fighting the GIL. In addition to the performance / latency / availability issues that others have mentioned, the GIL has also contributed to multiple deadlocks we’ve had.