PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

nogil-3.12 identifies one possible thread-safety issue with specialization: reading from inline bytecode caches while another thread is writing them. Since the caches are only written once (during specialization), nogil-3.12 gets around this particular issue by specializing each instruction only one time in the presence of threads and locking around all specialization attempts. In practice, this means that a failing specialization (such as an attribute lookup on an instance of a new or modified class at the same site) cannot “give up” and re-specialize as it can when run single-threaded. This is problematic, since a failing specialization is slower than no specialization at all.

We’ve spitballed a few possible fixes for this, including locking (or tolerating races on) the deopt counters of the specializations that don’t use inline caches, or loading the opcode and all caches in one atomic operation. We don’t yet have a good idea of how viable these approaches really are.

However, this is not the only type of race that can occur. It’s often unsafe to run code between a specialized instruction’s “guards” and “actions”, since that can defeat the purpose of guards. In general, it’s not enough to just protect code objects, since guards and their actions may depend on the state of any arbitrary object for thread-safety.

While nogil-3.12 modifies 8 of the ~60 total specializations to improve their thread-safety, it is still possible to crash at least 7 of the modified instructions (as well as several unmodified ones) from (admittedly contrived) pure-Python code.

I’ve identified ~20 specializations that appear to be unsafe in their current form, so there is probably still quite a bit of work to do before specialization is truly thread-safe. It’s not yet clear to us how best to fix them, what the performance hit will be now, or how making them thread-safe will complicate our attempts to remove and reorder guards later.

12 Likes

It makes sense as a mechanism for initial experimentation and stability. Consider it a transitional period. If existential issues crop up as the ecosystem readies itself and attempts using it in practice those would be signs of things to fix before we would be comfortable declaring it not ready for prime time (ie: default/only behavior).

(We’ve done this in the past: from __future__ import annotations is a nice recent example. We planned that future and its default behavior change release and found via the community that the original plan was wrong… so we paused and altered our future.)

The underspecified future in the PEP is actually something I think we here and the steering council can help craft a better “Plan A” set of goals for.

It should be clear to everyone that if we ship a --disable-gil option and it’s ABI (which I’ll call t because I don’t think the negative n is good letter for it): The ultimate goal would be for it to be the default and only behavior in the long run iff no existential blockers are found during the exploratory release period. It is probably too early to declare a fixed time frame for that up front. I’d just state minimum times (N releases) and goals on when and how we expect to actually decide if we’re good to move forward.

Questions exist in the interim that we should really officer advice for in the PEP: What is a CPython distributor supposed to do? Which build should they ship? Does python dot org also ship t release builds? Do we need an official new name for a python3 entry point binary built that way? It this an opportunity to require that new fangled py launcher for this selection purpose or would that become a curse?

3 Likes

When talking about pure Python code I do not think this is true. Your pure Python PyPI packages are already in use by people using multithreaded applications in which any given Python operation can be interrupted with a thread change at any point. Our existing GIL-threading and signal handling implementation has always allowed interrupts at almost any Python bytecode point.

It isn’t really relevant whether or not the threads run concurrently in time or in today’s cooperative execution style that the GIL has so far enforced. At the high level of Python, the risks are the same. Just as multithreading’s requirements from code doesn’t change on a one core system vs a multi-core system at the application level.

I believe there are some constructs that people might “rely” on (intentionally or not) as being “atomic” from a Python point of view (even though I don’t have any on the top of my head); but doing so has always been dangerous because we’ve even changed some of that behavior over the years as we tweak when GIL-thread-swap/signal-checks happen within our evolving eval loop dispatch system. Our atomicity guarantees are mostly around individual basic data type modification not causing crashes. dict or list modification from multiple threads for example. You could never tell what the order of operations would be but nothing is going to corrupt the internal state or either (that’d be unpythonic). Thus per data structure locks being added and the lock acquisition ordering trick laid out in the PEP IIUC.

13 Likes

My intuition is telling me there are some existing reusable mechanisms for synchronization of specialization guard checks leading into actions involving thread local structures and potentially versioned global ones and a less-frequently acquired specialization/JIT lock (SJL pronounced “Sigil” spelled “SiJiL”?) for the designed-to-be-less-frequent times at which un-ignorable[^] conflicts could arise.

But if I try to spell out how I see that working without spending some hours mulling things over and sketching it out I’m likely to get it wrong and add confusion. So I’ll just drop that as a food for thought hint for those who’s heads are already in that space.

[^] - “ignorable” because execution could just bail to the safe/slow code path as you’ll be back here executing the same action in the future and get another chance at the dynamic code improvement lottery if it hasn’t already been done for you.


Non fleshed out :blinking-under-construction-banner-from-the-90s: brainstorm: Something like a per-thread guard pointer having an action version and always checking that the thread local guard pointer’s version matched the action version else either triggering a resync from the global pointers or setting a “resync” bit and bailing to the slow safe path to not pause execution during an unlikely version conflict? A resync might require a lock or at least an atomic read and a maybe a write barrier or two - it’d be responsible for updating thread local pointers from the global state. The viability of that may depend on the complexity of guards - I’m thinking high level right now in lieu of personal specialization internals/plan knowledge. (I’m used to thinking of guards as intentionally trivial boolean operations).

I like to assume this kind of thing has already been done in JVMs or way back into Smalltalk and whatnot land and covered in related papers. I’m the wrong person to ask, y’all likely already know of any such references. =)

3 Likes

Thanks for the list of issues and test script. I will look into them.

6 Likes

A related discussion (which zooms out quite a bit) here: A fast, free threading Python - Ideas - Discussions on Python.org

7 Likes

Under the Mimalloc Page Reuse section:

It is beneficial to keep the restrictions on mimalloc page reuse to a short period of time to avoid increasing overall memory usage. Precisely limiting the restrictions to list and dictionary accesses would minimize memory usage, but would require expensive synchronizations. At the other extreme, keeping the restrictions until the next GC cycle would avoid introducing any extra synchronizations, but would potentially increase memory usage.

It seems to me that waiting for GC seems fine, since that’s generally the place where users expect memory to be reclaimed. Is waiting for GC for this actually an issue in practice, or just “potentially” problematic? If it actually is an issue in practice, is that also the case when using lower GC thresholds like CPython currently has?

I ask because waiting for GC avoids the complexity of the strategy described in this section (which reads to me as more of “a maybe-nice-to-have optimization” than “necessary for GIL removal to be viable”).

Yes, it’s an optimization to avoid potential problems rather than something necessary for GIL removal to be viable. If you got rid of it, you could remove Python/qsbr.c (145 LOC) and a few lines of code elsewhere, but I don’t think you’d save much more. I think the actual code complexity is relatively small and probably worthwhile, but it’s certainly possible to take a wait-and-see approach and just wait for GC at first.

3 Likes

I’m still curious about the perf impact of making container GetItem return a reference that is autoreleased at the next thread state release, as removing possible GetItem unsafety would make me feel much better about C extension safety during a nogil transition. PEP 703: Making the Global Interpreter Lock Optional - #87 by lunixbochs

There’s another possible transition method - enable the GIL per thread by default, but allow opting out of the GIL for specific threads (rather than all or nothing for the whole program) from either a C API function or a with context manager. This may be able to reduce any possible safety impact at first to nogil opt in users rather than all cpython users, and buys time to work out possible edge cases. This makes it less necessary to have a --disable-gil compile flag or separate pypi interpreter name as well.

Can the ABI changes necessary for nogil happen independently of the SC deciding on nogil itself being merged to the main interpreter? That’s a much smaller change and would allow C extensions to be compatible with both interpreters long before nogil is included.

(FWIW the realtime improvements from GIL removal are so compelling for me that I’m investigating switching to it even before a cpython merge)

3 Likes

With respects to the technical cocerns people brought up, I think it’s all agreed that the opportunity loss of rejecting nogil is too large to even quantify.

An energised community with lots of problems to solve, is (IMHO) better than a disappointed community with no problem to solve.

I do agree with Mark that picking a long-term strategy is very important.

8 Likes

FWIW I have an example that I believe fits the kind of operation you mean.

A while ago, I had to iterate over a mutable, global dict (let’s call it MGD) in a multi-threaded Python app. Inevitably, at some point I started getting Dict mutated during iteration errors. The first thing that came to my head was: ok, whenever I want to iterate over MGD, I have to do keys = tuple(MGD); for k in keys: ... instead of for k in MGD: .... But I started scratching my head: is what I’m doing now actually correct, or have I just made the race window smaller? I asked around in #python at Libera.Chat, and everybody told me the same thing:

tuple(MGD) still formally iterates over the mutable global dict, so you’re still subject to races, but you can trust that it will be atomic under the GIL and you’ll be fine.

The problem is that AFAIK there is no standard way of atomically retrieving a dict’s keys, so there might be a lot of code like tuple(MGD) in the wild that hopes to be atomic.


Sidenote: if somebody is thinking that I just shouldn’t be accessing global, mutable data without proper manual locking/synchronization, that global mutable dict was actually sys.modules. Even if I was ok with placing locks around every single import in my code, that’s not something I can expect from third-party pure-python libraries that I want to use. Also, I know that I’m still subject to races when I do keys = tuple(sys.module); for k in keys: ..., since another thread could import stuff between the construction of the tuple and the end of my loop, but I know the implications and I’m ok with it.

5 Likes

That error isn’t threading-related. If the dict’s keys change during iteration, the order of keys produced might well change. That’s just a property of dictionaries (and sets, I believe). You should be able to produce it without involving threads at all.

2 Likes

This example isn’t currently atomic in nogil-3.12, but it should be. I’ll make that change.

There are a bunch of similar patterns that are already atomic. For example, the equivalent keys = list(MGB.keys()); for k in keys: is already atomic. FWIW, there’s CPython library code that relies on this sort of behavior – for example https://github.com/python/cpython/blob/34e93d3998bab8acd651c50724eb1977f4860a08/Lib/concurrent/futures/process.py#L413.

8 Likes

When doing multithreaded python code, I’ve always wondered if there is a definitive list of atomic operations documented somewhere? I understand they are probably only related to basic builtin types (dict, list, set), and I seem to remember reading that setting a key in a python dict could be considered atomic:

my_dict["foo"] = "bar"  # this is atomic

I struggled with this atomic aspect, and wondering to a paranoid level should I lock here or not? And I thought that even if the GIL is mostly here to protect the C side state validity, it also helped creating those very few atomic python operations.

Now, if Sam’s nogil doesn’t even change the list of atomic operations, I don’t really see on the python side what nogil would break? Already today, in multi-threaded python code (that I practiced) you have all the free threading pitfalls (concurrency bugs associated with shared mutable types accessed in different threads), with very few of the advantages: you can only scale with IO waiting threads, but not a lot due to GIL acquisition contention (I think?). I have used them with good success with some pyQt app. My understanding so far, on python code side at least, is that we don’t change the pitfalls, and we get an increase of performance.

6 Likes

Following up on Gregory bringing up __future__: Would it make sense to use from __future__ import nogil together with an is_gil_enabled() runtime check that C libraries can use and then not need a --disable-gil flag and also don’t require library maintainers to package and test two separate releases/builds?

1 Like

In my use-case, the problem was exactly Thread2 executing import statements while Thread1 was iterating over sys.modules. My point was just to mention tuple(sys.modules) as an example of pure-Python code that, today, doesn’t need locking because of the GIL, but, in nogil, would need either:

  • manual locking (which is not feasible in the case of sys.modules); or
  • a guarantee from the language that this operation (and also other selected operations) are atomic (which is what Sam and Matthieu mentioned),

That may well be, but the error wasn’t because the dictionary’s keys were changed during iteration. Here’s a trivial example to demonstrate without a second thread in sight:

>>> d = dict(zip(range(10), range(10, 20)))
>>> for k in d:
...   if k % 2 == 0:
...     d[2 * k] = "str"
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

The use of threading might simply make the error more sporadic and harder to track down.

Though without threading, the “for k in list(d):” strategy is 100% guaranteed to eliminate the error.

2 Likes

I think this was the whole point of the original example. Even with threading, CPython wouldn’t let anyone else modify d during the call to list(d). One could imagine another thread adding a new element to d even during that operation, and so list would throw an exception as it tried to iterate over the dictionary. Python doesn’t do that now and existing threading code might rely on that, even if it’s not explicitly defined as an atomic operation [1].

I don’t know how reasonable it is to diagnose these situations from the CPython implementation itself…it seems like any block of code where the interpreter could never switch threads (a lot of builtins that are written in C, for instance) might be an example.


  1. it might be, somewhere? Not something I’ve had the misfortune of needing to learn ↩︎

1 Like

For the version-specific ABI, probably. But it wouldn’t work for the stable ABI.

I would disagree with that assessment. I have not seen a lot of core devs outside of the Faster CPython team commenting here, and its us/those folks who will have to maintain this. As has been pointed out, there’s a long-term cost to this sort of performance gain/approach.

I’m not sure if you could swap a running interpreter into a different mode after having already started.

1 Like