PEP 703: Making the Global Interpreter Lock Optional

Okay, so, they’re strictly worse than threads.

Mark Shannon:
Personally, I don’t like the shared memory model of concurrency, so any slowdown is too much. But that’s just my opinion.

Isn’t shared memory a crucial aspect of the use cases mentioned in the PEP?

I have a use case of an application that values complex financial instruments, referring to a plethora of objects (the “world”). I can easily partition the valuation tasks and execute them in threads, but I do not know beforehand which objects each task will need, so I cannot easily partition the “world” along the tasks (I could do it, basically having a non-parallel pass first, but that would add considerable calculation expense). Also, I can benefit from intermediate results from other tasks.

In such a scenario, non-shared memory threads would tremendously bloat up memory. I’m sure there are other scenarios that work well with non-shared memory.

<disclaimer>
I’m author of “competing” (not really) PEPs: 683 (immortal objects) and 684 (per-interpreter GIL).

To be clear, I do not consider this proposal to be mutually exclusive with a per-interpreter GIL (nor with multiple isolated interpreters generally). I consider them to be completely compatible, both conceptually and in the implementation. Fundamentally, they focus on different concurrency models. Furthermore, code changes made for the one will almost always benefit the other.
</disclaimer>

tl;dr I’m really excited by this proposal but have significant concerns, which I genuinely hope the PEP can address.

Overall, I’m quite happy to see this PEP and would love to see CPython without the GIL. Misconceptions about the GIL have been a consistent obstacle to Python adoption in the industry for a long time. (FWIW, that was the main motivator, in 2014, to start my own project to improve Python’s multi-core story.) So any successful approach to doing so is more than welcome!

That said, removing the GIL isn’t trivial and will necessarily come with tradeoffs. I’m seriously concerned by the likely negative impacts (which the PEP needs to address clearly, if not already). Fundamentally, removing the GIL would increase the maintenance burden on maintainers of both CPython and extension modules, as well as likely lead to a (relatively large?) regression to single-threaded performance. To be successful, the PEP needs to be clear about how these are either mitigated or compensated by the benefits, sufficient to show that it’s clearly worth doing.

Specifically, the proposed solution negatively impacts:

  • CPython maintainers (mostly volunteers)
    • increased complexity in runtime code
    • must now think about thread-safety everywhere (see the “Thread-Safety” section below)
    • some code will be more fragile (incl. in hard-to-debug ways, because threads)
    • extra CI resources needed (e.g. buildbots)
    • reflexive flinching due to the experience of Py3k (e.g. 2-to-3 migration, maintaining 2 parallel release branches)
  • extension module maintainers (mostly community volunteers)
    • may effectively double a project’s support/CI matrix
    • must now think about thread-safety (see the “Thread-Safety” section below)
    • may have to maintain dual implementations of some code (w/ GIL and w/o GIL)
    • projects may not have time to do the work (see the “Extension Maintainers” section below)
  • users
    • worse performance of single-threaded execution (see the “Performance” section below)
      • many more users actually paying this than will benefit from no GIL
      • now users will complain about performance instead of about the GIL
    • if any, changes in the semantics of Python code (e.g. how locks behave)
    • many will suddenly realize that programming with threads isn’t easy/fun, doesn’t actually solve any of their problems, and doesn’t make their code much faster (if at all)

Additionally, the PEP should be very clear (and specific) about:

  • how many extensions will be affected?
  • how much work to fix each?
  • how many unforeseen gaps in CPython’s test coverage?
  • how do the benefits of removing the GIL outweigh the new costs?

Other questions:

  • how much work to update the no-gil branch to the main (i.e. 3.12) branch? (a lot has changed since 3.9 in the eval loop and continues to change)
  • how does the proposal affect other Python implementations (will they be expected to support multi-core threads)?
  • will there be a new modulespec slot (see PEP 489) to indicate support for no-gil?
  • how will pip be affected?

In summary, there are some serious obstacles to the PEP and a number of unclear points and unanswered questions, but I’m still cheering you on!

==========

Feedback on specific topics:

expand

Thread-Safety

Programming with threads has never been great. Beyond basic usage (and even then), it’s easy to get threads wrong and hard to feel confident that you’ve gotten it right. When things go wrong, they are a genuine pain to diagnose and debug (and even reproduce). I argue that threads are not a human-friendly concurrency model. That said, they are certainly ubiquitous in software and sometimes even the right tool for the job. This PEP needs to be clear about this penalty and any mitigations.

The biggest concerns (which impact maintenance costs for a project):

  • few extensions give a second thought to thread-safety, because of the GIL
  • now thread-safety would be a cross-cutting concern in all code at all times
  • most maintainers won’t have much experience with thread-safety, where even experienced users get tripped up occasionally
  • it’s easy to miss something when thread-safety is involved, and not even catch it right away (e.g. before a release or a deployment to production)
  • thread-safe code is often not the same as regular code, so there may be surge (and cascade) of #ifdefs

All the above also applies to CPython’s code. Furthermore, extra complexity has a real cost on CPython maintainability:

  • in a number of places (e.g. dict), existing code must become more complex to achieve thread-safety (without the GIL) in an efficient way
  • this adds cost to working on that code
  • that either requires extra volunteer time or deters contributors from participating

Performance

  • how big a regression without mitigations?
  • what are the specific mitigations?
  • how much do they reduce the regression?
  • how many are worth doing anyway? (they may end up being done independently and not really be considered mitigations)
  • how much has the “faster-cpython” work impacted

FWIW, Fedora recently had a long discussion (several actually) about making -fno-omit-frame-pointer default for C compilation: https://pagure.io/fesco/issue/2817. A key point was about how everyone would pay for the performance regression will only a small subset would benefit. Overall, that discussion has a number of parallels with the concerns of this PEP.

Extension Maintainers

Even if extension module maintainers have the option to not support a no-gil mode, users will ask for it. This can put a lot of pressure on already over-extended volunteers to work on a feature in which they don’t necessarily have much interest, nor time/resources to something about it. I’ve faced this problem with my own project, as discussed in PEP 554 and PEP 684.

So the PEP needs to be extra clear on how the burden (and pressure) on these maintainers will be mitigated.

15 Likes

FWIW, I don’t want to distract from the discussion of PEP 703 and the topic of other concurrency models can quickly diverge enough to be not particularly relevant to PEP 703. So I’m posting with that in mind.

expand

FWIW, I’ve been working toward this sort of concurrency model for several years, using the existing (for over 20 years) ability to run multiple independent interpreters in a single process. PEP 554 is about exposing the existing functionality (ignore any parts about “channels”, which I’m going to punt to the future). PEP 684 is about no longer sharing the GIL between interpreters. Both rely on improving isolation between interpreters, which I’ve been working on for a long time now.

The end result will be support for a CSP-style concurrency model (similar to Go’s), that also supports multi-core parallelism. It will use a shared-nothing approach, with opt-in “sharing” as opposed to full shared memory (like threads do). It’s a much more human-friendly concurrency model. Those two PEPs are a starting point. Both proposals provide long lists of possible improvements that can build on that foundation, including genuine opt-in sharing of objects.

FYI, I’m aiming for the 3.12 release this year (assuming my PEPs are accepted–I’m cautiously optimistic).

We should be able to make a number of improvements on top of PEP 684 (and 554) that facilitate a lot more scalability and efficiency than where CPython’s runtime is at currently.

Also, keep in mind that there’s no intention to copy Go’s concurrency model (nor implementation), nor add similar syntactic support to Python. I also don’t anticipate that Python users will approach concurrency in the same way Go users do, so scalability and efficiency requirements will be different.

4 Likes

Since it will be our call, I guess we’re “in on it”, but this PEP has not even been sent to the SC yet so we don’t have a view on what this PEP would entail.

Technically it doesn’t matter since it falls on the SC to make the call as to whether to accept or reject this PEP. He’s obviously going to have an opinion like every core dev, but don’t get too tied up into bugging Guido about what he thinks since ultimately it isn’t up to him what this PEP’s fate is.

It’s actually more complicated than that. Not only is potential performance characteristics for all preexisting Python code a concern, but there’s also what would a transition look like and how costly/disruptive would it be? I would be curious to know what reactions people would get to this PEP if they went to every one of their extension module dependencies and asked the project whether they support this PEP or not due to the work they may have to put in to support it?

3 Likes

I suspect the biggest annoyance in the ecosystem will be C extension API-level compatibility. I think two of the backwards-compat issues in PEP 703 can be worked around:

  • The GIL may lock access to non-Python data structures.
  • PyDict_GetItem (and other container) refcounting is unsafe for concurrent access.

We may be able to do the following to make the transition easier:

  • Give legacy C extension modules some kind of pseudo-GIL so their internal locking assumptions don’t break until they have a chance to update their code. I’m unsure how the deadlock cases also mentioned in the PEP would interact with such a pseudo-GIL.
  • Make PyDict_GetItem return a temporary ref that is automatically dropped when you release the GIL (similar to objc’s autorelease).

I don’t have as good of an intuition about the implications of the memory allocator API incompatibility, or whether that could be worked around - but many extensions likely won’t be affected (if there’s no clear workaround, someone could crawl PyPI to estimate how many extensions may be broken by the allocator changes).

1 Like

I’d also like to mention my use case, as it’s not AI/ML and would significantly benefit from GIL removal.
I’m developing a desktop app with dozens of soft-realtime systems in a single process, and a GIL stall (when some native code holds the GIL for too long) can cause a stutter on multiple of the systems at once.

The app is Talon. It’s extremely user-scriptable with Python, with features like eye tracking, head tracking, speech recognition, noise recognition, audio processing, 60fps screen overlay, keyboard/mouse input taps, which all might make blocking calls into the Python layer.

If the GIL is held for, say, >100ms by a C extension - the user’s eye-tracking mouse cursor may freeze in place, an input tap could block physical keyboard/mouse usage until a Python callback can next be scheduled, and any overlay UI will drop frames. Stall and jitter over 16ms may be noticeable (e.g. I upstreamed a fix for CPython lock jitter on Windows). We also can’t run soft-realtime audio code directly in Python at all, even though Python is nominally fast enough for it, because taking a long-held lock (the GIL) can block the OS audio stack and cause it to drop audio frames.

I’ve been manually working around issues by playing whack a mole with the most egregious GIL stalls, but it’s still a problem in edge cases, and on more heavily-loaded user systems or weaker CPUs.

Solutions like multiprocessing are a bad fit here, as a given user will have hundreds of custom scripts running. We can’t give each user script their own process, because that would use far too many resources. There’s no easy place to implicitly split the workload. The closest solution I have in mind (besides nogil) would be to provide a WebWorker-like API where you can ask for specific functions to run in another process or in a subinterpreter with a separate GIL, but the ergonomics of that would be weird and I’d rather avoid it if possible.

11 Likes

Interesting. Tell us more about those GIL stalls in extensions and what you did about them. Is this CPU bound code that just computes too long without releasing the GIL? Or is it doing I/O without releasing the GIL? I wonder if those very extensions would also have trouble with a GIL-free Python because they might have threading bugs masked by the GIL?

1 Like

It’s primarily been CPU and memory bound workloads - one case was my own native extensions that need to access quite large Python data structures (e.g. preprocess a big list of a ~million words sent to native code from Python for speech recognition purposes). It’s been a while since I last profiled for this, but I think another case that starved the GIL too much was a Python thread that did some computation (gzip?) in a tight loop, I mitigated that one with small sleep in the loop. My intuition is that this is not an issue of potential race conditions, but more that enough stuff does ~milliseconds of computation with the lock held that it’s hard to hit <10ms latency targets with heavily threaded Python.

The cases where I really noticed it are:

  • A low-level input tap on Windows (which adds the current GIL acquire time to every keystroke, mouse movement, etc)
  • Real-time microphone processing (if we stall too long, we buffer underrun and silently drop audio)
  • Canvas API, which calls a Python function at 60fps to draw a frame (if it stalls we drop a frame and the animation stutters)

Unfortunately, for the first two cases, the solution for now was to not do either of those things. The Windows event tap is disabled because it’s really unsatisfying if we ever fail to hit interactive latency, and for mic processing I manually buffer with a separate native thread before handing off to Python, because even with all of the hot spots mitigated we were still losing some audio regularly.

People are using the app interactively to control their computer for 8+ hour sessions, so even if GIL stalls require a perfect storm of coalesced events, they may still happen constantly.

Almost all of the cases where I care about GIL acquisition times are when I call from a native thread that’s already running (e.g. USB stack, audio stack, UI/rendering stack) into Python. Having to take a global lock at this point is really unfortunate, as we’re already on a thread that’s executing, ready to go and probably wouldn’t take very long to finish the callback. Without the GIL, we’d always be able to execute the callback immediately, which is why I think it would be much easier to hit the soft-realtime target without stuttering.

Basically, any Python thread in the app that has reason to queue up some computation can stall any of the interactive Python callbacks. If there’s interest, I could again profile for GIL acquisition times and provide some more specific information about the current state of things in my app.

6 Likes

Hi @brettcannon /

Thank you for the answer, I know there are quite a few topics going on here, but what I meant was: Where does SC (and GvR) stand on potentially calling the “new” no-GIL Python version for Python4 as proposed by someone in this discussion. So Python3 is the old GIL version, and Python4 is with the new GC and no-GIL

I was listening to GvR on the Lex Fridman show and he sounded like there will be a Python 3.99 and 3.100 before a version 4 :slight_smile:

I agree, but it would be supernice if there was a cross thread async/await (maybe like Go’s channel).

With PEP 703 it would make much more sense to have that than it is with the GIL.

I defer to the SC in this.

3 Likes

Has it been considered packaging both versions into one executable? Something like the toolbox in Android, one program linked to be ls, mv, cp, cat, etc.

No sure how to do this in practice, but something like this in Python.c (pseudo):

main(int argc, **argv) {
    if argv[1] == "--no-gil" or argv[0] == "python4" { // :-)
        return main_nogil(argc, argv)
    } else {
        return main_org(argc, argv)
    }
}

Then users and dist’s did not have to consider which version to build and distribute. I think the exec size won’t be a huge problem. On my Mac Python 3.10 exe is 3.6M, even if it grows by 50% it’s only 5.4M, fx. vim is 5.1M :slight_smile:

1 Like

That will work because there are no conflicts between ls and mv, they share the C rtl and save space.

gil vs no-gil replaces the implementation of lots of functions so it a huge ask.

You could use a launcher to exec either gil or no-gil python.

Because extension modules rely on exported functions from the executable/DLL, and those functions would have different ABIs or semantics, we’d need to duplicate every single function with a new name. Alternatively, we’d need two separate DLLs (on Windows - this won’t work on other platforms where it’s statically linked) and would have to dynamically load the right one, and extension modules would have to be compiled against the right one, which is essentially two executables except with no obvious way to tell which one you’re running. Chances are, if you miss the argument, things will crash. Potentially in an exploitable manner.

So it hadn’t been considered, but now it has, and I don’t think we will do it.

5 Likes

Thank you for the detailed comments Eric. I’ll try to address some of the questions and comments that I can answer now.

how much work to update the no-gil branch to the main (i.e. 3.12) branch?

It’s hard to give a precise estimate. The last rebase (3.9.0a3 to 3.9 final) took months of work, but that involved substantial rewrites as well. Maybe 2-5 months of work?

how does the proposal affect other Python implementations (will they be expected to support multi-core threads)?

Generally, no – the PEP is specific to CPython. Implementation that closely track CPython may wish to do so. For example, my colleagues on Cinder have expressed interested in integrating the changes after they’re adopted upstream. I would not expect PyPy to do so, but that’s up to the PyPy developers. IronPython, Jython , and GraalVM python already do not have a GIL.

EDIT: correction from Tim Felgentreff regarding GraalPy.

will there be a new modulespec slot (see PEP 489) to indicate support for no-gil?

No, I think the ABI flag is sufficient. C API extension authors need specifically build for the --disable-gil variant, so a modulespec slot seems redundant to me.

how will pip be affected?

PIP is not affected. I’m using PIP in the nogil fork and it works great. (There is a small difference, but I don’t think it matters: the fork modifies the Python tag as well as the ABI tag, while the PEP only proposes modifying the ABI tag.)

the proposed solution negatively impacts … extension module maintainers

There are also substantial benefits to extension module maintainers. The PEP includes quotes from a number of maintainers of widely used C API extensions who suffer the complexity of working around the GIL. For example, Zachary DeVito, PyTorch core developer, wrote “On three separate occasions in the past couple of months… I spent an order-of-magnitude more time figuring out how to work around GIL limitations than actually solving the particular problem.”

extension module maintainers … may have to maintain dual implementations of some code

I think if you examine the patches to extensions from the nogil for, you’ll see that they are quite small and there aren’t too many dual code paths.

6 Likes

Is it possible to categorise the groups who will see benefits and those who will see complexity/risk? At the most basic level, I assume there are 3 classes of extension:

  1. Those that currently spend time fighting the GIL, presumably trying to parallelise complex CPU-bound workloads to package them in a Python-friendly way.
  2. Those that have an awareness of threading but know that their code is safe because of the GIL.
  3. Those that don’t consider the GIL at all, assuming (rightly or wrongly) that threading isn’t relevant to them.

Class (1) clearly gain from nogil. Class (2) will have to implement their own protections to replace what the GIL currently gives them, but we can hope that they know enough to do so - so they are negatively impacted but mainly in a one-off manner which is not that different to any other incompatible change in a new release.

Class (3) is where, for me, the real issues lie. We can hope that most people in class (3) are right, and threading is irrelevant to them. But re-entrancy issues are subtle, and not always easy to spot - an extension using strpbrk for example, has a threading issue. My concern is the “long tail” of extensions in this class, and how we help them deal with the new and unexpected need to have an opinion on their own thread safety guarantees.

3 Likes

I would not expect PyPy to do so, but that’s up to the PyPy developers. IronPython, Jython, and GraalVM python already do not have a GIL.

Speaking for GraalPy, this is no longer correct, we felt that we had to add a GIL to improve our compatibility with existing Python code. There were two main issues we had, both of which the PEP mentions:

a) What are the atomicity expectations around builtin types and their operations (IronPython and Jython answer the question what to lock in builtinListA.extend(builtinListB) differently, for example.) In my opinion, the clarifications around these alone would be very valuable from this PEP.

b) To get the C extensions in the ecosystem around NumPy to work, we needed either locking or patching. We observed issues around memory corruptions and plain wrong behaviour. And as was previously mentioned on this thread, without a thorough audit of the code of all extensions we run, we also felt great trepidation about what threading bugs may lurk even if we patched the issues we found. We initially looked into a GIL around calls to C extensions, but so many of our targeted workloads then needed to lock anyway that we saw little benefit. In this regard, too, we would benefit from this PEP if it pushed extensions to declare how thread-safe they are.

7 Likes

Does the existing implementation allow for only releasing the GIL explicitly within a with nogil context, similar to Cython?

1 Like

Wouldn’t this lead to similar situation as in nim. Where you decide wether to use GC or not? You would have some packages which would have to be used with GIL and some without, but at the end of the day it breaks inter-compatibility.

1 Like