PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

I’d like to share updates to PEP 703. The changes are substantial enough that I figured a new thread was warranted. The main results are from rebasing on to Python 3.12 and include performance numbers. That implementation is available at GitHub - colesbury/nogil-3.12: Multithreaded Python without the GIL (experimental rebase on 3.12).

  • The PEP is updated to target Python 3.13.
  • Tweaks to biased reference counting to avoid an atomic operation in most object deallocations. Essentially, this optimizes further for the case of objects accessed only by a single thread at the cost of some extra overhead when an object is accessed by a non-creating thread.
  • Changes to the PyObject header and GC header. With PEP 683 accepted and implemented, objects with 2^32-1 references become immortal. This PEP keeps that behavior and shrinks the ob_ref_local field to 32-bits (on both 32-bit and 64-bit platforms). The “extra” space is used for a one byte per-object mutex and GC flags (tracked/finalized/immortal), with 16 unused bits. The GC header (_PyGC_HEAD) is removed. This makes the total header size for GC-enabled objects the same as in CPython, but non-GC objects are still larger due to the two reference count fields and thread id field.
  • The specification of the Py_mod_gil slot from the previous PEP discussion.

I’d appreciate your feedback on the PEP and the updates.

38 Likes

Thanks @colesbury for working on this PEP! It’s clearly had a lot of thought put in to it and is tackling a really tough problem head on: the GIL.

Feedback:

The global interpreter lock will remain the default for CPython builds and python.org downloads. A new build configuration flag, --disable-gil will be added to the configure script that will build CPython with support for running without the global interpreter lock.

  • Does the PEP (want to) recommend that C extension authors add support for --disable-gil?
    • The Python Build Modes section seems to lay out a suggested roadmap for extension authors that implies that support for --disable-gil will be encouraged.
  • Adding support for --disable-gil in an extension would seem to imply that APIs like PyList_GetItem, PyDict_GetItem, and PyWeakref_GetObject be considered deprecated for regular usage in favor of their “new reference” versions that play nicely with --disable-gil. However the Why Not Deprecate PyDict_GetItem in Favor of PyDict_FetchItem? section explicitly says APIs like PyDict_GetItem are not deprecated even while acknowledging that using them can be unsafe. I’m a bit confused.

Additionally, the implementation proposed PEP supports […]

  • Nit: Remove either the word “PEP” or the phrase “implementation proposed”

the interpreter pauses all threads and enables the GIL before continuing.

  • Wait. So it’s possible for a --disable-gil build of CPython to start using a GIL at runtime? I feel like it might be worth mentioning this possiblity earlier in the Build Configuration Changes section. I was previously assuming that --disable-gil would compile away GIL-related code using #ifdefs or similar, but it sounds like code will have to regularly check whether a GIL is enabled or not since it can be enabled at runtime and not just at compile time.
    • Indeed,
    • The author believes a worthwhile goal is to […] have the global interpreter lock controlled at runtime, possibly disabled by default.

At least for some time, there will be two versions of Python requiring separately compiled C-API extensions. It may take some time for C-API extension authors to build --disable-gil compatible packages and upload them to PyPI.

  • Does (or will) PyPI support uploading multiple binary wheels, one for each build version of Python?
  • How is it proposed that CPython automated builds (“buildbots”?) be updated? Will there now be double the number of builds, to add --disable-gil builds?
2 Likes

Hi @davidfstr, thanks for your detailed feedback and suggestions. I’ve made some edits to the PEP and I’ve included responses are below.

Working C-API extensions are important, but I don’t think it’s necessary to make an “official” recommendation in the PEP.

I’ll expand on the rationale given in the PEP:

  • PyDict_GetItem and other functions that return borrowed references can already be “unsafe” when there’s code between the call and the use of the “borrowed” value if that code releases the GIL, modifies the dict, triggers a GC, etc. That can happen more often than it sounds because macros like Py_DECREF can call arbitrary code via destructors and weak refs. So, the PEP expands the hazards from “overlapping operations” to “concurrent operations”, but doesn’t categorically change a “safe” API into an “unsafe” one. (The safety is context dependent)
  • Most calls to these APIs are still safe. The reasons are varied and context dependent, but includes things like the dict being private to the calling thread (e.g., kwargs).
  • I expect that wholesale replacing these calls will introduce more bugs that it would fix. This was my experience when I tried this approach.
  • Given that the “correct” decision is uncertain, I think the not deprecating those APIs initially is less risky, even if that turns out to be the wrong decision in the long run. We can still later decide to deprecate those APIs. (The opposite scenario, in which we initially deprecate the APIs, and then decide that was a mistake would cause much more churn.)

Tools like thread sanitizer (TSAN) help. We already have a GitHub actions runner for address sanitizer (ASAN). I’d expect to add one for TSAN as well.

This is already mentioned in the Build Configuration Changes section. The code path that acquires the GIL (e.g., PyEval_SaveThread) checks if the GIL is enabled. Runtime vs. compile time support doesn’t affect performance or code complexity, but does improve some debugging and compatibility scenarios.

It already works with PyPI because the wheels have different ABI tags.

There will be additional stable build bots for --disable-gil builds, but I don’t expect the total number of bots to double.

4 Likes

Hello, one thing about the PEP that I find confusing to interpret from my non-expert perspective, is the performance section. In particular, the fact that nogil builds’ performance is worse even for the multithreaded programs (and in fact the overhead is actually bigger there). I more or less understand the explanation about specialization limits, but my question is: how should readers reconcile this with the rationale for this PEP?

In other words, the whole idea is that, as the section notes, multithreaded code will be faster by effectively using multiple CPUs. Why isn’t happening in the pyperformance benchmarks - just a quirk of these benchmarks being poor examples due to too little CPU work/too few threads/something else? Or perhaps they are getting a slight speedup outweighed by the specialization impact, but the idea is that with enough more threads the net result would eventually be positive?

At any rate, since the whole point of this change is that for various applications, the nogil mode will run faster than the (current) GIL mode, IMO it would be good to have examples demonstrating that.

Thanks!

I think the values given for multiple threads is the per thread overhead, and not an application-wide hard speed reduction. Each thread is a bit slower, but you can run then in parallel to more than make up for the overhead, instead being the parallelization being limited by the GIL.

Yes, as Edwin wrote, that section reports per-thread execution overhead. It doesn’t take into account any multi-threaded scaling. The pyperformance benchmark suite is almost entirely single-threaded. There is one benchmark (out of ~60) that uses threads; that benchmark is run as if it had the GIL enabled (but with the other “nogil” limitations) to avoid muddling the results.

I don’t report multithreaded results because there aren’t really any multithreaded Python benchmark suites. There are a few tiny programs that show nice linear scaling, but I don’t think those are interesting enough to include. There are some real world programs that use nogil Python, like the Dose-3D project mentioned in the PEP’s introduction, but in many of those cases GIL vs. nogil is so large that it’s basically “infeasible” vs. “feasible”. For the Dose-3D project, the important aspect was not that it ran faster without the GIL, but that it was able to meet real-time requirements. That would have been impossible with a multi-threaded Python application running with the GIL.

My EuroPython keynote talks a bit about some speed-ups with my colleague’s Hanabi project. There are some actual numbers, but I think they’re more reflective of details of the project than nogil in general.

1 Like

Indeed we struggled a lot to make Dose-3D DAQ run on traditional CPython. The communication overhead between processes representing functional hardware entities greatly exceeded our timing budget. Now with multiple threads running in parallel not only were we enabled to use Python but also the overall system architecture simplified.

@colesbury Moreover, there are plans for using the low-level subsystem of this DAQ for other projects. So we are going to be strongly reliant on the nogil CPython implementation :slight_smile:

2 Likes

Thanks for the clarification - I didn’t catch that the table is measuring per-thread results, nor that they are run with GIL enabled. And thanks for the link to your talk - I definitely found the examples useful!

1 Like

We (the Faster CPython team) have taken a careful look at the PEP and I have written up a summary. It is mainly about performance, but also compares PEP 703 to multiple interpreters.
I have attempted to be objective, please say if you think I am being unfair.

Performance assessment

Performance comparison of NoGIL with the status quo

PEP 703 claims that NoGIL has a 6% execution overhead on Intel Skylake machines, and 5% on AMD Zen 3.
Our benchmarking shows an 11% overhead on our (Cascade Lake) machine.

The NoGIL branch includes some major changes to the cycle GC and the memory allocator. Comparing NoGIL with our (relatively low effort) attempt to add those changes to the base commit shows 14% overhead.
Our attempt to isolate the changes to the cycle collector consist of reducing the number of generations to one, and setting the threshold of that generation to 7000.
A perfect comparison would be a lot more work, as the changes to the cycle GC in the NoGIL are hard to isolate.

Earlier experiments with mimalloc showed a small ~1% speedup, which gives a best guess number for the overhead of 15%.

A couple of things to note:

  1. The overhead is not evenly spread. Some programs will show a much larger overhead, exceeding 50% is some cases (e.g. the MyPy benchmark), and some will have negligible overhead.
  2. There are no benchmarks that have the workload spread across multiple threads, as Sam mentions in the PEP. Such benchmarks should show speedups with NoGIL.

Future and ongoing impact on performance

It is not clear how the overhead of NoGIL will change as CPython gets faster.
A large part of the overhead of NoGIL is in the more complex reference counting mechanism.
Reducing the number of reference counting operations is part of our optimization strategy,
which would reduce the absolute overhead of reference counting.
However, we expect large gains elsewhere, so the proportional overhead would probably increase.
There is also the overhead of synchronizing and locking. This is unlikely to change much in absolute terms, so would get considerably larger as a ratio.

Assuming a 10% overhead from reference counting and a 5% overhead from locking:
If we double the speed of the rest of the VM and halve the reference counting overhead, the overhead of NoGIL rises to 20%.
If we double the speed of the VM without changing reference counting, the overhead of NoGIL rises to 30%.

Summary of overhead estimates

All overheads as a percentage of the runtime without the NoGIL changes

Measurement Overhead
PEP 703 claimed 6%
Our unadjusted measurement 11%
Adjusted for cycle GC changes 14%
Overall impact of NoGIL on 3.12* 15%
Guess 1 for 3.13/3.14 20%
Guess 2 for 3.13/3.14 30%

* Adjusted for cycle GC plus 1% estimated speedup from mimalloc integration.

Opportunity costs

The adaptive specializing interpreter relies on the GIL; it is not thread-friendly.
If NoGIL is accepted, then some redesign of our optimization strategy will be necessary.
While this is perfectly possible, it does have a cost.
The effort spent on this redesign and resulting implementation is not being spent on actual optimizations.

The interactions between instrumentation (PEP 669, sys.settrace, etc), the cycle GC, and optimization, are already subtle.
Ensuring that all the parts work correctly together takes considerable effort, and slows down work on speeding up CPython.
Adding free-threading to that mix, will increase the complexity considerably, resulting in a lot of effort being spent on handling corner cases that simply do not occur with the GIL.

Comparsion to multiple interpreters

Python 3.12 offers support for executing multiple Python interpreters in parallel in the same process.

For the purposes of this discussion, let’s categorize parallel application into three groups:

  1. Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
  2. Numerical processing (including machine learning) where the shared data is matrices of numbers.
  3. General data processing where the shared data is in the form of an in-memory object graph.

Python has always supported category 1, with multiprocessing or through some sort of external load balancer.
Category 2 is supported by multiple interpreters.
It is category 3 that benefits from NoGIL.

How common is category 3?
All the motivating examples in PEP 703 are in category 2.

3 Likes

In addition to the above, I also have some more subjective comments to add.

Shared memory parallelism is risky, and very hard to get right. Non deterministic bugs in the VM are no fun at all for Python developers.

It took the developers of the HotSpot JVM many years and many many person years to iron out all the concurrency bugs. I don’t see any reason to believe that we will do any better.

If you want robust parallel, scalable applications, you write them in Erlang, not Java.
If you need the very last drop of performance you might use Java, although you would probably use Rust these days.
It is safe to say that if you are choosing to develop in Python, you aren’t worried about getting the last drop of performance. You might be concerned about robustness and reliability, though.

The share-nothing model of multiple interpreters has its usability limitations (you have to pass messages) but it is much safer IMO.

2 Likes

Outside the things Mark said for our team, I’ll summarize my perspective on no-gil.

Generally I’m in favor. It represents a substantial effort on Sam’s part and some effective strategies. It certainly addresses a major criticism of Python in the broader tech world. However, there are a few things that make me uncertain about it.

First, some of my potential biases:

  • I’ve been working toward a per-interpreter GIL for over 8 years, which can reasonably be thought of as a competitor to no-gil (FWIW, I don’t see it that way)
  • I almost never reach for threads to solve any of my problems, and when I do the GIL isn’t an issue for me, so I don’t have any personal vested interest for or against threading
  • I work on the faster-cpython team, which would be negatively impacted by no-gil
  • I’ve interacted with Sam on several occasions and found him to be thoughtful, smart, and respectful, so I’m inclined to trust his intentions and the work he’s done
  • I’ve struggled to quickly understand a number of the technical aspects of no-gil, which can impact my disposition toward the solution and how I’ve personally assessed the costs of no-gil

Here are my main concerns:

  1. it’s unclear who benefits and how much
    • which Python workloads benefit (outside AI/ML)
    • what new Python usage would come about that folks don’t even bother currently
  2. there is a real cost to the change, due to both size and complexity, that Python maintainers will have to pay indefinitely (this is significant since most contributors are volunteers), though it’s unclear how big that cost is
  3. free-threading is a world of pain that CPython contributors and extension maintainers would now be exposed to (but how bad would it be really?)
  4. there hasn’t been any serious analysis (that I’m aware of) of those costs and benefits, by which we could make an informed decision on no-gil is worth doing

Regarding that last point, it’s a tricky one. Providing just a rough analysis would be quite a challenge, even for core devs or the PSF, much more so for Sam and much more so a detailed analysis. So it would be somewhat disingenuous to expect Sam to provide enough analysis to be useful. (Perhaps I’m overestimating the difficulty or underestimating Sam’s resolve :smile:). Then again, it’s hard to imagine us accepting no-gil without a clear picture of the real costs and benefits. Doing so would feel irresponsible.

Regarding working on CPython outside the GIL, I’ve had some experience with that over the last few years, and almost exclusively for the last six months. I can honestly say that it is quite frustrating when compared to working under the GIL. Things that normally take days routinely stretch into weeks instead.

Of course, that would be likely to ease up over time under no-gil as we adjust and document the pain points and sort out effective tooling. Also, my experience is probably partly due to what I’ve been working on, though under no-gil many parts of the runtime would now be exposed to the same sort of concerns (and pain). Perhaps most contributors have more experience in dealing with free-threading or they’re smarter than me, but I expect that my experience would be typical going forward, at least for key parts of the runtime.

The fact that extension module maintainers would be exposed to the same peril definitely worries me too.

To be clear, I do hope there’s a way forward for no-gil and would like to see it succeed. Perhaps my concerns are unfounded. Perhaps the benefits will sufficiently outweigh the costs. Ultimately my biggest concern is that it’s hard to know one way or the other at the moment. At best, we’re making educated guesses. For something as far-reaching as no-gil I’d expect to have a clearer picture before a decision could be made.

Just to be clear, applying multiple interpreters to satisfy category 3 is a real possibility. However, how well it can do so is currently mostly hypothetical (no matter what I expect to happen). Folks are already exploring various approaches, but it may take a year or two before a firm conclusion can be made.

That makes it hard to reasonably suggest we hold off no-gil until we know if per-interpreter GIL solves the multi-core story, regardless of any optimism I have about that.

FWIW, all the work I’ve done for per-interpreter GIL has been driven by the specific goal of fixing Python’s multi-core story once-and-for-all. If I’ve succeeded (or at least given us a path to success) then no-gil, and its attendant costs, would be unnecessary. That said, I’ve tried to leave out my own aspirations when honestly considering no-gil, and do sincerely hope the project finds a way forward.

4 Likes

As a Rust framework maintainer for PyO3, I’ve had multiple feature requests from users to have multithreaded integration with Python. I’m keen to see Python’s multi-core story progress and have spent a fair deal of time thinking about how PyO3 can support per-interpreter GIL and nogil.

PyO3 supports neither multi-core solution yet, both due to their prerelease nature and a need to design appropriate safe abstractions for framework users to build multi-core extension modules correctly. I hope that PyO3 will offer an ergonomic way for extension authors to work with a multi-core Python.

My current opinion is that nogil is an easier match to Rust’s concurrency story. Rust is also able to help mitigate some categories of free-threading concerns that Eric alludes to above.

With nogil, the Rust integration seems straightforward. Multithreaded Rust programs can have multiple threads attached to the Python interpreter, and Rust’s concurrency primitives can allow these threads to interact safely to share Python and Rust state.

An example of this is asynchronous Rust code. Frequently this uses multithreaded event loops where tasks can move between worker threads. With a nogil Python, all worker threads could interact with shared Python objects without the latency of acquiring and releasing the GIL around Python touch points. I am aware of at least the Pants build system and the Robyn web framework as Rust programs which I understand are built using asynchronous event loops and could benefit from nogil.[1]

For per-interpreter GIL, I think the same integration is more challenging. My understanding is that each thread would run its own interpreter and Python objects cannot be shared between interpreters. This creates difficulties with multithreaded Rust event loops which pass work between threads. To achieve object isolation I think tasks would not be able to store Python state across task yield points and instead would need to store state purely in Rust data structures.

Rust programs written with per-interpreter GIL can still make use of Rust’s concurrency primitives to coordinate data sharing between threads in Rust data structures and push work onto isolated Python interpreters. This model can definitely be successful. To make it possible for PyO3’s users to do this soundly, I need to rework PyO3’s APIs to enforce PEP 630’s module isolation. This still needs significant design work. (Help is very much welcomed from all interested in contributing!)


  1. The Pants build system has actively voiced enthusiasm in discussions on PyO3’s Github. ↩︎