PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

colesbury · May 4, 2023, 6:22pm

I’d like to share updates to PEP 703. The changes are substantial enough that I figured a new thread was warranted. The main results are from rebasing on to Python 3.12 and include performance numbers. That implementation is available at GitHub - colesbury/nogil-3.12: Multithreaded Python without the GIL (experimental rebase on 3.12).

The PEP is updated to target Python 3.13.
Tweaks to biased reference counting to avoid an atomic operation in most object deallocations. Essentially, this optimizes further for the case of objects accessed only by a single thread at the cost of some extra overhead when an object is accessed by a non-creating thread.
Changes to the PyObject header and GC header. With PEP 683 accepted and implemented, objects with 2^32-1 references become immortal. This PEP keeps that behavior and shrinks the ob_ref_local field to 32-bits (on both 32-bit and 64-bit platforms). The “extra” space is used for a one byte per-object mutex and GC flags (tracked/finalized/immortal), with 16 unused bits. The GC header (_PyGC_HEAD) is removed. This makes the total header size for GC-enabled objects the same as in CPython, but non-GC objects are still larger due to the two reference count fields and thread id field.
The specification of the Py_mod_gil slot from the previous PEP discussion.

I’d appreciate your feedback on the PEP and the updates.

davidfstr · May 10, 2023, 3:20am

Thanks @colesbury for working on this PEP! It’s clearly had a lot of thought put in to it and is tackling a really tough problem head on: the GIL.

Feedback:

The global interpreter lock will remain the default for CPython builds and python.org downloads. A new build configuration flag, --disable-gil will be added to the configure script that will build CPython with support for running without the global interpreter lock.

Does the PEP (want to) recommend that C extension authors add support for --disable-gil?
- The Python Build Modes section seems to lay out a suggested roadmap for extension authors that implies that support for --disable-gil will be encouraged.
Adding support for --disable-gil in an extension would seem to imply that APIs like PyList_GetItem, PyDict_GetItem, and PyWeakref_GetObject be considered deprecated for regular usage in favor of their “new reference” versions that play nicely with --disable-gil. However the Why Not Deprecate PyDict_GetItem in Favor of PyDict_FetchItem? section explicitly says APIs like PyDict_GetItem are not deprecated even while acknowledging that using them can be unsafe. I’m a bit confused.

Additionally, the implementation proposed PEP supports […]

Nit: Remove either the word “PEP” or the phrase “implementation proposed”

Container Thread-Safety

Borrowed References

Python Critical Sections

Optimistically Avoiding Locking

Mimalloc Changes for Optimistic list and dict Access

Mimalloc Page Reuse

Optimistic dict and list Access Summary

The content of the above adjacent sections from the table of contents is sufficiently related that it seems they would be more logically organized if they were under a single supersection. For example:
- Container Thread-Safety
  - Optimistic dict and list Access Summary (reordered from end to beginning)
  - Borrowed References
  - Python Critical Sections
  - Optimistically Avoiding Locking
  - Mimalloc Changes for Optimistic list and dict Access
  - Mimalloc Page Reuse
Grouping under a single supersection would also be consistent with the high-level outline given in Overview of CPython Changes.
I’m a bit concerned at the level of complexity required to obtain safe and performant access to lists & dicts in a GIL-less context. It seems like it would be very easy for a contributor to introduce new data races, which would be difficult to detect/debug.
- Can this be mitigated somehow via additional static analysis of the CPython source? (I have never maintained lockless algorithms before, so I am unfamiliar with common tools for ensuring that such implementations resist introduction of data race bugs over time.)

the interpreter pauses all threads and enables the GIL before continuing.

Wait. So it’s possible for a --disable-gil build of CPython to start using a GIL at runtime? I feel like it might be worth mentioning this possiblity earlier in the Build Configuration Changes section. I was previously assuming that --disable-gil would compile away GIL-related code using #ifdefs or similar, but it sounds like code will have to regularly check whether a GIL is enabled or not since it can be enabled at runtime and not just at compile time.
- Indeed,
- The author believes a worthwhile goal is to […] have the global interpreter lock controlled at runtime, possibly disabled by default.

At least for some time, there will be two versions of Python requiring separately compiled C-API extensions. It may take some time for C-API extension authors to build --disable-gil compatible packages and upload them to PyPI.

Does (or will) PyPI support uploading multiple binary wheels, one for each build version of Python?
How is it proposed that CPython automated builds (“buildbots”?) be updated? Will there now be double the number of builds, to add --disable-gil builds?

colesbury · May 10, 2023, 5:13pm

Hi @davidfstr, thanks for your detailed feedback and suggestions. I’ve made some edits to the PEP and I’ve included responses are below.

Working C-API extensions are important, but I don’t think it’s necessary to make an “official” recommendation in the PEP.

I’ll expand on the rationale given in the PEP:

PyDict_GetItem and other functions that return borrowed references can already be “unsafe” when there’s code between the call and the use of the “borrowed” value if that code releases the GIL, modifies the dict, triggers a GC, etc. That can happen more often than it sounds because macros like Py_DECREF can call arbitrary code via destructors and weak refs. So, the PEP expands the hazards from “overlapping operations” to “concurrent operations”, but doesn’t categorically change a “safe” API into an “unsafe” one. (The safety is context dependent)
Most calls to these APIs are still safe. The reasons are varied and context dependent, but includes things like the dict being private to the calling thread (e.g., kwargs).
I expect that wholesale replacing these calls will introduce more bugs that it would fix. This was my experience when I tried this approach.
Given that the “correct” decision is uncertain, I think the not deprecating those APIs initially is less risky, even if that turns out to be the wrong decision in the long run. We can still later decide to deprecate those APIs. (The opposite scenario, in which we initially deprecate the APIs, and then decide that was a mistake would cause much more churn.)

Tools like thread sanitizer (TSAN) help. We already have a GitHub actions runner for address sanitizer (ASAN). I’d expect to add one for TSAN as well.

This is already mentioned in the Build Configuration Changes section. The code path that acquires the GIL (e.g., PyEval_SaveThread) checks if the GIL is enabled. Runtime vs. compile time support doesn’t affect performance or code complexity, but does improve some debugging and compatibility scenarios.

It already works with PyPI because the wheels have different ABI tags.

There will be additional stable build bots for --disable-gil builds, but I don’t expect the total number of bots to double.

a-reich · May 26, 2023, 12:33pm

Hello, one thing about the PEP that I find confusing to interpret from my non-expert perspective, is the performance section. In particular, the fact that nogil builds’ performance is worse even for the multithreaded programs (and in fact the overhead is actually bigger there). I more or less understand the explanation about specialization limits, but my question is: how should readers reconcile this with the rationale for this PEP?

In other words, the whole idea is that, as the section notes, multithreaded code will be faster by effectively using multiple CPUs. Why isn’t happening in the pyperformance benchmarks - just a quirk of these benchmarks being poor examples due to too little CPU work/too few threads/something else? Or perhaps they are getting a slight speedup outweighed by the specialization impact, but the idea is that with enough more threads the net result would eventually be positive?

At any rate, since the whole point of this change is that for various applications, the nogil mode will run faster than the (current) GIL mode, IMO it would be good to have examples demonstrating that.

Thanks!

ezoss · May 26, 2023, 1:32pm

I think the values given for multiple threads is the per thread overhead, and not an application-wide hard speed reduction. Each thread is a bit slower, but you can run then in parallel to more than make up for the overhead, instead being the parallelization being limited by the GIL.

colesbury · May 26, 2023, 3:26pm

Yes, as Edwin wrote, that section reports per-thread execution overhead. It doesn’t take into account any multi-threaded scaling. The pyperformance benchmark suite is almost entirely single-threaded. There is one benchmark (out of ~60) that uses threads; that benchmark is run as if it had the GIL enabled (but with the other “nogil” limitations) to avoid muddling the results.

I don’t report multithreaded results because there aren’t really any multithreaded Python benchmark suites. There are a few tiny programs that show nice linear scaling, but I don’t think those are interesting enough to include. There are some real world programs that use nogil Python, like the Dose-3D project mentioned in the PEP’s introduction, but in many of those cases GIL vs. nogil is so large that it’s basically “infeasible” vs. “feasible”. For the Dose-3D project, the important aspect was not that it ran faster without the GIL, but that it was able to meet real-time requirements. That would have been impossible with a multi-threaded Python application running with the GIL.

My EuroPython keynote talks a bit about some speed-ups with my colleague’s Hanabi project. There are some actual numbers, but I think they’re more reflective of details of the project than nogil in general.

pjurgielewicz · May 26, 2023, 7:14pm

Indeed we struggled a lot to make Dose-3D DAQ run on traditional CPython. The communication overhead between processes representing functional hardware entities greatly exceeded our timing budget. Now with multiple threads running in parallel not only were we enabled to use Python but also the overall system architecture simplified.

@colesbury Moreover, there are plans for using the low-level subsystem of this DAQ for other projects. So we are going to be strongly reliant on the nogil CPython implementation

a-reich · May 26, 2023, 11:28pm

Thanks for the clarification - I didn’t catch that the table is measuring per-thread results, nor that they are run with GIL enabled. And thanks for the link to your talk - I definitely found the examples useful!

markshannon · June 2, 2023, 2:39pm

We (the Faster CPython team) have taken a careful look at the PEP and I have written up a summary. It is mainly about performance, but also compares PEP 703 to multiple interpreters.
I have attempted to be objective, please say if you think I am being unfair.

Performance assessment

Performance comparison of NoGIL with the status quo

PEP 703 claims that NoGIL has a 6% execution overhead on Intel Skylake machines, and 5% on AMD Zen 3.
Our benchmarking shows an 11% overhead on our (Cascade Lake) machine.

The NoGIL branch includes some major changes to the cycle GC and the memory allocator. Comparing NoGIL with our (relatively low effort) attempt to add those changes to the base commit shows 14% overhead.
Our attempt to isolate the changes to the cycle collector consist of reducing the number of generations to one, and setting the threshold of that generation to 7000.
A perfect comparison would be a lot more work, as the changes to the cycle GC in the NoGIL are hard to isolate.

Earlier experiments with mimalloc showed a small ~1% speedup, which gives a best guess number for the overhead of 15%.

A couple of things to note:

The overhead is not evenly spread. Some programs will show a much larger overhead, exceeding 50% is some cases (e.g. the MyPy benchmark), and some will have negligible overhead.
There are no benchmarks that have the workload spread across multiple threads, as Sam mentions in the PEP. Such benchmarks should show speedups with NoGIL.

Future and ongoing impact on performance

It is not clear how the overhead of NoGIL will change as CPython gets faster.
A large part of the overhead of NoGIL is in the more complex reference counting mechanism.
Reducing the number of reference counting operations is part of our optimization strategy,
which would reduce the absolute overhead of reference counting.
However, we expect large gains elsewhere, so the proportional overhead would probably increase.
There is also the overhead of synchronizing and locking. This is unlikely to change much in absolute terms, so would get considerably larger as a ratio.

Assuming a 10% overhead from reference counting and a 5% overhead from locking:
If we double the speed of the rest of the VM and halve the reference counting overhead, the overhead of NoGIL rises to 20%.
If we double the speed of the VM without changing reference counting, the overhead of NoGIL rises to 30%.

Summary of overhead estimates

All overheads as a percentage of the runtime without the NoGIL changes

Measurement	Overhead
PEP 703 claimed	6%
Our unadjusted measurement	11%
Adjusted for cycle GC changes	14%
Overall impact of NoGIL on 3.12*	15%
Guess 1 for 3.13/3.14	20%
Guess 2 for 3.13/3.14	30%

* Adjusted for cycle GC plus 1% estimated speedup from mimalloc integration.

Opportunity costs

The adaptive specializing interpreter relies on the GIL; it is not thread-friendly.
If NoGIL is accepted, then some redesign of our optimization strategy will be necessary.
While this is perfectly possible, it does have a cost.
The effort spent on this redesign and resulting implementation is not being spent on actual optimizations.

The interactions between instrumentation (PEP 669, sys.settrace, etc), the cycle GC, and optimization, are already subtle.
Ensuring that all the parts work correctly together takes considerable effort, and slows down work on speeding up CPython.
Adding free-threading to that mix, will increase the complexity considerably, resulting in a lot of effort being spent on handling corner cases that simply do not occur with the GIL.

Comparsion to multiple interpreters

Python 3.12 offers support for executing multiple Python interpreters in parallel in the same process.

For the purposes of this discussion, let’s categorize parallel application into three groups:

Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
Numerical processing (including machine learning) where the shared data is matrices of numbers.
General data processing where the shared data is in the form of an in-memory object graph.

Python has always supported category 1, with multiprocessing or through some sort of external load balancer.
Category 2 is supported by multiple interpreters.
It is category 3 that benefits from NoGIL.

How common is category 3?
All the motivating examples in PEP 703 are in category 2.

markshannon · June 2, 2023, 4:29pm

In addition to the above, I also have some more subjective comments to add.

Shared memory parallelism is risky, and very hard to get right. Non deterministic bugs in the VM are no fun at all for Python developers.

It took the developers of the HotSpot JVM many years and many many person years to iron out all the concurrency bugs. I don’t see any reason to believe that we will do any better.

If you want robust parallel, scalable applications, you write them in Erlang, not Java.
If you need the very last drop of performance you might use Java, although you would probably use Rust these days.
It is safe to say that if you are choosing to develop in Python, you aren’t worried about getting the last drop of performance. You might be concerned about robustness and reliability, though.

The share-nothing model of multiple interpreters has its usability limitations (you have to pass messages) but it is much safer IMO.

eric.snow · June 2, 2023, 7:17pm

Outside the things Mark said for our team, I’ll summarize my perspective on no-gil.

Generally I’m in favor. It represents a substantial effort on Sam’s part and some effective strategies. It certainly addresses a major criticism of Python in the broader tech world. However, there are a few things that make me uncertain about it.

First, some of my potential biases:

I’ve been working toward a per-interpreter GIL for over 8 years, which can reasonably be thought of as a competitor to no-gil (FWIW, I don’t see it that way)
I almost never reach for threads to solve any of my problems, and when I do the GIL isn’t an issue for me, so I don’t have any personal vested interest for or against threading
I work on the faster-cpython team, which would be negatively impacted by no-gil
I’ve interacted with Sam on several occasions and found him to be thoughtful, smart, and respectful, so I’m inclined to trust his intentions and the work he’s done
I’ve struggled to quickly understand a number of the technical aspects of no-gil, which can impact my disposition toward the solution and how I’ve personally assessed the costs of no-gil

Here are my main concerns:

it’s unclear who benefits and how much
- which Python workloads benefit (outside AI/ML)
- what new Python usage would come about that folks don’t even bother currently
there is a real cost to the change, due to both size and complexity, that Python maintainers will have to pay indefinitely (this is significant since most contributors are volunteers), though it’s unclear how big that cost is
free-threading is a world of pain that CPython contributors and extension maintainers would now be exposed to (but how bad would it be really?)
there hasn’t been any serious analysis (that I’m aware of) of those costs and benefits, by which we could make an informed decision on no-gil is worth doing

Regarding that last point, it’s a tricky one. Providing just a rough analysis would be quite a challenge, even for core devs or the PSF, much more so for Sam and much more so a detailed analysis. So it would be somewhat disingenuous to expect Sam to provide enough analysis to be useful. (Perhaps I’m overestimating the difficulty or underestimating Sam’s resolve ). Then again, it’s hard to imagine us accepting no-gil without a clear picture of the real costs and benefits. Doing so would feel irresponsible.

Regarding working on CPython outside the GIL, I’ve had some experience with that over the last few years, and almost exclusively for the last six months. I can honestly say that it is quite frustrating when compared to working under the GIL. Things that normally take days routinely stretch into weeks instead.

Of course, that would be likely to ease up over time under no-gil as we adjust and document the pain points and sort out effective tooling. Also, my experience is probably partly due to what I’ve been working on, though under no-gil many parts of the runtime would now be exposed to the same sort of concerns (and pain). Perhaps most contributors have more experience in dealing with free-threading or they’re smarter than me, but I expect that my experience would be typical going forward, at least for key parts of the runtime.

The fact that extension module maintainers would be exposed to the same peril definitely worries me too.

To be clear, I do hope there’s a way forward for no-gil and would like to see it succeed. Perhaps my concerns are unfounded. Perhaps the benefits will sufficiently outweigh the costs. Ultimately my biggest concern is that it’s hard to know one way or the other at the moment. At best, we’re making educated guesses. For something as far-reaching as no-gil I’d expect to have a clearer picture before a decision could be made.

Just to be clear, applying multiple interpreters to satisfy category 3 is a real possibility. However, how well it can do so is currently mostly hypothetical (no matter what I expect to happen). Folks are already exploring various approaches, but it may take a year or two before a firm conclusion can be made.

That makes it hard to reasonably suggest we hold off no-gil until we know if per-interpreter GIL solves the multi-core story, regardless of any optimism I have about that.

FWIW, all the work I’ve done for per-interpreter GIL has been driven by the specific goal of fixing Python’s multi-core story once-and-for-all. If I’ve succeeded (or at least given us a path to success) then no-gil, and its attendant costs, would be unnecessary. That said, I’ve tried to leave out my own aspirations when honestly considering no-gil, and do sincerely hope the project finds a way forward.

davidhewitt · June 2, 2023, 9:42pm

As a Rust framework maintainer for PyO3, I’ve had multiple feature requests from users to have multithreaded integration with Python. I’m keen to see Python’s multi-core story progress and have spent a fair deal of time thinking about how PyO3 can support per-interpreter GIL and nogil.

PyO3 supports neither multi-core solution yet, both due to their prerelease nature and a need to design appropriate safe abstractions for framework users to build multi-core extension modules correctly. I hope that PyO3 will offer an ergonomic way for extension authors to work with a multi-core Python.

My current opinion is that nogil is an easier match to Rust’s concurrency story. Rust is also able to help mitigate some categories of free-threading concerns that Eric alludes to above.

With nogil, the Rust integration seems straightforward. Multithreaded Rust programs can have multiple threads attached to the Python interpreter, and Rust’s concurrency primitives can allow these threads to interact safely to share Python and Rust state.

An example of this is asynchronous Rust code. Frequently this uses multithreaded event loops where tasks can move between worker threads. With a nogil Python, all worker threads could interact with shared Python objects without the latency of acquiring and releasing the GIL around Python touch points. I am aware of at least the Pants build system and the Robyn web framework as Rust programs which I understand are built using asynchronous event loops and could benefit from nogil.^[1]

For per-interpreter GIL, I think the same integration is more challenging. My understanding is that each thread would run its own interpreter and Python objects cannot be shared between interpreters. This creates difficulties with multithreaded Rust event loops which pass work between threads. To achieve object isolation I think tasks would not be able to store Python state across task yield points and instead would need to store state purely in Rust data structures.

Rust programs written with per-interpreter GIL can still make use of Rust’s concurrency primitives to coordinate data sharing between threads in Rust data structures and push work onto isolated Python interpreters. This model can definitely be successful. To make it possible for PyO3’s users to do this soundly, I need to rework PyO3’s APIs to enforce PEP 630’s module isolation. This still needs significant design work. (Help is very much welcomed from all interested in contributing!)

The Pants build system has actively voiced enthusiasm in discussions on PyO3’s Github. ↩︎

gpshead · June 3, 2023, 6:02pm

While understandable, I don’t think this is a wholly accurate categorization.

TL;DR - We do see demand from the numerical processing / ml side for free threaded 703-style Python:

We hear from our data science & ML teams at work that the CPython bottleneck of any form is always what must be worked around in their software stacks. It is the computation between calls into specialized-kernels and in between offloading mass computations to TPUs and GPUs where they want a throughput boost aka latency reduction. As that in-Python wall time often means the dedicated special xPU resource sits allocated to them but idle.

Thread parallelism for numerical computation, what you call (2), would be very much appreciated. They’ll write code that bridges the (2)-(3) categorization that could benefit latency wise by being parallelized across the N-NNN available cores on a machine.

Multiple interpreters (be it processes or be it our upcoming per-interpreter-gil subinterpreters that nobody can actually use yet) lack a good shared data from Python story so they don’t support that case as well as threads should. (regardless, I’m interested to see what use they can make of per-interpreter-gil subinterpreters or not - work done there will likely be useful in a potential future free-threaded pep703 python as well)

bluetech · June 3, 2023, 6:14pm

Some comments:

Branding

Branding is important. I suggest using the term “Multicore Python” for this effort instead of “nogil”. It sounds much more impressive, while “no GIL” is more obscure and negative. It worked for OCaml.

Use cases

I’d like to offer some non-AI/ML use cases that I can think of which I believe, if no global lock was present, would use shared memory threading instead of multiple processes or multiple interpreters:

Linters like flake8 (granularity: file)
Test frameworks like pytest-xdist (granularity: test)
WSGI/HTTP servers (granularity: request)

Memory model

Note: I have to admit I haven’t read the PEP in detail, so this might be discussed there, apologies if so.

When shared-memory parallelism is available, a Memory Model is needed in order to program it safely. Examples: C/C++, Java, Go.

In particular I’m interested what actually happens if a data race happens. For example what happens in the classic “2 threads increment the same counter” scenario. Is it nasal demons undefined behavior as in C (AFAIK Python never had this), or something safer as in Java and Go (see last paragraph here)?

gpshead · June 3, 2023, 6:33pm

Yes, there are many use cases where people use multiple processes today where they might not have taken on that complexity otherwise. Even entire huge services that provide everyone cat videos and food photos that have relyed on os.fork semantics and attempt to share memory pages by disabling refcount modifications to save massive compute resources.^[1] Because they couldn’t just use threads to expand their work on their machines as things written in C, C++, Java, Golang, and more recently Rust have been able to.

These would gain the option to reduce their complexity of managing separate processes just adopt threads for a local speed boost or overall resource reduction. It doesn’t mean multiprocessing or multi-k8s-equiv orchestrated jobs go away - threads aren’t a replacement - they’d just another shape in part of the stack to better utilize total available resources with less overhead.

Contrary to popular belief, these types of services don’t actually all sit around blocked on I/O - aka use case (1) - and even when doing so the I/O is generally on a network to another pile of machines, not something else locally consuming the local machine cores that Python cannot. ↩︎

jeanas · June 3, 2023, 7:42pm

Please read the PEP – especially the parts about thread safety for collections, and backwards compatibility.

ambv · June 4, 2023, 7:49pm

Thanks for the very detailed review, Mark. This is important feedback.

6% vs 11% in unadjusted measurement is a large difference. We need to figure out if it was versus the same base 3.12 commit as what @colesbury measured. It’s also pretty interesting to learn why the GC changes are so significant. I’d just like to point out that if those changes aren’t counted towards nogil’s performance, their complexity shouldn’t be counted against nogil either.

In any case, I did some new personal benchmarks in response to some back and forth on Mastodon. I was benchmarking nogil-3.12 (4526c07 which builds on v3.12.0a4) with the GIL disabled against the latest 3.11 (8de607a to be exact) using pyperformance 1.0.8 with the default settings.

The average from the benchmarks shows nogil to be 3.18% slower than 3.11.3 compiled with the same settings. Interestingly, a number of benchmarks in the suite, namely the async_, asyncio_, and multiprocessing ones, are between 1.57x to 2.51x faster under nogil. While you’re saying that the benchmarks don’t spread the workload across threads, it seems that the stdlib and third-party libraries in some cases already make use of multiple cores. We could likely make the standard library faster in more places if free threading was a feature we could depend on.

Benchmark process details and full results

Full results

I compiled both 3.11 and nogil-3.12 in production settings (-O3 --enable-optimizations --with-lto) on Ubuntu 18.04 (gcc 7.5.0) on my 2018 MBP (Intel i7-8850H) on gigabit Ethernet, wifi disabled. When I re-ran the benchmarks, I noticed a few results differing in ways pyperformance compare considered “significant”. So I re-ran them a few more times and then consolidated them choosing the fastest results.

My thinking is that the slower results are a result of virtualization jitter, CPU affinity fluctuations, and background processes on Ubuntu itself. All raw files and the consolidation script are included in the results gist.

ambv · June 4, 2023, 11:15pm

I woke up in cold sweat thinking “oh no, this is probably just 3.12 being faster than 3.11!”. But no, it is nogil actually. I made the same benchmarks against the v3.12.0a4 tag (since that’s what the nogil-3.12 is based on) and I’m still seeing between 1.59x and 2.50x gain on nogil. The only major difference between vanilla 3.11 and vanilla 3.12 is asyncio’s TCP that got 1.6x faster (kudos @kumaraditya303 and @itamaro!). So those two benchmarks are no longer competitive using nogil. But they are also barely slower.

More importantly, averaging everything out I’m getting a 6.65% unadjusted slowdown of nogil-3.12 compared to vanilla 3.12, which is very close to what Sam wrote in the PEP.

Full results: Python v3.12.0a4 vs nogil-3.12, both compiled with the same options: -O3 --enable-optimizations --with-lto compare · GitHub

The nogil* JSON files and the consolidation script are the same as above so I didn’t re-upload them.

guido · June 5, 2023, 2:08am

I don’t have anything to add to the benchmarking discussion, but I have a thought about compatibility.

It seems unquestionable that nogil requires some work from extension module maintainers before an extension is considered safe to use in the nogil world. It also breaks the ABI (if only by changing the meaning of the bits in ob_refcnt).

Regardless of whether that’s a reason to proclaim that nogil is Python 4, this is likely to create a rift between modules that are nogil-compatible and those that aren’t.

If there’s one lesson we’ve learned from the Python 2 to 3 transition, it’s that it would have been very beneficial if Python 2 and 3 code could coexist in the same Python interpreter. We blew it that time, and it set us back by about a decade.

Let’s not blow it this time. If we’re going forward with nogil (and I’m not saying we are, but I can’t exclude it), let’s make sure there is a way to be able to import extensions requiring the GIL in a nogil interpreter without any additional shenanigans – neither the application code nor the extension module should have to be modified in any way (and that includes being able to run extensions build with ABI 3.x for some x).

gpshead · June 5, 2023, 3:50am

Reposting the reply I put on the steering council pep 703 decision issue here per Guido’s suggestion just so it’s all in one place:

“”"The steering council is going to take its time on this. A huge thank you for working to keep it up to date! We’re not ready to simply pronounce on 703 as it has a HUGE blast radius.

Software isn’t ready for the decades of Python assumptions such a change turns on its head, even when it appears to work fine it’s a statistical qualm of “but does it really? how do we actually know? when might it not and how often?” asked for every transitive dep of code. For some things that Q&A could be easy, but for others it becomes a can of worms.

From a steering council perspective we effectively view a 703 threading enabled interpreter at a high level as a fork of the CPython VM. In the sense that extension modules are unlikely to work without noteworthy modifications and even some pure Python libraries may even need to start considering locking where it had not in the past.

That does not mean “no” to this. There is demand for it. (personally, I’ve wanted this since forever!) It’s just that it won’t be easy and we’ll need to consider the entire ecosystem and how to smoothly allow such a change to happen without breaking the world.

I’m glad to see the continued discuss thread with faster-cpython folks in particular piping up. The intersection between this work and ongoing single threaded performance improvements will always be high and we don’t want to hamper that in the near term.“”"

– me with a steering council hat on