I think the values given for multiple threads is the per thread overhead, and not an application-wide hard speed reduction. Each thread is a bit slower, but you can run then in parallel to more than make up for the overhead, instead being the parallelization being limited by the GIL.
Yes, as Edwin wrote, that section reports per-thread execution overhead. It doesn’t take into account any multi-threaded scaling. The pyperformance benchmark suite is almost entirely single-threaded. There is one benchmark (out of ~60) that uses threads; that benchmark is run as if it had the GIL enabled (but with the other “nogil” limitations) to avoid muddling the results.
I don’t report multithreaded results because there aren’t really any multithreaded Python benchmark suites. There are a few tiny programs that show nice linear scaling, but I don’t think those are interesting enough to include. There are some real world programs that use nogil Python, like the Dose-3D project mentioned in the PEP’s introduction, but in many of those cases GIL vs. nogil is so large that it’s basically “infeasible” vs. “feasible”. For the Dose-3D project, the important aspect was not that it ran faster without the GIL, but that it was able to meet real-time requirements. That would have been impossible with a multi-threaded Python application running with the GIL.
My EuroPython keynote talks a bit about some speed-ups with my colleague’s Hanabi project. There are some actual numbers, but I think they’re more reflective of details of the project than nogil in general.
Indeed we struggled a lot to make Dose-3D DAQ run on traditional CPython. The communication overhead between processes representing functional hardware entities greatly exceeded our timing budget. Now with multiple threads running in parallel not only were we enabled to use Python but also the overall system architecture simplified.
@colesbury Moreover, there are plans for using the low-level subsystem of this DAQ for other projects. So we are going to be strongly reliant on the nogil CPython implementation
Thanks for the clarification - I didn’t catch that the table is measuring per-thread results, nor that they are run with GIL enabled. And thanks for the link to your talk - I definitely found the examples useful!
We (the Faster CPython team) have taken a careful look at the PEP and I have written up a summary. It is mainly about performance, but also compares PEP 703 to multiple interpreters.
I have attempted to be objective, please say if you think I am being unfair.
PEP 703 claims that NoGIL has a 6% execution overhead on Intel Skylake machines, and 5% on AMD Zen 3.
Our benchmarking shows an 11% overhead on our (Cascade Lake) machine.
The NoGIL branch includes some major changes to the cycle GC and the memory allocator. Comparing NoGIL with our (relatively low effort) attempt to add those changes to the base commit shows 14% overhead.
Our attempt to isolate the changes to the cycle collector consist of reducing the number of generations to one, and setting the threshold of that generation to 7000.
A perfect comparison would be a lot more work, as the changes to the cycle GC in the NoGIL are hard to isolate.
Earlier experiments with mimalloc showed a small ~1% speedup, which gives a best guess number for the overhead of 15%.
A couple of things to note:
- The overhead is not evenly spread. Some programs will show a much larger overhead, exceeding 50% is some cases (e.g. the MyPy benchmark), and some will have negligible overhead.
- There are no benchmarks that have the workload spread across multiple threads, as Sam mentions in the PEP. Such benchmarks should show speedups with NoGIL.
It is not clear how the overhead of NoGIL will change as CPython gets faster.
A large part of the overhead of NoGIL is in the more complex reference counting mechanism.
Reducing the number of reference counting operations is part of our optimization strategy,
which would reduce the absolute overhead of reference counting.
However, we expect large gains elsewhere, so the proportional overhead would probably increase.
There is also the overhead of synchronizing and locking. This is unlikely to change much in absolute terms, so would get considerably larger as a ratio.
Assuming a 10% overhead from reference counting and a 5% overhead from locking:
If we double the speed of the rest of the VM and halve the reference counting overhead, the overhead of NoGIL rises to 20%.
If we double the speed of the VM without changing reference counting, the overhead of NoGIL rises to 30%.
All overheads as a percentage of the runtime without the NoGIL changes
|PEP 703 claimed
|Our unadjusted measurement
|Adjusted for cycle GC changes
|Overall impact of NoGIL on 3.12*
|Guess 1 for 3.13/3.14
|Guess 2 for 3.13/3.14
* Adjusted for cycle GC plus 1% estimated speedup from mimalloc integration.
The adaptive specializing interpreter relies on the GIL; it is not thread-friendly.
If NoGIL is accepted, then some redesign of our optimization strategy will be necessary.
While this is perfectly possible, it does have a cost.
The effort spent on this redesign and resulting implementation is not being spent on actual optimizations.
The interactions between instrumentation (PEP 669, sys.settrace, etc), the cycle GC, and optimization, are already subtle.
Ensuring that all the parts work correctly together takes considerable effort, and slows down work on speeding up CPython.
Adding free-threading to that mix, will increase the complexity considerably, resulting in a lot of effort being spent on handling corner cases that simply do not occur with the GIL.
Python 3.12 offers support for executing multiple Python interpreters in parallel in the same process.
For the purposes of this discussion, let’s categorize parallel application into three groups:
- Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
- Numerical processing (including machine learning) where the shared data is matrices of numbers.
- General data processing where the shared data is in the form of an in-memory object graph.
Python has always supported category 1, with multiprocessing or through some sort of external load balancer.
Category 2 is supported by multiple interpreters.
It is category 3 that benefits from NoGIL.
How common is category 3?
All the motivating examples in PEP 703 are in category 2.
In addition to the above, I also have some more subjective comments to add.
Shared memory parallelism is risky, and very hard to get right. Non deterministic bugs in the VM are no fun at all for Python developers.
It took the developers of the HotSpot JVM many years and many many person years to iron out all the concurrency bugs. I don’t see any reason to believe that we will do any better.
If you want robust parallel, scalable applications, you write them in Erlang, not Java.
If you need the very last drop of performance you might use Java, although you would probably use Rust these days.
It is safe to say that if you are choosing to develop in Python, you aren’t worried about getting the last drop of performance. You might be concerned about robustness and reliability, though.
The share-nothing model of multiple interpreters has its usability limitations (you have to pass messages) but it is much safer IMO.
Outside the things Mark said for our team, I’ll summarize my perspective on no-gil.
Generally I’m in favor. It represents a substantial effort on Sam’s part and some effective strategies. It certainly addresses a major criticism of Python in the broader tech world. However, there are a few things that make me uncertain about it.
First, some of my potential biases:
- I’ve been working toward a per-interpreter GIL for over 8 years, which can reasonably be thought of as a competitor to no-gil (FWIW, I don’t see it that way)
- I almost never reach for threads to solve any of my problems, and when I do the GIL isn’t an issue for me, so I don’t have any personal vested interest for or against threading
- I work on the faster-cpython team, which would be negatively impacted by no-gil
- I’ve interacted with Sam on several occasions and found him to be thoughtful, smart, and respectful, so I’m inclined to trust his intentions and the work he’s done
- I’ve struggled to quickly understand a number of the technical aspects of no-gil, which can impact my disposition toward the solution and how I’ve personally assessed the costs of no-gil
Here are my main concerns:
- it’s unclear who benefits and how much
- which Python workloads benefit (outside AI/ML)
- what new Python usage would come about that folks don’t even bother currently
- there is a real cost to the change, due to both size and complexity, that Python maintainers will have to pay indefinitely (this is significant since most contributors are volunteers), though it’s unclear how big that cost is
- free-threading is a world of pain that CPython contributors and extension maintainers would now be exposed to (but how bad would it be really?)
- there hasn’t been any serious analysis (that I’m aware of) of those costs and benefits, by which we could make an informed decision on no-gil is worth doing
Regarding that last point, it’s a tricky one. Providing just a rough analysis would be quite a challenge, even for core devs or the PSF, much more so for Sam and much more so a detailed analysis. So it would be somewhat disingenuous to expect Sam to provide enough analysis to be useful. (Perhaps I’m overestimating the difficulty or underestimating Sam’s resolve ). Then again, it’s hard to imagine us accepting no-gil without a clear picture of the real costs and benefits. Doing so would feel irresponsible.
Regarding working on CPython outside the GIL, I’ve had some experience with that over the last few years, and almost exclusively for the last six months. I can honestly say that it is quite frustrating when compared to working under the GIL. Things that normally take days routinely stretch into weeks instead.
Of course, that would be likely to ease up over time under no-gil as we adjust and document the pain points and sort out effective tooling. Also, my experience is probably partly due to what I’ve been working on, though under no-gil many parts of the runtime would now be exposed to the same sort of concerns (and pain). Perhaps most contributors have more experience in dealing with free-threading or they’re smarter than me, but I expect that my experience would be typical going forward, at least for key parts of the runtime.
The fact that extension module maintainers would be exposed to the same peril definitely worries me too.
To be clear, I do hope there’s a way forward for no-gil and would like to see it succeed. Perhaps my concerns are unfounded. Perhaps the benefits will sufficiently outweigh the costs. Ultimately my biggest concern is that it’s hard to know one way or the other at the moment. At best, we’re making educated guesses. For something as far-reaching as no-gil I’d expect to have a clearer picture before a decision could be made.
Just to be clear, applying multiple interpreters to satisfy category 3 is a real possibility. However, how well it can do so is currently mostly hypothetical (no matter what I expect to happen). Folks are already exploring various approaches, but it may take a year or two before a firm conclusion can be made.
That makes it hard to reasonably suggest we hold off no-gil until we know if per-interpreter GIL solves the multi-core story, regardless of any optimism I have about that.
FWIW, all the work I’ve done for per-interpreter GIL has been driven by the specific goal of fixing Python’s multi-core story once-and-for-all. If I’ve succeeded (or at least given us a path to success) then no-gil, and its attendant costs, would be unnecessary. That said, I’ve tried to leave out my own aspirations when honestly considering no-gil, and do sincerely hope the project finds a way forward.
As a Rust framework maintainer for PyO3, I’ve had multiple feature requests from users to have multithreaded integration with Python. I’m keen to see Python’s multi-core story progress and have spent a fair deal of time thinking about how PyO3 can support per-interpreter GIL and nogil.
PyO3 supports neither multi-core solution yet, both due to their prerelease nature and a need to design appropriate safe abstractions for framework users to build multi-core extension modules correctly. I hope that PyO3 will offer an ergonomic way for extension authors to work with a multi-core Python.
My current opinion is that nogil is an easier match to Rust’s concurrency story. Rust is also able to help mitigate some categories of free-threading concerns that Eric alludes to above.
With nogil, the Rust integration seems straightforward. Multithreaded Rust programs can have multiple threads attached to the Python interpreter, and Rust’s concurrency primitives can allow these threads to interact safely to share Python and Rust state.
An example of this is asynchronous Rust code. Frequently this uses multithreaded event loops where tasks can move between worker threads. With a nogil Python, all worker threads could interact with shared Python objects without the latency of acquiring and releasing the GIL around Python touch points. I am aware of at least the Pants build system and the Robyn web framework as Rust programs which I understand are built using asynchronous event loops and could benefit from nogil.
For per-interpreter GIL, I think the same integration is more challenging. My understanding is that each thread would run its own interpreter and Python objects cannot be shared between interpreters. This creates difficulties with multithreaded Rust event loops which pass work between threads. To achieve object isolation I think tasks would not be able to store Python state across task yield points and instead would need to store state purely in Rust data structures.
Rust programs written with per-interpreter GIL can still make use of Rust’s concurrency primitives to coordinate data sharing between threads in Rust data structures and push work onto isolated Python interpreters. This model can definitely be successful. To make it possible for PyO3’s users to do this soundly, I need to rework PyO3’s APIs to enforce PEP 630’s module isolation. This still needs significant design work. (Help is very much welcomed from all interested in contributing!)
While understandable, I don’t think this is a wholly accurate categorization.
TL;DR - We do see demand from the numerical processing / ml side for free threaded 703-style Python:
We hear from our data science & ML teams at work that the CPython bottleneck of any form is always what must be worked around in their software stacks. It is the computation between calls into specialized-kernels and in between offloading mass computations to TPUs and GPUs where they want a throughput boost aka latency reduction. As that in-Python wall time often means the dedicated special xPU resource sits allocated to them but idle.
Thread parallelism for numerical computation, what you call (2), would be very much appreciated. They’ll write code that bridges the (2)-(3) categorization that could benefit latency wise by being parallelized across the N-NNN available cores on a machine.
Multiple interpreters (be it processes or be it our upcoming per-interpreter-gil subinterpreters that nobody can actually use yet) lack a good shared data from Python story so they don’t support that case as well as threads should. (regardless, I’m interested to see what use they can make of per-interpreter-gil subinterpreters or not - work done there will likely be useful in a potential future free-threaded pep703 python as well)
Branding is important. I suggest using the term “Multicore Python” for this effort instead of “nogil”. It sounds much more impressive, while “no GIL” is more obscure and negative. It worked for OCaml.
I’d like to offer some non-AI/ML use cases that I can think of which I believe, if no global lock was present, would use shared memory threading instead of multiple processes or multiple interpreters:
- Linters like flake8 (granularity: file)
- Test frameworks like pytest-xdist (granularity: test)
- WSGI/HTTP servers (granularity: request)
Note: I have to admit I haven’t read the PEP in detail, so this might be discussed there, apologies if so.
In particular I’m interested what actually happens if a data race happens. For example what happens in the classic “2 threads increment the same counter” scenario. Is it nasal demons undefined behavior as in C (AFAIK Python never had this), or something safer as in Java and Go (see last paragraph here)?
Yes, there are many use cases where people use multiple processes today where they might not have taken on that complexity otherwise. Even entire huge services that provide everyone cat videos and food photos that have relyed on
os.fork semantics and attempt to share memory pages by disabling refcount modifications to save massive compute resources. Because they couldn’t just use threads to expand their work on their machines as things written in C, C++, Java, Golang, and more recently Rust have been able to.
These would gain the option to reduce their complexity of managing separate processes just adopt threads for a local speed boost or overall resource reduction. It doesn’t mean multiprocessing or multi-k8s-equiv orchestrated jobs go away - threads aren’t a replacement - they’d just another shape in part of the stack to better utilize total available resources with less overhead.
Contrary to popular belief, these types of services don’t actually all sit around blocked on I/O - aka use case (1) - and even when doing so the I/O is generally on a network to another pile of machines, not something else locally consuming the local machine cores that Python cannot. ↩︎
Please read the PEP – especially the parts about thread safety for collections, and backwards compatibility.
Thanks for the very detailed review, Mark. This is important feedback.
6% vs 11% in unadjusted measurement is a large difference. We need to figure out if it was versus the same base 3.12 commit as what @colesbury measured. It’s also pretty interesting to learn why the GC changes are so significant. I’d just like to point out that if those changes aren’t counted towards nogil’s performance, their complexity shouldn’t be counted against nogil either.
In any case, I did some new personal benchmarks in response to some back and forth on Mastodon. I was benchmarking
4526c07 which builds on
v3.12.0a4) with the GIL disabled against the latest
8de607a to be exact) using pyperformance 1.0.8 with the default settings.
The average from the benchmarks shows nogil to be 3.18% slower than 3.11.3 compiled with the same settings. Interestingly, a number of benchmarks in the suite, namely the async_, asyncio_, and multiprocessing ones, are between 1.57x to 2.51x faster under nogil. While you’re saying that the benchmarks don’t spread the workload across threads, it seems that the stdlib and third-party libraries in some cases already make use of multiple cores. We could likely make the standard library faster in more places if free threading was a feature we could depend on.
Benchmark process details and full resultsFull results
I compiled both 3.11 and nogil-3.12 in production settings (
-O3 --enable-optimizations --with-lto) on Ubuntu 18.04 (
gcc 7.5.0) on my 2018 MBP (Intel i7-8850H) on gigabit Ethernet, wifi disabled. When I re-ran the benchmarks, I noticed a few results differing in ways
pyperformance compare considered “significant”. So I re-ran them a few more times and then consolidated them choosing the fastest results.
My thinking is that the slower results are a result of virtualization jitter, CPU affinity fluctuations, and background processes on Ubuntu itself. All raw files and the consolidation script are included in the results gist.
I woke up in cold sweat thinking “oh no, this is probably just 3.12 being faster than 3.11!”. But no, it is nogil actually. I made the same benchmarks against the v3.12.0a4 tag (since that’s what the nogil-3.12 is based on) and I’m still seeing between 1.59x and 2.50x gain on nogil. The only major difference between vanilla 3.11 and vanilla 3.12 is asyncio’s TCP that got 1.6x faster (kudos @kumaraditya303 and @itamaro!). So those two benchmarks are no longer competitive using nogil. But they are also barely slower.
More importantly, averaging everything out I’m getting a 6.65% unadjusted slowdown of nogil-3.12 compared to vanilla 3.12, which is very close to what Sam wrote in the PEP.
nogil* JSON files and the consolidation script are the same as above so I didn’t re-upload them.
I don’t have anything to add to the benchmarking discussion, but I have a thought about compatibility.
It seems unquestionable that nogil requires some work from extension module maintainers before an extension is considered safe to use in the nogil world. It also breaks the ABI (if only by changing the meaning of the bits in
Regardless of whether that’s a reason to proclaim that nogil is Python 4, this is likely to create a rift between modules that are nogil-compatible and those that aren’t.
If there’s one lesson we’ve learned from the Python 2 to 3 transition, it’s that it would have been very beneficial if Python 2 and 3 code could coexist in the same Python interpreter. We blew it that time, and it set us back by about a decade.
Let’s not blow it this time. If we’re going forward with nogil (and I’m not saying we are, but I can’t exclude it), let’s make sure there is a way to be able to import extensions requiring the GIL in a nogil interpreter without any additional shenanigans – neither the application code nor the extension module should have to be modified in any way (and that includes being able to run extensions build with ABI 3.x for some x).
Reposting the reply I put on the steering council pep 703 decision issue here per Guido’s suggestion just so it’s all in one place:
“”"The steering council is going to take its time on this. A huge thank you for working to keep it up to date! We’re not ready to simply pronounce on 703 as it has a HUGE blast radius.
Software isn’t ready for the decades of Python assumptions such a change turns on its head, even when it appears to work fine it’s a statistical qualm of “but does it really? how do we actually know? when might it not and how often?” asked for every transitive dep of code. For some things that Q&A could be easy, but for others it becomes a can of worms.
From a steering council perspective we effectively view a 703 threading enabled interpreter at a high level as a fork of the CPython VM. In the sense that extension modules are unlikely to work without noteworthy modifications and even some pure Python libraries may even need to start considering locking where it had not in the past.
That does not mean “no” to this. There is demand for it. (personally, I’ve wanted this since forever!) It’s just that it won’t be easy and we’ll need to consider the entire ecosystem and how to smoothly allow such a change to happen without breaking the world.
I’m glad to see the continued discuss thread with faster-cpython folks in particular piping up. The intersection between this work and ongoing single threaded performance improvements will always be high and we don’t want to hamper that in the near term.“”"
– me with a steering council hat on
(deleted the above post as it reiterated a point already made by @gpshead )
I won’t speculate on where the differences are coming from (it’s totally different hardware after all), but it actually seems like our benchmarks are directionally agreeing on the whole.
FWIW, if we want to confirm we’re comparing apples to apples, The Faster CPython team’s benchmark of nogil-3.12 vs. 3.12 was commit 1d39009 (faster-cpython/cpython/nogil-latest branch) against 3d5d3f7 upstream.
I think @pablogsal can probably speak to this better, but it’s pretty easy to make the gc perform way better time-wise and have a significant impact on the pyperformance benchmarks simply by increasing the threshold (and nogil increases the threshold of the first generation from 700 to 7000), but it’s really hard to reason about whether that’s acceptable vs. the tradeoff of more memory usage (it’s very workload dependent). We (completely independently of nogil) would need to validate whether that’s ok (and if they are, maybe we can merge that independently of nogil). Our reason to benchmark against an upstream with the same gc threshold changes was to remove the variable of that change which otherwise hides some of the drop in single-threaded performance. However, there’s so much going on and complex interactions possible that I’m not going to claim that it’s a completely accurate thing to do. I’d especially like to point out that nogil makes significant deep changes to gc, and we only modified upstream to change the generations and thresholds.
There are several subsystems where we have struggled with performance and thread scaling. I will talk about the three main ones:
We have a NoSQL graph database that stores all the data for the system. Originally it was pure Python code. Now it is mostly C++ for performance, but some critical parts of it are still Python. The C++ is very heavily multi-threaded to be able to scale across CPUs, so calls into Python code are a significant crunch point. Both the C++ and Python parts have a great deal of state that must be shared between threads. If the Python parts could run truly concurrently, that would remove one of the most significant bottlenecks in the database. It is not plausible to use multiple processes here for the Python parts, and I do not think it would be possible to use multiple interpreters due to the shared state. The long-term thinking is that we will end up replacing all the Python parts with C++, but we would be able to reevaluate that if the Python could scale across CPUs.
The second part of the system of interest is the part that connects to all the computers in the environment and interrogates them. One example thing that it does is run commands on remote computers and parses the results that come back. It spends a lot of its time waiting for results, so to get good throughput it uses hundreds of threads to talk to many targets simultaneously. Of course then what happens is that multiple results arrive simultaneously and require parsing with Python code. The GIL becomes a significant bottleneck here. This is quite a difficult situation because usage flips between being blocked for long periods of time to requiring sudden spikes of processing, where at any given moment, hundreds of threads are blocked but a handful have CPU-intensive processing to do. We cannot predict which threads will complete at which times. Memory usage means we can’t run hundreds of processes. Clearly we could run a number of multi-threaded processes, but that would still suffer from the possibility that a particular process is unlucky to have a sudden spike of CPU load to handle. If the parsing could scale across CPUs, that would definitely be a significant benefit.
The third part to mention is what we call the “engine”, which is responsible for the main meat of the data processing, including running all the Python code generated from our in-house language. For this, we do run multiple processes to scale across CPUs, but that leads to quite a lot of complexity for coordinating the actions of the processes, and we still see situations where one engine process happens to get unlucky and has too much work to do while others are idle. A single truly multi-threaded engine would be more efficient and easier to manage.
In summary, BMC Discovery is a large real-world Python application that has several areas in which a NoGIL Python would make a substantial difference. Obviously faster processing of Python code is extremely valuable too, but given a choice between single threaded performance and scaling across CPUs, the CPU scaling is more valuable to us. Customers run this product on machines with 32 or 64 CPUs, so we will happily take a 10% hit in single threaded performance if it means we can get 30 or 60 times more performance by CPU scaling.
I’m very much just a bystander here, but one question that immediately came to mind for me is whether it would be a problem for you if nogil builds of Python are binary incompatible with gil-based builds. Specifically, will you be able to gain the benefits of the nogil build if 3rd party libraries are slow/unable to publish nogil-compatible binaries? And relatedly, how many such libraries do you depend on?