PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

Some comments:

Branding

Branding is important. I suggest using the term “Multicore Python” for this effort instead of “nogil”. It sounds much more impressive, while “no GIL” is more obscure and negative. It worked for OCaml.

Use cases

I’d like to offer some non-AI/ML use cases that I can think of which I believe, if no global lock was present, would use shared memory threading instead of multiple processes or multiple interpreters:

  1. Linters like flake8 (granularity: file)
  2. Test frameworks like pytest-xdist (granularity: test)
  3. WSGI/HTTP servers (granularity: request)

Memory model

Note: I have to admit I haven’t read the PEP in detail, so this might be discussed there, apologies if so.

When shared-memory parallelism is available, a Memory Model is needed in order to program it safely. Examples: C/C++, Java, Go.

In particular I’m interested what actually happens if a data race happens. For example what happens in the classic “2 threads increment the same counter” scenario. Is it nasal demons undefined behavior as in C (AFAIK Python never had this), or something safer as in Java and Go (see last paragraph here)?

13 Likes

Yes, there are many use cases where people use multiple processes today where they might not have taken on that complexity otherwise. Even entire huge services that provide everyone cat videos and food photos that have relyed on os.fork semantics and attempt to share memory pages by disabling refcount modifications to save massive compute resources.[1] Because they couldn’t just use threads to expand their work on their machines as things written in C, C++, Java, Golang, and more recently Rust have been able to.

These would gain the option to reduce their complexity of managing separate processes just adopt threads for a local speed boost or overall resource reduction. It doesn’t mean multiprocessing or multi-k8s-equiv orchestrated jobs go away - threads aren’t a replacement - they’d just another shape in part of the stack to better utilize total available resources with less overhead.


  1. Contrary to popular belief, these types of services don’t actually all sit around blocked on I/O - aka use case (1) - and even when doing so the I/O is generally on a network to another pile of machines, not something else locally consuming the local machine cores that Python cannot. ↩︎

7 Likes

Please read the PEP – especially the parts about thread safety for collections, and backwards compatibility.

Thanks for the very detailed review, Mark. This is important feedback.

6% vs 11% in unadjusted measurement is a large difference. We need to figure out if it was versus the same base 3.12 commit as what @colesbury measured. It’s also pretty interesting to learn why the GC changes are so significant. I’d just like to point out that if those changes aren’t counted towards nogil’s performance, their complexity shouldn’t be counted against nogil either.

In any case, I did some new personal benchmarks in response to some back and forth on Mastodon. I was benchmarking nogil-3.12 (4526c07 which builds on v3.12.0a4) with the GIL disabled against the latest 3.11 (8de607a to be exact) using pyperformance 1.0.8 with the default settings.

The average from the benchmarks shows nogil to be 3.18% slower than 3.11.3 compiled with the same settings. Interestingly, a number of benchmarks in the suite, namely the async_, asyncio_, and multiprocessing ones, are between 1.57x to 2.51x faster under nogil. While you’re saying that the benchmarks don’t spread the workload across threads, it seems that the stdlib and third-party libraries in some cases already make use of multiple cores. We could likely make the standard library faster in more places if free threading was a feature we could depend on.

Benchmark process details and full results Full results

I compiled both 3.11 and nogil-3.12 in production settings (-O3 --enable-optimizations --with-lto) on Ubuntu 18.04 (gcc 7.5.0) on my 2018 MBP (Intel i7-8850H) on gigabit Ethernet, wifi disabled. When I re-ran the benchmarks, I noticed a few results differing in ways pyperformance compare considered “significant”. So I re-ran them a few more times and then consolidated them choosing the fastest results.

My thinking is that the slower results are a result of virtualization jitter, CPU affinity fluctuations, and background processes on Ubuntu itself. All raw files and the consolidation script are included in the results gist.

17 Likes

I woke up in cold sweat thinking “oh no, this is probably just 3.12 being faster than 3.11!”. But no, it is nogil actually. I made the same benchmarks against the v3.12.0a4 tag (since that’s what the nogil-3.12 is based on) and I’m still seeing between 1.59x and 2.50x gain on nogil. The only major difference between vanilla 3.11 and vanilla 3.12 is asyncio’s TCP that got 1.6x faster (kudos @kumaraditya303 and @itamaro!). So those two benchmarks are no longer competitive using nogil. But they are also barely slower.

More importantly, averaging everything out I’m getting a 6.65% unadjusted slowdown of nogil-3.12 compared to vanilla 3.12, which is very close to what Sam wrote in the PEP.

Full results: Python v3.12.0a4 vs nogil-3.12, both compiled with the same options: -O3 --enable-optimizations --with-lto compare · GitHub

The nogil* JSON files and the consolidation script are the same as above so I didn’t re-upload them.

8 Likes

I don’t have anything to add to the benchmarking discussion, but I have a thought about compatibility.

It seems unquestionable that nogil requires some work from extension module maintainers before an extension is considered safe to use in the nogil world. It also breaks the ABI (if only by changing the meaning of the bits in ob_refcnt).

Regardless of whether that’s a reason to proclaim that nogil is Python 4, this is likely to create a rift between modules that are nogil-compatible and those that aren’t.

If there’s one lesson we’ve learned from the Python 2 to 3 transition, it’s that it would have been very beneficial if Python 2 and 3 code could coexist in the same Python interpreter. We blew it that time, and it set us back by about a decade.

Let’s not blow it this time. If we’re going forward with nogil (and I’m not saying we are, but I can’t exclude it), let’s make sure there is a way to be able to import extensions requiring the GIL in a nogil interpreter without any additional shenanigans – neither the application code nor the extension module should have to be modified in any way (and that includes being able to run extensions build with ABI 3.x for some x).

12 Likes

Reposting the reply I put on the steering council pep 703 decision issue here per Guido’s suggestion just so it’s all in one place:

“”"The steering council is going to take its time on this. A huge thank you for working to keep it up to date! We’re not ready to simply pronounce on 703 as it has a HUGE blast radius.

Software isn’t ready for the decades of Python assumptions such a change turns on its head, even when it appears to work fine it’s a statistical qualm of “but does it really? how do we actually know? when might it not and how often?” asked for every transitive dep of code. For some things that Q&A could be easy, but for others it becomes a can of worms.

From a steering council perspective we effectively view a 703 threading enabled interpreter at a high level as a fork of the CPython VM. In the sense that extension modules are unlikely to work without noteworthy modifications and even some pure Python libraries may even need to start considering locking where it had not in the past.

That does not mean “no” to this. There is demand for it. (personally, I’ve wanted this since forever!) It’s just that it won’t be easy and we’ll need to consider the entire ecosystem and how to smoothly allow such a change to happen without breaking the world.

I’m glad to see the continued discuss thread with faster-cpython folks in particular piping up. The intersection between this work and ongoing single threaded performance improvements will always be high and we don’t want to hamper that in the near term.“”"

– me with a steering council hat on

6 Likes

(deleted the above post as it reiterated a point already made by @gpshead )

I won’t speculate on where the differences are coming from (it’s totally different hardware after all), but it actually seems like our benchmarks are directionally agreeing on the whole.

FWIW, if we want to confirm we’re comparing apples to apples, The Faster CPython team’s benchmark of nogil-3.12 vs. 3.12 was commit 1d39009 (faster-cpython/cpython/nogil-latest branch) against 3d5d3f7 upstream.

I think @pablogsal can probably speak to this better, but it’s pretty easy to make the gc perform way better time-wise and have a significant impact on the pyperformance benchmarks simply by increasing the threshold (and nogil increases the threshold of the first generation from 700 to 7000), but it’s really hard to reason about whether that’s acceptable vs. the tradeoff of more memory usage (it’s very workload dependent). We (completely independently of nogil) would need to validate whether that’s ok (and if they are, maybe we can merge that independently of nogil). Our reason to benchmark against an upstream with the same gc threshold changes was to remove the variable of that change which otherwise hides some of the drop in single-threaded performance. However, there’s so much going on and complex interactions possible that I’m not going to claim that it’s a completely accurate thing to do. I’d especially like to point out that nogil makes significant deep changes to gc, and we only modified upstream to change the generations and thresholds.

1 Like

Perhaps I can give one useful data point for a large scale application implemented in Python, and how we would benefit from NoGIL. BMC Discovery is an enterprise application that allows large customers to discover and model the hardware and software in their data centres and cloud environments. It is implemented with 1.3 million lines of Python code, plus 750K lines of code in an in-house language that compiles into another 2 million lines of Python. (To complete the picture, there is also 120K lines of C++, plus a whole load of JavaScript and TypeScript for the front-end that is not relevant here.)

There are several subsystems where we have struggled with performance and thread scaling. I will talk about the three main ones:

We have a NoSQL graph database that stores all the data for the system. Originally it was pure Python code. Now it is mostly C++ for performance, but some critical parts of it are still Python. The C++ is very heavily multi-threaded to be able to scale across CPUs, so calls into Python code are a significant crunch point. Both the C++ and Python parts have a great deal of state that must be shared between threads. If the Python parts could run truly concurrently, that would remove one of the most significant bottlenecks in the database. It is not plausible to use multiple processes here for the Python parts, and I do not think it would be possible to use multiple interpreters due to the shared state. The long-term thinking is that we will end up replacing all the Python parts with C++, but we would be able to reevaluate that if the Python could scale across CPUs.

The second part of the system of interest is the part that connects to all the computers in the environment and interrogates them. One example thing that it does is run commands on remote computers and parses the results that come back. It spends a lot of its time waiting for results, so to get good throughput it uses hundreds of threads to talk to many targets simultaneously. Of course then what happens is that multiple results arrive simultaneously and require parsing with Python code. The GIL becomes a significant bottleneck here. This is quite a difficult situation because usage flips between being blocked for long periods of time to requiring sudden spikes of processing, where at any given moment, hundreds of threads are blocked but a handful have CPU-intensive processing to do. We cannot predict which threads will complete at which times. Memory usage means we can’t run hundreds of processes. Clearly we could run a number of multi-threaded processes, but that would still suffer from the possibility that a particular process is unlucky to have a sudden spike of CPU load to handle. If the parsing could scale across CPUs, that would definitely be a significant benefit.

The third part to mention is what we call the “engine”, which is responsible for the main meat of the data processing, including running all the Python code generated from our in-house language. For this, we do run multiple processes to scale across CPUs, but that leads to quite a lot of complexity for coordinating the actions of the processes, and we still see situations where one engine process happens to get unlucky and has too much work to do while others are idle. A single truly multi-threaded engine would be more efficient and easier to manage.

In summary, BMC Discovery is a large real-world Python application that has several areas in which a NoGIL Python would make a substantial difference. Obviously faster processing of Python code is extremely valuable too, but given a choice between single threaded performance and scaling across CPUs, the CPU scaling is more valuable to us. Customers run this product on machines with 32 or 64 CPUs, so we will happily take a 10% hit in single threaded performance if it means we can get 30 or 60 times more performance by CPU scaling.

12 Likes

I’m very much just a bystander here, but one question that immediately came to mind for me is whether it would be a problem for you if nogil builds of Python are binary incompatible with gil-based builds. Specifically, will you be able to gain the benefits of the nogil build if 3rd party libraries are slow/unable to publish nogil-compatible binaries? And relatedly, how many such libraries do you depend on?

2 Likes

I’m very much just a bystander here, but one question that immediately came to mind for me is whether it would be a problem for you if nogil builds of Python are binary incompatible with gil-based builds. Specifically, will you be able to gain the benefits of the nogil build if 3rd party libraries are slow/unable to publish nogil-compatible binaries? And relatedly, how many such libraries do you depend on?

Well obviously that would be a consideration, but my feeling is that it would not be too problematic. We do our own builds of Python and all the third party libraries we use so we have reproducible builds and to insulate ourselves somewhat from supply chain attacks. If we went with a nogil Python, we would build everything against that. We could even have two builds for some or all packages if we felt the need to retain a gil-based build as well as the nogil one.

Packages we use that don’t (yet) support nogil at all would clearly be more of a problem. We use about 100 third party packages, but only a relatively small number are truly essential, and most are pure Python and so are less likely to be problematic. It would end up being a cost-benefit analysis of how much investment we could make in replacing or fixing nogil incompatible packages, compared to the benefit we would get from nogil.

1 Like

Another performance data point: we (Backblaze) have been testing nogil for real-world i/o-heavy multithreaded workloads, specifically with our CLI concurrently downloading chunks of large files from cloud object storage, and have seen a very worthwhile performance improvement. asyncio exists as an alternative approach, but that would require a complete rewrite of a large amount of code.

We wrote up our experience in this blog post.

TL;DR: while we’ve seen incremental improvements from successive CPython versions, indeed, for this workload, CPython 3.11 is almost twice as fast as 3.6, 3.9-nogil was 2.5 or 10 times faster (for single and multiple files, respectively) on the test than unmodified CPython 3.9.

Thanks to @ivoflipse for suggesting we post our experiences here.

14 Likes

Hi Mark,

  • You are comparing to a different base than the PEP. As stated in the PEP and at the language summit, the comparisons are to the implementation of immortal objects from PR 19474, specifically commit 018be4c. I think that’s an appropriate comparison because supporting immortal objects is a performance cost shared between this PEP and 3.12. That commit was chosen because it was recent and the base of that PR has the same performance as 3.12.0a4.
  • I believe you are using an old commit for nogil-3.12, from before the language summit and PEP update when nogil-3.12 was still in active development. I am comparing d595911 with 018be4c on Ubuntu 22.04 (AWS c5n.metal) with GCC 11 and PGO+LTO. I am using d595911 because it’s the commit right before disabling the GIL; that avoids any complications from bm_concurrent_imap using threads.
  • You have linked to a document that reports 10% overhead, but write 11% in the post.
  • The GC numbers are a useful data point. Thanks for providing them; they were larger than I expected.
  • Mimalloc does not provide a 1% performance improvement. To the extent that it provided a performance benefit in nogil-3.9, that was because it offset the cost of removing freelists, but nogil-3.12 keeps freelists as per-thread state. You can compare a595fc1593 with d13c63dee9, (the commits that integrate mimalloc.) I measured a tiny <1% regression from adding mimalloc.
  • Locking is not 5% overhead; it’s about 1%. You can benchmark it with the critical sections API as no-ops.

Regarding parallel applications, I disagree with both the general categorization and the specific claims about those categories. Many Python applications do not fall neatly into a single category. Outside of simple examples, machine learning applications are rarely “just” machine learning. They frequently include things like, data processing, logging, model serving, I/O, and network communication. From what I’ve seen of Meta’s Python web servers, you might think from a high-level description that they fall neatly into category 1, but if you look more deeply, there is more shared state than you might expect because the underlying C++ libraries have their own thread pools and caches. (A lot of infrastructure designed to be used from both C++ and Python services.)

I also disagree with the notion that machine learning is well served by multiple interpreters. PyTorch has had a multiple interpreters solution (multipy/torch::deploy) for about two years. Based on the experience, I don’t think many PyTorch developers would consider multiple interpreters as solving the multi-core story once and for all. The challenges include both usability issues due to the lack of sharing in Python (even for read-only objects), and additional technical complexity due to the potential 1:N mapping from C++ objects to Python wrappers.

Regarding multiple interpreters, there are some implicit assumptions that I don’t think are correct:

  • That free-threading is a world-of-pain and that multiple interpreters somehow avoids this. The isolation from multiple interpreters is only on the Python side, but to the extent that there are hard to debug thread-safety bugs, they are pretty much exclusively on the C/C++ side of things. Multiple interpreters do not isolate C/C++ code.
  • That nogil is going to break the world, when there is an implementation that works with a large number of extensions (NumPy, SciPy, scikit-learn, PyTorch, pybind11, Cython to name a few) and extension developers have repeatedly expressed interest in supporting it. This is in contrast to multiple interpreters, which requires more work for extension developers to support. For example, it took only a few lines of code and less than a day to get PyTorch working with nogil-3.9. It would be a very large undertaking (weeks or months) to get it working with multiple interpreters. This is also in contrast to the 3.11 release; it’s taken over a month of engineering work for PyTorch to support 3.11.
35 Likes

I actually do have a question about benchmarking. @ambv, I don’t understand what’s going on with the various async* benchmarks, which you measured to be much faster. Maybe as a side effect of not having a GIL, the event loop got faster at task switching, possibly because I/O selectors no longer have to release and re-acquire the GIL?

Thanks, I’ve read these sections now. As I understand it, there is a per-object mutex (ob_mutex), that is taken by the Py_BEGIN_CRITICAL_SECTION/Py_END_CRITICAL_SECTION API.

I see in the nogil-py312 branch from grepping Py_BEGIN_CRITICAL_SECTION that it is used in various collections (lru_cache, stringio, dictobject, listobject, setobject), and in two bytecode instructions, STORE_ATTR_WITH_HINT and CALL_NO_KW_LIST_APPEND. I further see that some instructions are using _PyMutex_lock(&_PyRuntime.mutex).

But other (specialized) instructions, like STORE_ATTR_SLOT and LOAD_ATTR_SLOT, seem to do a direct unprotected memory read/write, as far as I can tell. Is there something protecting them?

Experiment

I’ve tried this experiment on nogil-py312, and it didn’t crash. I only spent a few minutes on it though, and my laptop only has 2 cores (4 with hyperthreading).

class Obj:
    __slots__ = ('attr',)
    def __init__(self):
        self.attr = None

o = Obj()

X = object()
def writer():
    x = X
    while True:
        o.attr = x

def reader():
    x = X
    while True:
        assert o.attr is x

import threading
for i in range(2):
    threading.Thread(target=writer).start()
for i in range(1):
    threading.Thread(target=reader).start()
reader()

“Multicore Python” sounds much better than nogil, a big +1 from me

2 Likes

As the benchmarking guy, I’ll just respond to just the first part and leave the rest for others.

We determined the base commit automatically using git merge-base upstream/main 1d39009, which does indeed pre-date immortal objects, but is the cleanest way to isolate just the nogil changes. I’ll kick off a run of 018be4c on our hardware, and then we should be able to generate comparisons against it.

EDIT: 018be4c isn’t on any branch in upstream CPython. Did you mean: gh-84436: Implement Immortal Objects (gh-19474) · python/cpython@ea2c001 (github.com)? If so, we already have benchmarking results for that here: benchmarking-public/results/bm-20230422-3.12.0a7±ea2c001 at main · faster-cpython/benchmarking-public · GitHub.

I agree benchmarking a later commit would be helpful here. Is there one that’s rebased on top of the upstream immortal objects changes? That would be ideal in terms of removing variables in the data, but no worries if that’s a major task as I assume it might be.

Unfortunately, I’m unable to find d595911, though. It’s a 404 on your fork on Github. Maybe it’s pushed somewhere else?

colesbury/nogil is the original 3.9 implementation. The 3.12 rebase is in colesbury/nogil-3.12. There, you don’t get a 404 :slight_smile:

1 Like

Unfortunately, d595911 segfaults when running pyperf system tune, so we’ll need to resolve that before any results would be comparable against our other results. I filed a bug: psutil segfaults on import (when running pyperf system tune) · Issue #3 · colesbury/nogil-3.12 (github.com).