PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

(deleted the above post as it reiterated a point already made by @gpshead )

I won’t speculate on where the differences are coming from (it’s totally different hardware after all), but it actually seems like our benchmarks are directionally agreeing on the whole.

FWIW, if we want to confirm we’re comparing apples to apples, The Faster CPython team’s benchmark of nogil-3.12 vs. 3.12 was commit 1d39009 (faster-cpython/cpython/nogil-latest branch) against 3d5d3f7 upstream.

I think @pablogsal can probably speak to this better, but it’s pretty easy to make the gc perform way better time-wise and have a significant impact on the pyperformance benchmarks simply by increasing the threshold (and nogil increases the threshold of the first generation from 700 to 7000), but it’s really hard to reason about whether that’s acceptable vs. the tradeoff of more memory usage (it’s very workload dependent). We (completely independently of nogil) would need to validate whether that’s ok (and if they are, maybe we can merge that independently of nogil). Our reason to benchmark against an upstream with the same gc threshold changes was to remove the variable of that change which otherwise hides some of the drop in single-threaded performance. However, there’s so much going on and complex interactions possible that I’m not going to claim that it’s a completely accurate thing to do. I’d especially like to point out that nogil makes significant deep changes to gc, and we only modified upstream to change the generations and thresholds.

1 Like

Perhaps I can give one useful data point for a large scale application implemented in Python, and how we would benefit from NoGIL. BMC Discovery is an enterprise application that allows large customers to discover and model the hardware and software in their data centres and cloud environments. It is implemented with 1.3 million lines of Python code, plus 750K lines of code in an in-house language that compiles into another 2 million lines of Python. (To complete the picture, there is also 120K lines of C++, plus a whole load of JavaScript and TypeScript for the front-end that is not relevant here.)

There are several subsystems where we have struggled with performance and thread scaling. I will talk about the three main ones:

We have a NoSQL graph database that stores all the data for the system. Originally it was pure Python code. Now it is mostly C++ for performance, but some critical parts of it are still Python. The C++ is very heavily multi-threaded to be able to scale across CPUs, so calls into Python code are a significant crunch point. Both the C++ and Python parts have a great deal of state that must be shared between threads. If the Python parts could run truly concurrently, that would remove one of the most significant bottlenecks in the database. It is not plausible to use multiple processes here for the Python parts, and I do not think it would be possible to use multiple interpreters due to the shared state. The long-term thinking is that we will end up replacing all the Python parts with C++, but we would be able to reevaluate that if the Python could scale across CPUs.

The second part of the system of interest is the part that connects to all the computers in the environment and interrogates them. One example thing that it does is run commands on remote computers and parses the results that come back. It spends a lot of its time waiting for results, so to get good throughput it uses hundreds of threads to talk to many targets simultaneously. Of course then what happens is that multiple results arrive simultaneously and require parsing with Python code. The GIL becomes a significant bottleneck here. This is quite a difficult situation because usage flips between being blocked for long periods of time to requiring sudden spikes of processing, where at any given moment, hundreds of threads are blocked but a handful have CPU-intensive processing to do. We cannot predict which threads will complete at which times. Memory usage means we can’t run hundreds of processes. Clearly we could run a number of multi-threaded processes, but that would still suffer from the possibility that a particular process is unlucky to have a sudden spike of CPU load to handle. If the parsing could scale across CPUs, that would definitely be a significant benefit.

The third part to mention is what we call the “engine”, which is responsible for the main meat of the data processing, including running all the Python code generated from our in-house language. For this, we do run multiple processes to scale across CPUs, but that leads to quite a lot of complexity for coordinating the actions of the processes, and we still see situations where one engine process happens to get unlucky and has too much work to do while others are idle. A single truly multi-threaded engine would be more efficient and easier to manage.

In summary, BMC Discovery is a large real-world Python application that has several areas in which a NoGIL Python would make a substantial difference. Obviously faster processing of Python code is extremely valuable too, but given a choice between single threaded performance and scaling across CPUs, the CPU scaling is more valuable to us. Customers run this product on machines with 32 or 64 CPUs, so we will happily take a 10% hit in single threaded performance if it means we can get 30 or 60 times more performance by CPU scaling.

12 Likes

I’m very much just a bystander here, but one question that immediately came to mind for me is whether it would be a problem for you if nogil builds of Python are binary incompatible with gil-based builds. Specifically, will you be able to gain the benefits of the nogil build if 3rd party libraries are slow/unable to publish nogil-compatible binaries? And relatedly, how many such libraries do you depend on?

2 Likes

I’m very much just a bystander here, but one question that immediately came to mind for me is whether it would be a problem for you if nogil builds of Python are binary incompatible with gil-based builds. Specifically, will you be able to gain the benefits of the nogil build if 3rd party libraries are slow/unable to publish nogil-compatible binaries? And relatedly, how many such libraries do you depend on?

Well obviously that would be a consideration, but my feeling is that it would not be too problematic. We do our own builds of Python and all the third party libraries we use so we have reproducible builds and to insulate ourselves somewhat from supply chain attacks. If we went with a nogil Python, we would build everything against that. We could even have two builds for some or all packages if we felt the need to retain a gil-based build as well as the nogil one.

Packages we use that don’t (yet) support nogil at all would clearly be more of a problem. We use about 100 third party packages, but only a relatively small number are truly essential, and most are pure Python and so are less likely to be problematic. It would end up being a cost-benefit analysis of how much investment we could make in replacing or fixing nogil incompatible packages, compared to the benefit we would get from nogil.

1 Like

Another performance data point: we (Backblaze) have been testing nogil for real-world i/o-heavy multithreaded workloads, specifically with our CLI concurrently downloading chunks of large files from cloud object storage, and have seen a very worthwhile performance improvement. asyncio exists as an alternative approach, but that would require a complete rewrite of a large amount of code.

We wrote up our experience in this blog post.

TL;DR: while we’ve seen incremental improvements from successive CPython versions, indeed, for this workload, CPython 3.11 is almost twice as fast as 3.6, 3.9-nogil was 2.5 or 10 times faster (for single and multiple files, respectively) on the test than unmodified CPython 3.9.

Thanks to @ivoflipse for suggesting we post our experiences here.

14 Likes

Hi Mark,

  • You are comparing to a different base than the PEP. As stated in the PEP and at the language summit, the comparisons are to the implementation of immortal objects from PR 19474, specifically commit 018be4c. I think that’s an appropriate comparison because supporting immortal objects is a performance cost shared between this PEP and 3.12. That commit was chosen because it was recent and the base of that PR has the same performance as 3.12.0a4.
  • I believe you are using an old commit for nogil-3.12, from before the language summit and PEP update when nogil-3.12 was still in active development. I am comparing d595911 with 018be4c on Ubuntu 22.04 (AWS c5n.metal) with GCC 11 and PGO+LTO. I am using d595911 because it’s the commit right before disabling the GIL; that avoids any complications from bm_concurrent_imap using threads.
  • You have linked to a document that reports 10% overhead, but write 11% in the post.
  • The GC numbers are a useful data point. Thanks for providing them; they were larger than I expected.
  • Mimalloc does not provide a 1% performance improvement. To the extent that it provided a performance benefit in nogil-3.9, that was because it offset the cost of removing freelists, but nogil-3.12 keeps freelists as per-thread state. You can compare a595fc1593 with d13c63dee9, (the commits that integrate mimalloc.) I measured a tiny <1% regression from adding mimalloc.
  • Locking is not 5% overhead; it’s about 1%. You can benchmark it with the critical sections API as no-ops.

Regarding parallel applications, I disagree with both the general categorization and the specific claims about those categories. Many Python applications do not fall neatly into a single category. Outside of simple examples, machine learning applications are rarely “just” machine learning. They frequently include things like, data processing, logging, model serving, I/O, and network communication. From what I’ve seen of Meta’s Python web servers, you might think from a high-level description that they fall neatly into category 1, but if you look more deeply, there is more shared state than you might expect because the underlying C++ libraries have their own thread pools and caches. (A lot of infrastructure designed to be used from both C++ and Python services.)

I also disagree with the notion that machine learning is well served by multiple interpreters. PyTorch has had a multiple interpreters solution (multipy/torch::deploy) for about two years. Based on the experience, I don’t think many PyTorch developers would consider multiple interpreters as solving the multi-core story once and for all. The challenges include both usability issues due to the lack of sharing in Python (even for read-only objects), and additional technical complexity due to the potential 1:N mapping from C++ objects to Python wrappers.

Regarding multiple interpreters, there are some implicit assumptions that I don’t think are correct:

  • That free-threading is a world-of-pain and that multiple interpreters somehow avoids this. The isolation from multiple interpreters is only on the Python side, but to the extent that there are hard to debug thread-safety bugs, they are pretty much exclusively on the C/C++ side of things. Multiple interpreters do not isolate C/C++ code.
  • That nogil is going to break the world, when there is an implementation that works with a large number of extensions (NumPy, SciPy, scikit-learn, PyTorch, pybind11, Cython to name a few) and extension developers have repeatedly expressed interest in supporting it. This is in contrast to multiple interpreters, which requires more work for extension developers to support. For example, it took only a few lines of code and less than a day to get PyTorch working with nogil-3.9. It would be a very large undertaking (weeks or months) to get it working with multiple interpreters. This is also in contrast to the 3.11 release; it’s taken over a month of engineering work for PyTorch to support 3.11.
35 Likes

I actually do have a question about benchmarking. @ambv, I don’t understand what’s going on with the various async* benchmarks, which you measured to be much faster. Maybe as a side effect of not having a GIL, the event loop got faster at task switching, possibly because I/O selectors no longer have to release and re-acquire the GIL?

Thanks, I’ve read these sections now. As I understand it, there is a per-object mutex (ob_mutex), that is taken by the Py_BEGIN_CRITICAL_SECTION/Py_END_CRITICAL_SECTION API.

I see in the nogil-py312 branch from grepping Py_BEGIN_CRITICAL_SECTION that it is used in various collections (lru_cache, stringio, dictobject, listobject, setobject), and in two bytecode instructions, STORE_ATTR_WITH_HINT and CALL_NO_KW_LIST_APPEND. I further see that some instructions are using _PyMutex_lock(&_PyRuntime.mutex).

But other (specialized) instructions, like STORE_ATTR_SLOT and LOAD_ATTR_SLOT, seem to do a direct unprotected memory read/write, as far as I can tell. Is there something protecting them?

Experiment

I’ve tried this experiment on nogil-py312, and it didn’t crash. I only spent a few minutes on it though, and my laptop only has 2 cores (4 with hyperthreading).

class Obj:
    __slots__ = ('attr',)
    def __init__(self):
        self.attr = None

o = Obj()

X = object()
def writer():
    x = X
    while True:
        o.attr = x

def reader():
    x = X
    while True:
        assert o.attr is x

import threading
for i in range(2):
    threading.Thread(target=writer).start()
for i in range(1):
    threading.Thread(target=reader).start()
reader()

“Multicore Python” sounds much better than nogil, a big +1 from me

2 Likes

As the benchmarking guy, I’ll just respond to just the first part and leave the rest for others.

We determined the base commit automatically using git merge-base upstream/main 1d39009, which does indeed pre-date immortal objects, but is the cleanest way to isolate just the nogil changes. I’ll kick off a run of 018be4c on our hardware, and then we should be able to generate comparisons against it.

EDIT: 018be4c isn’t on any branch in upstream CPython. Did you mean: gh-84436: Implement Immortal Objects (gh-19474) · python/cpython@ea2c001 (github.com)? If so, we already have benchmarking results for that here: benchmarking-public/results/bm-20230422-3.12.0a7±ea2c001 at main · faster-cpython/benchmarking-public · GitHub.

I agree benchmarking a later commit would be helpful here. Is there one that’s rebased on top of the upstream immortal objects changes? That would be ideal in terms of removing variables in the data, but no worries if that’s a major task as I assume it might be.

Unfortunately, I’m unable to find d595911, though. It’s a 404 on your fork on Github. Maybe it’s pushed somewhere else?

colesbury/nogil is the original 3.9 implementation. The 3.12 rebase is in colesbury/nogil-3.12. There, you don’t get a 404 :slight_smile:

1 Like

Unfortunately, d595911 segfaults when running pyperf system tune, so we’ll need to resolve that before any results would be comparable against our other results. I filed a bug: psutil segfaults on import (when running pyperf system tune) · Issue #3 · colesbury/nogil-3.12 (github.com).

Playing a bit of devil’s advocate here…

True in a direct sense, but if your state is per-interpreter then you still get to rely on the GIL as an implicit lock to prevent stomping on yourself. So they do, in a way, isolate code if by “isolate” you mean “isolate from some threading issues” (if you mean “isolate their global state”, then you’re right that sub-interpreters don’t cause that automatically).

Some “extension developers have repeatedly expressed interest”. We always have to be careful about not overstated the level of support for any of this. There are plenty of folks who have never spoken up about either proposal. Plus getting buy-in by the big, heavily used extension modules doesn’t mean smaller projects managed by a single person in their spare time feel the same way.

But you work on PyTorch, correct? So would you say you know the code base and how it operates? If that’s true then you are in a very unique position to make PyTorch work with the thing you wrote the PEP on. :wink:

I will admit that right now my mind is thinking about what the best transition plan we can come up with is, and is the cost of that transition worth it for what we get for it? I don’t have an answer, but if I had to predict as to what I will be thinking about the most around this PEP, that’s going to be it.

2 Likes

First, a quick thanks to folks testing out the no-GIL work and letting us know about their experience with it! Real-world examples like this are really useful.

Second, Cheuk Ting Ho’s talk on testing out the no-GIL work on the scientific stack from PyCon US is now up:

2 Likes

This talk reported benchmark results on programs that don’t use threads. (Removing the GIL doesn’t benefit programs that don’t use threads.)

1 Like

Sam’s point was contextualising the amount of work for nogil (less than a day) relative to getting PyTorch to work with multiple interpreters (unknown, estimated at least weeks) and to work with 3.11 (over a month), presumably all done by people who know how the PyTorch codebase operates. This seems like a useful datapoint to me! Is your question whether if someone other than Sam had worked on nogil support for PyTorch, it would have taken weeks instead of less than a day?

8 Likes

I’ve posted updated benchmarks with the newer commit Sam suggested against a few other meaningful upstream points here: Compare a matrix of nogil to other upstreams · Issue #597 · faster-cpython/ideas (github.com)

vs. pyperf compare_to distribution plots
immortal 5% slower 2% slower
gc 14% slower 12% slower
immortal-gc 10% slower 8% slower
merge-base 9% slower 7% slower

EDIT 2023-06-07T21:00: Given feedback below, added a comparison against immortal objects + gc changes, and showing both pyperf compare_to and distribution. (pyperf compare_to is a mean of the means of each benchmark, the plots take the distribution of each benchmark into account in the overall mean).

1 Like

I’m confused. Why wouldn’t the benchmark we are looking for be: (immortal + gc threshold change) vs nogil?

3.12 will have a 4% immortilization overhead (per it’s PEP), so without running another benchmark, you could say that the nogil overhead is 9-10% (5% versus immortal, and another 5% if you take away the gc optimization). Am I understanding this correctly?

What is an acceptable per-thread overhead in order to effectively support multicore?

I’m also confused. Maybe I am reading these figures wrong, but from the results for “ALL” figures in the link, shouldn’t the results be 2.2%, 6.7%, and 12% respectively?