PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

Nope, most of my work is single-threaded, and single-process. A lot of Python code runs just fine as a single-process application with no threads or other concurrency. Writing command line utilities in Python is a clear example of a case where performance drops are often problematic (people want their command line tools to be fast).

I don’t know how to build an application using free threading. What would that mean? Would I need to do my own locking (something I can currently mostly ignore because of the GIL)? Would I have to think about race conditions myself? I’ll absolutely agree that because Python has the GIL, no-one has really been thinking about effective, user-friendly support for parallelism in Python until now. I’d love it if nogil gave us that. But right now, everyone is focusing on existing threading models - and if all we get from nogil is the ability to use current thread models with CPU-bound Python code, then people (like myself) who don’t already have problems because of the GIL will continue to struggle to see the attraction.

It would be really nice if someone could give the case for nogil in terms of the sorts of opportunities it would give for new programming techniques, or improvements that would benefit code that nobody currently thinks of as “needing threads”. For example, if map() could transparently parallelise the calculation, that would be a genuinely interesting benefit. But I don’t know if that’s reasonable to expect, or if the conditions needed to ensure thread-safety for such an operation would prevent it being transparent.

Agreed. It would be a shame if funding (and hence Sam’s involvement) dried up because we (or rather the SC) are still trying to work out how to tackle the transition and logistics issues. But conversely, it’s rather sad if the funding/commitment is only for doing the technical work, and doesn’t cover working on ecosystem issues like transition (or at least factor in the time it will take for others to handle those issues).

But we only have limited information here, and I don’t want to second-guess the constraints. I just hope the project doesn’t fail because technical and ecosystem work can’t be co-ordinated somehow.

Are you saying that if a package is pure Python, then it won’t be impacted by nogil? That’s great if it’s true, and I don’t have the technical understanding of the nogil work to know for myself, but isn’t it necessary for any code that might get called in a free-threaded application to at least be aware of the implications? Suppose I had a library that contained this function:

def sum_list(l):
    total = 0
    for n in l:
        total = total + n
    return total

who ensures (whether by auditing or explicit locking) that the input list isn’t being mutated while the function code is running? If sum_list can’t make that assumption, doesn’t it need a lock? Or am I misunderstanding nogil totally?

But that will result in a split ecosystem. People using python-with-nogil won’t be able to use certain packages, which haven’t been upgraded to be nogil-aware. That sounds very like the Python 2-3 transition, where you could find that some packages you rely on only supported Python 3, while others only supported Python 2. That’s precisely the sort of ecosystem impact that I imagine the SC is concerned about.

5 Likes

I personally don’t think that’s a realistic ask of the Python development team. Maybe for a set amount of time, but not forever.

You could get that with concurrent.futures.Executor.map() and using the ThreadPollExecutor, but I don’t know what level of guarantees there are about race conditions (e.g., reading from a dict and then writing to it again for the same key across two threads; can one write blindly overwrite the other?).

The GIL does not prevent modification to the list l. If your Python code did not need a lock with the GIL, it won’t need a lock for nogil. Of course, if your Python code needed a lock even with the GIL, it will still need a lock with nogil.

If multiple threads write to the same key of the same dict, then one thread will overwrite the value written by the other thread. This is true both with the GIL and in the nogil implementation.

10 Likes

Reading the PEP in detail but not having looked at the existing implementation branches at all: PEP-703 is very well written. ie: You pre-emptively answered a bunch of questions later on within it that I was going to have as I read it through, nice!

Remaining questions I have that remain to be answered (or that I missed):

  1. How would we plan to get this large of a change in? [assumption: it’s not a single PR]
    You’ve broken PEP-703 down into many nice sections in the “Overview of CPython changes”. It’d be nice to have an explicit estimate of which of these you predict to be one PR or multiple PRs and any necessary ordering of those and their likely sizes (individual reviewability). I think a lot of that can be guessed at from the doc, I’m just asking for a suggested PR plan.

  2. Have you proposed the desired mimalloc changes to the upstream project and gotten feedback from them on their viability of landing there if CPython were to need them? Link to that issue on github if so. And if not, shouldn’t we at least pre-emptively ask?

  3. It’d be nice to cover scaling limitations with some data to back it up in the PEP.
    We don’t have real world Python code expecting to scale up by threading today, so we’ll need to adapt or manufacture some benchmarks here. It is important to understand under which kind of application designs might expect to see what magnitude of ability to meaningfully utilize additional cores. Ie: run parallel benchmarks with 1, 2, 3, 4, 8, 16, 24, 32, +onwards threads (on hardware with that many actual cores).
    Examples could include: Sets of entirely independent computation, parallel computation consuming the same data structure, parallel computation feeding back into the same data structure. These may be considered microbenchmarks - but are still informative as they draw a picture of the limits of how far the parallelism can go in special circumstances. (Assumption: real applications likely tend to fit in the middle)
    Existing code that might be adaptable into such a benchmark would be things using concurrent.futures. both existing ones using a threads and ones using processes. What happens to thruput or latency on those if they’re all switched to using --disable-gil threads instead of whatever they use today?

  4. (forward looking:) What do Faster CPython folks think about bytecode specialization limitations or ways to undo that limitation?
    This seems to currently be an implementation detail that I expect will be fleshed out later and isn’t critical to answer up front (ie: not having an answer to this doesn’t block anything in my mind). I just want people working in these areas to actually ponder the thread safety needs of specialization/jit/whatnot as they work on the 3.13+ improvements.

4 Likes

See I knew I’d regret wading in over my head :joy:

I guess I also don’t understand the internals of CPython or the GIL well enough to understand why this would be a universal concern. If your package is pure Python and doesn’t already use multithreading, what is the failure mode? nogil isn’t going to introduce multithreading to a single-threaded program.

It is weird to me that this discussion hasn’t come up in this thread yet (H Vetinari did post about it in the previous one). If the only realistic path forward for the PEP is “the GIL is removed in 3.X” it seems like that should be written in the PEP that way? If nothing else, it feels like that would start to crystallize a real transition plan.

I know I come from a particular corner [1] of a Python ecosystem that contains other people doing totally different things. But in my corner the GIL is probably Python’s biggest weakness, and there is huge potential to improve performance and simplify a lot of code. It feels weird to see the discussion centered on performance deltas when the bigger discussion is how to make the transition as easy as possible for as many people as possible.


  1. scientific python ↩︎

2 Likes

I’m happy to be proven wrong, but my point is: removing such a long-standing assumption will very likely encounter breakage in unexpected places, and a-priori, not even pure python projects are safe from that.

From uncaught threading bugs in the dependencies of a pure python project, to being prepared to deal with the effects of some calls (e.g. io/network) suddenly running on several threads, to implicit assumptions in the code being invalidated (especially the assumption “I can be called at most once at the same time”, which is easy to break in a multi-core python world[1]; but there’s many many more).

To be clear, I think it’ll be both possible and ultimately worth it to fix these things, but it will demand a lot of work from the maintainers of essentially all packages, especially in the face of highly impatient users who can’t wait for the performance boost that nogil promises. And that’s just the baseline “we want nogil in any form” case. Add on top of that testing / publishing / supporting / debugging two variants for packages involving compilation (and the multitude of possible interactions that come from that[2]), and that is why I think caution is not misplaced, resp. that a solid plan is needed for the transition.

Without taking away from Sam’s achievements and time investment, accepting PEP703 means means externalizing orders of magnitude more effort on (mostly volunteer!) maintainers – and more effort still if the transition is not thought out well –, to achieve not just nogil-CPython but a nogil-ecosystem.


  1. whose bug is it then: the caller’s or the callee’s? Fun discussions ensue… ↩︎

  2. If we do get to the point that gil & nogil extensions can run in the same interpreter, then I foresee lots of bugs that will boil down to: “if you have packages A, B, C nogil and packages D, E, F with gil, then the following thing goes wrong”, with the added difficulty that failures may occur only on some platforms or be otherwise difficult to reproduce for maintainers, making the whole thing extra-painful to debug. ↩︎

2 Likes

After reading the entire thread, I felt compelled to share my thoughts.

I am the author of what I believe to be the most used financial backtesting package (universities, banks, quantum trading companies, energy trading companis, you name it). Even if I no longer have the time to maintain it.

It does also support using it for live-trading.

Choosing Python was a no-brainer because nobody would have used it had I written it in C++.

It supports optimization by using multiprocessing, what was (and is) and obvious bottleneck, mostly because of how data had to be passed back and forth which puts such an stress on the machine that it makes it unusable after a certain point and leads to implementing many workarounds to reduce the amount of data which moves across processes.

And what were many users complaining about? Performance during the optimization

With no-gil and threading … a child’s play to implement.

On the other hand:

  • Is it going to add complexity for the years to come? to CPython: Yes. I guess any new feature does add complexity.

  • Is the performance going to suffer (a modest amount) in some cases? Yes. And at the same time many other use cases are going to greatly profit. Furthermore, there are efforts on all fronts to make things faster, so the current “modest amount” which will be lost may be recovered.

  • Will it take years to make it bug-free (as in the HotSpot JVM example)? Maybe. Is it a good reason not to do it? Not really. You would stop any new imitative in such a case.

  • Can users break many multithreading use cases with race conditions and many other scenarios? For sure. Users tend to break things and they already do in many other scenarios which don’t involve multithreading.

Over the years I have supported Python in discussions, because of the productivity, even if the “everything is a dict and therefore a lookup” (oversimplification) meant that it would never match the peformance of badly and ill-programmed C++, but I have also acknowledged that the greatest problem was the lack of real multi-threading.

Best regards

10 Likes

For the purposes of this discussion, let’s categorize parallel application into three groups:

  1. Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
  2. Numerical processing (including machine learning) where the shared data is matrices of numbers.
  3. General data processing where the shared data is in the form of an in-memory object graph.

I maintain an application in the domain of financial market risk management. It is medium-sized, both in the code base (~150 KLoC) and team (~2-5 during active development).

It falls into category (3), the object graph mainly consists of financial transactions and financial instruments and prices. While it’s easy to partition the transactions per core, the instruments and prices are hard to partition and cause either a duplication of memory or considerable overhead, or both.
For a given budget, the number of cores obtainable grew much quicker than the memory per core, so this application cannot scale unless global data can be shared amongst cores – at least read-only.

So I believe this type of application would profit from nogil, and would not profit from subinterpreters as they stand now. In a nutshell, it requires sharing memory between cores.

I would like to thank the people behind both efforts – much appreciated! I just wanted to give the perspective of a category 3 medium-sized application.

Best regards, Martin

5 Likes

This is very much my position. As a package maintainer, if I state “this package hasn’t been checked for safety in a threaded context, use at your own risk” is that sufficient? If I’m not willing to accept patches to add thread safety because they add complexity I don’t want to take on, is that ok?

Yes, that’s the default position today. But my impression is that with nogil, people will be able to use threading more and the expectation is that they will do so. So I think it’s fair to ask if those sorts of assumption will remain valid.

To be clear, I’d love it if nogil did make it easier to easily benefit from multiple cores (where by “easier” I mean without having to manage executor pools explicitly). But there would be trade-offs, and we need to understand what they are.

Incidentally the fact that I don’t know if nogil gives the sorts of improvements I mention in the last paragraph suggests we still haven’t fully understood the advantages of nogil. This thread has been rather negative, focusing on downsides - maybe there’s scope for some more positive news beyond the basic “if the GIL is currently a problem for you, nogil will fix that”?

4 Likes

If I understand the GC state transition correctly, can the same mechanism not be used to effectively reinstate the GIL on every new frame, unless that frame is explicitly marked as nogil-safe? There could be a nogil def keyword or @nogil decorator or any number of things.

PS: As pointed out below, the GIL is a CPython concept, so the name would have to be something other than nogil to indicate thread-safety.

For what it’s worth I have an “application” that would definitely be interested in the nogil work. It’s a bit of an odd case, though, so I’m not sure how relevant it is. But I’ll add it here just in case it’s of interest.

My use case is an ongoing project of mine to try to demonstrate that having a high-level language can often be a better choice than a lower-level one, simply because higher-level languages make it easier to use more powerful abstractions, which might be impractical in lower-level languages. As a result, runtime can be faster even though in an absolute sense, the lower level language has better performance.

In this particular context, I’m looking at Monte Carlo simulation of games of chance. So the basic workload is to generate a few millions of random game states, calculate a “score”, and aggregate the results. Performance is crucial here, because I’m trying to compete with a (more or less) hand-coded C++ program. At the moment Python sucks in that comparison, because there’s no way Python can compete with C++ on raw calculation speed. But my big advantage would be if I could run the “calculate a score” step across multiple CPUs, which gives me a significant speed boost over the single-threaded C++ code. The C++ developer has already said that rewriting his code to use multiple threads would be impractical, and he doesn’t need to because the single-threaded code is fast enough. For me, though, a factor-of-n improvement through parallel execution could allow me to be competitive in this (highly biased against Python) comparison.

The key here is to be able to spread the workload across multiple cores, without additional complexity. Threads would be great for this if the GIL didn’t prevent Python code from running concurrently. However, the “calculate a score” code is essentially an arbitrary, user-supplied calculation, and as such requiring that code to be thread-safe or re-entrant isn’t acceptable (it violates the requirement to demonstrate that high-level languages give easy access to advanced constructs, as well as impacting the performance of code that’s right in the hot loop of the program).

I believe that this type of application would benefit from subinterpreters because the isolation properties of subinterpreters handle the need for safely running arbitrary code. I think there’s still some work to add appropriate co-ordination strategies for subinterpreters, but that’s a relatively straightforward matter of API design.

I believe that nogil won’t help directly with this problem, because in its raw form, it exposes the co-ordination issues with free-threading to the user. But I believe it offers the opportunity to build additional, safe and high-level abstractions for concurrency which would help - at least as much as the subinterpreter model, and quite possibly even more.

If there’s a follow-up plan with nogil to develop high-level concurrency abstractions, then IMO it will be a huge step forward for Python (and likely worth some level of performance cost in the short term). But given that we’re currently struggling to even work out how the transition to a “multi core by default” Python ecosystem will work, I fear that it’s premature to base my expectations on something like that happening any time soon. So for the short to medium term, I see subinterpreters as an important practical solution for me, with nogil being something that would be a much longer term benefit (for my use case - the fact that it’s an immediate benefit to other people is a relevant but separate point).

4 Likes

I don’t think that would work, the GIL is an implementation detail of CPython and should definitely not be codified in the language spec (PyPy doesn’t have a GIL right?)

Fair point, I edited my comment accordingly. The name could be anything that marks a frame as safe to execute in parallel.

1 Like

Indeed that could potentially be useful, and is in line with @pf_moore’s comment that this PEP (or a companion PEP) would be even stronger if it also included high-level, simple-to-use language support for multi-core usage.

1 Like

It doesn’t feel reasonable to include that in this PEP [1] but I would be shocked if libraries like this don’t appear as soon as they are useful. I suppose this is already baked-in to my impression of the proposal–it’s primarily about unlocking the potential for new packages, rather than an immediate speed boost to the core.

My preference would be to let third-party developers build and share useful tools as needed. If the community consolidates around a specific tool, there could be a PEP to add it to stdlib. This is my impression of how other packages have been added but I don’t know all the history and details.

Just as an aside: people definitely want CLI tools to be fast, and they really love when they are multitcore :wink: (pigz, ripgrep, and gsutil -m spring to mind)


  1. several people have suggested splitting it up already ↩︎

2 Likes

Agreed

Maybe. It’s a chicken and egg question, though - it’s very difficult to say that nogil will be useful to me without knowing what new approaches it will enable.

It’s like asyncio. That was all about highly parallel network services when it was first proposed. Who would have expected that it would result in something like textual? But if asyncio needed the support of console UI developers to get accepted, would it have happened? The main difference here is that it nogil has a much more perceptible cost on non-multicore users than asyncio did on people not writing network services.

I’m not trying to make a case for or against nogil. All I want to do is expose the bigger questions, so that we don’t end up basing its fate on issues like “Is a 5% performance hit to enable something that you don’t use but which might have long-term benefits for you acceptable or not? What if it were 4%?”

3 Likes

Out of curiosity, is there an estimate for the performance change of 3.12 → 3.13 ? I understand this is quite a ways off, but it does seem relevant. If 3.13 would be X% faster and is instead only Y% faster with nogil [1], it might not be perceived as a cost by most users.


  1. And I know there is a lot of hairy complexity here in terms of whether the planned 3.13 speedups work in a nogil version ↩︎

3 Likes

Just gonna throw this out there…

Way back in (I think) the 1.4 timeframe Greg Stein (an eShop cofounder) floated a GIL-free Python. The discussion then largely centered on the performance hit for single-threaded code. Had the bullet been bitten then, I wonder how much further along we’d be today.

Point being that maybe the discussion shouldn’t center so much on (single-digit percentage?) performance hits and more on the perceived/expected benefits. Get past micro-benchmarks. The smart folks working on Python’s internals will more than make up for it in fairly short order.

32 Likes

I was actually thinking about ways to make nogil something that code needs to explicitly opt into at the function level. That way, any code that is not explicitly marked as being thread-safe will continue to enjoy the carefree life of running under the GIL, while functions where someone has explicitly thought about thread-safety can run without the GIL. If I understand the GC mechanism correctly, any thread-unsafe function encountered anywhere could halt free-threading immediately.

(Just to add to the above voices, I would be very excited about running in nogil mode. I do scientific computing as well, and often have to use Cython for nothing else but to nogil-parallelise some low-level code that I know is safe.)

I believe that such high level libraries will be very quickly developed by third-parties once it becomes possible to do so.

There are many non-dangerous and non-complicated approaches to thread-based parallelism, according to my experience and understanding.

Here’s my view of the kinds of abstractions we could expect very quickly, FWIW.

Maybe the simplest example is parallelizing for loops. C and C++ have a high-level library, “OpenMP”, that does that. In a language like C, with block scope, all of variables declared inside of the block are local to the thread. I believe one of numpy or scipy libraries contain such a parallel loop already, but there may be restrictions on what can be inside of the loop (or on what gets parallized inside of such a loop). A parallel-for in pure Python would be enable parallizing many things simply and with little danger.

Another example is using atomics and lock-free data structures, like thread-safe queues, to prevent data races. Erlang has been mentioned in this thread. Its message-passing approach is an example, and Python has support for that model in the standard threading library, with event objects: call set() when adding (atomically) an item to the queue and clear() after all consuming (atomically) all or a quota of queue items.

Then there is the software transactional memory approach being used by an alternative PyPy implementation, that rolls-back commits of transactions when it detects a conflict, similar to ACID databases.

When locks are absolutely necessary, the way to prevent deadlock (i.e., solution to the dining philosopher’s problem) is to always acquire and release the locks in the same order. An easy abstraction is to have a (e.g., singleton) class that is instantiated with a list of all of the locks that need to be managed. Then proceed in something like the following way: tasks are registered together with the resources they need to acquire; then call a member function with the name of the task as an argument to acquire the locks and perform the task, where the library writer ensures that the task-running code always acquires and releases the locks in the same order.

Simple approaches like these probably cover the vast majority of problems that people need to solve with thread-based parallelism, and it shouldn’t be too difficult for library writers to provide these, and I believe we could expect that they would come up with more and more user-friendly and abstract approaches over time.