A fast, free threading Python

‘free threading’: Searching the web, I only found articles talking about ‘free threads’ or ‘COM free threading’ in relation to something like ‘Multi-Apartment Complexes’. Can someone give me a short explanation of the meaning here, or a reference?

Would it make any sense to consider an interpreter and subinterpreter, one with GIL and other free threaded (either way)? Or does needing different C structures make this nonsensical?

Or if there were separate python.exe and python-nogil.exe binaries, then use separate but linked processes, instead of needing 2 X N processes to keep N core busy. Subprocess could do this today. Multiprocessing.start would need a new ‘nogil’ option to start the 2nd process with the alternate binary.

3 Likes

We are hearing many opinions that would sacrifice single-threading performance for multi-threading support (i.e. push for nogil/PEP 703).

On the other hand, what are the applications that would sacrifice multi-threading support for an increase in single-threading performance (i.e. prioritise the ongoing effort over nogil/PEP 703)?

1 Like

I’d start reading here:

In response to @itamaro’s question about funding: Microsoft’s support for the Faster CPython team is unwavering and our charter is not limited to single-threaded performance work. (Although in the current corporate climate I don’t expect our funding to increase.)

A possible scenario would be that no-GIL is merged (in the way PEP 703 proposes, i.e., as a build option that is off by default for a few releases), but certain optimizations are suppressed in the nogil build (or possibly only once multiple threads are active).

This would allow the CPython core dev team (not just the Microsoft folks) to initially focus an two things:

  • Improving single-threaded performance in GIL builds
  • Ensuring correct operation of no-GIL builds, even if it means losing some optimizations

The no-GIL builds will be useful to apps that want to use multiple cores efficiently without switching to an architecture based on multiple processes or subinterpreters. They will also be needed to test 3rd party extension modules in a no-GIL environment.

Meanwhile, we can start adapting the specialization and optimization work to a no-GIL world, with the goal of obtaining Mark’s Option 3 (free threading and faster per-thread performance). Ideally we would reach a state where we can make no-GIL the one and only build mode without a drop in single-threaded performance (important for apps that haven’t been re-architected, e.g. apps that currently use multi-processing, or algorithms that are hard to parallelize).

It is this latter step (getting to Option 3) that requires extra resources – for example, it would be great if Meta or another tech company could spare some engineers with established CPython internals experience to help the core dev team with this work.

Finally, I want to re-emphasize that while Microsoft has a team using the Faster CPython moniker, we don’t intend to own CPython performance – we believe in good citizenship and want to contribute in a way that puts our skills and experience to the best possible use for the Python community.

47 Likes

Similar to Faster CPython team, the Cinder team’s charter is also not limited to single-threaded performance. I expect that if PEP 703 lands, threading will become a much more attractive scaling model for many Python applications at Meta, and the Cinder team’s focus will accordingly encompass improving multi-threaded performance. And (as of the last year or so) the approach of the Cinder team has shifted away from maintaining an internal fork and towards working upstream-first as much as possible (with Cinder JIT destined for a third-party extension), so I expect that investment to take the form of collaboration with the core dev team on CPython.

29 Likes

A counterpoint to that is that while the program might finish in less wall clock time using multiple cores, the overhead due to locks and thread scheduling actually makes multithreaded programs less efficient. Therefore multithreaded programs consume more energy per unit of work done.

Going forward I see a world where energy is going to be a more limited resource. Investing in efficiency and thus in single-threaded performance can be a valid approach.

Most bioinformatics applications can trivially parallelize. Usually I have several tens or hundreds of files so it can be advantageous simply to start one python interpreter per file. Also the multiprocessing module works fine for most workloads. Work on multithreaded python is not of benefit for that work. Especially not if it breaks the C-API ABI compatibility, in that case it would actually be a huge step back. Single-threaded performance increases are directly beneficial to the workloads that I have.

4 Likes

As a fellow bioinformatician, I would quibble with that [1]. multiprocessing works okay but there are lots of hard edges that need to be learned via experience to make it work well. There is considerable overhead from passing data around, when that’s necessary.

Multithreaded python would be a strict upgrade for me, as it would improve performance and reduce complexity to not need to worry about inter-process communication.

It would also make exploratory analysis faster, if we had higher level tools like par_map and such. I’m often in a grey area where I have enough data that things take a minute or so to run, but it’s not worth the complexity of figuring out multiprocessing for a one-off thing (and the overhead might be too high to make it worth it)


  1. I suspect that statements about “most bioinformatics applications” are rarely accurate :wink: ↩︎

8 Likes

Strongly agree. Language-level support for parallelism – along the lines of something like OpenMP in the C/Fortran world – would be ideal, but, I know, complicated, although I see that Mojo has enabled that in a somewhat Python-like (but not pythonic?) environment. So perhaps it’s possible?..

2 Likes

Allow me to post a Python user opinion, form the field.

When I started using Python, my cpu only had 1 core and “Python is slow” was the rant of the day. I was happy to write single-threaded code, occasionally venturing into ad-hoc coro-like generators.

CPUs became dual core and quad core over time, and, having exposure and hopefully skills to multi threading from university, I did write multithreaded code. However, looking back it was never for performance, rather to structure programs better or to overcome blocking behaviour. At that time, while there were cores, they were also very often by some other program.

Today, we finally have so many cores that there’s almost always a free core and desktop/laptop cpu is limited by the power envelope, not by cores * single-core performance product. And the code I write is mostly async/await.

Frankly, I feel that free-threading will not bring me direct benefits, and the indirect benefits (e.g. pandas parallel operation that calls a Python user function) are nice but rarely crucial.

What I wish for, instead is speculative execution of runnable coroutines or tasks, given some preconditions that a line developer can understand (eg don’t change module globals, don’t change object class after creation; IO will be serialised for a single file descriptor).

If the extra, “free” cores could be used for that, I’d be very happy :smiley:

P.S. I suspect that with some caveats, the cooperative multitasking, given it’s clear annotation of yield points, may be more amenable to optimisations than multithreaded code where every byte code instruction may need a guard.

1 Like

Not quite sure I follow - are you asking for coroutines to become threads?

I’m asking for coroutines/tasks to remain semantically as they are, and for Python to execute several of them in parallel, in several cores, if the runtime deems it safe, that is as long as the output of the program remains indistinguishable from running them sequentially on a single core.

You should be able to do that with a custom async executor, but it feels quite a bit easier with the GIL removed.

2 Likes

Ah. That seems tricky to prove; semantically, coroutines switch between themselves ONLY at explicit yield/await points, and proving that there would be no difference if they run in parallel sounds like a job for a programmer, not a runtime. Though that IS what I mean by turning them into threads, since threads definitely can run in true parallel like that (especially if the GIL ends up removed).

I won’t claim to be an expert here, but my two cents is that I have zero interest towards free threading.

IMHO, free threading is basically only enabling bad multithreading practices anyway. I think the current work on multiple subinterpreters is great and enabling multithreading with subinterpreters is the direction that suits Python better than free threading. The object ownership model naturally enforced by multiple interpreters makes it way easier to write correct and fast multi threaded code.

On the other hand, free threading muddies the object ownership and as they don’t have explicit synchronisation point, it is just asking for multithreaded trouble. Sure a well implemented free threading Python might be able to squeeze a little bit more performance, in theory, but people use Python for ease of use and not so much for speed anyway.

In my ideal world of Python concurrency, Python multiple interpreter would gain an arena allocator model. By default, each interpreter owns a single default arena and all objects belong to an arena. When an interpreter is running, it holds its interpreter/arena lock in the same way Python currently holds a GIL, but if an interpreter wants to borrow an object from another interpreter, they need to acquire the interpreter/arena lock of the other interpreter first, using a with-block/acquire-release pair. IIUC, subinterpreters is almost like halfway going towards this way. An interpreter can only hold references to objects in another arena as a weak reference/proxy object, you can only dereference while holding the appropriate arena lock, and trying to access objects from another arena after the arena lock is released is not permitted. IIUC, biased reference counting has a lot in common with that minus the explicit weakref mechanism.

An arena locking model means retaining the simplicity and speed of GIL for single threaded code and native libraries, and a relatively easy to use multithreading model. The only drawback is that making use of arena model wouldn’t be transparent to the code, borrowing objects from another interpreter requires a lot of very explicit code, which IMO, sounds like a good thing anyway. If Python also acquired a syntax sugar for working with foreign/weak reference, I think it can feel almost transparent and extremely easy to use.

Further enhancement to this model is allowing the creation of standalone shared arenas that doesn’t belong to a single interpreter, which is a bit like shared memory on steroids.

TL;DR option 1 looks great

7 Likes

It will also enable best practice multi-threading.

Just because some people will get into trouble writing threaded code is not a reason to give up on this.

Its is a great attraction that python is easy to start with.

But there are many users of python that want performance and will use free-threading when if it is available.

8 Likes

Absolutely agreed. However, the fact that it will allow some people get into trouble because they weren’t sufficiently familiar with the (well-known to be complex) risks involved in threading, should be factored in as a potential downside.

There’s been a lot of talk about free threading enabling new, powerful and effective libraries and frameworks for parallel processing. That’s great - I’m genuinely looking forward to using them[1]. But if a non-expert should be using such higher-level libraries, maybe they need to be developed alongside the low-level functionality? Otherwise, we risk a situation where we are saying “here’s a bunch of really powerful but rather dangerous bits - if you’re not an expert, don’t use them, wait for the experts to package them up in a usable form”. That might be OK, but it’s not immediately clear to me that it will be.

And it’s quite possible that we’ll be doing those people a disservice by giving them just the raw parts and not the higher level libraries.


  1. although I’ll admit I don’t know how such libraries would differ from existing ones like concurrent.futures and multiprocessing.dummy in practice… ↩︎

1 Like

Isn’t this the status quo of providing multiprocessing and concurrent.futures, though?

For what it’s worth, there are some powerful libraries out there for multiprocessing (e.g. joblib, mentioned in the PEP). I think a significant barrier to non-expert usage is understanding when and why they are appropriate, and a lot of that requires understanding why they work the way they do (i.e. understanding the GIL)

3 Likes

Presumably not, otherwise why aren’t they being discussed as the nogil model? Lots of people seem to be saying they would greatly benefit from nogil, but none of them to my knowledge are talking about just being able to use the existing stdlib tools where they can’t at the moment. And the people saying that nogil will enable new approaches presumably don’t mean “stdlib for CPU-bound workloads”…

But yes, maybe that’s just my lack of understanding here. I will say that if free threading is just “the existing stdlib, but works for CPU-bound code as well”, then it’s a bit disappointing. Very few of the problems I have with the existing stdlib modules are “doesn’t work with CPU-bound code”.

If PEP-703 is accepted, subinterpreters will still be a possible choice for those who think it’ll be easier to manage. Unfortunately as with async, mutliprocessing and GIL threads, there are limits on what subinterpreters can be used for.

For a (pure) Python developer free threading will be no more dangerous than GIL threading was. It’s only the native extension authors that will need to care (but Cython counts as native).

2 Likes

Maybe I misunderstood what you were saying–I thought you were suggesting that the complexity of writing multithreaded code is a potential downside because people will get themselves in trouble. But the stdlib already gives them tools to get into such trouble. They just don’t always get a multi-core speedup after they go through the whole ordeal.

It seems like the supposed downside “people might get in trouble” is already true, and the issue is that they might actually try once there’s a reason to do so?

1 Like