Is Free-Threading Our Only Option?

For what it’s worth, I’m trying to use my experiences with adopting this at $work to publish publicly some useful async and concurrency utilities that work well with freethreading, and intend to mark that library as stable, with a documented stability policy coinciding with the timing of 3.14’s release. I think the resources and experience available in my situation has enabled finding the missing pieces people will want for in the standard library.

What I’d call the “biggest missing pieces” in the standard library are there, with the biggest one being dual-color queues that don’t bind to a specific event loop[1], allowing safer communication between threads that may be more likely to mix concurrency kinds provided by the standard library with free-threading. I may look to upstream some of this to CPython for either 3.15 or 3.16, but anything beyond upstreaming that queue will probably require a pep, so I’m taking my time on iterating outside the standard library first.


  1. Providing these also may help with things like people feeling like function coloring is too pervasive, rather than useful and necessary for structure, when the primitives can bridge the gap better. ↩︎

9 Likes

Just FYI, the Ruby’s approach for this similar issue:

2 Likes

There are quite a few web frameworks using fork() to spawn workers for parallel and concurrent request processing. And native extension modules with multi-threading tend to not work nicely with those. Especially if a native extension module has shared states protected by mutexes.
I don’t think those web frameworks would have used multiprocessing if multithreading was an option from the beginning.

1 Like

Yes. Most framework/server for() before loading whole application to avoid many pitfalls of fork.
But Meta uses it a lot. Python added gc.freeze() and immutable objects for Meta reduces RAM using fork’s CoW behavior.

I’ll also add that fork has lots of things that can bite you later on (without realizing it).

We had a long term bug in a server that used flask/gunicorn/nginx. On startup it made a db connection.. then gunicorn forked. We got weird db errors every once in a while. Then an update to sqlalchemy made it much worse.

Lots of debug later: that fork was the issue since all children were sharing the same tcp socket talking to the db.

I like the simplicity of fork in theory, but in practice there’s lots of weirdness and generally unspecified behavior.

Edit: also iirc threading is never guaranteed to work with forking.

3 Likes

That’s not really fork s fault though, is it? It’s the failure of the programmer not to close unneeded open file descriptors in the child process (in most cases, everything but 0, 1, & 2). OTOH, if the programmer wanted a DB connection, it was their responsibility to either verify it could be shared or open a new connection, not simply hope they could share the existing TCP socket. (I think I learned that best practice from “Unix Network Programming” by Richard Stevens > 20 years ago.)

Edit: Or maybe even long before that in some long forgotten non-network programming source?

7 Likes

Sure it’s the programmer’s fault, though it had a foot gun that at least we weren’t aware of. (I didn’t realize the workers were forking after the db session was made.) This stuff tends to be seemingly obvious afterwards.

I hadn’t used fork directly in at least 10 years. Of course I guess I use stuff that uses it but not directly. We become less aware of these deep foot guns the more abstracted away we get. That’s both a good and bad thing.

2 Likes

I’m reminded of something a friend made a point of, (and if you’re reading this, I’m sorry I’m gonna butcher the more elegant phrasing you had): Abstractions that pass footguns onto their users without documentation are worse than the original footgun.

fork has caveats, but people know that. When things use fork and dont do all of the isolation for the user, they really should be responsible and document that all of the same caveats that apply to fork apply.

5 Likes

This is a problem with gunicorn, not with fork. gunicorn does not do the correct thing here. fork should almost always immediately be followed by execve or instead of using fork, use something like clone(2) or clone3()[1] or posix_spawn().

Doing any sort of socket binding prior to fork is arguably both the user’s problem for doing it, the tools you are using’s fault for allowing it, and the tools you are using’s fault for not documenting this limitation.

Threading is safe after fork if you use fork correctly; the idea that it isn’t is probably from the number of frameworks that misuse fork. Threading is also safe before fork with only a few notable limitations. These can be found in the documentation of fork

I’m sympathetic to the fact that many people writing python don’t expect to have to think about these things, but frameworks do often pass these issues on to their users without adequate warning.


  1. both documented here ↩︎

1 Like

I seem to be getting sidetracked.

Multiprocessing has significant limitations on resource sharing. You can use CoW to reduce some memory usage, and indeed Meta does use it, but it is very difficult to use correctly and is not recommended for the average programmer. I don’t think there is any disagreement on this point.
We don’t need to dive into details of the difficult and danger in this thread unless it is key difference from per-interpreter GIL.

Multi threading can achieve resource sharing more easily and efficiently than multiprocessing, despite the difficulties of thread-safe programming.

  • Even if an application has thousands of classes and uses hundreds of MB of memory, multiple threads can share the application by loading it only once.
  • External connections can be easily shared among threads using connection pools provided by SQLAlchemy, redis-py, requests, etc.
  • Caching the results of external requests is also possible with thread-safe cache libraries.

So users who using multi process + threads to run web app will get benefit from free threading.
They don’t need to use multi process anymore. They can save much RAM and their connection pool become more efficient.

Per-interpreter GIL is very similar to multi-processing, except that it runs within a single process. Even if you simply want to use multiple cores, you have to load the entire application for each interpreter. Resource sharing is difficult too.

13 Likes

Which leads back to bholley - Must be This Tall to Write Multi-Threaded Code and why I am so uneasy with the whole no-GIL movement.

1 Like

We have a wonderful solution to this – programming languages which protect against data races :slight_smile: (This is not all race conditions, merely those that produce undefined behavior within the abstract machine, deadlocks notably remain possible).

I say this not merely to shill, but also because it has an important interplay with Python’s concurrency semantics:

Free threaded Python maps quite well to the semantics Rust offers (and thanks to @ngoldbaum, we have excellent support for it in pyo3), and allows extension authors to fairly easily write safe, concurrent, code.

In contrast, sub-interpreters do not. They are not yet supported by pyo3, and its unclear that there’s ever a path to supporting sub-interpreters that doesn’t require users to use unsafe (meaning they’re responsible for upholding invariants that the compiler can’t enforce).

8 Likes

I don’t understand why the compiler would be involved in enforcing invariants when the runtime enforces them?

Subinterpreters naturally disallow data races and the developer has to opt into allowing them on a case-by-case basis. Free threading naturally allows all data races and the developer has to manually protect against all of them apart from when they’re intentional.

(Unless you’re specifically talking about Rust developers in the context of a Python design, in which case I still don’t understand, but at least I understand why I don’t understand.)

6 Likes

If you design your app this way (that is, inefficiently), then yes.

Alternatively, you can load a much smaller portion of your app into each interpreter to just do the job assigned to it. And there’s plenty of scope to optimise reloading the same parts again.[1]

Resource sharing is merely under-implemented, in that not enough people have invested in the libraries to make it easy. Similarly, concurrent data structures (in Python) are also under-implemented right now,[2] which means when people start trying to solve their data races they’re going to discover the same problem.

Again, this was Eric’s original point. We’re stumbling[3] towards free-threaded on the basis that work has been done that hasn’t been done in other areas - that is, sunk costs - and we want to be deliberate about making that decision.


  1. Provided they don’t do silly amounts of load-time execution, which we already acknowledge is silly and prevents all sorts of optimisations. ↩︎

  2. Beyond the built-in ones, which I acknowledge are fine, and in better shape than for subinterpreters. Though subinterpreters are more intended for message passing and so have different needs that don’t require concurrent access to basic data structures. ↩︎

  3. As an overall community. I’m not suggesting the experts who know their way around it are in any way incompetent. Just that the rest of us aren’t really being intentionally involved in choosing the direction. ↩︎

3 Likes

I think there’s also a problem in that the free-threaded model we’re aiming for seems to be the same model that the existing threading capabilities use - largely based on locks and their management. That model is well-known to be hard to use, as is demonstrated by the fact that people keep pointing out that you can’t “just” use free-threading blindly, you still need to understand all the issues that you had to understand with threading under the GIL.

That’s leading us to a world where free-threading exists, but is still very much an “experts only” feature. I don’t deny that for the experts, free-threading will be a benefit. But IMO we could (and should) be doing so much more, aiming for a more accessible threading model. Whether that’s subinterpreters (which are basically threading with a “no sharing, message passing” model), or free-threading with something like Rust’s machinery for immutability and controlled sharing of mutable data structures, or something else based on free threading, I don’t know.

So I guess for me, subinterpreters and free-threading aren’t alternatives. Free threading gets us better concurrency for the existing experts, but is only an initial step towards concurrency for the average user. Subinterpreters offer an immediate “better concurrency model”, that isn’t dependent on free threading for many of its benefits[1]. What concerns me is that we’ll stop once free threading is available - seeing it as sufficient in itself rather than being simply the step that unblocks work on better concurrency models for the language. And that subinterpreters will be sidelined, because of that view, and because it doesn’t need free-threading (and so is seen as no longer relevant).


  1. per-interpreter GIL is sufficient ↩︎

10 Likes

Well, no, it isn’t. Using a concurrent.futures.ThreadPoolExecutor isn’t harder than using a hypothetical concurrent.futures.InterpreterPoolExecutor. It might actually be easier, because you don’t have to worry about your data being efficiently picklable, for example.

I honestly don’t understand how you can think that it’s “better”. It’s more restricted, which can be seen as reassuring if you don’t really know what you’re doing, but it probably won’t give you better performance than using multi-threading (it may very well perform worse).

And, by the way, if you’re looking for a more reassuring model, multiprocessing has been there for years (and so is concurrent.futures.ProcessPoolExecutor). Have you being using it? And if not, why are you feeling so concerned about multiple interpreters?

4 Likes

Just to note that this currently exists, Eric added it for 3.14.

A

5 Likes

Oh, cool, I stand corrected :slight_smile:

2 Likes

That was the exact example I was thinking of. In a ThreadPoolExecutor you can use a shared mutable data structure to collect results. That’s wrong, and will fail, but you need to understand threading and the need to lock appropriately to see that. With an InterpreterPoolExecutor, you can’t share the data structure in the first place, protecting you from an entire class of problems.

It’s the fact that people don’t even see that as an important difference which makes me think that free threading is going to end up being an “experts only” tool.

Preventing certain types of error is more restricted, but that’s a good thing. I explicitly said that my goal was to make threading accessible to people who don’t have the level of expert knowledge that’s currently needed, so I really don’t understand why it’s surprising to you that I believe that.

As for performance, I don’t know how to answer that. Are you saying that it’s impossible to have high performance and easy, safe usage? Because Rust’s rayon crate directly disproves that. We may not be able to do anything that good in Python yet, but again, my point is that seeing free threading as an end goal prevents us from even looking at high-performance, safe threading models. (And my interest in subinterpreters is precisely because it’s the only work that is currently going on into safe, performant, threading models - and it’s hugely under-resourced compared to free threading).

On Windows, process startup is a huge overhead. For smaller workloads (for example, many of pip’s internal processes), concurrency would be beneficial but multiprocessing would not, because of that single fact. Maybe the startup cost of multiple interpreters would be high, too, but even if it is, it’s something that we control, and can improve.

3 Likes

The invariant that’s hard to enforce is that you never leak a PyObject* across sub-interpreters (e.g., storing one in any global/static state). Doing so is undefined behavior, and the compiler can’t enforce that you never do this.

1 Like