A fast, free threading Python

Isn’t this already the case for subinterpreters, because subinterpreters share C extensions?

Honestly for my use case I wouldn’t mind completely removing the GIL, but adding another (transitional?) feature where threads take a global lock by default anyway. Because then I can leave “old style” threads alone but avoid acquiring the global lock in specific cases where I need better latency.

However I think there are good opportunities around stuff like actor model in a free threaded world, where the ownership is simplified by keeping mutable data inside an actor. (It seems hard to do actor scheduling on subinterpreters, because you can’t easily migrate actors across interpreters)

The big draw for me is removing the GIL acquisition time in a realtime environment. I talked a bit about it here. Python is mostly fast enough for me, but it struggles with tasks like realtime audio, input processing, and rendering in a single process. Subinterpreters feel more like web workers to me, where you need to manually split up your workload. They’re a completely different programming model and not quite as flexible. I think nogil today would remove my GIL related stalls (which currently cause me to drop frames for both input and UI) without me making any other changes.

Subinterpreters feel to me like “multiprocessing API, in a single process”. You need to initialize a new interpreter, import all of your libraries again, etc. It seems slightly more efficient than multiprocessing, especially with some shared memory, but neither multiprocessing or subinterpreters really help my main use case at all. I’m not CPU bound, I’m latency bound, and it’s difficult enough today to prevent GIL contention even in extension code I completely control. I’ve had to do a lot of engineering just to work around various things that hold the GIL for more than a frame

(I think “I’ve had to do a lot of engineering to work around the GIL” is a common theme authors of some high performance C extensions, e.g. an earlier comment from openai)

I want to schedule semi-realtime work on 4-8 cores of a consumer machine (<16ms deadline for any given GIL acquisition). With subinterpreters, I’d need to pay the import cost for my whole tree 4-8 times, schedule all of my work manually on specific interpreters, and there will be significant abstraction leakage to the users scripting my app. With nogil I can use my existing locking and let the OS scheduler do what it does best.

With free threading you can also implement new paradigms for Python, like a multithread async scheduler. I don’t see that happening with subinterpreters, due to the cost of moving objects and code between interpreters.

7 Likes