PEP 684: A Per-Interpreter GIL

encukou · October 31, 2022, 1:00pm

My vote goes to no: make 3.12 safe, then remove the limitations.
For example, PyMem_SetAllocator with PYMEM_DOMAIN_MEM or PYMEM_DOMAIN_OBJ could block creating independent GILs, and new PyMem_SetGlobalAllocator could be added.

And, I guess setting memory allocators should be blocked if multiple GILs exist? Apparently, after Python is initialized, PyMem_SetAllocator should be only used only for hooks that wrap the current allocator (is that right @vstinner?), but creating such a hook using PyMem_GetAllocator gets you a race condition. IMO the best thing the initial implementation can do is to fall, and leave a better solution for later.

A wrinkle is that PyMem_SetAllocator has no way to signal failure – it silently ignores errors. Guess it predates PyStatus?

IMO, the solution is to not opt in for now. If synchronization/introspection API is missing, let’s add it after the PEP is in place. (IMO there are many issues in this area – that’s why I’m trying to convince Eric to make the initial implementation safe but limited.)

eric.snow · October 31, 2022, 8:23pm

Agreed. The PEP shouldn’t need more than that.

That said, a thread-safety restriction on the allocators is the simplest way forward for a safe 3.12 (under a per-interpreter GIL). Or were you talking only about the constraint on extension modules?

Do you mean if someone sets a custom mem/object allocator then subinterpreters with their own GIL should not be allowed? That is reasonable, if we don’t have enough information to conclude that existing custom allocators (used with PyMem_SetAllocator()) are thread-safe.

What would this do?

Yeah, that’s a race we’d have to resolve. However, rather than disallowing it, I’d expect a solution with a granular global lock, like we have for the interpreters list.

Right. We’d have to do something like leave the current allocator in place and return. Then you’d have to call PyMem_GetAllocator() afterward to see if your allocator is set. A function that returned a result could be helpful.

Regardless, it would make more sense to me if we had a separate API for wrapping the existing allocator after init (e.g. PyMem_WrapAllocator()). Then PyMem_SetAllocator() would apply only to the actual allocator and only be allowed before runtime init. However, that is definitely not part of this PEP (nor necessary for it).

Agreed.

encukou · November 1, 2022, 9:06am

I was talking about both :‍)

Yes, that seems like the easiest safe way forward.

Same as PyMem_SetAllocator, but allow subinterpreters with their own GILs – i.e. that allocator would be assumed to be thread-safe.
(Yes, it needs a better name.)

Yes. It’s out of scope for this PEP, but :

We probably should expose API for user-defined granular global locks. AFAIK we don’t have a good way to “allocate lock if not already allocated” that would work with multiple GILs.
Such a lock would be useful one-per-process modules (the isolation opt-out), as well as for Marc-André’s use case. IMO, this should be addressed relatively quickly, so people don’t start writing extensions that are only usable in the main interpreter. (I see relying on a single main interpreter as technical debt. Eventually I’d like to allow a library to call PyInitialize without caring whether there’s already an interpreter around. The concept of a main interpreter complicates that, but if it’s contained in the core, it should be manageable.)

eric.snow · November 2, 2022, 6:27pm

Thanks for clarifying. I agree that we should look into a new allocator set/get API that relates to interpreters. However, I don’t think this PEP needs that.

That’s a good idea. I’ll make a separate post just about this.

Regardless, I was hoping to leave specific APIs that help extension modules out of this PEP. From PEP 684:

We will work with popular extensions to help them support use in multiple interpreters. This may involve adding to CPython’s public C-API, which we will address on a case-by-case basis.

I’m sure we will add a fair number of utility APIs that might help extension maintainers reach multi-interpreter and per-interpreter GIL compatibility. It seems like the PEP would be out-of-phase with that effort, so it would be better to not include specific additions in the proposal.

+1

Yeah, that’s certainly something to look into (but not for this PEP). I known @steve.dower has some thoughts in this area, and certainly @vstinner does and I do. That said, I’d rather any further discussion on this get its own DPO thread, to avoid side-tracking the PEP discussion.

eric.snow · November 2, 2022, 7:49pm

I started a thread at https://discuss.python.org/t/a-new-c-api-for-extensions-that-need-runtime-global-locks/20668.

gpshead · November 9, 2022, 8:18am

faulthandler the crash reporting feature would remain per process. Just as it can do with dumping the current traceback of each thread in the VM, it should presumably be extended to do that for each subinterpreter so that it is clear which tracebacks belong to what.

faulthandler.dump_traceback* APIs could just dump thread stacks related to the calling interpreter? Or easier: simply restrict all faulthandler APIs to being called from the main interpreter rather than allowing them from subinterpreters. Given they deal with process wide state, just don’t let subinterpreters call them at all.

stonebig · January 14, 2023, 2:49pm

Will per-interpreter GIL work in a WASM context ? to bring parallelism also in this web context.
(Pyodide and Jupyterlite comes to mind)

brettcannon · January 15, 2023, 11:45pm

It’s not a clear-cut answer as it all depends on how you want to utilize per-interpreter GILs. WebAssembly does not natively have threads, so it would be no different than the situation today. If those Emscripten-based WebAssembly runtimes support some version of threads and that can be used from a pthread API, then it should be transparent. But all of that is up to Pyodide and Emscripten.

eric.snow · January 17, 2023, 7:27pm

CPython’s runtime relies on some global state that is shared between all interpreters. That will remain true with a per-interpreter GIL, though there will be less shared state.

From what I understand, WASM does not support any mechanism for sharing state between web workers (the only equivalent to threads of which I’m aware). So using multiple interpreters isn’t currently an option, regardless of a per-interpreter GIL. IIUC, at best you could run one runtime per web worker, which is essentially multiprocessing.

hannes · February 21, 2023, 7:48am

I just want to add that per-interpreter GIL would greatly increase Python’s usefulness for User-Defined Functions (UDFs) in DuckDB. DuckDB automatically parallelises SQL queries, including those with UDFs. However, thus far, we have been severely blocked to do this with Python as a UDF implementation language because of the GIL. The only way around this currently is to fork additional processes and to ship inputs and outputs around between processes with all the associated headaches. So yes, please add this!

eric.snow · February 21, 2023, 5:56pm

Thanks for the insight!

EwoutH · March 4, 2023, 9:24am

Thanks for all the hard work and insights on this! Is PEP 684 still targeted for the 3.12 release?

eric.snow · March 4, 2023, 4:31pm

Yeah, we’re still aiming for 3.12, assuming the PEP is accepted by the Steering Council.

emily · April 7, 2023, 7:41pm

On behalf of the Steering Council, I’m happy to report that we have accepted PEP 684.

@eric.snow, thanks for all of your efforts on this PEP and all of the supporting work it took to get us here over the years!

csm10495 · April 9, 2023, 6:05am

With just this PEP, is there a performance gain from using subinterpreters in threading.Threads as opposed to just raw threading.Threads.

I’m trying to understand any additional level of concurrency we get via just this PEP. It sort of sounds like it’s the same as threading.Threads for now until we get per-interpretor GIL.

methane · April 10, 2023, 2:52pm

Compared to regular threads

Subinterpreter threads can run works holding GIL in parallel.

Compared to multiprocessing (fork, forkserver)

Fork has many limitations and pitfalls, and Windows doesn’t support fork.
On the other hand, fork can share some RAM between interpreters.
Forkserver can be used to avoid some pitfalls.

Compared to multiprocessing (spawn)

Subinterpreter threads is similar to spawning multiprocessing in some way.
Both can run on Windows. Both can not share RAM between interpreters.
But subinterpreter threads is much faster to start compared to process.

In the future, we may be able to faster inter-subinterpreter communication and some memory sharing between subinterpreters. After that, subinterpreter threads can be more fast and efficient than spawn.

steve.dower · April 10, 2023, 3:29pm

Not quite true, but there are a few hoops you have to jump through to share RAM (such as converting the memory address to an int, passing it across as bytes, reconstituting it and wrapping it in some kind of accessor object).

What you can’t share is Python objects.

methane · April 11, 2023, 6:02am

Of course. All processes and sub-interpreters can share some RAM.
My point was the rough performance/memory efficiency characteristics compared to forking. (If concurrent.futures package adds SubinterpreterPoolExecutor.)

miraculixx · April 11, 2023, 8:03pm

Have been discussing this PEP with people at work and we have not been able to answer this question:

In Python up to 3.11, user created objects are shared among threads and can be accessed and updated by any thread, although the GIL effectively limits concurrency. It seems that with PEP 684, this is no longer the case, i.e. the GIL does not limit concurrency but also no longer allows cross-thread access to objects, which would imply fork-style copy-on-write(?).

Thus raises questions to how things will actually work.

E.g.

a dict or list obj allocated on the main thread is inserted with objects from code run by several threads (think map-style operation). Will the main thread see/be able to read/update these inserted objects? What happens to objects created by a thread and inserted into a global obj, once a thread terminates?
a thread accesses and updates an obj created on the main thread (say an element in a list). Will the main thread see the update, and when?
can threads be selectively created with either a new interpreter (thus its own GIL, i.e. unconstrained concurrency vv main & other threads, possibly losing shared objects semantics), or by using the main process GIL (thus limiting concurrency but keeping shared object semantics)?

We assume the answers to these questions boil down to “works like multiprocessing fork”, but would appreciate clarity.

If our assumption is correct, it seems this PEP would introduce a new “fully concurrent/shared nothing thread model” that is not always backwards compatible. With this in mind, will the default threading model in 3.12 be to continue working with a global GIL to maintain backward compatibility?

davidism · April 11, 2023, 8:08pm

This makes the GIL per-interpreter. Multiple interpreters are a different concept than threads. Within an interpreter, all threads still obey that interpreter’s GIL. Multiple interpreters no longer share a single GIL though, so the interpreters themselves are more independent of each other.