A New (C) API For Extensions that Need Runtime-Global Locks

malemburg · November 10, 2022, 4:20pm

If you’re hinting at a world where multiple interpreters are loaded into a single process using
different allocators, I’m pretty sure the whole idea is doomed to fail

This whole multi-interpreter thing is already complex enough.

encukou · November 10, 2022, 5:19pm

count the number of times the extension is loaded (to figure out when to finalize)

As I understand Erlend’s proposal, Py_mod_global_init will only be called for the first module object to be initialized, and Py_mod_global_exit will be called after the last one is freed.
(There’s Py_mod_exec and m_free that are called for all modules, those would have the global state set up.)

refcount all shared resources (to figure out when to finalize those)

Yes.

find its own way to communicate with instances running in other interpreters (to share allocated internal data structures for more efficient use, e.g. loaded models for ML [1]

Yes, but it can use static variables protected by a lock (see below).

figure out a way to do thread locking in a portable way to protect shared resources (since Python’s thread locking API likely won’t help with this, if I understand correctly – unless you want to halt all loaded interpreters using runtime-global locks)

Python’s thread locking should definitely help here.
IMO, best practice would be to allocate a module-specific static lock in Py_mod_global_init, and free it in Py_mod_global_exit. You can even use several locks for more granularity.

Why?
If we don’t do that, we might need a global allocator lock – which sounds worse than the GIL, performance-wise.

erlendaasland · November 10, 2022, 5:24pm

That’s a correct interpretation. With such a mechanism, there should be no need for the extension module to keep track on refs for the global state in order to find out when to set it up and when to tear it down.

eric.snow · November 10, 2022, 6:06pm

Sharing arbitrary objects between interpreters isn’t currently feasible (assuming a per-interpreter GIL) without a number of caveats and workarounds:

the object’s dealloc must run in the interpreter under which the object was created (the “owner”)
that means the object must not outlive its owning interpreter (so the main interpreter is the safest)
you’d have to use a global lock to protect any use of the object (like the GIL does now) or restrict all operations to the owning interpreter
there will probably still be some C-API that breaks under these conditions, which must be avoided (perhaps indefinitely)
…

A partial solution, when not in the owning interpreter, would be to use Py_AddPendingCall() to perform any operations on a shared object relative to its owning interpreter, which be safest with the main interpreter. However, that approach is a bit clunky and suffers from the limitations of Py_AddPendingCall() (e.g. blocked while all that interpreter’s threads are blocked or running outside the eval loop).

So, in the end, I agree with Petr. Currently it really isn’t practical to store shared global state in Python objects. It might not be as tricky as I think, but I’m not sure it’s worth it either way.

eric.snow · November 10, 2022, 6:38pm

There are two meanings for “different allocators”:

each interpreter can be set to use a different allocator (e.g. one uses glibc malloc and another mimalloc)
the allocators are still process-global but the state of the allocator (e.g. CPython’s “small block” allocator, AKA pymalloc) is per-interpreter

The first one is certainly an interesting idea but currently not feasible and likely not worth trying (due to complexity that doesn’t pay for itself). However, the second one is what PEP 684 proposes. I implemented it in a branch to verify it works.

Of course, things do break down under multiple interpreters for extensions that do not preserve isolation between them, but multi-phase init extensions promise to preserve that isolation. [1] If isolated extensions have a need for which the current solution would break isolation (or it’s easy to do so accidentally) then we should definitely provide API to avoid that (and, ideally, simplify the use case).

Clearly, this discussion is about such a case. The questions we’re still answering are:

what are the needs?
how might they break interpreter isolation?
what solutions are good enough?

[1] Single-phase init modules will only import in the main interpreter or in one created via the legacy Py_NewInterpreter().

eric.snow · November 11, 2022, 4:54pm

Regardless of where we go with a module-specific global state API, it sounds like an API that specifically facilitates creating module-specific global locks would be worth it. That would either be something direct (like what Petr posted earlier or what I posted) or with the existing lock API combined with the new moduledef slots Erlend outlined. I suppose I’m leaning toward the latter.

FWIW, we’ve bumped into the issue of sharing/managing module-specific global resources while porting the stdlib “syslog” module to multi-phase init, so a near-term solution is on my mind.

malemburg · November 14, 2022, 5:09pm

I understood you comment to mean that subinterpreters can each have their own memory allocator system. You probably meant: the subinterpreter will have its own instance of the allocator used for all interpreters.

This would only work for a known fixed number of such shared resources.

Let me play devil’s advocate:

Given the above constraints, the benefit of using multiple interpreters in a single process doesn’t appear to pay off. Data oriented extensions, which as I understand are the main drivers behind the idea to have subinterpreter, will end up having to implement their own way of sharing data between these interpreter instances.

As it stands, using separate processes with shared memory and e.g. PyArrow data structures to manage the sharing and avoid serialization overhead, seems like the much easier way to use all cores on a machine. It also avoids the added complexity of subinterpreters, having to port extensions over to the new logic and making them thread safe.

If subinterpreters really want to shine and provide scalability benefits over the multi-process architecture, Python will need to provide an easy to use, standard and (thread-)safe way to share data between subinterpreters.

erlendaasland · April 10, 2023, 3:28pm

We should try to hash out a PEP for this for 3.13.

eric.snow · August 3, 2023, 3:15pm

FWIW, I’m pretty sure an API for runtime-global locks will also help extensions that support no-gil.

erlendaasland · August 3, 2023, 3:40pm

I can try to adapt my earlier post in this thread to a PEP draft.

jpe · August 3, 2023, 6:39pm

Is there any safe place currently to allocate per-process locks with PyThread_allocate_lock()? I thought this would be the <module-name>_init() function, but I think it currently can be called simultaneously by 2+ interpreters.