A New (C) API For Extensions that Need Runtime-Global Locks

That is a good question.

Let’s take a step back. Perhaps it’s my wrong understanding of how loading an extension in multiple interpreters would work when interfacing to a C library with pre-process global state. So let’s see how that would work:

  • interpreter A is the first to load the extension
  • the dyn linker load the shared lib
  • A calls the module init function
  • the init function sets up the global state – in the single interpreter mode, this would normally happen using static C vars to hold that state, e.g. for ODBC, the SQLHENV henv.

So far, so good. Now another interpreter loads the extension:

  • interpreter B loads the extension
  • the dyn linker see that the shared lib is already loaded, so points interpeter B to it
  • B calls the module init function
  • the init function now has to check whether henv is set, to avoid a second ODBC env init
  • the extension module continues initializing the module with objects, using henv where necessary

This would work without locks. ODBC is thread safe, so no additional locks are needed for managing ODBC calls.

Now, interpreter A wants to terminate.

  • A calls the module cleanup function
  • this cleanup function would need to free the ODBC SQLHENV henv, if it were the last interpreter to use it, but B is still using it

At this point, we have a problem: how can A know that henv is still needed by B ?

In this scenario, locks could be used, but would not necessarily be the best solution (see below).

A better solution would be to have some sort of communication between the interpreters to check whether the ODBC henv is still needed. This could be done by having lock protected variables shared between interpreters (the key-value storage we discussed in the previous topic), or by providing a reference counting feature, where each extensions running in different interpreters can register their on-going use of their shared resources.

Alternatively, the extension could use a static C variable for this and protect it with a lock (using a thread lock as discussed above). This doesn’t seem like a good solution, though, since every single extension would have to go through the same hoops to make this happen.

PS: After writing the above I researched the ODBC SQLAllocHandle API and found that it is possible, in theory, to have multiple environments per process. I’ve never used or tried such a setup and given my experience with ODBC drivers, this kind of setup would likely introduce compatibility issues, so would not recommend it.

2 Likes

Same here.

As Petr noted, without a GIL between them, there may be a race on creating that shared lock. So we need a runtime-global lock that can guard creating new locks.

I suppose there may be other reasons (but I stopped at per-interpreter GIL :smile:).

How about a pair of new module slots combined with some ref. counting mechanism in for example the import machinery? Py_mod_global_init and Py_mod_global_exit?

1 Like

That sounds great!
Scratch my idea :‍)

Could you elaborate on that a bit more. I’m not sure I follow.

Note that extensions may need to manage multiple external resources, not just one as in the case of the ODBC example, so having just a single ref count per loaded extension copy would not be enough.

E.g. let’s say an extension loads a C library and enables a number of extensions in that library. Interpreter A may have just used extension 1, while interpreter B uses extension 2. A would then want to free (just) the resources for extension 1 when terminating, while keeping the main C library globals and extension 2 untouched.

Having a global key value storage for extension module copies to manage their resource state would help a lot and avoid much of the locking logic which would otherwise be needed in the extension.

It is similar to Petr’s idea (so we should not scratch that), but it simplifies the API for the extension modules by providing convenient PEP 489 module slots.

Pseudo-code without error checking, partly borrowed from PEP 489:

def PyModule_ExecDef(module, def):
    # ...

    exec = None
    g_init = None
    for slot, value in def.m_slots:
        if slot == Py_mod_exec:
            exec = value
        if slot == Py_mod_global_init:
            g_init = value  # In Petr's example, this would be setup_my_global_state()

    if g_init:
        if global_state_refcnt(def) == 0:
            acquire_global_lock()
            g_init(module)
            global_state_incref(def)
            release_global_lock()

    if exec:
        exec(module)

# Called when module is unloaded (or dealloc'd), for example at interpreter shutdown
def unload_module(module, def):
    # ...

    for slot, value in def.m_slots:
        if slot == Py_mod_global_exit:
            g_exit = value  # In Petr's example, this would be teardown_my_global_state()
            if global_state_refcnt(def) == 1:
                acquire_global_lock()
                g_exit(module)
                release_global_lock()
            global_state_decref(def)

The nice thing with such an API, is that the extension modules can get rid of possibly-hard-to-get-right boilerplate locking code; they can simply provide functions for setup and tear-down of global state. The runtime will make sure to call these functions when needed; that responsibility is not on the extension module author.

Unless I’m misreading you, I believe you should be able to solve that using a global lock API and the ordinary Py_mod_exec and m_free/m_clear coupled with the proposed Py_mod_global_ slots.

1 Like

Extension modules already have storage for global state – plain static variables. We just need locking.

You’d need your own refcounting for those extensions (an array of refcounts, or a C-level map if the list of possible extensions aren’t known at buildtime).
In Py_mod_global_init you’d initialize that structure and allocate a lock to protect it.
When loading an extension, you’d add it to that structure (or incref), with the lock held.
When unloading an extension (in m_clear at the latest), remove/decref with the lock held.
In Py_mod_global_exit, the structure must be empty. Destroy it and the lock.

I don’t think Python should provide the C-level map (key-value storage).

2 Likes

Ok, fair enough. Python takes care of making sure that global init and teardown are protected with the global lock and the modules have to manage their resources in some custom way.

Is it possible to use Python objects for such management ? They’d have to be allocated in the main interpreter, but shared across all interpreters via static C vars in the extensions.

No.
(Technically some objects can be used across interpreters, but that’s an implementation detail – it would tie you to an exact build of CPython, and you’d need to verify the assumptions with each update, which would be pretty tricky to verify for anything non-trivial. And depending on how per-interpreter allocators end up being implemented, you might not be able to allocate anything in a non-main interpreter, not even a string or a bigger int. I don’t think Python objects would help much given that constraint.)

1 Like

Hmm, so each extension will have to tackle the same problem on it own.

Please consider that each extension that deals more than just a bit with data, will have to:

  • count the number of times the extension is loaded (to figure out when to finalize)
  • refcount all shared resources (to figure out when to finalize those)
  • find its own way to communicate with instances running in other interpreters (to share allocated internal data structures for more efficient use, e.g. loaded models for ML [1])
  • figure out a way to do thread locking in a portable way to protect shared resources (since Python’s thread locking API likely won’t help with this, if I understand correctly – unless you want to halt all loaded interpreters using runtime-global locks)
  • (possibly more, which I’m not seeing now)

For existing external thread safe C libraries the above will mostly have been figured out in some way or another (e.g. ODBC comes with a complete handle infrastructure for these things), but think about extensions which currently rely on the GIL to protect them against threading issues and implement their logic mostly by themselves.

Those will now each need a complete new stack of APIs to handle the extra locking, sharing and refcounting.

IMO, it would be better and create more following to have support for these things right in the Python C API. This then avoids many subtle bugs you can introduce in such APIs and also is more inviting for extension writers to consider adding support for multiple interpreters.


  1. One of the use cases for having multiple interpreters in one process was that of being able to share already loaded ML models. I don’t remember which company this way, could have been Facebook. ↩︎

1 Like

If you’re hinting at a world where multiple interpreters are loaded into a single process using
different allocators, I’m pretty sure the whole idea is doomed to fail :frowning:

This whole multi-interpreter thing is already complex enough.

  • count the number of times the extension is loaded (to figure out when to finalize)

As I understand Erlend’s proposal, Py_mod_global_init will only be called for the first module object to be initialized, and Py_mod_global_exit will be called after the last one is freed.
(There’s Py_mod_exec and m_free that are called for all modules, those would have the global state set up.)

  • refcount all shared resources (to figure out when to finalize those)

Yes.

  • find its own way to communicate with instances running in other interpreters (to share allocated internal data structures for more efficient use, e.g. loaded models for ML [1]

Yes, but it can use static variables protected by a lock (see below).

  • figure out a way to do thread locking in a portable way to protect shared resources (since Python’s thread locking API likely won’t help with this, if I understand correctly – unless you want to halt all loaded interpreters using runtime-global locks)

Python’s thread locking should definitely help here.
IMO, best practice would be to allocate a module-specific static lock in Py_mod_global_init, and free it in Py_mod_global_exit. You can even use several locks for more granularity.

Why?
If we don’t do that, we might need a global allocator lock – which sounds worse than the GIL, performance-wise.

2 Likes

That’s a correct interpretation. With such a mechanism, there should be no need for the extension module to keep track on refs for the global state in order to find out when to set it up and when to tear it down.

Sharing arbitrary objects between interpreters isn’t currently feasible (assuming a per-interpreter GIL) without a number of caveats and workarounds:

  • the object’s dealloc must run in the interpreter under which the object was created (the “owner”)
  • that means the object must not outlive its owning interpreter (so the main interpreter is the safest)
  • you’d have to use a global lock to protect any use of the object (like the GIL does now) or restrict all operations to the owning interpreter
  • there will probably still be some C-API that breaks under these conditions, which must be avoided (perhaps indefinitely)

A partial solution, when not in the owning interpreter, would be to use Py_AddPendingCall() to perform any operations on a shared object relative to its owning interpreter, which be safest with the main interpreter. However, that approach is a bit clunky and suffers from the limitations of Py_AddPendingCall() (e.g. blocked while all that interpreter’s threads are blocked or running outside the eval loop).

So, in the end, I agree with Petr. Currently it really isn’t practical to store shared global state in Python objects. It might not be as tricky as I think, but I’m not sure it’s worth it either way.

There are two meanings for “different allocators”:

  • each interpreter can be set to use a different allocator (e.g. one uses glibc malloc and another mimalloc)
  • the allocators are still process-global but the state of the allocator (e.g. CPython’s “small block” allocator, AKA pymalloc) is per-interpreter

The first one is certainly an interesting idea but currently not feasible and likely not worth trying (due to complexity that doesn’t pay for itself). However, the second one is what PEP 684 proposes. I implemented it in a branch to verify it works.

Of course, things do break down under multiple interpreters for extensions that do not preserve isolation between them, but multi-phase init extensions promise to preserve that isolation. [1] If isolated extensions have a need for which the current solution would break isolation (or it’s easy to do so accidentally) then we should definitely provide API to avoid that (and, ideally, simplify the use case).

Clearly, this discussion is about such a case. The questions we’re still answering are:

  • what are the needs?
  • how might they break interpreter isolation?
  • what solutions are good enough?

[1] Single-phase init modules will only import in the main interpreter or in one created via the legacy Py_NewInterpreter().

1 Like

Regardless of where we go with a module-specific global state API, it sounds like an API that specifically facilitates creating module-specific global locks would be worth it. That would either be something direct (like what Petr posted earlier or what I posted) or with the existing lock API combined with the new moduledef slots Erlend outlined. I suppose I’m leaning toward the latter.

FWIW, we’ve bumped into the issue of sharing/managing module-specific global resources while porting the stdlib “syslog” module to multi-phase init, so a near-term solution is on my mind.

2 Likes

I understood you comment to mean that subinterpreters can each have their own memory allocator system. You probably meant: the subinterpreter will have its own instance of the allocator used for all interpreters.

This would only work for a known fixed number of such shared resources.

Let me play devil’s advocate:

Given the above constraints, the benefit of using multiple interpreters in a single process doesn’t appear to pay off. Data oriented extensions, which as I understand are the main drivers behind the idea to have subinterpreter, will end up having to implement their own way of sharing data between these interpreter instances.

As it stands, using separate processes with shared memory and e.g. PyArrow data structures to manage the sharing and avoid serialization overhead, seems like the much easier way to use all cores on a machine. It also avoids the added complexity of subinterpreters, having to port extensions over to the new logic and making them thread safe.

If subinterpreters really want to shine and provide scalability benefits over the multi-process architecture, Python will need to provide an easy to use, standard and (thread-)safe way to share data between subinterpreters.

3 Likes

We should try to hash out a PEP for this for 3.13.

1 Like

FWIW, I’m pretty sure an API for runtime-global locks will also help extensions that support no-gil.

2 Likes

I can try to adapt my earlier post in this thread to a PEP draft.

1 Like