C-API for initializing statically linked extension modules

update on this is that we’re conducting some experiments to gather more detailed data on the distribution of the overhead from using inittab vs our unordered_map solution, and whether this overhead is significant enough to justify non-trivial optimizations.
the interesting bit I’m trying to separate in the data is what part of the overhead is the initial copy (when using PyImport_ExtendInittab) vs the runtime lookup on every import.

since PyImport_ExtendInittab is public API / stable ABI, we will not be able to change its signature, so changing the internal data structure backing inittab (say from array to Py_hashtable) would regress the performance of that API (would need to make N inserts into a hash table instead of a single memcpy).
I don’t know if such a regression would be acceptable (I guess it depends on how much of a regression it is).

Such a regression seems only relevant to users who add lots and lots of importtab entries. I can’t imagine a use case other than a giant monorepo.

I bet that if you give such users a faster alternative API, they’ll be happy to switch.

1 Like

I updated PEP 741 to add PyInitConfig_AddModule() function, see: PEP 741: Python Configuration C API (second version) discussion.

1 Like

Some findings from experimenting with 10,000 (synthetic) extension modules.

Experiment Setup

We generated 10,000 trivial C extensions (all look like this), and imported them in a loop:

import time
N = 10_000
import_ts = {n: 0 for n in range(N)}
start = time.time()
for i in range(N):
    __import__(f"hello_ext_{i}")
    import_ts[i] = time.time()

Inittab

The overhead from extending inittab is quite small (took about 200µs on my machine). The bulk of the runtime overhead is from looking up entries in the inittab at import time. Since it’s a linear scan of inittab array, the lookup time is proportional to the position of the extension in the array. On my machine, imports close to the head of the list took approx 100µs, while imports close to the tail of the list took over 1ms. Overall that loop took about 4 seconds.

Unordered Map

Using the unordered map approach described earlier in the thread, the lookup overhead was approximately constant, with each import taking between 20µs and 30µs. Overall the loop took about 250ms.
Since in our implementation the unordered map is baked during the build, there’s no runtime initialization overhead. I did measure separately that populating an unordered map at runtime took 3ms on my machine (not using _Py_hashtable, but gives us a ballpark estimate).

Discussion

I see 3 possible directions to proceed:

  1. Don’t change anything. Arguably, ~1ms max overhead per import is acceptable and likely negligible compared to the “real work” (that doesn’t exist in this synthetic benchmark).
  2. Change the internal representation of inittab to use _Py_hashtable and the implementation of the append/extend APIs to update the hashtable. No need to change existing public APIs or add new ones, but the performance of the append/extend API regresses slightly.
  3. Design a new API (could be “unstable” initially) that can exist side-by-side with the append/extend inittab APIs (and a new PyInitConfig_AddModule). The API I have in mind is used to register an “inittab callback function” (takes a module name string, returns an initfunc). If such a callback function is registered, it is used before scanning the inittab array. If it doesn’t return an initfunc, we fallback to scanning the inittab array. In our scenario, we could use this to take advantage of the pre-baked unordered map, so we can use it directly in the callback function, and elide init-time overhead.

My personal preference is for option #3.
Thoughts?

1 Like

#3 sounds rather like a custom import hook. I’d rather expose PyUnstable_Import_SwapPackageContext and have you use that.

Since initialization copies the inittab to the private _PyRuntime.imports.inittab, and we don’t allow further calls to PyImport_{Append|Extend}Inittab, there’s:

#4. In _PyImport_Init, convert the inittab to a hashtable instead of copying the array. Then use the hash table for imports.


I don’t think adding modules one by one would be fast enough for this use case.

1 Like

I concur with this. Also, note that 1ms is when you have 10_000 extension modules loaded, which is a rather pathological case (does it occur in the real world?).

2 Likes

True, it is quite a pathological case, but also one based on real world analysis of large-ish python applications in the Meta monorepo (if we can call this “real world”). I can’t share exact numbers, but this is the order of magnitude of C++ extensions in such applications.

1 Like

I’m fine with any option, including exposing PyUnstable_Import_SwapPackageContext, or converting to a hashtable in _PyImport_Init :slight_smile:

Who would be the best expert to make a decision on which option to pursue? I’ll be happy to go ahead and file an issue and work on a PR, given a blessed option.

Maybe I’m that expert – I designed this API 34 years ago, before I knew about dynamic linking, expecting maybe a dozen extension modules at most…

It sounds like you’re unhappy that it takes 4 seconds to import 10,000 extensions this way, and I can understand that.

The solution would be along the lines of something that scans the array once and inserts everything into a hash map, right?

How much of that could be done in a 3rd party extension, and what’s the minimal API to add to CPython itself? I imagine the key thing is that we need to rebuild the hash map if the array has been modified. But aren’t you also in control of changes to that array? That would lead to the solution suggested by Petr, an import hook.

What am I missing?

1 Like

This is one of the possible solutions, yes.

I think there’s no issue with rebuilding anything, if we build a hashtable during initialization. The inittab append/extend APIs cannot be called after initialization, so I think this means the array is guaranteed to not change anymore post-init.

Import hook is what we’ve been using in our implementation for Python 3.8 and 3.10. In that import hook we found we needed to use a private CPython API to support pybind11 submodules. When adding support for Python 3.12, I wasn’t able to do something similar without patching CPython, which is what prompted this discussion.
I think the minimal API we’re missing to continue doing what we’ve done previously without patching CPython is exposing the existing _PyImport_SwapPackageContext so we can use it from the import hook, which is one of the solutions Petr suggested. If we do that, we don’t need to change anything in the inittab system.

1 Like

Ah, it turns out I may not be the right expert – I haven’t the faintest idea of what that “context” contains or what it’s for. I can tell it’s a char * and it’s used for single-phase init. I can also see that this context, which used to be a C global, is now incorporated in the private _PyRuntime struct, as part of the struct _import_runtime_state, which also holds the inittab.

A comment there suggests to me that the “package context” is the full module name, which is apparently needed when an extension module is initialized. A comment around L700 in import.c explains this some more (it appears due to an API design flaw – IIRC these APIs were set in stone before we made the module namespace hierarchical).

All of this is not unique to inittab – it’s used for dynamically loaded modules too, the common factor is single-phase init.

(Surely I’m telling you nothing you don’t already know. I had forgotten all about this myself though. :slight_smile:

It does look like the best solution is to add some “unstable” API – unstable (in the PEP 689 sense) because this is all related to the legacy of single-phase init, which we would want to rid ourselves of in some distant future. This is where the C API WG comes in, which Petr and I can probably channel.

The two main candidates are:

I have a feeling that the high-level API has a higher chance of surviving future refactorings of the internals here than the low-level API (e.g. there was some Emscripten-specific code here in 3.12 but it’s gone in 3.13). I don’t feel the need to push back further on “why is this important” (though the full C API WG might – in particular I’d like to hear from @steve.dower in this regard).

This would leave the question of naming. It took me some time to figure out what was meant by “context”. Assuming we all agree that it’s simply the full package name, maybe we should rename it PyUnstable_CallInitFuncWithFullName? Although I can also see the merit in keeping the Context term, which is used consistently in the implementation.

The expert for this appears to be @eric.snow – Eric, do you have any further insights here?

2 Likes

The internals here feel very unstable to me, in the sense that we’d just replace the whole thing with a more direct approach if we had infinite resources. Plus we’ve already essentially deprecated single-phase init (both subinterpreters and nogil will block/severely warn about it, IIUC).

I think Context is intended to mean it’s opaque, so this renaming probably sacrifices that. But I suspect it’s fully internal right now, so making it public at all necessitates adding more functions to create/set/free it.

Is there a higher level function we can offer, perhaps - PyImport_ImportByInitFunc(const char *fullname, initfunc func)? That way a customer importer can find/know the function itself and turn it into a module object, which seems to be the main aim here. (Presumably such a function already exists for ourselves to import from inittab.)

(Also agree on getting @eric.snow’s thoughts, particularly around how likely we are to refactor import and/or totally deprecate single-phase init.)

1 Like

I think this is exactly the function we added here with a different name (and the “high level” candidate Guido described above).

2 Likes

I filed an issue (Expose a C-API function to allow custom importers to create a module using an init function · Issue #116146 · python/cpython · GitHub) to finalize the details (assuming the PyImport_ImportByInitFunc approach is the preferred direction).

1 Like

FWIW, Google has the same need as Meta here. We have a slightly different solution, which is to rename the module init function and using dlsym() to load it. The unique name is generated from the package as well as the module, with underscores doubled and dots replaced by single underscores (pkg.foo_bar becomes PackagePyInit_pkg_foo__bar) to avoid the most obvious name clashes (like foo.bar and foo_bar). This relies on symbol resolution being optimised, and I haven’t been able to benchmark it compared to other solutions, but given that Google has a tendency to build massive binaries with giant symbol tables and it’s not shown up as a performance problem, I think it’s performing well enough.

We also, as mentioned before, load .so files from ZIP archives, where they are stored uncompressed and page-aligned. That’s closer to regular .so files, but in both cases we have to poke at quite a few internals to make it work. The proposed ByInitFunc function would work for both of those.

(We currently build both of those importers with Py_BUILD_CORE enabled so access to internals is fairly easy, we just have to do complex updates every time we upgrade Python :-P)

2 Likes