C-API for initializing statically linked extension modules

update on this is that we’re conducting some experiments to gather more detailed data on the distribution of the overhead from using inittab vs our unordered_map solution, and whether this overhead is significant enough to justify non-trivial optimizations.
the interesting bit I’m trying to separate in the data is what part of the overhead is the initial copy (when using PyImport_ExtendInittab) vs the runtime lookup on every import.

since PyImport_ExtendInittab is public API / stable ABI, we will not be able to change its signature, so changing the internal data structure backing inittab (say from array to Py_hashtable) would regress the performance of that API (would need to make N inserts into a hash table instead of a single memcpy).
I don’t know if such a regression would be acceptable (I guess it depends on how much of a regression it is).

Such a regression seems only relevant to users who add lots and lots of importtab entries. I can’t imagine a use case other than a giant monorepo.

I bet that if you give such users a faster alternative API, they’ll be happy to switch.

1 Like

I updated PEP 741 to add PyInitConfig_AddModule() function, see: PEP 741: Python Configuration C API (second version) discussion.

1 Like

Some findings from experimenting with 10,000 (synthetic) extension modules.

Experiment Setup

We generated 10,000 trivial C extensions (all look like this), and imported them in a loop:

import time
N = 10_000
import_ts = {n: 0 for n in range(N)}
start = time.time()
for i in range(N):
    __import__(f"hello_ext_{i}")
    import_ts[i] = time.time()

Inittab

The overhead from extending inittab is quite small (took about 200µs on my machine). The bulk of the runtime overhead is from looking up entries in the inittab at import time. Since it’s a linear scan of inittab array, the lookup time is proportional to the position of the extension in the array. On my machine, imports close to the head of the list took approx 100µs, while imports close to the tail of the list took over 1ms. Overall that loop took about 4 seconds.

Unordered Map

Using the unordered map approach described earlier in the thread, the lookup overhead was approximately constant, with each import taking between 20µs and 30µs. Overall the loop took about 250ms.
Since in our implementation the unordered map is baked during the build, there’s no runtime initialization overhead. I did measure separately that populating an unordered map at runtime took 3ms on my machine (not using _Py_hashtable, but gives us a ballpark estimate).

Discussion

I see 3 possible directions to proceed:

  1. Don’t change anything. Arguably, ~1ms max overhead per import is acceptable and likely negligible compared to the “real work” (that doesn’t exist in this synthetic benchmark).
  2. Change the internal representation of inittab to use _Py_hashtable and the implementation of the append/extend APIs to update the hashtable. No need to change existing public APIs or add new ones, but the performance of the append/extend API regresses slightly.
  3. Design a new API (could be “unstable” initially) that can exist side-by-side with the append/extend inittab APIs (and a new PyInitConfig_AddModule). The API I have in mind is used to register an “inittab callback function” (takes a module name string, returns an initfunc). If such a callback function is registered, it is used before scanning the inittab array. If it doesn’t return an initfunc, we fallback to scanning the inittab array. In our scenario, we could use this to take advantage of the pre-baked unordered map, so we can use it directly in the callback function, and elide init-time overhead.

My personal preference is for option #3.
Thoughts?

1 Like

#3 sounds rather like a custom import hook. I’d rather expose PyUnstable_Import_SwapPackageContext and have you use that.

Since initialization copies the inittab to the private _PyRuntime.imports.inittab, and we don’t allow further calls to PyImport_{Append|Extend}Inittab, there’s:

#4. In _PyImport_Init, convert the inittab to a hashtable instead of copying the array. Then use the hash table for imports.


I don’t think adding modules one by one would be fast enough for this use case.

1 Like

I concur with this. Also, note that 1ms is when you have 10_000 extension modules loaded, which is a rather pathological case (does it occur in the real world?).

2 Likes

True, it is quite a pathological case, but also one based on real world analysis of large-ish python applications in the Meta monorepo (if we can call this “real world”). I can’t share exact numbers, but this is the order of magnitude of C++ extensions in such applications.

1 Like

I’m fine with any option, including exposing PyUnstable_Import_SwapPackageContext, or converting to a hashtable in _PyImport_Init :slight_smile:

Who would be the best expert to make a decision on which option to pursue? I’ll be happy to go ahead and file an issue and work on a PR, given a blessed option.

Maybe I’m that expert – I designed this API 34 years ago, before I knew about dynamic linking, expecting maybe a dozen extension modules at most…

It sounds like you’re unhappy that it takes 4 seconds to import 10,000 extensions this way, and I can understand that.

The solution would be along the lines of something that scans the array once and inserts everything into a hash map, right?

How much of that could be done in a 3rd party extension, and what’s the minimal API to add to CPython itself? I imagine the key thing is that we need to rebuild the hash map if the array has been modified. But aren’t you also in control of changes to that array? That would lead to the solution suggested by Petr, an import hook.

What am I missing?

1 Like

This is one of the possible solutions, yes.

I think there’s no issue with rebuilding anything, if we build a hashtable during initialization. The inittab append/extend APIs cannot be called after initialization, so I think this means the array is guaranteed to not change anymore post-init.

Import hook is what we’ve been using in our implementation for Python 3.8 and 3.10. In that import hook we found we needed to use a private CPython API to support pybind11 submodules. When adding support for Python 3.12, I wasn’t able to do something similar without patching CPython, which is what prompted this discussion.
I think the minimal API we’re missing to continue doing what we’ve done previously without patching CPython is exposing the existing _PyImport_SwapPackageContext so we can use it from the import hook, which is one of the solutions Petr suggested. If we do that, we don’t need to change anything in the inittab system.

1 Like

Ah, it turns out I may not be the right expert – I haven’t the faintest idea of what that “context” contains or what it’s for. I can tell it’s a char * and it’s used for single-phase init. I can also see that this context, which used to be a C global, is now incorporated in the private _PyRuntime struct, as part of the struct _import_runtime_state, which also holds the inittab.

A comment there suggests to me that the “package context” is the full module name, which is apparently needed when an extension module is initialized. A comment around L700 in import.c explains this some more (it appears due to an API design flaw – IIRC these APIs were set in stone before we made the module namespace hierarchical).

All of this is not unique to inittab – it’s used for dynamically loaded modules too, the common factor is single-phase init.

(Surely I’m telling you nothing you don’t already know. I had forgotten all about this myself though. :slight_smile:

It does look like the best solution is to add some “unstable” API – unstable (in the PEP 689 sense) because this is all related to the legacy of single-phase init, which we would want to rid ourselves of in some distant future. This is where the C API WG comes in, which Petr and I can probably channel.

The two main candidates are:

I have a feeling that the high-level API has a higher chance of surviving future refactorings of the internals here than the low-level API (e.g. there was some Emscripten-specific code here in 3.12 but it’s gone in 3.13). I don’t feel the need to push back further on “why is this important” (though the full C API WG might – in particular I’d like to hear from @steve.dower in this regard).

This would leave the question of naming. It took me some time to figure out what was meant by “context”. Assuming we all agree that it’s simply the full package name, maybe we should rename it PyUnstable_CallInitFuncWithFullName? Although I can also see the merit in keeping the Context term, which is used consistently in the implementation.

The expert for this appears to be @eric.snow – Eric, do you have any further insights here?

2 Likes

The internals here feel very unstable to me, in the sense that we’d just replace the whole thing with a more direct approach if we had infinite resources. Plus we’ve already essentially deprecated single-phase init (both subinterpreters and nogil will block/severely warn about it, IIUC).

I think Context is intended to mean it’s opaque, so this renaming probably sacrifices that. But I suspect it’s fully internal right now, so making it public at all necessitates adding more functions to create/set/free it.

Is there a higher level function we can offer, perhaps - PyImport_ImportByInitFunc(const char *fullname, initfunc func)? That way a customer importer can find/know the function itself and turn it into a module object, which seems to be the main aim here. (Presumably such a function already exists for ourselves to import from inittab.)

(Also agree on getting @eric.snow’s thoughts, particularly around how likely we are to refactor import and/or totally deprecate single-phase init.)

1 Like

I think this is exactly the function we added here with a different name (and the “high level” candidate Guido described above).

2 Likes

I filed an issue (Expose a C-API function to allow custom importers to create a module using an init function · Issue #116146 · python/cpython · GitHub) to finalize the details (assuming the PyImport_ImportByInitFunc approach is the preferred direction).

1 Like

FWIW, Google has the same need as Meta here. We have a slightly different solution, which is to rename the module init function and using dlsym() to load it. The unique name is generated from the package as well as the module, with underscores doubled and dots replaced by single underscores (pkg.foo_bar becomes PackagePyInit_pkg_foo__bar) to avoid the most obvious name clashes (like foo.bar and foo_bar). This relies on symbol resolution being optimised, and I haven’t been able to benchmark it compared to other solutions, but given that Google has a tendency to build massive binaries with giant symbol tables and it’s not shown up as a performance problem, I think it’s performing well enough.

We also, as mentioned before, load .so files from ZIP archives, where they are stored uncompressed and page-aligned. That’s closer to regular .so files, but in both cases we have to poke at quite a few internals to make it work. The proposed ByInitFunc function would work for both of those.

(We currently build both of those importers with Py_BUILD_CORE enabled so access to internals is fairly easy, we just have to do complex updates every time we upgrade Python :-P)

2 Likes

OK, I’ve played around and I think I know my way around the design space here.
There are 3 things I’d like to improve around the inittab. My draft solutions below build on each other, but that’s just for narrative effect; we can solve them individually too.
This topic addressed in #2.

Let me know if looks worth pushing through.

1. Allowing slots (PEP 793 followup)

PEP 793 added a new way to specify modules: slots rather than inittab’s initialization function.
While it’s not impossible to wrap slots in a PyModule_Def and an init function, there’s an interpreter-switching dance that CPython can avoid if we can feed it slots directly.

We can use a struct with a tagged union, roughly like this:

typedef struct PyImport_BuiltinInfo {
    const char *name;
    uint16_t type; // chooses the variant of the union below
    union {
        PyModuleDef_Slot *slots;
        PyObject* (*initfunc)(void);
    };
} PyImport_BuiltinInfo;

Then, we add PyConfig->extra_builtin_modules, which users can set to an array of these.
This covers use cases for PyImport_AppendInittab and PyImport_ExtendInittab, except:

  • If the user needs to combine several tables, they need use their own growable-array implementation, and pass the final result to CPython.
  • This does not contain CPython’s builtin modules. Unlike with PyImport_Inittab, the defaults can’t be shadowed or removed.

AFAICS, this would be a surprisingly small change. The existing PyImport_Inittab is already copied to a private immutable space at runtime startup; at that point it can be converted to the new format and combined with extra_builtin_modules.

There’ll be a bit of challenge designing a PyInitConfig_Set* function for this, but I’d like to hold off on that until things get more concrete.

2. Custom lookup

Python uses a linear search to scan the array, which is fine if there are only a few entries. Should we optimize that for Meta/Google scale?
I don’t think so. Users are in a better place to tailor this to their needs.
Meta uses a C++ std::unordered_map. If you’re generating things statically (the intended use case), you might generate a switch-statement trie or something and let the compiler chew on it.
We don’t want to generalize that in CPython.

My proposal is to add two function pointers (and one data pointer) to PyConfig, which embedders can set to their own implementation:

// Look up an entry & copy it to caller-allocated *result.
// Return 1 on success, 0 on missing, -1 with exception set on other error.
int lookup_builtin_info(const char *name, struct PyImport_BuiltinInfo *result, void *arg);

// Implement `sys.builtin_module_names`
// (as an arbitrary iterable: deduping, sorting & converting to tuple
//   is left to caller)
PyObject *get_builtin_names(void *arg);

// arbitrary data passed to the callbacks
void *builtin_callback_arg;

We also add two functions with the same signatures, which provide default/existing behaviour (i.e. handle stdlib/core builtin modules, PyImport_Inittab, and extra_builtin_modules proposed above). The embedder is expected to call those for fallback/base behaviour.

This would cover all use cases of extra_builtin_modules, making extra_builtin_modules redundant, but it’s harder to use. I’d be fine with leaving extra_builtin_modules out if it doesn’t add enough value.

Note that there is no API to add a module to the underlying collection. The embedder can provide that if they want, answering questions like:

  • Should new entries override old ones, or should act as “setdefault”?
  • Should new entries be static (outlive the PyConfig), or does the implementation copy them?
  • Can entries be added after the first lookup?

If we do add extra_builtin_modules, you can use it to add one-off modules that become part of the suggested fallback behaviour.

3. Do frozen modules too

Looking around, I noticed that frozen module lookup is quite similar to these.
The current code uses 4 arrays of modules (some optional, some internal), with ad-hoc logic to enable them. It already uses “look_up_frozen” & “list_frozen_module_names"lookup” functions as its abstraction, since Eric Snow’s cleanup back in 2021; the change would be moving the user-configurable part to replacing a function rather than editing an array.

And AFAIK, projects like Cython (and their users) would benefit from us making builtin modules more similar to frozen (or Python) ones.

So, the third idea is:

  • allow struct _frozen as another variant in PyImport_BuiltinInfo union
  • add a type argument to lookup_builtin_info/get_builtin_names, to allow selecting either builtin or frozen modules, and rename them appropriately. (BuiltinImporter and FrozenImporter are separate, but at the import.c level, the difference can be similar to the difference between slots-based and initfunc-based modules.)

I’m not proposing to (initially) allow specifying builtin modules using struct _frozen, or frozen ones using slots/initfunc. But it would make sense for me to combine the lookup configuration (i.e. only add two PyConfig functions, not four).