PEP 741: Python Configuration C API

vstinner · January 31, 2024, 2:47pm

There are already programs embedding Python which target the stable ABI, such as the vim text editor. The problem is that it uses functions which are deprecated and scheduled for removal, such as Py_SetPythonHome() (removed in Python 3.13). PEP 741 goal is to design a long term solution for such use case.

Examples of programs embedded Python, using the stable ABI or not:

LibreOffice uses “pyuno” for Python scripting: see pyuno_loader.cxx. On Fedora, /usr/lib64/libreoffice/program/libpythonloaderlo.so is loaded dynamically and is linked to libpython3.12.so.1.0.
vim python3/dyn loads /usr/lib64/libpython3.12.so.1.0 dynamically. It can use the Python stable ABI (if built for that). See Python3_Init() in src/if_python3.c which calls functions such as Py_SetPythonHome(), PyImport_AppendInittab() and Py_Initialize().
Other examples of applications embedding Python: Blender, fontforge, py2app, pyinstaller.
Study on applications embedding Python (2019)

The “legacy” API, before PyConfig, can lead to inconsistent configuration and its implementation is expensive to maintain since it’s a different code path and it requires special care. It was deprecated gradually in Python 3.11, 3.12 and 3.13, as explained in PEP 741. PEP 741 is not about deprecating new API, but providing a long term solution for APIs which are deprecated.

vstinner · January 31, 2024, 2:58pm

On Linux, it’s done differently. Python is always loaded as a dynamic library, as shown in my previous message. For example, Fedora Packaging Guidelines is against embedding a copy of libpython in your application. It’s to simplify the maintenance when bugs and security vulnerabilities are fixed in libpython (well, that’s how things are done in Fedora, at least).

I’m not sure of what you’re talking about when you refer to “loading arbitrary code”. Are you referring to libpython? The “operating system” takes care of what’s being installed in /lib and /lib64 directories, only the administrator is allowed to write there, and it’s managed by the package manager which takes care of the security (validate files signature and checksum). libpython is not “arbitrary code”.

steve.dower · January 31, 2024, 3:17pm

That’s brave of them. We don’t have to go out of our way to make it a supported scenario.

What you mean here is that the system install of Python is always loaded.

Right, because Fedora wants to unbundle everything when they rebuild and redistribute. That’s fine, but then it’s their responsibility to make sure it all works. They’re also in a position to make sure your application depends on the matching version of Python and so the file will be there.

For those of us who are not Linux distros, I argue you should never assume that a suitable copy of Python is installed. Instead, it should be vendored.

Provided your application is guaranteed to only load from those directories, then it’s fine. It’s the equivalent of Windows apps loading from Program Files or Windows\System32. And yet somehow, all platforms have search path hijacking vulnerabilities, despite there being a “safe” option Maybe it’s possible to get this wrong?

The “arbitrary” is by loading a version that you don’t expect (e.g. user points you at Python 3.12, but you built/tested with 3.11), which inherently means you are referencing a path that does not exist as part of your own app. It is the responsibility of users to make sure the file at the end of that path is valid for your app, but the publisher still has to declare a security vulnerability if it’s not. All of which is avoided by having your own copy of Python that is inside your app, so you don’t have to reference a path outside of your app.

pf_moore · January 31, 2024, 3:44pm

That doesn’t sound particularly “stable” to me…

vstinner · January 31, 2024, 4:39pm

Functions such as Py_SetPythonHome() are still part of Python 3.13 stable ABI, but were removed from the limited C API version 3.13. It’s an “ABI only” function. The limited C API and the stable ABI are two different things, and I’m always confused between the two.

pitrou · January 31, 2024, 4:46pm

Why not, though? Let’s take mod_wsgi for example. I’m not sure what it uses internally, but it currently has the limitation that its daemon processes have to use the same Python version mod_wsgi was compiled for. This is quite inflexible if users want to have several independent WSGI applications running on different Python versions.

I suppose that, If mod_wsgi could use the stable ABI, this restriction could be lifted.

steve.dower · January 31, 2024, 5:00pm

That one seems more likely to be due to running multiple different versions of Python simultaneously in the same process. And if you’re launching completely new child processes, how do you tell each one which Python library to use? Again, we’re back at loading arbitrary code and hoping that it works, so they’d still recommend against doing it, even if it “worked”.

I also recommend against doing this, wherever possible^[1] So I’ll at least claim that I’m consistent, even if people look at my advice and choose to ignore it.

Assuming you mean independent apps on the same server without any sort of isolation or containerisation. Which seems a fair assumption given the context. ↩︎

steve.dower · January 31, 2024, 5:08pm

You don’t have to use past tense about 3.13. Right now, apparently, we are planning to remove it, but haven’t yet.

It seems we don’t want to remove it yet, based on earlier posts.

pitrou · January 31, 2024, 5:19pm

This is talking about several daemon processes instances.

Using configuration values perhaps? Configuring Python initialization using the stable ABI is precisely the problem the OP is trying to solve here.

So, do you think the stable ABI should be deprecated? After all, “loading arbitrary code” is exactly what happens when an arbitrary Python version is resolved against an extension module compiled against the stable ABI.

Can we come back to plausible usages of Python, instead of personal preferences?

steve.dower · January 31, 2024, 6:29pm

We can’t even configure initialization reliably with a non-stable ABI right now. Why do we think we can stabilize it? Especially with nogil and JIT in the pipeline, which will impact initialization in the next couple of releases.

The stable ABI is great for something that is loaded into an active Python runtime, like an extension module. In that situation, you can build and validate module in a version-agnostic manner.

I don’t believe that applies to hosting/embedding, where you are taking responsibility for more than just implementing functions called from Python. Of course, there’s nothing wrong with hosting Python and then importing an extension module that uses the stable ABI. It’s just the initialization, management and finalization responsibilities that I don’t think can be done through a very limited API or without being aware of their exact, version-specific semantics (or if they are, you may as well run Python out of proc, which generally makes things simpler from an implementation point of view).

Are you suggesting that running a single process containing apps that are so unrelated they can’t even use the same version of Python is plausible? Because that doesn’t sound plausible to me.

pitrou · January 31, 2024, 6:46pm

Again: this is talking about several daemon processes instances.

What do JIT and nogil have to do with these configuration variables?

pf_moore · January 31, 2024, 7:06pm

Is that not precisely what this PEP is about? After all, the abstract in the initial post here says:

It’s frustratingly difficult to understand your point here - you seem to be arguing that the PEP isn’t needed, because people shouldn’t be doing a bunch of things that they clearly are doing at the moment. And the PEP is proposing to make it safer to do those things by adding an API that can be used from the limited C API and the stable ABI. But then, when examples of where people do need such an API have been presented, you seem to have changed your argument to “we won’t be able to make it work anyway”.

I don’t understand why you’re so insistent that this is about loading multiple Python versions into a single process. The vim model has also been mentioned here, and that would benefit from the stable ABI while only having one Python version loaded. The point is that vim ships as a binary that can either be run without a Python installation, or the end user can point it at an existing Python installation which will then be available to vim.

You seem to think that by making the Python API actively hostile to such usage, you’ll somehow encourage developers to migrate to your preferred model. My experience is the opposite - people will do what they want to do, and will invent whatever mechanism they need, stressing the Python API more if it doesn’t make what they want easy.

It’s not like Python is a commercial product where there’s some sort of “your warranty is void if you don’t stick to our recommended approach” contract. We’re simply trying to ensure that the things our users want to do with Python can be done easily and effectively - we’re not trying to dictate how Python gets used.

steve.dower · January 31, 2024, 7:23pm

So you have a binary that does dlopen() on either a path passed to it (the “arbitrary library” situation) or a table that will look up a version number it already knows and dlopen() a path it already knows. (Or you have one binary per Python version and launch the right one, which doesn’t require a stable ABI, just like a version-specific extension module doesn’t need it.)

The first approach is both a security incident waiting to happen and a bug magnet. The second one is at least as big a maintenance burden as the third. Why would a developer want to do either of them to themselves? And if they do, they still can without a limited ABI - write the code that handles it for each one. (I have in fact written and maintained this code in the past. It wasn’t the greatest idea.)

Nothing right now, which is the problem. They will need some kind of initialization at some point, which means the variables will change, and to correctly embed a copy of Python you need to know whether they do anything or not (not just to avoid errors from the API, but to understand what you app is doing).

This is the fundamental problem with hosting CPython I’m trying to highlight. The fact that we know we’ll have new initialization options coming is just convenient - history shows that we make changes to runtime behaviour all the time, and this PEP is not promising to stop doing that so that embedders don’t have to know how a major component of their app works.

My point is that the limited API makes it more attractive, but it only becomes safer when we change our approach to developing CPython. Specifically, that the entire behaviour of initialization/runtime/finalization becomes stable enough that developers can use a stable API to use it. The API itself isn’t my concern, it’s that we haven’t agreed to actually keep the runtime behaving in a stable manner over the next 5-10 major releases, and in fact we have promised to change the behaviour within that time, which means it can’t be a stable API.

I’m not proposing to make it any more hostile than it already is, and I do want to make it less hostile. I prefer my models because those are the ones that have worked reliably across all the products I’ve worked on, reviewed, debugged, troubleshooted, etc over the years. If you need the user’s Python install, figuring out how to run out-of-process has been worth it every time. If you need to run in-process, locking to a single CPython version has been worth it every time.

What I don’t want to do is promise stability where there isn’t enough for people to trust that it’ll be stable.

pitrou · January 31, 2024, 8:04pm

Why would choosing a Python interpreter to dlopen at startup be any different than choosing a Python environment to activate at startup (which software like mod_wsgi already allows you to do)? Both are “bug magnets” and “security incidents” waiting to happen.

That doesn’t seem to follow. Why would initializing the JIT need to alter the semantics of Py_IgnoreEnvironmentFlag, or even threaten its existence?

That’s vague, hyperbolic and unhelpful.

I am getting frustrated with this discussion. You seem to oppose this out of principles without ever delving into the details of why this proposal is supposedly counter-productive. Perhaps we should agree to disagree and move on.

steve.dower · January 31, 2024, 9:00pm

Because environment activation affects launching a regular python3 binary, which we stabilise via the command-line interface. We don’t currently support launching into a venv through a host application, because it’s basically the host’s job to figure out the environment.

It’s also different because the dlopen’d code now lives inside your process with all the rest of your code. Launching an arbitrary executable is similarly insecure, as it can reach into your original process to extract memory. Launching a Python script with a known Python interpreter is less risky. It’s about reducing attack surface, in this case quite significantly.

It may require adding more variables, not necessarily changing existing ones. And once we add new options, code that needs to know whether they’ve been set or not can’t be version-independent, and so the benefit of limited API degrades to source-compatible API, which we promise anyway.

It’s counter-productive because if we make initialization part of the stable ABI, users will expect their code using only stable ABI interfaces to be stable across versions. I believe the PEP should demonstrate (a) that embedders will benefit from this, and (b) that we know how to provide it and are prepared to provide it.^[1] Currently, it does neither.

Don’t worry, me too. Happy to agree to disagree, but I’m still going to state (and try to clarify, as needed) my position on this. Embedding is a scenario that I care about, and believe is a huge gap that Python is perfect to fill, and would love to see CPython be in a position to fill it.

And subsequently, that embedders will benefit from the level of stability we are able and willing to provide. ↩︎

encukou · February 1, 2024, 8:18am

For reference, here’s the issue/discussion for removing the older API: C API: Remove legacy API functions to configure Python initialization · Issue #105145 · python/cpython · GitHub

vstinner · February 4, 2024, 11:36am

Deprecated global configuration variables such as Py_VerboseFlag were never part of the stable ABI. There were requests to access the initialization API such as these variables in the stable ABI.

Currently, it’s possible to workaround the lack of API by setting environment variables when calling Py_Initialize(), but not all configuration options have an environment variable, and environment variables are inherited and so affect child processes which may not be the expected behavior.

davidhewitt · February 7, 2024, 12:01am

I think making the configuration structure opaque and using an API to set/get configuration by name is a welcome simplification:

it’s a smaller API for language bindings like PyO3 to wrap and re-expose, and
it’s easier for people to support multiple Python versions to embed into their application; no need to conditionally compile structure field access, can just use normal error handling if configuration values are not available for a specific version at runtime.

I’m going to stay away from the question of whether users should be vendoring their own copy of Python or distributing apps which are expected to work with a system Python etc. On the PyO3 issue tracker I’ve had users wanting to do both of these things, for their own reasons. I think the bullets above apply regardless of distribution option.

Effect on the Stable ABI

It seems to me that there are two main new concepts which this PEP would add to the stable ABI:

The two steps “preinitialize” and “initialize”, and I wish it didn’t (I’ll cover that more below).
Programmatic control of the configuration variables. This would be helpful, as using the stable ABI’s Py_InitializeEx already allows for a lot of configuration to be read as environment variables. Being able to use the “isolated config” from PEP 587, for example, would be great; at the moment setting a bad PYTHONHOME value can probably break most Rust applications built with a Python embedded via PyO3.

Preinitialization

I like how this API merges PyConfig and PyPreConfig of PEP 587 into an opaque structure.

Py_PreInitializeFromInitConfig is missing from the “Specification” bullet list even though it is described down in the full documentation. That got me briefly excited, I think that having multiple steps to initialize from the one configuration object is more complex than it might need to be.

Is it an error to modify the PyInitConfig after preinitialization?
Could we have APIs like PyInitConfig_AddInittab or PyInitConfig_AddInittabHook, which are used to control the few operations legal between preinitialization and full intialization?
Is there anything besides editing the inittab which is expected to be done between the two phases of initialization?

I would love it if it’s possible to adjust the PEP to remove the preinitialization step and add some extra configuration for the inittab (and any other similar bits which would be needed).

I think then this would be a straightforward replacement for Py_InitializeEx where the embedding user gets more control to set the configuration they care about.

Alternative Runtimes

This API might also be very interesting to alternative Python runtimes like PyPy and GraalPy. I suspect that supporting PEP 587 would be difficult due to those runtimes likely needing very different configuration variables, but this API might be possible.

(and a bikeshed, sorry…)

As a final bikeshed, it was a bit surprising to me that a structure called PyInitConfig contains error state. I don’t readily have a better name though, PyInitializer maybe?

vstinner · February 7, 2024, 2:18am

What is your use case to explicitly pre-initialize Python?

The preinitialization does 2 important things for this API:

Set the LC_CTYPE locale.
Set the memory allocator used by Py_DecodeLocale() – it’s set once and cannot be changed later.

Current implementation:

PyInitConfig_SetStr() calls Py_DecodeLocale() to decode the bytes string from the locale encoding.
PyInitConfig_SetWStr() calls PyMem_RawMalloc().

Antoine proposed to add an API taking UTF-8: it would avoid the dependency to Py_DecodeLocale().

PyInitConfig_SetWStr() can also be modified to use malloc()/free() to not rely on the Python memory allocator: to avoid preconfiguration.

Py_DecodeLocale() must be used to decode bytes strings coming from the operating system: main() argv, filenames, env vars, etc.

Once Python is preinitialized, modifying a preconfiguration option such as "allocator" (int) is ignored. Would you prefer an error such as “cannot modify the preconfiguration after Python is preinitialized” in this case?

I can add a PyInitConfig_AddInittab(config, inittab) function. It can use malloc()/free() internally to not depend on the Python memory allocator. PyImport_AppendInittab() and PyImport_ExtendInittab() do something similar.

vstinner · February 7, 2024, 2:22am

I can rename it to: PyInitState. Does it sound better?

state = PyInitState_Python_New()
PyInitState_SetInt(state, ...)
PyInitState_GetError(state)
Py_InitializeFromInitState(state)
…