PEP 741: Python Configuration C API (second version)

methane · April 25, 2024, 12:23pm

Is there any reason for stable ABI on Windows is same to Unix?
If no, I think new wchar_t* APIs should be added only in Windows.

ncoghlan · April 26, 2024, 5:40am

This is a minor point in the grand scheme of things, but the reason there are two coercion points (one reached from Py_Main, one from the various flavours of Py_Initialize) is because the interpreter genuinely doesn’t work properly when 7-bit ASCII is set as the system encoding, and locale coercion (if a suitable target locale is available) gives slightly nicer behaviour overall than only activating UTF-8 mode in the Python runtime (e.g. the readline module works properly in the first case, but misbehaves in the latter).

~~If the embedding app is doing the right thing (i.e. setting a more reasonable locale), the second coercion point will never trigger.~~

I don’t know if anyone is actually turning off locale coercion, though - the inclusion of the runtime feature switches in PEP 538 was more a matter of avoiding potential objections to the idea rather than because I thought there were valid reasons to disable the coercion attempt.

I had a suspicion that my explanation here wasn’t actually right, so I went digging into the code to double-check.

The history is a bit messy (see Py_Initialize() and Py_Main() should not enable C locale coercion · Issue #78770 · python/cpython · GitHub for example), but the gist of the problem is:

there are a pair of environment variables that affect whether or not locale coercion and UTF-8 mode should trigger automatically
the -E and -I CLI options affect whether environment variables should be checked or not
while -E and -I can be checked using ASCII-only code, actually doing so requires duplicating a decent chunk of the argument parsing infrastructure to allow it to operate in ASCII-only mode just for that check
the workaround adopted instead (since 3.8) is that Py_Main() enables locale coercion, and Py_Initialize() does not (by way of the default preconfig settings used in those two cases)

(Note: this isn’t the way things worked in the accepted version of PEP 538. That worked the way Steve is requesting, with only the python3 embedding application actually implementing coercion, and the shared library just complaining if it detected that coercion was needed but hadn’t happened. However, that version of the code had the problem that even if -E and -I were passed on the command line, the locale coercion environment variable would still get checked, despite the CLI options specifically saying not to do that)

steve.dower · April 28, 2024, 8:41pm

It was implied in the particular post being quoted, but I’ve stated explicitly in other places that I’d also want to move CLI option parsing into python.c as well, which means we could still do the -E/-I handling. We’ve got the original argv at that point too, so it’s not really going to have to do that much work.

ncoghlan · April 29, 2024, 12:21am

Yeah, improving the factoring of this code is presumably possible. It’s just impressively tedious, hence our settling for the status quo so far.

The heart of the problem is the intersection between:

Windows APIs using wchar_t *
POSIX APIs using char *
Needing to process at least -E and -I on the command line and then read the env vars for locale coercion and UTF-8 mode before trusting the results of decoding any strings with the locale encoding
choosing between doing the above with native memory management or else delaying it long enough to be able to use the Python memory allocators set up during pre-initialisation

I made an attempt at restoring the separation of the locale coercion from Py_Main in an alternative PR for the issue I linked, but eventually abandoned it when it became clear the resulting code was going to be horrible to maintain. Someone else may be able to figure out a different approach that wouldn’t impose the same kind of ongoing maintenance burden.

steve.dower · April 29, 2024, 12:33pm

The tradeoff is that now embedders are forced to choose between one of our approaches, and are practically guided towards copying python.c behaviour rather than using their own process-wide setup.

For the sake of embedders, libpython ought to be as neutral as possible about this (probably, imho, requiring UTF-8 strings and only taking settings from the init struct). It really doesn’t help e.g. Blender to make them figure out how/whether initialising Python is going to affect the locale for the rest of their app.

vstinner · April 29, 2024, 1:37pm

PEP 741 was designed to be convenient to use on different platforms with different string types. Sure, the API can be redesigned to minimize the API (number of functions).

Currently, there are 6 PyInitConfig functions to get strings and 6 PyInitConfig functions to set strings: 12 string functions in total.

I propose to add encode and decode functions to reduce the API from 12 to 7 PyInitConfig string functions (remove 7 functions).

Keep 5 functions (no change) using UTF-8 strings (char* ):

PyInitConfig_GetStr()
PyInitConfig_GetStrList()
PyInitConfig_FreeStrList()
PyInitConfig_SetStr()
PyInitConfig_SetStrList()

Add 2 functions:

PyInitConfig_DecodeLocale(): decode char* to UTF-8
PyInitConfig_EncodeUTF8(): encode wchar_t* to UTF-8

Remove 7 functions:

PyInitConfig_GetWStr()
PyInitConfig_GetWStrList()
PyInitConfig_FreeWStrList()
PyInitConfig_SetStrLocale()
PyInitConfig_SetStrLocaleList()
PyInitConfig_SetWStr()
PyInitConfig_SetWStrList()

The complexity to handle locale char* argv array on Unix and the wchar_t* argv array on Windows is moved from Python to the API consumer. Such code is not complicated to write, it’s just about error handling.

On Windows, wchar_t* is used for main() argv, but also for filenames, the environment (_wgetenv()), etc. On Unix, wchar_t* is used by Py_DecodeLocale() and for literal Unicode strings. PEP 587 (PyConfig) recommends using PyConfig_SetBytesString() instead of Py_DecodeLocale(), it takes care of the Python pre-initialization if needed.

vstinner · April 29, 2024, 1:59pm

The Python preconfiguration is quite complicated. It has many inputs for the choice of the locale encoding:

current LC_CTYPE locale: especially “C” and “POSIX” values
-X utf8 cmdline option
-E and -I cmdline options (to ignore env vars)
PYTHONUTF8 env var (PEP 540)
PYTHONCOERCECLOCALE env var (PEP 538)
PyPreConfig.configure_locale option
PyPreConfig.utf8_mode option
PyPreConfig.coerce_c_locale option
PyPreConfig.legacy_windows_fs_encoding option (disable UTF-8 Mode)

By the way, additional inputs for selecting the memory allocator:

-X dev cmdline option
PYTHONDEVMODE env var
PYTHONMALLOC env var

Hum, if I recall correctly, handling properly cmdline options in the Python preinitialization requires also a function to set argv. PEP 587 has two functions:

PyConfig_SetBytesArgv() for Unix bytes (char*)
PyConfig_SetArgv() for Windows Unicode (wchar_t*)

So we would need similar API for PyInitConfig:

PyInitConfig_SetBytesArgv()
PyInitConfig_SetArgv()

vstinner · April 29, 2024, 2:01pm

I’m fine with removing the stable ABI target from PEP 741 and re-open a discussion a few releases later to reconsider adding this API to the stable ABI.

ncoghlan · April 29, 2024, 2:16pm

There are two reasons the problem remains theoretical despite libpython technically containing dubious behaviour:

only Py_Main does anything dubious (Py_Initialise and friends skip coercion unless it is specifically requested in the config)
embedding apps also don’t work right if the locale encoding is left as 7-bit ASCII, so they either fix it themselves before initialising Python, or else they actually want libpython to deal with it

This means the only case that doesn’t work is:

wanting to use Py_Main
wanting to keep the locale as 7-bit ASCII (a process configuration we explicitly exclude as unsupported)

vstinner · April 29, 2024, 5:10pm

Even if the PyPreConfig.configure_locale option is 0 (ask Python to leave LC_CTYPE unchanged), the choice of the “Python filesystem encoding” cannot be guessed easily: see the complex PyConfig.filesystem_encoding rules. It depends on:

The operating system.
The Python UTF-8 Mode (PyPreConfig.utf8_mode option).
nl_langinfo(CODESET) string.
The mbstowcs function is actively tested at startup to detect “lying” nl_langinfo(CODESET) on FreeBSD and Solaris.
PyPreConfig.legacy_windows_fs_encoding option.

steve.dower · April 29, 2024, 7:07pm

This doesn’t matter for embedders though. Embedders should always pass in UTF-8 or UTF-16-LE (or just UTF-8 if we simplify further), and they are responsible for their own decoding.^[1]

Embedders/extenders who need to use the same encoding as Python does should be able to use a public API to encode/decode. They don’t need to know the encoding to be able to convert to/from it. (Though on that note, making something like path_t from argument clinic available to extenders would be nice. It can be done in a few steps easily enough, but having a PyArg_ParseTuple item specific to paths would really take away any need for extenders to think they need to figure out the FS encoding themselves.)

Again, my preference here is for all environment variables to have been read by python.c, which is an “embedder” in this context, and so we still keep these rules but they live in our main() rather than as part of universal initialization. ↩︎

ncoghlan · April 30, 2024, 11:54pm

Environment variables are the other main place where the OS encoding can come up post-initialisation.

On the configuration front, Py_Main and its Windows counterpart are slightly odd beasts, in that they’re shipped as part of the shared library API, but semantically they’re providing standalone CLI applications.

The PEP 587 APIs represent the best effort to date to disentangle the “standalone application” bits of the config process from the “shared library” bits, including the preinit step to set up memory allocators and a reasonable locale configuration.

steve.dower · May 1, 2024, 2:20pm

Totally agree. I’m arguing as strongly here as I am because I don’t want PEP 741 to prevent us from ever making a better effort. (I’d love to make that effort myself, but honestly would have to drop nearly everything else I’m trying to look after right now, including $dayjob, just to have the time and the mental capacity for it, so I’m settling for not getting us locked into a corner.)

barry · May 1, 2024, 10:25pm

The Steering Council discussed PEP 741 again today. As we mentioned previously we’re still not able to fully evaluate the PEP for pronouncement either way. It is clear however that the PEP isn’t ready for Python 3.13. We aren’t officially deferring the PEP but we’d like for the PEP authors to target it to Python 3.14 at the earliest. Keep discussing!

ncoghlan · May 1, 2024, 11:03pm

I think I have an idea to reassure you (and @ericsnowcurrently!) on that front.

What if PEP 741 omitted any changes to the init process itself and instead just offered new UTF-8 encoded string field name based APIs to populate specific fields given opaque pointers to the config structs we already have defined?

The preconfig step exists for good reasons, and trying to mask it from embedders without taking over the entire initialisation process the way Py_Initialize does really isn’t going to do them (or us) any favours.

Making it possible to populate those structs without relying on their exact C layouts, on the other hand, is a genuine improvement that lays the foundation for some day allowing even embedders to use the stable ABI

If we do that, then the API for working with the preconfig fields can be dramatically simplified (no strings to deal with), and the memory for the more complex full config struct can be entirely managed by the shared library (as PEP 587 already does).

To distinguish the new “accepts the field name” functions from the existing field pointer based PEP 587 APIs, I would suggest appending a “Field” suffix to avoid name conflicts, and then retaining that convention for all the new APIs that replace a simple C struct field assignment with an API call that accepts a field name as a string.

steve.dower · May 2, 2024, 2:13pm

I’d love that!

My problem with the PEP is the limited API part - once that is dropped, we can look more closely at how the proposed APIs will be used, but I suspect the PEP will get my full support when it’s only defining a per-release API.

vstinner · May 3, 2024, 2:32pm

I don’t think that it’s that simple. On all operating systems but Windows, there are mojibake issues if the application decodes “OS data” (filename, env var, etc.) from an encoding A and Python encodes it back to bytes with encoding B: if encodings A and B are not the same.

Python initializes its filesystem encoding once, and then it’s no longer changed. In practice, it sets the LC_CTYPE locale using inputs that I described before. Then it’s important that all OS data is decoded from this filesystem encoding with the “surrogateecape” error handler. Well, Windows is different, it uses the “surrogatepass” error handler (or “replace” if the legacy Windows FS encoding is used).

steve.dower · May 3, 2024, 2:38pm

An embedder has already had to figure this out. If they’re decoding the OS data incorrectly, it’s going to affect their entire application, right? And even if they happen to round-trip, as soon as they let a user enter a path anywhere, they don’t have an original encoding to work with.

I don’t like modifying the global locale settings at initialization - when embedded, Python’s settings shouldn’t escape the Python context, and the embedding app gets to control the global ones. By default, we should respect their choices (and assume that they’ve chosen it), including letting them override our choices (in this case, let them specify the encoding we’ll use for fsencode).

pitrou · May 4, 2024, 3:13pm

Mojibake issues are more and more of a relic nowadays, since most modern systems should default on something like UTF8 (except Windows which prefers UTF16, but the result is the same). I agree that configuration APIs shouldn’t be responsible for handling of encoding issues, and only accepting UTF8 (and perhaps UTF16) is a reasonable API design choice.

vstinner · May 23, 2024, 4:44pm

Let’s say that your application is started in a LC_CTYPE locale using the Latin1 encoding. How do you convert Latin1 strings to UTF-8 on Linux? Are there portable API working on most platforms?

For example, how to pass argc, argv of the main() functions to PyInitConfig, since they are bytes strings encoding in the locale encoding, and not in UTF-8?

If you use the “Python configuration”, Python can change the LC_CTYPE locale after the application embedding Python decoded strings. For example, the application can use Latin1, and then Python decides to switch to UTF-8 (ex: if the locale is “C” or “POSIX”) or to ASCII (“C” or “POSIX” locale, but on FreeBSD).

When Python will reencode data to the new encoding, the data encoding will change. For example, Latin1 encoded string becomes a UTF-8 encoded string. If Python attempts to open a filename (“in the wrong encoding”, mojibake), it would just fail.

In the past, Py_DecodeLocale() was the reference portable API suggested by Python to decode bytes string. PEP 587 changes the situation by accepting directly bytes strings (and so decoding them, taking care of the Python preinitialization, etc.).

I don’t think that this problem is theorical, it’s trivial to reproduce.