FR: Allow private runtime config to enable extending without breaking the `PyConfig` ABI

gpshead · August 5, 2022, 10:32pm

Problem statement

We need the ability to extend the configuration of the Python runtime within patch releases where we cannot change public structures and thus break a releases ABI. We don’t do this often, but security fixes can require adding configuration settings. A past example of this is the hash randomization feature. (a new, still embargoed, need for this is in the works)

Python 3.8 added our suite of PyConfig based APIs via PEP 587 – Python Initialization Configuration | peps.python.org. This cleaned up a lot of things, good! But it has a downside: It resulted in a public C struct full of configuration options (including a few fields awkwardly called “private”). This is struct PyConfig currently seen in cpython/initconfig.h at main · python/cpython · GitHub.

We’re free to alter struct PyConfig between minor releases so long as we don’t remove fields, it is not a cross-version stable ABI as far as I can tell. But when we need to add more configuration in a security patch release we’re back to resorting to ad-hoc out of band configuration mechanisms because struct PyConfig must not be changed within a release.

Proposal

We could add an additional “extended config” concept. This should explicitly NOT be in the form of a public struct. I suggest it take the form of a string containing newline key value pairs in a trivial format. Likely simply "key=value\n". A pointer to this extended text based config would be added to struct PyConfig and parsed during Py_InitializeFromConfig to fill values in wherever they belong. Along with a pointer to an opaque private struct defined in Include/internal/ that we’d be free to change even in patch releases.

Questions

Do we allow existing struct PyConfig field names to be set via this?
- using their struct field name or using their -X flag name for those that have one?
- I propose text based settings always override fields during InitializeFromConfig. the ultimate goal of this is that people could use an entirely text based config instead of C struct fields. (Maybe we’d even want to encourage that very long term if deprecating a lot of the struct ever becomes desirable?)

For our users sake, we should probably flag unknown field names as a config error. BUT we need the concept of intentionally non-error causing value settings so that code can be written that works across Python versions without a huge pile of minor or patch release version check ifdef hell.

Should we allow a special "unknownok:" string prefix on the key name to allow setting of things that may not be an existing known key / feature in the currently running python release.

This could look something like this as a user

{
    PyConfig config;
    PyConfig_InitPythonConfig(&config);
    PyStatus status = PyConfig_SetString(
        &config, &config.text_based,
        L"check_hash_pycs_mode=always\n"  // could've been set in struct
        L"unknownok:avoid_medusas_gaze=yes\n"  // new security patch feature
    );
    if (PyStatus_Exception(status)) {
        goto fail;
    }
    status = Py_InitializeFromConfig(&config);  // text_based would be parsed and applied here.
    ...
}

Using "avoid_medusas_gaze=no\n" could also have been used, if the author knew they could guarantee having a recent enough CPython available.

A key= where key isn’t known would be an error. a unknownok:key= where key isn’t known would be ignored. (a note could be emitted to stderr in verbose mode)

Internal changes to support this

// include/cpython/initconfig.h

struct _Py_private_config;  // forward decl

typedef struct PyConfig {
    ...
    wchar_t *text_based;  // See https-link-to-docs.
    ...
    struct _Py_private_config *_private_config;
} PyConfig;

// include/internal/initconfig.h

struct _Py_private_config {
    bool avoid_medusas_gaze;
    ... // Existing PyConfig "_private" fields could move into here.
};

and obviously support for parsing, populating, and error checking called from Py_InitializeFromConfig.

thoughts?

If done ultimately this would become a PEP.

eric.snow · August 5, 2022, 11:16pm

If I understood correctly, this new mechanism would be for users that:

embed Python in their app
are pinned to a specific minor version of CPython
need to update the runtime config in response to some [security] fix in a micro release

The next “minor” release would have the configuration added to PyConfig, right? If so, then this would be a mechanism for one-off fixes for a very small set of users. (The number of affected users isn’t necessarily important.) Something feels off about this situation and using PyConfig to solve it.

Regardless, the temporary and specific nature of this need suggests other options that might be more straight-forward:

old-school global variables (a la pre-587)
env vars

gpshead · August 6, 2022, 12:19am

What about the private fields and data in our public struct? We shouldn’t do that.

You may be over-constraining your use scenario. PyConfig is indeed primarily, if not exclusively, for use by embedding Python (I didn’t look to see if subinterpreters use it; assuming not). But that doesn’t imply that an embedder is “pinned to a specific minor version of CPython”. An application embedding CPython can target building and working on numerous platforms embedding whatever CPython they provide. A desire for an application maintainer is to limit #if PY_BLAH_VERSION... soup while doing so.

Alternatives when a PyConfig field doesn’t exist to be set

What did you mean by “env vars”? I’m guessing:

The PYTHON* environment variables and command line flag argv are all parsed in the same step by the PyConfig_* initialization APIs (which fill in the PyConfig structure, though those can do anything and could fill in private globals as well).

I assume you’re suggesting that embedding code needing to enable features call C setenv() before configuring the Python interpreter? Or perhaps PyConfig_SetArgv() to add a “command line” flag as the way to set the configuration that isn’t in older releases PyConfig structures?

That’d be doable. Though setenv() has global & child process state consequences.

Core dev toil - backporting security fix pain

The need to add things in patch releases has come up multiple times over the years. Go back a decade and we have hash randomization, It added an environment variable and an associated flag (our norm). When a fix needing to do this kind of thing in our main branch looks different than the change to the most recent versions needing a backport, that adds maintenance burden and a chance of getting something wrong while backporting. Specifically I’m talking about needing to avoid modifying the PyConfig struct when backporting.

If we create a private config struct, the backport of this kind of thing becomes more routine once 3.12.x is in security fix mode.

alternative to ease backporting

Instead of having the fix in main even use PyConfig, just have it do the private global variable dance. As that is what the backports will require. A follow-on feature PR can move that global into PyConfig.

conflating two concepts?

I’m perhaps lumping two concepts together here.

Identifying a need for a non-public / private config struct.
A way to supply configuration that isn’t setting fields in a giant ever-growing public C struct.

With a private struct we can remove obsolete fields instead of leaving them in place unused for a decade lest we inflict #if PY_BLAH_VERSION... pain on our user code.

If we moved the public API entirely to a text based config we could specifically silently ignore old obsolete fields in a supplied config instead of requiring them to be conditionally compiled out based on version in user code. Hopefully easing maintainers lives a bit.

Use of a text based config could add a tiny startup time overhead. I don’t expect CPython’s own python launchers would use it internally if so, as we have access to all public and private fields directly and are by definition tied to our exact version.

–
This thread is a mix of me thinking out loud and trying to eludicate what feel like deficiencies I see in our API design while working on a bug.

tiran · August 6, 2022, 7:38am

We really should stop allocating public structs on the stack. If the PyConfig API would use heap allocation with helper function, then we would be able to extend the PyConfig struct and append new settings.

Instead of

PyConfig config;
PyConfig_InitPythonConfig(&config);

the API should be

PyConfig *config = PyConfig_New();
PyConfig_InitPythonConfig(config);
...
PyConfig_Free(config);

steve.dower · August 8, 2022, 12:10pm

I agree with the need and I like the proposal. I also very much agree with Christian’s suggestion that it ought to be a heap allocated type.

Since we have a private _config_init field, we could add additional flags to that which would let us identify a heap-allocated (by us) structure vs a statically allocated one. That at least preserves compatibility for existing users, even if they don’t get the benefit of additional options.

There are also some values taken from environment variables at a late stage (i.e. in getpath.c) which ideally would have been part of the config struct (like PYTHONPATH already is). However, they can mostly be ignored if you can figure out which variables in the structure to set…

I would say feel free to factor in getpath.py to any changes here. It’s intended to be overridable more easily than the old implementation - perhaps even to the point where we allow embedders to pass their own script/bytecode - and it should be doing all the processing of config values (apart from those needed to get to a working bytecode interpreter). Now that we’re confident that it replicates the old getpath.c behaviour, and has tests, it can be modified to support whatever future we want for embedders

vstinner · August 8, 2022, 4:41pm

In 2019, when PyConfig API was designed, I proposed to store the structure size in the structure to make it future-proof in terms of ABI. Bu this idea got rejected: Mailman 3 PEP 587 (Python Initialization Configuration) updated to be future proof again - Python-Dev - python.org In short, the only use case is about embedding Python, and for this use case, it’s ok to require rebuilding Python.

vstinner · August 8, 2022, 5:05pm

It’s an interesting API.

It might be useful to support newline characters (ex: first line \n second) for some strings.

Multiple PyConfig members are lists (ex: xoptions, warnoptions, argv, etc.), it would be nice to have a way to write a list without excluding a character. For example, PYTHONPATH env var doesn’t allow specifying a path which contains : since it’s the character used a path separator, whereas : is a legit character in a Unix path. I don’t recall how formats like YAML support that. Writing a parser is non-legit. Maybe keep it simple and forget corner cases, I don’t know.

Configuration options have many possible origins:

(Now deprecated) Global Configuration Variables, such as Py_VerboseFlag
Command line arguments
Environment variables
Configuration files (pybuilddir.txt, pyvenv.cfg)
Value overriden by another member (ex: isolated=1 implies use_environment=0)
And a few other origins

What’s the prioprity of such new “text based” configuration? Does it have a the highest priority?

vstinner · August 8, 2022, 5:12pm

For the specific case of fixing a securtiy issue in Python stable versions, IMO it’s simpler and safer to add a private variable rather than modifying the public PyConfig structure.

ronaldoussoren · August 9, 2022, 9:03am

I don’t agree. For tools like py2app/py2exe/pyinstaller it it is pretty inconvenient to have to rebuild the launcher executable that’s used to start the packaged application when there’s a bug fix release of Python.

gpshead · August 9, 2022, 8:51pm

Yeah it was just a mistake we made in the API design in 2019. You can’t extend a struct and assume embedding people all rebuild. They don’t. Real world embedding uses exist that use an installed Python minor version as a shared library. Update that to use a different sized struct in a public API and someone is going to have a bad time. That’s why I consider the struct frozen at rc1 time, even when only for use in the embedding / writing their own launcher case.

LostTime76 · August 9, 2023, 1:59pm

I think? I am running into this issue now as I am trying to embed the Python interpreter using a non c language. I have to stick with the limited API and private structures for configuration in headers files is a no no. Basically, I need to be able to allocate and configure everything using only exportable functions and the heap… no private structure details.

I run into stuff like this:

Hey! Here is a function that is part of the stable ABI and fits your use case! DON’T USE IT! Use this private structure that is not part of the stable ABI and which structure fields and size can be directly tied to a specific python version instead! No offense, but really guys?

I am strictly limited to what’s in the shared library (DLL). I don’t have headers, I can’t statically “recompile” every time a new version of python comes out. That’s unmaintainable for me.

So TLDR… Please for this and in the future, provide opaque heap types with supporting functions that can be part of the Stable ABI within the shared library. Not everybody is trying to embed using C headers and static compilation.

vstinner · August 12, 2023, 9:42am

IMO the best option to expose PyConfig would be not expose the structure but add a configuration file which would allow to override all PyConfig members. In terms of ABI, exposing structure members with their types is too complicated.

LostTime76 · August 14, 2023, 11:34am

Well, at this point I basically cannot use any of the new config API because its not part of the limited API / Stable ABI.

I don’t think a configuration file is great, because I would want to use it in an embedded scenario and build the config from within the host program.

What about exposing a few functions in the stable ABI to be able to use the new API?

A function to allocate the PyConfig / PyPreConfig from the heap without having to know any of the structure members or the size of the structure
Slot functions to set each member of the structures.

vstinner · August 14, 2023, 10:57pm

I created [C API] No limited C API to customize Python initialization (PyConfig, PEP 587) · Issue #107954 · python/cpython · GitHub to not forget this issue.

vstinner · September 30, 2023, 3:18pm

Hi,

I created an early draft of WIP PR to parse a configuration from a “configuration file” (a string in practice): PR #107954.

Example:

# int
bytes_warning = 2

# string
filesystem_encoding = "utf8"   # comment

# list
argv = ['python', '-c', 'code']

# you can put comments for the fun
verbose = 1  # comment here as well
# after, anywhere!

The format is similar to TOML. But right now, I didn’t implement \' or \" in strings, and values must stick to a single line.

All PyConfig members can be set with this configuration file. The configuration file is a UTF-8 encoded string.

Such API can be added to the Stable ABI since it doesn’t expose PyConfig member offsets at the ABI level, the ABI is not affected if members are added or removed to PyConfig: in the worst case, you just get an error.

I propose adding a function to parse configuration as text that you can call multiple times. So you don’t have to create a single long string with all parameters. Also, it makes it easy if you have conditional code. Pseudo-code example:

void stable_abi_init_demo(void)
{
    PyInit_SetConfig("isolated = 1");
    PyInit_SetConfig("argv = ['python', '-c', 'code']");
    if (condition) {
        PyInit_SetConfig("pythonpath = '/my/path'");
    }
    PyInit_SetConfig("filesystem_encoding = 'utf-8'");

}

Compact:

void stable_abi_init_demo(void)
{
    PyInit_SetConfig(
        "isolated = 1\n"
        "argv = ['python', '-c', 'code']\n"
        "filesystem_encoding = 'utf-8'\n"
    );
    if (condition) {
        PyInit_SetConfig("pythonpath = '/my/path'");
    }
}

A stable ABI becomes more important in Python 3.13 since the legacy API, like Py_SetPath() and PySys_AddXOption() functions, has been removed! Moreover, global configuration variables will be removed in Python 3.14, such as Py_IgnoreEnvironmentFlag or Py_OptimizeFlag.

jeanas · September 30, 2023, 10:21pm

(Never mind. I posted something which is already in a post above that I misread. Sorry.)

vstinner · September 30, 2023, 10:24pm

PyConfig currently has 66 members. I would prefer to not have to provide 66 functions.

Another problem is that such API is not “future-proof”: if you build your code with (limited C API) Python 3.11, you cannot set new PyConfig members added to Python 3.12 like int_max_str_digits or perf_profiling.

Supporting configuration file can be interesting to customize Python. Currently, Python supports 3 configuration files:

pyvenv.cfg
._pth file (ex: python._pth)
pybuilddir.txt (Unix only)

These files have a limited scope and are not easily accessible. For example, pybuilddir.txt is only used in the source code directory of Python, no longer when Python is installed.

jeanas · October 1, 2023, 7:51am

Is this a problem with code size (the code would be trivial), documentation size, or something else?

You could always dlopen libpython, right? Ok, that’s cumbersome, acknowledged. However, if this is an important use case, you could also provide an API like

Py_DynamicConfig *cfg = Py_DynamicConfigNew();
Py_DynamicConfig_SetValue(cfg, "int_max_str_digits", Py_DynamicConfig_MakeInt(10000));
Py_DynamicConfig_SetValue(cfg, "perf_profiling", Py_DynamicConfig_MakeBool(true));

(An existing example of this pattern is the Fontconfig API.)

I can see the use for configuration files, but my concern is that when given a string-based API that they need to feed with dynamic values, programmers often don’t bother to do the necessary string escaping, or do it incorrectly, causing a class of bugs.

(Also, with such an API, you preserve the ability to add struct members that have certain types like function pointers.)

vstinner · October 1, 2023, 1:09pm

Ok, and now something completely different: PyInitConfig API.

I propose a new API made of only 11 functions to configure the Python initialization. It is based on previously discussed ideas:

Opaque structure
Allocate memory on the heap
NEW! Use a string to identify a PyConfig member.

IMO using a string to identify a PyConfig member is convenient than PyTypeObject integer slot such as Py_tp_del (53). But the most important thing is that the list of members is not part of the API, so members can be add and removed without changing the API. Moreover, the type of a member can change in a future Python version.

Add PyInitConfig functions:

PyInitConfig_Python_New() – caller must call PyInitConfig_Free() once done
PyInitConfig_Isolated_New() – caller must call PyInitConfig_Free() once done
PyInitConfig_Free(config)
PyInitConfig_SetInt(config, key, value) – value is a int64_t
PyInitConfig_SetStr(config, key, const char* value) – bytes string
PyInitConfig_SetStrList(config, key, length, items) – bytes strings (ex: argv)
PyInitConfig_SetWStr(config, key, value) – wide string
PyInitConfig_SetWStrList(config, key, length, items) – wide strings (ex: xoptions)
char* PyInitConfig_GetErrorMsg(config) – caller must call free() once done

Add also functions using it:

Py_InitializeFromInitConfig(config)
Py_ExitWithInitConfig(config)

See the PR for the exact API.

Long example showing usage of most APIs:

static int test_initconfig_api(void)
{
    PyInitConfig *config = PyInitConfig_Python_New();
    if (config == NULL) {
        printf("Init allocation error\n");
        return 1;
    }

    if (PyInitConfig_SetInt(config, "dev_mode", 1) < 0) {
        goto error;
    }

    // Set a list of wide strings (argv)
    wchar_t *argv[] = {PROGRAM_NAME, L"-c", L"pass"};
    if (PyInitConfig_SetWStrList(config, "argv",
                                 Py_ARRAY_LENGTH(argv), argv) < 0) {
        goto error;
    }

    if (PyInitConfig_SetInt(config, "hash_seed", 10) < 0) {
        goto error;
    }

    // Set a wide string (program_name)
    if (PyInitConfig_SetWStr(config, "program_name", PROGRAM_NAME) < 0) {
        goto error;
    }

    // Set a bytes string (pycache_prefix)
    if (PyInitConfig_SetStr(config, "pycache_prefix",
                            "conf_pycache_prefix") < 0) {
        goto error;
    }

    // Set a list of bytes strings (xoptions)
    char* xoptions[] = {"faulthandler"};
    if (PyInitConfig_SetStrList(config, "xoptions",
                                Py_ARRAY_LENGTH(xoptions), xoptions) < 0) {
        goto error;
    }


    if (Py_InitializeFromInitConfig(config) < 0) {
        Py_ExitWithInitConfig(config);
    }
    PyInitConfig_Free(config);

    dump_config();
    Py_Finalize();
    return 0;

error:
    printf("Init failed:\n");
    Py_ExitWithInitConfig(config);
}

vstinner · October 1, 2023, 1:11pm

The implementation is just a convenient wrapper on top of the existing API. The current structure:

struct PyInitConfig {
    PyConfig config;
    PyStatus status;
    const char *err_msg;
};