Adding char* based APIs for Unix

Please see ongoing discussion on a new API for Python initialization.

One question is if we need to split Python initialization in two parts: pre-initialization only using bytes which chooses encoding, and initialization using Unicode. Right now, _PyCoreConfig allows to choose the encodings (ex: filesystem_encoding + filesystem_errors), accepts Unicode strings (wchar_t*, like program_name), but also bytes strings (char*, like allocator).

New Python features like C locale coercion (PEP 538) and UTF-8 Mode (PEP 540) made the Python initialization even more complicated than it was previously…

While passing bytes is fine for Unix, Unicode is the native type for Windows.

Hi Inada-san,

What question is being asked here? I looked at the issue, but it’s mostly around the interpreter initialisation, which I know little about and I’m happy to defer to the experts on. But the subject here is “Adding char* based APIs for Unix”, and that’s a very different question.

I don’t understand how you’d add an API “for Unix” - what would it do on Windows, raise an error? If it works on Windows, then we need to make sure that it behaves correctly when used by a developer who knows Unix but is unfamiliar with Windows, or vice versa - otherwise, we’ll end up with subtle and confusing portability issues.

As a minimum thought, we should be very clear on what encoding the APIs would use - even if Unix doesn’t care, dumping potentially unknown bytes into an API on Windows is not going to work (it’ll result in decoding errors).

I’m fine with having low-level APIs that reflect the differing underlying data on Unix and Windows, but at some level someone is going to have to make hard decisions about encodings and how to support platform differences - and I’d rather our APIs made it easy to know when to make those choices, and hard to accidentally (or otherwise) ignore them.

Edit: Wow, that was weird - Discourse presented Victor’s post as a new topic with no context, and it was only after I’d hit “send” that it showed me the post in the context of a longer thread :frowning: I’m not going to edit what I wrote above, it probably still reflects my thoughts, but please read it with that glitch in mind. Sorry…

[Trying that again… looks like Discourse doesn’t like inline quotes
in emails]

Hi Inada-san,

Python has several high-level C API which accept or return |wchar_t*|
It is OK on Windows, but I don’t want to use |wchar_t*| on Unix.

Could you explain the motivation for the latter ?

I know that a main() taking wchar_t on Unix looks a bit odd, but
it’s not really something that’s stopping you from using those
APIs. wchar_t has been around long enough for people to feel
comfortable with it.

Thank you for pointing the issue.

There are char* based way to configure Python when _PyCoreConfig become public.
Then, no need to add char* based APIs for current APIs.

I’ll follow the issue and PEP 432.

I don’t think so. wchar_t is designed before UTF-8 era.
Using wchar_t in right way on Unix is hard. It is based on locale. mbstowcs may be buggy on some systems.
I don’t want to learn about wchar_t. I don’t want to force young people to learn about wchar_t too.

For example, vim uses Py_SetPythonHome. See this code.

Strictly speaking, vim should use Py_DecodeLocale instead of mbstowcs here, if I understand the API correctly. But I’m not sure.
This is one example of how embedding Python correctly on Unix is difficult.

path is bytes in Unix. So char* based APIs is simple and easy to configure paths.

1 Like

But that code has to work on Windows as well as Unix, so how is this relevant? Just using a char* API would break on Windows if you didn’t consider encodings.

Technically, char is based on locale while wchar is either UTF-16 or UTF-32. You’re right that the conversion is the hard part, but that’s why it has to be done by the embedder and not automatically in Python. Otherwise we end up with incredibly complicated logic and multiple PEPs to work around it :wink:

This is why I say we should aim for Python’s APIs to be Unicode everywhere and let encoding for a particular platform be done only at the boundaries. (in one of my PEPs I think I listed all the char* APIs that assumed UTF-8 without stating it… PEP 528 or 529 probably.)

In fact, most people on this thread have contributed at least one PEP related to character encoding in Python :slight_smile:

1 Like

They use char* and mbstowcs already. So nothing goes worse.
Additionally, I’m talking about adding char* based APIs, not removing wchar_t based APIs.
If vim want to handle unicode paths correctly on Windows, they can use wchar_t on Windows, by #ifdef.

Technically, char* is just a byte array. Even when we’re using UTF-8 locale, there may be non-UTF-8 paths (e.g. mounting FAT32 drive, unzipping zip file created on Windows). char* can handle them without any trick. But we need surrogateescape trick when using wchar_t. That’s the major reason why people should use Py_DecodeLocale instead of mbstowcs.

Strictly speaking, there are no guarantee that wchar_t is UTF-16 or UTF-32. See this

A wchar_t` ’ may or may not be encoded in Unicode; this is platform and sometimes also locale dependent.

Please see functions I listed. They can be used before Python is initialized.
I think cmdline arguments and paths should be passed to Python by bytes in Unix. And I think that is what Victor did in python command while implementing PEP 580.
These bytes paths and cmdline arguments can be converted into unicode while initializing Python.
Otherwise, application embedding Python can’t change locale or use UTF-8 mode after using Py_DecodeLocale or mbstowcs.

Strictly speaking, vim should use Py_DecodeLocale instead of mbstowcs here

Both are wrong when PEP 538 or PEP 540 are enabled :frowning:

True, except everywhere they are used in Python it is Unicode.

But the sort of requirements you list are why we need to collect all points of view before designing something new. Otherwise these cases are missed (or the opposite, these assumptions get baked in, and we end up ignoring other points of view).

I’d quite like to see a clear separation between the core runtime, which is platform and encoding independent, and the platform adaption layer (PAL) which is the parts of the runtime that might talk to the OS (via the C runtime, for example).

The core runtime would require a lot more callbacks for everything from memory alloc/free (which we already have) to disk I/O (which we nearly already have), and the PAL would be the platform-specific implementations of those. Then we could treat a lot of initialisation data as plain old bytes, since the only thing they’d ever be used for is to pass back to the PAL that provided them in the first place. And the PAL would have to figure out source file encoding and give the core runtime UTF-8 to compile.

(Note that this is aimed at embedders. Normal use of python.exe comes with a PAL that does what it currently does. But by refactoring it like this its easier for embedders to customize or omit it.)

I created which adds a new internal _PyArgv structure to be able to pass command line arguments as bytes (char*) or Unicode (wchar_t*). It’s an helper to factorize the code. My change also rework Python initialization to be able to initialize _PyPreConfig and _PyCoreConfig structures (list of parameters) from command line arguments, whereas previously command line arguments were only handled in main.c (somehow private API).

This PR is a small step towards new Python initialization APIs which would accept command line arguments. (Py_Main() cannot be used to embed Python into an application.)

Shouldn’t Py_Main be the part that handles command line arguments? Since arguments are only really useful for our main entry point anyway, whereas embedders should do their own configuration (perhaps from their own command line)?

Honestly, I’m confused (as usual when we talk about the Python initialization) :slight_smile: We have a PySys_SetArgvEx() API, but when this function is called, the arguments are not parsed. I’m not sure if we should provide an API to parse them or not.

As a Python user, I’m used to the Python command line to pass option. Maybe it can be convenient to be able to do the same with the C API?

Well, what I wrote is that my change makes the coder easier to maintain (it makes functions more “stateless”) and allows to maybe later add new APIs. I don’t plan to add any new public API at this point :wink:

1 Like

If you don’t think you’re adding new API, and you’re making it easier to add API in the future, then I’m happy with whatever you’re doing :slight_smile:

@admins This discussion belongs to python-dev, non-committers cannot reply :frowning:

1 Like

I have moved this discussion to the Users category. :slight_smile:

1 Like

There are two major kinds of embedding scenario:

  • Applications embedding Python as a scripting engine (think Blender, Maya, etc)
  • Applications that are providing alternatives to the standard CPython CLI (think the isolated-by-default system-python idea in PEP 432, alternative REPLs, the bootstrapping no-import-system binary used to freeze importlib, etc)

Apps in the second category currently have a hard time correctly emulating CPython’s argument and environment handling, since a large chunk of it is hidden inside Py_Main. Even if you call Py_Initialize and then Py_Main, it isn’t quite the same thing, since not everything gets reconfigured after Py_Main has had a chance to look at the command line arguments.

So if we expose our command line processing machinery directly, then embedding applications can use that part of the system like a support library, rather than having to try to emulate it themselves.

1 Like

Yeah, right now, almost all the new “configuration” read by Py_Main() is ignored and will not be applied. Python keeps the “old” configuration read by Py_Initialize().

The problem is that many Python objects are kept alive between Py_Initialize() and Py_Main(). For this reason, it’s not possible to change the memory allocator in such case.

I would suggest to deprecate calling Py_Initialize() before Py_Main().

1 Like