I don’t think so. wchar_t is designed before UTF-8 era.
Using wchar_t in right way on Unix is hard. It is based on locale. mbstowcs may be buggy on some systems.
I don’t want to learn about wchar_t. I don’t want to force young people to learn about wchar_t too.
For example, vim uses Py_SetPythonHome. See this code.
Strictly speaking, vim should use Py_DecodeLocale instead of mbstowcs here, if I understand the API correctly. But I’m not sure.
This is one example of how embedding Python correctly on Unix is difficult.
path is bytes in Unix. So char* based APIs is simple and easy to configure paths.
But that code has to work on Windows as well as Unix, so how is this relevant? Just using a char* API would break on Windows if you didn’t consider encodings.
Technically, char is based on locale while wchar is either UTF-16 or UTF-32. You’re right that the conversion is the hard part, but that’s why it has to be done by the embedder and not automatically in Python. Otherwise we end up with incredibly complicated logic and multiple PEPs to work around it
This is why I say we should aim for Python’s APIs to be Unicode everywhere and let encoding for a particular platform be done only at the boundaries. (in one of my PEPs I think I listed all the char* APIs that assumed UTF-8 without stating it… PEP 528 or 529 probably.)
In fact, most people on this thread have contributed at least one PEP related to character encoding in Python
They use char* and mbstowcs already. So nothing goes worse.
Additionally, I’m talking about addingchar* based APIs, not removingwchar_t based APIs.
If vim want to handle unicode paths correctly on Windows, they can use wchar_t on Windows, by #ifdef.
Technically, char* is just a byte array. Even when we’re using UTF-8 locale, there may be non-UTF-8 paths (e.g. mounting FAT32 drive, unzipping zip file created on Windows). char* can handle them without any trick. But we need surrogateescape trick when using wchar_t. That’s the major reason why people should use Py_DecodeLocale instead of mbstowcs.
Strictly speaking, there are no guarantee that wchar_t is UTF-16 or UTF-32. See this
A wchar_t` ’ may or may not be encoded in Unicode; this is platform and sometimes also locale dependent.
Please see functions I listed. They can be used before Python is initialized.
I think cmdline arguments and paths should be passed to Python by bytes in Unix. And I think that is what Victor did in python command while implementing PEP 580.
These bytes paths and cmdline arguments can be converted into unicode while initializing Python.
Otherwise, application embedding Python can’t change locale or use UTF-8 mode after using Py_DecodeLocale or mbstowcs.
True, except everywhere they are used in Python it is Unicode.
But the sort of requirements you list are why we need to collect all points of view before designing something new. Otherwise these cases are missed (or the opposite, these assumptions get baked in, and we end up ignoring other points of view).
I’d quite like to see a clear separation between the core runtime, which is platform and encoding independent, and the platform adaption layer (PAL) which is the parts of the runtime that might talk to the OS (via the C runtime, for example).
The core runtime would require a lot more callbacks for everything from memory alloc/free (which we already have) to disk I/O (which we nearly already have), and the PAL would be the platform-specific implementations of those. Then we could treat a lot of initialisation data as plain old bytes, since the only thing they’d ever be used for is to pass back to the PAL that provided them in the first place. And the PAL would have to figure out source file encoding and give the core runtime UTF-8 to compile.
(Note that this is aimed at embedders. Normal use of python.exe comes with a PAL that does what it currently does. But by refactoring it like this its easier for embedders to customize or omit it.)
I created https://bugs.python.org/issue36142 which adds a new internal _PyArgv structure to be able to pass command line arguments as bytes (char*) or Unicode (wchar_t*). It’s an helper to factorize the code. My change also rework Python initialization to be able to initialize _PyPreConfig and _PyCoreConfig structures (list of parameters) from command line arguments, whereas previously command line arguments were only handled in main.c (somehow private API).
This PR is a small step towards new Python initialization APIs which would accept command line arguments. (Py_Main() cannot be used to embed Python into an application.)
Shouldn’t Py_Main be the part that handles command line arguments? Since arguments are only really useful for our main entry point anyway, whereas embedders should do their own configuration (perhaps from their own command line)?
Honestly, I’m confused (as usual when we talk about the Python initialization) We have a PySys_SetArgvEx() API, but when this function is called, the arguments are not parsed. I’m not sure if we should provide an API to parse them or not.
As a Python user, I’m used to the Python command line to pass option. Maybe it can be convenient to be able to do the same with the C API?
Well, what I wrote is that my change makes the coder easier to maintain (it makes functions more “stateless”) and allows to maybe later add new APIs. I don’t plan to add any new public API at this point
Applications embedding Python as a scripting engine (think Blender, Maya, etc)
Applications that are providing alternatives to the standard CPython CLI (think the isolated-by-default system-python idea in PEP 432, alternative REPLs, the bootstrapping no-import-system binary used to freeze importlib, etc)
Apps in the second category currently have a hard time correctly emulating CPython’s argument and environment handling, since a large chunk of it is hidden inside Py_Main. Even if you call Py_Initialize and then Py_Main, it isn’t quite the same thing, since not everything gets reconfigured after Py_Main has had a chance to look at the command line arguments.
So if we expose our command line processing machinery directly, then embedding applications can use that part of the system like a support library, rather than having to try to emulate it themselves.
Yeah, right now, almost all the new “configuration” read by Py_Main() is ignored and will not be applied. Python keeps the “old” configuration read by Py_Initialize().
The problem is that many Python objects are kept alive between Py_Initialize() and Py_Main(). For this reason, it’s not possible to change the memory allocator in such case.
I would suggest to deprecate calling Py_Initialize() before Py_Main().
FYI I finished (merged) the implementation and I closed the issue We can now discuss in https://bugs.python.org/issue36204 if we need to add a new Py_InitializeFromArgv() API.
Oh wow, Discourse disallowed me to post a “4th reply in a row”, even if I replied to 4 different messages. It forced me to edit a previous message… So here is my 4th message:
Most or even all issues described in this discussion are solved by my PEP 587: https://www.python.org/dev/peps/pep-0587/ PyConfig_SetBytesString() and PyConfig_SetBytesArgv() decode bytes string for you, and the PEP adds a new “preinitialization” phase with a dedicated PyPreConfig configuration to configure the LC_CTYPE locale. The PEP allows to parse argv as command line arguments. It implements the 2 use cases described by Nick Coghlan with 2 separated default configuration: “Python Configuration” and “Isolated Configuration”.