Adding char* based APIs for Unix

methane · February 20, 2019, 3:53am

Python has several high-level C API which accept or return wchar_t* string.
It is OK on Windows, but I don’t want to use wchar_t* on Unix.

Victor added _Py_UnixMain(int argc, char **argv) which is char* version of Py_Main(int argc, wchar_t **argv).
Can we make it public API? Is the name looks good?

And there are some other wchar_t* APIs. Can we add char* version for them?

Doc/c-api/sys.rst
218:.. c:function:: void PySys_AddWarnOption(const wchar_t *s)
233:.. c:function:: void PySys_SetPath(const wchar_t *path)
275:.. c:function:: void PySys_AddXOption(const wchar_t *s)

Doc/c-api/init.rst
344:.. c:function:: void Py_SetProgramName(const wchar_t *name)
375:.. c:function:: wchar_t* Py_GetPrefix()
388:.. c:function:: wchar_t* Py_GetExecPrefix()
423:.. c:function:: wchar_t* Py_GetProgramFullPath()
436:.. c:function:: wchar_t* Py_GetPath()
456:.. c:function::  void Py_SetPath(const wchar_t *)
551:.. c:function:: void PySys_SetArgvEx(int argc, wchar_t **argv, int updatepath)
599:.. c:function:: void PySys_SetArgv(int argc, wchar_t **argv)
611:.. c:function:: void Py_SetPythonHome(const wchar_t *home)

steve.dower · February 20, 2019, 5:09am

Provided we’re very good about them all being UTF-8 and not “whatever the current system maybe thinks it is”, I’m okay with going to char* for Windows as well.

None of us want to be in a place where char* means some random encoding, which is historically what it has meant and why they’ve been being deprecated without direct replacement.

(Put it on the list for when we do the big C API reset )

vstinner · February 20, 2019, 11:30am

Please see https://bugs.python.org/issue22213 ongoing discussion on a new API for Python initialization.

One question is if we need to split Python initialization in two parts: pre-initialization only using bytes which chooses encoding, and initialization using Unicode. Right now, _PyCoreConfig allows to choose the encodings (ex: filesystem_encoding + filesystem_errors), accepts Unicode strings (wchar_t*, like program_name), but also bytes strings (char*, like allocator).

New Python features like C locale coercion (PEP 538) and UTF-8 Mode (PEP 540) made the Python initialization even more complicated than it was previously…

While passing bytes is fine for Unix, Unicode is the native type for Windows.

malemburg · February 20, 2019, 11:58am

Hi Inada-san,

pf_moore · February 20, 2019, 12:06pm

What question is being asked here? I looked at the issue, but it’s mostly around the interpreter initialisation, which I know little about and I’m happy to defer to the experts on. But the subject here is “Adding char* based APIs for Unix”, and that’s a very different question.

I don’t understand how you’d add an API “for Unix” - what would it do on Windows, raise an error? If it works on Windows, then we need to make sure that it behaves correctly when used by a developer who knows Unix but is unfamiliar with Windows, or vice versa - otherwise, we’ll end up with subtle and confusing portability issues.

As a minimum thought, we should be very clear on what encoding the APIs would use - even if Unix doesn’t care, dumping potentially unknown bytes into an API on Windows is not going to work (it’ll result in decoding errors).

I’m fine with having low-level APIs that reflect the differing underlying data on Unix and Windows, but at some level someone is going to have to make hard decisions about encodings and how to support platform differences - and I’d rather our APIs made it easy to know when to make those choices, and hard to accidentally (or otherwise) ignore them.

Edit: Wow, that was weird - Discourse presented Victor’s post as a new topic with no context, and it was only after I’d hit “send” that it showed me the post in the context of a longer thread I’m not going to edit what I wrote above, it probably still reflects my thoughts, but please read it with that glitch in mind. Sorry…

malemburg · February 20, 2019, 12:11pm

[Trying that again… looks like Discourse doesn’t like inline quotes
in emails]

Hi Inada-san,

Python has several high-level C API which accept or return |wchar_t*|
string.
It is OK on Windows, but I dont want to use |wchar_t*| on Unix.

Could you explain the motivation for the latter ?

I know that a main() taking wchar_t on Unix looks a bit odd, but
it’s not really something that’s stopping you from using those
APIs. wchar_t has been around long enough for people to feel
comfortable with it.

methane · February 20, 2019, 12:29pm

Thank you for pointing the issue.

There are char* based way to configure Python when _PyCoreConfig become public.
Then, no need to add char* based APIs for current APIs.

I’ll follow the issue and PEP 432.

methane · February 20, 2019, 12:53pm

I don’t think so. wchar_t is designed before UTF-8 era.
Using wchar_t in right way on Unix is hard. It is based on locale. mbstowcs may be buggy on some systems.
I don’t want to learn about wchar_t. I don’t want to force young people to learn about wchar_t too.

For example, vim uses Py_SetPythonHome. See this code.

github.com

vim/vim/blob/14816ad6e58336773443f5ee2e4aa9e384af65d2/src/if_python3.c#L874-L887


      
          	if (*p_py3home != NUL)
          	{
          	    size_t len = mbstowcs(NULL, (char *)p_py3home, 0) + 1;
          
          	    /* The string must not change later, make a copy in static memory. */
          	    py_home_buf = (wchar_t *)alloc(len * sizeof(wchar_t));
          	    if (py_home_buf != NULL && mbstowcs(
          			    py_home_buf, (char *)p_py3home, len) != (size_t)-1)
          		Py_SetPythonHome(py_home_buf);
          	}
          #ifdef PYTHON3_HOME
          	else if (mch_getenv((char_u *)"PYTHONHOME") == NULL)
          	    Py_SetPythonHome(PYTHON3_HOME);
          #endif

Strictly speaking, vim should use Py_DecodeLocale instead of mbstowcs here, if I understand the API correctly. But I’m not sure.
This is one example of how embedding Python correctly on Unix is difficult.

path is bytes in Unix. So char* based APIs is simple and easy to configure paths.

pf_moore · February 20, 2019, 1:46pm

But that code has to work on Windows as well as Unix, so how is this relevant? Just using a char* API would break on Windows if you didn’t consider encodings.

steve.dower · February 20, 2019, 2:19pm

Technically, char is based on locale while wchar is either UTF-16 or UTF-32. You’re right that the conversion is the hard part, but that’s why it has to be done by the embedder and not automatically in Python. Otherwise we end up with incredibly complicated logic and multiple PEPs to work around it

This is why I say we should aim for Python’s APIs to be Unicode everywhere and let encoding for a particular platform be done only at the boundaries. (in one of my PEPs I think I listed all the char* APIs that assumed UTF-8 without stating it… PEP 528 or 529 probably.)

In fact, most people on this thread have contributed at least one PEP related to character encoding in Python

methane · February 20, 2019, 3:20pm

They use char* and mbstowcs already. So nothing goes worse.
Additionally, I’m talking about adding char* based APIs, not removing wchar_t based APIs.
If vim want to handle unicode paths correctly on Windows, they can use wchar_t on Windows, by #ifdef.

methane · February 20, 2019, 3:48pm

Technically, char* is just a byte array. Even when we’re using UTF-8 locale, there may be non-UTF-8 paths (e.g. mounting FAT32 drive, unzipping zip file created on Windows). char* can handle them without any trick. But we need surrogateescape trick when using wchar_t. That’s the major reason why people should use Py_DecodeLocale instead of mbstowcs.

Strictly speaking, there are no guarantee that wchar_t is UTF-16 or UTF-32. See this

A wchar_t` ’ may or may not be encoded in Unicode; this is platform and sometimes also locale dependent.

Please see functions I listed. They can be used before Python is initialized.
I think cmdline arguments and paths should be passed to Python by bytes in Unix. And I think that is what Victor did in python command while implementing PEP 580.
These bytes paths and cmdline arguments can be converted into unicode while initializing Python.
Otherwise, application embedding Python can’t change locale or use UTF-8 mode after using Py_DecodeLocale or mbstowcs.

vstinner · February 20, 2019, 6:06pm

Strictly speaking, vim should use Py_DecodeLocale instead of mbstowcs here

Both are wrong when PEP 538 or PEP 540 are enabled

steve.dower · February 20, 2019, 7:23pm

True, except everywhere they are used in Python it is Unicode.

But the sort of requirements you list are why we need to collect all points of view before designing something new. Otherwise these cases are missed (or the opposite, these assumptions get baked in, and we end up ignoring other points of view).

I’d quite like to see a clear separation between the core runtime, which is platform and encoding independent, and the platform adaption layer (PAL) which is the parts of the runtime that might talk to the OS (via the C runtime, for example).

The core runtime would require a lot more callbacks for everything from memory alloc/free (which we already have) to disk I/O (which we nearly already have), and the PAL would be the platform-specific implementations of those. Then we could treat a lot of initialisation data as plain old bytes, since the only thing they’d ever be used for is to pass back to the PAL that provided them in the first place. And the PAL would have to figure out source file encoding and give the core runtime UTF-8 to compile.

(Note that this is aimed at embedders. Normal use of python.exe comes with a PAL that does what it currently does. But by refactoring it like this its easier for embedders to customize or omit it.)

vstinner · February 28, 2019, 1:56am

I created https://bugs.python.org/issue36142 which adds a new internal _PyArgv structure to be able to pass command line arguments as bytes (char*) or Unicode (wchar_t*). It’s an helper to factorize the code. My change also rework Python initialization to be able to initialize _PyPreConfig and _PyCoreConfig structures (list of parameters) from command line arguments, whereas previously command line arguments were only handled in main.c (somehow private API).

This PR is a small step towards new Python initialization APIs which would accept command line arguments. (Py_Main() cannot be used to embed Python into an application.)

steve.dower · February 28, 2019, 4:54am

Shouldn’t Py_Main be the part that handles command line arguments? Since arguments are only really useful for our main entry point anyway, whereas embedders should do their own configuration (perhaps from their own command line)?

vstinner · February 28, 2019, 3:27pm

Honestly, I’m confused (as usual when we talk about the Python initialization) We have a PySys_SetArgvEx() API, but when this function is called, the arguments are not parsed. I’m not sure if we should provide an API to parse them or not.

As a Python user, I’m used to the Python command line to pass option. Maybe it can be convenient to be able to do the same with the C API?

Well, what I wrote is that my change makes the coder easier to maintain (it makes functions more “stateless”) and allows to maybe later add new APIs. I don’t plan to add any new public API at this point

steve.dower · February 28, 2019, 7:25pm

If you don’t think you’re adding new API, and you’re making it easier to add API in the future, then I’m happy with whatever you’re doing

vstinner · March 1, 2019, 5:15pm

@admins This discussion belongs to python-dev, non-committers cannot reply

pablogsal · March 1, 2019, 5:18pm

I have moved this discussion to the Users category.