PEP 756 – [C API] Add PyUnicode_Export() and PyUnicode_Import() C functions

steve.dower · September 25, 2024, 9:50pm

Right, but the section from Victor’s post that I quoted is the flag that enables conversion and allocation to get the format from the new API - in other words, apparently identical functionality. So if we’re going to have two APIs that do the same thing, I want it to be very clear to users which one they ought to be using (though I prefer to have the APIs not do the same thing).

pitrou · September 26, 2024, 8:27am

You mean PyUnicode_EXPORT_ALLOW_COPY specifically? Personally, I do not see a strong need for it (because what are you going to do if true zero-copy is not possible? surely you still want to access the string anyway), but that seems important for other users. Some people prefer an explicit error when their code is not as performant as they’d like to (of course “performant” is usually more complex than knowing whether a string access is zero-copy, but…).

vstinner · September 26, 2024, 10:19am

If PyUnicode_EXPORT_ALLOW_COPY flag is removed, I would suggest to modify the implementation to not support UTF-8 on CPython because of surrogate characters. It sounds bad to me that depending on the string content, the export may or may not work

If you use PyUnicode_AsUTF8AndSize(), the contract is clear: you request for a valid UTF-8 string, so surrogate characters are disallowed:

>>> "\udc80".encode("utf8")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

But for a function called “Export”, I would expect to just “export” what we have unmodified. The problem is that the implementation on CPython 3.14 requires encoding the string to UTF-8 with surrogateescape. If the string contains a surrogate character, the operation complexity becomes O(n) which is not what we want.

In short, PyUnicode_FORMAT_UTF8 would only be provided for other Python implementations which can provide O(1) export.

pitrou · September 26, 2024, 2:36pm

It had never occurred to me that this was the case. The documentation does not mention it. It just vaguely talks about possible errors, but I was assuming the only possible error concretely was MemoryError. Perhaps this documentation can be improved to mention the lone surrogates issue?

vstinner · September 26, 2024, 2:40pm

Sure: I wrote PR gh-124605 to mention surrogate characters explicitly.

steve.dower · September 26, 2024, 2:52pm

Thanks for the proposed changes in PEP 756: Give up on copying memory by vstinner · Pull Request #3999 · python/peps · GitHub. With those, I am +1 on this proposal.

vstinner · September 27, 2024, 8:02am

I updated again PEP 756 to make it way simpler:

PyUnicode_Export() never copies memory or converts between the different format, it always exposes exactly what we have unmodified. On CPython, it always has an O(1) complexity.
Remove PyUnicode_EXPORT_ALLOW_COPY flag.
On CPython (3.14), PyUnicode_Export() no longer supports PyUnicode_FORMAT_UTF8.
PEP 756 describes why no conversion is done and why UTF-8 is not supported in Rejected Ideas sections.

steve.dower · September 27, 2024, 9:39am

More precisely, CPython 3.14 will never return PyUnicode_FORMAT_UTF8. Saying it “no longer supports” might imply that you get an error if you request it, but that’s not the intent of the API at all.

Requesting (and handling) UTF-8 might be the only way to get O(1) behaviour from other implementations that offer the limited API.

methane · September 30, 2024, 6:42am

I still think we should provide only stable ABI version of PyUnicode_AsUTF8AndSize(). It might accept option to allow WTF-8 or not.

If we change the internal encoding of unicode in the near future, many technical debt will be remain in PyUnicode_Export() implementation and its users’ code.

I think O(1) guarantee this API provides is not worth enough for its complexity and technical debt. We should make effort to keep stable ABI simple and clean as possible.

Almost all string processing code is O(n). O(n) + O(1) = O(n) + O(n) = O(n).

vstinner · October 1, 2024, 10:02am

I consider that PEP 756 is now ready for pronouncement, so I submitted PEP 756 to the C API Working Group.

encukou · October 2, 2024, 9:47am

Re-reading the conversation, it seems that when it comes to interoperability, UTF-8 seems to tick all the boxes except

not being to encode all strings
not being CPython’s current internal format

I wonder if we should add PyUnicode_{As,From}WTF8AndSize to export UTF8-with-surrogates-or-other-illegal-characters, cache that representation as we do UTF-8, and expect all C-API implementations to provide that (with expected O(1) complexity after one-time O(n) conversion).

If we expect CPython to switch to that for the “main” internal storage in a few years, adding API for other formats now does seem premature. Nobody would use it after that switch.

pitrou · October 2, 2024, 11:52am

That’s a pretty big “if” that is for now only corroborated by the existence of an open issue on GH.

I’m not sure how that would really solve the issue for the libraries that currently peek into the internal UCS-<n> representation. If those libraries do this, it’s probably because they want to access unicode contents at a minimal cost, not because they are concerned with surrogates.

methane · October 3, 2024, 12:49am

I checked how duckdb uses PyUnicode_4BYTE_DATA(). They use it to create Unicode instance, not reading from.

Additionally, they can use PyUnicode_FromStringAndSize(). But they don’t because it is slow.

Maybe, we need to check that PyUnicode_FromStringAndSize() is really slow than their code and why. (UnicodeWriter? Checking lone surrogate?)

Rework client API and prepared statements, and improve DuckDB -> Pandas conversion by Mytherin · Pull Request #1260 · duckdb/duckdb · GitHub
Some comments · duckdb/duckdb@7171f3d · GitHub

methane · October 3, 2024, 1:24am

In case of Levenshtein, it seems they just iterate codepoints in Unicode string.
I don’t read the code carefully so I’m sorry if I am wrong.

Adding an API for iterating codepoints would help projects like Levenshtein write code that works fine with PyPy and CPython.

methane · October 3, 2024, 1:44am

orjson also uses PyUnicode_DATA to create Unicode from UTF-8.

orjson/src/str/pyunicode_new.rs at 3640a30eb0b4df9184f1e5ef03a789c0b6e0d9b1 · ijl/orjson · GitHub

They use two-pass approach. They detect latin1/UCS-2/UCS-4 (with AVX) in first pass.

orjson/src/str/avx512.rs at 3640a30eb0b4df9184f1e5ef03a789c0b6e0d9b1 · ijl/orjson · GitHub

If we provide fast API to create Unicode from UTF-8, orjson and duckdb can remove PEP393 dependent code.

methane · October 3, 2024, 3:55am

PyICU (and PyICU-binary) uses PyUnicode_4BYTE_DATA() to create Unicode object from UTF-16 data.

I think they can just use PyUnicode_DecodeUTF16.

methane · October 3, 2024, 4:34am

After looking some projects code, I think PyUnicode_Import is not worth enough.

They use PyUnicode_*_DATA() APIs for on-the-fly conversion from utf8 or utf16. But PyUnicode_Import() requires temporary buffer.

We already have PyUnicode_FromStringAndSize(), PyUnicode_DecodeUTF8/16/32() in stable API.
We can optimize and promote them instead of PyUnicode_New() + PyUnicode_*_DATA().

By the way, we have no-op PyUnicode_READY(). How about recommend (or require) calling it after PyUnicode_New() in the Python/C API reference? It will help PyPy, and future CPython.

(previous discussion about un-deprecate PyUnicode_READY()) Un-deprecate PyUnicode_READY() for future Unicode improvement

steve.dower · October 3, 2024, 12:40pm

This is the case I think is most interesting to enable. There are plenty of applications that would benefit from iterating/searching the raw codepoints without copying (e.g. XML/JSON parsing or regex search on large strings), and are likely willing to handle a range of encodings.

But I don’t think we can provide a worthwhile iteration API other than exporting a raw data pointer. Any additional function calls during iteration would spoil it worse than doing the conversion.

pitrou · October 3, 2024, 1:14pm

Exporting a raw pointer is not very friendly to non-CPython implementations, is it?

A batched copy-export API could perhaps work. By batching into a caller-provided buffer, one can perhaps ensure that the copy part is very fast (if the destination buffer stays in L1).

It would require benchmarking on an actual use case.

steve.dower · October 3, 2024, 2:06pm

It’s no worse than any of the other requirements already imposed on them. This API returns a Py_buffer, which at least means the caller is obligated to release it, so if the only way to fulfill the API is to allocate something new and do a copy then it’s possible without having to deal with loose references.