Un-deprecate PyUnicode_READY() for future Unicode improvement

methane · May 13, 2022, 4:08am

I have implemented PEP 623 (Remove wstr fom Unicode).

PyUnicode_READ() did convertion from wstr to PEP 393 representation. But it is now no-op API so it is deprecated.

On the other hand, I don’t think PEP 393 is our final goal. I think we will have UTF-8 based approach at some point.

For example:

PyUnicode_FromString(b) and b.decode() may create non-compact UnicodeObject with utf8, but without (pep393) data when b is long and contains at least one non-latin1 character.
PyUnicode_DATA() and PyUnicode_nBYTE_DATA() will create PEP 393 representation on-the-fly.

But PyUnicode_DATA() and PyUnicode_nBYTE_DATA() is no-check API.
Returning NULL with exception is breaking change.
How should we solve this?

A: Keep `PyUnicode_READY()`

Do utf8 → pep393 data conversion in the PyUnicode_READY().

Pros and cons:

Ugly
Backward compatible.

B: Make `PyUnicode_DATA()` return error.

PyUnicode_DATA() do the conversion and raise an error when memory error.

Pros and cons:

Simple
Backward incompatible.

C: Both

Do utf8 → pep393 conversion in both of PyUnicode_READY() and PyUnicode_DATA()

Old code can keep using PyUnicode_READY() before using PEP 393 APIs. PyUnicode_DATA() must not return error after it.

New code can just use PyUnicode_DATA() and check its return value.

Pros and cons:

Still ugly
- But simple for new code
Backward compatible

vstinner · May 13, 2022, 1:03pm

If we add support for strings stored as UTF-8 in Python str type, I propose two options:

Remove ASCII, UCS1, UCS2 and UCS4 kinds: only use UTF-8
or: Add a PyUnicode_UTF8_KIND kind and modify all functions relying on kind to support this new kind

I don’t see the point of converting UTF-8 to UCS1, UCS2 or UCS4: it removes the benefits of UTF-8 compact storage.

In short, I don’t think that PyUnicode_READY() is needed to support UTF-8 in Python.

steve.dower · May 13, 2022, 1:48pm

Essentially tell people to keep calling PyUnicode_READY for now, even if it’s a no-op, because we might give it a purpose again in the future? (I’m ignoring the specific example you gave, because I can think of others, and none of it really matters at this stage)

I can see the attraction, but I don’t think it’ll necessarily be the API we want anyway. Inherently, the READY API does an in-place mutation of the object, which I would think we would avoid in future anyway.

So I don’t really think PyUnicode_READY() is needed to support any enhancement of PyUnicodeObject. We’re best off cleaning it up with the rest now (though I’m fine with leaving it as a no-op if that helps projects migrate).

methane · May 14, 2022, 6:44am

Both are breaking change. Many third-party libraries would be broken.
I think we need to provide backward compatibility by creating PEP 393 representation on-the-fly when PEP 393 API are called. It is like we keep backward compatibility when introducing PEP 393 representation.

methane · May 14, 2022, 6:55am

When considering only CPython – Yes.

When considering Pythons other than CPython, PyUnicode_READY() is the only chance to create PEP 393 on the fly and return error when it failed. So would be not “no-op”.

If we deprecate PyUnicode_READY(), I think we need to change PyUnicode_DATA() and PyUnicode_nBYTE_DATA() would return error.

This is for both of Python implementations using UTF-8 as internal representation and future CPython.

storchaka · May 14, 2022, 7:39am

You need at least to check that the input is a valid UTF-8. Such check will have almost the same cost as decoding from UTF-8, so I do not expect a large benefit from this. On other hand, it will complicate the code, and all work with non-compact UnicodeObject in general will be slower.

Before PEP 393 it was common to create an unitialized UnicodeObject and then fill its content in-place. It is more difficult to do with PEP 393 (you need to specify not only length, but the kind of the future UnicodeObject), and I think we should completely forbid modification of UnicodeObject after creation in the user code. The only official ways of creating UnicodeObject should be PyUnicode_FromString(), PyUnicode_Decode*(), and like. We can add also an official API for efficient dynamic string builder (like _PyAccu/_PyUnicodeWriter).

For reading, we should provide an alternative to PyUnicode_DATA() which does not depend on the internal representation, but is efficient as well. PyUnicode_As*String(), PyUnicode_AsUCS4() and PyUnicode_AsWideCharString() are slow, because they always allocate memory and copy data. We need an “opener”, which returns a pointer to the internal representation and its width, allocating a new array if needed, and a “closer”, which deallocates a memory if it was allocated. And maybe some helper macros for iterating and searching in variable-width representations like UTF8 and UTF16.

methane · May 15, 2022, 12:58am

This is just one of ideas relying on lazy PEP 393 representation.
For longer term, I would like to change the main internal encoding to UTF-8 and create PEP 393 representation only when PEP 393 API is called or index is used.
By this change, decoding speed benefit will be very small as you can said, but memory usage and encoding speed would be improved.

Additionally, some Python implementations using UTF-8 for internal encoding provide Python/C API too.
So I still think Python/C API design should support creating PEP 393 representation on the fly regardless CPython needs it or not.

I totally agree with you. But creating PEP 393 on the fly is not modify after creation.
We already create utf8 and hash on the fly after creation. It is not modification.

How your “opener” idea independent from internal representation but efficient?

I think we should just promote PyUnicode_AsUTF8AndSize().
There are so many algorithms and libraries for UTF-8 written in C, C++, or Rust.
Writing string algorithms by hand for 3 kinds without using such libraries is very painful.

methane · May 15, 2022, 1:31am

It seems no one supports keeping PyUnicode_READY. I abandon that idea (A) and (C).

Notes:

There are too many code using it so I don’t add Py_DEPRECATED until Python 3.11 becomes “security” status.
I had removed PyUnicode_READY from unicodeobject.c. But many PyUnicode_READY remains in other files. Feel free to remove it when you find it in function you are working in.

vstinner · May 16, 2022, 7:47am

I’m fine with leaving PyUnicode_READY() (always return 1) until we have a good reason to deprecated and remove it

vstinner · May 16, 2022, 8:04am

See also this discussion: gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function by vstinner · Pull Request #92705 · python/cpython · GitHub

vstinner · May 16, 2022, 9:21am

I proposed adding a New public PyUnicodeBuilder C API on python-dev. It’s somehow related to this discussion.

Un-deprecate PyUnicode_READY() for future Unicode improvement

A: Keep PyUnicode_READY()

B: Make PyUnicode_DATA() return error.

C: Both

A: Keep `PyUnicode_READY()`

B: Make `PyUnicode_DATA()` return error.