Essentially tell people to keep calling PyUnicode_READY for now, even if it’s a no-op, because we might give it a purpose again in the future? (I’m ignoring the specific example you gave, because I can think of others, and none of it really matters at this stage)
I can see the attraction, but I don’t think it’ll necessarily be the API we want anyway. Inherently, the READY API does an in-place mutation of the object, which I would think we would avoid in future anyway.
So I don’t really think PyUnicode_READY() is needed to support any enhancement of PyUnicodeObject. We’re best off cleaning it up with the rest now (though I’m fine with leaving it as a no-op if that helps projects migrate).
Both are breaking change. Many third-party libraries would be broken.
I think we need to provide backward compatibility by creating PEP 393 representation on-the-fly when PEP 393 API are called. It is like we keep backward compatibility when introducing PEP 393 representation.
You need at least to check that the input is a valid UTF-8. Such check will have almost the same cost as decoding from UTF-8, so I do not expect a large benefit from this. On other hand, it will complicate the code, and all work with non-compact UnicodeObject in general will be slower.
Before PEP 393 it was common to create an unitialized UnicodeObject and then fill its content in-place. It is more difficult to do with PEP 393 (you need to specify not only length, but the kind of the future UnicodeObject), and I think we should completely forbid modification of UnicodeObject after creation in the user code. The only official ways of creating UnicodeObject should be PyUnicode_FromString(), PyUnicode_Decode*(), and like. We can add also an official API for efficient dynamic string builder (like _PyAccu/_PyUnicodeWriter).
For reading, we should provide an alternative to PyUnicode_DATA() which does not depend on the internal representation, but is efficient as well. PyUnicode_As*String(), PyUnicode_AsUCS4() and PyUnicode_AsWideCharString() are slow, because they always allocate memory and copy data. We need an “opener”, which returns a pointer to the internal representation and its width, allocating a new array if needed, and a “closer”, which deallocates a memory if it was allocated. And maybe some helper macros for iterating and searching in variable-width representations like UTF8 and UTF16.
This is just one of ideas relying on lazy PEP 393 representation.
For longer term, I would like to change the main internal encoding to UTF-8 and create PEP 393 representation only when PEP 393 API is called or index is used.
By this change, decoding speed benefit will be very small as you can said, but memory usage and encoding speed would be improved.
Additionally, some Python implementations using UTF-8 for internal encoding provide Python/C API too.
So I still think Python/C API design should support creating PEP 393 representation on the fly regardless CPython needs it or not.
I totally agree with you. But creating PEP 393 on the fly is not modify after creation.
We already create utf8 and hash on the fly after creation. It is not modification.
How your “opener” idea independent from internal representation but efficient?
I think we should just promote PyUnicode_AsUTF8AndSize().
There are so many algorithms and libraries for UTF-8 written in C, C++, or Rust.
Writing string algorithms by hand for 3 kinds without using such libraries is very painful.