Python C API allows modifying immutable strings. It’s a common pattern to create new strings. Examples of functions:
PyUnicode_New()
PyUnicode_FromStringAndSize(NULL, size)
PyUnicode_Resize()
PyUnicode_WriteChar()
PyUnicode_WRITE()
PyUnicode_CopyCharacters()
PyUnicode_Fill()
The problem is that PyUnicode_New() is designed for PEP 393: it requires a “maximum character”. If tomorrow, Python switchs to UTF-8 internally, computing the maximum character becomes inefficient since it’s useless. By the way, PyPy is already facing the problem today since it uses UTF-8 internally. I was asked by PyPy developers long time ago to get rid of PEP 393 C APIs. We should try to hide these implementation details.
What do you think of deprecating C APIs which modify immutable strings? I don’t think that the PyUnicodeWriter API is complete enough, we might need to add other APIs to create strings. These APIs have to be designed.
I don’t know the cost on performance. PyUnicodeWriter was designed with performance in mind. It can overallocate its internal buffer if needed, for example.
I feel like there is more and more pressure on using these functions versus willingness to change Unicode internals. So we should think about replacement APIs to hide implementation details.
I don’t know.
I’m not sure that the stdlib is a good candidate since we like to abuse internals to get best performance. Using PyUnicodeWriter in the stdlib extensions would only be acceptable if there is no performance overhead.
Doesn’t it precisely make the stdlib a good testing ground to check that the PyUnicodeWriter can be a complete replacement for the legacy APIs?
Intuitively, I see two possible problems with the PyUnicodeWriter API:
It seems that PyUnicodeWriter_Create / PyUnicodeWriter_Finish add a malloc/free pair in addition to the actual PyUnicodeObject allocation. This might be eliminated using clever tricks, though.
The presizing/overallocation behavior is not documented. Even the length parameter to PyUnicodeWriter_Create isn’t documented (is it a number of codepoints ? a number of UTF8 bytes?).
Also an additional concern, perhaps temporary, is that PyUnicodeWriter is not part of the limited API (yet?).
The only place that Cython uses these that wouldn’t be easy with the PyUnicodeWriter API is to try to optimize in-place addition of unicode.
s = ""
for other_string in list_of_strings:
s += other_string
It does basically the same thing that Python does internally, where if there reference count is 1 it tries to resize rather than creating a new object.
Obviously that isn’t good code, but it’s nice to be able to optimize it.
In principle, it should be possible (maybe easier?) to keep doing the same thing in a UTF-8 world. But I don’t think it’s easily expressed with PyUnicodeWriter in anything but the simplest cases.
Moving to the new PyUnicodeWriter API internally is a good idea, provided the performance stays the same, but I don’t think we’ll be able to deprecate the mentioned C API for a longer while, since the basic idea “allocate, fill in data, then do a final resize” has been a common approach for strings in Python ever since the beginning, so people using the Python C API will have it internalized.