What do you think of deprecating C APIs to modify immutable strings?

vstinner · October 8, 2024, 2:48pm

Hi,

Python C API allows modifying immutable strings. It’s a common pattern to create new strings. Examples of functions:

PyUnicode_New()
PyUnicode_FromStringAndSize(NULL, size)
PyUnicode_Resize()
PyUnicode_WriteChar()
PyUnicode_WRITE()
PyUnicode_CopyCharacters()
PyUnicode_Fill()

The problem is that PyUnicode_New() is designed for PEP 393: it requires a “maximum character”. If tomorrow, Python switchs to UTF-8 internally, computing the maximum character becomes inefficient since it’s useless. By the way, PyPy is already facing the problem today since it uses UTF-8 internally. I was asked by PyPy developers long time ago to get rid of PEP 393 C APIs. We should try to hide these implementation details.

Python 3.14 has a new PyUnicodeWriter C API which avoids writing into immutable strings. It’s available on Python 3.6-3.13 using the pythoncapi-compat project.

What do you think of deprecating C APIs which modify immutable strings? I don’t think that the PyUnicodeWriter API is complete enough, we might need to add other APIs to create strings. These APIs have to be designed.

I don’t know the cost on performance. PyUnicodeWriter was designed with performance in mind. It can overallocate its internal buffer if needed, for example.

Victor

storchaka · October 8, 2024, 3:44pm

How would you implement PyUnicodeWriter without such API? Or other high-performant code, like str.replace, decoders, etc?

pitrou · October 8, 2024, 4:00pm

What is the performance story for PyUnicodeWriter vs. the legacy APIs you want to deprecate?
Have you tried to migrate the stdlib to PyUnicodeWriter to validate the approach?

encukou · October 9, 2024, 8:41am

I’d soft-deprecate them and add them to PEP-743.

Since the replacement was just added, I’d rather wait a release before making such important API raise deprecation warnings.

vstinner · October 9, 2024, 8:54am

To be honest, I don’t know

I feel like there is more and more pressure on using these functions versus willingness to change Unicode internals. So we should think about replacement APIs to hide implementation details.

I don’t know.

I’m not sure that the stdlib is a good candidate since we like to abuse internals to get best performance. Using PyUnicodeWriter in the stdlib extensions would only be acceptable if there is no performance overhead.

pitrou · October 9, 2024, 12:01pm

Doesn’t it precisely make the stdlib a good testing ground to check that the PyUnicodeWriter can be a complete replacement for the legacy APIs?

Intuitively, I see two possible problems with the PyUnicodeWriter API:

It seems that PyUnicodeWriter_Create / PyUnicodeWriter_Finish add a malloc/free pair in addition to the actual PyUnicodeObject allocation. This might be eliminated using clever tricks, though.
The presizing/overallocation behavior is not documented. Even the length parameter to PyUnicodeWriter_Create isn’t documented (is it a number of codepoints ? a number of UTF8 bytes?).

Also an additional concern, perhaps temporary, is that PyUnicodeWriter is not part of the limited API (yet?).

vstinner · October 10, 2024, 9:34am

I created issue gh-125196 to use the public PyUnicodeWriter API in the stdlib.

da-woods · October 10, 2024, 8:45pm

The only place that Cython uses these that wouldn’t be easy with the PyUnicodeWriter API is to try to optimize in-place addition of unicode.

s = ""
for other_string in list_of_strings:
    s += other_string

It does basically the same thing that Python does internally, where if there reference count is 1 it tries to resize rather than creating a new object.

Obviously that isn’t good code, but it’s nice to be able to optimize it.

In principle, it should be possible (maybe easier?) to keep doing the same thing in a UTF-8 world. But I don’t think it’s easily expressed with PyUnicodeWriter in anything but the simplest cases.

malemburg · October 11, 2024, 7:49am

Moving to the new PyUnicodeWriter API internally is a good idea, provided the performance stays the same, but I don’t think we’ll be able to deprecate the mentioned C API for a longer while, since the basic idea “allocate, fill in data, then do a final resize” has been a common approach for strings in Python ever since the beginning, so people using the Python C API will have it internalized.

vstinner · October 25, 2024, 9:32am

PyUnicodeWriter API is good to append-only functions. Maybe we need a different API to "allocate a buffer, fill in data, then do a final resize”.

pitrou · October 25, 2024, 5:03pm

Why would it be a different API?