PEP 756 – [C API] Add PyUnicode_Export() and PyUnicode_Import() C functions

Does it mean that CPython must not support PyUnicode_FORMAT_UTF8?

In CPython, PyUnicode_FORMAT_UTF8 may be O(1) or O(n), it depends if the string was already encoded or not, and if the string contains a surrogate characters or not.

1 Like

If the string is already encoded in UTF-8, then we can return a pointer to the internal buffer and say that it’s UTF-8. If the user didn’t request UTF-8 and we only have UTF-8, then we can’t, so we shouldn’t.

Whatever we say about surrogate characters here, we ought to be enforcing at string creation time. So if we’re saying “there will never be surrogates in PyUnicode_Export results”, what that means is “we never use surrogates in our internal representation.”

So I’d rather say define PyUnicode_FORMAT_UTF8 as “not quite perfect UTF-8, if you’re going to use it, watch out for these things, and if you want perfect UTF-8 then use the proper function for it instead of Export

It was decided to allow surrogates in PEP 756 export. For UTF-8, it means that the “surrogatepass” error handler must be used. A Python string can contain a lone surrogate character, such as "\uDC80".

How is an extension supposed to figure out in which formats the string would be available in zero copy form ?

At the moment, we have these cases:

  1. The string is available in UCS1 or 2, or 4
  2. The string is additionally available as UTF-8

It looks like we’re missing an API which tells the extension: these zero copy formats are available. Requiring a trial and error approach for this would be poor API design.

1 Like

One of the inputs is the bitmask of the formats you’ll accept, and the return value is the format you received (guaranteed to be one of the formats in your bitmask). So it’s the same API as the one that also gives you the pointer - another good reason for it to be O(1).

Like I’ve said, I do think this is a good API, provided we leave out the bit where we do any conversions behind the scenes.

(And the non-trial and error approach is to use PyUnicode_AsString or _AsWideChar. This API is specifically for optimising cases where you can handle the internal data directly, but the caller will always need a fallback where we convert to a standard interchange format for you.)

And what are people going to do if the O(1) access fails, exactly? Surely they fall will back on O(n) conversion anyway, so why not give them exactly that?

I know next to nothing about PyPy’s internals, but a couple of snippets seems to suggest otherwise:

This one even seems to export a bf_getbuffer callback on PyPy types for CPython to call into:

I’m not saying this is necessarily cheap: perhaps it’s an expensive operation for the GC?
I don’t know if @cfbolz or @mattip is available to comment on this.

Because then we have to write and maintain all those conversions, or at least the codepaths to call existing code to perform those conversions. If we don’t give them that, we don’t have to add any more code. That’s my entire calculation here - writing less things that might break :slight_smile:

1 Like

But this still doesn’t tell you which formats are available: e.g. UTF-8 may be available, but the API is going to return say UCS2 instead and won’t tell you about the availability of the cached UTF-8 buffer.

In the future, there may be more (or fewer) formats available and the extension doesn’t have any influence on what the API selects.

If a “bad” C extension exports only to UCS-1/UCS-2/UCS-4 (without UTF-8) and CPython switchs to UTF-8 internally, I would prefer CPython to provide UCS-1/UCS-2/UCS-4 using conversion rather than failing and so breaking the C extension.

You can replace “CPython switchs to UTF-8 internally” with PyPy. I’m sure that PyPy will do whatever needed (convert to UCS-1/2/4) to support such “bad” C extension, rather than failing.

What’s the purpose of such query? Just export with all supported formats and you’re good.

1 Like

If you have code that prefers to process UTF-8, then only request UTF-8 and you’ll get it if it’s there. You won’t be given UCS2 unless you ask for it. If you ask for both and they’re both there, then you’ll get UCS2, because as the PEP specifies:

On CPython, the UTF-8 format has the lowest priority: ASCII and UCS formats are preferred.

I don’t understand what you mean by this. The extension always has exactly the same amount of influence in defining the limits of what the API may return. The API will never return a format that the extension cannot handle, because it will never return a format that the extension hasn’t specifically requested. Perhaps if you propose an alternative behaviour then it’ll become clearer what you think is going on here?

Yes, it’s possible that requests that succeed in 3.14 may start failing in 3.15. That’s by design. The alternative is that we make our internal format part of the limited API (or that we tell the extensions which claim to require this O(1) functionality that they can’t use the limited API).

The problem of this API is that it’s opinionated. That’s why I proposed to add flags a few times, to give more control to the caller on what’s being done.

For example, we can say that PyUnicode_Export() doesn’t convert to other formats by default, but you can pass a PyUnicode_EXPORT_COPY flag to allow the function to copy memory.

I updated my implementation for that and prepared a PR updating PEP 756 to add PyUnicode_EXPORT_COPY flags.

  • By default, the complexity is O(1): no memory is copied and no conversion is done.
  • If PyUnicode_EXPORT_COPY flag is set, the complexity can be O(1) or O(n).

It should satisfy @da-woods, @steve.dower and @malemburg who asked to always have O(1).

No. It only satisfies me if we don’t have to implement, test, or support any new conversions. I want a less complex function, not a more complex one.

1 Like

This raises the question: if the caller asks for UTF8 and the PyUnicode object doesn’t have a cached UTF8 version, does the call to PyUnicode_Export succeed?

It seems that morally, it should (the copy is amortized accross all further exports), but legally, it shouldn’t :slight_smile:

Yes, it does succeed. The cache is filled at the first call.

2 Likes

I updated PEP 756 to avoid memory copies and avoid conversions by default. It now has a complexity of O(1) by default. There is one exception: a UTF-8 export can require encoding the string to UTF-8 at the first call if the cache is not already filled.

For example, if a string is stored as UCS-1, an UCS-4 export does now fail. Only UCS-4 can be exported as UCS-4.

I also added a PyUnicode_EXPORT_ALLOW_COPY flag to allow memory copies and conversions. Sadly, on CPython, this flag is needed to export a string containing surrogate characters to UTF-8, since the implementation encodes the string at each call using the surrogateescape error handler (O(n) complexity).

2 Likes

So now the question is, under what circumstances should I use this function rather than PyUnicode_AsString when I want UTF-8? Should I be replacing all my calls to AsString with this function? Or only in some specific circumstances?

Do I, as an extension author who is not trying to microoptimise string processing, need to consider this function at all or not? And if so, how easy is it for me to get it right?

(My preferred answer is that regular extension authors should totally ignore this function and stick to ones that are designed for interoperability, rather than performance. But if it can be shown that having this makes life easier for all extension authors rather than more complicated, I may well come around in favour.)

Do you mean PyUnicode_AsUTF8 ? It’s not part of the stable ABI, while PyUnicode_Export would be a candidate for it.

Sorry, PyUnicode_AsUTF8AndSize is the one I meant, and it’s been limited API since 3.10.

1 Like

Ah, I see. Well, if what you need is UTF8 anyway, PyUnicode_AsUTF8AndSize should certainly be fine, and another API can’t really get more performant.

What I take from this discussion is that some extensions might want more performance (by avoiding the potential UTF8 decoding) and are willing to pay the price of getting UCS-<n> data in return. I do not maintain such an extension, so I cannot say anything more.