PEP 756 – [C API] Add PyUnicode_Export() and PyUnicode_Import() C functions

vstinner · September 14, 2024, 9:10am

Read the PEP: https://peps.python.org/pep-0756/

The Add PyUnicode_Export() and PyUnicode_Import() to the limited C API decision issue of the C API Working Group has now more than 50 comments which makes it difficult to read/navigate. I wrote PEP 756 to summarize the discussion. It might be easier to make a decision on a PEP. Obviously, it’s an opinionated PEP

For this first C API Working Group PEP, I used: PEP-Delegate: C API Working Group.

Abstract

Add functions to the limited C API version 3.14:

PyUnicode_Export(): export a Python str object as a Py_buffer view.
PyUnicode_Import(): import a Python str object.

In general, PyUnicode_Export() has an O(1) complexity: no memory copy is needed. See the specification for cases when a copy is needed.

Open Questions

Should we guarantee that the exported buffer always ends with a NUL character? Is it possible to implement it in O(1) complexity in all Python implementations?
Is it ok to allow surrogate characters?
Should we add a flag to disallow embedded NUL characters? It would have an O(n) complexity.
Should we add a flag to disallow surrogate characters? It would have an O(n) complexity.

da-woods · September 15, 2024, 7:46am

Thinking about where Cython would use these interfaces, it’s mostly just micro-optimizations.

For export, we’d probably prefer something that fails rather than something that makes a copy in most cases. But I can see that other users might feel differently. I guess that’s probably possible by requesting everything, then rejecting results in the wrong format, but long-term more formats might be added.
I’m not convinced it needs to go into the limited API immediately - I’m personally of the opinion that the Limited API should be as small as possible while still being useful, and this feels like an optimization rather than a necessity.

The flags argument exists but isn’t discussed in the PEP. I assume it’s mostly for future use and is ignored for now?

pitrou · September 15, 2024, 10:32am

This PEP looks very well thought out. I have a minor comment: instead of saying the buffer format for PyUnicode_FORMAT_UCS4 is either "I" or "L", why not always "I"?

Now to your questions:

No. Any decent text-handling library, in any language, should accept explicitly-sized strings.

I think so.

No. It’s trivial for callers to implement this if they want.

No. It’s almost trivial for callers to implement this if they want.

vstinner · September 15, 2024, 12:25pm

It should be "I" in the common case, but this format is related to the unsigned int type which can be 16-bit on some platforms. In this case, "L" is used instead. It’s more reliable to rely on the export format (ex: PyUnicode_FORMAT_UCS4) than than Py_buffer.format (ex: "I").

vstinner · September 15, 2024, 12:39pm

I can add a PyUnicode_EXPORT_NO_COPY flag for that. But can’t you just pass enough formats as recommended by the PEP to get O(1) complexity (avoid copy)?

Oops, I added it by mistake to the PEP, it’s not part of the API: I just remove it.

(I was working on a patch to add flags for surrogates and embedded null characters. See Open Questions.)

da-woods · September 15, 2024, 1:08pm

Yes, but if you add more supported formats in future then our O(1) code suddenly becomes O(n) with no change on our part.

pitrou · September 15, 2024, 5:19pm

Is CPython even supported on such platforms? I doubt it.

encukou · September 16, 2024, 9:44am

Generally, I think that

feature-wise, we should design API to not break in the future, but
performance-wise, we should design API for what we (and e.g. PyPy) have now.

This means that if we change things in the future, extensions should continue working, but perhaps they’ll be (much) slower than before – until they update to use some new features or APIs.
IMO, this is better than trying to predict the future right now.

IMO, we need to do this, but discourage relying on it.
In practice, most of the returned buffers are NUL-terminated. No matter what we say or document, some users will expect the trailing NUL, and they won’t be bothered by embedded NULs truncating their strings.
Unfortunately, using C string functions on NUL terminated can easily become a security issue. And we’re still building a C API.
So, to make the world safe, we need to always the terminating NUL.

We can add a “no NUL please” flag in the future. Or alternate implementations where this is a bottleneck can add a XPyUnicode_Export_NoNUL function (which CPython can adopt later). But right now, let’s export the NUL and

document that we do it, for the benefit of alternate implementations
document that you shouldn’t rely on it, and use size whenever possible, since strings can have embedded NULs

We don’t need to add the flag now – if/when we do, all already-released CPython versions will simply ignore the flag and add an extra NUL.

No. If a user needs to reject surrogate characters, they can scan the string themselves.

However, we do need to document that our exports can contain these.

Again, if some day there appears an (alternate) implementation where this check is cheap, we/they can add a flag/function for it.
(Already-released CPython versions will ignore the flag if it’s requested, and they won’t set it in the output.)

No. Embedded NULs are a normal feature of Python strings.
Again, users that need to reject these can scan the result. In many cases, they might want to reject some other control characters as well, not just NUL – for example terminal escapes, lone surrogates, or BIDI overrides.
Let’s not build the tool now if CPython doesn’t need it. It would be an ill-fitting tool.

I guess the proper format would be "=I".
Some consumers might not expect the full struct syntax here, but since this is new API, perhaps that’s OK?
If so we should also use "=H". But I’d keep "B" alone; that’s unambiguous, and more likely to be special-cased.

vstinner · September 16, 2024, 12:10pm

I asked PyPy devs about PEP 756: a memory copy will be needed anyway, because Python str objects can be moved in memory (it’s not possible to pin a str object in memory, or at least, it’s not recommended for best performance). Also, obviously, PyPy doesn’t use UCS-1, UCS-2 or UCS-4 internally and so requesting this format would either fail or have to copy memory. PyPy uses UTF-8 internally.

IMO it’s ok that PyPy has to copy memory. It’s still better than the current PyUnicode_AsUTF8() API which has an undefined lifetime.

vstinner · September 16, 2024, 1:13pm

I wrote the PR: PEP 756: Remove Open Questions, add flags to update the PEP:

Add uint32 flags to PyUnicode_Export() and PyUnicode_Import() for future usage. It will allow to extend the API without having the break the ABI or need to add a new API. For now, flags must be set to 0.
Add a soft requirement on ending the buffer with a trailing NUL character: “The
buffer should end with a trailing NUL character” (and not must).
Remove Open Questions: they have been answered:
- Soft requirement on ending the buffer with a trailing NUL character.
- It’s ok to allow surrogate characters.
- Don’t add a flag to reject embedded NUL characters.
- Don’t add a flag to reject surrogate characters.
Update information about PyPy: mention the moving GC and the need to copy the string anyway.

The lack of this API prevents some C extensions to use the limited C API. I would like to promote limited C API usage and allow to write efficient code.

I updated the PEP, it now always use "=I" buffer format for the PyUnicode_FORMAT_UCS4 export format.

steve.dower · September 16, 2024, 4:22pm

Just to transfer my most current concerns (as Victor said, we’ve been working on this API for a while already):

The reasons that PyUnicode_AsAnythingElse aren’t sufficient right now are purely due to performance - where having O(1) access to the entire string^[1] is more important than duplicating your processing code to handle different internal representations.

Certainly for some cases I can see the value - a regex library is likely more efficient handling a million characters stored as UCS-4 than converting to UTF-8 and then processing that.

But the value is entirely in that zero-copy, zero-process “export”. So as soon as we add any algorithm over the string contents, even a copy, the value is gone.

My opinion is that if we’re going to add high-performance APIs where we already have interoperable APIs, they should keep that guarantee or else fail and the caller can fall back. One day, our internal representation may change in a way to make this API no longer fast, at which point it should fail dynamically, but because all callers have a fallback already in place they’ll continue working (and if they’re running tests that ensure the fallback is not used, they’ll find out that we changed it early).

We shouldn’t even be considering starting with “high performance” APIs that break that guarantee. O(1) or nothing.

Given that constraint, the only thing I’d change in the PEP as it stands is that we should only ever return the buffer if we already have the requested format. Enabling any kind of conversion or filtering here inevitably breaks the O(1) guarantee, which is the entire point of the function. If that makes it uninteresting to others, then we just shouldn’t add it.

(PyUnicode_Import in my mind is for round-tripping data. Because we have to guarantee internal consistency of our own data structures, it’s very unlikely that we can make it O(1). Export is the important operation as far as perf is concerned, but if we add one then we should add both. I wouldn’t take PyUnicode_Import on its own.)

As a side note, I would love if we could find some consistency between this API and the PyLong export and import functions also being proposed. A general pattern for “give me a raw view/untranslated access to this PyObject” would help save us from having to keep designing these from scratch each time.

Or more likely, O(1N) rather than O(2N) with a copy. ↩︎

methane · September 17, 2024, 8:05am

I don’t like this API to limited APIs. This API design is optimized for current CPython implementation. So it should be CPython public API, not limited API. For limited API and stable ABI, we can add PyUnicode_ExportUTF8().

FWI, There is a discussion about deprecating PEP 393 based APIs in Python 3.14 and use PyPy-like representation in the future. So I am not sure even public (not limited) API is needed.

github.com/faster-cpython/ideas

Use UTF-8 internally for strings.

opened 03:58PM - 12 Jun 24 UTC

markshannon

I'm sure this has been discussed elsewhere and I don't know when, or if, we'll h…ave time to implement it, but I think it's worth adding here. Currently Python `str`s (`PyUnicodeObject` in C) are implemented as arrays of 1, 2 or 4 byte unicode code points, plus a header I propose that we implement them as a [utf8](https://en.wikipedia.org/wiki/UTF-8) encoded array of bytes, plus a header. The advantages are numerous: * Non-ASCII strings will generally be more compact, without making ASCII strings any bigger. * Strings can be joined easily by simple concatenation of the data * The internet is utf8, so the vast majority of encoding and decoding operations should be fast * There need only be one implmentation of each `str` method and each `PyUnicode_` C function, saving considerable code size and simplifying the code * Algorithms for fast operations on utf8 are well known, so many operations will fast, despite the variable length encoding. * The C struct is clearer, as we don't need an awkward union of `uint8_t`, `uint16_t` and `uint32_t` for character data. However there are two problems: * Indexing into strings is supposed to be O(1). * Some of the C API exposes the internal encoding ### Keeping indexing O(1) To maintain this properly we will have to lazily create an offset table for larger, non-ASCII strings. This is won't be a problem in practice because: * Creating the index is no more expensive than creating the current UCS1/2/4 strings. * We only allocate indexes if we need them, which should be relatively rare. ### The C API We will have to deprecate, and then remove, the C API that exposes the implementation details. We should probably deprecate for 3.14, so that we can implement utf8 strings in 3.16, allowing a proper deprecation period. ## Implementation We will probably want to embed the string data directly into the object, so the struct will look something like this: ```C typedef struct { PyObject_HEAD uintptr_t interned: 2; uintptr_t ascii: 1; uintptr_t valid_utf8: 1; uintptr_t length: (WORD_SIZE-4); /* Number of code points in the string */ Py_hash_t hash; /* Hash value; -1 if not set */ PyUnicodeIndex *index; /* NULL unless needed */ uint8_t data[1]; } PyUnicodeObject; ``` The `valid_utf8` bit helps fast encoding. It is false if the string contains half surrogate pairs or any other code point not allowed in legal utf-8. ### Indexing operations `setitem(self, index)` would be implemented something like this: ```Py def getitem(self, index): if self.ascii: return self.data[index] if self.index is NULL: self.index = make_index(self) offset = offset_from_index(self.index, index) return read_one_char(self.data, offset) ``` The index table would be composed of len(s)/64 entries, each entry being: ```C struct index_entry { uintptr_t base_offset; uint8_t additional_offset[64]; }; ``` With the offset being computed as `base_offset[index/64] + additional_offset[index%64]`.

7 years ago, I ported MarkupSafe speedup module from Python 2 to PEP 393. My motivation was making Python 3 as fast as Python 2 to motivate Flask users to go Python 3.

Now I hate this code. I needed to use long C macro for UCS-1/2/4 template that is really maintenance burdon.

Many HTML snippets are ASCII. Using UTF-8 API doesn’t slow down them. And if we change the str internal representation to UTF-8, escaping it will be much faster than escaping current UCS-4 too. So I hope future CPython has PyPy-like internal representation.

steve.dower · September 17, 2024, 3:08pm

It’s actually been designed specifically for the limited API, and to be stable even if we change internal layout. We’ve spent quite a while working on that on the earlier GitHub issues.

A non-limited API would be much closer to the macros we currently have.

Of course, you’re welcome to dislike the API still. I don’t particularly like it either, but with the “must be O(1)” and “must be limited API” constraints, it’s probably the best we can do. (You need to drop the O(1) constraint to always use UTF-8.)

methane · September 17, 2024, 5:09pm

Why O(1) limited API is must have?

The PEP uses MarkupSafe for example why performance is needed. But MarkupSafe cannot be O(1) anyway. And I, the author of the MarkupSafe speedup think UTF-8 is acceptable performance and much easier to maintain.

Additionally, O(1) is impossible with PyPy.

Why the O(1) is required is not explained neither in the PEP nor this thread.

steve.dower · September 17, 2024, 5:52pm

It’s because we already have O(N) functions (which convert to UTF-8 or wchar_t). Therefore, we don’t need another function that has the same characteristic. Ensuring it is O(1) is the only value we can add with a new function, otherwise, we shouldn’t add it at all.

O(1) is possible with PyPy because it can refuse to return any kind but UTF-8. The PEP should change to say that the function returns failure if a suitable format is not requested, and should never do conversions. I already suggested this earlier - without it, the function provides no additional value.

methane · September 18, 2024, 1:52am

Please read the @vstinner 's this comment.

O(1) is not possible with PyPy and new API is better than current API because of limited lifetime.

If limited lifetime is not required, I agree that we don’t need new limited API.

pitrou · September 18, 2024, 8:09am

Is it not? AFAIU, PyPy supports memoryview zero-copy views, so it would at least be theoretically possible to also support zero-copy views of str objects? @cfbolz

steve.dower · September 18, 2024, 1:18pm

Ah right, they can’t just pin the memory when the view is requested. (I never learned the intricacies of PyPy’s GC, so I typically just assume they’re similar to .NET’s GC.)

In any case, it means O(1) is never possible with PyPy, and that’s a tradeoff of the entire implementation. At least with this API it can be a straight memcpy, whereas if we implement the scanning and conversions being proposed up the thread then every implementation has to become more complicated and more complex.

vstinner · September 23, 2024, 11:59am

When you ask only for UCS-4 and data is stored as UCS-1 or UCS-2, I prefer to convert to UCS-4. IMO it makes the API more convenient to use.

Anyway, we cannot guarantee O(1) in all code paths on all Python implementations. As said before, in the future, CPython might change to store strings as UTF-8.

Moreover, on PyPy, it seems like O(1) is not possible, only O(n).

steve.dower · September 23, 2024, 12:45pm

Yes I know, and this is what I’m opposing. Clearly it will be up to others to decide one way or the other.

I don’t particularly want to maintain a function that can do every permutation of conversion. And I don’t want to add a function where the only benefit over the existing ones is that it uses arguments to decide the format rather than separate functions.

The only thing people have asked for here is O(1) access to the contents of a string. If we can’t provide that, we should just apologise and move on.

I’ll also note that PyPy’s inability to pin memory is completely unrelated to Unicode. They (apparently) can’t provide O(1) access to any internal structure through a pointer, because it’s how they work (as you’ve described it). Just because they don’t use a fixed location memory buffer for every string object doesn’t give us permission to convert formats and still say it’s fast enough.