Generally, I think that
- feature-wise, we should design API to not break in the future, but
- performance-wise, we should design API for what we (and e.g. PyPy) have now.
This means that if we change things in the future, extensions should continue working, but perhaps they’ll be (much) slower than before – until they update to use some new features or APIs.
IMO, this is better than trying to predict the future right now.
IMO, we need to do this, but discourage relying on it.
In practice, most of the returned buffers are NUL-terminated. No matter what we say or document, some users will expect the trailing NUL, and they won’t be bothered by embedded NULs truncating their strings.
Unfortunately, using C string functions on NUL terminated can easily become a security issue. And we’re still building a C API.
So, to make the world safe, we need to always the terminating NUL.
We can add a “no NUL please” flag in the future. Or alternate implementations where this is a bottleneck can add a XPyUnicode_Export_NoNUL
function (which CPython can adopt later). But right now, let’s export the NUL and
- document that we do it, for the benefit of alternate implementations
- document that you shouldn’t rely on it, and use size whenever possible, since strings can have embedded NULs
We don’t need to add the flag now – if/when we do, all already-released CPython versions will simply ignore the flag and add an extra NUL.
No. If a user needs to reject surrogate characters, they can scan the string themselves.
However, we do need to document that our exports can contain these.
Again, if some day there appears an (alternate) implementation where this check is cheap, we/they can add a flag/function for it.
(Already-released CPython versions will ignore the flag if it’s requested, and they won’t set it in the output.)
No. Embedded NULs are a normal feature of Python strings.
Again, users that need to reject these can scan the result. In many cases, they might want to reject some other control characters as well, not just NUL – for example terminal escapes, lone surrogates, or BIDI overrides.
Let’s not build the tool now if CPython doesn’t need it. It would be an ill-fitting tool.
I guess the proper format would be "=I"
.
Some consumers might not expect the full struct
syntax here, but since this is new API, perhaps that’s OK?
If so we should also use "=H"
. But I’d keep "B"
alone; that’s unambiguous, and more likely to be special-cased.