Chiming in here as a sometime-contributor to unicodedata2 and community member who is interested in Unicode more generally: I like the simplicity of the proposed API compared to juggling the underlying primitives directly.
unicodedata2 itself is just copying code from upstream. It would add some maintenance burden to that project if the CPython implementation of unicodedata switched to this API[1]. We would shim around it if we needed to, though.
I don’t have a lot to say about the performance concerns/etc., but I think that I agree with Steve’s remarks along the lines of meeting the performance guarantee or letting the user know they need to use a fallback. My one qualm there is that the name PyUnicode_Export() isn’t particularly obvious about being a fast export of an internal representation that might fail. If I hadn’t read the PEP/thread, I would probably have expected implicit conversion instead of a failure if the requested format(s) don’t align with the internal format.
I’m not clear on whether or not that’s a possibility based on the discussion, but it’s not really important to this PEP. ↩︎
I don’t feel like a strong support for these APIs.
Apparently, it’s too borderline between the stable ABI and exposing “implementation details” (UCS1/UCS2/UCS4 string formats). There are also multiple subtle questions about embedded null characters (NUL) and surrogate characters.
I’m no longer sure that there is a strong use case for these APIs. MarkupSafe may use UTF-8 instead of this API. Or just don’t use the limited C API.
Thanks @vstinner for all the work that went into this. As you say, it’s still a useful resource, and I do think we ended up at a reasonable design for such a complex feature (and should try to reuse it in the future for similar problems).
I made a pull request to optimize UTF-8 decode. It reduces the temptation to use PEP 393 API instead of PyUnicode_FromStringAndSize(). Would someone review it?