PEP 756 – [C API] Add PyUnicode_Export() and PyUnicode_Import() C functions

SnoopJ · October 11, 2024, 5:32pm

Chiming in here as a sometime-contributor to unicodedata2 and community member who is interested in Unicode more generally: I like the simplicity of the proposed API compared to juggling the underlying primitives directly.

unicodedata2 itself is just copying code from upstream. It would add some maintenance burden to that project if the CPython implementation of unicodedata switched to this API^[1]. We would shim around it if we needed to, though.

I don’t have a lot to say about the performance concerns/etc., but I think that I agree with Steve’s remarks along the lines of meeting the performance guarantee or letting the user know they need to use a fallback. My one qualm there is that the name PyUnicode_Export() isn’t particularly obvious about being a fast export of an internal representation that might fail. If I hadn’t read the PEP/thread, I would probably have expected implicit conversion instead of a failure if the requested format(s) don’t align with the internal format.

I’m not clear on whether or not that’s a possibility based on the discussion, but it’s not really important to this PEP. ↩︎

vstinner · October 29, 2024, 5:06pm

I don’t feel like a strong support for these APIs.

Apparently, it’s too borderline between the stable ABI and exposing “implementation details” (UCS1/UCS2/UCS4 string formats). There are also multiple subtle questions about embedded null characters (NUL) and surrogate characters.

I’m no longer sure that there is a strong use case for these APIs. MarkupSafe may use UTF-8 instead of this API. Or just don’t use the limited C API.

PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions | peps.python.org should be carefully analyzed and PEP 756 may not be the best fit for these projects.

In short, I prefer to reject my PEP. I changed its status to “Withdrawn”.

Thanks everyone who was involved. At least, the PEP document itself is now an useful resource if someone wants to propose a similar API later.

steve.dower · October 29, 2024, 5:38pm

Thanks @vstinner for all the work that went into this. As you say, it’s still a useful resource, and I do think we ended up at a reasonable design for such a complex feature (and should try to reuse it in the future for similar problems).

methane · November 19, 2024, 7:48am

I made a pull request to optimize UTF-8 decode. It reduces the temptation to use PEP 393 API instead of PyUnicode_FromStringAndSize(). Would someone review it?

github.com/python/cpython

gh-126024: optimize UTF-8 decoder for short non-ASCII string

python:main ← methane:opt_decode_utf8_nonascii_numchars

opened 01:40AM - 27 Oct 24 UTC

methane

+277 -45

* Test input UTF-8 is ASCII before allocating ASCII buffer. * If error handler …is strict: * If input is not ASCII, estimate kind using first non-ASCII code unit. * Count number of codepoints before allocating the first buffer string. This optimization works only for strict error handler, because other error handler may remove or replace invalid UTF-8 sequence. ## Benchmark <details><summary>code</summary><div> ```py import pyperf import _testlimitedcapi ascii10 = "hellohello".encode() latin1_10 = "hello\u00e0\u00e1\u00e2\u00e3\u00e4".encode() ucs2_10 = "こんにちはこんにちは".encode() ucs4_10 = ("こんにちは" + "".join([chr(i) for i in range(0x1F0A0, 0x1F0A0+5)])).encode() runner = pyperf.Runner() def add_funcs(name, arg): assert len(arg.decode()) == 10 runner.bench_func(f"{name} 10", _testlimitedcapi.unicode_decodeutf8, arg) runner.bench_func(f"{name} 100", _testlimitedcapi.unicode_decodeutf8, arg*10) runner.bench_func(f"{name} 1000", _testlimitedcapi.unicode_decodeutf8, arg*100) for i in [0, 1, 2, 5, 8]: runner.bench_func(f"ASCII {i}", _testlimitedcapi.unicode_decodeutf8, ascii10[:i]) add_funcs("ASCII", ascii10) add_funcs("latin1", latin1_10) add_funcs("ucs2", ucs2_10) add_funcs("ucs4", ucs4_10) ``` </details> Result (wit `--enable-optimizations --with-lto`): | Benchmark | main-opt | patched-5o | |----------------|:--------:|:---------------------:| | ASCII 0 | 87.1 ns | 89.8 ns: 1.03x slower | | ASCII 1 | 88.5 ns | 89.8 ns: 1.01x slower | | ASCII 2 | 100 ns | 103 ns: 1.02x slower | | ASCII 5 | 104 ns | 103 ns: 1.01x faster | | ASCII 8 | 100.0 ns | 105 ns: 1.05x slower | | ASCII 10 | 101 ns | 104 ns: 1.02x slower | | ASCII 100 | 110 ns | 110 ns: 1.01x faster | | ASCII 1000 | 239 ns | 245 ns: 1.03x slower | | latin1 10 | 220 ns | 170 ns: 1.29x faster | | latin1 100 | 385 ns | 320 ns: 1.21x faster | | latin1 1000 | 2.13 us | 1.92 us: 1.11x faster | | ucs2 10 | 217 ns | 178 ns: 1.22x faster | | ucs2 100 | 615 ns | 473 ns: 1.30x faster | | ucs2 1000 | 3.15 us | 3.21 us: 1.02x slower | | ucs4 10 | 268 ns | 241 ns: 1.11x faster | | ucs4 100 | 725 ns | 581 ns: 1.25x faster | | ucs4 1000 | 3.79 us | 3.85 us: 1.02x slower | | Geometric mean | (ref) | 1.07x faster | * Issue: gh-126024