New PyUnicode_EqualToUTF8() function

storchaka · October 4, 2023, 4:37pm

It is already passed a harsh review by Victor and ready to be merged, but since adding new C API is a big deal, I want to talk about it here.

There is a public C API function PyUnicode_CompareWithASCIIString(). It compares a Unicode object with the C string which is interpreted as Latin1 encoded (despite ASCII in name). It never raises exception and returns -1, 0, 1 if the first argument is less, equal or larger that the second argument. The flaw of this function is that it interprets the C string as Latin1 encoded. In all other C API the C strings are interpreted as UTF-8 encoded. So, while it can be used to compare with ASCII string literal, it cannot be used to compare with PyTypeObject.tp_name, PyMethodDef.ml_name, PyDescrObject.d_name, etc in general case. It is also not so convenient, because virtually all usages of it in CPython are for equality test.

There is a public C API function _PyUnicode_EqualToASCIIString(). It only supports equality test () and more convenient for this. It requires the C string be ASCII-only and crashes in debug build if it is not.

New PyUnicode_EqualToUTF8() function is a generalization of _PyUnicode_EqualToASCIIString() which supports non-ASCII C strings (interpreting them as UTF8 encoded). It is completely compatible with _PyUnicode_EqualToASCIIString() and will replace it. It can replace aalmost 100% (or all 100%) of usages of PyUnicode_CompareWithASCIIString(). It can replace a pair of PyUnicode_FromString() followed by one of comparison functions (there are several options). It does not raise exception, does not use heap, and preserves the currently raised exception if there is one, so it can be used in critical parts of code.

It was advertising.

Any suggestions or objections?

storchaka · October 4, 2023, 4:38pm

Issue:

github.com/python/cpython

C API: Add PyUnicode_EqualToUTF8() function

opened 02:35PM - 03 Oct 23 UTC

serhiy-storchaka

type-feature topic-unicode topic-C-API

# Feature or enhancement There is public `PyUnicode_CompareWithASCIIString()` f…unction. Despite it name, it compares Python string object with ISO-8859-1 encoded C string. it returns -1, 0 or 1 and never sets an error. There is private `_PyUnicode_EqualToASCIIString()` function. It only works with ASCII encoded C string and crashes in debug build it it is not ASCII. It returns 0 or 1 and never sets an error. `_PyUnicode_EqualToASCIIString()` is more efficient than `PyUnicode_CompareWithASCIIString()`, because if arguments are not equal it can simply return false instead of determining what is larger. It was the main reason of introducing it. It is also more convenient, because you do not need to add `== 0` or `!= 0` after the call (and if it is not added, it is difficult to read). I propose to add the latter function to the public C API, but also extend it to support UTF-8 encoded C strings. While most of use cases are ASCII-only, formally almost all C strings in the C API are UTF-8 encoded. `PyUnicode_FromString()` and `PyUnicode_AsUTF8AndSize()` used to convert between Python and C strings use UTF-8 encoding. `PyTypeObject.tp_name`, `PyMethodDef.ml_name`, `PyDescrObject.d_name` all are UTF-8 encoded. `PyUnicode_CompareWithASCIIString()` cannot be used to compare Python string with such names. For PyASCIIObject objects the new function will be as fast as `_PyUnicode_EqualToASCIIString()`. ### Linked PRs * gh-110297

PR:

pitrou · October 4, 2023, 5:23pm

As I said on the PR, I don’t think taking a null-terminated C string is a very good API choice these days. The Python C API is used not only by C developers but also from many other languages such as C++ and Rust where strings have an explicit length.

storchaka · October 4, 2023, 5:29pm

But all strings used in the C API are null-terminated. How do you use not null-terminated tp_name or ml_name? Or non null-terminated keyword names?

We can add PyUnicode_EqualToUTF8AndSize() if there will be need.

pitrou · October 4, 2023, 5:33pm

Is this a weird joke? Are you saying that I’m not supposed to call PyUnicode_EqualToUTF8 with something else than a tp_name?

A design rule should be that C APIs are useful for a wide range of use cases. Especially if you’re making them part of the stable ABI as you seem to.

Well… why not, but why have two functions?

daniele · October 4, 2023, 5:44pm

If this gets added I think that a generic comparison function PyUnicode_CmpUTF8 or something is much more useful that one function that exclusively checks for equality. The function checking for equality can be added as a convenience inline helper. The only reason (other than ergonomics, solved with the inline helper), for providing the equality check but not the comparison is performance, but the fast path already uses memcmp() and I’m not convinced that the slow path would become measurably slower if the comparison would be tested instead of equality.

storchaka · October 4, 2023, 5:59pm

You are not supposed to call it with something that is not null-terminated C string, as all other C API.

Currently its predecessor are mostly called with literal C strings like "sys", "<stdin>" or "__class__". It is inconvenient to count and pass the length of these literals. Hmm, PyArg_ParseTupleAndKeywords() also use it for keyword names, it means that only ASCII keyword names are actually supported.

I think that the new function can be used also with null-terminated C strings which are attributes of C structures like PyMethodDef or PyDescrObject. And non-ASCII keyword names finally can be supported in PyArg_ParseTupleAndKeywords(). All these C strings only have a pointer, not a size. They are null-terminated.

I do not know other use case, but if you know it, and it is enough common, a new function can be introduced.

storchaka · October 4, 2023, 6:01pm

Because the common case is for null-terminated strings. I am not even sure that the other case exists and that it is not marginal.

MRAB · October 4, 2023, 6:09pm

How about a function that accepts a length, but if that length is -1 (or just negative in general), then the string is assumed to be null-terminated?

davidhewitt · October 4, 2023, 6:38pm

Rust strings are not null-terminated; I would love to have PyUnicode_EqualToUTF8AndSize() to be able to cheaply compare Rust strings against Unicode objects!

pitrou · October 4, 2023, 7:33pm

Yet, a bunch of C API functions do take an explicit string length argument, such as PyUnicode_Decode, PyUnicode_DecodeFSDefaultAndSize, PyUnicode_DecodeLocaleAndSize, PyBytes_FromStringAndSize, PyUnicode_FromWideChar, PyUnicode_DecodeUTF8, etc.

That said, I agree that passing the string length can be annoying when dealing with C literals, so having two functions (or one function where the size argument can be -1 to indicate an unknown length) sounds reasonable to me?

malemburg · October 4, 2023, 8:28pm

I’m not against adding such a function, but why only have it work for equality and not also for less than and greater than ?

IMO, it’s better to add a PyUnicode_CompareWithUTF8String() API, which returns -1, 0, 1 respectively. And perhaps another PyUnicode_CompareWithUTF8StringAndSize() API for non-zero terminated strings where you know the size.

Antoine does have a point in that such functions are not just mere helpers for CPython, but do serve a purpose outside CPython as well and it’s not uncommon to have to deal with strings that can embed NULs. I’m not saying that it’s common to have such strings, but often, this special case is not invalid on input. Stopping the comparison at the first NUL code point could then easily lead to security issues later on.

barry-scott · October 4, 2023, 8:57pm

Passing the length as -1 I do not like as an API.
Implement two functions please.
One that takes a NUL terminated string the other that takes pointer and length.

You then can refactor the implementation as you see fit.

pitrou · October 5, 2023, 9:55am

I would say because an ordered relationship between Unicode codepoints doesn’t mean much and is usually not what users are expecting, while equality testing is extremely common and reasonably intuitive (except for occasional normalization issues).

malemburg · October 5, 2023, 11:56am

I’m not sure I understand. Sorting is done in exactly this way (using code point ordinals as basis) and comparisons are also useful for searching and indexing (in an ordered set of values).

Performance of a full comparison vs. just an equality check is also the same, since in both cases, the comparison can stop at the first mismatch.

storchaka · October 5, 2023, 12:22pm

It looks reasonable if there is a case for function with the size argument. It seems that one use case is already found – comparison with Rust strings. I am not sure how much a new function may be useful for Rust, comparing with alternatives:

PyUnicode_FromStringAndSize() + PyUnicode_Compare().
PyUnicode_AsUTF8AndSize() + memcmp().

It is more cumbersome (but I think that in Rust they will use wrappers in any case), can fail, PyUnicode_FromStringAndSize() always use heap, PyUnicode_AsUTF8AndSize() can use heap and “leaks” memory in the cache (but it makes the following comparisons faster).

In any case, adding a function with the size argument has a small cost and probably won’t affect performance of the main case.

But function which returns -1, 0, 1 respectively is a different thing.

It will double the size of the original not so small function, because for every failed equality test it needs to check what of the bytes was larger. It affects not only readability, it may affect performance.
Some fast checks (like comparing the size of ASCII or cached UTF8) can no longer be used.
And we need to decide what to do with non-decodable bytes on one side and non-encodable code points (in the surrogates range) on other. Currently they mean “not equal”, but in case of full ordering how to order them?

So, implementing ordering instead of equality test has some cost. And adding parallel implementation also has some cost.

pitrou · October 5, 2023, 12:34pm

It’s not only Rust but also C++, and generally any other language, runtime or data format where strings can contain embedded zeros.

vstinner · October 5, 2023, 2:04pm

I don’t think that it’s worth it to discuss which programming language (C, C++, Rust, etc.) is more popular these days, we should just cover all cases by having two APIs: one with length in bytes, one without length (use strlen() internally).

davidhewitt · October 5, 2023, 3:45pm

Yes I’d hope that most Rust users are using PyO3 (but not all e.g. orjson).

~~I suppose already that if PyUnicode_AsUTF8AndSize() fails then I can just call PyErr_Clear() and infer that the Unicode object was not equal to the UTF8 Rust string.~~ So it seems to me the main advantage of this new API is performance: PyUnicode_EqualToUTF8AndSize() does not need to have any error-handling branches beyond returning 0 in the not-equal case.

EDIT maybe the existing APIs are not so straightforward in out-of-memory conditions, so I need to do something more complex than PyErr_Clear but the same insight may still apply (with the new API now looking even better for simplicity & performance).

storchaka · October 6, 2023, 9:44am

Added also PyUnicode_EqualToUTF8AndSize().

Thank you all for discussion. I did not even think about the size parameter, because I haven’t seen any use for them. But the discussion convinced me that such cases can be.

As for the *Compare* functions, let’s wait and see if there are any cases left after using the *EqualTo* functions.