Conversion between Python and C integers

storchaka · January 25, 2024, 11:39am

Python supports integers in unlimited range (if memory is enough), C has several types of integers with limited ranges. There are several ways to convert Python integer to C integer and back:

Dedicated C API functions like PyLong_AsLong() and PyLong_FromLong().
PyArg_Parse() with corresponding format unit like 'l'. Py_BuildValue() with a similar format unit.
PyMemberDef with corresponding type like Py_T_LONG.

These sets are not equivalent, especially for unsigned integers.

Most of C API functions except PyNumber_AsSsize_t() has the PyLong_ prefix. There is usually three variants for conversion to the C integer:
- PyLong_AsLong() converts integers in range LONG_MIN to LONG_MAX to signed long.
- PyLong_AsUnsignedLong() converts integers in range 0 to ULONG_MAX to usigned long.
- PyLong_AsUnsignedLongMask() accepts arbitrary integers and convert them to usigned long module ULONG_MAX+1.
PyArg_Parse() has variants of format units for signed and unsigned types. For example, 'l' works like PyLong_AsLong() and 'k' works like PyLong_AsUnsignedLongMask(). There is no variant for PyLong_AsUnsignedLong(), the only way to convert to unsigned long with range check is to use a custom converter.
PyMemberDef API also has variants for signed and unsigned types. Py_T_LONG is equivalent to PyLong_AsLong(), but Py_T_ULONG which converts to unsigned long is more tricky. It accepts Python integers in range LONG_MIN to ULONG_MAX. It is larger than the range of unsigned long, so it converts negative integers in range LONG_MIN to -1 modulo ULONG_MAX+1.

Why there is so strange API for unsigned types? I think there are several reasons:

In is not clear whether some types like uid_t or dev_t are implemented as signed or unsigned types (it varies between OSes).
Even if some types are unsigned and supports values larger than maximal limit for corresponding type (like uid_t or dev_t on some OSes), some negative values can still be used as special signs for unknown or unavaliable value, so you can see (uid_t)-1 or (size_t)-1 in the C code. It is better to accept Python integer -1 as a special value than require to use 4294967295 or 18446744073709551615.

There are also differences in supporting int-like objects with __index__() method, but this is a different painful issue.

Due to to differences between these three sets, it is diffucult to write a code that supports the same range as argument as a value for attribute setters. It is difficult to change the code from using PyArg_Parse() to manual parsing with the C API and vica verse. How can we unify these APIs? API like PyLong_AsUnsignedLongMask() is the most lenient, but it allows integer overflow errors. Should we limit its range as in Py_T_ULONG? Or maybe limit it even more, allowing only -1 as negative value? There is a specialized private C API like _Py_Uid_Converter() which only accepts -1 as negative value. In some cases any negative value is invalid (when we specify a length etc)and all positive values that fits the target type are valid, so there is a value of more strict PyLong_AsUnsignedLong(). Should we add corresponding strict codes in PyArg_Parse() and PyMemberDef?

I am going to add wrappers for some C structs, and need support of types like uint32_t and off_t for this, so I need to resolve these questions for older types before adding support for new types.

pitrou · January 25, 2024, 1:50pm

Also the fact that unsigned types (not just values) are less common and less useful than signed ones. That said, I agree that all those APIs should be made consistent in scope, and if possible cover the useful range of integer types, including explicitly-sized types such as int32_t and int64_t.

vstinner · January 26, 2024, 4:38pm

In my TODO list, I have an item for a few months. On Windows, the subprocess module seems to hang forever if a negative timeout is passed, whereas it should behave as sleep(0): non-blocking call. For example, if you pass -1e9 second, it’s rounded to -1 ms which becomes 0xFFFF_FFFF and WaitForSingleObject() treats this value as INFINITE (wait forever).

The WaitForSingleObject() function is wrapped in Modules/_winapi.c which uses Argument Clinic with milliseconds: DWORD. And DWORD is defined in Argument Clinic as: create_converter('DWORD', 'k') # F_DWORD is always "k" (which is much shorter).

In short, PyArg_Parse() with k format (C unsigned long) converts silently negative numbers to positive numbers (wrap) which can lead to such surprising behavior. Is it a bug? A feature? I let you decide

See also Rationale for non-overflow checking format codes in PyArg_Parse* discussion.

pitrou · January 27, 2024, 7:42pm

I suppose you’re being snarky, but just in case: I think it’s a bug.

storchaka · February 6, 2024, 8:54am

My intermediate plan is to raise exception for values greater than UNSIGNED_MAX or lesser than SIGNED_MIN. I beliewe all outside this range is a bug. Members setters already emit a RuntimeWarning in these cases, so they can be made raising an exception immediately. PyArg_Parse() and PyLong_AsUnsigned*Mask() needs a warning first.

We should also consider either introducing new more strict non-wrapping variants of unsigned integer converters for PyArg_Parse() or making the existing converters non-wrapping by default and adding new converters that allow to wrap some negative numbers. The question is what negative numbers can be accepted? Just -1 or from SIGNED_MIN to -1, or we need both types of converters?

h-vetinari · February 6, 2024, 10:59am

It’s going to take a very long time to get away from wobbly-sized integers, but the C99 sized integers are slowly but surely spreading, thankfully. C23 also adds first-class^[1] support for larger-sized integers (e.g. _BitInt(128)). Not that Python will require C23 any time soon, but it would be good to keep the explicitly sized ints in mind for the API design (and ideally, the possibility that larger types than 64bit become universally usable).

For stupid ABI-reasons related to intmax_t, the last step towards actually having int128_t etc. is insanely hard. ↩︎

vstinner · February 6, 2024, 9:21pm

It would be nice to extend Argument Clinic (AC) for these use cases and to make it usable outside CPython.

If I recall correctly, AC already raises a ValueError if the argument type is unsigned but the passed value is negative.