Add portable way to extract all digits of a `PyLongObject` from C?

Python mandates integers need to be “big ints”, i.e., integers can be of arbitrary length. CPython currently implements this with the following type:

typedef struct {
    /* ... */
    ssize_t ob_size;
    uint32_t ob_digit[1];  /* variable-length array */
} PyLongObject;

and provides the following method in its stable C API/ABI for converting this to a C number:

long PyLong_AsLongAndOverflow(PyObject *obj, int *overflow);

which returns the Python integer if it fits inside a C long, and sets overflow to a non-zero value otherwise. However, suppose I want to extract all digits of the python in order to, e.g., feed it to a BigInt library. How do I do that in a portable and stable way?

I could of course parse the object above, but as far as I understand the docs correctly, for ABI stability sake the Python devs consider PyLongObject * an opaque pointer, its structure an implementation detail, and subject to change. Alternatively, but this is painfully slow, I could round-trip through the python divide and modulo functions.

So how about adding a pair of new methods:

size_t PyLong_AsChunks(PyObject *obj, uint32_t *chunks, size_t chunk_size);
int PyLong_FromChunks(PyObject *obj, const uint32_t *chunks, size_t chunk_size);

which convert from and to a set of chunks, which then can be assembled and fed into an external bigint library?

You may have overlooked int.to_bytes and int.bit_length.

But that incurs a roundtrip through Python, right? The question is how to do that from the C API.

Related ideas discussion:
https://www.mail-archive.com/python-ideas@python.org/msg28177.html

@mwallerb is your intent for the to-bytes function to return the internal structure (a list of 4 * ceil(log(n, 2**30)) bytes to represent the base-2**30 ‘digits’ of int n; see Internal structure, likely also needs the header)?
That differs from the less implementation-specific ‘logical’ representation (a list of ceil(log(n, 256)) bytes to represent the base-256 ‘digits’ of int n)

The former would expose internal detail and not be obviously useful for consumers because of the 30 mantissa bits per uint_32.
The latter would still require processing (bit-shifting to repackage) to construct, though less than a roundtrip via decimal (which apparently is quadratical, see recent restriction for security). But that basically is the existing to-bytes function, I guess.

PyObject_CallMethod is part of the C API, anything wrong with that? It may be slightly slower than a dedicated C API function, but not painfully so.

3 Likes

I concur with @Dutcho that the internal structure is probably not terribly useful, and to_bytes is better.

@AndersMunch I am not sure about the overhead, I suspect it is not huge. But just from a conceptual perspective I find it strange that the C API for an arbitrary-length integer does not contain a method for
converting it to anything but longs and long longs.

Especially since there’s already a _PyLong_FromByteArray function in CPython that has a somewhat nicer interface than packing and unpacking tuples:

PyObject *_PyLong_FromByteArray(const unsigned char *bytes, size_t n,
                                int little_endian, int is_signed);

but alas, it is not public. Is there any harm in exposing this to the user?

Or, at the very least, if the to_bytes method is the “blessed” way of getting the full integer, shouldn’t we add something to this effect to the PyLong docs?

By the time someone wants this feature is there really significant benefit to maintaining direct C API functions to do the same?

They’re already making a call saying they expect to handle arbitrary length bigints more than long long (64-128) bits. At which point added overhead from CallMethod may be dwarfed when faced with actually high bit count numbers. One fewer memory copy out of the PyBytes object and excess Mallorca/free. Do you have a practical application where that is a measurable performance bottleneck?

1 Like