PyLongObject: just use GMP?

I’m surprised! While endianness was mildly annoying at times, I thought CPython covered all the surprises there decades ago. Have more assumptions crept in? It’s really quite rare that C code needs to peek into the storage order of a multibyte scalar at a smaller granularity. Forcing endianness (as in constructing packets for network protocols) is easily done with less than a handful of planform-dependent functions (like the htonX family).

Or is something beyond just endianness in play?

2 Likes

To answer @tim.one ‘s questions about pypy: basically we still use python2’s approach to representing ints. The int/long distinction is still there internally. ints that fit into a machine word are stored (boxed) as one type of representation. For larger ints we use a big integer representation. We use the full 64 bits for small integers (1<<63 does not fit into an int64_t, it’s one too big).

We tried tagging very early on, but the cost of tag checks everwhere turned out not to be worth it. Tagging requires every operation on all objects to have an extra check. Instead, we leaned heavily into having the jit remove allocations for objects with predictable lifetimes. The jit can reason about integers quite effectively, we have a range analysis, a known bits analysis, and plenty of integer optimizations (all checked for correctness by an smt solver at vm build time).

As you also described, we have runtime switchable list storage strategies, that store the content of type homogeneous lists of ints/floats in an unboxed way. This saves allocations when creating the list. Additionally, when iterating over the list we only do a single check of the strategy before the loop to gain knowledge about the types of all the list elements at once.

We also store instance fields in an unboxed way, as long as all the instances of a class store the same type in these fields.

It’s not entirely true that we don’t have a C API. We emulate the cpython one, but it’s not super efficient. The reasons is as was pointed out various times, the internal representation of an int and it’s external one differ.

Anyway, I’m not sure what cpython can learn from this. I guess that getting unboxing into its jit would be good? That you could think about storage strategies for lists? But I suppose both would be quite a way off.

Happy to answer more questions.

5 Likes

Thanks, @cfbolz! PyPy’s space & speed optimizations for ints and floats are very potent in real life. I hope CPython can learn from them :smiley: .

I understood the space advantage of “tagged pointers” for small-enough ints, but was skeptical of their value for speed. I’m not surprised that PyPy found them “not to be worth it” - slowing every operation on all not-small-int objects.

I believe the int JIT analysis you added to PyPy goes far beyond what CPython has tried so far, and would be major work to add. The first step would be to read your paper on list storage strategies, and try for that.

I’ll note that CPython’s list.sort() already pre-scans its input for type homogeneity, and specializes to its own tiny comparison functions when possible, to avoid the major overheads of going through the all-purpose PyObject_RichCompareBool()

math.sumprod() also adapts to small-int inputs, using native C int addition when possible (whether obtained from lists or any other kind of iterable).

But there’s no “grand plan” behind these - just pragmatic ad hoc special code in specific contexts.

3 Likes

Oh, the conversion is easily done, but the pesky difference between native and little remains, even for Python code passing an array or memoryview from a computation-focused tool to a storage/interchange-focused one. And it’s invisible on common machines, similar to how issues in using bytes for text are invisible to an English speaker. Sometimes we need the s390x buildbot reminding people to do the conversion.

1 Like

Slightly tangential to this issue, but have you seen that IBM have s390x runners available at GitHub - IBM/actionspz: Instructions for using and issue tracking for the hosted GitHub Actions runner for IBM Power and IBM Z and LinuxONE · GitHub? So it might be possible to add it as a regular runner and not just a separate buildbot if this was a useful thing to track.

2 Likes

The first step would be to read your paper on list storage strategies, and try for that.

FWIW, I did read it some time ago. It’s not very feasible for CPython right now as we can’t break the C API for PyListObject/PyTupleObject (the struct is public and so are all of its fields). PyList_GET_ITEM is a macro, so we can’t break that either. Converting back and forth across the C API boundary is also tricky and expensive. Unboxed Internal list/tuple/dict/containers representation for use within CPython · Issue #729 · faster-cpython/ideas · GitHub

I believe the int JIT analysis you added to PyPy goes far beyond what CPython has tried so far, and would be major work to add.

The JIT is missing a “shape representation” or “partial evaluation” pass (there are a few names). We are roughly only a year away from that actually (though I say this every year, so I may be wrong once again). The 3.15 JIT does do int optimizations, such as simple lifetime tracking of freshly allocated ints. Where possible, we try to reuse small ints and avoid an allocation. We just added that feature barely 2 weeks ago gh-146640: Optimize int operations by mutating uniquely-referenced operands in place (JIT only) by eendebakpt · Pull Request #146641 · python/cpython · GitHub . So you need 3.15a8 to try it out, and not an earlier alpha. For example, the benchmark calculating the spectral norm of a matrix is much slower than PyPy still, but with JIT on Python main, it’s about 30-40% faster depending on platform versus no JIT.

While I’m here I might as well point out: CPython has a stranger set of cards to play than PyPy. The biggest blocker to a lot of our optimizations is the requirement to respect reference count semantics. E.g. x = y, people expect x to be discarded almost immediately if it’s the last reference. PyPy doesn’t use refcounting here. The reason why this blocks is due to destructors/finalizers in Python being allowed to run any code it wants. This means x = y is side effecting in CPython! While on PyPy, it can be just a register to register move.

3 Likes

Thanks, @kj0, for pulling back the curtain! I realize this stuff “is hard” - hang in there. It’s paying off! :smiling_face_with_sunglasses:

If even that much. Back in my long-gone days working on traditional optimizing compilers, registers were assigned by a graph-coloring algorithm. Different variables with distinct lifetimes were routinely assigned to same register. So x = y in the source code could, at times, be optimized out of existence.

Made life harder for symbolic debuggers, but so it goes.

The “almost instant” reclamation of non-cyclic trash is something I miss at times in PyPy, because it too complicates my mental model. I would, however, often value a 10x speedup more :wink:

2 Likes

Would it be possible to extend JIT support for external types? Maybe offering some API, that will register some additional knowledge about type (i.e. beyond immutability and so on).

I’m asking because this looks for me as a motivation for using of the bigint library like GMP in the CPython core. More optimizations are coming in CPython, that aren’t possible for types in C extensions. Thus, suggestion “just use gmpy2, if you have big integers” — will come with increasing speed penalty for operands of small size. (Here in this thread were some jit-enabled benchmarks.) But another story if it’s a temporary situation…

1 Like