Python 3.13.0 speed

tim.one · February 5, 2025, 7:01am

This is a hearty “congratulations!” message. I often tackle a problem inspired by The On-Line Encyclopedia of Integer Sequences (OEIS). Here, sequence A378048: integers k such that k and k**2 together use at most 4 distinct decimal digits.

816 is in the sequence since 816**2 = 665856 and both together use at most 4 distinct digits.
149 is not in the sequence since 149**2 = 22201 and both together use 5 distinct digits.

I usually use PyPy for these, because it’s usually much faster than CPython, and if ints are “small enough” uses much less memory (“small enough” ints are apparently unboxed under PyPy) if stored in lists.

For this puzzle, it found all solutions with no more than 18 decimal digits in about 165 seconds (no, the first 5 programs you try for this will take very much longer - on the OEIS mailing list, one person coding in C said it took an hour to find the first 10,000 terms, and that was after they gave up waiting for a “pure brute force” program - this takes thought to make fast). There are 114,667 of them.

Just for fun, I ran it too under the released 64-bit Windows Python 3.13.0. Big surprise! It not only wasn’t much slower than under PyPy, it was faster, done in about 145 seconds. No real idea why. It’s overwhelmingly simple int arithmetic and int->str conversions, but rather involved logic. One factor is that for bigints, CPython has special code to compute i*i (multiplying two ints that are the very same object), which builds only about half of the full “multiplication tree”.

In any case, I don’t believe that’s happened before for an OEIS program using pure Python. There are many such sequences I do use CPython for, but using “really really big ints”, where the gmpy2 extension module is a superpower pure PyPy speed can’t touch.

Would this be faster still in C? I doubt it, but don’t care enough to try. The ints here routinely overflow 64 bits, and when moving from one decade to the next it needs to sort ever-growing auxiliary helper lists (recording the suffixes that could still possibly be part of a solution), but those lists have structure of a kind CPython’s list.sort() can exploit.

So, bravo! CPython’s speed greatly impressed me here. Keep it up .

Ah, I need to clarify a fine point: CPython was actually slower until I wrapped the code in a function. Accessing locals is very much faster than accessing globals under CPython, but PyPy doesn’t much care about that.

cfbolz · February 5, 2025, 4:02pm

cool! in the spirit of friendly competition, would you be up sharing your source code so I can check whether there’s anything I can do to optimize it int PyPy some more?

brandtbucher · February 5, 2025, 4:05pm

I’m also interested in seeing the source code.

tim.one · February 5, 2025, 8:06pm

Sure! I since shaved off another 10%, and it now it runs about equally fast under Python and PyPy:

[PyPy 7.3.12 with MSC v.1929 64 bit (AMD64)] on win32

About 130 seconds now under each, with a slight edge to PyPy, but a difference less than the normal variation across distinct runs (this is not an otherwise-quiet machine - this is casual benchmarking),

The last few lines of output should be:

114667 999_999_999_999_999_999
18 6_825_558 - 1_585_308 = 5_240_250 13_124_724 36_725
129.80258320000002
129.8026536

where the last two lines will vary (total elapsed seconds).

Click for code

def drive(len=len, print=print, set=set, str=str):
    from itertools import islice
    from time import perf_counter as now
    import bisect

    def fint(n):
        return format(n, '_')

    def out(n, *args):
        nonlocal count
        count += 1
        print(count, fint(n), *args)
    count = 0
    out(0)

    start = now()
    good = [0]
    base = 1
    next_base = 10
    ndigs = 0
    zc = [10]
    MOST = 4

    while True:
        base_inc = base
        ongood = len(good)
        i = bisect.bisect_left(good, ndigs - 1,
                               key=lambda g: len(str(g*g)))
        j = bisect.bisect_left(good, ndigs, i,
                               key=lambda g: len(str(g)))
        assert i <= j
##        for k in chain(range(i), range(j, ongood)):
##            g = good[k]
##            sg = str(g)
##            sg2 = str(g*g)
##            assert len(set(sg + sg2.rjust(ndigs, "0")[-ndigs:])) <= MOST
        #assert len(set(good)) == ongood
        #assert good == sorted(good)
        for k in range(i, j):
            g = good[k]
            sg = str(g)
            sg2 = str(g*g)
            if len(set(sg + sg2.rjust(ndigs, "0")[-ndigs:])) <= MOST:
                good[i] = g
                i += 1
        del good[i : j]
        npruned = j - i
        ngood = len(good)
        ndigs += 1

        #assert len(set(zc)) == len(zc)
        zc.sort()
        for i, g in enumerate(zc):
            zc[i] = g * 10
        zcit = iter(zc).__next__
        zcnext = zcit()

        for base in range(base, next_base, base_inc):
            for g in islice(good, ngood):
                i = base + g
                last = None
                while zcnext <= i:
                    out(zcnext, "zzzzzzz")
                    last = zcnext
                    zcnext = zcit()
                if last == i:
                    continue
                si = str(i)
                #assert len(si) == ndigs
                if len(fseti := set(si)) > MOST:
                    continue
                i2 = i * i
                si2 = str(i2)
                if len(fset := fseti | set(si2[-ndigs:])) <= MOST:
                    good.append(i)
                    if len(fset | set(si2[:-ndigs])) <= MOST:
                        out(i)
                        if "0" in fseti:
                            zc.append(i)
        print(ndigs, fint(ongood), "-", fint(npruned), "=", fint(ngood),
              fint(len(good)), fint(len(zc)))
        print(now() - start)
        #input("MORE")
        base = next_base
        next_base *= 10
        if ndigs == 18:
            break
    finish = now()
    print(finish - start)
drive()

tim.one · February 5, 2025, 8:56pm

Possible clue: if the code is changed to break out of the loop after generating all solutions with no more than 20 digits (instead of 18), CPython again has a major speed advantage (PyPy about 560 seconds, CPython about 450). The last lines of output should be:

243610 99_999_999_999_999_999_999
20 24_766_662 - 5_729_218 = 19_037_444 48_275_516 75_993

That’s interesting because lots of the base integers (not just their squares) no longer fit in 64 bits then.

So this may be largely due to that arithmetic on (not very big!) bigints runs faster under CPython. It’s always something

oscarbenjamin · February 5, 2025, 10:36pm

What exactly has changed in CPython 3.13 that is relevant here?

I saw somewhere that small ints were going to be inlined in the interpreter stack and possibly also in lists. Did that happen?

tim.one · February 6, 2025, 12:48am

To be clearer, I don’t know that it’s specific to 3.13. Ordinarily I would have held off on posting this, but it was so pleasantly surprising and it was so over my bedtime already that I pulled the trigger.

It’s a casual comparison I do multiple times per year, year after year (athough the specific code being timed varies). This result was such an outlier that I couldn’t resist sharing the .

For all I know, it may simply be due to that CPython has special code for i*i, which it’s had essentially forever.

But “why” wasn’t/isn’t my real interest. My interest was in advertising that, for people like me who have been conditioned to assume “don’t bother - PyPy is always much faster”, it may pay to take another look at that. YMMV.

oscarbenjamin · February 6, 2025, 1:08am

In my experience it really depends. Much code is micro-optimised for CPython. It’s hard to know how things would be if everyone got around PyPy and focused on performance there instead because it is so much faster for some things (maybe 100x faster for scalar floating point arithmetic).

I just checked and the basic SymPy test suite in CI takes 27 minutes for PyPy and 18 minutes for CPython. Neither of those particular timings involves GMP or other libraries besides mpmath which is also pure Python. There is a lot of code there that people have clearly timed and tweaked specifically for CPython though. Perhaps if it was better optimised for PyPy instead it could take 5 minutes or something. Maybe it wouldn’t even be any slower on CPython.

There is a lot of effort going into making CPython faster right now and I get the impression that PyPy is winding down. The changes in CPython likely mean that the old fastest way of doing things doesn’t apply though. I think you possibly need to know what the changes are to take advantage of them especially if you already have the old knowledge of how things used to be optimised.

tim.one · February 6, 2025, 2:08am

Of course! In context, my conditioning was based on comparing programs written for OEIS tasks under both. The case at hand was massively unprecedented. The best prior case of this I saw was CPython running about 3x slower than PyPy. More typically 6x slower. These all sling integers heavily, but, depending on the specific problem, ranging from a handful of bits to multiple millions of bits. This specific problem was unusual in mostly sticking to the range of 64 to 128 bits. “Most” OEIS sequences are so sparse they quickly exceed that range. Then it’s “use gmpy2, or forget it”.

At the other end, I’m told I was infamous in PyPy-land for writing the once-popular SpamBayes spam classification software. Infamous because it was one of the few programs at the time (early in PyPy’s life) they couldn’t speed up at all to speak of. That was heavily string manipulation, with a smattering of float arithmetic to compute statistics.

More generally, what used to be true is that “optimizing” random code for CPython relied heavily on two general “tricks”:

Use locals rather than globals. That’s still true.
Minimize trips around the eval loop. Pretty much “period”. Decoding PVM opcodes and dispatching are pure overheads in the sense of accomplishing useful work. So calling something coded in C was almost always a win if it saved even one trip around the eval loop.

But that last one increasingly backfired under PyPy. Heavy use of itertools is the clearest example. Those coded in C save a great many trips around the eval loop. But they also hide from PyPy what the code is “really doing”, crushing attempts at more global analysis.

CPython did almost no cross-opcode analysis or optimization. The dubious (mixed blessing) exception was dynamically peeking ahead in the opcodes to see whether:

x += y

involved strings and x had a refcount of 1. Then it was possible to try to extend x in-place instead of building a brand new object. That alone could change a loop from quadratic-time to (amortized) linear-time.

That part is changing now, as CPython gets smarter about the meaning of the code it’s trying to optimize.

Then again, PyPy could also get smarter about what the various itertools functions do, to recover the semantic relationships “hidden” by calling them. CPython would also eventually benefit from that.

For the duration, @cfbolz’s “friendly competition” will benefit everyone.

cfbolz · February 6, 2025, 1:13pm

nice, thanks! Found a small islice inefficiency with it already. Apart from that, on PyPy it’s completely dominated by the cost of doing set(str(int)), basically. Not sure it would be useful in general to optimize that . If you want to do PyPy specific optimizations you can do tricks like implementing your own bitfield based set (which is efficient because integers are mostly unboxed, even if stored in instance fields). I get a 2-3x speedup on PyPy writing this kind of code: Tim Peters' OEIS A378048 program · GitHub
Unfortunately it really destroys the performance on CPython.

oscarbenjamin · February 6, 2025, 2:15pm

Probably not.

This is the kind of CPython-specific micro-optimisation I was thinking of. If working in C it would seem completely absurd to convert an int into a heap allocated string and then convert each character of that string into a new heap allocated string just to collect the unique occurrences into a hash table of length no more than 10 and then discard everything after checking the length.

Tim mentioned itertools as an example of something that makes sense in CPython but not so much in PyPy but I think hash tables are often another. If working in C it the overhead of a hash table compared to an array would seem very significant but in CPython if it saves having a for loop anywhere then it seems reasonable to use set or dict.

I think that the better path here is just to use integers and make sure that PyPy is well optimised for those. I would have thought that the way to do this with PyPy was something along the lines of:

def count_digits(num):
    digits_mask = 0
    while num:
        num, d = divmod(num, 10)
        digits_mask |= 1 << d
    return digits_mask.bit_count()

Hopefully CPython can get faster for small integer operations as well.

cfbolz · February 6, 2025, 2:55pm

yep, that’s basically what I did in my gist.

tim.one · February 6, 2025, 8:52pm

Thank you - good to know! Not useful in general, no, although … str(int) on its own is worth making fast.

I independently tried something like that, but more “ambitious”:

Cached the bitset in the good array too, so it’s only computed once across iterations (this is delicate, though, because base + g can create 0s “out of thin air”, like 1000 + 12 = 1012, and “0:” is not in the bitset for “12”, or in the loop-invariant bitset for the leading “1”).
Precomputed a list such that pc[i] gives the popcount of i (number of bits set in i), for i in range(1024) (there are only 10 bits).

Both similar in my code. After (this is for generating all solutions with no more than 20 digits - up from the 18 in the original post):

CPython 919 seconds
PyPy    236 seconds

That’s much more like I’m used to seeing. Note: I’m building the bitset with functools.reduce(), but it may go faster under PyPy if I used your explicit loop method.

Caution for the peanut gallery: in CPython too, it’s usually faster to use a small bitset than an instance of set(), when possible, and if saving the sets away somewhere can save an enormous amount of memory. The tradeoffs get muddier in general if you also need to iterate over set elements (that’s strained for a bitset).

It’s not quite so bad as that in CPython:

>>> "abc"[1] is "b"
<python-input-1>:1: SyntaxWarning: "is" with 'str' literal. Did you mean "=="?
  "abc"[1] is "b"
True

That is, single characters are generally singletons, and so are the ints in range(11). Only the memory for the set structure is freshly allocated.

tim.one · February 7, 2025, 2:18am

Have to say the most baffling thing to me now is why there’s such dramatic slowdown in CPython when switching from sets of characters to 10-bit integer bitsets. I expected that to make it faster, not much slower. Here’s the code I have now to convert a digit string to a bitset of the contained digits. bit[] maps the ord() of a digit character to its bitmask; e.g., bit[ord('0')] == 0b1; bit[ord('1')] == 0b10 and so on.

def s2set(s, map=map, ord=ord, bit=bit,
          reduce=functools.reduce, or_=operator.or_):
    return reduce(or_, map(bit.__getitem__, map(ord, s)), 0)

Essentially all at “C speed”, as was also set(digitstring). Ya, it slings a lot of tiny ints, but they’re all cached as singletons in CPython - there’s no actual memory alloc/delloc in play, and no “computation” left at all to speak of.

CPython 643 seconds
PyPy    202 seconds

To be clearer, given that it’s “overwhelmingly just small ints” now, it’s not surprising that PyPy’s unboxed ints run faster. The mystery is why CPyhon on its own is so much slower slinging small ints than small sets of characters. Sets are a much heavier kind of object, and a union of small sets (the most common operation) is certainly much slower than bitwise-or of two small ints.

Even checking for “0” is obviously faster, just

    if gset & 1:

instead of

    if "0" in gset:

It’s always something

tim.one · February 7, 2025, 4:37am

Here’s a clue: if I add this as the first line of s2set(), the time to generate all the <= 20-digit solutions under CPython drops from about 643 seconds to about 506. That’s huge.

    s = set(s)

“Even at C speed” operation counts matter. Most digits are duplicates. Each duplicate weeded out saves computing an ord(), a lookup into bit[], and an | into the running union. Weeding out a duplicate in a set just requires finding a match, which hash tables are very fast at when a match exists.

For the 20-digit run, about 4.4 billion digits come into that function, but only about 1.3 billion come out, so about 70% of the input digits are discarded by the set() operation. That ratio gets higher the longer the solutions we’re looking for.

Doesn’t account for all of it, but does account for a lot of it. The “| into the running union” part goes much faster in PyPy because its small ints are unboxed (and, I assume, may even mutate in-place), and building a set first hurts PyPy performance.

Yup, I’ll say it again: it’s always something

pitrou · February 7, 2025, 12:26pm

Oscar Benjamin:

I think that the better path here is just to use integers and make sure that PyPy is well optimised for those. I would have thought that the way to do this with PyPy was something along the lines of:
def count_digits(num):
    digits_mask = 0
    while num:
        num, d = divmod(num, 10)
        digits_mask |= 1 << d
    return digits_mask.bit_count()

I think you could easily improve this snippet by processing 3 digits at a time using a 1k-entry lookup table? Something like:

def compute_digits_mask(num):
    digits_mask = 0
    while num:
        num, d = divmod(num, 10)
        digits_mask |= 1 << d
    return digits_mask

digits_masks_1k = [compute_digits_masks(i) for i in range(1000)]

def count_digits(num):
    digits_mask = 0
    while num:
        num, d = divmod(num, 1000)
        digits_mask |= digits_masks_1k[d]
    return digits_mask.bit_count()

You can perhaps even vectorize using NumPy.

tim.one · February 7, 2025, 8:47pm

@pitrou, converting multiple digits at a time is common in int->str conversion implementations, but it needs care to handle zeroes correctly. For example, your count_digits() is oblivious to the 0s in 1001: num and d are both 1. In the loop you need to add something like:

        if num and d < 100:
            digits_mask |= 1

Even then count_digits(0) will return 0 instead of (the correct) 1.

In context it’s not promising anyway. In context, a string of digits can be pulled “out of the middle” of a longer string, and leading zeroes are always significant then. Pulling “out of the middle” can, of course, also be done via arithmetic with powers of 10, but they’re no longer tiny powers (like 1000) then.

In code and concept, it’s much easier to work with decimal strings here.

Speaking of which, yup, I tried using the decimal module instead, where int->str is linear-time, It was slower overall. For this range of inputs, it’s not the int->str time that dominates (and PyPy doesn’t have the C version of decimal anyway).

pitrou · February 7, 2025, 9:16pm

Well, I didn’t claim to understand the problem you’re trying to solve! I was merely suggesting how to speed up @oscarbenjamin 's code.

superlopuh · February 19, 2025, 8:52am

Sorry to hijack this thread for a somewhat off-topic question, I’m just wondering if this also has a performance impact:

def drive(len=len, print=print, set=set, str=str):

Is this so that the functions are also locals, which makes them faster to look up? I’ve never seen this approach before.

nas · February 19, 2025, 7:06pm

Yes, that’s a trick to make those lookups locals rather than globals, which is faster. This made a bigger difference with older versions of Python. In recent versions (3.12 or newer maybe), global lookups are quite a bit faster. I profiled Tim’s code and that keyword argument trick still gives a small speedup, even with Python 3.13.