Single-thread integer performance of CPython on different CPUs

Hello,

I’m very surprised by some benchmarks I did recently. Hopefully someone here has an idea of what is going on. To get my main point over quickly, here are the results of running a quick informal benchmark on two different systems, a laptop from 2011 and a modern beefy desktop.

% lscpu | grep 'Model name' 
Model name:                           Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
% python3 -m timeit -s 'def fib(n): return fib(n-2) + fib(n-1) if n > 1 else 1' 'fib(23)'
50 loops, best of 5: 6.96 msec per loop
% lscpu | grep 'Model name'                                                              
Model name:                           12th Gen Intel(R) Core(TM) i7-12700K
% python3 -m timeit -s 'def fib(n): return fib(n-2) + fib(n-1) if n > 1 else 1' 'fib(23)'
50 loops, best of 5: 5.13 msec per loop

Both machine are running Debian stable, and the system Python is used (Python 3.11). I have also compiled 3.12.4 from source (using ./configure --enable-optimizations) and it gives the same results.

Now the single-thread speed difference between both CPUs (i5-2520M, 12700K) should amount to a factor of about 3. This is confirmed by a non-Python-benchmark where the command sysbench cpu run produces results (“CPU speed”) of, respectively, 883 and 2452.

Perhaps the Fibonacci “benchmark” is flawed? Executing the script listed below (which is the implementation of a real-world tensor network contraction algorithm that is mostly juggling dicts and sets of tuples of ints) outputs (respectively):

cwg@drac:/tmp% python3 contract.py 
...
time: 8.352014541625977
cwg@smok:/tmp% python3 contract.py 
...
time: 7.0084943771362305

Again, the results are the same with Debian’s system Python and with CPython 3.12.4 compiled from source.

But perhaps CPython integer code simply does not profit much from modern CPUs (because of its memory access patterns, say)? I asked a colleague to execute the listed script on his laptop with i7-1265U CPU and he gets

time: 2.9437015056610107

both with Ubuntu’s system Python and the one of micromamba. So much better integer performance is possible. (I also measure a running time of 6.40 on a server with EPYC 7452 processor and Python 3.9.)

Note that the CPU of my colleague’s laptop belongs to the same generation from the same manufacturer (Alder Lake) as the desktop. So one could expect both to perform similarly, but the desktop should be slightly faster.

Any ideas?

# Implementations of algorithms from
# "Faster identification of optimal contraction sequences for tensor networks"
# by Robert N. C. Pfeifer, Jutho Haegeman, Frank Verstraete
# https://arxiv.org/abs/1304.6112
#
# Written in 2024 by Christoph Groth <christoph.groth@cea.fr>
#
# This is work in progress.

from math import prod, inf
from itertools import combinations
import collections
import collections.abc
import time

# The tensor network is made up by interconnected *simple* tensors.
# Any number of simple tensors can be contracted into a *composite* tensor.
# We call that number the *level* of the composite tensor.


class Network(collections.abc.Sequence, collections.abc.Iterable):
    def __init__(self, tensors, dims):
        self.tensors = tuple(tuple(t) for t in tensors)
        self.dims = tuple(dims)

    def __len__(self):
        return len(self.tensors)

    def __getitem__(self, index):
        return self.tensors[index]

    def __iter__(self):
        return iter(self.tensors)

    def cost(self, inds):
        return prod(self.dims[ind] for ind in inds)

    @classmethod
    def make_rect(cls, width, height, val):
        """Generate the edges of a rectangular graph

        This produces an iterable of (n0, n1, val) where n0 and n1 are
        integer node ids and val is the provided constant value.
        """
        assert width >= 1
        assert height >= 1

        tensors = collections.defaultdict(list)
        dims = []

        # Horizontal edges
        for y in range(height):
            for x in range(width - 1):
                eid = len(dims)
                dims.append(val)
                tensors[x, y].append(eid)
                tensors[x + 1, y].append(eid)

        # Vertical edges
        for y in range(height - 1):
            for x in range(width):
                eid = len(dims)
                dims.append(val)
                tensors[x, y].append(eid)
                tensors[x, y + 1].append(eid)

        return cls(tensors.values(), dims)


# The cache is used to remember the best (so far) way to contract a given
# composite tensor.  The key is a tuple of sorted node ids that uniquely
# identifies the composite tensor.  The value is a 3-tuple of the following
# values:
#
# - construction cost
# - set of indices
# - construction sequence: If the sequence has length one it is represented
#   by the index of the corresponding tensor.  Otherwise it's a 2-tuple of
#   tensor ids.
def _init_cache(network):
    """Create cache entries for the simple tensors."""
    return {(t,): (0, set(inds), t) for t, inds in enumerate(network)}


def _extract_seq(cache, ten):
    def seq(ten):
        _, _, s = cache[ten]
        if not isinstance(s, tuple):
            return s
        a, b = s
        # TODO: b does not need to be stored, it can be extracted:
        assert b == tuple(sorted(set(ten) - set(a)))
        return seq(a), seq(b)

    return seq(ten)


def _any_common(seq0, seq1):
    i0, i1 = iter(seq0), iter(seq1)
    e0, e1 = next(i0), next(i1)

    try:
        while True:
            if e0 == e1:
                return True
            elif e0 < e1:
                e0 = next(i0)
            else:
                e1 = next(i1)
    except StopIteration:
        return False


def seq_bf_cap(network):
    """Exhaustive breadth-first search with cost capping
    """
    n = len(network)
    simple = tuple(range(n))

    caches = [None]
    caches.append(_init_cache(network))
    for i in range(n - 1):
        caches.append(dict())

    min_dims = min(network.dims)
    cost_cap = 1                # Current cost cap
    cost_old = 0                # Previous cost cap

    while not caches[n]:
        # Tensors that have been updated in this round.
        new = set()

        # Lowest encountered cost that lies above the cost cap
        cost_next = inf

        # Loop over level of tensor to be constructed.
        for lev in range(2, n + 1):
            # Loop over level of source tensor a.  The level of source tensor
            # b follows.
            for lev_a in range(1, lev // 2 + 1):
                lev_b = lev - lev_a
                for ten_a in caches[lev_a]:
                    for ten_b in caches[lev_b]:
                        # Avoid generating (c, d), (a, b) together with
                        # (a, b), (c, d).
                        if lev_a == lev_b and ten_a[0] > ten_b[0]:
                            continue

                        if _any_common(ten_a, ten_b):
                            continue

                        cost_a, inds_a, _ = caches[lev_a][ten_a]
                        cost_b, inds_b, _ = caches[lev_b][ten_b]

                        # Ignore outer products (not part of the original algorithm).
                        if not (inds_a & inds_b):
                            continue

                        cost = cost_a + cost_b + network.cost(inds_a | inds_b)

                        if cost > cost_cap:
                            if cost < cost_next:
                                cost_next = cost
                        elif cost_old < cost or ten_a in new or ten_b in new:
                            ten = list(ten_a)
                            ten.extend(ten_b)
                            ten.sort()
                            ten = tuple(ten)

                            # Query best known way to contract current tensor.
                            known = caches[lev].get(ten)

                            if known is None:
                                inds = inds_a ^ inds_b
                            else:
                                known_cost, inds, _ = known
                                assert inds == inds_a ^ inds_b
                                if cost >= known_cost:
                                    continue
                            caches[lev][ten] = cost, inds, (ten_a, ten_b)
                            new.add(ten)

        cost_old = cost_cap
        cost_cap = max(cost_next, min_dims * cost_cap)

    cost = caches[n][simple][0]
    cache = {}
    for c in caches[1:]:
        cache.update(c)
    return cost, _extract_seq(cache, simple)


def main():
    nw = Network.make_rect(5, 5, 2)
    print(nw.tensors)
    print(nw.dims)
    print()

    t = time.time()
    print(seq_bf_cap(nw))
    print("time:", time.time() - t)


if __name__ == "__main__":
    main()

I think the benchmarks are telling you that the code was never CPU bound in the first place.

My two pence is that Recursive Fibonacci is a pox on Computer Science. It’s an awful way of calculating the Fibonnacci numbers. Just store the numbers in an array - there’s zero need to hold a whole frame and sub scope, just to calculate an integer and remember the two previous ones. The iteration can even be vectorised!

Although it’s a great demonstration of the benefit of automatic tail-recursion optimisation, overall it gets way more use than it deserves.

Could you try with another implementation?

python3 -m timeit -s '
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a
' 'fib(100000)'
Script
import timeit

nth = 100000

setup_code = """
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a
"""

test_code = f"fib({nth})"
number = 5
execution_time = timeit.timeit(setup=setup_code, stmt=test_code, number=number)
print(execution_time / number)


def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a


# Extracting the last 6 digits using modulus operation
last_four_digits = fib(nth) % 1000000

print(
    f"The last 6 digits of the {nth}th Fibonacci number are: {last_six_digits}")

"""
python3 -m timeit -s '
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a
' 'fib(100000)'
"""

IMO it’s not even a great demonstration of that, since it’s SO easy to transform into non-recursive. It’s almost a good demo of the value of memoization, but not really, since a single memoized value (the one most recent) is sufficient.

What it is a good demonstration of, though, is the way that the right algorithm makes a huge difference. Or if you prefer: It’s a demonstration of a naive algorithm, which you can then drastically improve.

1 Like

Please don’t focus on the Fibonacci example. I posted also a real-world reasonably implemented algorithm that shows the same performance characteristics.

It’s true that recursive Fibonacci is ridiculous as a general benchmark, but in the case of CPython with its high function call overhead it’s not meaningless. It’s a benchmark of simple Python functions calls in fast repetition. (Python function calls, as you know, are quite involved operations.) Also, since it doesn’t use much memory, it should actually be CPU bound. In any case, how do you explain the 2.5 times speed difference between essentially identical CPUs (the two Alder lakes), because:

Meanwhile I also have the result of running the Fibonacci toy example on the colleague’s laptop (i7-1265U):

python3 -m timeit -s 'def fib(n): return fib(n-2) + fib(n-1) if n > 1 else 1' 'fib(23)'                                                                          (py311) 
100 loops, best of 5: 2.18 msec per loop

I attach a simple but reasonably well implemented integer-only algorithm in C. Compiled with the same gcc that I used to compile CPython it runs in 5.48 s on the laptop and in 1.96 s on the desktop.

/* Compile with: gcc -Wall -Wextra -O3 prime.c -o prime */
#include <stdio.h>

/* Compute nth (0-based counting) prime. */
size_t prime(size_t n) {
    switch (n) {
    case 0: return 2;
    case 1: return 3;
    }
    size_t primes[n];
    primes[0] = 3;
    size_t *top = &primes[1];
    size_t *end = &primes[n];
    size_t *divs_end = &primes[1];
    size_t greatest_div_sq = 9;
    size_t cand = 5;
    while (top < end) {
        size_t *div = primes;
        while (div < divs_end) {
            if (cand % *div++ == 0) goto not_a_prime;
        }
        *top++ = cand;
    not_a_prime:
        cand += 2;
        if (cand >= greatest_div_sq) {
            greatest_div_sq = *divs_end * *divs_end;
            ++divs_end;
        }
    }
    return primes[n-1];
}

int main()
{
    size_t n;
    for (int i=0; i < 20000; ++i) n = prime(1000);
    printf("%ld\n", n);
}

(The script you posted had a small typo: had to replace last_four_digits by last_six_digits.)

On my old and slow laptop (i5-2520M) your program prints 0.12551639579978655. On the recent desktop (i7-12700K) it prints 0.20465546080376953, so the code actually runs significantly slower on the faster machine.

While, as checked above, the prime number computation executes (as expected) 2.8 times faster on the desktop. So it’s not that the new machine is badly configured.

Bear in mind that this is NOT equivalent. C’s integers are fixed size, Python’s are bigints. It’s hard to compare them usefully.

It could be utilizing an efficient core.

The script posted by @elis.byberi is indeed a benchmark of CPython’s bigint implementation. But I believe that the naive recursive Fibonacci oneliner (as bad as it is for actually calculating Fibonacci numbers), is actually a decent benchmark of the basic operation of the CPython interpreter, and that’s what I wanted to show.

I posted the C prime number example as another proof that my desktop is indeed as fast as one would expect for some CPU bound tasks. Note that CPython has been compiled with the same gcc.

1 Like

The naive Fibonacci function simply benchmarks stack frame creation in CPython.

Good point, but the i7z tool confirms that on of the performance cores is used. (It would be quite strange if the kernel would schedule a 100% CPU task onto an efficient core on an otherwise idle machine.)

Right. So let’s focus on these. I would have expected that an Alder Lake CPU will be much faster at this task compared to a CPU from 2011. And this is indeed the case if one compares my colleague’s laptop to my old laptop.

But what could be the problem with my Alder Lake desktop then? It runs stock Debian and is otherwise very snappy and (as shown and as expected) executes at least some programs 3 times faster.

What could be the reason for the mediocre performance? I can try setting the BIOS to defaults and booting some Ubuntu live system, but I have difficulty to imagine that would change anything, because I didn’t do anything crazy to my system.

This is a benchmark that’s measuring something fairly specific. It doesn’t mean that the newer machine is generally 3x faster, just on certain kinds of benchmarks.

I wouldn’t say it’s flawed but it’s likely not measuring the same things as your sysbench benchmark above. Perhaps the Python Fibonacci benchmark is limited by CPU cache or system RAM speed (due to all the stack frames getting created). That’s probably not gotten a lot quicker on the new machine vs the old. It also could be the case that the Fibonacci benchmark creates execution paths that challenge the branch prediction units of the CPUs. If that’s not working well, you can’t utilize the full integer processing ability of CPU.

Benchmarking is really hard to do, especially when dealing with modern CPUs.