Trick: Faster functions (by using generators)

This is the normal way to write a function that takes one argument (and in this example returns its negated value):

def normal(x):
    return -x

print(normal(3))  # prints -3

And here’s a way using a generator:

def send():
    x = yield
    while True:
        x = yield -x
send = send()
next(send)
send = send.send

print(send(3))  # prints -3

And the generator way can be faster:

 48.1 ± 0.2 ns  send
 67.7 ± 0.7 ns  normal

Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]
benchmark script

Attempt This Online!

def normal(x):
    return -x

print(normal(3))  # prints -3

def send():
    x = yield
    while True:
        x = yield -x
send = send()
next(send)
send = send.send

print(send(3))  # prints -3

funcs = [normal, send]

from timeit import timeit
from statistics import mean, stdev
from itertools import repeat
from collections import deque
import sys
import random

n = 10**5
times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e9 for t in sorted(times[f])[:5]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} ns '
for _ in range(100):
    random.shuffle(funcs)
    for f in funcs:
        t = timeit(lambda: deque(map(f, repeat(3, n)), 0), number=1) / n
        times[f].append(t)
for f in sorted(funcs, key=stats):
    print(stats(f), f.__name__)

print('\nPython:', sys.version)
5 Likes

I’m guessing this should be send(3)?

3 Likes

The code calls normal(3) twice. It doesn’t print anything from send(). I expect the second instance of

print(normal(3))  # prints -3

should be

print(send(3))  # prints -3

instead.

Ignoring that, yes, it’s excruciating :wink:. I expect you’re seeing the overhead of creating & tearing down “an execution frame” for each invocation of normal(). send() incurs that expense only once at the start, and resumptions all execute in the same execution frame.

5 Likes

Oops. Fixed, thanks.

Yes, I think so, too. I once saw Guido say that Python function calls are expensive because of stack frames, so I tried avoiding that this way. Note my benchmark has map call the function. If you have Python code calling it, then I think since the 3.11 optimizations, that’s faster with the normal way. When I first tried this a while back, I think the generator way was faster for both.

2 Likes

Alas, my mental model hasn’t caught up to the current implementation so I really don’t know anymore exactly what happens when, where, or why. At this point, your guess is at least as good as mine :smile:.

I get:

$ python t.py 
-3
-3
 57.7 ± 0.0 ns  normal
 65.4 ± 3.2 ns  send

Python: 3.13.1 (main, Jan  5 2025, 05:33:47) [Clang 19.1.6 ]
2 Likes

Tried a system with Clang as well now:

 27.3 ± 0.1 ns  send
 29.7 ± 0.0 ns  normal

Python: 3.13.4 (tags/v3.13.4:8a526ec7cbe, Jun 15 2025, 10:56:11) [Clang 18.1.8 (Fedora 18.1.8-2.fc40)]

Anyway, yes, this is something I can easily see vary a lot between different Python versions, compilers, compiler options, and machines. Hence the “can be faster” instead of “is faster”.

Mostly I see it as an interesting curiosity. I’ve known it for a few years but almost never used it. The only cases I remember are this where I also found it neat and scalable to have my predicate function’s state represented by “instruction pointer” (simply writing the different conditions one after the other), and this where I rather did it to point out that there’s precedent for iterators having non-dunder methods (since I suggested adding one for itertools.chain and it struck even me as unusual until I remembered that generator iterators have some).

2 Likes

@oscarbenjamin Hmm, I just remembered I had tried three other systems before. On all five systems, the generator way was faster. What’s your OS/CPU? Did you compile Python yourself? If so, how?

That was using CPython 3.13 installed by uv on x86-64 Linux. I didn’t test on anything else.

Ok, thanks. I don’t know much about uv other than that it’s about speed. Your times seemed relatively slow to me for what I assume is a personal computer, so I thought maybe your Python wasn’t compiled with optimizations and that could affect this. But I assume that’s not the case with uv.

It feels like you are insulting my poor little computer :). That machine is maybe 10 years old and was picked for its 50W power consumption at the time. It isn’t fast but most things I do don’t need a powerful computer and I have access to compute clusters if I do want them. That little computer did me just fine until we started using MS Teams at work (which needs at least 20 cores and 1GB RAM just to provide text messaging even if you aren’t in a video meeting).

I believe that the uv-installed CPythons are built with optimisations enabled. They come prebuilt from python-build-standalone.

Here are timings instead on MacOS with M3 Pro armv8a CPU and also uv installed CPython 3.13:

$ python t.py
-3
-3
 13.9 ± 0.0 ns  send
 14.6 ± 0.0 ns  normal

Python: 3.13.0 (main, Oct 16 2024, 08:05:40) [Clang 18.1.8 ]

I get a little variation from run to run but send measures as maybe 5% faster on average.

2 Likes

Not trying to insult :slight_smile: And I even only have a phone.

The uv docs linked there say:

The Python distributions are built in a manner to minimize run-time dependencies. This includes limiting the CPU instructions

So it looks like Python execution speed is actually not a top priority. No idea what to make of that, though.

1 Like

If you distribute binaries then you have to limit the CPU instructions so that your binaries are compatible with all the machines where they get installed. This is especially true of x86 where the instruction set has been expanded continuously over decades. The level of specificity that just says “x86-64” means supporting every 64-bit CPU since the first AMD64 ones 20+ years ago so all the newer instructions (SIMD etc) are skipped.

It looks like NumPy is going to raise the bar to x86-64-v2 but currently wheel metadata has no way to distinguish that so those wheels will just crash on older hardware. My older Linux box will cope but not when they go to v3 (although I would still be able to build from source).

The situation is simpler when distributing ARM binaries for MacOS where CPUs more than 5 years old just don’t exist.

Before using uv I used to use pyenv to install Python which does build from source but does so in a way that attempts to minimise build time. I think that means no LTO, PGO etc but since you build on your machine I assume it uses the whole native instruction set. With python-build-standalone the binaries need to be portable but since it is build-once-and-deploy-many-times it is well worth running the extra build steps to get an optimised build within the instruction set constraints.

2 Likes

Before using uv I used to use pyenv to install Python which does build from source but does so in a way that attempts to minimise build time. I think that means no LTO, PGO etc

correct, you need to set an enviroment variable In order to build with maximum performance; pyenv goes over that here

as for uv a few months ago I was concerned about using uv managed python precisely because of performance; from my research:

python-standalone compiles python with many different options and provides them: Running Distributions — python-build-standalone documentation

uv then parses that and decides the optimal candidate of the version you’re requesting to prefer the smallest distribution with the most optimizations available for the given system

2 Likes

An extra nice use case: Packed arguments.

Let’s say you have this:

xs = [2, 3, 4, 6, 8, 9]

And you want to group into runs of consecutive numbers:

[[2, 3, 4], [6], [8, 9]]

A common solution is to group by difference to index:

groupby(enumerate(xs), lambda ix: ix[1] - ix[0])

That was so much nicer in Python 2, where we could do this:

groupby(enumerate(xs), lambda (i, x): x - i)

And with the generator way, we can bring that back:

def send2():
    i, x = yield
    while True:
        i, x = yield x - i

And it saves me some more nanoseconds per call :smiley:

116.5 ± 0.4 ns  send2
122.8 ± 0.6 ns  send1
146.7 ± 0.6 ns  normal

Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]

The code:

normal = lambda ix: ix[1] - ix[0]

def send1():
    ix = yield
    while True:
        ix = yield ix[1] - ix[0]
send1 = send1()
next(send1)
send1 = send1.send

def send2():
    i, x = yield
    while True:
        i, x = yield x - i
send2 = send2()
next(send2)
send2 = send2.send

funcs = [normal, send1, send2]

from timeit import timeit
from statistics import mean, stdev
from itertools import repeat, groupby
from collections import deque
import sys
import random

for f in funcs:
    print([[x for _, x in g] for k, g in groupby(enumerate([2, 3, 4, 6, 8, 9]), f)])

n = 10**5
xs = range(n)
times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e9 for t in sorted(times[f])[:5]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} ns '
for _ in range(100):
    random.shuffle(funcs)
    for f in funcs:
        t = timeit(lambda: deque(groupby(enumerate(xs), f), 0), number=1) / n
        times[f].append(t)
for f in sorted(funcs, key=stats):
    print(stats(f), {normal: 'normal', send1: 'send1', send2: 'send2'}[f])

print('\nPython:', sys.version)

Attempt This Online!