Microoptimizing flattening a linked list with a walrus (in a way that keeps mypy happy)

I’m doing some microopitimization for textual (so it plays nicer for mypy mainly but it needs to microbench as good or better than before)

and I noticed using:

while True:
    if x is None:
        break
    ...

is better than

while x is not None:
    ...

am I doing the benchmark correctly?

I posted this on the Python Discord and @Jelle recommended reporting this as an issue to the cpython tracker, I’m not sure if this is necessary as I’m getting a high standard deviation in my benchmarks so I’d like some more advice on fixing that first.

Another approach I tried is inverting the trick so I’m using x is not None like I would in the while loop, and that’s slower than the idiomatic option:

    while True:
        if node is not None:
            nodes.append(node)
            node = node._parent
        else:
            return nodes

I think I noticed that difference as well a while ago, but concluded that there’s no meaningful reason that the uglier code is faster, that it’s just s quirk of the current implementation of the interpreter and might change any day. And the difference was tiny. So I went with the prettier code.

1 Like

I tried to dis it and, formally speaking, the “better” version loops over 3 instructions and produces a more reasonable bytecode overall (3.13)

  7           RESUME                   0

  8           NOP

  9   L1:     LOAD_GLOBAL              0 (x)
              POP_JUMP_IF_NOT_NONE     1 (to L2)

 10           RETURN_CONST             0 (None)

  8   L2:     JUMP_BACKWARD           10 (to L1)

The “while” variant looks a bit awkward

  3           RESUME                   0

  4           LOAD_GLOBAL              0 (x)
              POP_JUMP_IF_NONE        11 (to L3)

  5   L1:     NOP

  4           LOAD_GLOBAL              0 (x)
              POP_JUMP_IF_NONE         2 (to L2)
              JUMP_BACKWARD           10 (to L1)
      L2:     RETURN_CONST             0 (None)
      L3:     RETURN_CONST             0 (None)

Honestly, those timings are all within a standard deviation of each other. I don’t know if you’re consistently able to reproduce it (or what versions you’re able to reproduce it on), but I suspect you’re really just measuring the cost of resizing a list a ton of times, not necessarily the cost of branching-vs-not-branching.

Alright, I tried more pure versions. On one system, while True: was about 0.88 nanoseconds faster per iteration:

18.87 ± 0.02 ns  while_True
19.75 ± 0.02 ns  while_condition

18.90 ± 0.02 ns  while_True
19.76 ± 0.02 ns  while_condition

18.87 ± 0.03 ns  while_True
19.76 ± 0.02 ns  while_condition

Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]

On another, it was about 0.52 ns slower:

14.77 ± 0.03 ns  while_condition
15.29 ± 0.05 ns  while_True

14.82 ± 0.05 ns  while_condition
15.32 ± 0.06 ns  while_True

14.82 ± 0.03 ns  while_condition
15.36 ± 0.05 ns  while_True


Python: 3.13.3 (tags/v3.13.3:6280bb54784, May  3 2025, 13:46:12) [Clang 18.1.8 (Fedora 18.1.8-2.fc40)]
Benchmark script
n = 10**3
root = None
for _ in range(n):
    root = root,
roots = [root] * 100

def while_True():
  for x in roots:
      while True:
          if x is None:
              break
          x = x[0]

def while_condition():
    for x in roots:
        while x is not None:
            x = x[0]

funcs = [while_True, while_condition]

from timeit import timeit
from statistics import mean, stdev
import sys
import random

for i in range(6):
    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * 1e9 for t in sorted(times[f])[:10]]
        return f'{mean(ts):5.2f} ± {stdev(ts):4.2f} ns '
    for _ in range(1000):
        random.shuffle(funcs)
        for f in funcs:
            t = timeit(f, number=1) / 1 / len(roots) / n
            times[f].append(t)
    if i < 3:
        continue
    for f in sorted(funcs, key=stats):
        print(stats(f), f.__name__)
    print()

print('\nPython:', sys.version)