Why does unreached code take linear time?!?

Compare these two functions:

def short():
    try:
        if 0 == 1:
            unreached
        raise RuntimeError
    except RuntimeError:
        pass

def long():
    try:
        if 0 == 1:
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
        raise RuntimeError
    except RuntimeError:
        pass

The only difference is that long() has 100 unreached statements instead of just one. But it takes much longer:

 357.0 ±  0.3 ns  short
1355.8 ±  1.5 ns  long

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]

And when I increase it to 200 unreached statements, it takes even longer:

 358.3 ±  1.3 ns  short
2357.8 ±  2.9 ns  long

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]

So it looks like it takes an extra 10 ns for each unreached statement. Why? Shouldn’t it just “skip” them all in constant time and be just as fast as short()?

Benchmark code (Attempt This Online!):

from timeit import timeit
from time import perf_counter as time
from statistics import mean, stdev
import sys


def short():
    try:
        if 0 == 1:
            unreached
        raise RuntimeError
    except RuntimeError:
        pass


def long():
    try:
        if 0 == 1:
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
        raise RuntimeError
    except RuntimeError:
        pass


funcs = short, long

for _ in range(3):
    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * 1e9 for t in sorted(times[f])[:5]]
        return f'{mean(ts):6.1f} ± {stdev(ts):4.1f} ns '
    for _ in range(100):
        for f in funcs:
            t = timeit(f, number=10**4) / 1e4
            times[f].append(t)
    for f in sorted(funcs, key=stats):
        print(stats(f), f.__name__)
    print()

print('Python:', sys.version)
1 Like

Just a guess, but could the difference be bytecode generation? If you use py_compile on each of the functions and then time them, is the result the same?

Might be mostly a Python 3.11 issue.

On mycompiler.io (Python 3.11) it’s even far worse (had to reduce range(100) to range(25) due to their 15 seconds limit):

 317.9 ±  1.7 ns  short
5771.0 ±  3.0 ns  long
Python: 3.11.4 (main, Jun  9 2023, 07:59:55) [GCC 12.3.0]

On trinket.io (Python 3.10) it consistently takes only a little longer:

 225.6 ±  3.9 ns  short
 242.1 ±  2.2 ns  long
Python: 3.10.9 (main, Jan 23 2023, 22:32:48) [GCC 10.2.1 20210110]

On replit.com (Python 3.10), it also consistently takes only a little longer:

 309.8 ±  1.5 ns  short
 332.6 ±  1.9 ns  long
Python: 3.10.11 (main, Apr  4 2023, 22:10:32) [GCC 12.2.0]

In the disassembly for the two functions, I did find one thing that MIGHT be relevant:

>>> dis.dis(short)
  1           0 RESUME                   0

  2           2 NOP

  3           4 LOAD_CONST               1 (0)
              6 LOAD_CONST               2 (1)
              8 COMPARE_AND_BRANCH      39 (==)
             12 POP_JUMP_IF_FALSE        6 (to 26)
...

>>> dis.dis(long)
  1           0 RESUME                   0

  2           2 NOP

  3           4 LOAD_CONST               1 (0)
              6 LOAD_CONST               2 (1)
              8 COMPARE_OP              32 (==)
             12 EXTENDED_ARG             2
             14 POP_JUMP_IF_FALSE      600 (to 1216)

...

The larger bytecode means that the conditional jump doesn’t fit inside the normal opcode. It’s entirely possible that this has an impact.

I’m not familiar with that and am not completely sure what you mean. Does it differ from the normal compilation when just running the script? ? I’m testing on the linked “Attempt This On-line” site, so you could try your test there.

I get that as well, but with 200 unreached statements it looks equivalent, so wouldn’t explain the additional slowdown of that. (Plus I’d be surprised if this small thing had that much impact.)

With 100:

              8 COMPARE_OP               2 (==)
             14 EXTENDED_ARG             2
             16 POP_JUMP_FORWARD_IF_FALSE   700 (to 1418)

 19          18 LOAD_GLOBAL              0 (unreached)

With 200:

              8 COMPARE_OP               2 (==)
             14 EXTENDED_ARG             5
             16 POP_JUMP_FORWARD_IF_FALSE  1400 (to 2818)

 19          18 LOAD_GLOBAL              0 (unreached)

That’s definitely interesting.

Yeah, but when we’re dealing with unexplained performance regressions, ANY potential focal point is worth exploring. Are you able to find the exact point at which it flips to EXTENDED_ARG, and get a performance measurement just below and just above that point? If those have nearly the same slowdown, that would solidly exclude that as a possibility.

For me on ATO, that’s between 36 and 37. Runtimes:

 726.9 ±  4.5 ns  long36
 736.5 ±  4.1 ns  long37
script

Attempt This Online!

from timeit import timeit
from time import perf_counter as time
from statistics import mean, stdev
import sys


def long36():
    try:
        if 0 == 1:
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached
        raise RuntimeError
    except RuntimeError:
        pass


def long37():
    try:
        if 0 == 1:
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached; unreached
            unreached; unreached; unreached; unreached; unreached; unreached; unreached
        raise RuntimeError
    except RuntimeError:
        pass


import dis
# dis.dis(long36)
# dis.dis(long37)

funcs = long36, long37

for _ in range(3):
    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * 1e9 for t in sorted(times[f])[:5]]
        return f'{mean(ts):6.1f} ± {stdev(ts):4.1f} ns '
    for _ in range(100):
        for f in funcs:
            t = timeit(f, number=10**4) / 1e4
            times[f].append(t)
    for f in sorted(funcs, key=stats):
        print(stats(f), f.__name__)
    print()

print('Python:', sys.version)

For you I guess it’s around 256/6≈42.7, since you jump by 600 over the 100 statements instead of my 700.

Does this happen if the function isn’t a try/except block? I am wondering if it needs to build and store a larger exception object in the slow case.

Btw, If I replace raise RuntimeError with return, the difference shrinks a lot (to just ~0.05 ns per unreached statement):

  49.4 ±  0.1 ns  short
  54.4 ±  0.0 ns  long

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]

Edit: Actually that seems to be constant 5 ns, not some time “per element”. It doesn’t get bigger when I increase to 200 or even 300 unreached statements.

You mean with the try/except not inside the function but in the calling code? Then yes. That’s what I had first, the results were similar to what they’re now (you can see it in the history of my first post).

I shouldn’t have said try/except, I was thinking about the exception itself (as you just posted about). When you raise a RuntimeError it needs to construct the traceback for that and one of them has more stuff in it (perhaps).

How should I check how much stuff is in the traceback? With except ValueError (so it doesn’t catch) and calling long() right after defining it, I’m shown just this little bit:

Traceback (most recent call last):
  File "/ATO/code", line 32, in <module>
    long()
  File "/ATO/code", line 29, in long
    raise RuntimeError
RuntimeError

I don’t really know, just brainstorming what might be happening. I was messing around with this in a notebook and when I do this:

edit: sorry, I should say that I removed the try/except in the functions and just have them raise

try:
    long()
except RuntimeError as exc:
    print(exc.__traceback__.tb_frame.f_locals)

I can see the entire definition of long being stored in there, as a string. But when I do that for short in the same notebook it also has the code for long, because they’re in the same scope. Is timeit doing something to isolate them?

edit: this may have been a red herring…my f_locals also has stuff like ipython history in it :man_facepalming:

What Python version do you use? With ATO’s 3.11.4, that only shows me this:

{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x7f02658a8410>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__file__': '/ATO/code', '__cached__': None, 'timeit': <function timeit at 0x7f02658a56c0>, 'time': <built-in function perf_counter>, 'mean': <function mean at 0x7f026571d4e0>, 'stdev': <function stdev at 0x7f026571dd00>, 'sys': <module 'sys' (built-in)>, 'short': <function short at 0x7f0265ac1f80>, 'long': <function long at 0x7f02658a5bc0>, 'exc': RuntimeError()}

Manually broken into multiple lines for readability:

{'__name__': '__main__',
 '__doc__': None,
 '__package__': None,
 '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x7f02658a8410>,
 '__spec__': None,
 '__annotations__': {},
 '__builtins__': <module 'builtins' (built-in)>,
 '__file__': '/ATO/code',
 '__cached__': None,
 'timeit': <function timeit at 0x7f02658a56c0>,
 'time': <built-in function perf_counter>,
 'mean': <function mean at 0x7f026571d4e0>,
 'stdev': <function stdev at 0x7f026571dd00>,
 'sys': <module 'sys' (built-in)>,
 'short': <function short at 0x7f0265ac1f80>,
 'long': <function long at 0x7f02658a5bc0>,
 'exc': RuntimeError()}

I don’t think so (but I’m not sure what you mean with isolate). And if I instead measure with

            t0 = time()
            for _ in range(10**4):
                f()
            t = (time() - t0) / 1e4

then it still looks the same (just ~100 ns slower due to timeit doing that loop faster):

 418.2 ±  3.6 ns  short
1428.7 ± 13.9 ns  long

I was using a jupyter notebook with 3.9, so quite different (I was mistakenly looking at the jupyter history which they store as local vars). But I was looking in the wrong place when I mentioned the code string. The code is still in the traceback (as a code object), in exc.__traceback__.tb_frame.f_code.co_code

Modifying your ATO example (and just doing one rep instead of 10000):

...
    except RuntimeError as exc:
        print(len(exc.__traceback__.tb_frame.f_code.co_code))
...

yields 188 1576 which sort of the same magnitude as the timing difference. So perhaps constructing the context of the exception object accounts for this timing.

I confirm. Although on trinket.io, where there wasn’t such a large speed difference, I also get very different string lengths: 76 and 472. Also, I think that that value (co_code) is prepared beforehand, not each time an exception is created.

1 Like

Could someone run the first post’s benchmark in equivalently installed regular Python 3.10 and 3.11 on the same machine? (I’m only on a phone, where that’s quite difficult.)

the things I do[1]

On a debian11 VM on GCP:

mamba create -n py310 python=3.10
mamba create -n py311 python=3.11
mamba activate py310
python unreached_test.py
mamba deactivate
mamba activate py311
python unreached_test.py

outputs:

(py310) jupyter@fluent-analysis ~$ python unreached_test.py 
 142.3 ±  0.1 ns  short
 161.2 ±  0.1 ns  long

 150.4 ±  0.3 ns  short
 166.4 ±  0.1 ns  long

 150.7 ±  0.1 ns  short
 167.0 ±  0.2 ns  long

Python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]

and

(py311) jupyter@fluent-analysis ~$ python unreached_test.py 
 176.8 ±  0.3 ns  short
 643.3 ±  0.4 ns  long

 176.0 ±  0.4 ns  short
 644.0 ±  0.3 ns  long

 176.5 ±  0.4 ns  short
 644.7 ±  0.6 ns  long

Python: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:34:09) [GCC 12.3.0]

So this definitely is new to 3.11, although something is different in 3.10 as well…


  1. to avoid work ↩︎

1 Like

Thanks! I think with that, creating a GitHub issue would be appropriate. I might tomorrow, depending on further comments here.

I just noticed it also matters whether I write the 100 unreached statements on the same line or on separate lines. Benchmark with those two extremes (instead of 10 lines of 10 statements each):

 356.4 ±  2.9 ns  short
1382.3 ±  4.9 ns  long_one_per_line
2266.7 ±  5.5 ns  long_all_on_same_line

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]
Code
from timeit import timeit
from time import perf_counter as time
from statistics import mean, stdev
import sys


def short():
    try:
        if 0 == 1:
           unreached
        raise RuntimeError
    except RuntimeError:
        pass


u = '            unreached\n' * 100
exec(f'''
def long_one_per_line():
    try:
        if 0 == 1:
{u}
        raise RuntimeError
    except RuntimeError:
        pass
''')


exec(f'''
def long_all_on_same_line():
    try:
        if 0 == 1:
            {'unreached;' * 100}
        raise RuntimeError
    except RuntimeError:
        pass
''')


funcs = short, long_one_per_line, long_all_on_same_line

for _ in range(3):
    times = {f: [] for f in funcs}
    def stats(f):
        ts = [t * 1e9 for t in sorted(times[f])[:5]]
        return f'{mean(ts):6.1f} ± {stdev(ts):4.1f} ns '
    for _ in range(100):
        for f in funcs:
            t = timeit(f, number=10**4) / 1e4
            times[f].append(t)
    for f in sorted(funcs, key=stats):
        print(stats(f), f.__name__)
    print()

print('Python:', sys.version)

Attempt This Online!

Btw long_one_per_line.__code__.co_code == long_all_on_same_line.__code__.co_code is true.