About free-threading performance

I was very excited to test free-threading performance when 3.13 beta came out (thank you for everyone who works on it). Here’s what I found.

fib.py

from itertools import repeat

def fib(n):
    if n < 3:
        return n
    return fib(n-1) + fib(n-2)

# calculate the 30th fibonacci number 16 times
# sequentially
for i in repeat(30, 16):
    print(fib(i))

fib-thread.py

from itertools import repeat
from multiprocessing.pool import ThreadPool

def fib(n): ... # same as above

# calculate the 30th fibonacci number 16 times
# in parallel
with ThreadPool() as pool:
    for result in pool.imap_unordered(fib, repeat(30, 16)):
        print(result)

Benchmarks:

  • Regular build + fib.py: 940 ms
  • Free-threading + fib.py: 1545 ms
  • Regular build + fib-thread.py: 1377 ms
  • Free-threading + fib-thread.py: 279 ms

This is very surprising for me. On free-threading build, singlethreaded code is much slower, but multithreaded code is much much faster. I expected some differences, but not this much.

Is this expected, or is it my machine’s anomaly (linux, intel i7-1360p)? Will free-threading performance for singlethreaded code improve during the beta period, or will I have to wait for 3.14?

The free-threaded build currently has the specialising, adaptive interpreter introduced in Python 3.11 switched off. This was necessary because the specialising adaptive interpreter is not currently threadsafe. Unfortunately this leads to a significant slowdown for single-threaded code.

There is active ongoing work by @kj0 and others to make the specialising adaptive interpreter threadsafe, but I think you may have to wait for Python 3.14 to see the fruits of this effort.

4 Likes

And the free-threaded version is much faster in your multithreaded test because hey, it’s actually multithreaded! :tada: It can run the jobs concurrently. Your CPU has 4 “performance cores” and 8 “efficient cores”. I suspect that the OS is only using the four performance cores for this script, and that’s why you see about a 4x speedup.

1 Like

Just curious, what would be the timing using Pool()?

fib-pool.py

from itertools import repeat
from multiprocessing import Pool

def fib(n):
    if n < 3:
        return n
    return fib(n - 1) + fib(n - 2)

# calculate the 30th fibonacci number 16 times
# in parallel
with Pool() as pool:
    for result in pool.imap_unordered(fib, repeat(30, 16)):
        print(result)

Regular build + fib-pool.py: 229 ms
Free-threading + fib-pool.py: 267 ms

Multiprocessing with the regular build is by far the fastest. I can’t reliably test whether multiprocessing or threading is faster with free-threading, they’re about the same and the numbers fluctuate every run.

2 Likes

That’s a consequence of your workload, which is completely isolated, and heavily dominated by processing time rather than startup or message-passing overhead. Fortunately, multiprocessing isn’t going away, so for those workloads, it’s always going to be an option :slight_smile:

A couple other workloads to consider:

  • For a given user input and a vast database, which entries are the closest matches? The fuzzywuzzy library can, for any input and potential result, tell you how similar they are; but each of those comparisons is independent. This also requires getting a result back from each tiny part of the job, so there’s a bit more communication back to the main thread.
  • Search a single ZIP file for any file that contains a search term, given by a regular expression. Open the file only once, but parallelize the deflation and searching.

You may well find that different models behave quite differently there. It’ll be interesting to see how threads+subinterpreters go, although I’m not sure that’s stable enough for reasonable performance testing yet.

1 Like