About free-threading performance

I was very excited to test free-threading performance when 3.13 beta came out (thank you for everyone who works on it). Here’s what I found.

fib.py

from itertools import repeat

def fib(n):
    if n < 3:
        return n
    return fib(n-1) + fib(n-2)

# calculate the 30th fibonacci number 16 times
# sequentially
for i in repeat(30, 16):
    print(fib(i))

fib-thread.py

from itertools import repeat
from multiprocessing.pool import ThreadPool

def fib(n): ... # same as above

# calculate the 30th fibonacci number 16 times
# in parallel
with ThreadPool() as pool:
    for result in pool.imap_unordered(fib, repeat(30, 16)):
        print(result)

Benchmarks:

  • Regular build + fib.py: 940 ms
  • Free-threading + fib.py: 1545 ms
  • Regular build + fib-thread.py: 1377 ms
  • Free-threading + fib-thread.py: 279 ms

This is very surprising for me. On free-threading build, singlethreaded code is much slower, but multithreaded code is much much faster. I expected some differences, but not this much.

Is this expected, or is it my machine’s anomaly (linux, intel i7-1360p)? Will free-threading performance for singlethreaded code improve during the beta period, or will I have to wait for 3.14?

2 Likes

The free-threaded build currently has the specialising, adaptive interpreter introduced in Python 3.11 switched off. This was necessary because the specialising adaptive interpreter is not currently threadsafe. Unfortunately this leads to a significant slowdown for single-threaded code.

There is active ongoing work by @kj0 and others to make the specialising adaptive interpreter threadsafe, but I think you may have to wait for Python 3.14 to see the fruits of this effort.

7 Likes

And the free-threaded version is much faster in your multithreaded test because hey, it’s actually multithreaded! :tada: It can run the jobs concurrently. Your CPU has 4 “performance cores” and 8 “efficient cores”. I suspect that the OS is only using the four performance cores for this script, and that’s why you see about a 4x speedup.

2 Likes

Just curious, what would be the timing using Pool()?

fib-pool.py

from itertools import repeat
from multiprocessing import Pool

def fib(n):
    if n < 3:
        return n
    return fib(n - 1) + fib(n - 2)

# calculate the 30th fibonacci number 16 times
# in parallel
with Pool() as pool:
    for result in pool.imap_unordered(fib, repeat(30, 16)):
        print(result)

Regular build + fib-pool.py: 229 ms
Free-threading + fib-pool.py: 267 ms

Multiprocessing with the regular build is by far the fastest. I can’t reliably test whether multiprocessing or threading is faster with free-threading, they’re about the same and the numbers fluctuate every run.

2 Likes

That’s a consequence of your workload, which is completely isolated, and heavily dominated by processing time rather than startup or message-passing overhead. Fortunately, multiprocessing isn’t going away, so for those workloads, it’s always going to be an option :slight_smile:

A couple other workloads to consider:

  • For a given user input and a vast database, which entries are the closest matches? The fuzzywuzzy library can, for any input and potential result, tell you how similar they are; but each of those comparisons is independent. This also requires getting a result back from each tiny part of the job, so there’s a bit more communication back to the main thread.
  • Search a single ZIP file for any file that contains a search term, given by a regular expression. Open the file only once, but parallelize the deflation and searching.

You may well find that different models behave quite differently there. It’ll be interesting to see how threads+subinterpreters go, although I’m not sure that’s stable enough for reasonable performance testing yet.

1 Like

Yea, Free-threading pretty good for my project with streaming large amount of 4k RTSP. But the single thread performance make main process down. Have to keep continue to wait for the good one version.
The future is beautiful with this new generation Python.
I’m almost give up the langauge to start again with new one.

On python 3.14 a6 on windows, free-threading seems now at same speed in mono-thread as standard python on this test.

Next problem to understand, for me, is why i get only a 2.5 speed-up on 4 threads .

3 Likes

Here are the publicly available benchmarks from the faster-python project:

It depends on the algorithm. Are the threads sharing memory or copying it? Also, how much locking or synchronization is involved?

I agree Sam is not a Windows guy and said that it’s hard to know how program will truly scale https://lukasz.langa.pl/5d044f91-49c1-4170-aed1-62b6763e6ad0/

But at same time, I was teased/mind-blowed per his linear fibonacci promise in https://youtu.be/9OOJcTp8dqE?t=1220

“Something” makes my Windows Laptop PC scaling looking bad, with a7

can be:

  • “software details”: Windows not the “focus target” OS, “disable that antivirus”
  • “hardware details”: can’t be friendly with small L2 / L3 cache size cpu (like an old i7-8550U), “u” cpu immediatly reducing frequency to remain in power budget
  • or really, when integrating into cpython without shortcut, free-threading will require a very cautious way of programming to scale