Free-threaded Python collection performance on MacOS

santiagobasulto · December 18, 2024, 9:00am

Hello everybody. I was testing free-threaded Python on my Mac (through uv’s cpython-3.13.0+freethreaded-macos-x86_64-none) and noticed that multi-threaded access to collections was VERY slow. I did a quick comparison of different benchmarks that include simple operations like appending/adding elements, removing elements, etc and ran them both in Linux and Mac and the difference is astonishing.

For reference, I own a 2019 Macbook with an 8-core i9 32GB RAM. For the Linux benchmarks I used a GCP instance (c4-standard-4, 4vCPUs, 15GB RAM).

I ran these benchmarks using 100 threads, and here are the results:

(slowdown is computed as time in seconds Mac / time in seconds Linux)

As you can see the performance on my Mac is considerably worse.

For something less extreme, only 10 threads, there’s also a considerable penalty:

(image removed because I can post only one image as new user)

Is there any implementation detail in the locking mechanisms to make collections thread safe on freethreaded python that is maybe causing this?

For reference, I created the benchmarks for each different type of collection. For example:

Create elements

# For lists
a_list = []
def target(n_iters):
    for _ in range(n_iters):
        a_list.append(1)

# for sets
a_set = set()
def target(n_iters):
    for i in range(n_iters):
        a_set.add((threading.get_ident(), i))

# for dicts
a_dict = {}
def target(n_iters):
    for i in range(n_iters):
        a_dict[(threading.get_ident(), i)] = 1

And always ran the benchmarks in this way:

# 100 threads, 1000 iterations each
threads = [threading.Thread(target, args=(1_000,)) for _ in range 100]
start = time.monotonic()
[t.start() for t in threads]
[t.join() for t in threads]
print(f"Total time: {time.monotonic() - start}")

da-woods · December 18, 2024, 2:20pm

(post deleted by author)

da-woods · December 18, 2024, 2:21pm

(post deleted because of a reading comprehension failure on my part… Sorry)

pitrou · December 18, 2024, 2:45pm

Well, can you post the image in a comment below? Then someone can reintegrate it in the original post.

pitrou · December 18, 2024, 2:48pm

This comparison doesn’t make a lot of sense on its own, but the numbers are so unexpectedly large that they are still informative IMHO

Santiago Basulto:

# For lists
a_list = []
def target(n_iters):
    for _ in range(n_iters):
        a_list.append(1)

# for sets
a_set = set()
def target(n_iters):
    for i in range(n_iters):
        a_set.add((threading.get_ident(), i))

# for dicts
a_dict = {}
def target(n_iters):
    for i in range(n_iters):
        a_dict[(threading.get_ident(), i)] = 1

threading.get_ident invokes a system function, you should not call it in a loop in your benchmark. Please invoke it outside of the loop and store the result in a local variable.

If you have a 8-core machine, then doing such accesses from 100 threads does not make sense IMHO. Also, you’re comparing with a 4-CPU VM instance, which skews the comparison even more.

Could you please post a 4-threads result?

barry-scott · December 18, 2024, 3:20pm

What version of macOS do you have installed?

I wonder if you are measuring macOS kernel locking and threading performance against linux kernel locking and threading?

macOS is known to have issues with performance in these areas compared to other OS, at least in the past.

It would be interesting to reproduce the results on Apple Silicon and run
both linux in a VM under macOS to keep the CPU performance the same.

santiagobasulto · December 18, 2024, 3:53pm

Why wouldn’t it be relevant? If the same benchmark, same number of threads takes twice as much, you’ll see 2X. Anyways, here’s the raw data:

yes, thanks, I’m setting tid outside the loop. My previous code was just illustrative. I’ll publish all the benchmarks soon.

Yes, this is the most interesting result. A 4-core CPU performs BETTER than an 8-core one. Anyways, I attached the image above using 10 threads.

I don’t have benchmarks with 4 threads. I did 1, 4 and 100. I ran these benchmarks approximately ~800 times to remove statistical insignificant results and noise (using just Coefficient of Variation).

santiagobasulto · December 18, 2024, 3:54pm

Thanks, this is a good point. I’ll publish the source code of the benchmarks so anybody can run them.

I’m in 12.5 Monterrey

barry-scott · December 18, 2024, 3:59pm

I wonder how useful benchmarking on a very old X86 mac with old mac OS is.
Edit: Also I expect that the VM you used for linux is far faster then your laptop anyway.

I’m not a uv user. If you can give me step by step instructions I’ll try to run the benchmarks on Apple SIlicon and latest macOS and linux kernels.

santiagobasulto · December 18, 2024, 4:01pm

Thinking about it, and I’m saying this without any type of source to back it up, I doubt it’s only an issue of the OS. Because for some benchmarks, Mac takes the same time as Linux, and sometimes even less (see the image above). So it’s not consistently slower. But again, this is just a hunch. You might be right.

n

pitrou · December 18, 2024, 5:57pm

Because those are different operating systems running on different CPUs, so the performance differences cannot be attributed to Python with certainty. Some CPUs are better than others at inter-thread synchronization, and the same can be said of operating systems.

Those CPU cores are fighting to access the same data structure all the time. The more cores are fighting together, the more contention it creates. So the results are not that surprising. Real-world programs should certainly try to avoid creating such a situation.

I’m not saying those performance numbers are useless, but they are hard to interpret due to 1) running a kind of workload that’s not recommended at all for performance 2) comparing two different CPU/OS pairs.