Slower performance with --disable-gil on threaded nearest-neighbor benchmark (10M points)

basileMarchand · June 13, 2025, 8:23am

Summary

I am currently benchmarking the performance of the experimental --disable-gil mode in Python 3.14b2.
In most CPU-bound cases, I observed excellent improvements (×3–×4 speedups). However, in one more realistic test case, performance significantly drops when using --disable-gil.

This issue is not really a bug report, but a performance feedback and an invitation to discuss whether this behavior is expected.

Reproducible test case

Nearest neighbor search using few threads over a list of 10 million 3D points (list[list[float]]).

The algorithm:

splits the search range evenly across threads
uses a shared dict to store the result with a threading.Lock
each thread compares distance2(p, query_point) and updates shared data if needed

Repository with code and Dockerfile:

Benchmark article:

🚀 Benchmarks of Python 3.14b2 with --disable-gil - DEV Community

Results

Mode	Elapsed Time
Python GIL ON	2.38 s
Python GIL OFF	3.61 s
C++ + threads	0.88 s

Question

This test is the only case in my benchmarks that runs slower with --disable-gil, even though it is purely CPU-bound.

Is this performance drop expected with current implementation of nogil?

Any hints on what could explain it? (lock contention, list access overhead, memory locality, etc.)

Environment

Python: 3.14.0b2, compiled with --disable-gil
Platform: Docker / Linux (see repo for details)
Memory: 10M Python objects (list of lists)

pitrou · June 13, 2025, 8:27am

The first question I would ask is how it fares when varying the number of threads (from, say, 1 to 8). Both with and without the GIL.

I would also suggest, if it’s possible, to collect results in per-thread structures and only merge them at the very end (typical map-reduce division of labor).

thomas · June 13, 2025, 12:02pm

Without diving into it too much, my first inclination would be contention on refcounts. You’re passing around several shared objects to other functions, which means their refcounts will have to be updated. The interpreter avoids refcount operations when it can within functions, so you could even see a dramatic shift in performance if you didn’t use functions… Obviously that’s not a good way to optimize (Especially because other changes could prevent the interpreter from omitting unnecessary refcounts, and then you’re back to square one.)

If contention is the issue you’d see very bad scaling (or even negative scaling) as you ramp up threads. I don’t think we have ready to use tools to detect refcount contention at this point, but it’s something that’s being looked at.

One way to avoid contention on the refcount is to not share objects (as Antoine already suggested), the other is to not pass around the data directly. The ft_utils project has ft_utils.local.LocalWrapper for that purpose. It provides a (mostly) transparent thread-local handle to an object, so it can be passed around and used without requiring additional refcount operations on the shared object. (You can also avoid a lot of the contention on rarely-used objects that you pass around, like lock and shared_result, by making them globals. I wouldn’t recommend that for real code, of course, but it might help to identify the sources of slowdowns.)

oscarbenjamin · June 13, 2025, 12:37pm

The code runs 4 threads operating on 4 slices of the list and each thread runs this function:

def find_closest_worker(points, query_point, start, end, shared_result, lock, tid):
    local_min_dist = float("inf")
    local_min_idx = -1

    for i in range(start, end):
        d = distance2(points[i], query_point)
        if d < local_min_dist:
            local_min_dist = d
            local_min_idx = i

    with lock:
        if local_min_dist < shared_result['min_dist']:
            shared_result['min_dist'] = local_min_dist
            shared_result['closest_idx'] = local_min_idx
            shared_result['owner'] = tid

Since the lock part is only used once at the end of each thread’s workload I wouldn’t worry about the contention there. The things that are shared between the threads are points and query_point and there could be reference count contention on each. My suggestion would be to not share those.

You can divide up the points list with points[start:end] before passing off to each thread so each has a separate list object. You can copy query_point with [float(str(f)) for f in query_point]. Note that converting to str is necessary because:

>>> f = 2.0
>>> g = float(f)
>>> f is g
True
>>> g = float(str(f))
>>> f is g
False
>>> f == g
True

I assume this is a toy problem but to be clear for anyone reading this much better speedups would be available by using numpy for this problem rather than using threads. Always optimise single-threaded performance first before considering threading.

barry-scott · June 13, 2025, 2:04pm

I expect that if you monitor the python threads with debug tools you will see a log of locking operations. Maybe try perf and see what is taking the time. If you see futex high up it’s locking.

nas · June 13, 2025, 4:04pm

Using a profiler will help you find where the time is being spent. I find “samply” convenient but you could also use “perf”. Here is a samply profile I produced for your script:

Firefox Profiler

It seems a lot of time is spent in the main thread, calling random functions and running the cyclic GC. You could disable the GC, either temporarily or completely. It seems this is the code inside your __main__ block. To generate an array of random numbers, using the numpy random function would be enormously faster since you can generate them in one call. Since that logic is outside the section of your timing, it won’t affect the reported “Elapsed time”.

For the processing done by the worker threads, they are spending quite a bit of time accessing the shared list. Other people have suggested ways to mitigate that. Generally you want the worker threads to be manipulating local data and only send the final results back to the main thread.

dpdani · June 13, 2025, 5:23pm

Hi @basileMarchand

As @thomas has already pointed out, the problem is implicit contention on the shared objects reference counts.

After applying this patch, on my macbook this prints 1.66 seconds in the default build, and 0.83 seconds in the free-threading build:

--- 03_loop_with_big_shared_list_original.py    2025-06-13 18:41:02
+++ 03_loop_with_big_shared_list.py     2025-06-13 19:10:26
@@ -2,7 +2,9 @@
 import random
 import time
 from math import sqrt
+from cereggii import ThreadHandle
 
+
 def distance2(a, b):
     return (a[0]-b[0])**2 + (a[1]-b[1])**2 + (a[2]-b[2])**2
 
@@ -11,8 +13,11 @@
     local_min_dist = float("inf")
     local_min_idx = -1
 
+    points_h = ThreadHandle(points)
+    query_point_h = ThreadHandle(query_point)
+
     for i in range(start, end):
-        d = distance2(points[i], query_point)
+        d = distance2(points_h[i], query_point_h)
         if d < local_min_dist:
             local_min_dist = d
             local_min_idx = i

The ThreadHandles you see here are essentially equivalent to the LocalWrapper that Thomas mentioned. Here are some docs if you’re interested in how it works: ThreadHandle - cereggii

ngoldbaum · June 13, 2025, 5:44pm

I should finally write a page to close Document how to use samply on Linux to generate multithreaded profiles · Issue #161 · Quansight-Labs/free-threaded-compatibility · GitHub and properly document this workflow!

malemburg · June 13, 2025, 7:30pm

Thanks for running the profiler.

I see many suggestions on how to improve the code to make it run faster, which is all helpful and fine, but isn’t the point of this posting that there is a performance bug somewhere in the code which should ideally be fixed ?

What I find surprising is that GC is causing such a major slowdown. It is literally taking up most of the time of the threading time (unless I’m misinterpreting the profiler output):

If the GC is causing a roadblock for threading, we need to so something about it… simply suggesting to turn off GC cannot be a good solution, since this may well cause other problems such as exploding RAM usage and all the issues associated with it.

nas · June 13, 2025, 7:44pm

I assume you mean a performance bug in the CPython code. If that’s the case, we should definitely open an issue and try to fix it. I don’t see one of those yet.

It’s the case for the default (non-free-threaded) version of Python as well. In fact, it looks to me like GC takes an even larger fraction of the time for that build. Here is a samply profile for the default build. It takes 4 seconds rather than 2 seconds to do that initial setup work. So, at least on my machine, the free-threaded build is faster at that part.

Comparing the performance of the cyclic GC in the default build vs free-threaded build is not that straight-forward. The free-threaded build doesn’t do incremental collections, only full collections. However the free-threaded full collection is generally quite a bit faster than the default build GC full collection (partly due to some optimizations I did for 3.14).

In any case, the suggestion to turn off the GC was mostly to make the profile data nicer to look at. That GC collection time is outside of the timed section of code anyhow.

malemburg · June 13, 2025, 8:52pm

Yes, that’s what I meant.

True, but the GC does appear to have a negative affect on how the threads perform.

The flame graph for the free-threaded run show rather long times for PyCriticalSection_Begin. Could that be related to GC running at the same time ?

The flame graph for the GIL run shows that the threads are very often just waiting to run again - which is expected, but it’s still surprising that you are getting different results than @basileMarchand .

Could be due to using time.time() for measurements… that’s not exactly how you should run benchmarks these days (time.perf_counter() is a better choice).

nas · June 13, 2025, 9:08pm

In case you are not familiar with the samply UI, you can select a region of time by dragging in the top time ribbon (shows seconds or milliseconds for me). Click on “Full Range” to see the whole trace. Click on the left hand side where it says “Thread-1”, “Thread-2” to look at a particular thread. The “Call Tree”, “Flame Graph”, etc show for the time section and thread you have selected.

I don’t see any GC happening while the workers are running. The sections with PyCriticalSection_Begin are largely due to the lock on the big list being used. Whenever a thread is putting something in or out of that list, it blocks the other threads. The PyCriticalSection_BeginSlow is kinda a sign of bad news, since it means you are waiting a significant time on a lock (rather than just acquiring it immediately or spinning a bit and getting it).

The Py_DecRefShared is sort of bad news too. It happens when reference counts are being manipulated by multiple threads. This is expected behavior due to how the biased reference counting works but is not optimal if you want best performance. Ideally the hot loops of your code are dealing with thread local data and not data shared with other threads.

The free-threaded GC does a “stop-the-world” pause when it runs. So, no other thread can run when it’s running.

nas · June 13, 2025, 9:13pm

Actually, it’s a bit more complex than that. There is some lock-free access that the list object implements and so it doesn’t always need locking with a mutex. In any case, there is a cost to sharing it and concurrently accessing it.

basileMarchand · June 16, 2025, 8:54pm

Thanks everyone for the very insightful (and fast!) feedback

As @oscarbenjamin mentioned, this was definitely a toy benchmark — the goal was mainly to explore how --disable-gil behaves in a simple but common use case: multi-threaded, CPU-bound code with shared access to a Python dict for reduction.

I’ll rework it using a map/reduce pattern with thread-local storage as suggested, to better isolate the impact of contention. I really appreciate all your suggestions — thanks again!

That said, one thing I keep thinking about: although we’re still in a shared memory model (in the system sense), it feels like we’re losing some of the benefits of shared memory when using --disable-gil. As soon as we try to share Python objects like list, dict, or even float between threads, the overhead becomes significant — refcounting, fine-grained locks, etc.

So I’m wondering: are there any efforts planned (in CPython) to provide thread-friendly shared data structures in Python ? Something a bit like what you’d get in C++ with raw arrays, std::vector, or even OpenMP-style parallelism — where threads can access shared memory directly, but it’s up to the developer to manage synchronization?

Either way, I’ll keep experimenting and will post any interesting follow-up results here!

oscarbenjamin · June 16, 2025, 10:12pm

I know this is a toy problem but really it is too much of a toy problem to be meaningful. I already said that if you really want to do this then you should use NumPy. Your question about “raw arrays” would seem strange to anyone who is familiar with NumPy arrays.

Here is how your benchmark might look if using NumPy:

import threading
import sys
import time
from math import sqrt
import numpy as np


def find_closest_worker(points, query_point, start, end, shared_result, lock, tid):
    local_min_dist = float("inf")
    local_min_idx = -1

    dist = ((points[start:end] - query_point) ** 2).sum(axis=1)
    local_min_idx = np.argmin(dist)
    local_min_dist = dist[local_min_idx]

    with lock:
        if local_min_dist < shared_result['min_dist']:
            shared_result['min_dist'] = local_min_dist
            shared_result['closest_idx'] = local_min_idx + start
            shared_result['owner'] = tid


def threaded_closest_point(points, query_point, num_threads=4):
    n = len(points)
    chunk_size = n // num_threads

    shared_result = {
        'min_dist': float("inf"),
        'closest_idx': -1,
        'owner': -1
    }
    lock = threading.Lock()
    threads = []

    for tid in range(num_threads):
        start = tid * chunk_size
        end = (tid + 1) * chunk_size if tid < num_threads - 1 else n
        t = threading.Thread(target=find_closest_worker,
                             args=(points, query_point, start, end, shared_result, lock, tid))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

    return shared_result['closest_idx'], sqrt(shared_result['min_dist']), shared_result['owner']

if __name__ == "__main__":
    N = 10_000_000
    num_threads = int(sys.argv[1])
    points = np.random.rand(N, 3)
    query_point = np.random.rand(3)

    print(f"Launching threaded NN search with {num_threads} threads")
    t0 = time.time()
    idx, dist, owner = threaded_closest_point(points, query_point, num_threads)
    t1 = time.time()

    print(f"    Elapsed time: {t1 - t0:.5f} seconds")

Here is how that scales with threads:

$ python t.py 1
Launching threaded NN search with 1 threads
    Elapsed time: 0.12663 seconds
$ python t.py 2
Launching threaded NN search with 2 threads
    Elapsed time: 0.06742 seconds
$ python t.py 3
Launching threaded NN search with 3 threads
    Elapsed time: 0.05072 seconds
$ python t.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.04004 seconds
$ python t.py 13
Launching threaded NN search with 13 threads
    Elapsed time: 0.02588 seconds

Notice firstly how much faster it is already with only 1 thread and I have just written the most obvious code here without timing alternatives or attempting any further optimisation.

Then also it scales nicely at least for a few threads giving a 3x speed up at 4 threads on this machine. At 13 threads (the number of cores) it is only 5x faster so there are diminishing returns.

I’m not even using the free-threading build for these timings because NumPy already releases the GIL.

ngoldbaum · June 17, 2025, 4:00pm

At 13 threads (the number of cores) it is only 5x faster so there are diminishing returns.

There’s some randomness in this benchmark, so I wouldn’t put any strong interpretation on any individual result:

goldbaum at Nathans-MBP in ~/Documents/test
○  python test.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.08437 seconds

goldbaum at Nathans-MBP in ~/Documents/test
○  python test.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.05730 seconds

goldbaum at Nathans-MBP in ~/Documents/test
○  python test.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.05192 seconds

goldbaum at Nathans-MBP in ~/Documents/test
○  python test.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.04854 seconds

goldbaum at Nathans-MBP in ~/Documents/test
○  python test.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.04026 seconds

Also I think disk latency from the imports is impacting the first result. After that the disk cache is hot.

oscarbenjamin · June 17, 2025, 5:11pm

Agreed. I did run multiple timings but didn’t show them because the variation was low (much lower than what you showed):

$ python t.py 1
Launching threaded NN search with 1 threads
    Elapsed time: 0.12756 seconds
$ python t.py 1
Launching threaded NN search with 1 threads
    Elapsed time: 0.12756 seconds
$ python t.py 1
Launching threaded NN search with 1 threads
    Elapsed time: 0.12740 seconds
$ python t.py 1
Launching threaded NN search with 1 threads
    Elapsed time: 0.12756 seconds
$ python t.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.03984 seconds
$ python t.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.04057 seconds
$ python t.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.04087 seconds
$ python t.py 4
Launching threaded NN search with 4 threads
    Elapsed time: 0.03994 seconds
$ python t.py 13
Launching threaded NN search with 13 threads
    Elapsed time: 0.02445 seconds
$ python t.py 13
Launching threaded NN search with 13 threads
    Elapsed time: 0.02749 seconds
$ python t.py 13
Launching threaded NN search with 13 threads
    Elapsed time: 0.02630 seconds
$ python t.py 13
Launching threaded NN search with 13 threads
    Elapsed time: 0.02681 seconds

ngoldbaum · June 17, 2025, 5:25pm

Interesting, I’m using the free-threaded build so that might be related

pitrou · June 17, 2025, 8:03pm

Or perhaps your system is noisy; or its power saving features are enabled and CPU clock ramp-up is too slow for such a short-running benchmark; or your CPU has asymmetric cores and the benchmark threads are not always scheduled the same way from one run to another; etc.

dpdani · June 17, 2025, 8:46pm

There are some efforts yes, but outside CPython.

There are two libraries working on this goal, that I know of:

GitHub - facebookincubator/ft_utils: A library of utilities to support efficient, scalable Python development leveraging Free Threaded Python.
GitHub - dpdani/cereggii: Thread synchronization utilities for Python

The end goal is to make these data structures part of the standard library.

(the latter one is developed by me, so if you have any questions/feedback feel free to dm me)