Faster ThreadPoolExecutor submission

axeld · July 4, 2025, 9:18pm

With free-threading becoming ready, there’s going to be more and more interest in ThreadPoolExecutor.

However it seems the default (python-based) implementation’s speed could be improved. Likely with custom C code.

Here is a simple Python code to highlight the issue:


from concurrent.futures import ThreadPoolExecutor
from cymade.threadpool import ThreadPool
import gc
import time

gc.disable()

pool = ThreadPoolExecutor(max_workers=1)

def f():
    def g():
        e = 0
        for i in range(1000):
            e += 1
        return e
    a = time.monotonic()
    for _ in range(1000000):
        fut = pool.submit(g)
    b = time.monotonic()
    fut.result()
    c = time.monotonic()
    print(b-a, c-b)

f()

pool.shutdown()
pool = ThreadPoolExecutor(max_workers=4)

f()

pool.shutdown()
pool = ThreadPool(max_workers=1)

f()

pool.shutdown()
pool = ThreadPool(max_workers=4)

f()

Where cymade’s ThreadPool is a Cython made ThreadPoolExecutor that can be found here cymade/cymade/threadpool.pyx at main · axeldavy/cymade · GitHub.

The code, run on Cpython 3.13 freethreaded build (with PYTHON_GIL=0) prints on my computer:

6.369051308000053 46.233525454999835
15.782200638999711 16.54086810300032
0.3838555649999762 44.35501626699988
0.9758839780006383 25.223620621998634

The first column is the time to submit one million functions to the thread pool. The second column is approximately the time after that submission for the work to finish (note the actual time running the work is actually longer as the first submission starts before the second timestamp)

The first row is with the default threadpool, max_workers=1, the second with max_workers=4. The third and fourth row are the same result but with a custom Cython-made threadpool (which is there mainly for demonstrating the issue but should not be considered a solution).

The results seem to indicate that there is significant overhead with the default ThreadPoolExecutor for work submission. The cost of running the function g is rather small, but not zero either. Yet just submitting the function can be in the same order of cost as running it. In my own usage of ThreadPoolExecutor I can have submissions of even smaller functions. I believe it would be worthwhile for CPython to improve the speed of this feature.

I assume if work was to be done on this topic, it would be as a C custom implementation in CPython. Would contributions on this topic be accepted ? Or, if you think it is a hard contribution, is there interest from someone more experimented ?

ZeroIntensity · July 5, 2025, 5:27pm

I think most of the overhead you’re seeing here is just from 3.13t being slow. Several workarounds had to be implemented to prepare free-threading in time for the 3.13 release, which negatively impacted performance. It’s a lot faster on 3.14, as these workarounds have been fixed:

2.856964466000136 21.78334188300005
3.5243622540001525 3.2805891289999636

I do agree that a C version would still be faster, but it would also probably be too much of a maintenance burden to be feasible. ThreadPoolExecutor (and its base class, Executor) are quite complicated and non-trivial to do in C.

axeld · July 5, 2025, 10:54pm

I understand this viewpoint, but I do not fully agree. In my view, Python is great as a conductor, orchestrating many calls. The topic of reducing the overhead of work submission, for thread pools or asyncio will likely come on the table again. Maybe it is too early.

Thanks for the reply.