Picking the number of workers in asyncio


I’m trying to compare sync and async approaches for a blog post. I’ve written a synthetic benchmark function:

async def realistic_workload(_):
    await sleep(0.01)
    fibb(23)  # Around 5.86 ms
    await sleep(0.02)
    await sleep(0.01)
    fibb(20)  # Around 1.36 ms
    await sleep(0.01)
    # All together: 10 + 5.8 + 30 + 1.3 + 10 ~= 57 ms
    return JSONResponse({"work": "done"})

(and a synchronous version) and I’m trying to compare some runtime characteristics of sync (Gunicorn+Flask) and async (Uvicorn/Starlette) approaches.

I’m dedicating 4 logical cores of my 12 core machine to the server, and benchmarking with the ab utility. For the sync approach I was well aware I’d need to run a lot of workers, but I was surprised by the fact I get the best results running 8 asyncio workers, instead of the expected 4. Both the stdlib asyncio loop and uvloop give the same results.

Does anyone have any idea why I need 8 processes to saturate 4 cores? Feels like there’s something blocking in the asyncio workers.

How are you setting the number of workers? Using Gunicorn?

Yep. Well, uvicorn ... --workers x, but it boils down to the same thing I suppose.

During my research into this I thought maybe there was some sort of DNS lookup being done when the connections were being accepted (knowing the exact system calls involved in handling connections is a little too low-level for me) so I went digging into how uvloop (and libuv) does DNS. I think it does it sync, using a thread pool. Turns out you can set an environment variable (UV_THREADPOOL_SIZE) to manually set the size of the thread pool. I also needed to hack uvicorn, since it uses fork under the hood and fork doesn’t preserve threads, so the forked workers had no thread pools. But at the end this turned out to be a red herring, and even running 4 worker processes with 128 DNS threads each didn’t saturate my cores.

Starlette recently changed their approach to threadpools starlette · PyPI

Mind elaborating on this?

it uses the trio style thread pool approach, where threads keep themselves alive for 10 seconds after any task and user code can make as many threads as needed. User code must limit their own threads using an anyio.Semaphore()

Ah, I think you misunderstood. This has nothing to do with threads an asyncio process might use to run blocking operations, but rather the number of asyncio processes needed to saturate a number of cores. The example I gave doesn’t use threadpools at all.

Ah I see. I’ve always used 2x+1 processes where x is the number of hyperthreads


Yeah, that documentation seems to pertain to running sync io. Also, it’s not very good (it claims there’s no such thing as too many workers?). If you port the example I gave to f.e. Flask, you will need around 60 to 70 (IIRC) worker processes to saturate 4 cores. Which is definitely more than (2x4)+1.

Even if we take the 2x+1 number for asyncio, the main question is: why? Why not just 1x?