Asyncio in a nogil world

Tinche · July 31, 2023, 2:37pm

I am very curious to learn what the community thinks about asyncio’s place in the upcoming nogil world.

More precisely, I mean the ‘asyncio’ style of network programming involving non-blocking I/O, cooperative green threads (tasks) and colored functions (two flavors of functions [sync and async] and explicit suspension points marked by await); so frameworks like Trio and anyio are also included.

Alternatives would include:

just using large threadpools with sync I/O. I haven’t written Java for a long time, but IIRC they do this by default?
async programming but with non-colored functions; Go and gevent-style.

Interestingly, there are easy examples of languages supporting async programming with colored functions while having free threading: C# and Rust. So this gives me faith colored functions still make sense in a free-threaded world.

The reason it occurred to me to reconsider colored functions in a nogil world is because this style has two benefits: it’s clear where a suspension point is (that will still hold), and between suspension points the world (i.e. state) isn’t expected to change (this might not hold anymore). These constraints meant many operations that might need locks using a different style don’t need locks; the critical section can be performed between suspension points. This is very natural and efficient.

Ok, and if we as a community decide to stick with asyncio and try adapting it to take advantage of nogil, I see the following strategies available to us:

continue running asyncio services essentially single-threaded; the main difference is running CPU-bound things in a threadpool becomes more straightforward (before you had to use a ProcessPoolExecutor if your workload didn’t release the GIL). So just like now, except a little easier.
run multiple event loops in multiple threads. It’s a little more efficient than running multiple processes since the memory cost of the interpreter and stdlib can be shared. Resources scoped to the process (ports, signals?) become more complex to manage.
write an event loop that can multiplex active tasks across a number of threads, also known as M:N threading. I think this approach is the most efficient in theory but using it loses the fundamental asyncio assumption of no state changes between suspension points. I would be surprised if existing asyncio libraries could run without issues using this model, so I’d call this model something else (nogil-asyncio?) instead.
more complex models higher up in the application layer. For example, I could imagine an actor-like framework being able to bridge between the existing asyncio ecosystem and coroutines multiplexed on many threads in parallel. The community could probably come up with innovative stuff.

Since there are so many options, I’m really curious what other folks interested in this space are thinking makes sense. A part of me is also sure, our community being so diverse and creative, all of these will probably see light of day in some form or another.

Rosuav · July 31, 2023, 2:47pm

Threads have overhead. Let’s say you want to run a web server where you spawn a task for each incoming request (read request, process it, send response); done with a thread pool, the pool size will limit your number of concurrent requests. And since it’s extremely possible for an attacker to delay this (open a ton of connections, start them all, and don’t finish them), this will quickly result in either a huge thread pool with most of them idle, or requests getting dropped. So asyncio will still have a place there, since it scales to infinity far better than threads do.

I would be VERY curious to see whether a nogil Python would allow a hybrid whereby you have a ThreadEventLoopPool that has some number of threads, each running an asyncio event lop, and thus able to scale up to vast numbers of tasks (since idle tasks aren’t consuming much), while also able to run multiple actual jobs concurrently (because threads), with minimal overhead for moving data between threads (unlike a process pool). Basically this

but much much simpler and better abstracted. I personally think that the “event loop that can multiplex active tasks” approach is more clunky than simply having independent event loops on the separate threads, although I’m open to examples showing otherwise.

Tinche · July 31, 2023, 3:24pm

Yeah, threads + sync IO are not the best. You’d also lose asyncio cancellation semantics which are great. I’m personally also not enthused by this approach, but folks are using it I guess.

Your tasks will presumably be using a bunch of libraries (like aiohttp, sqlalchemy, httpx, aioredis…) to do their work. All of these libraries maintain connections pools internally. These connection pools are bound to an event loop. So if you have N independent event loops, you’ll need a connection pool per loop. While it’s not the end of the world, it’s still pretty bad for a bunch of reasons. For example, you risk having a connection pool starved in one thread while an identical connection pool has available connections in another thread. Likewise, (depending on your library) you risk hogging database resources with unnecessary idle connections. All of this would be alleviated with a pool that’s a little bigger but shared between threads.

That said, I mentioned I think none of these libraries are “nogil-asyncio” safe today. But they could be adapted, if that’s what we decide on.

Rosuav · July 31, 2023, 4:02pm

Hmm, that’s fair. I’m not sure whether that could be solved, but it’s probably a good reason to go with the single event loop, yeah. I’m just not sure what the mental model should be here - it’s a bit of a weird hybrid between threads and tasks. Will it end up feeling like “asyncio, but with more concurrency”, which would be great? Or will it be “oh <bleep> there’s a bug that only happens when this half of the task runs on a different thread”?

In any case, I’m excited for the future, and hopeful of being able to put this to some real-world use soon!

barry-scott · July 31, 2023, 4:45pm

The basic trade off between async and threads is the cost of the threads.
Each thread needs its own stack. When there are 1,000’s of threads that
memory will be huge. There is also the cost of context switching between
the threads that can use more CPU then async.

brettcannon · August 1, 2023, 12:43am

And this is why I plan to continue to do async programming even with free threading when it’s I/O-bound and threading doesn’t win me some massive performance win. I find it way easier to reason about async programming than worrying about locks and race conditions.

hynek · August 5, 2023, 5:21am

The good news is that asyncio got impressively faster on nogil without changing a thing so we’re all benefiting either way.

Tinche · August 5, 2023, 12:36pm

We never figured out why though, right? I remember Guido had a hunch, but that’s about it. Also presumably @ambv was benchmarking it using the stdlib event loop, which no one who cares about performance uses in prod anyway.

johnthagen · August 8, 2023, 3:21pm

Does nogil have any potential impact on a runtime such as uvicorn?

markshannon · August 10, 2023, 4:55pm

The apparent speedup of nogil on asyncio is a largely a result of using much higher GC thresholds on the NoGIL branch.

ajoino · August 10, 2023, 5:04pm

Going a bit off-topic here, but why does GC thresholds affect asyncio so much? Does asyncio create a bunch of objects that it throws away immediately?

carljm · August 10, 2023, 6:28pm

Yes. Coroutine objects and Task objects, mostly.

kevrod07 · August 27, 2023, 4:04pm

do you happen to have any benchmark results available for this?

x42005e1f · November 24, 2024, 1:33am

The key problem to be solved is the interaction between tasks. Even if we parallelize asynchronous tasks on different event loops in different threads, they will still depend on each other. But asyncio primitives are not thread-safe and, moreover, do not work with different event loops.

This problem is much more serious than it may seem at first glance. I found at least 21 questions on StackOverflow related to this topic: both communication between event loop and threads (queues) and synchronization (events, locks, semaphores, etc.), which affects not only asyncio but also libraries like trio and gevent. And the solutions usually have a number of drawbacks: they are either not thread-safe in principle, or are not able to work with more than one event loop, or do not work with cancellation and timeouts (e.g., can lead to thread leaks), or simply have very poor performance.

Partly for this reason, I created my own library called aiologic, which supports all the features of asyncio and other libraries, shows good (and sometimes even incredible) performance results, and works successfully in a free-threaded mode. The approach is simple: waiting is delegated to a lightweight, one-time library-specific event, and all primitives are built on top of a queue of such events using effectively atomic operations (such as list’s append() and pop()). This makes my library work well in a model with different event loops in different threads, not all of which necessarily have to be asyncio event loops.

I believe that this approach will allow us to maximize performance, especially if the pool will employ work-stealing. As one future scenario, such primitives could be implemented at the C level, and further move to truly atomic operations such as compare-and-swap. The center of the new architecture will be an atomic queue (for distributing tasks and events between threads): in recent years, there have been a number of scientific advances in fast wait-free queues. With the move to io_uring (Linux) and I/O Rings (Windows), we can reduce the number of system calls to a minimum, which will reduce the number of context switches and improve performance even more. We can even override normal blocking calls on top of this architecture and get thread cancelability. But the question of how relevant this is for Python remains open.

mikeshardmind · November 25, 2024, 7:48pm

A small bit of clarity, They are not safe to use the async methods of in multiple event loops, but some (specifically queues, futures, and events) along with coroutine objects that don’t hold references to objects that cannot be passed, are safe to pass across threads and interact with using the threadsafe methods (e.g. asyncio.run_coroutine_threadsafe). If you pass an asyncio future, you should also not set a result on it from a loop other than the one it was bound to, use loop.call_soon_threadsafe to arrange this, or use asyncio.wrap_future to instead wrap a concurrent.future with an asyncio.future per event loop that needs to be an effective consumer, passing the concurrent.future to whatever will be setting the result.

You can use this to get very good performance out of lock-free algorithms when leveraging this, and splitting work to the segments that do not require synchronization improves concurrency and parellization strategies available and the ergonomics of the code written for them.

work-stealing is sometimes harmful to performance, and I’ve found that some performant applications should be aware of how they are mixing threading and async to choose the right structure for the right strategy explicitly, rather than relying on a general-purpose scheduler to get it right for them.^[1] Some prior well-documented cases exist where general-purpose schedulers like rust’s tokio are only work-stealing some of the time in an attempt to avoid the pathological cases.

I’ve also been working on generalizing some things I’ve repeatedly written variants of for more general purpose use, but mixing async and threading is something I’ve so far in Python, mostly limited at a conceptual level to having multiple threads (in some cases, such as those heavy on filesystem access, explicitly thread pools for the life of the application, which is something io_uring and similar may help improve in the future) with event loops running in threads that need one and a means to pass jobs and messages between them.

It’s more work developer work to do this though, so it’s cool to see other people working on a more generalized system, and I’m hopeful what you’re working on with aiologic may become something that significantly improves performance for users.

with that said, a good general-purpose scheduler will beat a bad implementation using multiple good schedulers each per thread, so this isn’t a knock against general-purpose async+threading schedulers. ↩︎

Tinche · November 25, 2024, 10:34pm

So it’s been almost a year and a half since my original post.

I have to admit I’ve actually changed my mind on the topic. I now think function coloring doesn’t make a lot of sense in a free-threaded environment. Function coloring always imposed a pretty heavy burden on the community - all networking libraries had to be written twice, essentially - but in a single-threaded world that cost was worth it, in my view.

This doesn’t change my view on async programming - I still think it’s imperative to have for many real-world production workloads. I just don’t think async/await is worth it, given the new circumstances. Apart from the burden of function coloring, one of the most obvious benefits of async/await - explicit suspension points and the fact that the world doesn’t change in between - doesn’t hold in a free-threaded environment. (It doesn’t even really hold with the GIL if you reach for asyncio.to_thread, but we all squint and organize our code in special ways so it sort-of works in practice.)

I now agree with Armin that it’d be best if we now changed course and looked to virtual threads (using N:M scheduling, in a free-threaded context) for async programming. (Goroutines in Go and Java’s Project Loom are good examples of this.) This might end up being more complex than at first glance, since to avoid hidden function coloring (having to write different code for sync and async contexts) we will need to figure out how to have cancellation on normal, OS threads first.

mikeshardmind · November 25, 2024, 10:51pm

I don’t agree that function coloring is a bad thing. I do agree with much of the rest, and think we should be looking to get the other benefits.

I’ve always written asyncio code as if mutable data structures were to be avoided or only held internally when code flow is designed around the potential for threading, not handing out references, but I’ve been mixing asyncio and explicit threading in production, including with multiple event loops running.

I don’t think the explicit yield points were ever what we were sold, but I don’t think they are unhelpful as they exist either, it’s useful for designing various levels of cooperation in a work unit (it’s not useful for determining what can have object references in it’s own, this needs a lot more thought)

I’d like to see N:M concurrency work out of the box, rather than people needing to build it themselves, but I don’t think getting rid of async is the way to do it. Again, referencing rust’s tokio, coroutines can be run on any available work thread (when tokio is configured this way). I think in python, making more of asyncio threadsafe and not binding to event loops would be what is necessary for this to work out of the box.

That would likely be a successor to asyncio or otherwise require enabling through configuration though, it’s too involved to change under people’s existing not-quite-correct assumptions. As it is, it requires explicitly having a scheduler (Event loop) per thread participating in this manner, and a way to communicate between them. The building blocks for it are there in asyncio, but the overall way existing asyncio code is written doesn’t always play nicely with that.

x42005e1f · November 26, 2024, 12:02am

I do not think using event loop methods allows us to say anything about the safety of asyncio primitives. We cannot say that, for example, the get_nowait() and put_nowait() methods of the asyncio queue are thread-safe - they actually are not and result in hangs, IndexError raises, and internal state violations due to the non-atomicity of the increment operation. Instead, delegating calls to the event loop actually reduces execution to single-threaded execution and has side effects such as the inability to keep track of the actual state of the queue and dependence on the responsiveness of the event loop (which negatively impacts performance).

Furthermore, using event loop methods alone does not tell you how to handle multiple event loops, since it is usually a case of a single event loop. I understand why everyone likes these methods so much, but they are not as good as they seem. I would say that using run_coroutine_threadsafe() can be even worse than using a lock that blocks the event loop, because in that case you have to wait for a scheduler that may already have a lot of calls scheduled due to a very large number of tasks.

There are cases where we cannot redesign an application so that it delegates all of its synchronous operations to a thread pool. Usually, these are cases of legacy code and use of third-party sync-only libraries with inversion of control. So we will have to coexist with two independent worlds for a while.

mikeshardmind · November 26, 2024, 1:00am

I should have been more specific, yes those specific methods require using loop.call_soon_threadsafe. The queue itself is safe to pass between threads, but use of it is not, and the parts that are “partially threadsafe” are not guaranteed currently, and require exhaustive checking each python update, even in patch releases, as a result.

This does create suboptimal performance for some uses, and it would be better if there were asyncio-compatible threadsafe queues in the standard library, or at least available in the open source ecosystem.

The status quo can work quite well when you navigate this, but it is filled with more sharp edges than I’m happy about personally. With that said, I’ve been using a mix of asyncio and threading in production since python 3.6 (prior to this, a mix of threading and generators), and can say the gains of doing so are sometimes worth the extra effort. Other times, it’s enough extra legwork that it becomes more pragmatic to scale horizontally with more processes and add a message queue to the architecture.

x42005e1f · November 26, 2024, 1:11am

Well, there is another library of mine, derived from aiologic - culsans, which provides asyncio-compatible threadsafe queues and can handle multiple event loops. Its queue interfaces are fully compatible with standard queues, so now we can say that such queues are available in the open source ecosystem. Almost nobody knows about the existence of these libraries, so I am actively working on their distribution.

As for performance, benchmarks are distributed together with the source code. On my hardware, for the case of two-way communication between synchronous code and asynchronous one, culsans is 1.5 times faster than the naive solution (via event loop methods) on Python 3.13.