Add Virtual Threads to Python

malemburg · May 14, 2025, 1:40pm

OS threads normally each get their own stack, so swapping in and out a stack could only happen on the same thread.

I don’t think that’s a problem, though. Each OS thread could have their own event loop managing the async processing on that thread and you could use semaphores/shared state/queues to communicate between those OS threads.

This may sound complicated, but that’s only because threading is hard to get right (with or without GIL)

jamadden · May 14, 2025, 1:59pm

Indeed, that’s how gevent works. Although getting locks to work correctly both within the greenlets of a thread and across threads has been tricky.

In practice, this hasn’t really been an issue; I can only recall it even being discussed once, and that was only recently in the context of freethreading (I suppose its possible moving forward in a freethreading world this may become more important). What has been identified as a practical limitation is the inability to pass IO objects between OS threads. In gevent, each OS thread has its own event loop, and IO objects such as sockets are bound to the event loop of the thread in which they were created and can only be used (by any greenlet) in that thread. But sometimes you might want one thread with one or more greenlets creating sockets (e.g., accept) while a different OS thread with one or more greenlets handles processing them (e.g., handling file uploads/downloads). This is a solvable problem with a little bit of effort.

da-woods · May 14, 2025, 8:10pm

If I’ve got an extension module using non-Python aware locks (e.g. C++ std::mutex) does this introduce exciting new opportunities for deadlocks?

e.g.:

“Thread 1”:
- has GIL
- acquires Lock 1 and GIL simultaneously (which it’s possible to do safely, with a bit of care)
- calls some Python function (thus allowing virtual thread switching to “Thread 2”)
“Thread 2”:
- has GIL
- releases GIL
- Tries to acquire Lock 1 (will block indefinitely because there’s no way to switching back to “Thread 1”).

jamestwebber · May 14, 2025, 8:17pm

I’m confused what’s happening there. Both threads start with the GIL?

da-woods · May 14, 2025, 8:26pm

There was one small typo that I’ve corrected since you commented.

But both “Thread 1” and “Thread 2” are virtual threads here, running on the same real thread. So the “Thread 1” gets itself into a situation where it holds both the GIL and the external lock. It then does something that allows the switch to “Thread 2” (which could be as innocuous as calling a Python function). If “Thread 2” attempts to acquire the external lock then I think it’s now stuck.

The general question of “does this make existing code deadlock?” is more important that the specific contrived example that I’ve tried to show though.

jamadden · May 14, 2025, 8:55pm

This specific example is currently possible with regular OS threads too, isn’t it? When Thread 1 begins executing Python bytecode again (while holding its external lock), the interpreter can choose to drop the GIL whenever it wants (the switch interval) allowing any other thread to pick it up. If Thread 2 picks up the GIL and attempts to acquire the external lock, you deadlock. There’s not a way to tell CPython “no matter what, never release the GIL”.

Perhaps virtual threads make this race condition slightly more likely (I’m not sure that it does), but buggy code is buggy code. Don’t execute arbitrary code while holding locks.

jsbueno · May 14, 2025, 11:30pm

I guess that if the limitation can be avoided by “implement(ing) a (complex) C API to support suspension,” that should be factored in from the start -

Other than that, I like what I am seeing here.

da-woods · May 15, 2025, 11:38am

I don’t think it’d currently be an issue - thread 2 deliberately acquires the external lock without the GIL which avoids what you describe.

I agree with the general point - anything involving multiple locks should make one nervous, especially for the GIL/thread state (which is a lock that you don’t really control).

But I think it’s possible to reason about what I’ve written (especially if you know exactly who has access to the external lock) and the reasoning changes with virtual threads.

pitrou · May 15, 2025, 12:23pm

That would be a rather severe and annoying limitation IMHO. But you also haven’t outlined how the explicit task switches are denoted, so it’s difficult to understand the proposal precisely.

MegaIng · May 15, 2025, 12:41pm

Just to make sure I understand what is meant here: If we have for i, w in enumerate(py_gen_func), then it would not be possible for py_gen_func to yield to a different thread because enumerate is implemented in C? And it would work if enumerate was implemented in python instead?

If yes, I would consider this unusable without C stack switching.

RonnyPfannschmidt · May 17, 2025, 3:32pm

Enumerate wouldn’t create a problem

MegaIng · May 17, 2025, 3:35pm

Why not?

What about everything in itertools? What about functools?

What about third-party similar wrappers?

RonnyPfannschmidt · May 17, 2025, 4:35pm

Anything that adds to the c stack in a permanent fashion is an issue

Iterators usually don’t do that

The problem cases would be more surrounding callbacks from speedup

MegaIng · May 17, 2025, 4:41pm

I don’t understand how enumerate could work otherwise? Are you sure you have a clear understanding of the situation, the proposal and how python is implemented in this area?

I am not talking about enumerate.__init__, which yes, isn’t an issue since it’s not permanent. I am talking about enumerate.__next__ which needs to be called every iteration and sits in the C stack between the for loop and py_gen_func.__next__ (i.e. the function body of the generator). So if py_gen_func wants to yield to a different thread (e.g. because it has to wait for a result anyway and doesn’t want to blockage other works that might exists), that has to go through the C-level implementation of enumerate.__next__, no?

RonnyPfannschmidt · May 17, 2025, 4:45pm

Its called and exited

Its not like enumerate drives the loop, the iteration protocols do

So enumerate doesn’t add something to the c stack while control flow is active only while the call to next happens once every iteration

RonnyPfannschmidt · May 17, 2025, 5:12pm

Ot seems like i misunderstood

If the python function generator is a greenlet that switches then the c stack contribution of enumerate is a problem

If the greenlet is part of the surrounding loop it isn’t

A5rocks · May 18, 2025, 1:11am

I just want to point out that you cannot have the following two properties:

well defined schedule points (for at least my definition of well defined)
no coloring

This is because either your schedule points are “colored” (i.e. marked and need to be called from marked functions) or you can schedule from an unmarked function which is… not well defined.

Static tooling can’t help, for instance:

def f(func: Callable[[], None]) -> None:
  func()  # is this a schedule point?

Async can have poor ergonomics but I think it’s a feasible problem to improve them (at least outside of stdlib, the asyncio stuff below is mostly a pipe dream). For instance:

asyncio could change its API to require that coroutines need to be immediately awaited (asyncio.create_task(blah(x=5)) into asyncio.create_task(functools.partial(blah, x=5)). It would be nice to have a short API for this ) so type checkers can catch any forgotten await
type hints could allow people to propagate changes to a function’s “color” (whether it’s async) to callers for instance through an LSP code action. This is hard to do, but I think totally viable.
offering a sync API to an async library can be improved by adding an analog to anyio’s blocking portal. There’s probably a bunch more Python can do here, like maybe some standard decorator that type checkers can implement?
type checkers could encourage correct code by dropping type narrowing for shared objects after await points

I think ultimately it’s incredibly valuable to know whether a function will do I/O or context switch. The fact that this is dividing an interface into “colors” is an incredibly minor cost to pay in exchange (in fact I think this complaint is silly in general). And the Python ecosystem can develop better tooling for any specific ergonomics issues.

sirosen · May 18, 2025, 5:32am

It sounds like you haven’t been exposed to the difficulty it creates for libraries which want to support sync and async usage with the same APIs.

If you want a good open example, check out the core parser in webargs. Most of the important functions are fully duplicated to satisfy the function color problem. For larger libraries, the complexity cost goes up from here nonlinearly.

Just as we should avoid calling the downsides of async “baggage”^[1], we shouldn’t call other people’s opinions silly. Maybe they have those opinions because they have knowledge or experience which they can share.

Anyway, I don’t think async is irrelevant to this thread, but surely if this proposal is to succeed, it needs to justify itself not as an async replacement, but rather as a good alternative mechanism to add. Focusing too much on details of async may be detrimental.

After all, it’s not like async is going away. Function color is here to stay.

I think the initial “async baggage” comment was meant to be harmless shorthand for “the cost of function color”, but clearly it didn’t read that way to everyone. ↩︎

Tinche · May 18, 2025, 9:33am

I think there are two points where our opinions are misaligned, and those points are the selling points of virtual threads (for me).

I think the cost of coloring on the ecosystem itself is enormous. Every library needs to be written at least twice: any database driver, any ORM, any service interface (AWS S3, Azure Storage, Kubernetes…), any web framework.

But that’s not the #1 issue for me - if function coloring was the best we could do for async, let’s just pay the price and move on.

So this brings me to the second point. Function coloring is great for single-threaded cooperative multitasking. I don’t think it works for non-single-threaded scenarios.

Imagine an event loop driving two different coroutines on two different OS threads in parallel. Both of these coroutines attempts to use a shared HTTP client. How does the client protect its critical section, where it picks connections from a connection pool?

In a single-threaded scenario, it doesn’t even need to - just do the work between suspension points. If the work needs do perform some IO, use an asyncio.Lock.

In a multi-threaded scenario - the suspension points do not help at all since there’s actual parallelism going on. The lock that would need to be used hasn’t even been written yet (it would need to be bock an async and sync lock at the same time).

The idea is that virtual threads would solve this.

Tinche · May 18, 2025, 5:54pm

Somewhat relatedly, I came by this article about Go scheduling recently. I’m not a Go expert so I can’t vouch for whether it’s 100% correct but it did seem legit to me. I also found it very interesting. Obviously Go and Python are very different, especially in performance. What might be an important performance optimization for them might be completely lost in the noise for us.

Since a lot of this discussion was motivated by virtual threads in Java, it’d be very useful if someone could do a write up on how exactly they work with an eye to what the implementation might look like in Python.