Add Virtual Threads to Python

Why would that matter to a user of the threading implementation? This seems to me like an implementation detail. Maybe I’m missing something?

Enough to matter? Especially when using higher level abstractions like thread pools?

This is the major downside, agreed. I’m trying to establish whether there are enough benefits to outweigh it. So far I don’t see them :slightly_frowning_face:

And this is another major downside. But when you say “not preemptive”, how does that align with virtual threads not needing function colouring? I feel like you’ve said that the downsides of both threading and asyncio apply to virtual threads, which really isn’t helping the case for them :slightly_frowning_face:

So all downsides, no (realistic) benefits?

Micropython isn’t really a consideration here - they are not a full Python implementation (for a start, they don’t implement threads!) Maybe micropython could implement the threading module via virtual threads, and document the limitations. I don’t see that as relevant to core Python, though.

I don’t see the trade-off here. You lose a lot of key functionality, and gain a bit lower resource usage. Maybe the functionality loss isn’t important to you (although that seems highly unlikely, especially for library code), but I’m still struggling to see “saves a bit of memory” as a game-changer…

I agree, but as I noted above, Micropython is a separate case (and they can adopt virtual threads if it suits their needs). Maybe Python on Raspberry Pi is in a similar situation. But I still think that “use concurrent.futures to share work via a thread pool” is the right answer when you have limited resources and need threads, and I haven’t seen anything that convinces me otherwise.

1 Like

It feels like this thread is doomed to go around in circles forever. One must imagine posters in here happy.

I think a great starting point would be reading the JEP that introduced virtual threads into Java; I found it quite approachable. The case for virtual threads is made there quite emphatically.

Now, Java is not Python, and Python already has a framework for asynchronous programming.

As a proponent of virtual threads, one of my starting assumptions is that function coloring sucks. I would love it if it didn’t exist. I consider its existence a large burden on the community and ecosystem. Virtual threads are, then, a way of doing async programming without function coloring. Yes, one of the consequences is that you don’t see where context switches can occur. So what? You don’t see them in Go or Java (with or without virtual threads) and folks seem to program using these languages fine.

You may disagree and love function coloring and explicit context switches. That’s a perfectly reasonable stance to have. But, if you want to understand the appeal of virtual threads over asyncio, you need to keep in mind that a major selling point is the idea of writing code that can run either async or sync. If you don’t see the appeal, then yes, I imagine the case for virtual threads would appear weak.

As for how virtual threads would work in practice: probably just like asyncio except without function coloring, and some additional freedom for the runtime to use multiple platform threads if configured to do so.

7 Likes

So tell me, how do Go and Java handle native code and the possibility of context switches? You keep pointing to Java as if “it works in Java therefore it is a no-brainer in Python” is a clinching argument. You have NEVER explained how this would work in Python.

It is trivially easy to eliminate function colouring. All you have to do is pick one and make all functions that colour. It’s just that this has consequences:

  1. If all functions are blue, you have threaded code, and you need preemption everywhere in order to have concurrency. This means, in turn, that you need locks everywhere. Python is much maligned for having a Global Interpreter Lock, but one of the consequences is that you don’t have to have coloured functions - and thanks to the GIL, single-threaded code isn’t paying an insane price for this. One of the reasons it’s taken this long to remove the GIL is that this has some extremely far-reaching consequences.
  2. Alternatively, if all functions are red, you have a concurrency model like JavaScript’s - there are simply no blocking functions whatsoever. You cannot time.sleep() in JS, instead you request that your function be called after some amount of time. This avoids function colouring, to be sure, but at the price of making everything harder to use.

You can hate on the distinction all you like, but that’s not really the problem. The problem is that, fundamentally, you can’t have all the benefits of both simultaneously.

THAT IS NOT THAT SIMPLE. Sorry. We’ve been saying this a few times, but it really doesn’t work that way.

Go implement it exactly like you’re saying, then see how many different things start breaking. Start by trying to get some of the most popular GIL-releasing libraries to work (eg numpy). See how they behave with virtual threads, then see whether you can just pretend that it’s exactly asyncio.

You keep handwaving away the problems. I think Paul Moore said it best:

Exactly that.

Thanks. I have only briefly skimmed that document (I’ll find time to read it more fully later) but I get the basic idea that the key benefit of virtual threads over OS threads is that they are by design so lightweight that pooling them is unnecessary. And I can see the benefit of that.

But I didn’t see anything in the JEP that addresses the problem Python has with C extensions. In Python, code that calls C which calls back into Python is (relatively) normal, and that’s the big sticking point for virtual threads in Python. The JEP never seemed to mention that scenario at all (unless I missed it). Maybe that’s because it’s rare enough in Java that it can be discounted, but I don’t think that holds true for Python.

So I’m still left with the question - is there a way to make virtual threads work in Python without limiting them to never calling native extensions (or never passing Python callbacks to native extensions)? Even if Java has managed to handle that, I imagine that Java’s API for writing C extensions is sufficiently different from Python’s that it’s not a foregone conclusion that their approach will work for us.

One of the “essential” features of virtual threads that @vitaly.krug stated was handling extensions:

If someone can clarify precisely how this will be done, without relying on costly stack copying[1] or difficult to maintain platform-specific assembler, then maybe we can move forward.

Or do we have to accept virtual threads as just another approach on the spectrum of tradeoffs, where we get lightweight tasks without colouring, but only for pure Python code? I don’t know how I feel about that tradeoff, but I do think we have to establish whether that is the tradeoff we’re talking about, otherwise this thread really will never go anywhere.


  1. which destroys the “lightweight” aspect of virtual threads ↩︎

2 Likes

Maybe the key to having a feasible approach in Python is exactly in having a somewhat “: third function color” : limiting the functions that can be called back in Python from native code, when running a virtual-thread.

I amt saying that would be easy to code, I can use a 3rd party lib which doesn’t do C today, but tomorrow optimizes for that, and my callback would break. There would be hard things in this approach -

But if somehow the callbacks could be limited (like, in having the virtual-thread they are running at being “locked” to an OS-Thread), maybe it is feasible.

Got it. I don’t know if this can be solved but I’m also not sure if I’ve ever encountered this. Can you give me an example of a library that does this and works with asyncio?

This is how I would like virtual threads to work in practice:

import virtual_threading
import time

def task():
    time.sleep(1)  # non-blocking
    print("Done")

virtual_threading.Thread(target=task).start()

(While suggesting Java is helpful, showing an example in code can make the idea much clearer and easier to understand.)

They would be a drop-in replacement for Python threads. I don’t see a need for yet another concurrency framework.

The issue is that the task function must be thread-safe, unless you plan to use manual synchronization everywhere. Given the lack of thread-safe library code in the wild (largely due to the GIL) the introduction of free-threading in Python would create a third requirement for library authors: providing a thread-safe API.

If virtual threads were introduced for only a single OS thread, it could create another GIL-like era for libraries and code, leading to false thread-safe implementations.


The truth is that we can’t do that; we would have to support all three: blocking code, asyncio, and threads (OS or virtual). I would expect virtual threads to have performance similar to asyncio, but that won’t make asyncio obsolete, much like how asyncio didn’t make generators obsolete.

Function coloring isn’t too bad in practice. The two main problems with function coloring are:

  1. Most of the stdlib and many 3rd-party libraries use blocking IO, and they don’t provide async variants for some common APIs like open().
  2. 3rd-party libraries have to implement these 2 variants, or choose to implement sans-io style state machines.

Virtual threads have the same first problem: we have to implement APIs that don’t block the OS thread in the stdlib (similar to what Go does), since we don’t want an IO operation to block its OS thread.

The second problem isn’t too scary. Libraries can just provide async APIs, and users can simply call them with an asyncio.run() from sync functions. Even if asyncio’s eventloop is too heavy for this use case, I think it can be solved by a minimal library that implements a subset of asyncio’s API, wraps blocking IO in async functions, and doesn’t support cancellation or concurrency. Then libraries that need IO but don’t need cancellation or concurrency can be written once and called from both sync and async functions.

Quick answer: treat the C stack like a malloc/free heap, using setjmp() to place an entry point between each heap block.

More detail: initially, have 2 functions, say i_am_a_block() and use_some_stack(), i_am_a_block() has a Coroutine local variable, and a setjmp(). The Coroutine is registered with the system to record the block. The setjmp() return value says what to do: 0 (1st call to setjmp()) = return to system (using longjmp()); 1 = allocate some more stack; 2 = split your stack chunk; 3 = run your coroutine. ‘Allocate more stack’ calls use_some_stack() which uses alloca() to consume some stack space and then calls i_am_a_block(). The 3rd notable function is the yield function. This uses setjmp() to note the return location and longjmp() to jump to the next runnable Coroutine. The 3 setjmp() actions are: 0 - do the yield; 1 - finish this chunk and create a new chunk using use_some_stack(); 2 - return from yield.

No assembler, no stack copying, standard C functions only. See https://svnplace.com/artoflibs/ccoroutines for the coroutine implementation on its own, and https://github.com/JonathanRoach/cpython-await-anywhere-record for a use case to allow python→C→python→C→python→await future

I don’t know what “works with asyncio” means, and even if I did, wouldn’t “works with asyncio” just be another function colour?

But ignoring that, the usual example is tkinter callbacks.

So let me reiterate my earlier question. If this “just works”, why not simply replace the existing threading implementation with it?

Assuming that’s not a reasonable thing to do, then why isn’t it? Because the reasons it’s not are precisely the downsides of virtual threads that I’ve been trying to get people to explain to me all this time. Once we have an honest explanation of the downsides, we’ll be a lot closer to the point where we can judge the trade-offs and decide whether virtual threads are worth pursuing.

1 Like

This is not guaranteed to work on all platforms. What platforms have you tested this on?

1 Like

My reasoning is as follows: I’ve been using asyncio for almost a decade now; and I mean serious work, not toy systems. If this hasn’t been an issue for me up until now, I’m left wondering:

  • does asyncio have the same problem? This means the problem has a class of “doesn’t work with user-mode stacks”, not “doesn’t work with virtual threads”
  • for my use cases, is the problem essentially a red herring? (since it’s not a point of friction or I would have felt the friction)

If this reasoning is sound then this problem wouldn’t be a blocker for virtual threads.

I’ve never used tkinter, but from skimming the docs I see it already has a complicated threading story. We could do a couple things here:

  • detect when tkinter is started from a virtual thread and run its own even loop in a separate thread, bridging it with the virtual thread loop. Not sure how difficult this would be, but it doesn’t seem impossible.
  • document that it doesn’t work with virtual threads. I had Claude put together a “hello world” tkinter app and it looks to me that it doesn’t work with async callbacks either, but this is also not documented.

To conclude, it’s very likely we won’t ever get to 100% of Python code running correctly in both platform thread and virtual thread contexts. (This cannot even be a goal, since then platform threads and virtual threads would be identical, defeating the purpose of having both.) Maybe some modules and libraries will need to be documented to be incompatible or require adapters or whatever. However, at least there’s a path towards getting this percentage very high as time goes on. The function coloring of asyncio doesn’t have this path.

My view is that it’s easy to describe the rules for asyncio (because of function colouring) - tkinter functions that take callbacks aren’t async, so you can’t await them, and everything blocks. On the other hand, for virtual threads there’s no colour, so what’s the rule? Functions will not block unless they do (probably because there’s some sort of C function behind the scenes)?

I’m not a heavy user of asyncio, so please take my opinions with that in mind. But I do want virtual threads (which I might well use, as an alternative to OS threads) to be easy to understand and reason about - and this “sometimes they don’t work as expected” behaviour bothers me.

This issue isn’t limited to tkinter - and in fact it was one of the first questions raised about this idea, 9 months ago, and it’s still not been properly answered. Other examples are (using just the stdlib) map(my_python_func, iterable), enumerate(python_iterable) and an example using 3rd party code posted last May: scipy.optimize.minimize(python_func, x0=0.5).

Ugh map not working would be terrible. I guess I need to learn more about the issue to have an informed opinion.

1 Like

To be 100% clear here, I don’t actually know myself whether map would work. But that’s sort of the point - people are talking about not being able to handle cases where “the C and Python stacks are intermixed”[1] and I don’t know precisely what that means - but in my model, map mixes the stacks, because you call map from Python code, and map is implemented in C, but then calls back to the user’s Python code (which could itself call C, etc).

@JonathanRoach has suggested an implementation which he claims doesn’t have this limitation. I don’t have enough expertise to judge whether that’s true or not, but it does rely on certain semantics of setjmp and longjmp, which as far as I know are not mandated by the C standard, and so might be non-portable, or difficult to maintain.

Basically, there are a lot of claims being made here, but no-one seems to have a good enough understanding of the whole problem space to say definitively that “this is going to work”. As a result, we’re waiting for someone to have the time and confidence to produce a working prototype, so that we can try things out directly, rather than speculating.


  1. Don’t cross the streams!!! ↩︎

4 Likes

The C Coroutine library is not the whole story, but does open a way through. It was written to make assumptions cpython already makes (a descending, unbroken, stack where the stack limits are known, and alloca() (handy, but could be avoided)), plus widely available standard C functions (setjmp() and longjmp()) used only in a way the C standard allows (eg switch (setjmp()) is allowed, but int a = setjmp() is not). The goal was to minimise cross-platform gotchas. (caveat: I really need to do some testing on, at least, Windows and Linux).

In my cpython variant I’ve used C Coroutines to manage the stack, plus reengineered asyncio to switch C stack as necessary, knowing it’s in a C Coroutine context. This (and a while load more work to make sure everything work as it did before) allows the python user to use await myfunc() with async def myfunc() or myfunc() with def myfunc() interchangeably, and to go via C (eg by using any() with a generator which happens to await at some point). This on its own is a big win (I tried, I really tried to use asyncio in a Django web server - the two-colour-function-ness is practically insurmountable in that context).

Function colourless asyncio.Tasks could be extended to be VirtualThread equivalents. I’m not totally on-board with the idea, but I can see a path to achieving it. My biggest wariness is that the direct connection between Threads and OS threads would be lost.

Working prototype in the await-anywhere branch. Please download, build, try, poke holes, send feedback. It only took 9 months to make :wink:. This isn’t a ready-to-pull-request version (too much time has gone by, and other reasons), but does pass all cpython’s regression tests.

2 Likes

What platforms have you tested this on?

I can almost guarantee you have encountered this because it is extremely common. If you have an extension module that calls PyArg_ParseTuple() to parse arguments into C types (ints, for instance), that can call arbitrary Python code if the object passed in is a class defined in Python. If you have an extension module that calls PyDict_GetItem(), that can execute arbitrary Python code if the key you pass in or any keys already in the dict are defined in Python and have __hash__ or __eq__ methods. If you DECREF an object – for example after replacing a stored reference with something else, that can call arbitrary Python code. You can even end up calling Python code when allocating one of your own, tightly controlled types that definitely take care never to do anything to call Python code, when you happen to trigger a GC run or the eval breaker. And any of those arbitrary Python calls can lead to calls into extension modules, which can lead to more Python calls. Re-entrancy is very common and pretty fundamental in CPython.

3 Likes