Add Virtual Threads to Python

Builds OK on Windows, and passes most of the tests (test_cmd_line, test_repl, test_ssl failed, I doubt they are related to your changes). I did a quick search for the text string “Coroutine” in the sources, and didn’t find it, so I’m confused as to what I should be looking at next.

Edit: I noticed there’s an await-anywhere branch. I tried that but it doesn’t compile, there’s a bunch of rror C2039: 'ob_flags': is not a member of '_object' errors. And one very suspect error C1083: Cannot open include file: 'alloca.h': No such file or directory error. The alloca header isn’t standard C, and doesn’t exist on Windows (there’s an _alloca header, I believe). So have you tested this at all on Windows?

So how do I create a virtual thread? Can you give a simple example of code that creates a bunch of threads and waits on them, so that I can try out various scenarios (threads that sleep, threads with reentrant code, etc)?

If all you have so far is the Coroutine class as a building block, that’s a long way from virtual threads. I’m not trying to dismiss the work you’ve put in, just trying to understand how it relates to the discussion here, which is “can virtual threads work effectively as low-overhead threads with no function colouring or similar usage limitations?”

2 Likes

Thank you, that’s really helpful feedback. The await-anywhere branch is the one to try. It builds on my MacBook Pro (MacOS 26.1) (debug build), but hasn’t been tried on Windows or Linux. Please don’t spend more time on this until I’ve sorted out the problems on Windows and Linux (tomorrow’s job).

When it does work, this build only goes so far as to remove the function colouring limitation in asyncio. I misunderstood what you were asking for, my apologies.

2 Likes

Could someone show an example of what code looks like without function coloring? I’d find that really helpful. I’m sure everyone in this thread already has their own Python interpreter in mind; just showing the code is enough.

Is there any chance you could put this on Github? That link isn’t
working for me, I keep getting 504 Gateway Time-out.

Like this:

import asyncio

def a():
    for i in range(5):
        await asyncio.sleep(0.2)
        yield 15

def b():
    return sum(a())

async def main():
    return b()

print(asyncio.run(main()))

Things to note:

  • the outermost function (main()) still needs to be async def (for now)
  • you can await in any function, not just async def ones
  • you can do this through any C-implemented functions (sum() is an example)

Which also means you can await inside Python’s ‘power’ moves, such as:

  • @propertys
  • overrides, like __len__(), __bool__(), and __getattr__()
  • operator overloads like __lshift__()

which makes ORMs with async possible.

Also, for example, callbacks can await. However, be careful your framework is ok to be reentered from a callback - another coroutine could call the framework while the callback is awaiting.

You can still use await myfunc() and async def myfunc():, if you prefer. It achieves the same result, but slower, with more typing. Similarly async for and async with still work, but if you don’t have to, why would you?

I don’t think this is what folks talking about removing function coloring have in mind, at least not me.

The idea behind virtual threads is that the code that runs on platform threads and the code that runs on virtual threads looks the same - is the same. So it would look like ordinary threaded code.

1 Like

Precisely. Until we’re all talking about the same thing when we use the term “virtual threads” (which is what this discussion is supposed to be about!), I don’t think there’s much hope for progress.

And what makes me confused is how something like time.sleep() would work in virtual threads. Would it block the whole interpreter? If so, then the code doesn’t act like ordinary threaded code, while still looking like ordinary threaded code. In effect “blocking” becomes just another form of function colour (although one that’s not explicitly marked, so you’re supposed to “just know” what functions are blocking).

Or am I missing something fundamental here about what people mean when they say “virtual thread”? I don’t mind if the proposed rule is that “virtual threads are like threads, but blocking calls will halt the whole interpreter, not just the current (virtual) thread”[1], but if that is the case, then we should be having discussions about what constitutes a “blocking call”, not about removing await and async keywords…


  1. I don’t think such a rule will be practical, but I don’t mind having that discussion… ↩︎

1 Like

It would detect whether the call is being done in a virtual thread or a platform thread. If we’re not in a virtual thread, do the current version of sleep. Otherwise, do the virtual version of sleep. That’s how it works in Java (and Thread.sleep is showcased early in the JEP). Whether this is accomplished with an if statement in time.sleep() or a more clever way, I don’t know.

Cool. So anyone writing potentially-blocking C code would need to call a new API to detect if they are in a virtual thread and choose an asynchronous implementation. Or maybe there’s some sort of interpreter managed OS thread pool so that the code could just call an “If I’m on a virtual thread, call the following code in the thread pool” API. That seems plausible as a design.

I’m not sure how much I like the overhead that this type of approach would impose on C extension authors, and I’m not sure I like the way it would fail if I call an extension that didn’t do this, but it’s far easier to discuss the proposal in concrete terms like this, than it is when everything’s abstract.

1 Like

Yeah, you’d have to be careful to not call potentially-blocking C code, but that’s a footgun with asyncio today too.

Thinking ahead now, maybe we could borrow a trick from Go’s runtime. Imagine we have a pool of platform threads to run virtual threads on. We could detect that a virtual thread step (I’m using the term step here since asyncio tasks use similar nomenclature) is taking a long time, presumably comparing it to a threshold value. This could be due to a blocking syscall, doing CPU-bound work, or like you mentioned, a C library doing blocking IO. Once we detect this, we take the thread out of the pool of platform threads, and replace it with a fresh platform thread. We put it aside and let it finish what it’s doing, and shut it down afterwards. The virtual thread that it was running goes back to the existing pool. All other virtual threads still get to run on the existing pool, especially since we’ve filled it back up.

This is why when you want to set the number of threads a Go process should be allowed to use you set GOMAXPROCS and not something like GOMAXTHREADS. There’s this additional level of abstraction (they call it processors, hence PROCS), and the actual number of threads in use can go up and down slightly as this algorithm is applied. I think this is quite cool.

Disclaimer: I’m not a Go expert and this is only my understanding, but it makes sense to me.

2 Likes

100% agree, this is exactly what I worry about. A blog post calls it “purple color”, which is neither red nor callable from red.[1] This post is about Rust, but the argument also applies to Python. IMHO the red/blue distinction missed the key point, you can call an async function from a sync function with asyncio.run(), but can’t call a blocking function from an async function without an OS thread. The biggest problem is blocking.

I think it doesn’t eliminate function coloring, it just pushes function coloring from Python level to C level.


  1. Rust async is colored, and that’s not a big deal | More Stina Blog! ↩︎

1 Like

Yes, that’s what I had in mind. This addresses the original issue that led to the two-color function problem. It also requires a non-blocking standard library, which asyncio already provides.

I don’t think the main issue is the async/await syntax itself. Requiring a call to asyncio.run() is not a problem either. A threaded environment is still necessary, whether using OS threads or virtual threads. The two-color function problem does not directly apply in this case.

That said, removing the async/await syntax may be possible, but it is not ideal, as it reduces readability. This also appears to conflict with the idea of “well-defined points.” I think it makes sense to keep the syntax in the code, since a single implementation cannot reliably support both serial and parallel execution models. For example, you will eventually need to use asyncio.to_thread().

The issue with external blocking C code also exists in Java. Calling external C code can pin the entire thread. There is no real solution to this unless the C code itself is compiled or adapted to run within the interpreter.

AFAIK what Java does is running the virtual threads on top of a smaller pool of OS threads, and if one of the VR’s block, the OS thread blocks - there is no magic. If it happens for “number of virtual threads blocking” >= “real threads blocking” the app blocks as a whole. This is by design, and the lower overhead for the VTs still allows for most of the benefits.

Although, I did found out about this design when an article about a hard to solve deadlock bug in a Java app caused such havoc to even “make the news” showed up somewhere.

1 Like

Hi @markshannon, did python Virtual Threads get any traction among Python developers? I haven’t found any PEP for it, so am wondering where it stands.

In my corporate experience, I am beginning to see python losing mindshare due to lack of Virtual Thread support to enable applications that need to use async io for scalability. For example,

  1. The mainstream, mature python acme package has no asyncio support.
  2. Monkeypatching blocking python API to make acme work with asyncio via greenlet is not an option, because patching could break other functionality in the app.
  3. So, it becomes enticing to switch to Java’s Virtual Threads and use acme4j, due to the impracticality of doing the same in Python.

Have you considered raising this as an issue with the package? There isn’t really much we can do from here.

Interesting piece of work. Are you familiar with stackless Python? It does something similar although tries to avoid having to save the C stack in many cases (soft-switch vs hard-switch). It’s been many years since I looked at it but it was a very sophisticated piece of work. In your case, it seems you use setjmp/longjmp to save the stack. As I suppose you know, the problem with that is it is non-portable.

Have you considered allowing full co-routines and not just “simple generators”? I would guess your implementation technique should allow it.

I did encounter it part way through development. My first approach was to de-C-stack as much as possible, then spotted the stackless references, and thought ‘aha, someone else had a similar idea, nice’. Soon after I realised it was possible to do coroutines in C using setjmp(), a lightbulb moment, then went down that route to a complete result, rather than a 90% result. The proof-of-concept still has the de-stacking as far as I’d got in it, it’s unnecessary, but helpful (it keeps the C stack usage down).

You say…

… it is part of the C standard, and has been provided by most C implementations AFAIK forever (Mac, Linux & Windows for sure, others TBD). The only assumptions made in the proof-of-concept are: setjmp() & longjmp() behave as the standard allows (ie only switch & if on the return value); the stack is contiguous; the stack grows down (there is one target where the stack grow up - it’s something I need to address). The only new-to-python assumption is setjmp() & longjmp(), and I didn’t think assuming a well-supported, standard C feature available on the big-3 was much of a stretch. The growing down shortcoming will be fixed at some point.

Asyncio without function colouring , is narrowed down to just the proof-of-concept - please have a look there too. Highlights: requests library can be async ready as-is (and any other socket and ssl-based library); abstract interfaces are abstract - the interface user doesn’t need to know async is happening inside ever. There’s much more there too.

The intended purpose of setjmp/longjmp is for early exit from a nested
sequence of function calls. I’m fairly sure this is the only usage that
the C standard promises will work.

Wikipedia says:

Similarly, C99 <https://en.wikipedia.org/wiki/C99> does not require
that |longjmp| preserve the current stack frame. This means that
jumping into a function which was exited via a call to |longjmp| is
undefined.^[5]
<https://en.wikipedia.org/wiki/Setjmp.h#cite_note-ISO/IEC_9899:1999-5>

Which would suggest that longjumping to another coroutine and then
longjumping back to the first one again is not guaranteed to work, as
its stack frame may no longer be valid.

Also I have some personal experience with this. Some years ago I wrote a
coroutine library that does exactly this using setjmp/longjmp. It works
on Linux, but it crashes on MacOSX. I never got around to finding out
exactly why; possibly it could have been made to work with some
tweaking. But it serves as a proof that this use of setjmp/longjmp is
not portable.

If you’ve managed to get yours to work on MacOSX (x86 version) I’d be
interested to see the code.

2 Likes

See https://svnplace.com/artoflibs/ccoroutines for the C coroutines library on its own, or GitHub - JonathanRoach/cpython-await-anywhere at await-anywhere · GitHub for the same library installed in cpython (Python/coroutine.c). It was developed and debugged on MacOSX, and survives all the make test tests in cpython on Mac and Linux, debug, non-debug and optimised. Windows is in progress (have done a first pass, and most of the make tests pass - the others (10s of them) I’ve yet to work out the root causes). When debugging make test in the different versions, the coroutine implementation has not been a source of bugs. I’d be interested in your experience with svnplace.com, good or bad - developing this is what prompted me to adapt Python.

..which is why coroutine.c is very careful to not dump on stack frames which need preserving :wink: . If you setjmp(), then longjmp() away, so long as you avoid corrupting the stack frame you’ve longjmp()ed from, the stack frame will still be there to longjmp() back to.

Also, reading the C99 fragment you highlighted, if longjmp() doesn’t preserve its stack frame (weird, but I guess, as its in the standard, its possible), there is simple work round:

void mylongjmp(jump_buf buf, int status){ longjmp(buf, status); }

It would be a truly horrid implementation of longjmp() which corrupted the stack frame of whatever called mylongjmp().

BTW, thank you for the head-up - I may add this to coroutine.c to tighten up its standard-following-ness.

Have you tested with free-threaded Python?