Transferring a Python computation across OS threads

Hi everyone! First of all, I am sorry if this is a wrong category to post my question to, but I figured, since my question is pretty technical and talks about the internals of the interpreter, core-dev would be a better fit. Please move it to another category if I was wrong in my assessment .

I am trying to do something that seems to go beyond the anticipated usage of the embedded interpreter, namely I am trying to run a Python computation within “fibers” (not unlike the python-fibers library, except that I am doing this within an embedded interpreter, not as a Python package). This post serves two purposes: first, I would like to check that my understanding of how to do this safely is correct; and then I would like to see if it would be feasible to better support this (admittedly, rare) use-case in the future.

To avoid ambiguity, by “fibers” I mean that an arbitrary computation can be suspended (yield), have its state (CPU registers, etc.) stored in a data structure, and this computation can then be restored and resumed, potentially on a different OS thread. I think, the details of the implementation do not matter here, but, if it is important, we can assume that I am using call/cc from Boost.Context.

This all revolves around PyThreadState. Initially, I tried doing what the documentation suggests: keeping a PyThreadState per OS thread; in this case, if a computation is moved to a different thread, it will resume with a different PyThreadState. This actually seems to work fine, although I am very worried about some of the fields of the structure which seem to be part of the computation context (e.g. recursion_remaining, context, just to give a few examples).

Next I tried the opposite approach: keeping a PyThreadState per fiber and carrying it along when moving a fiber to a different thread. This seems to be a safer alternative, although there are some fields in PyThreadState that seem to be thread-specific (e.g. thread_id :). However, this does not work at all. I can manage (create, move, delete) the state myself, but I cannot update the state stored in thread-specific storage, which is actively used by PyGILState_Ensure. If there is a mismatch between the PyThreadState currently executing and the PyThreadState stored in the current thread’s TSS, things break, and there doesn’t seem to be any way for me to update the TSS.

Based on the above, my current understanding is that PyThreadState really contains two types of information: thread-specific and computation-specific. So, the approach I am considering is to manually slice PyThreadState into these two pieces, and then use PyGILState_Ensure to give each threads its own PyThreadState, which will be used for its thread-specific state, while at the same time keeping computation-specific part of PyThreadState next to each fiber and transplanting it into the current PyThreadState when moving to a different thread. Incidentally, this is also the approach chosen by python-fibers, however their selection of fields to move seems somewhat arbitrary to me. Hence, I would be glad to receive any advice on which fields I should treat as part of the computation state and transplant when moving fibers. Obviously, this depends on the specific version of cpython, so what I am looking for is general guidance.

Lastly, I am wondering if this kind of separation is something that could be maintained upstream? I feel like this would be useful to have and it is aligned with the general direction of async, coroutines, and PEP 567, while only requiring a fairly conservative change to the internals of the interpreter (and a bunch of new public APIs for managing it). What do you think?

1 Like

As a new user, I could only have two links in the post, so here are the most important ones of those that I had to remove:

I agree that separate would be great, and would [help] solve one of my biggest issues with embedding CPython.

My worry is that there are probably all sorts of assumptions throughout the runtime, stdlib, and ecosystem that “Python thread == OS thread”. I’ve certainly written code like that (e.g. which wraps OS primitives that have thread affinity), and while we ought to unwind it all, it’s likely very difficult to detect.

There’s also the problem of Python code being reentrant via C code, which then ties it tightly to the C stack if you end up trying to switch in there (e.g. during a __getitem__ implementation). This one is probably manageable if you’re doing fully cooperative switching, but would need to be made clear that you shouldn’t just swap anytime the GIL is released.

But the best way to find out how bad these are is for someone to try it and see, which is what you’re asking about :slight_smile: I think we’d want it to be tried before we start changing anything upstream.

As for what needs to be moved, I’m pretty sure the only thing stored in native TSS is the current thread state (i.e. the one holding the GIL). You shouldn’t touch this one. (I think we also keep the one we expect for the current thread in there, too? But that shouldn’t matter if you explicitly activate a different thread state.)

If you’re only doing cooperative moving (i.e. the Python code yields all the way back to the host before you switch), everything should be fine to move around. If you’re trying to move them in the middle of a computation, that’s where the above concerns matter more than anything else. Alternatively, if you’re moving your computation context betwen threads without moving PyThreadState, then you’ll definitely want to move context with it, and probably detect/forbid any threading.local use. This is the case where all the above concerns come into play, so while I doubt you’ll see synchronization issues, you may still deadlock/break if you switch while any locks are held by Python code.

So in terms of general guidance, I think the best that can really be offered is “don’t”. If you control all the code that’s going to be run, it might be doable, but if you want it to generalise then safest for it to be a Python-level API and not try to do it in C. (And if you really want to do it in C, then figuring out which bit is necessary is the project, because we didn’t design it with this in mind and so probably nobody has the answer already.)

2 Likes

The CPython internal thread state is very much tied to the idea of OS threads. A more modern design would be to use execution contexts which I guess is your “computation-specific” state which happens to live in that structure because it was a convenient place that happened to work so far. Splitting the existing monolith into a two part execution state does seem useful. I wouldn’t be surprised if the faster cpython performance work may also wind up wanting to see this done.

The idea of a single serial flow of execution hopping around between OS threads during its lifetime is an unusual one for most authors to expect. I expect there is code that will never be amenable to that as code has access to APIs that allow them to make assumptions that inadvertently tie it to a specific OS thread beyond that of our own internals. But that seems true of code in any existing language bolting the fibers concept on top.

3 Likes

This one is actually simple and is exactly what Boost.Context takes care of. Fibers are fully cooperative, sort of by their definition, so they know exactly when and what to switch. The only question is the consistency of the resulting state after the switch is done if we end up on the different OS thread. Anything that relies on a thread-local storage certainly has potential for breaking, but it also may end up being just fine. This is really solved by PEP 567 (at least, as far as I understand) and reduces to the question of code authors being mindful of the difference between thread-local and context-local variables.

However, at this point, I am not really worrying about third-party code (yet), I am primarily focused on the consistency of the interpreter’s own state.

That is my understanding as well and, hence, it is PyThreadState that my questions is primarily about.

Ah, ok, that is where our definitions of “cooperative” differ. In my model, it is cooperative in the sense that the C code (that embeds the interpreter) is in control of when to yield, however on the Python side of things it is just a C function exposed as _fiber.yield(), so from the Python point of view we can indeed yield (in the C sense, not in the Python generator sense) right in the middle of a computation. The idea is that all this should be completely transparent to the Python code.

Yes, that is the one I was almost certain about, together with context_ver. But there are other interesting fields as well:

  • recursion_remaining, recursion_limit, recursion_headroom – these sound like they are counting the current function call depth, so if I am switching to a different thread in the middle of a computation, I probably want to take these three with me, right?
  • tracing, tracing_what, c_profilefunc, c_tracefunc, c_profileobj, c_traceobj – not sure about these at all, but it sounds like these should be moved too?
  • cframe, curexc_type, curexc_value, curexc_traceback, exc_info, exc_state, root_cframe – these are related to the stack, and stack moves, so these should move too, right?
  • There are also a bunch of finalisers/deleters, and I am not sure about them.

On the other hand, thread_id, native_thread_id, async_exc, dict – sound like they should never be moved.

Yes, of course, you are 100% right. This is the responsibility of individual code authors to be mindful (or not be mindful) of the possibility of such a move. My understanding is that with PEP 567 authors now have the right vocabulary to talk about thread-local storage vs. context-local storage and it is all now a matter of education.

However, I, for now, am focusing not on ensuring that third-party code will behave correctly, but rather that this whole concept is possible at all to begin with.

Oh, yes, and about this. If I understand correctly what you are talking about, then this is what I described as the “opposite approach” in my original post, namely keeping a PyThreadState for each fiber and restoring it to a different thread when the fiber is moved. By “restoring” here I mean using PyThreadState_Restore, which puts it into the GIL state (i.e. locks the GIL for us). The problem here is that, as you pointed out, there is also this thread’s “original” state stored in its TSS, which I cannot touch in any way (I don’t have access to the TSS key), and the current thread state (the one in GIL) must always match the one in current thread’s TSS – otherwise things break, e.g. PyGILState_Ensure will deadlock. So carrying PyThreadState together with fibers is not an option right now. That is why I am focusing on letting each thread have its own “identity” in the form of the PyThreadState (the memory address of its PyThreadState to be precise), while transplanting the internals as needed.

2 Likes

We should probably get @eric.snow and @vstinner to comment on this thread as well. (Personally I saw “call/cc” and my brain exploded, but that’s an old problem and it seems plenty of others here don’t have that problem. :slight_smile:

2 Likes

Are you aware of stackless tasklets? AFAIK you can even pickle those and resume them on a different machine.
I’ve never used them myself, though.

2 Likes

No, I had no idea, thanks for the pointer!

I was actually a little worried seeing that their wiki page was updated in 2016 and their tasklet example is literally:

print "aCallable:", value

but it looks like the code, unlike the wiki page, is maintained.

They have this PyTaskletObject structure, which contains some of the interesting fields from PyThreadState, and here is their context switch that moves the values of those fields into PyThreadState for the new tasklet and in the other direction for the previous one.

I especially like that there are plenty of comments in the code, so this looks like a good starting point.

Others have mostly covered anything I would have said. There is certainly quite a lot of conceptual (and practical) overlap between the OP’s description and my Multi-Core Python project.

The only thing I haven’t seen brought up is the constant concern of dealing with state in external libraries (e.g. openssl) that might not be very portable between OS threads or other host execution contexts. That would be a tricky thing to solve for the Python runtime.


Overall, I’m in favor of encapsulating the runtime state as much as possible. Further, I’d even favor explicitly passing the execution context (including runtime state) around as an argument through (internal/public) C-API calls, instead of storing it in a thread local. However, there’s a big gap between that and where we’re at right now, and the disruption probably wouldn’t pay for itself. (That’s a long-running discussion in this group.) Regardless, we continue to work on consolidating global runtime state as there several extra benefits to that.

4 Likes

Python is far from being a pure stateless programming language and its implementation is full to tons of internal states (ex: globals, per-thread variables, variables on the stack, etc.). Did you consider using a real functional programming language Erlang which is trivial to distribute in multiple processes and even multiple machines?

greenlet with eventlet or gevent is close to what you are describing, it’s already implemented and it works. It’s designed with the assumption that network operations are likely to be stale until the network interface is ready to send data and until the other computer sent data. There are some use cases with threads and filesystem operations, but network is the most common use case.