Supporting Per-Interpreter GIL from Rust (PyO3)

I was talking with @mitsuhiko at EuroRust this morning about what per-interpreter GIL support in Rust can look like (and specifically in PyO3). This is just a couple of thoughts we explored which might be interesting to share wider.

In my eyes, the challenge of subinterpreter support for a framework like PyO3 is the need to have object isolation between subinterpreters. User-facing APIs need to be constrained to account for this.

  1. Verifying object provenance - at the fundamental level, it seems we need a way at runtime to verify that objects belong to the current subinterpreter. We already can identify subinterpreters by interpreter ID, maybe a solution is to add interpreter ID to PyHeapTypeObject, so instances of that type can have their subinterpreter known. There may need to be some exploration how this interacts with immortal / static types.

  2. Thread affinity - currently subinterpreter API as implemented creates a Python thread state which can then be activated with PyEval_RestoreThread. If I understand this correctly it means C code can change the subinterpreter on the current thread by swapping the thread state to one from a different thread. Is there a way we could prevent that, so that there is a guarantee that per host thread there is only ever one subinterpreter which can run on it? Otherwise I think the implication is that after any call into unknown C code you might have been swapped onto a different subinterpreter.

  3. Message passing - I see Py_AddPendingCall to send calls back to the main interpreter. Are there other APIs to use to pass messages between subinterpreters? I didn’t see any; I guess we can always build our own thread-safe datastructures but it might be nice to have something in the C API for this.

  4. Shared objects - To avoid the need to serialize and message pass, maybe there are subsets of objects we can share safely? If I recall correctly, nogil is introducing per-object locks. Maybe these can be explored as a way to lock objects to enable sharing them in a synchronized way across subinterpreters?

Overall we are optimistic Rust can help bring some cool use cases to the table in this space!

5 Likes

Thanks for taking the time to write up such a thoughtful post. This sort of feedback is incredibly helpful.

Agreed. We have discussed something along these lines for heap types before. We would also need to accommodate builtin types, which have remained static types. Either way, we’d provide something like PyType_GetInterpreter() (and PyObject_GetInterpreter()), rather than exposing struct members.

Hmm, I’ll have to give this some thought. I had not considered that this would be a problem, and the design of subinterpreters has worked this way for several decades (though without extensive use).

What in particular is problematic here? Note that PyThreadState_Swap() has been around for a long, long time, so this isn’t a new situation. Callers are responsible for swapping the previous thread state back in as appropriate.

There is an internal (Py_BUILD_CORE) _PyEval_AddPendingCall() that is interpreter-specific and will make those pending calls on any thread where that interpreter is currently active. However, I’m reluctant to simply expose this as public API. I worry that it would end up being an “attractive nuisance”. Whether or not we make it public, we would probably expose some much more focused functions with related behavior (e.g. Py_DecrefInInterpreter()).

Regarding message passing, I’ve been working on this somewhat with PEP 554. (Note that the latest version of the PEP doesn’t have “channels” in it. I plan on adding them back in the next few weeks.)

I have some ideas at the C-API level, and there are several people in the community that have been exploring possibilities. However, nothing is settled yet and there are a lot of possibilities to be explored.

Yep, that’s kind of the point we’re at currently. It’s a bit of a new frontier. I’d be glad to have a chat to discuss what might make a good API.

There are definitely a number of possibilities to explore here. PEP 554 is focused on setting a basic foundation and discusses several things we all might look into afterward.

FWIW, I like the idea of allowing arbitrary objects to be shared, and making use of the no-gil per-object locks makes sense. There are certainly some additional subtleties to sort out, but I think it’s worth exploring. (That’s independent of PEP 554 though.)

1 Like

Methinks building on noGIL here is putting the cart before the horse. NoGIL will take years to stabilize, and I expect you won’t be able to count on it existing for the next 4-5 releases.

after any call into unknown C code you might have been swapped onto a different subinterpreter

After a call into C from where? Python code can definitely be swapped to a different OS thread whenever it blocks for the GIL. But the mapping between Python threads and Python interpreter state is, was, and will always be fixed. Interpreters are not tied to OS threads.

The only time you could be swapped to a different (sub)interpreter, I think, would be when you’re executing C code without holding the GIL, and you make a call into something that does something with threads and/or the GIL.

I agree it’s surprising that the default API for running code in a different subinterpreter runs it in the current OS thread. But that’s how it’s always been for subinterpreters (pre GIL-per-interpreter).

I would like to see a new API to run some code in a different interpreter using a different OS thread. I am hoping that someone comes up with a SubinterpreterPoolExecutor that works like ProcessPoolExecutor but maintains a pool of subinterpreters to which it can dispatch functions. (Certain kinds of named functions can be pickled – the unpickling just re-imports it by name.) It could use a more efficient protocol than pickle in some cases. The main advantage IMO is that you don’t have to learn a new API.

I think 2 is not a real problem?

An arbitrary C call can already release the GIL; and rust-cpython / PyO3 both assume that by the time control flow returns from the C call to safe Rust code (with a Python<'_> token in scope), the GIL was somehow re-acquired.
For multiple interpreters on the same thread (which seem like a useful feature to me), it seems reasonable that PyO3 could assume in a similar fashion that the C code somehow switched back to the old interpreter.

3: This would also remove the need for point 3 – there’s no need for message passing between subinterpreters if you can make synchronous cross-interpreter calls within a single thread.

4: Shared objects: object sharing won’t be possible without nogil; but it’s plausible that there could be an operation that moves an object with refcount 1 to a different interpreter. It’d be difficult to call such an operation from Python (you couldn’t pass the object-to-be-passed as a parameter); but C and Rust wouldn’t have such issues. Such an operation would need to ensure that all reachable objects are either immortal or only reachable from the starting object; and would need to update the object types to the equivalent types in the other interpreter.

Thanks all for your ear! Some replies:

Yes definitely, I wonder if it makes sense to go further and have PyObject_EnsureInterpreter(), both for performance of a single FFI call and so that any exception raised by this invariant being broken is always of the exact same form, rather than leaving every extension / framework to implement it. We can chew on that when the time comes.

I think perhaps you are both right here, and thread affinity is not an issue. In analysis of the C APIs I had noticed this case of multiple-interpreters-per-thread, though if the expectation of how they are used is that callers must always swap back to the original interpreter then Rust frameworks can make assumptions based on this.

I definitely agree with this and wasn’t implying that shared objects are needed tomorrow; this was just an interesting observation (full credit @mitsuhiko) of what synergy might exist in the long term.

I’m not sure if you’ve seen projects such as cloudpickle which do dark magic to reconstruct whole type objects best they can when unpicking. They’re currently used in the distributed processing space, but if you squint, subinterpreters are not so different!

It’s an interesting point, I would need to think further about how you get two interpreters which might be running on different threads to converge on a single thread to do a synchronous data exchange. I believe that there is a lot of experience of message passing from the Javascript ecosystem. Doesn’t necessarily mean that Python has to do it that way, but I think users would be grateful if they can model systems in similar ways if it makes sense from a language design angle to do so.

Hmm, I might misunderstand what you’re trying to say, but that doesn’t sound true. An OS thread can be swapped between different hardware threads or cores, but a Python thread state is AFAIR always bound to the same OS thread. That’s what the PyGILState APIs ensure, and that’s what threading.Thread expects.

In other words, calling PyThread_get_thread_ident with a given Python thread state activated should always give the same answer.

Oops, you’re right. Sorry. I Don know what I was thinking of.

IMO this makes it extra problematic that the subinterpreter API defaults to using the current OS thread though.

Actually pybind11 documents a way to move a thread state to a different thread: https://github.com/pybind/pybind11/blob/0e2c3e5db41b6b2af4038734c84ab855ccaaa5f0/include/pybind11/gil.h#L40

Correct. Also, each interpreter will have a distinct thread state for each thread in which it is active, and that thread state will only be use for that thread.