I doubt we can get rid of the thread-local storage field, at least without changing all public APIs to require passing in the thread state explicitly (like most other embeddable language runtimes do). But hopefully we can update internal APIs to pass threadstate around and so we only have to pay the cost at boundaries. Unfortunately, tying Python’s state to the OS thread like this causes some real pain when embedding.
Presumably in a per-interpreter GIL world, gilstate has to be looked up from PyInterpreterState, which has to be found from PyThreadState which is going to need the TSS API (or an explicit parameter). So I’d say it’ll depend on which “context” object is going to be in TSS: PyInterpreterState or PyThreadState?
My gut feel is that PyThreadState will continue to be in TSS, and so we’ll need the TSS API to get it if we don’t have it, and gilstate.tstate_current can go. (If we have a PyObject* then I expect we can trace back to the interpreter state that owns it, but probably not safely without knowing that we hold that interpreter’s GIL, and if we knew that then we wouldn’t need to find it.) We’ll want to be careful not to use TSS any more often than we need to, and I really do hope that we one day make it easy for embedders to explicitly control the threadstate when they’re calling back into Python code.
Though to be fair, I haven’t had to work on this since Python 3.7, so maybe we’ve actually got some APIs to “just set” the current state now. We didn’t at the time (or they did extra validation and would fail if you tried to move thread), and so it was impossible to use native thread pools to execute Python code, for example. ↩︎
I should say it’s impossible (or very nearly impossible) to resume Python code execution in a thread pool. It’s fine if you start it running and wait for it to finish, but that’s not how embedding typically works.
Take Blender for example. Most of its physics simulation is going to happen in multithreaded C++, but you can write custom expressions in Python as part of it. If the processing is being run in a thread pool, you have to create a new Python thread state every time you want to call back into it, because you can’t guarantee that you’re on a “known” native thread, even if you know (or have decided) that it doesn’t matter.
It gets worse in something like Minecraft, where you call Python code which calls back into the game and has to wait for something to complete (async/await style, though a custom native implementation). The “completion” signal arrives on any available thread from the thread pool, but the Python code has to be attached to its original thread, so you can’t do anything except native message passing from the completion signal. It also means you can’t be running the Python code on a thread pool thread (at least in this system), because you have to block it forever waiting for the extra signal.
Both cases would be fine to take a global interpreter lock and execute their code, because they’re controlled enough to not have any real reliance on the native thread they’re currently on. But because Python internally requires so much consistency between the GIL, the threadstate and the OS thread, it’s not possible to just do this. You always end up with a dedicated thread that only runs Python code, forbid Python threading (which will mess with your dedicated thread), and set up message passing primitives to interact with the host application.
(Incidentally, neither case is a “general purpose Python environment” situation. Nobody is installing arbitrary third-party modules or adding new native code. If your app is going to allow that, you don’t really have much choice but to run Python as a separate process. I’m more concerned about apps that want expression evaluation or short snippets run in the context of the main process, rather than running an entire app/script’s worth of Python code.)
At least up until 3.7 (as I said, I haven’t had to do this since then), the API changed in virtually every version, and sometimes in micro-versions. Sometimes “ensure” would do it, sometimes it would crash. Sometimes “create” would do it. Sometimes that would crash. Sometimes you had to do one before the other, sometimes after, sometimes not at all. We had the most hideous code to handle this in our old debugger, and it certainly was not amenable to embedders.
It’s also a big, heavyweight operation for potentially doing a single attribute lookup (if that’s all the user wants to do). Also not conducive to embedders. Lua keeps winning here for a range of reasons, but this is definitely one of them.
Perhaps, but it’s not uncommon, at least on the user side of applications. Actually, it’s not that uncommon on the server side either - everyone tries to parallelise operations using thread pools (or equivalents), and there’s a lot of completion-triggered event handling.