PEP 788: Reimagining native threads

ZeroIntensity · April 27, 2025, 3:24pm

Hi everyone,

I’m excited to present PEP 788, a redesign of how we approach native threads in the C API. The goal here is to overcome the issues and limitations that come with PyGILState_Ensure by providing new APIs that make native threads safer to use during finalization.

In short, interpreters are given a reference count by active native threads, and the interpreter may not finalize until all these threads are finished. The PEP outlines a number of ways to acquire and release these references in a predictable and thread-safe manner.

You can read the full text here:

pitrou · April 27, 2025, 7:30pm

Hi,

Some questions:

For backwards compatibility, all thread states created by existing APIs will remain daemon by default.

Which “existing APIs”? C ones? Python ones?
Why are we concerned about backwards compatibility? Do we really view hanging the thread as a feature?

int PyThreadState_SetDaemon(int is_daemon)
Set the attached thread state as non-daemon or daemon.

Shouldn’t there be a PyThreadState_GetDaemon counterpart?

PyInterpreterState *PyInterpreterState_Hold(void)
The caller must have an attached thread state, and cannot return NULL.

What cannot return NULL here? The caller? PyInterpreterState_Hold?

PyInterpreterState *PyInterpreterState_Lookup(int64_t interp_id)
Similar to PyInterpreterState_Hold(), but looks up an interpreter based on an ID (see PyInterpreterState_GetID()). This has the benefit of allowing the interpreter to finalize in cases where the thread might not start, such as inside of a signal handler.

Is this API function really signal-safe? That sounds like a very constraining requirement for a function that will probably have to access a mutable global structure.

int PyThreadState_Ensure(PyInterpreterState *interp)
The interpreter’s interp reference count is decremented by one.

Are you sure that’s safe to do? I would expect PyThreadState_Release to decref the interpreter, not PyThreadState_Ensure. Otherwise, what happens if there is a Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS pair inside the
PyThreadState_Ensure / PyThreadState_Release pair? Could Py_END_ALLOW_THREADS fail reacquiring the interpreter?

void PyThreadState_Release()
Detach and destroy the attached thread state set by PyThreadState_Ensure().

It doesn’t always destroy the thread state, does it? It should only do so if the thread state was created by the matching PyThreadState_Ensure() call.

I would also expect the PEP to answer a couple more questions:

Are nested pairs of PyThreadState_Ensure() and PyThreadState_Release() calls supported?
What is the use case for PyInterpreterState_Lookup()? Is it when you don’t want to keep a strong reference to an interpreter? Is the interpreter id guaranteed to be unique for the entire process call (i.e. it cannot be recycled after the interpreter was destroyed)? The PyInterpreterState_GetID doc doesn’t say so.
How does this change the shutdown sequence? Does Py_FinalizeEx wait for all subinterpreters to be released (this could certainly introduce new deadlocks)? Or does Py_FinalizeEx only finalize the main interpreter, letting subinterpreters die when their refcount drops to zero?

ZeroIntensity · April 27, 2025, 8:09pm

All of them. Python and C ones. (But, it’s not really relevant or accessible information from Python.)
Yes, it has to be. Again, PyGILState_Ensure cannot fail right now, and we can’t make it do so in the future. It must hang or exit the thread when it can’t provide a thread state.

Possibly, but I don’t think it’s very useful. For one, it wouldn’t be consistent with threading-created threads that call into C. A threading thread will be considered “daemon” from C, but it will still fully finish because the interpreter will get rid of it anyway before finalization. We could specify that threading threads be finalized in the same way that native ones do, but that seems like unnecessary work without a clear benefit.

PyInterpreterState_Hold. Sorry if the “attached thread state” part is confusing, it’s new to the docs in 3.14.

It should be. Why would it be constraining?

Correct, PyThreadState_Release is the one that actually decrefs the interpreter. The description here is trying to convey the idea that the reference is no longer held by the caller; it doesn’t necessarily need to depict the actual magic behind the call, because it’s an implementation detail. This isn’t uncommon for the C API docs, but I guess I could adjust the wording here (maybe “pass off the reference” is better than “decrement it”).

Yeah, but again, the point is that we don’t need to expose those details. I’m not sure how to word it in a way that wouldn’t be unnecessarily confusing (“detach and destroy the thread state if it’s the only thread state in this thread, and if it’s not the initial thread for the interpreter, but in the latter case it’s PyThreadState_Clear’ed and put in a freelist”).

Yes. See PEP 788 – Reimagining native threads | peps.python.org

It is OK to call this function if the thread already has an attached thread state, as long as there is a subsequent call to PyThreadState_Release() that matches this one.

Yeah, it’s to prevent strong references. That’s noted in the rationale:

In the case where it is useful to let the interpreter finalize, such as in a signal handler where there’s no guarantee that the thread will start, strong references to an interpreter can be acquired through PyInterpreterState_Lookup().

Uniqueness isn’t a problem I considered, thanks for bringing it up! I think they’re unique for the lifetime of the process at the moment, but if they aren’t, that’s something we need to change.

Py_FinalizeEx finalizes all interpreters. Native threads are finalized in a similar way to Python-created threads–how could that introduce deadlocks?

(Note that subinterpreters don’t support being finalized with any threads active at the moment, but that will ideally be fixed.)

A lot of these questions were indeed answered in the PEP . Do you have any suggestions to help make the text clearer?

pitrou · April 27, 2025, 8:53pm

Ok, but the PEP doesn’t say “PyGILState_Ensure will continue hanging”, it says “all thread states created by existing APIs will remain daemon by default”. I would expect the PEP to discuss why it is desirable to make “daemonness” a sticky property of a thread.

If a setter is useful, my experience is that a situation always comes up where a getter is desirable as well, if only for debugging.

My point is more that the sentence is not correct: in “the caller must have an attached thread state, and cannot return NULL”, the subject of “cannot return NULL” is “the caller”.

What kind of data structure are you planning to use that would allow safe updates during reentrant calls?

Right. The common terminology in CPython API docs is “steal a reference”.

The PEP should certainly expose those details to avoid any misunderstanding.

Ok, but if a subinterpreter has a non-zero refcount, does Py_FinalizeEx wait for the refcount to drop to zero? Or does it simply ignore the refcount (but then, the proposed API isn’t safer than the one it replaces)?

ZeroIntensity · April 27, 2025, 9:13pm

Because it’s not backwards-compatible. There’s a section in the motivation about why we can’t change how PyGILState_Ensure hangs the thread; I would hope that implied the PEP wouldn’t try to change it either.

Both of these acknowledged.

FWIW, the term “steal” was explicitly rejected during the editing process because we didn’t want to confuse it with object reference counting. Apparently, it did more harm than good

It waits for the threads to finish in the same way the main interpreter does. Let’s focus on this issue–how could this cause deadlocks?

ZeroIntensity · April 27, 2025, 9:22pm

Sorry for missing this point. We can just use the existing linked list of interpreter states. Other than a HEAD_LOCK that we need to deal with, it should be async-signal safe.

da-woods · April 27, 2025, 9:47pm

A couple of nice features of PyGILState_Ensure() were:

you could can it when you didn’t know if you held the GIL (and after calling it you definitely would)
You could call it nested in a Py_BEGIN_ALLOW_THREADS section.

I think the reference counting on the interp argument makes both of those uses harder. Because it makes any interp argument that you have single-use only.

I think there’s probably ways to write code to work around that but it isn’t completely obvious to me how easy that would be to get right.

ZeroIntensity · April 27, 2025, 9:56pm

That’s a good point. I designed PyThreadState_Ensure to steal the reference because I didn’t think there would be many cases where interp would be needed to ensure the thread more than once, and the cases that did would just use multiple PyInterpreterState_Hold calls.

I see two ways to deal with this:

Don’t steal the reference in PyThreadState_Ensure. This adds more boilerplate with PyInterpreterState_Release in the “common” case.
Add an incref API for interpreter states. This adds more boilerplate for the multi-use interp case, but it’s what I’m leaning towards.

pitrou · April 28, 2025, 7:26am

That’s a good question. I can’t think of a concrete scenario for now, not sure I’ll be able to come up with something later

Well, I guess good luck dealing with a lock in a signal handler?
(also, searching a linked list is O(n), this might be annoying in some cases)

ZeroIntensity · April 28, 2025, 9:40am

It’s not totally unmanagable, but it might also depend on what we want to consider “safe” for a signal handler. PyInterpreterState_Lookup could just fail if called re-entrantly, I don’t think supporting re-entrancy is that important.

O(n) will probably be fine, I can’t think there of a case where there would be more than 1000 or so subinterpreters in a single process.

vstinner · April 28, 2025, 4:23pm

One idea would be to add an “incref” function to “duplicate” a reference to an interpreter.

By the way, I’m a little bit confused by borrowed (classic) references to an interpreter versus new PEP 788 strong references (incref). Another idea would be to replace the “reference count” concept with “handles”:

PyInterpreterHandle_Create() (PyInterpreterState_Hold) and PyInterpreterHandle_Lookup() (PyInterpreterState_Lookup) create an interpreter handle.
PyInterpreterHandle_Dup() (new) duplicates a handle.
PyInterpreterHandle_Close() (PyInterpreterState_Release) closes a handle.

Having a separate concept (handles) and different object type (PyInterpreterHandle) would make it easier to understand that it prevents an interpreter from finalizing.

The implementation of a handle (PyInterpreterHandle) would be a structure containing an uintptr_t which would be the PyInterpreterState* pointer. Creating or duplicating a handle would still increment the internal interpreter reference counter, and closing a handle would decrement this counter.

PyThreadState_Ensure() would take a handle instead of PyInterpreterState*.

vstinner · April 28, 2025, 4:32pm

From my understanding, Py_Finalize() must wait for:

Wait until all non-daemon threads (which called PyThreadState_Ensure()) exit PyThreadState_Release()
And wait until all strong references to sub-interpreters are deleted by PyThreadState_Ensure() or PyInterpreterState_Release().

Otherwise, Py_END_ALLOW_THREADS can hang a non-daemon thread which would break the API, no?

pitrou · April 28, 2025, 7:28pm

I can easily imagine hundreds of subinterpreters if they are used for concurrency.

ZeroIntensity · April 28, 2025, 7:35pm

They’re generally pretty short-lived, and there’s, in general, about one interpreter per thread. I’ve never seen anyone try to create more than 500 in the wild.

colesbury · April 28, 2025, 7:35pm

I am not enthusiastic about the PEP in its current state.

~~I think this needs a reference implementation. I’m not convinced that this solves the problems listed in motivations.~~ EDIT: I missed the link in the PEP, sorry!
Some things listed in motivations like, PyGILState_Ensure() crashing on shutdown, should be solved regardless of new APIs. Others, like the use of the term “GIL”, are really minor compared to this sort of broad change.
“Daemon threads can cause finalization deadlocks”: I think this misunderstands the problem. Switching daemon to non-daemon threads will not fix any deadlocks! It will likely introduce more deadlocks! The underlying problem, in my opinion, is that we do too much work during finalization (i.e., calling finalizers, which can call arbitrary user code).
The proposed APIs do not serve as a direct replacement for PyGILState_Ensure(): PyGILState_Ensure() is usually called when you don’t have an active thread state, or at least don’t know if you have an active threads state. To use PyThreadState_Ensure(), you must have a valid interp from PyInterpreterState_Hold(), but that function requires that you already have a valid thread state! Ahhh!
What happens when PyInterpreterState_Hold() is called during finalization? What happens if PyThreadState_Ensure() is called during finalization? It’d be helpful to be specific about what precisely you mean by “interpreter finalization” as well. There are a few “phases” in Py_FinalizeEx().
“If the calling thread already has an attached thread state that matches interp , then this function will mark the existing thread state as non-daemon and return”: Ahhh!! This doesn’t sound like good API design to me: PyThreadState_Ensure() and PyThreadState_Release() now implicitly modify an existing thread state. How do you use these functions properly without messing with turning daemon threads into non-daemon threads (or vice versa, if you pair it with `PyThreadState_SetDaemon()?).
The PEP title is grandiose for adding a few functions to replace PyGILState_Ensure(). This doesn’t reimagine native threads, nor do I think we should be “reimagining” them to fix the PyGILState_Ensure issues. Also, all threads in Python are native threads! We don’t have green threads or virtual threads (yet).

ZeroIntensity · April 28, 2025, 7:48pm

Thanks for the feedback!

There is a reference implementation. I’m sure there’s some minor thread-safety issues in there somewhere, but nothing that can’t be fixed.

Yes, I mentioned that those should be fixed regardless. The idea was to summarize all the issues with PyGILState_Ensure, including the existing bugs. I’m trying to make it clear that we need something new.

I’m a little confused here. Switching to a non-daemon thread would indeed fix the problem, because Py_END_ALLOW_THREADS won’t hang. Could you elaborate on how this would cause deadlocks?

Right, you need to figure out which interpreter to get a thread state for in the calling thread, which will always require a thread state! We can’t really get around this problem. If users really want, they can call PyInterpreterState_Main, but the point here is that we want to be explicit with the interpreter.

The interpreter waits until the thread is finished, and thus the interpreter reference is released. If the interpreter is at a point where native threads have already finished, then PyThreadState_Ensure will fail. PyInterpreterState_Hold will work, but it won’t do anything, and ideally the reference will just get thrown away by a subsequent PyThreadState_Ensure call anyway.

Sure, will clarify this.

PyThreadState_Release resets the daemon-ness of the thread state to what it was prior to the PyThreadState_Ensure call. I think that’s noted somewhere.

I do think it would be nice to not modify existing thread states, but then we kill use-cases where calls are nested (and PyGILState_Ensure would have previously been OK), because I don’t think we support arbitrarily switching the thread state (for the same interpreter), do we?

I mean “native threads” in contrast to threading threads. Using the term “non-Python created threads” was too verbose.

I don’t think we should worry about bikeshedding too much , but if we go with the idea that PyGILState_Ensure is the only way to create a native thread (again, non-Python created), then I would say this falls under the category of “reimagining.”

pitrou · April 28, 2025, 8:49pm

500 will already make lookups quite slow if you have to do a search in a linked list.

ZeroIntensity · April 28, 2025, 8:53pm

I guess we’d use a hash table then, but that can be done regardless of this PEP. A linked list iteration is just what exists for the _interpreters module at the moment.

Anyways, I’m starting to see your point on signal safety being limiting. @vstinner also showed some opposition towards signal safety for PyInterpreterState_Lookup, because basically nothing else in the C API is signal safe.

mpage · April 28, 2025, 10:43pm

Thanks for taking the time to document these issues! A couple of additional thoughts/questions in addition to what others have said:

I’d like to understand a bit better why the current finalization behavior of “native threads” (hanging them after finalization) is problematic. I believe this is the approach taken by the JVM and they seem to have made it work.
I prefer it if we could make the “daemonicity” of threads immutable.
This is probably an unpopular opinion, but I don’t understand why we have the distinction between daemon and non-daemon threads. Having only daemon threads would lead to a simpler runtime. I know that we probably can’t change this for pure Python code, but I’d prefer we don’t introduce the distinction for extension code too.

pitrou · April 29, 2025, 8:17am

Ok, one example. In Apache Arrow we have C++ IO abstractions that can be implemented for various backends. One of those implementations delegates the IO to Python. This allows PyArrow users to use Python file-like objects ^[1] with the runtime facilities that the Arrow C++ runtime provides (such as multi-threaded reads of CSV files).

If the Python callback decides to hang the current thread when called by Arrow C++, this will have ripple effects on the unsuspecting C++ runtime.

(yes, it’s better not to do anything complex at shutdown anyway, but you don’t always control this when you provide such cross-language integration in a library)

I suspect that integration of Java code with non-Java code is generally on a much more trivial level than what happens in the Python world, and therefore the problem doesn’t pop up so much there.

It’s not great for performance but it works and there are some use cases. ↩︎