PEP 788: Reimagining native threads

The two main cases I quoted in the motivation were:

  • There are cases (for example, with C++ destructors) where Python code should be ran in addition to some other work. If we hang the thread, nothing else can run, which isn’t very helpful. I’m not sure how users of JNI deal with this.
  • If a daemon thread grabs a lock and then gets hung, the main thread can possibly deadlock during finalization.

Could be worth it, but probably should be done in a future PEP. I think threading can change the “daemonicity” of a thread whenever it wants, and we’re probably going to end up unifying threading’s shutdown and native threads in the implementation.

I’m not too sure how daemon threads make things simpler. Any lock acquired during finalization is cloudy with a chance of deadlocks right now.

That, and aren’t non-daemon threads typically more useful/common? There was recently a proposal to deprecate daemon threads. I can’t think of many cases where it’s desirable to let your thread disappear at the nearest Python call. Isn’t it more common to want your computation to finish?

2 Likes

PyThreadState_Ensure steals a reference to interp. This implies that code has to call PyInterpreterState_Hold for every time that PyThreadState_Ensure will be used. This seems to be problematic when calling back into Python from a callback that might be invoked an arbitrary number of times.

PyThreadState_Release doesn’t seem to fix the stated issue with PyGILState_Release in that it can hang the process. Is is possible to return an error instead of blocking?

What’s the performance impact of this API compared to the older GILState APIs? I have code that can receive a relatively large number of callbacks without holding the GIL. The description of PyThreadState_Ensure implies that a new thread state will be created and destroyed for every callback and that could be expensive.

1 Like

Yes, that’s what PyInterpreterState_Lookup is for.

Well, it could return an error, but it shouldn’t. Thread states randomly disappearing during finalization will definitely not be fun. PyThreadState_Release calls can only hang the thread if it was originally daemon (e.g., via PyGILState_Ensure), which is intended. If you don’t use the old APIs, nothing will hang.

The PEP mentions that PyThreadState_Ensure will only create a new thread state if the active thread state doesn’t match the interpreter.

In addition, Sam made a good point on GitHub that we should also store the thread state in the gilstate pointer. That will also help eliminate creation of unnecessary thread states.

I still don’t like stealing a reference though, this code feels unbalanced:

interp = PyInterpreterState_Lookup(some_id);
if (! PyThreadState_Ensure(interp)) {
   /* do stuff */

   PyThreadState_Release();
}

I’m not sure if I understand this. After the call to PyThreadState_Release I’m do with Python and it shouldn’t matter that some other thread is finalizing. Or can the hang only happen when nesting the PyGILState and PyThreadState APIs?

I’ve considered not stealing a reference, but I’d like to hear what others have to say about it. I’m worried we’ll just add unnecessary boilerplate to the common case (that is, creating a fresh thread natively).

Yeah, it can only hang when reattaching a prior thread state.

Ok, rereading the thread a bit, I think we want to make the following changes:

  • Add PyThreadState_GetDaemon for debugging. This involves unifying how threading and the C API shut down threads, otherwise PyThreadState_GetDaemon will return weird values in cases where C code is ran by a thread created by threading.
  • Add PyInterpreterState_Incref to handle multi-use cases of an interpreter.
    • Should we remove PyInterpreterState_Hold at that? PyInterpreterState_Get +PyInterpreterState_Incref would be sufficient, but consequently more cumbersome to use.
  • Remove any guarantee of signal safety from PyInterpreterState_Lookup.
  • Use the gilstate pointer in PyThreadState_Ensure. This will add compatibility with thread states created by PyGILState_Ensure, and vice versa.
    • I am a little worried about when a thread switches between interpreters, though. If a thread enters interpreter A, and then enters interpreter B, should a subsequent PyThreadState_Ensure call to re-enter interpreter A use the original thread state?
  • Add some clarification about what needs to change in finalization.
  • Don’t steal a reference in PyThreadState_Ensure? (This will be an open issue.)
  • Should we add a new “handle” type for interpreter references to help make reference counting semantically clearer? (This will also be an open issue.)

I’ll make the updates once some clarity issues get fixed. (Heads up @pitrou, there’s a PR based on your original comments awaiting your review :smile:.)

1 Like

Only supporting daemon threads avoids adding more complexity into the runtime.

Many other languages only have “daemon” threads (e.g. C/C++, Go, Ruby).

If you want your computations to finish, you should architect your application in a way that you can join the threads. This is probably just a matter of taste, but I think non-daemon threads move complexity into the runtime that should instead be handled by application developers.

1 Like

C++20 has the awkwardly-named jthread which, AFAIU, is a non-daemon thread (it joins when the object is destroyed).

C++ std::jthread is only enforced by the destructor rather than the language runtime though, so it’s not a lot different to putting thread.join() in a finally: block in Python.

If I created a jthread with new std::thread(...) and then forgot to delete it, there’s nothing magical about it that causes c++ to wait for it. Just the strong convention that resources should be owned by an object with a defined lifetime.

1 Like

How does make things any different? There is no “language runtime” to speak of in C++, and a C++ destructor called at shutdown is morally equivalent to the Python interpreter finalizing all resources at shutdown.

That’s very bad C++ code to begin with. That’s not what C++20 developers would typically write.

Yes, you can create resource leaks and eschew object finalization in C++ more easily than in Python. That’s entirely beside the point.

I agree in general, but the issue is that in our C API, we can’t join the thread. I should make this clearer in the motivation, thanks for bringing this up.

If you created a thread in the C API, the only way to get it to completion is to call join before finalization starts. Speaking of which, I should clarify two meanings of finalization (and I need to do this in the PEP too):

  • _Py_Finalize–I try to refer to this as “beginning finalization.” This is where threading threads are joined, and where native threads are joined with this PEP.
  • “Finalization”–the runtime has the finalizing field set to 1. This causes non-main thread states to hang during attachment. A Python finalizer is only ran when this has happened, so join needs to be called explicitly to prevent hanging.

The idea here is to add a way to prevent that finalizing field from being set until native threads are done, and to provide an alternative to PyGILState_Ensure that doesn’t immediately brick the thread if the interpreter isn’t available. Does that make more sense?

It’s more just a comment that c++ jthread is definitely related and relevant, but not necessarily completely analogous. If I understand correctly, it’s the special place that daemon threads have in the Python runtime that makes them difficult, and c++ avoids that problem.

Agree. Although plenty of people manage to write bad code. Probably disguised with a bit more than the simple example I gave.

I think the reference stealing makes this proposed API drastically harder to use in the common case - and I think that you’re wrong about what the common case is.

I think this is the overwhelmingly common case. Callbacks that are only ever called once are much less common than callbacks that may be called any number of times. Take @pitrou’s example from upthread:

Every time it wants to do IO, it needs to attach a thread state, do the IO, and then release that thread state and return to the C++ layer. In my experience this is the overwhelmingly common case for native threads in Python - they don’t call back into Python exactly once, they call back into Python an arbitrary number of times. And potentially from an arbitrary number of threads - it’s quite common for a C++ library to have a thread pool and for tasks in that pool to sometimes need to call back into Python. As another example of this sort of pattern, I maintain a library for work that lets C++ code log to the Python logging module. Every time the C++ code wants to log something, it calls PyGILState_Ensure, calls logging.log(), and calls PyGILState_Release. I also maintain a Python wrapper around a C++ service framework: every time a request arrives, the threadpool thread that it’s assigned to needs to call PyGILState_Ensure, call the Python request handler function, and then call PyGILState_Release.

I struggle to think of a time when the reference stealing version would be useful to me. I can think of dozens of cases in code I maintain where it would be the wrong default.

4 Likes

Yeah, I was wondering about that. To me, it seemed somewhat natural that the “common” case would be similar to what you would do with threading, but that has seemed less and less true. It would be nice if there were some clear examples that I could cite in the rationale.

For use cases where you’re starting from a thread that has an attached thread state (the precondition for calling PyInterpreterState_Hold()) and you want to start a thread that always has an associated thread state, you wouldn’t use native threads at all, you would just use threading. In most of my extension modules where I use PyGILState_Ensure (maybe even all of them?) , I’m not even the one that starts the thread, the native code that I’m binding to is.

1 Like

Currently the best way to do that is with an atexit handler registered by the library that gracefully stops any background threads started by that library. (At least, assuming context managers can’t be used, or can’t be relied upon because the user might just directly call __enter__ and never call __exit__. Users do the darnedest things…)

If you do that, the current best way to do cleanup stops working. atexit handlers run after the interpreter waits for all of the non-daemon threads to finish.

Well, it’s not a great solution, and I don’t think it’s something commonly used in practice. Py_AtExit doesn’t take a closure, and the variation that does (PyUnstable_AtExit) was only added in 3.13, and undocumented up until a few months ago.

That also doesn’t work for any asynchronous callbacks, and any additional threads created after the exit function is called (e.g., in a finalizer) will indeed hang.

How so? The place where non-daemon threads are finalized won’t change; there will just be user-created C threads in addition to the threading threads.

You do the cleanup I’m proposing with atexit.register(), not with Py_AtExit. Your Python library at some point creates some object that owns native threads. When it creates that thing, it registers an atexit callback to destroy that object, or at least to tell that object to gracefully shut down its threads.

I think it is pretty commonly used in practice. It’s the only way I know of to avoid the crashes and deadlocks that the PEP talks about.

It’s not exactly easy to search for since I’d expect the atexit.register() to usually be called from a .py file and the PyGILState_Ensure to mostly be called from a C file, but here’s some examples I found of projects calling both in the same file for exactly the sort of cleanup I’m talking about.
Example 1
Example 2
Example 3

Well, sure, if you don’t use non-daemon native threads, nothing changes, but the problems the PEP wants to solve aren’t solved either. If you do use non-daemon native threads, it becomes harder to stop background threads gracefully than it is today.

To frame this concern a little differently: the PEP’s proposal for native non-daemon threads is that the interpreter will wait for them to stop, but unless something tells them to stop, the interpreter will be stuck waiting forever for something that will never happen.

So this proposal is incomplete without some sort of mechanism that can be leveraged by libraries to tell their threads to stop. Ordinarily I’d suggest atexit.register(), but that Py_Finalize currently waits for non-daemon threads before calling atexit handlers.

How is that any different from threads created with threading? A thread that runs forever can never join. That’s not specific to this PEP. I don’t think we need to reinvent how to create threads here, we’re just focusing on interacting with the interpreter asynchronously without nasty surprises.