Safely using the C API when Python might shut down

wjakob · January 30, 2025, 1:21pm

Dear all,

tricky corner cases sometimes arise in huge and complicated (think: TensorFlow, PyTorch, etc.) mixed C/C++/Python codebases.

There are too many situations to all cover, but for simplicity let’s say that:

Some code runs in C/C++, and it needs to use the Python C API to update some state.
To do so safely, it first calls PyGILState_Ensure().

Often, the triggers are asynchronous and diverse:

A kernel has finished running on the GPU
A network packet was received.
A thread has quit, and the C++ library is executing static finalizers of thread local storage.
etc…

Now consider that the Python interpreter has shut down by the time that this happens, which means that it is not able to service something as basic as PyGILState_Ensure() or Py_DECREF anymore? What happens then? The docs say that PyGILState_Ensure() will then terminate the thread. Sadly, there isn’t a reliable way to do this in C++, and it usually segfaults the application :-(. If such events can occur, there is a long tail of spurious crashes that are difficult to reproduce and fix.

Python has an API that is supposedly an answer to this problem. One can call Py_IsFinalizing(). If that returns false, the interpreter is in the process of being shut down. Unfortunately, this API doesn’t solve the problem.

Consider a pattern like this:

if (!Py_IsFinalizing()) { /* #1 */
    PyGILState_STATE state = PyGILState_Ensure(); /* #2 */
    /// ....
    PyGILState_Release(state);
}

Just because we succeeded at #1 doesn’t mean that #2 is still safe to execute. The main thread might have made further progress in the meantime, causing the interpreter to fully shut down. It’s a race condition.

The second reason is that we often still want to use the Python C API even when Py_IsFinalizing() is true. That’s because shutdown logic can trigger asynchronous events that cause some resource to be finally deleted. As long it is still possible, we should deliver Py_DECREF() calls etc, so that the garbage collector can clean things up.

What I am really missing is an API that looks like the following:

PyGILState_STATE state = PyGILState_EnsureOrSafelyFail(); /* will never crash */
if (state != PyGILState_FAILURE) {
    /* Python API safe to use until the release statement below */
    /* ... do stuff ... */
    PyGILState_Release(state);
} else {
    // Oh well. Do the best that we can do here without talking to Python
}

To my knowledge, Python doesn’t have something like this at the moment. Is it possible to provide such an API?

Thanks!

encukou · January 30, 2025, 1:51pm

Register an atexit function to set a bool?

I don’t think the API can work you posted it: it’s possible to reinitialize the runtime, after which Python will be initialized but all your old PyObject*s will be invalid. (They’ll also be invalid if PyGILState_Ensure puts you in the wrong interpreter, which might be hard to avoid for a library. I’d rather not add more API like PyGILState_Ensure.)

Perhaps we need a “reference counter” to allow things to say that they need the interpreter to stay around, with Python refusing to finalize if it’s still needed. And then, implement weak references, with callbacks to set a bool.

wjakob · January 30, 2025, 2:15pm

Setting a flag from atexit has the same issue as Py_IsFinalizing(), it’s just a flag without any implication beyond that instantaneous point in time.

Perhaps we need a “reference counter” to allow things to say that they need the interpreter to stay around, with Python refusing to finalize if it’s still needed.

Yes, exactly – I was thinking of something like a reference counter.

(PS: The situation where Python dies and comes back from the dead, while the C++ extensions keeps on running with its undead PyObject* seems much more fringe than (already fringe) situation I was describing ;))

smurfix · January 30, 2025, 4:11pm

The “Python comes back from the dead” case is easy, just refuse to continue.
A failure when trying to restart the interpreter because some old thread still holds a reference is a lot more benign than the type of crash that happens when some code in a C++ destructor raises an exception that can neither be ignored (the compiler refuses to allow that) or not-ignored (destructors may not throw exceptions). That kind of situation causes the C++ runtime to implode.

The question is, though, who or what should do the reference counting – AFAIK the fact that you saved your thread state is no guarantee that it’ll be restored …?
Wenzel’s initial “oh well then we’ll leak some resources instead” API has the benefit that it’d work now, just refactor the code that restores the thread state to optionally return a sorry-we’re-in-shutdown flag instead of cancelling itself.

vstinner · January 30, 2025, 5:01pm

A new function like PyGILState_EnsureOrSafelyFail() sounds like a good idea. It was discussed previously: issue #124622.

Python was modified recently to no longer call pthread_exit(), but hangs instead: see issue #87135.

smurfix · January 30, 2025, 6:00pm

I don’t know where that modification was, and I’m not going to check right now, because wherever it was doesn’t affect the problem we’re talking about here.

3.12.8, Python/ceval_gil.c, line 346 ff, in take_gil, right at the top:

    if (_PyThreadState_MustExit(tstate)) {
        /* ... */
        PyThread_exit_thread();
    }

(There’s another call to PyThread_exit_thread further down.)
This definitely is a problem when the take_gil in question was called from a destructor. Oops, no exceptions allowed, hard crash: PyThread_exit_thread calls pthread_exit which, you guessed it, self-kills by throwing a C-level cancellation exception.

gpshead · January 30, 2025, 6:26pm

it removes the line in question and replaces it with a hang instead of a crash.

github.com/python/cpython

gh-87135: Hang non-main threads that attempt to acquire the GIL during finalization

python:main ← jbms:new-gil-hang

opened 11:49PM - 14 Jun 23 UTC

jbms

+247 -29

This splits off the change from https://github.com/python/cpython/pull/28525/ to… hang threads that attempt to acquire the GIL during interpreter shutdown, but does not introduce any new public APIs. * Issue: gh-87135 ---- :books: Documentation preview :books:: https://cpython-previews--105805.org.readthedocs.build/

that is only in 3.14 right now. We haven’t backported it yet, though are considering doing so. it feels rather invasive as a bugfix for a patch release… but given the alternative in such a scenario was processes dying a hard to debug horrible death sometimes, maybe not.

smurfix · January 30, 2025, 8:42pm

Ah. Sorry for the misunderstanding, I got the distinct but apparently wildly incorrect impression that this had happened some time ago.

davidhewitt · January 30, 2025, 8:50pm

PyGILState_EnsureOrSafelyFail() would be useful for PyO3 too. We’ve had similar cases where e.g. a Rust thread would like to attach and send some logging calls, but this can have crashes during shutdown.

ZeroIntensity · January 31, 2025, 1:31am

In general, I don’t trust PyGILState. It should get deprecated, if anything. You’ll have better luck directly using the thread state APIs. I suspect something like this should work more reliably:

PyThreadState *tstate = PyThreadState_New(PyInterpreterState_Main());
PyThreadState_Swap(tstate);

PyGILState will try to access the automatic interpreter state, which might be NULL if the runtime is shutting down. PyThreadState_New should be, in theory, ok to call without a runtime.

pitrou · January 31, 2025, 9:47am

Explicit +1 from me as well. It’s important to know that Python can’t be used anymore at this point, but still be able to go on with non-Python tasks.

virtuald · February 3, 2025, 1:54pm

Strong +1 for this API.

I’ve been using pybind11 to wrap external C++ libraries for over 5 years now, and have spent so much time creating various workarounds to avoid crashing the interpreter on shutdown.

Typically, many of these crashes are some C++ singleton (or other static global that is torn down at shutdown) that holds a C++ object that has a reference to a python object (think std::function where the captured context has a py::object that needs to be destroyed), or it’s some C++ thread that is shutting down and is releasing python objects. Certainly these things can be written correctly, but when you’re wrapping an external library it’s harder to get even a cooperative upstream to make those changes, and there’s lots of weird corner cases.

It would be fantastic if this API could be cheap too – then in C++ destructors that are releasing PyObjects could just try to take the gil and not do the decref if it fails, and most of my workaround code could be removed.

sliedes · February 3, 2025, 10:14pm

C++ globals with destructors are in my experience a recipe for disaster in even pure C++ code (arbitrary order of destruction); I think there’s a parallel there.

How about this, would this be implementable from the Python interpreter side in a way that doesn’t break existing code that works now—and would it actually solve the problem for extensions without creating a minefield of deadlocks:

Allow native extensions to register a shutdown callback. The contract would be this:

The callback is allowed to block.
Before the callback returns, the interpreter and all extensions with callbacks registered before this callback are guaranteed to be available, both to the code in the callback and any threads
After the callback returns, it is undefined behavior to call any Py* function unless the documentation declares otherwise.
(implied by 2) The callbacks are called in the reverse order of registration. (What to do with existing code without callbacks?)

wjakob · February 3, 2025, 11:56pm

Let’s not get hung up on C++ globals, that’s another story. I intentionally did not mention them in the first post of this thread, there are plenty of other event sources (usually unavoidable ones) that trigger crashes.

@sliedes What you describe already exists: for example, atexit.register() registers a shutdown callback. However, callbacks are invoked very early in Python’s shutdown sequence – the GC hasn’t kicked in, and it’s even possible to still import new modules at that point. There is a also a lower-level one called Py_AtExit(), which runs too late : Python has already fully shut down when callbacks are invoked. This API is also weird because there is a limit to how many callbacks can be registered (max 32). The proposed PyGILState_EnsureOrSafelyFail() would succeed for as long as possible – until the interpreter state is really no longer usable.

Overall, I think that the problem is important enough that it would be good for Python to provide a safe API, instead of a dangerous one that forces people to develop half-baked workarounds.

vstinner · February 5, 2025, 2:10pm

I created PR gh-129688 to add PyGILState_EnsureOrFail() function.

colesbury · March 12, 2025, 3:29pm

A question for @wjakob, @pitrou, @davidhewitt and others in similar situations: what will you do if PyGILState_EnsureOrSafelyFail() returns an error code indicating failure? How are you going to handle that?

For PyTorch, I think the only safe thing we can do is to hang the thread, like PyGILState_Ensure() now does in 3.14. That case is going to happen sometimes even with a PyGILState_EnsureOrSafelyFail(): if you have non-trivial code between PyGILState_EnsureOrSafelyFail() and PyGILState_Release(), there’s a chance you may incidentally call code that internally releases and attempts to re-acquire the GIL, which will hang the thread. For example, any Python code that calls into the interpreter may trigger a GC, and collecting things like file objects whose finalizers temporarily release the GIL.

Furthermore, we can’t reliably pass an error up because we’re sometimes in situations where C++ calls into Python, which calls back into C++, which calls into Python:

C++ -> Python -> C++ -> Python
                    [^^ PyGILState_EnsureOrSafelyFail() fails here]

I don’t think there’s any way to safely pass an error or throw a C++ exception through the earlier Python context while the Python interpreter is shutting down.

pitrou · March 12, 2025, 3:51pm

It depends on the situation. If I take the example from A new API for ensuring/releasing thread states - #20 by pitrou, I would return a C++ error reflecting the fact that the IO call failed. It’s quite natural and will allow C++ resources to clean up.

Sam Gross:

Furthermore, we can’t reliably pass an error up because we’re sometimes in situations where C++ calls into Python, which calls back into C++, which calls into Python:
C++ -> Python -> C++ -> Python
                    [^^ PyGILState_EnsureOrSafelyFail() fails here]

That sounds unlikely, no? If you’re being called synchronously from Python, why would PyGILState_EnsureOrSafelyFail not succeed?

In any case, it’s possible to return an error to the calling Python frame.

wjakob · March 13, 2025, 3:16am

What will you do if PyGILState_EnsureOrSafelyFail() returns an error code indicating failure? How are you going to handle that?

The situation where I usually face this is some kind of cleanup routine called asynchronously. I would then leak whatever was supposed to be cleaned up. This is infinitely better than crashing or freezing. Since the process is about to shut down, the cleanup will happen anyway.

colesbury · March 13, 2025, 1:58pm

I think the PyGILState_Ensure() behavior in 3.14 should already work for you. Just because a background thread is blocked on PyGILState_Ensure() doesn’t mean the program freezes. It behaves as if the the main thread, which is running Py_Finalize(), holds the GIL for the duration of shutdown, from the time at-exit hooks are called (after non-daemon threads are joined), until process exit.

davidhewitt · March 13, 2025, 6:42pm

For PyO3, the use case reported was in logging integration where Rust threads were acquiring the GIL to forward logs into Python’s logging.

Presumably in that case if the GIL acquisition fails, the log message can be dropped (or just sent directly to stderr).