Safely using the C API when Python might shut down

Dear all,

tricky corner cases sometimes arise in huge and complicated (think: TensorFlow, PyTorch, etc.) mixed C/C++/Python codebases.

There are too many situations to all cover, but for simplicity let’s say that:

  1. Some code runs in C/C++, and it needs to use the Python C API to update some state.
  2. To do so safely, it first calls PyGILState_Ensure().

Often, the triggers are asynchronous and diverse:

  • A kernel has finished running on the GPU
  • A network packet was received.
  • A thread has quit, and the C++ library is executing static finalizers of thread local storage.
  • etc…

Now consider that the Python interpreter has shut down by the time that this happens, which means that it is not able to service something as basic as PyGILState_Ensure() or Py_DECREF anymore? What happens then? The docs say that PyGILState_Ensure() will then terminate the thread. Sadly, there isn’t a reliable way to do this in C++, and it usually segfaults the application :-(. If such events can occur, there is a long tail of spurious crashes that are difficult to reproduce and fix.

Python has an API that is supposedly an answer to this problem. One can call Py_IsFinalizing(). If that returns false, the interpreter is in the process of being shut down. Unfortunately, this API doesn’t solve the problem.

Consider a pattern like this:

if (!Py_IsFinalizing()) { /* #1 */
    PyGILState_STATE state = PyGILState_Ensure(); /* #2 */
    /// ....
    PyGILState_Release(state);
}

Just because we succeeded at #1 doesn’t mean that #2 is still safe to execute. The main thread might have made further progress in the meantime, causing the interpreter to fully shut down. It’s a race condition.

The second reason is that we often still want to use the Python C API even when Py_IsFinalizing() is true. That’s because shutdown logic can trigger asynchronous events that cause some resource to be finally deleted. As long it is still possible, we should deliver Py_DECREF() calls etc, so that the garbage collector can clean things up.

What I am really missing is an API that looks like the following:

PyGILState_STATE state = PyGILState_EnsureOrSafelyFail(); /* will never crash */
if (state != PyGILState_FAILURE) {
    /* Python API safe to use until the release statement below */
    /* ... do stuff ... */
    PyGILState_Release(state);
} else {
    // Oh well. Do the best that we can do here without talking to Python
}

To my knowledge, Python doesn’t have something like this at the moment. Is it possible to provide such an API?

Thanks!

2 Likes

Register an atexit function to set a bool?

I don’t think the API can work you posted it: it’s possible to reinitialize the runtime, after which Python will be initialized but all your old PyObject*s will be invalid. (They’ll also be invalid if PyGILState_Ensure puts you in the wrong interpreter, which might be hard to avoid for a library. I’d rather not add more API like PyGILState_Ensure.)

Perhaps we need a “reference counter” to allow things to say that they need the interpreter to stay around, with Python refusing to finalize if it’s still needed. And then, implement weak references, with callbacks to set a bool.

1 Like

Setting a flag from atexit has the same issue as Py_IsFinalizing(), it’s just a flag without any implication beyond that instantaneous point in time.

Perhaps we need a “reference counter” to allow things to say that they need the interpreter to stay around, with Python refusing to finalize if it’s still needed.

Yes, exactly – I was thinking of something like a reference counter.

(PS: The situation where Python dies and comes back from the dead, while the C++ extensions keeps on running with its undead PyObject* seems much more fringe than (already fringe) situation I was describing ;))

3 Likes

The “Python comes back from the dead” case is easy, just refuse to continue.
A failure when trying to restart the interpreter because some old thread still holds a reference is a lot more benign than the type of crash that happens when some code in a C++ destructor raises an exception that can neither be ignored (the compiler refuses to allow that) or not-ignored (destructors may not throw exceptions). That kind of situation causes the C++ runtime to implode.

The question is, though, who or what should do the reference counting – AFAIK the fact that you saved your thread state is no guarantee that it’ll be restored …?
Wenzel’s initial “oh well then we’ll leak some resources instead” API has the benefit that it’d work now, just refactor the code that restores the thread state to optionally return a sorry-we’re-in-shutdown flag instead of cancelling itself.

A new function like PyGILState_EnsureOrSafelyFail() sounds like a good idea. It was discussed previously: issue #124622.

Python was modified recently to no longer call pthread_exit(), but hangs instead: see issue #87135.

2 Likes

I don’t know where that modification was, and I’m not going to check right now, because wherever it was doesn’t affect the problem we’re talking about here.

3.12.8, Python/ceval_gil.c, line 346 ff, in take_gil, right at the top:

    if (_PyThreadState_MustExit(tstate)) {
        /* ... */
        PyThread_exit_thread();
    }

(There’s another call to PyThread_exit_thread further down.)
This definitely is a problem when the take_gil in question was called from a destructor. Oops, no exceptions allowed, hard crash: PyThread_exit_thread calls pthread_exit which, you guessed it, self-kills by throwing a C-level cancellation exception.

it removes the line in question and replaces it with a hang instead of a crash.

that is only in 3.14 right now. We haven’t backported it yet, though are considering doing so. it feels rather invasive as a bugfix for a patch release… but given the alternative in such a scenario was processes dying a hard to debug horrible death sometimes, maybe not.

2 Likes

Ah. Sorry for the misunderstanding, I got the distinct but apparently wildly incorrect impression that this had happened some time ago.

1 Like

PyGILState_EnsureOrSafelyFail() would be useful for PyO3 too. We’ve had similar cases where e.g. a Rust thread would like to attach and send some logging calls, but this can have crashes during shutdown.

2 Likes

In general, I don’t trust PyGILState. It should get deprecated, if anything. You’ll have better luck directly using the thread state APIs. I suspect something like this should work more reliably:

PyThreadState *tstate = PyThreadState_New(PyInterpreterState_Main());
PyThreadState_Swap(tstate);

PyGILState will try to access the automatic interpreter state, which might be NULL if the runtime is shutting down. PyThreadState_New should be, in theory, ok to call without a runtime.

Explicit +1 from me as well. It’s important to know that Python can’t be used anymore at this point, but still be able to go on with non-Python tasks.

3 Likes

Strong +1 for this API.

I’ve been using pybind11 to wrap external C++ libraries for over 5 years now, and have spent so much time creating various workarounds to avoid crashing the interpreter on shutdown.

Typically, many of these crashes are some C++ singleton (or other static global that is torn down at shutdown) that holds a C++ object that has a reference to a python object (think std::function where the captured context has a py::object that needs to be destroyed), or it’s some C++ thread that is shutting down and is releasing python objects. Certainly these things can be written correctly, but when you’re wrapping an external library it’s harder to get even a cooperative upstream to make those changes, and there’s lots of weird corner cases.

It would be fantastic if this API could be cheap too – then in C++ destructors that are releasing PyObjects could just try to take the gil and not do the decref if it fails, and most of my workaround code could be removed.

1 Like

C++ globals with destructors are in my experience a recipe for disaster in even pure C++ code (arbitrary order of destruction); I think there’s a parallel there.

How about this, would this be implementable from the Python interpreter side in a way that doesn’t break existing code that works now—and would it actually solve the problem for extensions without creating a minefield of deadlocks:

Allow native extensions to register a shutdown callback. The contract would be this:

  1. The callback is allowed to block.
  2. Before the callback returns, the interpreter and all extensions with callbacks registered before this callback are guaranteed to be available, both to the code in the callback and any threads
  3. After the callback returns, it is undefined behavior to call any Py* function unless the documentation declares otherwise.
  4. (implied by 2) The callbacks are called in the reverse order of registration. (What to do with existing code without callbacks?)

Let’s not get hung up on C++ globals, that’s another story. I intentionally did not mention them in the first post of this thread, there are plenty of other event sources (usually unavoidable ones) that trigger crashes.

@sliedes What you describe already exists: for example, atexit.register() registers a shutdown callback. However, callbacks are invoked very early in Python’s shutdown sequence – the GC hasn’t kicked in, and it’s even possible to still import new modules at that point. There is a also a lower-level one called Py_AtExit(), which runs too late : Python has already fully shut down when callbacks are invoked. This API is also weird because there is a limit to how many callbacks can be registered (max 32). The proposed PyGILState_EnsureOrSafelyFail() would succeed for as long as possible – until the interpreter state is really no longer usable.

Overall, I think that the problem is important enough that it would be good for Python to provide a safe API, instead of a dangerous one that forces people to develop half-baked workarounds.

2 Likes

I created PR gh-129688 to add PyGILState_EnsureOrFail() function.