Disallowing GC inside interpreter/JIT _Py_Dealloc calls

brandtbucher · June 4, 2025, 4:00pm

TL;DR: It would be convenient from a performance perspective if Py_DECREFs in the interpreter loop were unable to trigger GC. In practice, this means that if large amounts of cyclic garbage are created inside an object’s __del__, tp_dealloc, or tp_finalize function (or anything called by these functions), then that garbage may not be collected until some time after the object has been destroyed.

The free-threaded build already uses deferred reference counting for some types of objects (in order to allow efficient scaling by reducing contention on the refcounts of commonly-shared objects). This works by “tagging” the pointers themselves on the frame’s evaluation stack in lieu of constantly dereferencing the pointers and manipulating the reference count itself. This is only done for certain types, since it means that objects that use deferred reference counting are not actually reclaimed until the GC runs, which would be problematic to do in general for all objects. Deferred reference counting is something that we’d like to start doing in the non-free-threaded build as well; in addition to reducing the diff between the two builds, refcounting operations are exactly the sort of overhead that we’d like to remove in the interpreter and JIT.

However, this poses some problems. One key part of deferred reference counting is that the GC must be able to see all values on the stacks of all Python frames when it runs, in order to account for these tagged references. This means that all potentially-tagged values the interpreter is currently using must be spilled to the stack whenever GC may run. In other words, any time that we may “escape” the interpreter and run arbitrary Python or C code… which, crucially, involves every Py_DECREF call (spelled as PyStackRef_CLOSE in the interpreter).

As you can imagine, decrefs are everywhere in the interpreter. Storing a local decrefs an object. Binary math decrefs two objects. Getting an item in or out of a list decrefs 2-3 objects. This is a problem because we not only need all of the extra runtime logic to fix up the stack before all of these operations (code that is already on main in preparation of deferred refcounting), but also because we hope to start keeping stack items in machine registers in the JIT in 3.15. With the requirement of spilling before every decref, this means that in practice these items aren’t “kept” in registers at all. And having the GC traverse the C stack to find these references is an additional layer of complexity that I really don’t think we want to take on.

So, my proposed solution: increment a counter on the thread state before we decref items in the interpreter/JIT, and decrement it when the decrefs are complete (like the current stack-spilling code, these operations are inserted automatically by our interpreter generator). GC won’t run if this counter is a non-zero value. Note that we still “escape” the interpreter in these calls, so the JIT will still need to run (cheap) validity checks after these decrefs, but spilling won’t be necessary since it’s only needed for GC.

Current code for BINARY_OP_SUBSCR_LIST_INT (mostly generated, I’ve cleaned it up a bit):

// ...some checks come before this...
STAT_INC(BINARY_OP, hit);
PyObject *res_o = PyList_GET_ITEM(list, index);
assert(res_o != NULL);
_PyFrame_SetStackPointer(frame, stack_pointer);         // Removed.
stack_pointer[-2] = PyStackRef_FromPyObjectNew(res_o);
PyStackRef_CLOSE(list_st);
stack_pointer[-1] = PyStackRef_NULL;                    // Removed.
PyStackRef_CLOSE(sub_st);
stack_pointer = _PyFrame_GetStackPointer(frame);        // Removed.
stack_pointer += -1;
assert(WITHIN_STACK_BOUNDS());

After (roughly):

// ...some checks come before this...
STAT_INC(BINARY_OP, hit);
PyObject *res_o = PyList_GET_ITEM(list, index);
assert(res_o != NULL);
tstate->deallocs++;                                     // Added.
stack_pointer[-2] = PyStackRef_FromPyObjectNew(res_o);
PyStackRef_CLOSE(list_st);
PyStackRef_CLOSE(sub_st);
tstate->deallocs--;                                     // Added.
stack_pointer += -1;
assert(WITHIN_STACK_BOUNDS());

This seems like an okay trade-off to me: the precise timing of GC is an implementation detail, and I think it’s reasonable to disallow running it in these situations. Short of long-living __del__ methods that create tons of cyclic garbage and run collections today, this shouldn’t really affect users in any meaningful way.

But, I’d like to check with people before moving forward with this. Am I missing something?

CC @markshannon, @kj0.

brandtbucher · June 5, 2025, 4:18pm

Update: @colesbury pointed out to me that this could get hairy in certain threaded contexts. For example, GC could be delayed indefinitely if one or more threads are in the process of deallocating objects whenever GC tries to run in another thread.

It’s probably a problem that can be solved, but for now the best path forward is probably just to remove as many refcounts as possible in the JIT, which we already want to do. Perhaps some version of this approach may be helpful in the future, though.