Reference leaks in free-threaded 3.13 version

I’m currently investigating the changes needed to get an extension library framework (nanobind) to work Python 3.13 (and eventually, the free-threaded version). In doing so, I noticed that garbage collection seems to behave differently in 3.13 when compiled with --disable-gil (versus a regular build without --disable-gil). For now, my extension doesn’t actually request the free-threaded version of Python, which triggers the fallback to GIL-based locking.

The nanobind test suite includes strict checks that trigger an error when any nanobind objects haven’t been freed by the time Python has fully shut down. (Py_AtExit()).

One of my tests creates an intentional reference cycle that should be collected by the tp_traverse mechanism of the cyclic GC. This works in the default build but not in the free-threaded one.

Before I investigate more, I was wondering if garbage collection is fully worked out in the new free-threaded version (which I understand is still very experimental). If it’s an expected issue, I’ll hold off.

Thanks,Wenzel

2 Likes

One more detail: one of the objects in the uncollected reference cycle is a Python function which references an external object via a reference cell/closure. Potentially this is related to deferred reference counting, which I read is used for functions, code objects, and such.

Top-level Python functions and code objects are immortalized in free-threaded CPython 3.13 for performance and scalability reasons. This means they are never freed or tracked by GC.

We already removed this in 3.14 and are working on it.

Dear @kj0,

thank you for the suggestion. In my case, the leak has a structure that does not use a top-level function/code object. Instead, a type + local method is constructed dynamically within a testcase.

def test01_leak():
   class Test(Base):
       def __init__(self):
           super().__init__()

           def f():
               print(self.f)

           self.f = f

    t = Test()
    del t

The reference cycle involves the local function f, whose capture cell/closure object stores self. The function f is in turn stored in self.f. The type Base is defined in a C extension library, and implements the tp_traverse callback that is there to detect the reference. I can see this callback being invoked, but the cycle is not collected in free-threaded builds.

Is it the case that immortalization of reference cells occurs even in dynamically created methods? This seems problematic since operations that create such leaks could be called arbitrarily many times from a loop.

1 Like

@colesbury

What might be leaking is not the closure or reference cycle, but rather creating the class itself. IIRC, class definitions are immortalized in 3.13 as well (again, removed for 3.14).

Can you check if this minimal test leaks?

def test01_leak():
   class Test(Base): pass
1 Like

Just to confirm what @kj0 wrote:

  • Type objects (classes) are immortalized in 3.13 free-threaded build (3.13t).
  • This is no longer the case in 3.14 (main) with deferred reference counting.
  • If you create lots of classes dynamically, this will leak memory in the 3.13t.
  • Top-level functions and (all?) methods are also immortalized in 3.13t. Closures are not immortalized.

In CPython, we disable the immortalization behavior using the tests.support.suppress_immortalization context manager for refleak tests. This helps catch other leaks unrelated to immortalization. You can consider using that API, but it’s not public and may get removed or changed in future Python releases.

3 Likes

Hi @colesbury and @kj0,

after some further testing, I believe that there may be a more severe problem.

The leak is caused by the closure object of local function created within the __init__(self) constructor of a dynamically created class. @kj0: just declaring the class isn’t enough. I need to actually call this constructor for the leak manifest. Wrapping this code in

from test.support import suppress_immortalization
with suppress_immortalization(True):
    ...

does not fix the issue.

Furthermore, if I run the test 1000 times, I get 1000 leaks. Note that what is leaked here aren’t just code/type objects. They are instances of a C extension, which could in principle encapsulate a large memory region (e.g. a GPU tensor object).

I simplified the test further and added explanations here, see here for the implementation: nanobind/tests/test_stl.py at free-threaded · wjakob/nanobind · GitHub

Here is the failing testcases

def test32_std_function_gc():
    # Temporarily turn off immortalization
    from test.support import suppress_immortalization

    with suppress_immortalization(True):
        # Test class -> function -> class cyclic reference

        # t.FuncWrapper is a C extension type with a custom property 'f', which
        # can store Python function objects It implements the tp_traverse
        # callback so that reference cycles arising from this function object
        # can be detected.

        class Test(t.FuncWrapper):
            def __init__(self):
                super().__init__()

                # The constructor creates a closure, which references 'self'
                # and assigns it to the 'self.f' member.
                # This creates a cycle self -> self.f -> self
                def f():
                    print(self.f)

                self.f = f


        # The Test class declared above inherits from 'FuncWrapper'.
        # This class keeps track of how many references are alive at
        # any point to help track down leak issues.
        assert t.FuncWrapper.alive == 0
        b = Test()
        assert t.FuncWrapper.alive == 1
        del b
        collect() # Forcefully invoke the garbage collector just to be sure.
        assert t.FuncWrapper.alive == 0 # <-- this fails on free-threaded builds

To me, it seems like the GC behavior in free-threaded builds is different, in a way that isn’t just tied to immortalization.

I’ve checked Python master and can confirm that this issue is still present on the in-development version of v3.14

Here is a link to a failure showing the full error message: WIP: free-threaded python · wjakob/nanobind@93a39a7 · GitHub

I can reproduce the issue in nanobind and I’m looking into it.

The leak is due to reference counting operations within the tp_traverse:

In this case case, the calculation of resurrected objects, which re-uses the ob_ref_local field, breaks. The following avoids the leak by avoiding the reference count modifications across the Py_VISIT, but it would be better to avoid reference count modifications in tp_traverse callbacks entirely:

nb::handle f = nb::cast(w->f, nb::rv_policy::none).release().dec_ref();
Py_VISIT(f.ptr());

I think there’s a similar issue in wrapper_tp_traverse in test_classes.cpp.

1 Like

I think we may be able to make the GC less susceptible to this, but probably not in 3.13 at this point.

I filed the following issue to track this:

1 Like

Thank you very much for tracking this down, I am relieved that it’s not a bug and mainly a tightening in conventions. These are easy to follow, and I added a note to the nanobind documentation to make users aware of this.

1 Like