[free-threaded] `Py_REFCNT` remains unexpectedly high after `Py_XDECREF` in C++ embedded scenario with multiple threads

Environment

- CPython version: 3.14 (free-threaded build, `Py_GIL_DISABLED`)

- Platform: (your OS)

- Use case: C++ application embedding the Python interpreter

Scenario

1. **Thread A** creates a Python object `obj`, held by a C++ pointer (ob_tid = A_tid, ob_ref_local = 1, ob_ref_shared = 0).

2. **Thread B** (attached via PyGILState_Ensure, with its own PyThreadState bound to the correct OS thread) calls PyObject_GetAttrString(obj, "methodA"), invokes the method, then calls Py_XDECREF(methodA) in thread B.

3. **Thread C** (same setup as B) calls PyObject_GetAttrString(obj, "methodB"), invokes the method, then calls Py_XDECREF(methodB) in thread C.

4. A few milliseconds later, **Thread A** reads Py_REFCNT(obj) and consistently observes **3** instead of the expected **1**.

Each thread uses its own `PyThreadState`, correctly created and bound to the corresponding OS thread, so ownership should be unambiguous.

Expected behavior

After `Py_XDECREF(methodA)` in thread B and `Py_XDECREF(methodB)` in thread C:

  • `methodA` and `methodB` are owned by B and C respectively, so

    `Py_XDECREF` takes the **owner path** in both cases.

  • Both method objects reach `ob_ref_local = 0` → `_Py_MergeZeroLocalRefcount`

    → `_Py_Dealloc` → `method_dealloc` → `Py_DECREF(im_self = obj)`.

  • Those two `Py_DECREF(obj)` calls come from non-owners (B and C are

    not the owner of `obj`), taking `_Py_DecRefShared`:

    `ob_ref_shared(obj)`: 8 → 4 → 0.

  • Final state: `ob_ref_local = 1`, `ob_ref_shared = 0`,

    `Py_REFCNT(obj) = 1`.

Actual behavior

Py_REFCNT(obj) returns **3** consistently, several milliseconds after both `Py_XDECREF` calls have returned. This corresponds to:

ob_ref_local = 1 (the C++ pointer held by thread A)

ob_ref_shared = 8 (count = 2, as if both im_self references still exist)

The delay rules out CPU cache-coherence lag, which is at most nanoseconds on modern hardware.

Attempted workaround

After each `Py_XDECREF`, and again before reading `Py_REFCNT` in thread A, we tried:

  PyThreadState* ts = PyEval_SaveThread();

  PyEval_RestoreThread(ts);                                                                                                                         

This had no effect. Through source analysis we understand why: in free-threaded mode, PyEval_SaveThread/RestoreThread only triggers QSBR attach/detach (_Py_qsbr_detach / _Py_qsbr_attach) for memory-page reclamation. It does not run the Python eval loop and therefore does not trigger BRC queue merging (_Py_brc_merge_refcounts), which only executes inside the bytecode dispatch loop at ceval_gil.c:1386–1389.

Diagnostic results

We performed the following checks at the moment Py_REFCNT(obj) = 3: 1. gc.get_referrers(obj) — no Python-level holders found =

import gc                                                                                                                             

holders = gc.get_referrers(obj)
# Result: only the C++ reference holder is present                               # No bound method objects, no frames, no dicts, no tracebacks   

Neither methodA nor methodB appear as live objects holding obj. This implies their tp_dealloc has already run (or they no longer exist), yet ob_ref_shared(obj) still shows count = 2.

2. C++ layer — no unaccounted Py_INCREF calls We audited all C++ code paths. Thread A holds exactly one reference (the original C++ pointer). No additional Py_INCREF(obj) calls were found in threads B or C beyond those implied by PyObject_GetAttrString. The contradiction: Py_REFCNT(obj) = 3 implies two live shared references, yet gc.get_referrers finds no holders and the C++ layer adds no extra increments. This suggests either:

- Py_REFCNT is reporting a stale or incorrect value in free-threaded mode (e.g. due to an un-merged BRC entry that inflates ob_ref_shared without a corresponding live holder), or - The holders are non-GC-tracked C-level structures not visible to gc.get_referrers. - Next step pending: read raw ob_ref_local, ob_ref_shared, and he flag bits (ob_ref_shared & 0x3) via ctypes to determine whether ob_ref_shared is in REF_QUEUED (0x2) state — which would not be visible in Py_REFCNT count bits but would indicate a BRC stolen reference — or whether the count bits genuinely reflect two live references.

import ctypes                                                                                                                         

  addr = id(obj)                                                                                                                        

  ob_ref_local  = ctypes.c_uint32.from_address(addr + 12).value                                                                         

  ob_ref_shared = ctypes.c_ssize_t.from_address(addr + 16).value                                                                        

  flags = ob_ref_shared & 0x3                                                                                                           

  count = ob_ref_shared >> 2                                                                                                            

  # flags: 0=INIT, 1=MAYBE_WEAKREF, 2=QUEUED, 3=MERGED                                                                                  

  print(f"ob_ref_local={ob_ref_local}, count={count}, flags={flags:#x}")   

Analysis

Why PyEval_SaveThread/RestoreThread cannot help BRC queue merging is gated entirely on the eval-breaker check at ceval_gil.c:1386–1389:

if ((breaker & _PY_EVAL_EXPLICIT_MERGE_BIT) != 0) {

      _Py_unset_eval_breaker_bit(tstate, _PY_EVAL_EXPLICIT_MERGE_BIT);                                                                  

      _Py_brc_merge_refcounts(tstate);                                                                                                  

}   

This is only reachable by executing Python bytecode. A C++ thread that calls PyEval_SaveThread/RestoreThread and then returns to C++ never passes through this path, so BRC queues accumulate indefinitely for that thread.

No public API to drain BRC queues from C++ The only mechanism to force merging is the internal _Py_brc_merge_refcounts(tstate), which is not part of the stable ABI. For C++ embedding scenarios, there is currently no supported way to ensure BRC queues are drained without executing Python bytecode.

Questions for the community

1. Given that gc.get_referrers finds no live holders yet Py_REFCNT = 3, is Py_REFCNT reliable for non-owner threads in free-threaded mode? Specifically: can a BRC REF_QUEUED` entry or an un-merged ob_ref_shared value cause Py_REFCNT to over-report even after the last real holder has gone away?

2. Is there a guaranteed, supported path that eventually calls _Py_Dealloc (and thus runs tp_dealloc → Py_DECREF(im_self)) without the owning thread re-entering the eval loop, in a C++ embedding scenario where threads may not execute Python bytecode after their API calls complete?

3. Should PyEval_SaveThread/RestoreThread — or a new dedicated function — trigger BRC queue merging in free-threaded mode, given that it is a natural synchronization boundary commonly used in C extension and embedding code?

No, Py_REFCNT isn’t guaranteed to be accurate. The Python docs explicitly say that Py_REFCNT “may not actually reflect how many references to the object are actually held”. Look at PyUnstable_Object_IsUniquelyReferenced (added in 3.14) instead.

I’m not well versed in the other parts to give advice.