This is for the free-threaded GC but could potentially be adopted for the generational and incremental GC as well. The free-threaded GC currently checks the process resident size and compares it to the size when the GC was last run. If the size has not increased more than 10% then we consider skipping the collection. This helps decrease GC cost in the case you allocate many small container objects and produce little cyclic trash, causing GC to run too often.
This process size increase check has an unfortunate interaction with how mimalloc manages memory: it defers returning memory to the OS. So, if a GC pass frees a lot of trash and frees memory, the process resident size computed by the OS doesn’t change. I created an issue for this. I have a potential fix, computing memory usage by looking at mimalloc heaps and counting full “pages”. However, with that fix, there are still issues: it misses memory allocated outside of mimalloc (e.g. mmap, other allocators), still collects too often if you are creating a lot of objects with zero or a small amount of cyclic trash.
In looking for ways to fix this, I recall Pablo’s idea of dynamic GC thresholds. This idea is simpler to implement for the free-threaded GC since it always does full collections and there is only one threshold to adjust. The design that follows was assisted by Claude. Based on some limited benchmarking, it seems to work pretty well. Claude’s explanation of the design we ended up with follows. A draft PR is available.
Cost model
Every allocation increments young.count. We want to answer: given the
current young.count, should we start a GC pass now?
A GC pass in the free-threaded build is dominated by the mark-alive walk
over the mimalloc GC heap. Its cost is roughly O(N) where N is the count
of live objects in that heap (long_lived_total in the code). So the
amortized cost per allocation is
\frac{\text{cost per pass}}{\text{allocations between passes}} \;\propto\; \frac{N}{T}
where T is the trigger threshold. To keep amortized cost roughly linear
in total allocations as the program grows, T should scale with N. That
gives us our upper bound:
T_{\max} = \frac{N}{D}, \qquad D = 2
At D = 2, GC fires at most once per heap doubling in the no-trash limit.
But T_{\max} alone is wrong: a program churning short-lived cycles wants
GC often, not once per doubling. We also have a user-configured
lower bound T_{\text{base}} (default 2000), which is the minimum number
of allocations between passes regardless of heap size. The adaptive trigger
T_a lives in [T_{\text{base}}, T_{\max}] and slides between them based on
recent trash productivity.
Productivity ratio
After every threshold-triggered collection, we know two numbers: how many
objects the pass collected, C, and the heap size N. The trash ratio
r = \frac{C}{N}
measures the fraction of GC work that actually freed memory. A high ratio
means the pass paid for itself; a low ratio means we mostly walked live
objects for nothing. Note this is not trash-as-fraction-of-young — it’s
trash-as-fraction-of-the-walk, which is the right thing because the walk is
where the cost lives.
We map r to a target threshold via a hyperbolic decay:
T_{\text{target}}(r) \;=\; T_{\text{base}} + (T_{\max} - T_{\text{base}}) \cdot \frac{1}{1 + K r}
with K = 8. Sanity points:
| r |
fraction of range used |
| 0 |
100% (use T_{\max}) |
| 0.05 |
≈ 71% |
| 0.125 |
50% |
| 0.25 |
≈ 33% |
| 0.5 |
≈ 20% |
| 1.0 |
≈ 11% |
Why hyperbolic? Two reasons. First, it’s monotone and smoothly squashes
[0, \infty) into (0, 1] without piecewise cases. Second, the
1/(1+x) shape mirrors how productivity-vs-work tradeoffs feel — each
unit of trash buys less threshold reduction as you accumulate more, which
is what we want (we don’t need to slam the threshold to its floor the moment
we see a single cycle). K controls how fast the curve decays; the asymmetric
step (next section) does most of the noise filtering, so the exact shape
matters less than its monotonicity.
In code this is rearranged to keep everything integer:
T_{\text{target}} = T_{\text{base}} + \frac{(T_{\max} - T_{\text{base}}) \cdot N}{N + K C}
Asymmetric adaptation
We don’t move directly to T_{\text{target}} each pass — that would
“hunt” around its true value due to pass-to-pass variance. Instead we take
a weighted step from the current T_a toward T_{\text{target}}. And
the step size is asymmetric:
- Raising T_a (less trash than recently): cautious 1/4 step
- Lowering T_a (more trash than recently): aggressive 3/4 step
The asymmetry reflects asymmetric failure modes:
- Lowering too slowly costs a few extra GC passes — annoying, recoverable.
- Raising too fast can blow up the heap, and in a memory-constrained
environment (a container, an embedded device) that’s an OOM. Unrecoverable.
So we err on the side of more frequent collection: when trash appears we
snap toward the lower target in a single big step, but when trash dries up
we creep upward one quarter-step at a time. One lucky pass shouldn’t push us
into a long deferral.
What doesn’t count
The adaptation only updates after collections that fired because the
threshold was hit (reason == _Py_GC_REASON_HEAP). Manual gc.collect()
calls and shutdown collections are skipped — they aren’t representative of
the steady-state trash ratio and would skew the average.
A user who calls gc.set_threshold() to raise T_{\text{base}} above the
current T_a gets an immediate resync: T_a jumps up to the new floor.
Lowering T_{\text{base}} relies on the natural T_{\max} clamp on the
next pass.
Boundaries and degenerate cases
A few edges worth noting:
- Empty / tiny heap. If N/D < T_{\text{base}}, we set
T_{\max} = T_{\text{base}} and the adaptation collapses to “always fire
at T_{\text{base}}” — exactly what you want when the heap is too small
for amortized scaling to mean anything.
- Quadratic-behavior guard. Even if T_a is exceeded, GC won’t fire
unless count >= long_lived_total / 4. This pre-existing guard prevents
pathological behavior on heaps that are growing in pure-non-trash regions.
- All math is integer. The ratio computation is done in
long long
with a shift-down step on N first, so it’s overflow-safe on both 32-bit
and 64-bit Py_ssize_t builds. Only the ratio matters, not the scale.
Why not a simpler heuristic?
The obvious alternatives — fixed multiplier of heap, fixed allocation count,
EMA of trash count — each fail one of the criteria the hybrid hits:
- Amortized linear GC cost on growing heaps. (T_{\max} \propto N.)
- Responsive to actual trash rate, not just allocation rate. (r feedback.)
- Bounded below by a user knob. (T_{\text{base}} floor.)
- Stable under noise. (Asymmetric weighted step.)
- Can’t be fooled by manual
gc.collect().
The constants (D = 2, K = 8, 1/4-up / 3/4-down) were chosen so the
benchmark suite’s GC-pass count looks reasonable across cycle-heavy and
cycle-light workloads. They aren’t sacred — the structure is what matters,
and the structure is: let the heap set the upper bound, let the trash
ratio choose where in that range to sit, and move toward that choice
faster when caution demands it.