Improving incremental gc

Indeed, I expect that’s why CPython’s “free threading” build uses mimalloc. pymalloc relies on the GIL, and there’s no straightforward way to break that dependence short of slamming a process-level mutex acquire/release pair around each entry point.

But a bump allocator’s design remains dirt simple regardless of threads - give each thread its own nursery, and there’s 0 contention (at least so long as the nursery isn’t exhausted)..

mimalloc no more caters to moving objects in memory than pymalloc does.

Worse, all the pointers that point to moved objects also need to be rewritten.

But with enough effort, it can go very far indeed in real-world apps. PyPy, and all major Java and JavaScript implementations, are examples of real-world success. They’re not simple, though, and various implementations make various tradeoffs. For example, only compact objects in surviving members of the young generation, but not in major collections unless an internal measure of fragmentation gets “too large”.

All of this is complex. “Truly modern” designs have incremental compacting collectors that run concurrently with apps in one or more background threads, and achieve all of low fragmentation, very short pauses (milliseconds at worst) regardless of how large the object graph, timely reclamation of trash, and high throughput. But developer-years of expert work go into them, beyond CPython’s volunteer resources.

The focus in this topic, though, is “just” tweaking CPython’s inc gc to achieve more timely reclamation of trash cycles, without increasing its pause times. pymalloc and mimalloc’s “round up to discrete size classes” goes a long way to reducing external fragmentation, at the cost of some usually minor internal fragmentation, so long as memory requests usually fit in their internally-managed size classes.

2 Likes

I’m quite sure that mimalloc and other modern allocators have per-thread structures as well, so I doubt contention is an issue in the happy cases (mimalloc’s own description hints about a “lock-free design”).

Admittedly, bumping a pointer will still require less CPU instructions than whatever mimalloc is doing, but if you start from the assumption that CPU execution is relatively cheap these days, that’s a minor concern.

Indeed mimalloc doesn’t, but it also doesn’t need to (and neither does pymalloc). It’s able to reclaim and reuse deallocated memory perfectly fine without having to compact the heap.

A bump allocator, OTOH, requires periodic heap compaction [1]. Otherwise, it’s just a huge memory leak in any non-trivial allocation workload.

So IMHO it’s fair to factor the cost of periodic heap compaction into the total cost of a bump allocator, at least as far as general-purpose usage is concerned.

Well, yes, “with enough development effort” into the GC itself but also with a sophisticated enough JIT that should be able to eliminate many allocations in the critical paths (for example by “virtualizing” objects in PyPy-speak). Both are things that CPython hasn’t been able to afford, yet.

That was my implicit point: that the alluringly simplicity of a bump allocator is a bit of an illusion. You’re just trading a complex engineering problem for another. :wink:


  1. Except for very specialized purposes ↩︎

2 Likes

There will be work loads that fragment memory in a pathological way.
But that seems rare. I first recall seeing the moving memory allocator on the original Mac OS. In it’s case compaction was a must.

It solved the pointer problem by using double pointers that the memory allocator updated as it compacted memory.
The array of pointers was fixed in memory.
You had to lock a block of memory to stop it moving.

1 Like

Sure! My point is just that adding contention-free thread-safety to a bump allocator is a no-brainer. It wasn’t designed that way, it just falls out for free. As the mimalloc docs you pointed at say, they still require CAS instructions to guard against a design that doesn’t eliminate contention, but rather makes it rare. That’s inherently more brittle.

But wholly agreed with your larger point:

Yes, the system “as a whole” is what’s important in the end.

Pretty much so. It makes allocation trivial but freeing a puzzle.

Curiously enough, CPython’s pymalloc does pass out blocks in pools via a bump allocator unique to each pool. Freeing isn’t a problem in that context, because a freed block is just added to the front of the pool’s list of free blocks (which are reused ASAP, resorting to the bump allocator only if the free list is empty). But the blocks within a pool are all the same size, so no new fragmentation wrt blocks of other sizes.


  1. Except for very specialized purposes ↩︎

1 Like

This is for the free-threaded GC but could potentially be adopted for the generational and incremental GC as well. The free-threaded GC currently checks the process resident size and compares it to the size when the GC was last run. If the size has not increased more than 10% then we consider skipping the collection. This helps decrease GC cost in the case you allocate many small container objects and produce little cyclic trash, causing GC to run too often.

This process size increase check has an unfortunate interaction with how mimalloc manages memory: it defers returning memory to the OS. So, if a GC pass frees a lot of trash and frees memory, the process resident size computed by the OS doesn’t change. I created an issue for this. I have a potential fix, computing memory usage by looking at mimalloc heaps and counting full “pages”. However, with that fix, there are still issues: it misses memory allocated outside of mimalloc (e.g. mmap, other allocators), still collects too often if you are creating a lot of objects with zero or a small amount of cyclic trash.

In looking for ways to fix this, I recall Pablo’s idea of dynamic GC thresholds. This idea is simpler to implement for the free-threaded GC since it always does full collections and there is only one threshold to adjust. The design that follows was assisted by Claude. Based on some limited benchmarking, it seems to work pretty well. Claude’s explanation of the design we ended up with follows. A draft PR is available.

Cost model

Every allocation increments young.count. We want to answer: given the
current young.count, should we start a GC pass now?

A GC pass in the free-threaded build is dominated by the mark-alive walk
over the mimalloc GC heap. Its cost is roughly O(N) where N is the count
of live objects in that heap (long_lived_total in the code). So the
amortized cost per allocation is

\frac{\text{cost per pass}}{\text{allocations between passes}} \;\propto\; \frac{N}{T}

where T is the trigger threshold. To keep amortized cost roughly linear
in total allocations as the program grows, T should scale with N. That
gives us our upper bound:

T_{\max} = \frac{N}{D}, \qquad D = 2

At D = 2, GC fires at most once per heap doubling in the no-trash limit.

But T_{\max} alone is wrong: a program churning short-lived cycles wants
GC often, not once per doubling. We also have a user-configured
lower bound T_{\text{base}} (default 2000), which is the minimum number
of allocations between passes regardless of heap size. The adaptive trigger
T_a lives in [T_{\text{base}}, T_{\max}] and slides between them based on
recent trash productivity.

Productivity ratio

After every threshold-triggered collection, we know two numbers: how many
objects the pass collected, C, and the heap size N. The trash ratio

r = \frac{C}{N}

measures the fraction of GC work that actually freed memory. A high ratio
means the pass paid for itself; a low ratio means we mostly walked live
objects for nothing. Note this is not trash-as-fraction-of-young — it’s
trash-as-fraction-of-the-walk, which is the right thing because the walk is
where the cost lives.

We map r to a target threshold via a hyperbolic decay:

T_{\text{target}}(r) \;=\; T_{\text{base}} + (T_{\max} - T_{\text{base}}) \cdot \frac{1}{1 + K r}

with K = 8. Sanity points:

r fraction of range used
0 100% (use T_{\max})
0.05 ≈ 71%
0.125 50%
0.25 ≈ 33%
0.5 ≈ 20%
1.0 ≈ 11%

Why hyperbolic? Two reasons. First, it’s monotone and smoothly squashes
[0, \infty) into (0, 1] without piecewise cases. Second, the
1/(1+x) shape mirrors how productivity-vs-work tradeoffs feel — each
unit of trash buys less threshold reduction as you accumulate more, which
is what we want (we don’t need to slam the threshold to its floor the moment
we see a single cycle). K controls how fast the curve decays; the asymmetric
step (next section) does most of the noise filtering, so the exact shape
matters less than its monotonicity.

In code this is rearranged to keep everything integer:

T_{\text{target}} = T_{\text{base}} + \frac{(T_{\max} - T_{\text{base}}) \cdot N}{N + K C}

Asymmetric adaptation

We don’t move directly to T_{\text{target}} each pass — that would
“hunt” around its true value due to pass-to-pass variance. Instead we take
a weighted step from the current T_a toward T_{\text{target}}. And
the step size is asymmetric
:

  • Raising T_a (less trash than recently): cautious 1/4 step
  • Lowering T_a (more trash than recently): aggressive 3/4 step

The asymmetry reflects asymmetric failure modes:

  • Lowering too slowly costs a few extra GC passes — annoying, recoverable.
  • Raising too fast can blow up the heap, and in a memory-constrained
    environment (a container, an embedded device) that’s an OOM. Unrecoverable.

So we err on the side of more frequent collection: when trash appears we
snap toward the lower target in a single big step, but when trash dries up
we creep upward one quarter-step at a time. One lucky pass shouldn’t push us
into a long deferral.

What doesn’t count

The adaptation only updates after collections that fired because the
threshold was hit
(reason == _Py_GC_REASON_HEAP). Manual gc.collect()
calls and shutdown collections are skipped — they aren’t representative of
the steady-state trash ratio and would skew the average.

A user who calls gc.set_threshold() to raise T_{\text{base}} above the
current T_a gets an immediate resync: T_a jumps up to the new floor.
Lowering T_{\text{base}} relies on the natural T_{\max} clamp on the
next pass.

Boundaries and degenerate cases

A few edges worth noting:

  • Empty / tiny heap. If N/D < T_{\text{base}}, we set
    T_{\max} = T_{\text{base}} and the adaptation collapses to “always fire
    at T_{\text{base}}” — exactly what you want when the heap is too small
    for amortized scaling to mean anything.
  • Quadratic-behavior guard. Even if T_a is exceeded, GC won’t fire
    unless count >= long_lived_total / 4. This pre-existing guard prevents
    pathological behavior on heaps that are growing in pure-non-trash regions.
  • All math is integer. The ratio computation is done in long long
    with a shift-down step on N first, so it’s overflow-safe on both 32-bit
    and 64-bit Py_ssize_t builds. Only the ratio matters, not the scale.

Why not a simpler heuristic?

The obvious alternatives — fixed multiplier of heap, fixed allocation count,
EMA of trash count — each fail one of the criteria the hybrid hits:

  1. Amortized linear GC cost on growing heaps. (T_{\max} \propto N.)
  2. Responsive to actual trash rate, not just allocation rate. (r feedback.)
  3. Bounded below by a user knob. (T_{\text{base}} floor.)
  4. Stable under noise. (Asymmetric weighted step.)
  5. Can’t be fooled by manual gc.collect().

The constants (D = 2, K = 8, 1/4-up / 3/4-down) were chosen so the
benchmark suite’s GC-pass count looks reasonable across cycle-heavy and
cycle-light workloads. They aren’t sacred — the structure is what matters,
and the structure is: let the heap set the upper bound, let the trash
ratio choose where in that range to sit, and move toward that choice
faster when caution demands it.

2 Likes

I’m all in favor of trying adaptive thresholds, but worry that Claude may be “out-thinking” it, adding more complexity than is actually helpful.

Training data has lots of stuff about schemes that adapt to “real-life” behaviors in many areas. But in physical systems, and even human ones, “momentum” and “intertie” are very real drivers. That’s why weighted steps are generally advantageous: they’re effectively doing a moving average, and follow (although lag) the emerging trends.

But software isn’t like that. In one cycle my app can go from creating no cyclic trash to creating a million trash cycles per second. And go back to none again abruptly 10 seconds later. The best predictor of what’s next is often the most recent behavior, not, e.g., an exponential moving average of everything that’s happened since the beginning of time :wink:.

It may work just fine anyway, but in spite of - not because of - the hypothesized “reasons”. Bots are just as good as people at making their made-up guesses sound inevitable :wink:

1 Like