Reverting the incremental GC in Python 3.14 and 3.15

“Bottomless pit” :wink:

I just played with seeing how bad things could get, on 64-bit Windows, under the relesed 3.14.4.

Very simple program: a deque with maxlen 1000. Each iteration pushes a new self-referential class instance, with a payload of 1MB. So deque throws away the oldest to make room for the newest, and after 1000 iterations reaches a steady state of 1GB of reachable payloads.

I’m not sure what happens accepting all defaults. Aftet 150_000 iterations, psutil’s idea of “rss” was still reaching new highs, about 15.6GB when I stopped. But it doesn’t just climb, it falls too. I’m guessing the OS is flushing RAM to swap file (it’s a 16GB box). But I’m not looking for details here: just “how bad could things get?”. Pretty bad :wink:

BTW, at iteration 1000, rss was about 1GB, almost wholly acconnted for by reachable payloads. No mystery there!

And if I do gc.collect() at the end of each iteration, it stays at about 1GB “forever”.

Setting threshold0 to 100 (the default is 2000) instead made a difference, but not much: still reached new highs at 150 thousand iterations, but the peak rss was “only” 13.5GB, at least through 50 thousand iterations.

Setting threshold0 to 1 msde it “almost sane, kinda”: it reached peak rss of 2.4GB on iteration 56_273, and it stayed very close to that across the next million iteratioms.

At threshold0=2, through half a million iterations peak rss didn’t break 3GB, but was still reaching new highs (both falling and rising).

pymalloc isn’t relevant here: the class instances are small (it’s their payload attributes that are large), and it’s neither allocating nor releasing arenas.

So it’s a combinarion of how inc gc is working, and the pragmatics of how Microsoft’s malloc() family behaves under this kind of load.

How can all this be “fixed”? Don’t know - I don’t even have a start a mental model that fits all the observed behaviors. So it’s good that this contrived program is utterly unrealistic :wink:,

1 Like

Setting threshold0 to 100 (the default is 2000) instead made a difference, but not much: still reached new highs at 150 thousand iterations, but the peak rss was “only” 13.5GB, at least through 50 thousand iterations.

The incremental GC has early return since 3.14.1. Even if threshold0 reached, it can exit early and skip all processing. I tested this with a script like yours - GH-142002: Revert GC changes to "work_to_do" logic. by nascheme · Pull Request #142001 · python/cpython · GitHub

So it seems that the only 100% reliable and predictable workaround remaining is to force gc.collect() by hand at times. Until that too decides it knows better :wink:

My apologies for the partially duplicated effort, and thank you for your efforts here!

In all, I still favor keeping inc gc the default (it’s the future. after all), but adding a startup option (for now) to fall back to the 3-gen collector. Which, despite its “long pauses” (which don’t matter to many apps), was far more straightforward for users to control via smooth changes in set_threshold() inputs resulting in smooth changes in behavior.

I have no usable mental model for how changes to those actually work under inc gc. Massive changes to threshold0 can have little visible effect. While I didn’t mention it, changes to threshold1 were even less predictable (well, apart from “very little effect regardless” - even setting theshold1=0 left my program with peak rss use at 13GB after 150 thousand iterations)..

1 Like

I added more outputs to my toy, and of course I’m seeing that too.

Since our gc doesn’t know the size of objects, the payload size shouldn’t matter. So I cut mine by a factor of 10 (down to 100K bytes), so that OS swapping to page file could be eliminated as a confounder.

Ha! Under the default settings, peak rss still reached over 11GB. But not large enough to swap out. And it stabilized after about 1.5M iterations.

Now at 10M iterations, gen1 collections usually occur about every 2000 iterations (each of which tosses an old cycle, and creates a new one). but punctuated by brief stretches of one per 18_000, then 10_000, then 8_000, then 4_000 iterations - and then back to a long stretch with 2_000 between. :Lather, rinse, repeat.

There’s always a backlog of at least 90K trash objects waiting to be reclaimed, but I haven’t seen it exceed 100K (10GB of unreclaimed trash). By construction, the actual number of reachable cycles is always 1K. Unreachable ones outnumber them by a factor of 90-100.

Since payload size doesn’t matter, I expect that’s the shape of the eventual “steadyish state”, under the defaults, in any long-running program producing lots of cyclic trash - provided it doesn’t run out of RAM first.

If so, that makes a stronger cause for “revert it!” than I first guessed.

EDIT: indeed, cutting the payload size by another factor of 10 (down to 10KB) didn’t change the eventual “steadyish state” shape: varying between 90 to 100 unreclaimed cycles for each still reachable, and peak rss reached about 10x smaller.

1 Like

One more observation, and I’ll stop bothering everyone with this :smile:.

In my toy program, the gc callback function never sees an event for a gen0 collection. Only for gen1. But a gen0 collection would find about 2000 (threshold0) dead cycles to reclaim, every time. Instead they’re apparently moved, unexamined, to gen1.

So we effectively have a single-generation scheme in the end, which typically (not always) reclaims little per invocation (and sometimes nothing).

I see that Neil seemingly saw this before (about 100x more dead cycles sitting unclaimed), but for whatever reasons his significant mitigation attempt stalled:

I haven’t been staring at the code, and that’s deliberate: I’m trying to “black box out-think this”, as a user faced with mysterious behavior would be faced with. I wish them luck :wink:

1 Like

In the incremental GC if we reach threshold0 we always run gc.collect(1) , which means run collection of the young generation and one increment of the old generation. If we do not return by work_to_do < 0 condition, we than get young generation and a piece of the pending space of the old generation. So, we don’t need to run gc.collect(0) - gc.collect(1) do it. It is expected behavior for inc GC.

2 Likes

Then I have no explanation for anything I’m actually seeing: there are about 2000 freshly unreachable cycles every time threshhold0 is reached, but they pile up uncollected. As Neil indirectly noted in the referenced PR, under the 3-gen collector they were reclaimed in its gen0 collection.

Perhaps they’re merged into the gen1 collection in a position it simply won’t look at until a whole bunch of later gen1 collections trigger? They eventually do get reclaimed, but a long time after they first become unreachable (which a pure gen0-collection would discover at once).

To call any of this “expected” is a stretch from a user’s POV :wink:

1 Like

In February, I encountered a real-world issue with incremental garbage collection, Django’s migration system, and servers with limited resources. I just published a blog post explaining this interaction and the workaround I added (calling gc.collect() often): Django: fixing a memory “leak” from Python 3.14’s incremental garbage collection - Adam Johnson

8 Likes

If I understand you correctly, yes, they do get reclaimed, but it can take a long time since it’s controlled by the work_to_do logic and the size of the old generation. I’d like to examine your memory allocation pattern using my gc_stats module. Would you mind sharing your test script somewhere?

To call any of this “expected” is a stretch from a user’s POV :wink:

I agree, though this behavior is documented as intended :slight_smile:

2 Likes

Yes, they do. What we’re out of synch on is the meaning of “gen0 collection”. Under the older 3-gen collector (which I did substantial work on and understood inside out), a gen0 collection reclaimed all cycles created, and which became trash, since the last time it ran. The current scheme doesn’t even approach that. So it appears I have no real idea what “gen0” means anymore.

Sure! The code has bloated to produce more and more output, but the core remains dead simple: create one new cycle per iteration, and also convert one older cycle to trash. Here’s a public gist:

Toy cyclic gc code · GitHub

3 Likes

May I suggest opening a new topic to discuss plans and experiments for 3.16 and beyond? Let’s keep this for 3.14 and 3.15. Thanks!

2 Likes

Sure!

Not entirely sure what “keep this” means to you. That nothing changes from the first post in the topic?

  • 3.14.5 will replace inc gc with the older 3-gen gc.
  • 3.15 will have the older 3-gen gc from the start.
  • And 3.15,n will stay that way.
  • There’s no intent to, e.g., consider a start-up option to allow users to pick which gc to use.
  • Any reintroduction of inc gc would require a PEP, and won’t land before 3.16.

Just looking for explicit clarity there, not an argument.

1 Like

Not entirely sure what “keep this” means to you. That nothing changes
from the first post in the topic?
No, that this thread keeps focused on what to do for 3.14 and 3.15, not
future plans.

Then the intent remains unclear to me. If, e.g., a startup option to let the user pick which gc to use is a possibility on the table for 3.14.5, then that’s not really “a future plan”, but a very near-term outcome. Nobody has been discussing ways to change inc gc here, just trying to get a clearer picture of what goes wrong with it. At least Neil has found an actual bug:

If we have explicit clarity about that no such feedback is of any interest until 3.16 is on the horizon, then I’d agree that this topic is unsuited. But so long as the possibility of keeping inc gc as an option before 3.16 is on the table, then it’s in everyone’s interest to mitigate inc gc’s failure modes as best we can ASAP., and 3.14 and/or 3.15 are very much in play too.

Fine by me whatever the release management folks have settled on, just looking for explicit clarity about what that is.

1 Like

Yep! The plan for 3.14 and 3.15 is still as outlined in the first post: have just the old generational GC available, no build-time or run-time switches.

I don’t want the increased risk and extra complexity of having two GCs and extra options at this stage, especially for 3.14. Also, having for example two GCs in 3.14 and one in 3.13 and 3.15, and maybe something else in 3.16, will make future maintenance harder in the future.

Let’s take our time and plan things properly with a PEP for 3.16 and beyond.

9 Likes

I don’t see the problem. Java has:

Epsilon
G1
Parallel
Serial
Shenandoah
Z

The old Java had also CMS and some JDK, as Zulu Zing or IBM Semeru, have their own GC. The more, the better.

Side note: it’s a joke :grin:

1 Like

Good! The original plan is fine by me. I just hope the news about it is balanced - there are no pure wins here. In particular:

  • While the older gc can slash peak memory use, it can also bloat pause times.
  • Workarounds apps may have adopted to mitigate increased memory use under the newer gc may be counterproductive under the older gc. For example, setting theshold0 to a small value to coax gc into running more often, or calling gc.collect() (force full collections) “a lot”.

For the future, I don’t know that there’s a killer technical problem here, more a failure to reach consensus on goals. For example, Neil has a long history of raising concerns about increased memory use, but Mark seemed focused on cutting pause time. The PEP process is one way to encourage consensus.

3 Likes

I’ve been looking into how to add back generational GC and perform incremental GC on the old space only.

With that model, the GC would:

  1. Collect the young and intermediate generations as it does in the generational GC
  2. With each collection, also collect an increment of the old generation (as the incremental GC does now)

Any cycles that were promptly collected by the generational GC would still be promptly collected.
Objects that are collected by the full collection, would still be collected. In some cases more promptly, in some case less promptly. But these long lived object that eventually become cyclic garbage aren’t the cycles causing problems.

In summary:

When Gen GC Gen + Incremental GC
After net N allocations Collect gen 0 Collect gen 0 + increment of gen 2
After net N*M allocations Collect gen 0 + 1 Collect gen 0 + 1 + increment of gen 2
After heap_size/K allocations Collect gen 0 + 1 + 2 Not applicable

Adding the gen0 collection to the increment is trivial; it’s a three line change. However, by itself doesn’t fix the problem: too many garbage cycles span multiple young collections.
To fix the issue we need to add back the intermediate generation.

If anyone wants to work on adding back the middle generation, or cooperate with me to do so, let me know.

3 Likes

If anyone wants to work on adding back the middle generation, or cooperate with me to do so, let me know.

As always, I’d love to work on the GC! :slight_smile:

Would you mind if I implemented your approach and shared a branch/PR with you for review?

3 Likes

Yes, please do. I’ll be happy to review it

4 Likes