Incremental GC and pushing back the 3.13.0 release

thomas · September 28, 2024, 12:43am

I’m a little ^[1] concerned with the impact of the incremental GC change in 3.13, which recently showed up (Incremental GC means that Sphinx is significantly slower in 3.13 than in 3.12 · Issue #124567 · python/cpython · GitHub). It’s not clear that the incremental GC provides significant improvements (although the smaller pauses are probably desirable), it clearly has slightly more overhead in common cases, and we’re still discovering pathological cases. I don’t think we should release 3.13.0 with the incremental GC.

Obviously taking it out at this stage is a drastic change, and not something I want to do without a new release candidate. There’s also a bunch of other fixes that went in since rc2, so I really want rc3 anyway. Which means pushing back the release to make room for the new release candidate.

What I would like to do:

Roll back the incremental GC change.
Release 3.13.0rc3 this Monday, September 30th.
Release 3.13.0 final a week later, on Monday, October 7th (probably starting the release process on Sunday).

Since I am the release manager for 3.13, I think I will just do what I want to do .

Rolling back the incremental GC sounds drastic but after talking to a few people about this we think it’s not that big a deal. The code hasn’t changed much since, and we’re rolling back to known old code, which is less risky than attempting to work around the pathological cases in the new code. A week is not a long time for a new release candidate, but it should be enough to find any obvious issues with the rollback (and the other changes that have gone into 3.13 since rc2).

(I don’t think the incremental GC needs to be rolled back in main (3.14) at this point, but it does feel like we need a bit more of a value proposition for the change. I’m glad I can leave that up to @hugovk )

Obviously a shift in schedule like this is annoying for redistributors, so we’re already reaching out to some of the usual folks for this.

somewhat of an understatement ↩︎

mgorny · September 28, 2024, 5:23am

Does this involve an API/ABI change?

pitrou · September 28, 2024, 9:59am

I would point out that the smaller pauses are definitely more desirable in a number of broad uses cases, though overall GC cost can also be a concern in the same use cases (the example I’ve been personally acquainted with is the scheduler and worker components of Dask.distributed, which has grown countermeasures over time ).

But I agree that if a widespread and foundational tool like Sphinx becomes 30% slower, then there deserves to be a response. Hopefully the performance issues can be fixed for 3.14.

PS: ideally and from a user’s perspective, there would be a runtime switch between incremental and non-incremental, but I suppose that might get quite hairy and especially at this point of the release process

thomas · September 28, 2024, 3:13pm

It should not require either, except to the extent the incremental GC had slightly different semantics in the thresholds.

nas · September 28, 2024, 4:18pm

Repeating what I said about this elsewhere: any change to the GC can slow down some programs and speedup others. Likewise, it could cause them to use more or less memory. Or, to more or less quickly free resources like file descriptors and sockets. That doesn’t mean we should never touch it but we need to be careful about it. Unfortunately, I feel we don’t have good benchmark suits that give us insight into how well the GC is working.

When speaking to Thomas about the release, I didn’t object to the incremental GC going into 3.13. However, I was somewhat concerned that there could be programs that are quite negatively impacted by the different behavior of it. We don’t have a good sense if those are rare cases or common. Giving it time to mature more in the 3.14 branch is the cautious approach and I think that’s a good decision.

The non-incremental GC is quite aggressively tuned, with the youngest generation threshold at 700. That makes it safe in terms of quickly freeing resources involving cyclic garbage but it also means that it often does more work than required. People can tune is quite easily to be less aggressive by setting the threshold higher. Calling gc.freeze() could be a big help as well. I see mypy does that with good result.

The vast majority of people don’t tune it and I think it’s time we do something like Pablo suggested with his dynamic threshold PR. Unfortunately him and I stalled out on that work. As a minimal tweak for 3.13, I think we could increase the threshold0 value. The current value is 700 and that was set many years ago when computer hardware was much slower and had less RAM. My gut feeling is that 7000 feels a bit better, given some of the performance metrics Pablo and I gathered during the dynamic threshold work. The incremental GC is using 5000 as the threshold for the young space.

Measuring the total time it takes to run the “Sphinx doctest” test case, with different thresholds:

threshold	time [sec]
700	2.59
2,800	1.90
5,600	1.78
11,200	1.74
22,400	1.70
70,000,000	1.69

pitrou · September 28, 2024, 5:41pm

One unfortunate limitation is that the GC thresholds are expressed in numbers of objects, which is a poor proxy for memory consumption.

If some code somewhere is relying on the cyclic GC to release critical system resources such as file descriptors or sockets, then I would usually call it a bug to fix. It can be non-trivial (can involve some weakref hackery), but it usually results in a much better user experience - and can also help with other issues such as GC costs and memory consumption.

itamaro · September 28, 2024, 8:15pm

I’m in favor of increasing the default thresholds, but I’m not sure that would be a good idea for 3.13 at this stage of the release cycle (or at least require very prominent changelog banner).

Last year, @Dino and others did some GC tuning for several internal Meta workloads, and found that (14_000, 100, 100) worked very well for them. Thinking that should generalize well across all workloads, and should be low risk, I applied these thresholds to ALL Python workloads, only to revert it shortly after, having caused at least a dozen services to start crashing with OOMs.

It would be much better to have the increased thresholds in a new Python version, as those issues would surface as part of the upgrade, but I’d be hesitant about doing that in late-stage RC.

nas · September 28, 2024, 8:48pm

Thanks for that info Itamar. That kind of testing is really useful.

I’m fine with leaving it at 700. I still think it’s fairly safe to bump it to 5000 in 3.13.0. We can drop it back to 700 in 3.13.1 if we notice any issues. It’s not really an API change. We were previously okay with releasing the incremental GC with the young collection threshold at 5000. The thresholds don’t work exactly the same but it’s quite similar.

Thomas asked about peak memory use so I re-did the Sphinx benchmarks to measure peak RSS (the Linux /usr/bin/time command can do this, which was news to me).

threshold	time [s]	max rss [MB]
700	2.59	87
2,800	1.90	88
5,600	1.78	87
11,200	1.74	89
22,400	1.70	90
70,000	1.70	93
700,000	1.71	122
70,000,000	1.69	156

For this particular program, threshold of 5000 seems totally fine for not increasing peak RSS. I suspect it creates many fairly small objects in cycles. If you have a program that has a smaller number of objects and those hang on to big chunks of memory, the bump to threshold 5000 could grow peak RSS a lot more.

This is related to Antoine’s point about basing cyclic GC on object counts rather than actual memory use. We can quite cheaply measure what obmalloc is using (treat arenas as full) and I’d imagine mimalloc would be similar. However, that doesn’t count the memory for “large allocations”. And those are the ones that could hurt. The object count works pretty okay for the things obmalloc allocates. I’m not sure we want to pay the accounting cost to track what large allocations use. That also doesn’t cover memory used by extensions using other allocators. So, not so simple to solve that.

gpshead · September 28, 2024, 10:05pm

Release management perspective: We’re not going to change a magic number for 3.13.0 at this stage.

Large applications who’ve taken the time to measure and care already set different thresholds. They’ll continue to do so and see long familiar behavior in 3.13.

We’ll likely have something different before 3.14betas and seek real application feedback and more meaningful benchmarks.

nas · September 29, 2024, 3:16am

I’m worried that we won’t get much feedback from real world applications. Betas and RCs just don’t get tested much. So, when 3.14 RC comes around, we will be stuck in the same situation, not knowing how safe it might be to adjust the GC thresholds. We will likely implement dynamic thresholds and/or incremental collector for 3.14 and those both could result in less frequent young generation collections.

I thought more about Itamar’s experience with setting the thresholds higher. Having things crash on production systems with an OOM error is not nice. It would be better if the new behavior was opt-in for at least one release. Along those lines, I’ve implemented a PR for a PYTHON_GC_THRESHOLD environment variable. I kept it really simple with the hope we can include it in 3.13. Then, we can suggest in the release notes that people try PYTHON_GC_THRESHOLD=14000 and report back if they notice issues.

Sample docs: 1. Command line and environment — Python 3.14.0a0 documentation

pitrou · September 29, 2024, 9:04am

He’s what C# exposes FTR:

Libraries such as NumPy etc. could most certainly use such APIs to inform the Python GC of external memory allocations.

cfbolz · September 29, 2024, 3:41pm

PyPy has a function __pypy__.add_memory_pressure(n) for exact that purpose. cffi will also tell the GC when it’s calling libc malloc about the size of non-gc memory that is being allocated, to get the gc to collect earlier. See the docs here: Using the ffi/lib objects — CFFI 1.17.1 documentation

If CPython adds some API like this, maybe we can coordinate.

pitrou · September 29, 2024, 3:54pm

cc @markshannon (in case he’s not already following this discussion) for @cfbolz 's suggestion above

cfbolz · September 29, 2024, 4:32pm

I’m happy to go into more detail about this (just been brushing up my memory of the mechanism recently), but maybe this thread is not the right place for it. Is there an issue yet?

pitrou · September 29, 2024, 4:45pm

Probably not I would say.

(also, moderators have the ability to split threads, though I don’t know how to summon them )

thomas · September 29, 2024, 4:57pm

What isn’t clear from Neil’s post is that that magic number was already changed in 3.13, in alpha 5 or 6. The first generation threshold was raised from 700 to 5000, and that generation’s collection is still mostly the same. So yeah, it changes relative to 3.12, but it’s the same as the betas and earlier release candidates.

nas · September 29, 2024, 6:36pm

It’s actually 2000 in the current 3.13 branch. I think I was looking at an older version of 3.13 or something and thought it was 5000.

My recommendation for Thomas is to revert to the non-incremental and keep the original threshold as well. That’s the safe thing to do and we really don’t want another RC.

If he feels inclined, he can merge my PYTHON_GC_THRESHOLD env var change but that’s optional. People still have gc.set_threshold() if they want to test and we could just suggest that. I think the env var makes it easier, especially when you are dealing with things running in containers.

thomas · September 29, 2024, 9:35pm

I think the safe thing to do is to keep the threshold we’ve been using through the betas and release candidates, rather than revert that too.

thomas · September 29, 2024, 11:15pm

[3.13] GH-124567: Revert the Incremental GC in 3.13 by Yhg1s · Pull Request #124770 · python/cpython · GitHub is the revert of the incremental GC, in case people want to see what it does to the performance of specific cases.

pablogsal · September 30, 2024, 1:42pm

Yes, but not in a way that will affect wheels (it affects the size and offsets of opaque structures). It may affect debuggers/profilers that are vendoring headers though (but these are not supported cases and we have already a better mechanism for them).