Improving incremental gc

Incremental gc is being reverted for 3.14, to be replaced with the older 3-gen collector, and the same will be true of 3.15:

Inc gc seemed to be primarily aimed at reducing “stop the world” long pauses, and has had some success at that. But it came at the cost of sometimes greatly increasing memory use, as trash cycles pile up waiting for the incremental collector to get to them.

I wrote a toy app to explore the limits (link to a public gist):

Toy cyclic gc code · GitHub .
Don’t be put off by “toy app” :wink:. The difference between a toy and a real app in this kind of context is that the former can display pathological behaviors much faster. and without dozens of confounding factors muddying the picture.

The app depends on psutil, but not on Python 3 - it sticks to Python 2 features.

The high-order bit here is on the behavior of “gen0” collections. Under the old 3-gen collector, a gen0 collection collects the cycles created, and which became trash, since the last time gen0 ran. It’s aimed at reclaiming short-lived cyclic trash, which is common in many apps (if they create cycles at all).

The app is an infinite loop that creates a new cycle on each iteration, and after iteration 1000 also converts one to trash on each. So neither truly short- nor long-lived. There are never more than 1000 reachable cycles.

Under Py 3,13 (older gc), it works as I expect. The gen0 threshold is 2000 (same as under 3.14), and around iteration 2000 a gen0 collection occurs. It reclaims 1000 trash cycles, and moves 1000 to gen1. Same story at iterations 4000, 6000, 8000, … it cleans up half the trash created along the way, quickly and smoothly. Other trash is awaiting a gen1 collection.

The current (dev) docs for gc.set_threshold() seem to say that should still be the case (plus maybe more):

For each collection, all the objects in the young generation and some fraction of the old generation is collected

But that’s not what I’m seeing. Nothing is collected at 2000 iterations. And still not by 4000 iterations. Or 6000, 8000, … Nothing at all gets collected until about the 20 thousandth iteration. gc is invoked along the way (you can confirm by adding more output to the toy’s gc callback function), but it returns without collecting anything until iteration 20_000. Then it collects about 18_000 trash cycles.

So it’s off to a bad start, which gets worse over time. But not without bound. It eventually (after about 750K iterations) reaches a “steadyish state”, always with over 90K trash cycles awaiting collection, but not more than 100K.

So that’s the first thing I’d think about: is there ever a “good reason” to skip gen0 performing its traditional function? I don’t understand the current “work to do” logic, and especially not how “the math” can end up making it negative(!) at times. But intuition says “work to do” should always include gen0.

6 Likes

It should not take major surgery to make the incremental GC work better in the cases it is currently having problems. Then, I think we should offer it as an optional, opt-in GC. Once we have the kinks worked out, we can consider switching it to be the default. As you say Tim, some people will really value the lower GC pauses that it provides.

I worked on a patch back in December but did not get time to finish polishing it. The key ideas are that we trigger GC every 2000 net new objects, like the generational GC. We size the increments (how many old objects to look at) such that we effectively do a full collection often enough.

Testing this patch with Tim’s test program and my “cyclotron” tester, it seems to resolve the issues (max RSS and # of trash objects stay low, GC pause is still small).

6 Likes

Here’s a more polished patch that seems to work pretty well. It keeps the GC pause low and also seems to handle programs that create mass cyclic trash fairly well too. It seems at least as good as 3.13 in keeping memory usage low. Possibly there are other situations when it “falls down” and does a bad job.

4 Likes

Let’s try some English. In the 3-gen collector (using all default settings), about 1 in 100 collections was gen2, and about 1 in 10 gen1. The rest were gen0.

You appear to be sticking to the newer 2-gen scheme for now, gen0 and “everything that outlived gen0”. On each collection, all of gen0 is scanned, but only a fraction of everything else. The fraction is about 1%, so the old “about 1 in 10 is gen1 and about 1 in 100 is gen2” is spread out across about 100 collections now.

If that’s so,

  • Some trash cycles that waited for gen 1 collection before may be reclaimed earlier now.
  • And some trash cycles that waited for gen 2 collection may be reclaimed much earlier now.
  • But some trash cycles that used to be reclaimed in gen1 will instead get pushed into “everything else”, and will be reclaimed later.

It’s unclear to me how to weigh the tradeoffs. Seems like a “bad case” would be an app that creates a lot of long-lived non-cyclic tracked containers, and cycles that would have been collected in gen1 under the 3-gen collector. Then the latter will end up in a potentially very large “everything else” (hiding among the non-cyclic containers), and potentially take much longer to get collected.

Are there such apps? Certainly. Can I name one? Certainly not. :wink:.

If they’re important, though, I expect that Mark’s sketch of reintroducing 3 generations may be the biggest bang-for-the-buck way to deal with them.

3 Likes

@tim.one There’s a profiling trace for your cyclops.py script (30-second run) saved as gc_trace_cyclops.json (gc_trace_cyclops.json). You can load it into https://ui.perfetto.dev/ to interactively explore the GC profile.

1 Like

Thanks! I haven’t seen one of these before, so don’t know what to make of it. Does anything stick out to you?

The output of the toy already displays everything gc does (well, the current gist, which also keeps track of exactly which cycles have and haven’t yet been collected), except for the non-exposed “work_to_do”.

But I’m already all but certain that should never be allowed to fall below the number of objects in the young generation, and @nas has had major success in a variant that ensures that much.

Not the whole story, but highly suggestive.

2 Likes

I asked an AI what research was done on the payoff of multiple generations and on the typical lifetime of objects. Obviously (essentially) nothing Python specific but there was a fair bit of research done. I won’t post since you can just ask an AI yourself.

Two interesting points: it seems that two generations is the most common design, rather than three. It might be in most systems that more generations have a cost (e.g. bits somewhere, perhaps) whereas with Python’s double-linked lists they are basically free. So maybe three is kind of a sweet spot.

Another thing is that there is a theory that very young objects haven’t had time to die yet and so you should wait a bit. I believe Pablo has some actual data that showed this effect. For the incremental GC, perhaps each increment should take a fraction of the new objects and a fraction of the old. If you make “new” a dequeue and put the new at the end and take the increment from the front, that would give them more time to die. I think with that tweak, the old/new split might be enough to work well.

Finally, there was mention. of a “Merlin object lifetime study”. Building some analog of this for Python would probably give some nice data for us. Python is a unique beast. No unboxed types, no stack allocation, reference counting to free most of the data. Trying to adapt object lifetime trends from other languages could be misleading and maybe we need to do our own measurements.

3 Likes

Or even from other implementations of Python. PyPy has tried several gc approaches, which are worth reading about too:

Ways it differs from CPython include:

  • no reference counting in PyPy; it’s on their gc to recycle all unused space
  • they can also move objects in memory during collection, to reduce fragmentation

The current default scheme appears to be a 2-generation (where the young gen is called the “nursery”), incremental, compacting collector. My experience:

  • despite that some docs claim pauses are under 1ms, I’ve seen pauses of over 10 seconds in large-data, long-running programs.

  • I’ve also often seen a PyPy process bloat to consume almost all of RAM, only to fall waaaaaaaaay back, over and over

Now none of that is due to cycles, though, unless they’re cycles created by RPython for its own purposes. As a Python programmer from back in the days when cycles just leaked, I’ve long avoided cycles whenever possible.

One way CPython skews this: most short-lived containers get removed from gen0 all along as a matter of course due to refcounting. It’s only short-lived cycles that don’t clean themselves up by magic that remain in gen0. Which may make it especially valuable for CPython to “give cycles time to die”.

OTOH, an excess of 2000 allocations over deallocations (threshold1) is a lot of cycles (CPU cycles) already :wink:

2 Likes

More info about PyPy, about how it behaves on my toy app: it’s apparently far less aggressive than CPython (any version) about collecting trash. My toy requires some fiddly surgery to produce useful output, because PyPy has no gc.callbacks workalike.

Short course: the toy displays a line on each iteration when at least one trash cycle has been collected since the last iteration[1]. I fiddled the toy to stop when the iteration that produced some output exceeded 3 million.

  • 3.13.3 produced 1_500 lines.
  • 3.14.4 produced 1_334 lines.
  • PyPy produced only 60 lines.
    • It went as long as > 80K iterations without any output.
    • And then reported that over 100K cycles had been reclaimed since the prior iteration.

So it’s really no better on this app than 3.14 in respect of letting trash cycles pile up.

It’s very different technology, though.


  1. each node contains the iteration # on which it was created; a set of all such iteration numbers is created; and a node’s __del__ method removes the node’s iteration number from that set ↩︎

2 Likes

This tool is based on the extension proposed in Add a module for monitoring GC statistics · Issue #146527 · python/cpython. It currently requires a custom CPython build, so I haven’t published it yet.

Hope the generational GC lands soon. Then I can update get_gc_stats in _remote_debugging and publish my external-process GC monitoring tool once the PR merges.

2 Likes

I saw an interesting post from the PyPy about GC sampling: Low Overhead Allocation Sampling with VMProf in PyPy's GC | PyPy

1 Like

That was under PyPy 3.10-7.3.12. Under the latest, 3.11-7.3.21, it’s down to 19(!) lines. No cycles are reclaimed until iteration 164_775, at which point 163_774 trash cycles are collected.

Later, at iteration 1_879_852, 281_376 cycles were collected since the iteration before it.

Given that no more than 1_000 cycles are ever reachable, that seems quite extreme.

OTOH, the payload is small enough that RAM isn’t under any pressure.. If I increase the payload per node by a factor of 100 (up to 1 MB each), it rises to 33 lines of output. Unlike CPython’s, PyPy’s gc is aware of how much RAM is in use, not just counts of objects. I speculate they don’t bother collecting trash until RAM is starting to show signs of pressure.

Another difference between the PyPy versions in use is that, on my box, the nursery size increased from 1MB to 4MB. And, yup, changing the nursery size (via setting envar PYPY_GC_NURSERY) does have major effects on the number of output lines - the larger the nursery, the fewer the lines. At 16MiB, nearly 2.9M trash cycles pile up.

But since this is raising more mysteries than it’s resolving, I’m dropping this offshoot now :rofl:.

1 Like

I’m jealous of their bump allocator :smiling_face:. Short course: none of pymalloc or mimalloc’s elaborate size classes. There’s a contiguous chunk of memory from which all “small enough” objects are allocated. If you want N bytes, and at least that many bytes are still free, return the address of the first free byte, and increment that address by N. Done. Crazy fast, so far as it goes.

In a free-threaded world, each thread could allocate from its own nursery with no synchronization needed.

Of course there are downsides too. It’s unsuitable for CPython because we never move a Python object in memory. PyPy can, and when a nursery is full it copies the objects in it (the ones that survive gc) into the “old space”, leaving the nursery area empty again.

“Old space” is itself a (much larger) contiguous chunk of memory, and copying to it is just appending to its current high water mark. In the lingo, it’s managed as a “semispace”: When a major collection is performed on old space, it’s viewed as “from space”, and objects that survive are copied to an equally large “to space” contiguous region. Then the roles of “from space” and “to space” are swapped.

So fragmentation is squashed out of existence.

How they implement Python’s id() is a mystery to me :wink:

2 Likes

Yeah, the bump allocator seems like a very neat idea! :slight_smile:

Is it applicable to CPython? I’m not deeply familiar enough with GC internals to evaluate how well it would work for our use cases, or how much refactoring it would require. Maybe worth exploring down the line?

It’s unsuitable for CPython because we never move a Python object in memory.

IIUC, if PyObjects made opaque, we could implement a compacting/moving GC that relocates internal data.

1 Like

I added TIm’s deque based stress test to cyclotron. I also added a “btree” workload that creates a more complex data structure. One parameter I’m missing now is the number of live non-garbage objects. That affects the long_lived_pending < long_lived_total / 4 trigger condition, for example. So it would be good to add that.

2 Likes

Not new, though - over 50 years old.

It’s more CPython’s C API that blocks it. Moving collectors don’t work on indirect pointers to objects, the objects themselves move, with no hint remaining of where they used to live. So, e.g., a CPython list contains a contiguous array of pointers to the objects it contains. All of those pointers may need to be changed. tp_traverse exposes those pointers to gc, but nothing exposes the addresses of those pointers. Nor has there been any need to, since object addresses never change.

We could, in theory, do that for containers CPython implements, but there’s no API in place for extension modules that allows to discover (let alone change) the pointers they’re holding on to. For example, picture an extension that saves (a pointer to) a Python list in a file static. That’s fine now, because the list never moves. But if it can move, there’s no way for CPython’s gc to know that the extension has a file static that also needs to change its value (to point to list’s new memory location).

A more obvious downside is that semispace collectors can only use half of available RAM for “from space”. An area at least as large as “from space” has to be reserved for “to space”, which sits unused until gc needs it to copy into. Then “to space” and “from space” switch roles, and the old “from space” sits unused until the next full collection.

Although CPyhon’s gc may be able to get the same effect by linking things together in “from space” and “to space” lists, using the gc header struct pointers.

The PyPy docs say they use “Cheney’s algorithm” for this (which dates back to 1970, and easy to find info about on the web). Like Python’s gc, it’s a breath-first traversal of the object graph that only requires O(1) space of its own. Although that statement wishes away that it also requires O(n) space for the destination “to space”. Like the similar O(1)-space claim for CPython’s gc wishes away that it also requires O(num_objects) space for storing two gc-related pointers in each container object that participates in cyclic gc.

Bump allocators are usually combined with compacting collectors. Non-moving systems are well served by taking rounded-up size classes into account, which goes a long way to reducing external fragmentation at the cost of embracing some usually minor internal fragmentation. Bump allocators enjoy 0 internal fragmentation, but in the absence of moving collections can suffer extreme external fragmentation. Combining bump with moving results in 0 total fragmentation.

Yes, that’s a lot of hand waving :wink: It’s big topic.

The rub being that object addresses are 'internal data" that’s exposed directly to extension modules. We have no way to inform extensions when a previously exposed address needs to change, nor to know where extension modules stored those addresses.

2 Likes

Thanks for the great explanation! I need a bit more time to gather my thoughts.

1 Like

Would you mind if we replace your gc_callback with the get_gc_stats version once it becomes available? It provides more data, including timestamps for when collection starts and ends.

We could also just enrich the existing callback with timestamps and extra metrics, since that data’s already being collected anyway.

Optimist :wink: A moving collector just can’t support passing memory addresses to extensions without major pains. The API has to pass & accept some form of “handle” instead, a value of some type that’s invariant & unique over the lifetime of an object, regardless of its memory address. The implementation of handles has to know how to get at current addresses, but code using a handle API does not (& generally can’t even if it wants to - all of CPython’s internal implementation details remain hidden at a handle API level).

I know little about it, but there is a project to create a handle-based C API for Python, called HPy:

It intends to replace CPython’s native C API, and allow C extensions using the HPy API to run much faster with other implementations (like PyPy, which has long struggled with C extensions, in large part because it does move objects around in memory as part of its compacting gc).

HPy most likely deserves more love than it’s gotten so far :smiling_face_with_sunglasses:

EDIT: alas, I see now that HPy hasn’t seen any new activity in going on 3 years, seemingly ending before its 1.0 release :frowning: .

2 Likes

Modern allocators such as mimalloc take multi-threaded allocation quite seriously. I’m not sure a bump allocator would make that much of a difference on non-toy workloads.

But in exchange you have to pay the cost of regularly copying objects from one space to another. I don’t know how far that can go in real-world applications.

Something like a hash table IIRC.

1 Like