Switching from refcounting to libgc

thomas · May 8, 2019, 9:34pm

I’ve talked to a couple of people about this at PyCon2019, and there was more interest than I expected, so I figure I should just post this here:

I’ve (hackishly, as a proof of concept) replaced refcounting with libgc (aka BDW GC, aka Boehm GC, the canonical generic, conservative mark-sweep (non-copying, non-moving) GC):

I initially just disabled pymalloc and linked against a libgc with --enable-replace-malloc, which interposes malloc() and free(). I then commented out the free() call in _PyMem_RawFree, and added a libgc destructor (called when libgc thinks the object is no longer reachable) that warned when the object had a refcount that wasn’t 0. This showed me several things: libgc didn’t try to deallocate anything that shouldn’t have been deallocated, and (comparing against a regular build without pymalloc) the performance difference wasn’t really noticeable.

(Using libgc with --enable-replace-malloc means libgc tracks all allocations, not just ones made with _PyMem_RawMalloc. For libgc to work it needs to be able to see all memory that might potentially contain pointers to its allocations. Because with refcounting code can just store a PyObject * anywhere and INCREF it to make it not go away, using --enable-replace-malloc is the easy way to do that.)

I then disabled Py_INCREF and Py_DECREF, disabled the gc module (because libgc already takes care of circular references), and fixed weakrefs to work with libgc. This is the current state of the thing. It works well enough to run setup.py and much of the testsuite (and passes a whole bunch of tests), but it has a tendency to segfault randomly while running tests – probably because of not locking correctly in the right places. (I haven’t seen it segfault while running setup.py, so I’m pretty sure it’s to do with threads.) I have not measured the performance of this at all, yet.

Just to be clear: this doesn’t remove the GIL – it doesn’t even remove the ob_refcnt struct field – it just removes refcounting, which is the main (but not only) reason for the GIL. I don’t think this approach is source-level-compatible enough for the real world. It’s just an experiment, and I pushed the branch to GitHub primarily so Larry could look at it for his Gilectomy work.

For a more backward-compatible approach at introducing real GC, I have something in mind like this:

Introduce a new, completely separate PyGCObject API.
Leave PyObject to be refcounted as normal. Don’t change the struct at all.
Provide a proxy PyObject wrapper for PyGCObjects, so that anything taking (or returning) a PyGCObject can be called from PyObject-oriented code.
Provide a proxy PyGCObject wrapper for PyObjects, so that anything taking (or returning) a PyObject can be called from PyGCObject-oriented code.
Migrate CPython itself from PyObject to PyGCObject.
Only require the GIL for PyObject-touching code (and many fine-grained locks for PyGCObjects that need locking).

I’m not yet sure how feasible this approach is (quite apart from the 2x explosion of exposed APIs) – I have the beginnings of an experiment with this in a different branch, but I won’t know how well it works until I make at least dicts, strings, ints and ceval.c use PyGCObjects… I may get to that in the next couple of weeks.

pitrou · May 9, 2019, 8:16am

Interesting. I’ll be curious to learn about your findings along the way.

nas · May 25, 2019, 10:20pm

Hi Thomas,

I did the same kind of experiment way back in 2000. There is still a patch on my web site for it. If I recall, performance was a fair bit worse and there was a few issues to work out.

After I did the Boehm GC experiment, I moved to the idea of cyclic GC. Tim Peters helped a lot of make the gcmodule.c code polished. The idea for the double-linked list design was apparently from Eric Tiedemann. I think I was working on similar GC ideas (based on Tim’s cyclops tool) but I was using hash tables or trees rather than linked links.

Even after these 19 years, I still wish that we could have a higher performance GC for Python. I don’t think it can be done without breaking the C-API. I feel like the overhead of the current reference counting plus the cycle GC is much higher than compared to other GC languages.

tim.one · May 26, 2019, 10:29pm

It may well be, but if the total doesn’t amount to more than (say) 5% of runtime, nobody is going to pay a million dollars to reduce it .

Regardless, I suggest anyone looking at this pick the brains of the PyPy folks! They don’t use refcounting at all, and from this piece had already - 5 years ago - gone through at least 6 GC alternatives different enough to deserve different names. I believe they’re still using the “incminimark” variant of the last one on that page.

From the description, that’s a synchronous (not run in a different thread) generational mark-&-sweep GC, which works incrementally (no “long pauses” for GC). By all accounts I’ve read, it is significantly cheaper than CPython’s collection of GC subsysterms + small-object allocator.

I’ve picked up that partly it uses tricks borrowed from obmalloc.c to segregate “old” smaller objects by size, and new objects are obtained from a “nursery” simply by incrementing the nursery’s highwater mark pointer by the size of the object. Minor collections work only on the nursery, and all surviving objects there are copied to an older generation (so the nursery is entirely empty again). To preserve that id() doesn’t change, if it’s applied to a nursery object they also reserve space for it in the next generation, to which it will be copied if it survives the next minor collection.

I mentioned the last as a reminder that there are always worlds of problems .

In any case, that’s enough of a sketch to see that it’s extremely well suited to programs that generate very short-lived small objects at a ferocious pace. Allocation from the nursery is extremely simple (so also extremely fast), and “collecting them” if they don’t survive the next minor collection is simply a matter of not copying them to an older generation.

At a higher level, I believe things like the Boehm GC are not well-suited to “programs that generate very short-lived small objects at a ferocious pace” - which is a whole bunch of Python programs, given that even scalar numeric types are heap-allocated. Refcounting deals with those very efficiently, apart from the bad effects of bloating the implementation code with mountains of decrement-test-branch decref stanzas. But the approach PyPy evolved appears to be even better for that than refcounting.

njs · May 26, 2019, 11:07pm

PyPy’s GC is sweet, but to use it in CPython you’d need some super clever way to reconcile our tendency to pass around raw PyObject*s everywhere with a compacting GC.

tim.one · May 26, 2019, 11:39pm

Which is why I said “pick their brains” rather than “take their code”

pitrou · May 27, 2019, 2:00pm

Keep in mind the standard way of using PyPy is with the JIT, and this mode many small objects will never actually be stored on the heap. I’m not sure how their GC fares against refcounting in a non-JITted situation, i.e. everything stored as dynamic objects on the heap.

nas · May 27, 2019, 5:40pm

Conventional wisdom in the GC world is that if you want a high performance collector that handles “short-lived small objects [allocated] at a ferocious pace” then you need a compacting GC. I.e. have a nursery for new objects and move them elsewhere if they turn out to be long-lived, compacting as you move. When the GC moves an object, it has to update all if references to that object. The current C-API for Python using PyObject pointers is not at all friendly to that kind of GC. I don’t know how you could make it work.

There was some discussion on the C-API mailing list about introducing a new API that uses opaque handles or makes PyObject itself opaque. In the latter case, PyObject would be an indirect reference to the memory block managed by the GC and so it could be moved during collection.

For a simple and clear description of a generational, compacting, GC, you could read the description of the Caml Light GC. The source code is also quite readable. For a modern design (not actually implemented), see the new collector for LuaJit. I would guess Caml and Lua are fairly similar to Python in terms of allocation patterns.

It is interesting that Go uses a non-moving collector. Here are some details on the Go GC. I think the non-moving collector is viable for Go because it is a “value-orientated” language. So, they don’t create nearly so many small, short lived objects for the GC to deal with. If you want to look at source code, the guts seem to be in the mgc.go file. Nice for them that they can write so much of the system in Go itself.

Switching CPython to a different GC doesn’t seem feasible, at least in any near term timeframe. The idea that PyObject pointers can’t move is so system-wide, it seems impractical to change that. Might was well write a new Python VM. I don’t think a non-moving GC can be performance competitive with reference counting. If I understand Tim correctly, that’s what he told me years ago and Tim is seldom wrong with his intuitions. Maybe there is some new non-moving GC design that would work for us. I think the Boehm GC is unlikely to be it.

I think it is possible we could replace the cyclic GC with something better. If we could find all object roots, we could use a traditional-style mark & sweep collector. Maybe that could be better. As mentioned in a different topic, I’m pursing an idea to get rid of the PyGC_Head for small objects.

I also think it is useful if we could have a C-API that doesn’t lock the Python VM into using reference counting. That’s one of the things getting discussed on the C-API list. Imagine I write a new, high performance Python VM. It can run all .py modules just like CPython. Does it become successful? This is basically what happened with PyPy it seems to struggle a lot to be adopted. The ecosystem of extension modules for CPython is too important and without a solution for that, other VMs struggle to find users. If extension module authors would have a API that could work with both CPython and with new, higher performance VMs, that would be a great advance. How we could get there from here is a challenge but maybe one worth tackling. I think a successful solution will be one that minimizes the efforts for extension module authors.

pitrou · May 27, 2019, 5:49pm

AFAIK, “convention wisdom in the GC world” tends to vary based on the latest research and engineering. I don’t know what the current state-of-the-art is. There has even been research to make reference counting more competitive: see e.g. https://stackoverflow.com/questions/18800165/the-research-papers-say-its-possible-to-have-real-time-reference-counting-and-c for pointers

pitrou · May 27, 2019, 5:51pm

By the way, there are different possible goals:

minimize the total cumulated cost of memory management (including allocation, garbage collection, possibly reference counting)
minimize the duration of GC pauses

thomas · May 27, 2019, 10:02pm

Actually, I have two different goals in mind, as mentioned in my original post:

experimenting with a C-API to evolve CPython to that’s amenable to a different GC mechanism, while allowing some backward compatibility. (I’ve sketched the approach I’m investigating in the original post.)
Allowing Larry to continue with the Gilectomy work, which is currently (as I understand it) somewhat stalled on the various attempts to make refcounting work in some lock-free or lock-spare manor.

thomas · May 27, 2019, 10:04pm

I thought I remembered something like that, but I couldn’t find it when I searched for it. Turns out it didn’t matter, since too much changed to use that patch anyway :)… And the complicated parts were things that didn’t exist back then, like cyclic-GC and weakrefs.

tim.one · May 27, 2019, 10:05pm

Neil, I do believe I agree with everything you wrote My point at the start is that it “would be good” to pick PyPy’s collective brains, not that CPython could do the same things they did. They’ve collectively spent more time thinking about GC in Python than all of us combined. If we had a blank slate (and we don’t), I’d take PyPy’s latest GC as a starting point - it does well, and it appears the fundamentals of it haven’t changed in years. If people don’t want to ask them, fine by me too. But I’m not going argue about what they do.

One meta-point to take, which is really more applicable to your topic about removing PyGC_Head for small objects: “the best” systems uniformly consider gc and object allocation to be two aspects of a larger problem, so design them to work well with each other. For example, PyPy’s nursery is unrealistically simplistic unless it’s coupled with a compacting move later. In CPython they’ve been almost entirely uncoupled. So, ya, I think your intuition is correct in the other topic: to get best results, small gc’able objects need a different allocator than obmalloc. Which may be very similar, but devotes arena and/or pool space to making gc more efficient and/or dirty fewer pages (whatever your goals may turn out to be).

Which is another way of “picking brains”. Alas, I’m not aware of any truly new “breakthrough idea” in gc. Research on making refcounting more competitive is older than Python’s gc, and the Bacon & Rajan paper referenced in the Stackoverlow question is taking its core idea from work Rafael Lins did before obmalloc was written.

The key insight remains the same: cyclic trash can’t be created by incrementing a refcount, but only by decrementing a refcount that nevertheless remains non-zero. That’s it! All the rest is elaboration on the consequences.

In short, whenever a refcount decrement doesn’t hit zero, the object is “a suspect”. Take that object and follow its pointers to find the transitive closure (TC) of all the objects reachable from it. If (and only if) all the references within the TC are from objects in the TC, the TC is cyclic trash.

That last part is basically what Python’s cyclic gc already does. But instead of trying to identify a “suspect” after every decref, it ignores decrefs entirely and considers all objects in the generations through the oldest one being collected to be “suspects”, and computes the transitive closures of all of them “at once”. On the face of it, breaking that work into “no, just do one suspect at a time” doesn’t seem to have any clear advantage.

The other part of the paper is about complications to allow their cycle collector to run concurrently with the program. That part appears to be a real advance over where Lins left it - but not clear CPython would want to pursue it.

nas · May 27, 2019, 10:30pm

Maybe I’m wrong about the requirement of a moving collector for good performance. After re-reading the LuaJIT design linked above, I see it is non-moving. I remember now why I so interested in the design and saved the link to that page: since it was non-moving, it looks possible to implement in CPython. If we control the allocation location of container objects (which we do, PyObject_GC_New/PyObject_GC_Del), we should be able to implement the same design. I think Mike Pall is really talented and so I have some faith that the design would perform well.

Real-time doesn’t mean fast and I’m not surprised a reference counted GC can work in those applications. We could adapt our cyclic collection process to work in smaller steps and also be “real-time”. That doesn’t interest me much. I would like to have lower collection times but I’m more interested in total runtime of the program. Hard to profile but I suspect it is much more than 5%, maybe more like 20%.

The LuaJIT 3.0 GC design is another reason why I’m pursuing my “no GC head for small objects” "idea. If I can make that work, it is one step to implement a similar GC. Next, we would have to ensure every PyObject that is a container is also a GC object. Then, make sure we can find all GC roots. Then, we can remove the reference counting and the cyclic collector and use a mark & sweap algorithm like the LuaJIT one.

The C-API is still a problem in that case. E.g. extension modules can stash away PyObject pointers and the GC can’t find them. So, we would still have to solve the extension API. Using handles would be a solution.

nas · May 27, 2019, 10:37pm

Certainly stealing ideas is better than coming up with our own. I’m concerned how useful their design would be given that I don’t think CPython can realistically switch to moving PyObject pointers. So, we can’t use the young object nursery like they do and that seems a key feature of the design.

What about the trashcan? It now abuses the gc_pref/gc_next links. I’m not sure how I’m going to solve that with my PyGC_Head removal project. Did you deal with that with your Boehm GC patch?

nas · May 27, 2019, 10:57pm

A slightly different idea is to have PyObject be two different structs, depending on if code is CPython internals or not. If you are in “internals” code then PyObject is like the PyGCObject you describe, i.e. it has no refcnt field. In the internals, Py_INCREF/Py_DECREF become no ops. Internal code will have to “play nice” with the GC and not hide away GC roots.

In external code, PyObject becomes like the PyObject in your design, i.e. is reference counted. For a VM that implements non-ref-counted GC, that external PyObject acts as a handle to the internal PyObject struct.

I think this design has advantage of not having to changing so much source code. In the vast majority of places, PyObject can be treated as an opaque struct. When I did my tagged pointer experiment, there was a surprisingly little amount of code to be changed. I have a patch that makes it happen so maybe you would be interested in that. Accessing ob_refcnt is very rare. Easy to fix those. Accessing ob_type is more common but I created a Py_TP() macro and it is pretty easy to fix those too.

pitrou · May 27, 2019, 10:57pm

It should be reasonably easy, at least if you’re under Linux. Run your workload with the Linux perf tool, then at the end inspect the report to find out how much time was spent under gc_collect (or whatever that function is called nowadays).

nas · May 27, 2019, 11:01pm

I want to include time in Py_INCREF/Py_DECREF as well. I’ve based my 20% guess on Linux perf. When I implemented my tagged pointer idea, I had to tweak Py_INCREF/Py_DECREF. I made them C99 inlined functions. Then, Linux perf is able to record time spent in them. It seems like a pretty significant amount of time is spent there and I think the CPU cycles are mostly due to reading and writing to RAM. Some is due to the branch prediction on the Py_DECREF code path.

Edit: I remember what I did now regarding the tagged pointers. My experiment added a few machine instructions to the Py_INCREF/Py_DECREF code paths. Total execution time increased to 108% of original. Using Amdahl’s law, it is possible to estimate total run time spent in these functions based on the overall slowdown due to these new instructions. For Py_INCREF, it went from 6->8 cycles (L1 cache) or 14->16 cycles (L2 cache). I think that gives 30% of execution time in Py_INCREF for the whole program. Maybe my estimates are flawed but 5% seems too low for GC/refcount overhead.

Edit2: some more details on the Py_INCREF code change. Disassembly of tagged pointer Py_INCREF in listmodule.c is below:

 1ff:       48 8b 1e                mov    (%rsi),%rbx
if (_Py_IsTaggedPtr(op)) {
 202:       f6 c3 01                test   $0x1,%bl
 205:       0f 85 00 00 00 00       jne    20b <PyList_AsTuple+0xdb>
     ((PyObject *)(op))->ob_refcnt++);
 20b:       48 83 03 01             addq   $0x1,(%rbx)

The extra instructions that the tagging adds is the “test” and “jne”. Based on cycle counts for the Haswell CPU:

TEST I,R 1 cycles
JNE short 1 cycles
ADD I,M 2 cycles
L1 latency: 4 cycles
L2 latency: 12 cycles

If the Py_INCREF went from 6 cycles to 8, that would be 30% of CPU time spent in Py_INCREF.

tim.one · May 27, 2019, 11:42pm

Which reminds me: over a decade ago it was on my list of vague “someday” ideas to see whether CPython could benefit from this. For reasons already explained, it’s only necessary for cyclic gc to initiate traversals on objects that have been decrefed at least once since the last time.

Skipping all the details, the bottom line point is that it’s easy to think of cases where that would do arbitrarily better than what’s done now. For example, picture a forked process inheriting a giant list of objects it never looks at. Since the list is never referenced, it’s never decrefed, so gc will never crawl over it (although it may crawl over some objects in the list if they happen to be reachable from objects that are decrefed).

That was more the point of the referenced Stackoverflow paper: traditional mark-&-sweep crawls over the entire universe of reachable objects, but this form of cyclic gc only needs to crawl over the (possibly very much smaller) galaxy of objects reachable from objects that survive decrefs. So, overall, combine the latter with refcounting, and it may or may not be cheaper than pure mark-&-sweep (they didn’t give details in that particular paper, but reported they saw wins in both directions).

thomas · May 28, 2019, 8:21am

The trashcan is a non-issue: Py_DECREF never calls tp_dealloc, so there’s no nested destruction going on.

(About changing what ‘PyObject’ means internally and externally.) It’s an option, but I don’t really see the point. Having two different concept be named the same thing is confusing, the CPython code itself isn’t generally speaking used with different versions of CPython (so it doesn’t have to remain compatible), and it doesn’t allow for piece-meal conversion from PyObject to PyGCObject.