Switching from refcounting to libgc

pitrou · May 28, 2019, 8:51am

The question is, whether a large proportion of those “idle” objects would not be reachable from objects that are decref’ed.

Consider this:

any object has a reference to its class
any class has references to its methods
any pure Python method (or function) has a reference to its globals dict
a significant proportion of those globals dicts will have a “sys” entry pointing to the sys module
the sys module holds references to all live modules through sys.modules
modules hold references to their globals

As soon as pure Python objects are decref’ed (which is extremely likely), it seems that the object subgraph that needs traversing by the GC may be close in size to the entire object graph.

Note that Facebook (I think it was them? @ambv) contributed something more radical:

(discussion here: Issue 31558: gc.freeze() - an API to mark objects as uncollectable - Python tracker)

tim.one · May 28, 2019, 9:57pm

Yes, it’s quite possible that the scheme wouldn’t pay at all in CPython - or would require other changes to make it possible for motivated users to create conditions under which it would help (like., e.g., the gc.freeze() hammer was added).

Here’s a cute little module that only defines a trivial function, not even a class. Using gc tools to chase down references, importing this under Windows 3.7.3 finds well over 6,000 objects reachable from f.

    from gc import get_referents as gr
    from collections import Counter

    def f():
        return None

    alllist = [f]
    allset = set([id(f)])
    for o in alllist:
        for p in gr(o):
            pid = id(p)
            if pid not in allset:
                alllist.append(p)
                allset.add(pid)

    print(len(allset))
    t2n = Counter()
    for o in alllist:
        t2n[type(o)] += 1
    for k, v in sorted(t2n.items(), reverse=True, key=lambda t: t[1])
        print(k, v)

These are the most frequent types (from the start of the final loop’s output):

<class 'str'> 3010
<class 'function'> 675
<class 'code'> 668
<class 'tuple'> 399
<class 'builtin_function_or_method'> 368
<class 'dict'> 295
<class 'int'> 278
<class 'type'> 220
<class 'getset_descriptor'> 80
<class 'classmethod'> 49
<class 'module'> 44

It’s not materially different if the function is thrown out and replaced with this instead:

    class F:
        def __init__(self, n):
            self.n = n
    f = F(3)

Of course I was wrestling with cycles in Python long before cyclic gc was added, so I’m not qualitatively surprised by that (but quantitatively surprised by how very much is reachable now after just starting an interpreter and importing that module first thing).

That’s why, although I was aware of Lins’s work at the time, it didn’t seem promising enough to ever make it to the top of my todo list. Luckily, Neil thought up a different way that works well.

In any case, CPython’s sprawling object graph creates speed challenges for any cyclic gc scheme, since there’s just no way “even in theory” to reliably detect trash cycles without considering every edge. It’s possible nobody else’s scheme will work well for us .

nas · May 29, 2019, 1:59am

Are you aware that the current gcmodule code only visits objects in the generation being collected? I think this is a somewhat recent thing. My memory is fuzzy but I think in the original code, the visit_* would crawl through whatever was found. There is now the gc_is_collecting() function that tests if a PyGC_Head pointer is in the collected generation.

The idea of only looking at objects with decremented reference counts would seem to require adding instructions to Py_DECREF. So, I would guess that’s not going to be a win for most programs.

Crazy idea: maybe the gc.freeze() idea could be extended to disable reference counting on all objects found when it is called. If we allocate all PyObjects out of arenas we control (aligned the right way), we can add bitmaps at the start of the arena to specify if the object is frozen. If frozen, Py_INCREF/Py_DECREF and gc.collect() would ignore the object

tim.one · May 29, 2019, 3:46am

I’m not quite sure what you have in mind, and the gc code has changed a lot, but it looks the same in this respect to me as always

Which, I believe, used to be done by testing whether the gc header’s refcount copy “was negative”. For example, part of visit_decref now:

        if (gc_is_collecting(gc)) {
            gc_decref(gc);

and before:

        if (gc->gc.gc_refs > 0)
            gc->gc.gc_refs--;

Some flags (one flag? more than one? unsure) were added to the gc header, but there was no room for them, so bit-fiddling was added and hidden behind layers of macros and inline functions.

Before, between collections every gc’able object’s recount copy was forced to one of two negative values (GC_UNTRACKED or GC_REACHABLE), and gc started by calling update_refs(), which set the refcount copy to the actual (> 0!) refcount of the objects in the generation being collected.

In any case, that doesn’t have anything to do with the gc.get_referents() I used to get the “over 6,000 objects reachable” result. That function is in the gc module, but doesn’t use any of its collection machinery - it just calls the argument’s tp_traverse implementation directly. For example, here’s a goofy way to reverse a list :

>>> gc.get_referents([1, 2, 12])
[12, 2, 1]

Yes, if the new refcount wasn’t zero, it would have to “do something”. Perhaps as “simple” as setting a persistent flag in the GC header. Except there’s no room for a flag.

As before, the real gain would come if the union of transitive closures of objects that survive decrefs is small compared to all objects. The paper described some success in some Java programs at running refcounting + this kind of cycle detection as opposed to using Java’s mark-&-sweep collector. Which is quite an achievement! Sun paid Guy Steele and a small office of PhDs to work on Java’s gc for a couple years - it’s good.

But Python objects can reach an awful lot of stuff, and Antoine spurred me to write that teensy test case which revealed just how extreme that’s become. I’ll note again that those thousand of objects were reachable from a module function (or class instance) in a module that only imported two functions, but not the modules (gc and collections) the functions came from - in a program that did nothing at all except import the module.

It’s probably not going to get better than that .

I’m unclear on why someone might want to. If someone wants all gc’able objects in existence to be ignored forever after by gc, they can call gc.freeze() now. Why would they want refcounting disabled too? That seems downright dangerous; e.g., there are a few places in CPython that trigger “optimizations” based on an object’s refcount, and they’re not safe if the refcount is lying.

If this is about a fork micro-optimization, since I’m on Windows my answer to "what is fork good for?` is “who cares?” :

nas · May 29, 2019, 6:09am

My thought was that incref/decref is quite expensive in terms of the CPU cache. Even on an OS without fork(), isn’t it a good idea to avoid spraying writes all over the memory space?

pitrou · May 29, 2019, 7:46am

Perhaps you want to experiment with buffered reference counting or other sophisticated schemes. There are pointers in this intriguing presentation: http://www.cs.tau.ac.il/~maon/teaching/2014-2015/seminar/seminar1415a-lec12-conc-rc.pdf

thomas · May 29, 2019, 1:15pm

If you do, you should talk to @larry as well, as he’s done a number of experiments as part of the Gilectomy.

tim.one · May 30, 2019, 2:20am

If all else were equal, sure. Maybe, this would be great for, e.g., a giant data structure that’s conceptually read-only after it’s built? I see plenty of those.

But there are costs too. Like enduring 3 test-branch snippets per incref/decref pair instead of the current 1. The “is it frozen?” branch outcomes depend on the data, not on the branch code locations. so may not be predictable. The extra code also pressures the instruction cache. If you’re thinking about an arena-style allocator with flags physically near the start, also code to locate the “frozen?” flag and isolate it for testing - and filling cache lines with flag data as well as object data.

And, perhaps, “etc”. So while it may be an overall win for data where it applies, it seems a pure loss on every count for data that isn’t frozen (more code, 3x the test-branches, more cache lines, and it ends up mutating the refcount part anyway).

I wouldn’t discount it offhand just based on that, but I hope there are bigger fish to fry .

larry · June 12, 2019, 2:01pm

I implemented “buffered reference counting” in my Gilectomy branch.

The good news: it made reference counting scale nicely across cores! I had some strategies in mind that suggest it could have scaled up to an arbitrary number of cores.

The bad news: it adds a lot of overhead. And given how often one makes reference count changes in CPython, this slowed things down by a shocking amount, like 5-10x or something.

True story: I derived the technique on my own, then discovered a nice writeup of it in “The Garbage Collection Handbook” in the (slim!) chapter on multithreaded-friendly reference counting. My final implementation is a bit sleeker than theirs but the principles are the same.

The only other technique mentioned in that chapter is something called “The Recycler”, which I can’t claim to understand. It’s a lot more complicated and I dimly recall would require compilation-time support–like, you modify your Java compiler to emit extra code for object lifetime management stuff. I remember thinking it wouldn’t be appropriate for hand-written code like CPython’s innards. But I may be wrong / misremembering.

tim.one · June 12, 2019, 3:26pm

“The Recycler” was implicitly (not by that name) mentioned earlier here when discussing a Bacon & Rajan paper cited in a StackOverflow question Antoine pointed at.

It’s a refcount-based cycle collector intended to run concurrently with “the real” program. “Concurrent” makes it complicated, and like any such thing needs extra support to efficiently detect concurrent mutations in the object graph that can invalidate tentative conclusions-so-far about what may and may not be cyclic trash.

The “concurrent” part is important, because without running in parallel it’s slower (elapsed wall-clock time) than a simple stop-the-world mark-and-sweep collector. Its basics have much in common with Python’s current stop-the-world cyclic gc, but potentially runs “all the time” (technically, instead of starting from all the objects in “a generation” as Python does, it starts from all the objects that have survived at least one decref, but kicking root objects out of that set if they happen to be incref’ed again before it gets to them - which is the easy part ).

After an object has been traversed, the tentative conclusions-so-far based on objects reachable from it have to be thrown out if one of the pointers in the object changes. That’s the hard part, and compiler knowledge of which pointer stores are relevant to cyclic gc can automate an otherwise error-prone dance.

Short course: there’s nothing here obviously more promising than Thomas’s original plan of trying the Boehm mark-sweep collector again. Indeed, there’s nothing here even plausibly more promising than that .

nanjekyejoannah · June 19, 2019, 8:42pm

Am curious and interested in this discussion. You may look at the re-usable OMR’s GC policies here : https://github.com/eclipse/omr .

jdemeyer · July 15, 2019, 2:33pm

That’s easy: the trashcan doesn’t need to be a linked list, any kind of container (such as a plain malloc’ed array) would work.

Marco_Sulla · December 16, 2019, 3:39am

Excuse me… I know very little about the argument, but I know that LLVM uses Automatic Reference Counting. And I know that Numba uses LLVM for creating its JIT. How much is difficult is to implement such a GC?

steven.daprano · December 17, 2019, 9:37am

This may be relevent:

nas · April 18, 2020, 11:17pm

This is an old thread but contains a lot of interesting discussion. I had mentioned this “crazy” idea:

maybe the gc.freeze() idea could be extended to disable reference counting on all objects found when it is called.

Something like that idea has now been implemented by some Instagram people:

https://bugs.python.org/issue40255

They don’t use memory bitmaps to disable reference counts. Instead they use the high bits of the refcnt field. I’ve been thinking about how that change could be extended to be more useful to general Python programs. In Instagram’s case, the big win comes from their application’s use of copy-on-write OS memory pages and avoidance of dirtying them. Most Python programs are not written that way and will not benefit from PR 19474 as it is.

My thought for today is could we change CPython to use a hybrid of mark-and-sweep (M&S) and automatic-reference-counting (ARC) for memory management? I found this paper today:

A Unified Theory of Garbage Collection:
https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon04Unified.pdf

From the abstract:

Using this framework, we show that all high-performance collectors (for example, deferred reference counting and generational collection) are in fact hybrids of tracing and reference counting.

The cyclic GC that Python uses is like a kind of M&S collector. We are close to the hybrid approach but we are paying for both the ARC and the M&S. Can we extend PR 19474 to get better performance for the “average” Python program? My idea is to use ARC on the young generation of objects and M&S for the old. Maybe avoid more of the ARC overhead on old objects?

Leaving out a bunch of details, it would be like calling the PR 19474 immortalize_heap() at the end of a full cyclic GC pass. Or, alternatively, having the M&S collector set the bit on the refcnt field to disable ARC as it is working. I.e. everything the M&S collector sees is considered “old”.

What is involved with making that work? If we use libgc, as per the original proof-of-concept by Thomas, not too much. As I recall, libgc already supports a hybrid approach in that you can manually call free(). We need to replace the current cyclic GC with an actual mark-and-sweep collector that can find all “roots”. The libgc library has magic (e.g. C stack crawling, heuristic pointer detection) to do that. Or, we could enhance the current cyclic GC to make it into a M&S collector. That would involve making sure every container object has an accurate tp_traverse method, dealing with the C stack somehow, marking all global PyObject pointers. Tedious but doable.

It would be easy to support the HPy handle API with this approach. Use another bit in the refcnt field to mark objects that have an “open” handle to them. Then, the M&S would always consider them to be alive as long as they are open. Alternatively, can add them to a handle data structure.

A possible alternative to libgc would be to adapt the Apple “libauto” library:

https://opensource.apple.com/source/libauto/libauto-77.1/README.html
https://opensource.apple.com/source/libauto/libauto-77.1/

I didn’t study the code too closely but it looks quite clean and fairly portable. It has support for write memory barriers which would be a performance enhancement for the M&S collection. BTW, it is interesting that Apple moved away from M&S collection and the newer versions of Objective-C use pure ARC. At least, that’s my understanding.

nas · April 19, 2020, 1:05am

I had some trouble getting Thomas’s libgc change to build. Seems like the github PR is pointing to the wrong branch. Here’s an updated branch, re-based on current cpython “master”.

The most recent commit seems to be broken (doesn’t compile). However, the next-to-newest commit does build for me and seems to at least partially work. I was going to try to combine Thomas’s libgc changes with Eddie’s PR 19474 (immortal instances). I didn’t get that far yet though.

eelizondo · April 19, 2020, 4:31pm

Neil,

I managed to get libgc + immortalization working to leverage both ARC for short lived objects and M&S for long ones. Overall, this looks very promising. Here are a couple of notes:

Performance:
While there’s still a slight regression on perf, we might be able to keep pushing this further. Right now, we are using a plug-and-play approach. We could better specialize bdwgc as well as having a different immortalization strategy to work better with CPython.

Correctness & Crashes
Is bdwgc able to handle the topological destruction that ARC guarantees? This might be causing some memory corruption and hence crashes in tests. I have not looked into this so maybe bdwgc already handles this gracefully.

Migration Plan
Ultimately, the goal should be to use a more effective GC as most of the comments here have mentioned. In theory, we could achieve this by starting with a simple non-moving M&S. We can slowly migrate to a performant moving GC by:

Creating a layer between the C-API and Internals through a handle to allow for a moving GC (i.e: HPy).
Removing internal ref counting while maintaining the ref count contract through C-API usage.
Have both the moving M&S (internals) and ARC (external C-API) agree before they destroy an object.

PR Fixes

I pushed this onto an immortal-libgc branch in my repo:

This branch fixes the compilation issues in your branch to get libgc working as well as the immortal reference patches. Then, I added an immortalize_heap in _PyObject_GC_Alloc right after collecting the generations. Finally, I did some slight optimizations to immortalize_heap now that we are calling it more often.

Performance Numbers

Here are some perf numbers on a Linux machine setup for benchmarking (i.e cpu isolation, etc…)

Baseline (CPython Master):
unpack_sequence: Mean ± std dev: 133 ns ± 2 ns
richards: Mean ± std dev: 205 ms ± 4 ms
fannkuch: Mean ± std dev: 1.31 sec ± 0.00 sec

Lib GC + Immortalization:
unpack_sequence: Mean ± std dev: 137 ns ± 1 ns
richards: Mean ± std dev: 217 ms ± 7 ms
fannkuch: Mean ± std dev: 1.50 sec ± 0.01 sec

Building Details:

Notes on bdwgc configuration:
A more in depth analysis has to be done into what the proper configuration for bdwgc. For instance, on my Linux machine, I had to manually remove the #define USE_PROC_FOR_LIBRARIES in include/private/gcconfig.h as it was causing a performance degradation. Without this, perf was about 30% slower.

# Setup
mkdir /tmp/libgc_temp && cd /tmp/libgc_temp && mkdir bdwgc_files

# Build bdwgc
git clone https://github.com/ivmai/bdwgc
cd bdwgc
./autogen.sh
./configure --enable-sigrt-signals --enable-large-config --enable-mmap --enable-static --disable-shared --prefix=/tmp/libgc_temp/bdwgc_files --enable-redirect-malloc
make -j
make install

# Build CPython
git clone https://github.com/eduardo-elizondo/cpython
cd cpython && git checkout immortal-libgc
CFLAGS="-I/tmp/libgc_temp/bdwgc_files/include" ./configure --with-libgc="/tmp/libgc_temp/bdwgc_files/lib/libgc.a /tmp/libgc_temp/bdwgc_files/lib/libcord.a" --enable-optimizations
make profile-opt -j

nas · April 19, 2020, 6:10pm

I think in general no. It does have support to order finalizers. See:

https://hboehm.info/gc/finalization.html

Many years ago, when we were writing the cyclic GC, there was a lot of discussion of ordering of finalizers. It is a tricky problem and I think one of the reasons why Java ultimately deprecated or at least discouraged the use of finalizers.