Switching from refcounting to libgc

The question is, whether a large proportion of those “idle” objects would not be reachable from objects that are decref’ed.

Consider this:

  • any object has a reference to its class
  • any class has references to its methods
  • any pure Python method (or function) has a reference to its globals dict
  • a significant proportion of those globals dicts will have a “sys” entry pointing to the sys module
  • the sys module holds references to all live modules through sys.modules
  • modules hold references to their globals

As soon as pure Python objects are decref’ed (which is extremely likely), it seems that the object subgraph that needs traversing by the GC may be close in size to the entire object graph.

Note that Facebook (I think it was them? @ambv) contributed something more radical:
https://docs.python.org/3/library/gc.html#gc.freeze
(discussion here: https://bugs.python.org/issue31558)

Yes, it’s quite possible that the scheme wouldn’t pay at all in CPython - or would require other changes to make it possible for motivated users to create conditions under which it would help (like., e.g., the gc.freeze() hammer was added).

Here’s a cute little module that only defines a trivial function, not even a class. Using gc tools to chase down references, importing this under Windows 3.7.3 finds well over 6,000 objects reachable from f.

    from gc import get_referents as gr
    from collections import Counter

    def f():
        return None

    alllist = [f]
    allset = set([id(f)])
    for o in alllist:
        for p in gr(o):
            pid = id(p)
            if pid not in allset:
                alllist.append(p)
                allset.add(pid)

    print(len(allset))
    t2n = Counter()
    for o in alllist:
        t2n[type(o)] += 1
    for k, v in sorted(t2n.items(), reverse=True, key=lambda t: t[1])
        print(k, v)

These are the most frequent types (from the start of the final loop’s output):

<class 'str'> 3010
<class 'function'> 675
<class 'code'> 668
<class 'tuple'> 399
<class 'builtin_function_or_method'> 368
<class 'dict'> 295
<class 'int'> 278
<class 'type'> 220
<class 'getset_descriptor'> 80
<class 'classmethod'> 49
<class 'module'> 44

It’s not materially different if the function is thrown out and replaced with this instead:

    class F:
        def __init__(self, n):
            self.n = n
    f = F(3)

Of course I was wrestling with cycles in Python long before cyclic gc was added, so I’m not qualitatively surprised by that (but quantitatively surprised by how very much is reachable now after just starting an interpreter and importing that module first thing).

That’s why, although I was aware of Lins’s work at the time, it didn’t seem promising enough to ever make it to the top of my todo list. Luckily, Neil thought up a different way that works well.

In any case, CPython’s sprawling object graph creates speed challenges for any cyclic gc scheme, since there’s just no way “even in theory” to reliably detect trash cycles without considering every edge. It’s possible nobody else’s scheme will work well for us :wink:.

Are you aware that the current gcmodule code only visits objects in the generation being collected? I think this is a somewhat recent thing. My memory is fuzzy but I think in the original code, the visit_* would crawl through whatever was found. There is now the gc_is_collecting() function that tests if a PyGC_Head pointer is in the collected generation.

The idea of only looking at objects with decremented reference counts would seem to require adding instructions to Py_DECREF. So, I would guess that’s not going to be a win for most programs.

Crazy idea: maybe the gc.freeze() idea could be extended to disable reference counting on all objects found when it is called. If we allocate all PyObjects out of arenas we control (aligned the right way), we can add bitmaps at the start of the arena to specify if the object is frozen. If frozen, Py_INCREF/Py_DECREF and gc.collect() would ignore the object

I’m not quite sure what you have in mind, and the gc code has changed a lot, but it looks the same in this respect to me as always

Which, I believe, used to be done by testing whether the gc header’s refcount copy “was negative”. For example, part of visit_decref now:

        if (gc_is_collecting(gc)) {
            gc_decref(gc);

and before:

        if (gc->gc.gc_refs > 0)
            gc->gc.gc_refs--;

Some flags (one flag? more than one? unsure) were added to the gc header, but there was no room for them, so bit-fiddling was added and hidden behind layers of macros and inline functions.

Before, between collections every gc’able object’s recount copy was forced to one of two negative values (GC_UNTRACKED or GC_REACHABLE), and gc started by calling update_refs(), which set the refcount copy to the actual (> 0!) refcount of the objects in the generation being collected.

In any case, that doesn’t have anything to do with the gc.get_referents() I used to get the “over 6,000 objects reachable” result. That function is in the gc module, but doesn’t use any of its collection machinery - it just calls the argument’s tp_traverse implementation directly. For example, here’s a goofy way to reverse a list :wink::

>>> gc.get_referents([1, 2, 12])
[12, 2, 1]

Yes, if the new refcount wasn’t zero, it would have to “do something”. Perhaps as “simple” as setting a persistent flag in the GC header. Except there’s no room for a flag.

As before, the real gain would come if the union of transitive closures of objects that survive decrefs is small compared to all objects. The paper described some success in some Java programs at running refcounting + this kind of cycle detection as opposed to using Java’s mark-&-sweep collector. Which is quite an achievement! Sun paid Guy Steele and a small office of PhDs to work on Java’s gc for a couple years - it’s good.

But Python objects can reach an awful lot of stuff, and Antoine spurred me to write that teensy test case which revealed just how extreme that’s become. I’ll note again that those thousand of objects were reachable from a module function (or class instance) in a module that only imported two functions, but not the modules (gc and collections) the functions came from - in a program that did nothing at all except import the module.

It’s probably not going to get better than that :wink:.

I’m unclear on why someone might want to. If someone wants all gc’able objects in existence to be ignored forever after by gc, they can call gc.freeze() now. Why would they want refcounting disabled too? That seems downright dangerous; e.g., there are a few places in CPython that trigger “optimizations” based on an object’s refcount, and they’re not safe if the refcount is lying.

If this is about a fork micro-optimization, since I’m on Windows my answer to "what is fork good for?` is “who cares?” ::stuck_out_tongue_winking_eye:

My thought was that incref/decref is quite expensive in terms of the CPU cache. Even on an OS without fork(), isn’t it a good idea to avoid spraying writes all over the memory space?

1 Like

Perhaps you want to experiment with buffered reference counting or other sophisticated schemes. There are pointers in this intriguing presentation: http://www.cs.tau.ac.il/~maon/teaching/2014-2015/seminar/seminar1415a-lec12-conc-rc.pdf

1 Like

If you do, you should talk to @larry as well, as he’s done a number of experiments as part of the Gilectomy.

2 Likes

If all else were equal, sure. Maybe, this would be great for, e.g., a giant data structure that’s conceptually read-only after it’s built? I see plenty of those.

But there are costs too. Like enduring 3 test-branch snippets per incref/decref pair instead of the current 1. The “is it frozen?” branch outcomes depend on the data, not on the branch code locations. so may not be predictable. The extra code also pressures the instruction cache. If you’re thinking about an arena-style allocator with flags physically near the start, also code to locate the “frozen?” flag and isolate it for testing - and filling cache lines with flag data as well as object data.

And, perhaps, “etc”. So while it may be an overall win for data where it applies, it seems a pure loss on every count for data that isn’t frozen (more code, 3x the test-branches, more cache lines, and it ends up mutating the refcount part anyway).

I wouldn’t discount it offhand just based on that, but I hope there are bigger fish to fry :smile:.

1 Like

I implemented “buffered reference counting” in my Gilectomy branch.

The good news: it made reference counting scale nicely across cores! I had some strategies in mind that suggest it could have scaled up to an arbitrary number of cores.

The bad news: it adds a lot of overhead. And given how often one makes reference count changes in CPython, this slowed things down by a shocking amount, like 5-10x or something.

True story: I derived the technique on my own, then discovered a nice writeup of it in “The Garbage Collection Handbook” in the (slim!) chapter on multithreaded-friendly reference counting. My final implementation is a bit sleeker than theirs but the principles are the same.

The only other technique mentioned in that chapter is something called “The Recycler”, which I can’t claim to understand. It’s a lot more complicated and I dimly recall would require compilation-time support–like, you modify your Java compiler to emit extra code for object lifetime management stuff. I remember thinking it wouldn’t be appropriate for hand-written code like CPython’s innards. But I may be wrong / misremembering.

2 Likes

“The Recycler” was implicitly (not by that name) mentioned earlier here when discussing a Bacon & Rajan paper cited in a StackOverflow question Antoine pointed at.

It’s a refcount-based cycle collector intended to run concurrently with “the real” program. “Concurrent” makes it complicated, and like any such thing needs extra support to efficiently detect concurrent mutations in the object graph that can invalidate tentative conclusions-so-far about what may and may not be cyclic trash.

The “concurrent” part is important, because without running in parallel it’s slower (elapsed wall-clock time) than a simple stop-the-world mark-and-sweep collector. Its basics have much in common with Python’s current stop-the-world cyclic gc, but potentially runs “all the time” (technically, instead of starting from all the objects in “a generation” as Python does, it starts from all the objects that have survived at least one decref, but kicking root objects out of that set if they happen to be incref’ed again before it gets to them - which is the easy part :wink: ).

After an object has been traversed, the tentative conclusions-so-far based on objects reachable from it have to be thrown out if one of the pointers in the object changes. That’s the hard part, and compiler knowledge of which pointer stores are relevant to cyclic gc can automate an otherwise error-prone dance.

Short course: there’s nothing here obviously more promising than Thomas’s original plan of trying the Boehm mark-sweep collector again. Indeed, there’s nothing here even plausibly more promising than that :wink:.

Am curious and interested in this discussion. Am also working on something experimental in line with improving the current Pypy GC with IBM/UNB. We are still in the brewery . The baseline is we want to see If we can make use of one of OMR’s GC policies : https://github.com/eclipse/omr in pypy. We may as a result make some modifications to attain our goals of bettering Garbage collection in pypy but the basis is, can we reuse the OMR GC implementations?