Removing PyGC_Head for small objects

Here is an idea I have been toying with since the Python core sprint at Facebook. I don’t know yet if there would be any real advantages but I think it is an interesting idea.

The current GC (implemented in gcmodule.c) has a number of issues:

  • the PyGC_Head overhead is 16 bytes and applies to every object that participates in GC. That memory overhead is not terrible but it would be nice to reduce it.

  • the process of doing a collection modifies data in the PyGC_Head for objects in the generations being collected. That can have bad effects for the OS because of dirty VM pages. The PyGC_Head data is spread out over memory and so a collection can dirty many OS pages.

  • due to modern CPU/memory/cache behavior, traversing linked lists is quite a slow operation. So, while the current linked-list design appears to be simple and fast, on modern machines a different design (e.g. one that uses vector-like data structures) could be comparable or faster.

My first step has been to modify the current gcmodule.c to no longer use the double-linked lists for keeping sets of objects during collection. Instead, the linked-list is only used to traverse all the objects that have a GC head. I’ve added some new PyGC_Head fields:

  • gc_color: WHITE (trash), GREY (alive, need traverse), BLACK (alive)

  • gc_flags: GC_FLAG_FINIALIZER_REACHABLE, GC_FLAG_LEGACY_FINIALIZER_REACHABLE, GC_FLAG_FINALIZED

  • gc_gen: generation number for object, 0 to 4

The code for this experiment is on github. Nearly all unit tests pass. Performance is obviously terrible since the GC process makes many passes over the entire set of GC tracked objects. Making this work without the performance drop is the work that remains.

The next step I would like to try is to remove PyGC_Head for small objects. E.g. if an object is less than 512 bytes in size, it is allocated from a special GC pool (similar but separate from the obmalloc pools). Information that was previously in PyGC_Head for these objects gets moved elsewhere. E.g. packed at the start of the pool or in an array/tree indexed by pool index.

One potentially killer problem with this idea is finding a fast way to distinguish small and large objects (have PyGC_Head or not). I fear the cost for pointers found by tp_traverse and for _PyObject_GC_Del(). My current idea is that we could use a version of obmalloc:address_in_range(op). That is quite fast.

An alternative idea to address_in_range() is to use a radix tree. See the pagemap data structure in tcmalloc for a similar idea. Attached are some sketching I made of a possible GC allocator and related data structures.


The allocation function is something like:

  • lookup free object block, if found, unlink it and return
  • lookup empty/partially empty pool from new_pool list, if exists, return next block from pool
  • create new pool, return first block in pool

The free function would be:

  • link object block to the list of free blocks for size index. We could link to the start or end of the free list, depending on what works better.

Random thoughts:

  • This scheme would be optimized for 64-bit desktop/server machines with decent amounts of RAM. We could have a build configure option that builds CPython using the old double-linked lists.

  • For simplicity, my allocation design is single-threaded and doesn’t return memory to the OS. Those limitations should be fixable given some “small matter of programming”. E.g. just copy what tcmalloc does.

  • In sketch above, I have the “generation” as a pool level property. So, all objects in the same pool are the same generation. It would useful to tune the freelist behavior to optimize the behaviors of generations (e.g. try to minimize number of pools in youngest generation).

  • In sketch, I have gc_collect_info as allocated only during collection. It could be allocated in a 2-phase process per L3 node. First, find the number and size classes of the pools in the L3 leaf. Allocate a contiguous block of memory to hold gc_collect_info for all pools in the L3 leaf. Set ‘c’ pointer of each gc_info to point into this contiguous block. That way, we don’t pay the memory cost of ‘refs’ for generations not being collected. Also, the page dirtying behavior should be more friendly to the OS (locality of writes).

  • My radix tree design requires that pools are aligned to pool size boundaries. This could be done by allocating a large “arena” and then allocating pools from within there (waste first bit of space to get alignment). Or we could use a system call that allocates pages that are aligned, e.g. posix_memalign.

  • My changes to gcmodule.c require that allocations happen during the GC collect process. Not sure if that is a major problem.

1 Like

I agree that the doubly-linked list is a big performance issue for GC pauses, especially with large working sets. I’m less worried about the memory overhead because 1) there will be an overhead anyway, even if smaller (you will encode the information somewhere else, just not in the object header) 2) many commonly-allocated objects are not GC-enabled (for example strings, ints or floats… though tuples are)

One radical approach is to kill PyGC_Head for all objects. This means Python must take control of all object allocations. We already have a more than decent small object allocator, so we would need a medium-to-huge object allocator.

In general, my impression is that any high-performance GC needs a custom allocator (to be able to enumerate objects efficiently).

2 Likes

What do you see as the advantages of that? Is it to avoid dirtying pages during GC (i.e. improving locality of reference)?

Do you have any idea of how to do that? I was thinking about it this morning. The key reason you can do it efficiently for small objects seems to depend on pools and objects being aligned at specific memory boundaries. Then, you can use cheap pointer masking and shifting to find the additional GC state for the object. With large objects, I don’t see how you could align the pointers.

One idea: have an extra layer in the radix tree for large object pointers. E.g. at the 3rd layer, have a flag saying if it is a large or small object. If small, the data for the object is in that level. If large, you need to traverse another level of the tree (using lower order bits of object pointer) to find the data object. In my rough design for the tree, you would already have a field specifying now many objects are contained in the leaf and the offset to the first object’s data. For large objects, maybe there would be only one object in the leaf.

Yes, improving locality of reference and therefore cache locality. Traversing the object graph depth-first must not have very good cache characteristics

No, but larger objects are much less often allocated, so I think a bit more overhead is bearable. I would also design two more allocators: a medium-sized objects allocator that handles objects smaller than e.g. 16 * PAGESIZE (that’s 64kiB on x86), and a large objects allocator that handles objects over that.

The large objects allocator can for the most part delegate allocation to mmap(). The medium-sized objects allocator would require a bit more thought, perhaps using RB trees to find the best-sized match amongst free areas.

By the way, probably @larry has been thinking about all this.

As I feared, my simple allocator design is much too simple to work well. E.g. pools get locked into holding one size class and never change. Probably the easiest thing to do is to clone obmalloc and adapt it to also store the additional state needed by the GC.

I’m also not sure about how state should be split between getting stored in the radix tree vs getting stored at the front of pools. To improve locality of reference, maybe should have some data in each place. E.g. the “tracked” bit should probably be close to the PyObject. It gets changed near when the object is allocated or de-allocated. The “finalized” bit should probably be there too. All the stuff that a GC pass mutates should probably be in the radix tree.

I’ve watched this video on data-oriented design recently and I think CPython internals do a poor job in terms of data layout (optimizing for modern computer cache/memory latency). We have object structures everywhere with pointers to pointers to pointers. I’ve heard this described as a “sparse” data representation. I recall some Facebook engineer told me that the Hiphop JIT would be useless if they didn’t first fix the data representation inside the VM. Then, once fixed, the JIT only had a small increment in performance vs interpreted code. Some people seem to think a JIT would be a “magic bullet” to make CPython faster.

There is an option to remove PyGC_Head : make a class without cyclic garbage collection support. There are many cases when reference counting is enough. If an instance do not supposed to participate in reference cycles then cyclic garbage collection is an overhead.

You can do that in C already. But requiring users to think and be careful about what “no cyclic gargage collection support” implies is not the right way to think about the issue IMO.

Furthermore, most objects are actually built-in Python objects (tuples, lists, dicts, etc.).

I’m agree that the choice to not support cyclic garbage collection in particular case should be made very carefully. But consider simple data class:

class Point:
__slots__ = ‘x’, ‘y’
x:int
y:int
def __init__(self, x, y):
self.x = x
self.y = y

In this particular case this class can be made without Py_TP_HAVE_GC.

So what? Unless you have a million of those objects, it’s probably not a big deal.

Exactly. To create small classes with hundreds of thousands and millions of instances while limiting the amount of memory used.

It should be GC object.

x = []
p = Point(42, x)
x.append(p)  # circular reference

In general - yes. It should. But developer can create instance of Point in his project and take responsibility to himself that he create instances of Point with integer arguments. Another options to control such issues-- type checkers.

Some small progress on this idea. I should clarify that reducing the size overhead due to PyGC_Head is pretty low on my priority list. The current 16-bytes per object (on 64-bit platforms) is not so bad. My main desire is to reduce the overall execution time of Python programs, especially ones that keep many GC objects in memory. The hope is that by using data structures that are more friendly to modern CPUs (e.g. vectors rather than linked lists, avoid structures of pointers), excution time within gc.collect() will be reduced.

I created a version of CPython that uses a separate obmalloc allocator for GC objects. To try to get a somewhat “real world” test, I ran a script that loads hundreds of thousands of objects from an object database (Durus). My script then uses gc.get_objects(), builds a radix tree for them and calculates some statistics. From that, I can estimate some things like GC memory overhead and the shape of the radix tree.

RSS of the process was 373 MB. gc.get_objects() returned 468,787 objects. Only 1712 were large objects (> 512 bytes, not from within obmalloc pools).

My radix tree uses three levels, width of tree at each level:

L1 2^18
L2 2^18
L3 2^16

Some rough code for radix tree (not working yet).

#define ALIGNMENT 16 // must be 2^N
#define ALIGNMENT_SHIFT 4

#define LEAF_BITS 16
#define LEAF_LENGTH (1<<LEAF_BITS)

#define INTERIOR_BITS 18
#define INTERIOR_LENGTH (1<<INTERIOR_BITS)

// structure of radix key (from object pointer)
// radix key:
//   18 -> L1
//   18 -> L2
//   16 -> L3 -> pool
//    8 -> cell/block within pool
//    4 -> zero due to 16-byte alignment
// ----
//   64 bits

typedef struct {
    uint8_t flags[N]; // color and gc flags, N = leaf_get_size()
    ssize_t refs[N]; // gc reference count
} leaf_t;

// for large object leaf nodes, 'objects' is a linear lookup table
// to find offset into 'flags' and 'refs' using object pointer.  Each
// large leaf covers 4096 bytes of memory, pointers aligned to 16 bytes
// so 256 possible pointer locations within page.  There is usually only
// one or two objects in large object page
typedef struct {
    uint8_t flags[N]; // color and gc flags, N = leaf_get_size()
    ssize_t refs[N]; // gc refs
    uint8_t objects[N]; // lookup table, value is (((op)>>4) & 255)
} large_leaf_t;

typedef struct _node3 {
    leaf_t *ptrs[LEAF_LENGTH];
    uint8_t size_class[LEAF_LENGTH];
} node3_t;

typedef struct _node2 {
    struct _node3 *ptrs[INTERIOR_LENGTH];
} node2_t;

typedef struct _node1 {
    struct _node2 *ptrs[INTERIOR_LENGTH];
} node1_t;

#define INDEX2SIZE(I) (((uint)(I) + 1) << ALIGNMENT_SHIFT)

// return number of object blocks for leaf
static uint8_t
leaf_get_size(leaf_t *leaf)
{
    if (leaf->size_class > 0) {
        // obmalloc pool, small objects
        return INDEX2SIZE(leaf->size_class);
    }
    else {
        // large object, no more than 4096/512
        return -leaf->size_class;
    }
}

L1 and L2 nodes are just arrays of pointers so they use 2048 kB per node. L3 has pointers but also an array of bytes to hold the size class of the leaves. So, an L3 node uses 576 kB.

From my test program, here are the number of child nodes at each level:

L1 1
L2 3
L3 6886 5682 795

The L3 node with 795 children contains the large objects (>512 in size). The other two nodes cover obmalloc arenas.

Statistics on the number of leaves for large objects

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 1.000   1.000   2.000   2.155   3.000   5.000

Statistics on the number of leaves for small objects (correspond to number of objects in small object pools)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1.00   20.00   50.00   37.17   50.00   84.00

My idea is to put both the small and large objects into the same radix tree. At L3, the size class array will specify if the leaf node is large or small. Negative size class means the page contains large objects.

Space estimate for my example program (373 MB resident set size, 469k GC objects):

L1 2048 kB * 1 = 2048 kB
L2 2048 kB * 1 = 2048 kB
L3 576 kB * 3 = 1728 kB
small leaves (N = 31.2)
    8+(1+8)*N * (6886 + 5682) = 3544 kB
large leaves (N = 2.2)
    (8+1+8)*N * (795) = 29 kB
total
    2048 + 2048 + 1728 + 3544 + 29 = 9397 kB
    9397 * 1024 / 468787 = 20.5 bytes/obj
    without refs 13.8 bytes/obj

Based on this, I’m using more memory per GC object (20.5 bytes vs current 16 bytes). I’m okay with that if it can be made fast. A lot of the overhead is due to the ‘refs’ arrays, about 1/3 of total. Those would be allocated only during GC collection pass.

We could allocate only the refs needed for the collected generation and also allocate them contiguously. E.g. at start of collection, make traversal of radix tree and find all leaves in collected generation. Allocate one big array holding refs needed for all. Set ‘refs’ pointer for each leaf into this big array. When collection is done, we deallocate that refs array. Some of the ‘flags’ needs to be saved between collection. It would be possible to save a few bits related to that but for my prototype, I’m thinking just make flags a whole byte.

1 Like

I haven’t completely given up on this idea and I have a few updates. First, I was worried about how to deal with the “trash can” mechanism but now I think that’s a non-issue. We can use a separate container object to hold the trash. That avoids blowing up the C stack if you dealloc deeply nested stuff. It does introduce the possibility of an allocation failure in the decref code path (growing of trash container fails) but that doesn’t seem a fatal flaw.

I was struggling to decide how to deal with different object sizes (e.g. size class in the obmalloc code) and with “large” objects (i.e. stored outside the arenas, not by obmalloc). I now think those are solvable problems as well. One idea I had recently is that the high bits of the refcnt field can be (ab)used for various things. On a 64-bit machine, you would never need all those bits for the count. So, we could steal some to hold info. E.g. flag if the object is a container, the size class and flag if it is “large”. Having that info directly in the object structure would make the GC traverse process a lot faster.

I have the radix tree version of obmalloc working:

The performance is good, basically the same as the current obmalloc implementation. The radix tree would have to be extended to hold the GC head info (maybe get another level of nodes). I’m encouraged that it works as well as it does in the above PR.

The main reason I’m still interested in this idea is that it could be useful to implement a mark-and-sweep collector to replace the current cyclic GC. See the switching from refcounting to libgc discussion. Also, fixing Copy on Writes from reference counting.

2 Likes

Another update on this idea. I’ve made some more progress on my “gc colors” version of the Python cyclic garbage collector. It is based on v3.11.4 and the branch is here:

The code is still a bit messy but nearly all unit tests pass now. This is a proof of concept. A list of issues and future work follows.

I’m still using the gc_next/gc_prev links in the PyGC_head in order to find all GC objects. However, that could be replaced with some kind of obmalloc arena traversal (e.g. radix tree/mimalloc heap walk, combined with a way to handle “large” objects).

I’m also allocating some extra memory for the “collection state” data structure. If we don’t want the GC to fail due to failed memory allocations, that needs to be handled differently (e.g. pre-allocate needed space).

Traversal of object graph uses the C stack (e.g. visit_reachable() is recursive). That could perform badly if there are very deeply nested object references. We could use an explicit stack instead. I don’t think this is a showstopper issue but a challenge.

It adds two words to the PyGC_Head structure. That’s the opposite direction of the original idea here. :slight_smile: However, the color GC only needs a couple extra bits of state. So, maybe they can be stashed elsewhere. With immortal objects, we have a couple bits in ob_refcnt. Perhaps two bits in PyObject would be enough: is_tracked_by_gc, is_small_allocation.

1 Like

Do you have benchmark numbers on the memory usage? pyperf and pyperformance can track peak memory usage (RSS or tracemalloc).

It would be premature to benchmark since I know the memory usage will currently be higher, not lower. For now, I’m just trying to show the cyclic GC can work correctly, preserving the current behaviour as much as possible. It’s also likely to be slower since I didn’t make much effort to optimize the new code.

1 Like