Here is an idea I have been toying with since the Python core sprint at Facebook. I don’t know yet if there would be any real advantages but I think it is an interesting idea.
The current GC (implemented in gcmodule.c) has a number of issues:
the PyGC_Head overhead is 16 bytes and applies to every object that participates in GC. That memory overhead is not terrible but it would be nice to reduce it.
the process of doing a collection modifies data in the PyGC_Head for objects in the generations being collected. That can have bad effects for the OS because of dirty VM pages. The PyGC_Head data is spread out over memory and so a collection can dirty many OS pages.
due to modern CPU/memory/cache behavior, traversing linked lists is quite a slow operation. So, while the current linked-list design appears to be simple and fast, on modern machines a different design (e.g. one that uses vector-like data structures) could be comparable or faster.
My first step has been to modify the current gcmodule.c to no longer use the double-linked lists for keeping sets of objects during collection. Instead, the linked-list is only used to traverse all the objects that have a GC head. I’ve added some new PyGC_Head fields:
gc_color: WHITE (trash), GREY (alive, need traverse), BLACK (alive)
gc_flags: GC_FLAG_FINIALIZER_REACHABLE, GC_FLAG_LEGACY_FINIALIZER_REACHABLE, GC_FLAG_FINALIZED
gc_gen: generation number for object, 0 to 4
The code for this experiment is on github. Nearly all unit tests pass. Performance is obviously terrible since the GC process makes many passes over the entire set of GC tracked objects. Making this work without the performance drop is the work that remains.
The next step I would like to try is to remove PyGC_Head for small objects. E.g. if an object is less than 512 bytes in size, it is allocated from a special GC pool (similar but separate from the obmalloc pools). Information that was previously in PyGC_Head for these objects gets moved elsewhere. E.g. packed at the start of the pool or in an array/tree indexed by pool index.
One potentially killer problem with this idea is finding a fast way to distinguish small and large objects (have PyGC_Head or not). I fear the cost for pointers found by tp_traverse and for _PyObject_GC_Del(). My current idea is that we could use a version of obmalloc:address_in_range(op). That is quite fast.
An alternative idea to address_in_range() is to use a radix tree. See the pagemap data structure in tcmalloc for a similar idea. Attached are some sketching I made of a possible GC allocator and related data structures.
The allocation function is something like:
- lookup free object block, if found, unlink it and return
- lookup empty/partially empty pool from new_pool list, if exists, return next block from pool
- create new pool, return first block in pool
The free function would be:
- link object block to the list of free blocks for size index. We could link to the start or end of the free list, depending on what works better.
This scheme would be optimized for 64-bit desktop/server machines with decent amounts of RAM. We could have a build configure option that builds CPython using the old double-linked lists.
For simplicity, my allocation design is single-threaded and doesn’t return memory to the OS. Those limitations should be fixable given some “small matter of programming”. E.g. just copy what tcmalloc does.
In sketch above, I have the “generation” as a pool level property. So, all objects in the same pool are the same generation. It would useful to tune the freelist behavior to optimize the behaviors of generations (e.g. try to minimize number of pools in youngest generation).
In sketch, I have gc_collect_info as allocated only during collection. It could be allocated in a 2-phase process per L3 node. First, find the number and size classes of the pools in the L3 leaf. Allocate a contiguous block of memory to hold gc_collect_info for all pools in the L3 leaf. Set ‘c’ pointer of each gc_info to point into this contiguous block. That way, we don’t pay the memory cost of ‘refs’ for generations not being collected. Also, the page dirtying behavior should be more friendly to the OS (locality of writes).
My radix tree design requires that pools are aligned to pool size boundaries. This could be done by allocating a large “arena” and then allocating pools from within there (waste first bit of space to get alignment). Or we could use a system call that allocates pages that are aligned, e.g.
My changes to gcmodule.c require that allocations happen during the GC collect process. Not sure if that is a major problem.