Garbage Collector Alternative?

We’ve encountered this issue now not only once, but multiple times in several scenarios (all related to neural network training, i.e. a loop where same the same code is executed over and over, with few large tensor objects). In all those scenarios we hit a case, where few large objects end up in gen2 of the gc, which is not called to be collected until it is large enough. And because most of the time objects are collected in gen1, this does not happen often, thus these large objects live for a long time. These few (but very large) objects in gen2 will cause an OOM before ever triggering gen2 automatically.

In terms of user-experience, I’d wish for a better functioning GC, where the user is not forced to call the gc gen2 collection manually (by explicitly calling gc.collect()).

First, I thought it’s an actual python bug and submitted https://github.com/python/cpython/issues/98801. Apparently, it wasn’t actually a bug, but by-design and because we’ve encountered this issue again and it took again several days to find that the cause it not pytorch (gpu OOM) but the gc concept. I hope to avoid this in the future, also for other users.

I actually don’t have a good suggestion, as gc’s are not my strength, but I want to raise the topic as a discussion.

Seems like you already had two months of feedback and discussion in the original issue, and got some pretty good explanations and suggestions. What are you expecting here that you did not get out of all the attention put into that issue?

Well, sure there was discussion, but I feel it was closed, because the bug I submitted was not actually the bug (which I’m really sorry about). But there wasn’t a conceptual discussion, on how this could be avoided in the future for other folks as well. I cannot believe nobody is hitting the same issue, as we’ve encountered it multiple times now. I don’t like explicitly having to call gc.collect manually every iteration of my loop, this feels like a hacky workaround and should happen implicitly?

Try running your program under other interpreters with different garbage collectors, such as PyPy, IronPython and Jython and see if they perform better for your needs.

Otherwise, I fear that maybe your expectations are too high. Every garbage collection strategy is a trade off between convenience for the programmer, how rapidly memory gets collected, and avoiding pauses and slowdowns. No strategy is going to “win” for every imaginable program under all circumstances, so sometimes you just have to give the collector a nudge to get the behaviour you need.

A very simple suggestion: Support another way, how gen2 collection is triggered: Add an absolute trigger value for gen2 collection based on long_lived_pending, instead of hardwired long_lived_pending / long_lived_total > 0.25. This’d allow for the program to set this value once in the beginning, and ensure that gen2 is triggered.

Actually, just recognizing that gc.freeze() in the beginning could also help, because it’d reduce the number of objects needed to trigger gen2 collection by lots. Maybe this could be a sufficient mechanism.

I see there are ways to fix this properly, so I’ll report that to pytorch. Maybe it’d be good to have this scenario documented somewhere so future developers can read about it?

The compact scenario: Having a loop creating large and small objects, using a combination of manual regular calls to gc.collect() and an initial call to gc.freeze() after initialization to fix the memory overflow.

Having such an issue suggests that you are having too many or too complex reference cycles in your data structures. Have you investigated why you need or see those cycles ?

E.g. you could use weak references or restructure the data to avoid creating reference cycles to simplify or remove them – esp. in those large objects which you appear to be using.

In my experience, the most common situations where you run into cycles is storing tracebacks for e.g. logging purposes. Rather than storing those, it’s often better to serialize them when you get hold of them in some form and then store the serialized form.

It’s often cheaper to add swap space or even physical RAM, than to optimize.

Last I heard, IronPython and Jython were still Python 2.x only.

But Pypy3 and Micropython are Python 3.x.

Oh, also, Cython can do Python 3 and can give you limited access to manual memory management.

I don’t believe we’re having a special case here. Our hardware is already mighty (~1TB RAM, ~40GB gpu RAM). In our case, one iteration of the outer loop already takes ~1sec (mostly gpu computation). Thus a little memory alloc/dealloc does not really weigh much compared to that. As python/pytorch is nowadays the state of the art for neural network training, this should not be considered exotic. Still, after ~100 iterations with logging, the GPU memory overflows because tensors are not freed. I don’t see the option of “adding” more (gpu) ram to compensate for a memory leak. Also, those frameworks and the dependencies run best using c python, especially with all the compiled-in modules.

Nevertheless, I see the option of calling gc.freeze() after init a good solution. With a combination of gc.collect() every iteration, it should be performing good I expect.

Sounds exotic to me.

(IronPython 3.4 is Python 3.4 plus some newer features like f-string support.)