When returning Python sets very often, I get a large, unidentified memory usage. With lists the memory stays low

I have a function returning a set and call it very often, leading to large memory usage. Deleting the list with the returned sets does not free the memory. When returning a list instead, much less memory is consumed and it is freed again when deleting the list of returned lists.

import sys
large_list = list(range(2_000_000))

def some_function(arg1, arg2=None):
        # return list(range(23)) # --> memory usage: ~650 MB
        return set(list(range(23))) # --> memory usage: ~4 GB

results = []
for x in large_list:
    results.append(some_function(x))
print(f"size results: {sys.getsizeof(results)/1000/1000} MB")
# >>> size results: 17.12828 MB

del results # does not reduce process memory when using sets

When calling the for-loop again after deleting results, the memory usage does not go up to ~8 GB, so whatever the sets need so much memory for seems to be getting reused.

Can I free the memory if i need to use sets? I have already gotten python to crash when running out of memory.

I’m on linux (Rocky 9.2) and using btop to inspect the memory usage of the process.

When executing the del-line by itself, I can immediately see the memory dropping (without explicitly calling the garbage collector). But not the large ~4 GB memory block.

I have asked this on stackoverflow, but gotten no real answers

I was able to reproduce the higher memory usage, but del results still causes the memory to be freed when using sets for me. Tested on Ubuntu-based Mint 20.3 Cinnamon with Python 3.8.10, monitoring memory usage in the System Monitor GUI program.

The higher memory usage is not a surprise, as each individual set object uses a larger amount of memory. Keep in mind that sys.getsizeof will only account for the object itself, and not for any contained elements or attributes. Of course, both sets and lists of integers taken from range(23) will reuse interned int objects created at startup, so there is no additional memory usage there, beyond the pointers within the set/list implementations. But those are pretty significant:

>>> sys.getsizeof(set(list(range(23))))
2264
>>> sys.getsizeof(list(range(23)))
240

Note that a set can be created directly from the range object, but this does not make the resulting object any more space-efficient:

>>> sys.getsizeof(set(range(23)))
2264

Even an empty set is quite sizable, at least on a 64-bit architecture:

>>> sys.getsizeof(set())
216
>>> sys.getsizeof([])
56

Experimentally, the amount of space taken for the set object jumps from 4->5 elements and then again from 18->19 (and 76->77).

If your actual use case involves keeping track of many sets that repeatedly reuse the same common elements, consider instead representing them as subsets of a global set (representing the universe of possible values). Each subset can be represented using a single integer, each of the bits of which indicate whether a corresponding element is present in the subset. (If the global set needs to be computed dynamically, it would be better to use an ordered-set implementation - or just a list - so that existing subsets don’t need to be updated when more elements are added.) Otherwise, dictionaries in 3.7 and up (e.g. dict.fromkeys(range(23)), ignoring the common None values) are more space-efficient - although they will not perform as well for repeated insertions and removals, and set operations like union/intersection/etc. are trickier.

1 Like

Perhaps try one of these?

Also bear in mind that a set is a hash table behind the scenes. To give O(1) lookups, it needs to scatter the values - a saturated hash table is a slow hash table.

1 Like

Thank you, I tested this on with a different Python version and got the same result, but on windows del results in the memory being freed.
I also used guppy (from @dstromberg link, thanks), before del results:

Partition of a set of 4391442 objects. Total size = 4669488395 bytes.
 Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)
     0 4014906  91 4585133761  98 4585133761  98 list
     1   9661   0 35037141   1 4620170902  99 dict of module
     2  96915   2  8329133   0 4628500035  99 types.CodeType
...

And after del results:

Partition of a set of 2391510 objects. Total size = 124367520 bytes.
 Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)
     0 2014924  84 57136036  46  57136036  46 list
     1   9659   0 17908691  14  75044727  60 dict of module
     2  96913   4  8328983   7  83373710  67 types.CodeType
...

So to guppy / the heap it looks like the memory has been freed, but btop (and htop) still show the memory as belonging to the python process.

So there seems to be a problem with my operating system.

Just for info: When pickling the list of sets, this results in a small ~100 MB file, which can also be unpickled which fills up the memory again…).

This is a very common phenomenon. Often, the release of memory doesn’t immediately free it up to the rest of the system (because that can only be done in fixed allocations, or because it’s inefficient to do so, or any other reason), but that memory IS available for reuse within that application. Try running this a second time to see if it uses more memory.

I’d be curious to see whether your results are comparable if you instead use a dictionary. Replace the calls to set() with calls to dict.fromkeys() and see what that does to your memory usage. In my testing, each dictionary came in at roughly half the size of the corresponding set.

Side note though: I don’t know what sort of figures you’re getting from sys.getsizeof(results), but to my knowledge, that is ONLY going to give you the size requirement for the list itself. So I would expect that figure to be basically the same for any list of the same size, and most likely will approximate to len(results) * word_size where the word size would be 4 bytes for a 32-bit Python and 8 bytes for a 64-bit Python. It may be somewhat larger than that due to overallocation, though by a bounded amount. Your figure shown there of 17MB seems pretty plausible - that’s less than ten percent above the minimum.

Here’s my figures for a much simpler test, showing the effective costs of different Python data types. The usual caveats apply: this is my system and may differ from yours, the results will vary across Python versions, the results will vary on 32-bit and 64-bit builds, and your operating system and CPU architecture may play a part too. With that said, though, there does seem to be a notable difference here.

>>> sys.getsizeof(tuple(range(10000)))
80040
>>> sys.getsizeof(list(range(10000)))
80056
>>> sys.getsizeof(dict.fromkeys(range(10000)))
294992
>>> sys.getsizeof(set(range(10000)))
524504
>>> sys.getsizeof(tuple(range(23)))
224
>>> sys.getsizeof(list(range(23)))
248
>>> sys.getsizeof(dict.fromkeys(range(23)))
1168
>>> sys.getsizeof(set(range(23)))
2264

The overhead for a list is pretty much constant. In my testing of lists constructed directly from ranges (as in your some_function example), the overhead is either 56 or 64 bytes, regardless of the length of the list; for a list constructed by repeated appends, the overhead also includes the slack space for future expansion, up to maybe 10%.

The overhead for a set, however, grows far more rapidly. There’s ever-growing overhead due to the internal slack space required for an efficient hash table. This varies considerably with the size of the set, but seems to be about 2KB for smallish sets, and settles in on about a 4:1 ratio as they get larger (that is to say, a set with N items takes up roughly as much memory as a list with 4*N items).

So I think what you’re seeing here is the inherent cost of sets: since they need random access by value, they can’t be as compact as lists, which need only look up by index. That fact, coupled with some quirks of how operating systems and processes manage their memory, means that you’re allocating more but then only returning it to the Python process without fully returning it to the OS. But…

… this part is curious. If you’re doing this repeatedly, and clearing out the old results before starting a new run, the same memory should be being reused. So maybe there’s more going on here. Can you post an example that actually crashes?

1 Like

Not necessarily a “problem” per se. Operating systems will commonly leave memory assigned to a process that has “freed” it, making the prediction that it will want to use the memory again soon.

1 Like

I have tested this, the memory gets reused. But the crashes occur because the List of sets is much larger than the 2 million example I posted. When I posted the question, I still thought the list of sets wasn’t as large (because deleting it didn’t free up memory on my system). Now I’ll find a workaround where I just don’t use a very large list of sets.

Ah, I didn’t know this. And I can actually reuse the allocated memory to make a new list of sets. Just out of interest, do you know if there is a way to actually free the memory? gc.collect() (and del) don’t work. But maybe that question is too OS specific…

Okay, sounds like things are basically working. The key is going to be not using a huge list of sets, then, and one neat trick would be to take advantage of dict’s compactness, but that’ll only get you a 2:1 advantage. Sadly, you may need to heavily rethink your algorithm to reduce memory usage.

What are the sets storing? Are they immutable? Can you share them? (For example, switch from sets to frozensets, and intern them via a dictionary.) Or can they be done as bitsets by assigning each element a power of two, and then storing integers?

Is 23 your real set size or is that also wrong?

It’s variable, between 1 and 25 normally.

I’m storing nodes from connected subgraphs as integers. They are immutable and I do save them later in a dictionary (I am actually using frozensets). The bitset approach sounds interesting, but I’m not sure I even need to use sets here, I only ran into memory issues when I scaled my calculation up.

What is the actual calculation? What purpose do the sets serve?

Apparently none. I changed the sets for tuples and it still works but uses less memory now.