Weird `set` slowdown

Can someone explain this?

I benchmarked the built-in set as well as these functions creating two sets:

def sets1(arr):
    set(arr)
    set(arr)

def sets2(arr):
    hold = set(arr)
    set(arr)

Here are benchmark results with arr = [*range(10000)], where I always compare the runtimes of two of the functions (Attempt This Online!):

200.6 ± 5.0 μs  set
395.0 ± 4.7 μs  sets1

502.4 ± 96.9 μs  set
809.0 ± 5.5 μs  sets2

204.1 ± 1.7 μs  set
393.9 ± 3.2 μs  sets1

495.6 ± 102.1 μs  set
793.2 ± 3.0 μs  sets2

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]

Why does sets2 take twice as long (800 vs 400 μs) as sets1, just because it holds on to its first set? Even more strangely, why does the pure set take 2.5 times as long (500 vs 200 μs) when it’s measured alongside sets2? How does sets2 affect the set runtime at all, let alone so massively?

Full benchmark code (my test function runs the given functions alternatingly, each 100 times, then shows statistics for their 10 fastest times):

from time import perf_counter as time
from statistics import mean, stdev
import gc, sys
     
arr = [*range(10000)]

def sets1(arr):
    set(arr)
    set(arr)

def sets2(arr):
    hold = set(arr)
    set(arr)

def test(funcs):
    times = {f: [] for f in funcs}
    for _ in range(100):
        for f in funcs:
            gc.collect()
            t0 = time()
            f(arr)
            times[f].append(time() - t0)
    for f in funcs:
        ts = [t * 1e6 for t in sorted(times[f])[:10]]
        print(f'{mean(ts):5.1f} ± {stdev(ts):3.1f} μs ', f.__name__)
    print()

for _ in range(2):
    test([set, sets1])
    test([set, sets2])

print('Python:', sys.version)

Results from a different site with an older Python version show the same thing, even more extreme (Try it online!):

176.3 ± 6.2 μs  set
330.1 ± 10.5 μs  sets1

484.5 ± 43.7 μs  set
737.9 ± 19.3 μs  sets2

180.9 ± 5.6 μs  set
330.8 ± 12.1 μs  sets1

467.1 ± 38.0 μs  set
735.6 ± 12.7 μs  sets2

Python: 3.8.0b4 (default, Sep  2 2019, 08:08:37) 
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)]

My guess, without solid justification, is that you’re mostly seeing memory allocation and deallocation. When you construct one set, dispose of it, and then construct another, it’s possible that they’re reusing the same block of memory. Assuming this is CPython, try looking at the id of each object - are they the same? With sets2(), they cannot be the same, and must exist independently.

1 Like

Memory (de)allocation was the only thing I could think of as well, but can that have such a huge impact? Take more time than Python working with the objects to build the set? I’d expect memory (de)allocation to be rather insignificant in comparison. And it doesn’t explain the effect on the pure set time, as at the end of both sets1 and sets2, both of their sets are gone.

I btw also tried with 10x bigger data (arr = [*range(100000)]) and got times like these:

2690.0 ± 50.2 μs  set
5326.8 ± 89.3 μs  sets1

5858.6 ± 972.8 μs  set
7692.6 ± 66.2 μs  sets2

2734.6 ± 57.1 μs  set
5327.0 ± 100.3 μs  sets1

5923.0 ± 733.4 μs  set
7728.7 ± 43.6 μs  sets2

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]

Here the slowdown factor was less but still large. I’d expect memory (de)allocation to not only be insignificant compared to building the set, but also to grow far less.

Maybe cache effects? That might explain how sets2 can affect the pure set time at all, although the magnitude would still surprise me. I write such benchmarks a lot and I don’t think I’ve ever seen anything like this, one function impacting a separate other function this much.

Yes, it absolutely can, especially when you’re forcing a garbage collection every time. Memory allocation can be quite slow. Usually the fastest option is to reuse a previous memory block of the same size, and that’s why my suspicion is that the two sets in set1() will have the same id.

1 Like

Hmm, I still think memory allocation is very low-level and one of the fastest things you can possibly do with memory, and should be much faster than then filling the memory in an elaborate process involving Python objects. Not that much slower. I rather suspect caching instead.

Yes, it’s CPython. And when testing set and sets1, all three sets have the same id. Whereas with set and sets2, the set set and hold have the same id and the second sets2 set has a different id.

Just noticed something in the “Timing details” shown by the site when I use only one of the test calls at the end.

With test([set, sets1]):

Real time: 0.642681285 s
Kernel time: 0.023972 s
User time: 0.608371 s
Maximum lifetime memory usage: 12576 KiB
Waits (voluntary context switches): 1
Preemptions (involuntary context switches): 48
Minor page faults: 2077
Major page faults: 0
Input operations: 0
Output operations: 0

With test([set, sets2]):

Real time: 0.884625526 s
Kernel time: 0.195504 s
User time: 0.675351 s
Maximum lifetime memory usage: 12836 KiB
Waits (voluntary context switches): 1
Preemptions (involuntary context switches): 276
Minor page faults: 54745
Major page faults: 0
Input operations: 0
Output operations: 0

More “Preemptions” and many more “Minor page faults”.

“I think” is only really good as a starting point for an investigation - it’s really REALLY easy to be wrong :slight_smile:

Okay, so that confirms it: they ARE being reused. Unless there’s some other indication, I’d pin this one on unnecessary allocations and garbage collection.

Do you have an idea how I could measure the allocations? (Without modifying CPython, just by writing Python code.)

How does garbage collection play a role? My gc call is outside of the measurement, and the times are the same if I remove it. And the garbage collection for the pure set call is the same regardless of which of sets1 and sets2 it runs against, isn’t it? I.e., no garbage collection of previous sets, only the garbage collection of its one set that it built.

tracemalloc — Trace memory allocations — Python 3.12.0 documentation would be a good start. Not sure if it’ll tell you what you need to know.

It’s not something that absolutely definitely plays a role, but something that potentially could. If the times are the same without it, then it can be ruled out, so all of this is just hypothetical now; but the way it might have an effect would be by forcing both allocations to be done every time.

Many types of object in Python make use of “free lists” where recent deallocations get stashed in case they’re needed again. You may find that your performance figures are drastically different if you switch to tuples, for example.

It doesn’t, other than to waste time. gc only reclaims cycles, and your program doesn’t create any cycles. But by calling gc.collect() you’re forcing it to look for trash cycles hundreds of times.

Add this function near the top:

def showgc(phase, info):
    print("phase", phase)
    print("info", info)

and then add:

gc.callbacks.append(showgc)

right after it. You’ll get hundreds of lines of new output, but they’re all forced by the gc.collect() calls. The output shows that no garbage is ever collected by gc (because there are never trash cycles - all trash is reclaimed by refcounts falling to 0).

Then comment out the collect() call, and you’ll see that gc never runs (well, not until your code is over, and Python is tearing itself down on exit).

I don’t mind that wasted time, the program is fast enough. I did it to set the collection counts to zero, to prevent gc from running during the timing. While I don’t have cycles, I still create objects which could trigger a collection.

While true in general, in this test case it shouldn’t much matter. The int objects the sets contain were all constructed once, at the start, and sticking them in new sets reuses those objects. There are no new allocations of ints needed.

The biggest strain on Python’s small object allocator (obmalloc) here are actually the floats recording times(!). There aren’t mountains of those, but there are more than the float free list can contain. Trivial regardless.

There are only a few lists too (lists of times), so that shouldn’t matter much here either.

The biggest memory hogs are the sets themselves, and the set objects here are far too large for obmalloc to allocate itself. Instead those are delegated to the system malloc/free family. So, in the end, I expect we’re seeing consequences of how the platform C malloc/free handle chains of operations.

It’s a red herring here regardless: as already demonstrated, gc doesn’t run at all in this program in the absence of the program’s explicit collect() calls forcing it to run.

gc is triggered by a “large” excess of allocations over deallocations. Refcounting alone reclaims all the trash the program creates “right away”, so there’s never an excess of allocations over deallocations large enough to get anywhere near triggering a gc run.

BTW, the usual way to take possible gc mysteries out of the equation isn’t to run collect(), but to call gc.disable().

1 Like

That makes sense. I know that the free lists can make a HUGE difference to things like multiple-return (where you’re constantly making tiny tuples), but I wasn’t sure how large an object might be reused.

(By the way, it’s worth noting that when I originally said “allocations and garbage collection”, I wasn’t referring to the cyclic collector but to all “drop it on the floor for deallocation”. It would probably have been clearer if I’d said “allocations and deallocations”.)

fwiw I don’t see this on MacOS with python 3.8, 3.10, or 3.11. Is this on Windows, or Linux?

edit: by “this” I meant the weird slowdown of set when paired with sets2. In all cases, both sets1 and sets2 take about double the time of set.

1 Like

Whether I trigger gc collection with my code depends on the prior counts. There’s a small chance that I’m close enough to trigger one (testing now showed that I appear to consistently start 75 away, but I’d rather be 700 away and know it).

Yes, perhaps gc.disable() is equivalent here, but I don’t know what exactly it does. It’s only documented with “Disable automatic garbage collection”. Which could mean also disabling the tracking/counting, not just collections. (I’ve read the whole gc doc and the whole devguide page about it a while back, but don’t remember whether they say more about it.)

I am reproducing the issue with 3.8 and 3.11 on Linux, using timeit from the command line with analogous code. (I’m also finding that 3.11 performs worse on this task.)

$ python3.11 -m timeit --setup "def sets1(arr): set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets1(arr)"
1000 loops, best of 5: 297 usec per loop
$ python3.11 -m timeit --setup "def sets2(arr): hold = set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets2(arr)"
500 loops, best of 5: 665 usec per loop
$ python3.8 -m timeit --setup "def sets1(arr): set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets1(arr)"
1000 loops, best of 5: 258 usec per loop
$ python3.8 -m timeit --setup "def sets2(arr): hold = set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets2(arr)"
500 loops, best of 5: 626 usec per loop

… This is really strange, as I try more things.

The slowdown also disappears if hold is made global. It persists, however, if the input list is global, and/or if separate input lists are used for the two set calls. I can also get a similar, but much more minor, effect from using list both times instead of set, but not from mixing and matching.

Edit: Making and assigning an empty set does not appear to trigger the issue.

Weird. My test was to make a file with the same code as @pochmann in the OP (unless I made a mistake somewhere)

(pytest311) ➜  jupyter python test.py
 93.0 ± 0.1 μs  set
185.3 ± 0.2 μs  sets1

 93.4 ± 0.2 μs  set
185.8 ± 0.5 μs  sets2

 92.8 ± 0.1 μs  set
185.0 ± 0.1 μs  sets1

 93.3 ± 0.2 μs  set
185.4 ± 0.1 μs  sets2

Python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:10:28) [Clang 15.0.7 ]

As noted in the output, this version of python was built with Clang. I think this reinforces the idea that underlying allocation is behind this?

Reaching, here, but something this strange justifies this kind of reach I guess.

My Pythons are both compiled by GCC 9.4.0 (the system compiler; 3.8 came with the OS and was presumably built with that GCC by the distro). Could that have something to do with it?

Using a C level profile tool would give insight into where the time is going in cpython.

I would be surprised that the compiler used, or linker options, made this type of odd slow down.