Weird `set` slowdown

Can someone explain this?

I benchmarked the built-in `set` as well as these functions creating two sets:

``````def sets1(arr):
set(arr)
set(arr)

def sets2(arr):
hold = set(arr)
set(arr)
``````

Here are benchmark results with `arr = [*range(10000)]`, where I always compare the runtimes of two of the functions (Attempt This Online!):

``````200.6 ± 5.0 μs  set
395.0 ± 4.7 μs  sets1

502.4 ± 96.9 μs  set
809.0 ± 5.5 μs  sets2

204.1 ± 1.7 μs  set
393.9 ± 3.2 μs  sets1

495.6 ± 102.1 μs  set
793.2 ± 3.0 μs  sets2

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]
``````

Why does `sets2` take twice as long (800 vs 400 μs) as `sets1`, just because it holds on to its first set? Even more strangely, why does the pure `set` take 2.5 times as long (500 vs 200 μs) when it’s measured alongside `sets2`? How does `sets2` affect the `set` runtime at all, let alone so massively?

Full benchmark code (my `test` function runs the given functions alternatingly, each 100 times, then shows statistics for their 10 fastest times):

``````from time import perf_counter as time
from statistics import mean, stdev
import gc, sys

arr = [*range(10000)]

def sets1(arr):
set(arr)
set(arr)

def sets2(arr):
hold = set(arr)
set(arr)

def test(funcs):
times = {f: [] for f in funcs}
for _ in range(100):
for f in funcs:
gc.collect()
t0 = time()
f(arr)
times[f].append(time() - t0)
for f in funcs:
ts = [t * 1e6 for t in sorted(times[f])[:10]]
print(f'{mean(ts):5.1f} ± {stdev(ts):3.1f} μs ', f.__name__)
print()

for _ in range(2):
test([set, sets1])
test([set, sets2])

print('Python:', sys.version)
``````

Results from a different site with an older Python version show the same thing, even more extreme (Try it online!):

``````176.3 ± 6.2 μs  set
330.1 ± 10.5 μs  sets1

484.5 ± 43.7 μs  set
737.9 ± 19.3 μs  sets2

180.9 ± 5.6 μs  set
330.8 ± 12.1 μs  sets1

467.1 ± 38.0 μs  set
735.6 ± 12.7 μs  sets2

Python: 3.8.0b4 (default, Sep  2 2019, 08:08:37)
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)]
``````

My guess, without solid justification, is that you’re mostly seeing memory allocation and deallocation. When you construct one set, dispose of it, and then construct another, it’s possible that they’re reusing the same block of memory. Assuming this is CPython, try looking at the id of each object - are they the same? With sets2(), they cannot be the same, and must exist independently.

1 Like

Memory (de)allocation was the only thing I could think of as well, but can that have such a huge impact? Take more time than Python working with the objects to build the set? I’d expect memory (de)allocation to be rather insignificant in comparison. And it doesn’t explain the effect on the pure `set` time, as at the end of both `sets1` and `sets2`, both of their sets are gone.

I btw also tried with 10x bigger data (`arr = [*range(100000)]`) and got times like these:

``````2690.0 ± 50.2 μs  set
5326.8 ± 89.3 μs  sets1

5858.6 ± 972.8 μs  set
7692.6 ± 66.2 μs  sets2

2734.6 ± 57.1 μs  set
5327.0 ± 100.3 μs  sets1

5923.0 ± 733.4 μs  set
7728.7 ± 43.6 μs  sets2

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]
``````

Here the slowdown factor was less but still large. I’d expect memory (de)allocation to not only be insignificant compared to building the set, but also to grow far less.

Maybe cache effects? That might explain how `sets2` can affect the pure `set` time at all, although the magnitude would still surprise me. I write such benchmarks a lot and I don’t think I’ve ever seen anything like this, one function impacting a separate other function this much.

Yes, it absolutely can, especially when you’re forcing a garbage collection every time. Memory allocation can be quite slow. Usually the fastest option is to reuse a previous memory block of the same size, and that’s why my suspicion is that the two sets in `set1()` will have the same id.

1 Like

Hmm, I still think memory allocation is very low-level and one of the fastest things you can possibly do with memory, and should be much faster than then filling the memory in an elaborate process involving Python objects. Not that much slower. I rather suspect caching instead.

Yes, it’s CPython. And when testing `set` and `sets1`, all three sets have the same id. Whereas with `set` and `sets2`, the `set` set and `hold` have the same id and the second `sets2` set has a different id.

Just noticed something in the “Timing details” shown by the site when I use only one of the `test` calls at the end.

With `test([set, sets1])`:

``````Real time: 0.642681285 s
Kernel time: 0.023972 s
User time: 0.608371 s
Maximum lifetime memory usage: 12576 KiB
Waits (voluntary context switches): 1
Preemptions (involuntary context switches): 48
Minor page faults: 2077
Major page faults: 0
Input operations: 0
Output operations: 0
``````

With `test([set, sets2])`:

``````Real time: 0.884625526 s
Kernel time: 0.195504 s
User time: 0.675351 s
Maximum lifetime memory usage: 12836 KiB
Waits (voluntary context switches): 1
Preemptions (involuntary context switches): 276
Minor page faults: 54745
Major page faults: 0
Input operations: 0
Output operations: 0
``````

More “Preemptions” and many more “Minor page faults”.

“I think” is only really good as a starting point for an investigation - it’s really REALLY easy to be wrong

Okay, so that confirms it: they ARE being reused. Unless there’s some other indication, I’d pin this one on unnecessary allocations and garbage collection.

Do you have an idea how I could measure the allocations? (Without modifying CPython, just by writing Python code.)

How does garbage collection play a role? My `gc` call is outside of the measurement, and the times are the same if I remove it. And the garbage collection for the pure `set` call is the same regardless of which of `sets1` and `sets2` it runs against, isn’t it? I.e., no garbage collection of previous sets, only the garbage collection of its one set that it built.

tracemalloc — Trace memory allocations — Python 3.11.4 documentation would be a good start. Not sure if it’ll tell you what you need to know.

It’s not something that absolutely definitely plays a role, but something that potentially could. If the times are the same without it, then it can be ruled out, so all of this is just hypothetical now; but the way it might have an effect would be by forcing both allocations to be done every time.

Many types of object in Python make use of “free lists” where recent deallocations get stashed in case they’re needed again. You may find that your performance figures are drastically different if you switch to tuples, for example.

It doesn’t, other than to waste time. `gc` only reclaims cycles, and your program doesn’t create any cycles. But by calling `gc.collect()` you’re forcing it to look for trash cycles hundreds of times.

Add this function near the top:

``````def showgc(phase, info):
print("phase", phase)
print("info", info)
``````

and then add:

``````gc.callbacks.append(showgc)
``````

right after it. You’ll get hundreds of lines of new output, but they’re all forced by the `gc.collect()` calls. The output shows that no garbage is ever collected by `gc` (because there are never trash cycles - all trash is reclaimed by refcounts falling to 0).

Then comment out the `collect()` call, and you’ll see that `gc` never runs (well, not until your code is over, and Python is tearing itself down on exit).

I don’t mind that wasted time, the program is fast enough. I did it to set the collection counts to zero, to prevent gc from running during the timing. While I don’t have cycles, I still create objects which could trigger a collection.

While true in general, in this test case it shouldn’t much matter. The int objects the sets contain were all constructed once, at the start, and sticking them in new sets reuses those objects. There are no new allocations of ints needed.

The biggest strain on Python’s small object allocator (`obmalloc`) here are actually the floats recording times(!). There aren’t mountains of those, but there are more than the float free list can contain. Trivial regardless.

There are only a few lists too (lists of times), so that shouldn’t matter much here either.

The biggest memory hogs are the sets themselves, and the set objects here are far too large for `obmalloc` to allocate itself. Instead those are delegated to the system `malloc/free` family. So, in the end, I expect we’re seeing consequences of how the platform C `malloc/free` handle chains of operations.

It’s a red herring here regardless: as already demonstrated, `gc` doesn’t run at all in this program in the absence of the program’s explicit `collect()` calls forcing it to run.

`gc` is triggered by a “large” excess of allocations over deallocations. Refcounting alone reclaims all the trash the program creates “right away”, so there’s never an excess of allocations over deallocations large enough to get anywhere near triggering a `gc` run.

BTW, the usual way to take possible `gc` mysteries out of the equation isn’t to run `collect()`, but to call `gc.disable()`.

1 Like

That makes sense. I know that the free lists can make a HUGE difference to things like multiple-return (where you’re constantly making tiny tuples), but I wasn’t sure how large an object might be reused.

(By the way, it’s worth noting that when I originally said “allocations and garbage collection”, I wasn’t referring to the cyclic collector but to all “drop it on the floor for deallocation”. It would probably have been clearer if I’d said “allocations and deallocations”.)

fwiw I don’t see this on MacOS with python 3.8, 3.10, or 3.11. Is this on Windows, or Linux?

edit: by “this” I meant the weird slowdown of `set` when paired with `sets2`. In all cases, both `sets1` and `sets2` take about double the time of `set`.

1 Like

Whether I trigger gc collection with my code depends on the prior counts. There’s a small chance that I’m close enough to trigger one (testing now showed that I appear to consistently start 75 away, but I’d rather be 700 away and know it).

Yes, perhaps `gc.disable()` is equivalent here, but I don’t know what exactly it does. It’s only documented with “Disable automatic garbage collection”. Which could mean also disabling the tracking/counting, not just collections. (I’ve read the whole `gc` doc and the whole devguide page about it a while back, but don’t remember whether they say more about it.)

I am reproducing the issue with 3.8 and 3.11 on Linux, using `timeit` from the command line with analogous code. (I’m also finding that 3.11 performs worse on this task.)

``````\$ python3.11 -m timeit --setup "def sets1(arr): set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets1(arr)"
1000 loops, best of 5: 297 usec per loop
\$ python3.11 -m timeit --setup "def sets2(arr): hold = set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets2(arr)"
500 loops, best of 5: 665 usec per loop
\$ python3.8 -m timeit --setup "def sets1(arr): set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets1(arr)"
1000 loops, best of 5: 258 usec per loop
\$ python3.8 -m timeit --setup "def sets2(arr): hold = set(arr); set(arr)" --setup "arr=[*range(10000)]" -- "sets2(arr)"
500 loops, best of 5: 626 usec per loop
``````

… This is really strange, as I try more things.

The slowdown also disappears if `hold` is made global. It persists, however, if the input list is global, and/or if separate input lists are used for the two `set` calls. I can also get a similar, but much more minor, effect from using `list` both times instead of `set`, but not from mixing and matching.

Edit: Making and assigning an empty set does not appear to trigger the issue.

Weird. My test was to make a file with the same code as @pochmann in the OP (unless I made a mistake somewhere)

``````(pytest311) ➜  jupyter python test.py
93.0 ± 0.1 μs  set
185.3 ± 0.2 μs  sets1

93.4 ± 0.2 μs  set
185.8 ± 0.5 μs  sets2

92.8 ± 0.1 μs  set
185.0 ± 0.1 μs  sets1

93.3 ± 0.2 μs  set
185.4 ± 0.1 μs  sets2

Python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:10:28) [Clang 15.0.7 ]
``````

As noted in the output, this version of python was built with `Clang`. I think this reinforces the idea that underlying allocation is behind this?

Reaching, here, but something this strange justifies this kind of reach I guess.

My Pythons are both compiled by GCC 9.4.0 (the system compiler; 3.8 came with the OS and was presumably built with that GCC by the distro). Could that have something to do with it?

Using a C level profile tool would give insight into where the time is going in cpython.

I would be surprised that the compiler used, or linker options, made this type of odd slow down.