Can someone explain this?
I benchmarked the built-in set
as well as these functions creating two sets:
def sets1(arr):
set(arr)
set(arr)
def sets2(arr):
hold = set(arr)
set(arr)
Here are benchmark results with arr = [*range(10000)]
, where I always compare the runtimes of two of the functions (Attempt This Online!):
200.6 ± 5.0 μs set
395.0 ± 4.7 μs sets1
502.4 ± 96.9 μs set
809.0 ± 5.5 μs sets2
204.1 ± 1.7 μs set
393.9 ± 3.2 μs sets1
495.6 ± 102.1 μs set
793.2 ± 3.0 μs sets2
Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]
Why does sets2
take twice as long (800 vs 400 μs) as sets1
, just because it holds on to its first set? Even more strangely, why does the pure set
take 2.5 times as long (500 vs 200 μs) when it’s measured alongside sets2
? How does sets2
affect the set
runtime at all, let alone so massively?
Full benchmark code (my test
function runs the given functions alternatingly, each 100 times, then shows statistics for their 10 fastest times):
from time import perf_counter as time
from statistics import mean, stdev
import gc, sys
arr = [*range(10000)]
def sets1(arr):
set(arr)
set(arr)
def sets2(arr):
hold = set(arr)
set(arr)
def test(funcs):
times = {f: [] for f in funcs}
for _ in range(100):
for f in funcs:
gc.collect()
t0 = time()
f(arr)
times[f].append(time() - t0)
for f in funcs:
ts = [t * 1e6 for t in sorted(times[f])[:10]]
print(f'{mean(ts):5.1f} ± {stdev(ts):3.1f} μs ', f.__name__)
print()
for _ in range(2):
test([set, sets1])
test([set, sets2])
print('Python:', sys.version)
Results from a different site with an older Python version show the same thing, even more extreme (Try it online!):
176.3 ± 6.2 μs set
330.1 ± 10.5 μs sets1
484.5 ± 43.7 μs set
737.9 ± 19.3 μs sets2
180.9 ± 5.6 μs set
330.8 ± 12.1 μs sets1
467.1 ± 38.0 μs set
735.6 ± 12.7 μs sets2
Python: 3.8.0b4 (default, Sep 2 2019, 08:08:37)
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)]