Weird `set` slowdown

The plot thickens:

$ python3.8 -m timeit --setup "def sets2(): hold=set(arr1); set(arr2)" --setup "arr1, arr2=[*range(4914)], [*range(10000)]" -- "sets2()"
2000 loops, best of 5: 186 usec per loop
$ python3.8 -m timeit --setup "def sets2(): hold=set(arr1); set(arr2)" --setup "arr1, arr2=[*range(4915)], [*range(10000)]" -- "sets2()"
500 loops, best of 5: 580 usec per loop

This result is very consistent for me.

@kknechtel.

>>> import sys
>>> s = set(range(4914))
>>> sys.getsizeof(s)
131288
>>> s.add(4914)
>>> sys.getsizeof(s)
524504

That is, in your latest, very simple code, the difference in arr1 between the two runs is that it’s exactly at the boundary at which the set internals vastly increase the size of the underlying Python object if another element is added.

When that resize happens, the entire hash mapping so far needs to be recomputed.

1 Like

This makes sense, but it’s noteworthy that the effect is far greater when the assignment to the local occurs. Without it:

$ python3.8 -m timeit --setup "def sets1(): set(arr1); set(arr2)" --setup "arr1, arr2=[*range(4914)], [*range(10000)]" -- "sets1()"
2000 loops, best of 5: 181 usec per loop
$ python3.8 -m timeit --setup "def sets1(): set(arr1); set(arr2)" --setup "arr1, arr2=[*range(4915)], [*range(10000)]" -- "sets1()"
1000 loops, best of 5: 214 usec per loop

Sure, more than one thing can be at work :wink:. But since (at least to my eyes) all signs here point to the platform C’s malloc/free behavior, why aren’t people trying to time simple C code instead? I expect we’re merely obscuring the underlying cause by tossing in thousands of lines of CPython implementation code driving that.

1 Like

That less than doubles that kind of work for the smaller of the two lists/sets. So for that work I’d expect the overall time to grow less than factor 1.5, not the observed factor 3.12 (≈580/186).

You mean just timing mallocing/freeing 524504 bytes?

Already said there’s more than one thing at work. It certainly doesn’t help here to add a case that adds an additional complication of vastly increasing the memory allocated for a set between the runs.

I mean treat it seriously instead of as a word game :wink:. Go through several layers of emulating the essence of what a set needs to do, at the memory level.

For a start, your example has two allocations, and two frees, per function call. In one case both allocations are done before the first free. In the other (faster) case, malloc and free calls alternate.

For the next step (if needed), memory allocated for a set is then read and written into. That can be supremely important if, e.g., alternating malloc and free returns the same memory block that’s still high in the cache hierarchy.

Besides that we already have reports of platforms with no apparent timing differences, on my platform (Windows), where there is one, I don’t have to use sets at all to see significant timing differences from the same kind of allocation/free C behavior the CPython implementation provokes; like

$ py -m timeit -s "a = list(range(100000))" -s "def f(a): hold=a[:]; a[:]" "f(a)"
200 loops, best of 5: 1.26 msec per loop

$ py -m timeit -s "a = list(range(100000))" -s "def f(a): a[:]; a[:]" "f(a)"
200 loops, best of 5: 810 usec per loop

I reproduced the weirdness on a linux machine. I’ll just note that swapping the order of the two loops (i.e. move for f in funcs outside of for _ in range(100)) makes set consistent again. Seems like the hold in sets2 is kicking something out of a cache, making set slower.

1 Like

I don’t love C, so instead of just malloc/free, I created bytes from bytearray in Python, as large as the sets. That also mallocs/frees, right?

Added source data and functions:

barr = bytearray(os.urandom(sys.getsizeof(set(arr))))

def bytes1(barr):
    bytes(barr)
    bytes(barr)

def bytes2(barr):
    hold = bytes(barr)
    bytes(barr)

Results (including the sets results for comparison):

192.1 ± 6.2 μs  set
383.0 ± 11.8 μs  sets1

434.5 ± 5.3 μs  set
782.7 ± 14.3 μs  sets2

 42.6 ± 0.3 μs  bytes
 69.1 ± 0.8 μs  bytes1

 42.4 ± 0.3 μs  bytes
 71.2 ± 0.8 μs  bytes2

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]

The bytes creation times (35-42 μs) are much smaller than the time increases for the sets (240-400 μs), so I think that shows that malloc/free play only a minor role. (Yeah, I know, the sets seem to have multiple growing (re)allocations during their creation instead of just one like bytes presumably has, so this isn’t perfect).

Full code
from time import perf_counter as time
from statistics import mean, stdev
import gc, sys, os
     
arr = [*range(10000)]

def sets1(arr):
    set(arr)
    set(arr)

def sets2(arr):
    hold = set(arr)
    set(arr)

barr = bytearray(os.urandom(sys.getsizeof(set(arr))))

def bytes1(barr):
    bytes(barr)
    bytes(barr)

def bytes2(barr):
    hold = bytes(barr)
    bytes(barr)

def test(funcs, data):
    times = {f: [] for f in funcs}
    for _ in range(100):
        for f in funcs:
            gc.collect()
            t0 = time()
            f(data)
            times[f].append(time() - t0)
    for f in funcs:
        ts = [t * 1e6 for t in sorted(times[f])[:10]]
        print(f'{mean(ts):5.1f} ± {stdev(ts):3.1f} μs ', f.__name__)
    print()

for _ in range(2):
    test([set, sets1], arr)
    test([set, sets2], arr)
    test([bytes, bytes1], barr)
    test([bytes, bytes2], barr)

print('Python:', sys.version)

Attempt This Online!

1 Like

Whereas the first thing I tried with a bytearray was over 9 times(!) slower when doing both allocations before a free:

$ py -m timeit -s "b = bytearray(range(10)) * 100000" -s "def f(b): bytes(b); bytes(b)" "f(b)"
5000 loops, best of 5: 87.6 usec per loop

$  py -m timeit -s "b = bytearray(range(10)) * 100000" -s "def f(b): h = bytes(b); bytes(b)" "f(b)"
500 loops, best of 5: 848 usec per loop

Turns out “100000” there is seemingly uniquely bad. The relative difference is very much smaller with every other materially different multiplier I tried.

But still no reason I can see to believe this has anything to do with the CPython implementation, apart from how it happens to invoke the platform C memory management primitives.

I’m giving up on this :laughing:. I tried it in C, and it’s just plain hopeless. Under MSVC (MIcrosoft’s C compiler), malloc+fill+malloc+fill+free+free can have very different timing than doing malloc+fill+free+malloc+fill+free (faster), which are the two patterns we keep recreating in obscure, indirect ways with Python code here. But it’s extremely sensitive to the number of bytes requested. If “very large” or “very small”, scant difference, but in between (more in the 1/2 to 2 megabyte range) I see up to a factor of 10 difference. I’m not including code because this is trivial for a C programmer.

Since Python has nothing to do with those results, the only thing to do next would be to reverse-engineer Microsoft’s libc implementation, which I have no interest in doing.

In any case, I doubt anyone will to learn anything useful by pursuing this in Python. The CPython implementation just isn’t doing radically different things depending on the sizes of sets, lists, bytearrays, …, it’s allocating. It just passes the memory request on to the platform C library.

FYI, the largest objects for which Python does its own memory management are <= 512 bytes.

3 Likes

Thanks. Then I really suspect that the twice as large memory usage of malloc+fill+malloc+fill+free+free simply exceeds a CPU cache level (might go away if you omit the fills), so that it constantly gets misses instead of constantly getting hits. And maybe those page faults I saw, maybe staying in CPU caches avoids paging altogether and that avoids page faults.

Not sure what lesson to take from all this regarding benchmarking in Python. I’m still surprised that this is so very noticeable at the Python level with its much slower operations (yes, set(list_of_large_range) is almost all C, but with objects and iteration and hashing). Bummer. I generally like that benchmarking Python is relatively easy because Python rarely tries to “outsmart the programmer” (as someone put it just two days ago), unlike for example C++. So this huge impact of something outside of Python on Python is annoying.

No, the page fault is taken for the OS to allocate pages to the address range.
CPU Cache has no effect on that process.

Hmm, how do you explain this then?

#include <stdio.h>
#include <stdlib.h>
 
int main() {
    int n = 10000000;
    for (int j = 0; j < 100; ++j) {
        int* ptr = (int*)malloc(n * sizeof(int));
        for (int i = 0; i < n; ++i) {
            ptr[i] = i;
        }
        free(ptr);
    }
    return 0;
}

Result (Attempt This Online!):

Minor page faults: 61549

But if I replace ++i with i += 1000000 (Attempt This Online!:

Minor page faults: 5755

Baseline (if I don’t do anything) is about 4740, so the latter adds about 1000 faults, matching the 100*10 assignments ptr[i] = i. And it’s far fewer faults than the 61549 that I got when assigning every element. So it looks like the allocation doesn’t cause the faults, the assignments do.

1 Like

Backing up a bit, you can get related “mysteries” with much simpler setups. Like this one:

$ py -m timeit -s "b = bytearray(2**19)" "bytes(b)"
10000 loops, best of 5: 26.9 usec per loop

$ py -m timeit -s "b = bytearray(2**20)" "bytes(b)"
1000 loops, best of 5: 260 usec per loop

What’s that? We double the size but it takes nearly 10x longer?

You can stare at the Python implementation until your eyes fall out and never gat a clue as to why,

My guess it that we crossed a boundary across which the platform free() decides whether to give a chunk of memory back to the OS, or retain it in the current process for possible later reuse. malloc() doesn’t zero-out memory, so reuse is close to costless. But if you give it back to the OS, the virtual addresses are taken out of the process’s page table, so you’ll at best get a “minor page fault” if malloc() gets those addresses back from the OS again later and you access them. Shouldn’t really matter to this whether you store into them or merely read from them - for security reasons, the OS arranges to zero-fill all memory it hands out. But this is done one page at a time, the first time (if ever) a process accesses the page.

5 Likes

For large allocations that are bigger the the malloc arena size malloc will map enough memory for the request. Only the front of the memory will be touch by malloc itself so 1 page fault. None of the other pages have been allocated in the kernel.

When you write to all of the memory of the malloced buffer it faults in the pages as needed.
When you only touch 4 bytes every 1,000,000 int then you only need 10 pages faulted in.

The linux kernel concept is memory overcommit that will give a process a block of virtual memory, but not allocate the physical pages until you read or write to them.

The man malloc docs give some details of how malloc works and shows the configuration you can control.

You could get the sources of glibc and read the malloc code to get a detailed understanding of the algorithms malloc is built on.

Beyond that you will need to read about how the kernel handles the memory allocation calls malloc uses.

3 Likes

Hi Tim,

I’m curious about the execution time gap (x10) between 512kB and 1MB, for example:

def test(x):
    b = bytearray(int(2**x))
    ## h = bytes(b)
    ## bytes(b)

Coincidentally, my PC has 512kB of L2 cache. Could it be related to the CPU cache?
Here is the result of the benchmark:

Test Code (click to expand)
from timeit import timeit

import numpy as np
from matplotlib import pyplot as plt

n = 1  # number timeit
k = 20 # times trial
X = np.arange(10, 24, 0.1) # Use 2^x bytes for test

def test(x):
    b = bytearray(int(2**x))
    ## h = bytes(b)
    ## bytes(b)

yy = []
for j in range(k):
    print(f"Trial {j}...")
    y = [timeit(lambda: test(x), number=n) / n for x in X]
    yy.append(y)
    plt.plot(X, y, '.', alpha=0.1)

Y = np.median(yy, axis=0)
ym = np.mean(yy, axis=0)
ys = np.std(yy, axis=0)

plt.semilogy(X, Y, 'b-', label='median')
plt.errorbar(X, ym, ys, alpha=0.2)
plt.xlabel('2^x [bytes]')
plt.ylabel('Time [s]')
plt.legend()
plt.grid(1, 'major', color='gray')
plt.grid(1, 'minor', color='lightgray')
plt.show()
2 Likes

Nice work! Thank you for sharing this.

I doubt caches play a primary factor. For one thing, my box’s L2 cache is twice as large, but falls off a cliff at the same size range. For another, you materially simplified the test code again: you’re not accessing the constructed bytearray at all, just creating it.

Under the covers, bytearray(an_int) just asks C for memory and then calls memset() to fill it with zero bytes. And that’s the end of it for your test code. memset() implmentations are typically as cache-friendly as crawl-over-the-whole-thing code can be made.

I should clarify that I used Microsoft’s MSVC on Windows (the “standard” way to build Python on Windows). The OS and C runtime in use appear to be the whole story here, so identifying them is important.

Note: you’ll probably see very different results if you construct bytes(an_int) objects instead. Because, at least at present, Python doesn’t do malloc+memset for those, but calls calloc() instead. If an_int is large enough that that the allocator maps in fresh memory for it, an intelligent calloc() will rely on that the OS will 0-fill it “by magic”, lazily, as/if pages are accessed. Then test code that doesn’t go on to access the object itself won’t spend any time initializing the memory before the object is freed again.

Note: while “minor”/“soft” page faults are much cheaper than “major”/“hard” faults (which require reading data up from disk), resolving a minor fault can still cost thousands of cycles in the kernel. They’re far from “free”. Not having a virtual address in a process’s page table at all is a whole different level of pain than the all-done-by-HW process of searching the cache hierarchy for an address that has been mapped into the process’s RAM.

1 Like

Assuming you are on linux.
Install perf tool and let it look at the system while you run the benchmark.
Also strace may give clues of system calls at the different sizes.

I was wondering if it could be malloc arena size, but that seems to be 128kb which is too small.

Here’s one: modern systems - even just “consumer grade” - are exceedingly complex. Code is easier to write in Python, but so far as timing goes using Python adds yet another layer of complexity to try to out-think.

There’s really nothing Python can do to hide the hardware realities of the extreme relative expense of accessing a hard drive, page faults, cache misses, or even mispredicted branches.

Well, that’s not entirely true: we could go back to Python’s very early days, long before its first public release. At first, even locals were looked up by string name in a dict. No matter what you tried to time, you were measuring Python’s dict lookup speed - and stuck to one-letter names all the time to cut that expense :wink:.

My Windows box right now is running several thousand threads, almost none of which I intentionally started. 4 physical cores pretending to be 8 via “hyperthreading” tomfoolery, active threads frequently migrate among cores as the OS tries to keep CPU temps down, and likewise the OS changes CPU clock frequencies to keep temps down. If Chrome is running (which it almost always is), I consider myself lucky to get timings within 10% across two runs of a test case. Even then, if I try again a few minutes later, it may run twice as fast. Or twice as long.

Benchmarking on modern boxes is an exercise in unbounded optimism :wink:.

6 Likes

Quote of the day, right here.

5 Likes