Allow bytearray creation without zeroing

Safihre · September 25, 2023, 9:07am

I used bytearray’s extensively to receive data from a socket connection using recv_into, then process the data and write it to disk.
For each new operation I need a new empty bytearray(int) of around 1MB to fill, but I don’t care about it’s contents since it will be filled with data from the socket.
I noticed that the during the creation of a bytearray(int) object, it is memset to 0.
This operation is quite expensive.
I created a C-API extension that create a bytearray(int) without zeroing: https://github.com/sabnzbd/sabctools/blob/0469073ac95bc4a323ff0e6c2993f1f09a1f9c4a/src/utils.cc#L21-L23
It’s magnitudes faster:
bytesarray(800_000):
0.018805000000156724
bytes(800_000):
0.02472710000006373
bytearray_malloc(800_000):
0.00026260000004185713

Would there be interest in making this somehow available from the regular bytearray(int) call (for example with an extra parameter initialized=True)?

kknechtel · September 25, 2023, 9:34am

Why? The point of recv_into is to use a buffer that already exists, for example, one that you allocated before repeatedly receiving data, to receive into the same buffer every time.

If that doesn’t suit your use pattern, that’s why there’s also .recv, which lets the C layer allocate and then read in.

I get results like yours if I ask timeit to do a thousand iterations. To be clear, the complaint is that it takes an extra 19 microseconds to prepare to receive 800 kilobytes of data into a mutable buffer?

Can you download anywhere near 40 gigabytes per second on this socket connection of yours?

In fact, you don’t need to do any timing to think about the problem clearly. Are you seriously concerned that the time that the C layer will spend writing useless zeros directly from the CPU into RAM, will be significant compared to the time that it takes to get the actual useful values from the Internet?

You will not find a lot of support for any kind of “uninitialized memory for performance” features supported at the language or standard library level, I suspect. However, if this kind of speed is somehow important to you for some other purpose, you can get a lot of the way there with Numpy by using the empty function to create a Numpy array without initializing the underlying storage. An 1-dimensional array with a dtype of 'B' (i.e. np.uint8) should suit your purposes (I’m pretty sure it implements the buffer protocol).

Safihre · September 25, 2023, 10:25am

Some more context:
After the data is fully downloaded, we need to process it (outside the GIL) and write it to disk so we cannot use the same buffer again. Or only until that processing is finished. So instead we opt to create a new buffer each time versus copying the data to a new buffer to process it.
Note that there will be anywhere between 10 to 200 connections active at the same time.
Using regular recv wil create buffers for each chunk of data (since rarely 800kb is received in 1 go) that we then still need to stitch together, using more time on memory-copies.
Quite a lot of CPU time is used on decryption of the SSL data by the SSLSocket and the processing of the data that we do our self (both outside the GIL), so we would like to reduce any other tasks if possible.
Using profilers we identified the bytearray allocation to take a small but measurable portion of the whole process.

Understood! I will rely on using our C-extension helper-function.
I was just wondering if others might have a similar use case, but I understand it is probably limited.

pochmann · September 25, 2023, 10:34am

How large is that portion?

storchaka · September 25, 2023, 11:27am

You said that you need a new empty bytearray(int) of around 1kb to fill. How much time it takes? 20-100 microseconds? On my computer, in Python 3.10, bytearray(800_000) takes 18.5 microseconds, and bytearray(1000) takes 0.1 microseconds. Is your computer 1000 times slower than my old computer?

Now, how much time takes to receive data from a socket connection using recv_into, then process the data and write it to disk? I afraid that it is magnitudes more than 0.1 microsecond, or even than 100 microseconds.

Safihre · September 25, 2023, 1:59pm

I tried to create a somewhat realistic test that reflects the real-world timings:
Using a simple SSL-server-socket, of which I set the buffer-size using setsockopt to 1MB so we can send the 800kb of test data in 1 sendall command.
For the client we use our patched version of recv_into, which can read all 800kb of data in 1 function call compared to the regular SSLSocket.recv_into that can only fetch 16Kb per call. See: SSLSocket.read does a GIL round-trip for every 16KB TLS record · Issue #81536 · python/cpython · GitHub (title says it’s a GIL round-trip, but actually it just returns after every 16kb of data instead of reading all data, described later in the issue).

When doing the timings, I get:
bytearray(800_000): 18.2 usec
unlocked_ssl_recv_into: 414 usec
decoding: 110 usec

So, of our total time of 542 usec, 3.3% is used on preparing the bytearray.
If we use the unitialized version it takes 2 usec, only 0.4% of total time.
Small, but measurable.
In addition, creating the bytearray is also an operation that holds the GIL while the other 2 operations do not.
This tests of course excludes the writing to disk, but that is also done outside the GIL.

I understand that it is likely not useful for the other users, just wanted to provide the context.

EDIT: I see that I wrongly wrote in my original first post 1KB while I meant 1MB. Apologies!