Add zero-copy conversion of `bytearray` to `bytes` by providing `bytes()`

cmaloney · March 31, 2025, 8:08pm

PyBytesWriter definitely is a lot neater to me than using the existing _PyBytes_Resize pieces. Think it could be interesting to back bytearray generally. Also means that the “Efficiently oversized grow buffer on repeated appends” could be de-duplicated a bit. Still doesn’t allow going from “mutable array of bytes” → bytes in pure Python code. asyncio, encodings.punycode, zipfile, _pyio all make a bytearray, extend/resize/mutate, then at end of function convert the bytearray to a bytes which is returned to caller.

Agreed, not sure with the overloads of BytesIO today there is a straightforward path for that? In my case also specifically looking at “How does _pyio.BytesIO get faster?” Right now the large file test runs into two pieces around it: a. it can’t resize b. it can’t convert the internal bytearray to bytes without a copy (and can’t use bytes as at Python interpreter level it’s immutable). It’s one of the slower individual tests, especially on older/smaller hardware. I suspect adds up in CI runtime quite a bit.

Both of those are part of why I think just the bytearray.take_bytes([n]) subset of the idea today. It just removes the end copy operation which enables significant speed up for cases in asyncio, _pyio.BytesIO, etc. today. As other pieces (ex. PyBytesWriter) are standardized, can potentially make even more efficient. Some tradeoffs in “can a large single-reference bytes be put into a bytearray without copying”, not sure how common that is.

bytearray.take_bytes([n]) can be implemented lazily on copy out as @encukou suggested, which keeps things much more localized. Need to be careful around the extend buffer code that currently “realigns” to no padding before, but I think very solvable. Means “start with enough padding at front for the zero-copy out”. In time I think could move to PyBytesWriter underneath which would simplify further but be largely independent. The take_bytes([n]) I think is useful as a small step that lets code specify intent that enables an optimization both with and without that migration.

cmaloney · March 31, 2025, 10:34pm

It feels like overall there’s a preference for adding new abstractions and using them, rather than extending or modifying bytearray?

That has the secondary implication to me of “modify existing code to use the new / more efficient abstractions” rather than adding a new API here. If that is the case, would it be reasonable to make an issue and work on migrating existing code to from bytearray to _io.BytesIO where it measurably improves performance?

I’m happy to work on code for BytesIO improvements to make it cover more cases if there’s a particular API / case. .truncate() tweaks and / or .reserve() it sounds like are high level whats wanted?

pitrou · April 1, 2025, 6:58am

PyBytesWriter is a C API while Python code can use io.BytesIO. It would be unfortunate to add another Python API for creating bytes objects.

cmaloney · April 3, 2025, 12:53am

To summarize (if for nothing else, my own future reference)

Shelving .take_bytes([n]) for bytearray until at least PEP 782 / PyBytesWriter. I’ll look at measuring and migrating to io.BytesIO and study tradeoffs / what it looks like to use for other cases bytearray is commonly used. Including individual ideas for modifications of BytesIO (ex. truncate changes).

Constructing and manipulating large blocks of bytes (4KB - 1GB+), there are two main options currently:

_io.BytesIO: Allows getting the underlying bytes with minimum number of copies in most cases, file-like API which works well for repeated read/write, but less ideal for make a block of memory and manipulate bytes (ex. building a file header or network protocol blob).
bytearray: Allows manipulating a block of memory changing/setting individual bytes and resizing easily, but requires a copy to get to a bytes. The copy can dominate runtime when working with large buffers.

Improving both of those to be more comprehensive would be nice. It’s impossible to implement BytesIO like functionality in pure-python (ex. _pyio), and there is currently no way to do it without directly using the Unstable C API; in particular, there’s no way to allocate a block of memory with direct access to manipulate machine bytes then convert / finish it into a bytes without copying today. PEP 782 will improve this situation!

Individual ideas for changes/improvements for BytesIO so it can cover even more cases are welcome today. Measurable code performance improvements by going from bytearray to BytesIO may be okay. A more efficient _pyio.BytesIO implementation isn’t viable at the moment and not somewhere to focus in CPython.

storchaka · April 3, 2025, 1:27am

In any case, we cannot guarantee that the optimization like in BytesIO works on non reference counting implementations (e.g. PyPy or Jython). This is a CPython specific optimization, like optimization for in-place string concatenation.

pitrou · April 3, 2025, 7:22am

We (CPython developers) obviously cannot guarantee anything about PyPy or Jython (they can even choose the violate the Python spec if they want), but it’s still reasonable to expect them to provide the same zero-copy semantics as CPython would do for a hypothetical buffer-detaching operation.

storchaka · April 3, 2025, 8:00am

Yes, they can implement detach(), but optimizations for the constructor and getvalue() cannot be fully implemented without reference counting.

It is possible to implement it only partially. For example, a constructor may have complexity O(1), but the first modification will have to make a copy, even if the original bytes object no longer referred from outside.

cmaloney · April 3, 2025, 10:47pm

People are also good at figuring out how to optimize even with a lot of constraints in my experience. Just takes time and will. If it keeps coming up in profiles slowly work out what the underlying problems are and ideas to address.

For bytes slicing and conversion to bytes of a memory allocation / malloc / bytearray is more expensive in CPython because PyBytesObject must always have its bytes (ob_bytes) inline and that means have to make a new PyVarObject head w/ space allocated and copied to every time / can’t reuse a head repeatedly (for cases like repeated yield of slices in a loop). Could do things like build new abstractions which allow to capture that, but a very different optimization point and hard to adapt CPython to use without breaking a lot of other important established guarantees . Different tradeoffs than C++ for instance where std::string tends to do inline for small objects, and longer ones a second allocation. For short strings definitely never having to do an extra memory hop can be really efficient, and that makes a lot of code work well / memoryview exists for slicing large objects without copying bytes when needed. Lots of interesting ways to optimize.

pitrou · April 4, 2025, 7:48am

Big difference is that std::string is mutable (in particular, it can be arbitrarily resized), so it has to allow for a separately allocated data buffer.

cosmicexplorer · January 15, 2026, 11:21am

Hello, I just found gh-139871: Add `bytearray.take_bytes([n])` to efficiently extract `by… · python/cpython@732224e · GitHub and realized so so so many of the changes I had been making on my own were being realized! This work is so so exciting!!!

I am not sure of the best venue to mention this, but I have been working for much of last year on path strings and vfs (filesystem) operations in std::{fs,path} in the rust stdlib. In particular the getdents() libc call provides much greater atomicity guarantees and has been standardized by POSIX as of 2024 Add musl and glibc bindings for getdents{,64} by cosmicexplorer · Pull Request #4522 · rust-lang/libc · GitHub and was immediately supported in musl libc.

This thread investigates the goal of zero-copy i/o, which I’m just delighted to see. There are a few avenues I’ve been investigating along these lines:

First of all, the getdents() libc call introduces a very peculiar set of alignment and lifetime constraints: https://github.com/rust-lang/rust/issues/43467#issuecomment-3741642799
- I tried to describe how the buffer provided to getdents() has very specific alignment requirements, but does not know the size nor field layout of of each entry in the buffer. I also noted that the OS writing to your memory through the syscall introduces a type of ABI.
- cpython will have somewhat of an easier time with this using ref counting to explicitly mark those lifetimes.
- My current attempt at this in the rust stdlib is here: Comparing rust-lang:main...cosmicexplorer:getdents-fs-read_dir · rust-lang/rust · GitHub
  - This does not quite work yet, but it demonstrates one way to handle these complex lifetimes.
On the subject of buffers: readlink{,at}() is sometimes done with an allocating loop, as in the rust stdlib.
- If you examine the spec, you can do it without any new allocations: Muti- lated in any way.
- This logic is plugged into an allocating loop later in the file with ops::ControlFlow.
SIMD operation checks in the configure script for byte scanning: Comparing python:main...cosmicexplorer:byte-set-splitting · python/cpython · GitHub
- So this was motivated particularly by optimizing pip until it became unrecognizable (Remove the experimental fast-deps feature by pradyunsg · Pull Request #11478 · pypa/pip · GitHub has some context), but as the workstream in this thread has found (and fixed), there are many many operations upon bytes that are highly pessimal.
- For pip, I was able to introduce some very wicked caching into url quoting that improved over cpython, but found that the C-level string searching in cpython was really at fault here: Comparing python:main...cosmicexplorer:byte-set-splitting · python/cpython · GitHub
- While we should probably look to hyperscan for the really complex work, we can do some simpler efforts with just a few instructions: https://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html
- Also note my incredibly in-depth documentation for my re2 and hyperscan rust wrapper crates:
  - re2 - Rust
  - vectorscan - Rust
- See other notes I provided in this prototype branch: Comparing python:main...cosmicexplorer:byte-set-splitting · python/cpython · GitHub

I am absolutely not yet an expert on bits and bytes and simd, but I know I can make finding SIMD instructions in our configure script extremely robust. I am actually looking to do a phd thesis on parsing and text search some day and would love to help contribute to this kind of work in any way I can.

Once again: I was overjoyed to see someone else working on this and identifying how generally useful it can be for perf. I have two specific cases (url quoting and reading directory entries) where I think we can improve perf a huge amount. I also think URL parsing can be improved in this and other ways.

Please let me know if any of that would be useful to investigate further! I have *not" yet prototyped using getdents() over readdir() in cpython (but getdents is just perfect for python coroutines!!!).

Thanks!!

cmaloney · January 15, 2026, 5:40pm

Nice! Glad you’re finding it useful and interesting :). re: SIMD and alignment you might also want to see https://github.com/python/cpython/pull/140559. This particular Ideas thread to me is closed out / the idea that originated in it should ship in Python 3.15 (.take_bytes()).

For other alignment and filesystem pieces it would be best to make new “Ideas” topics with specific proposals. Trying to get from general thoughts to actionable ideas which are adaptable / can fit in CPython as it exists I highly recommend talking to people at local Python meetups, conferences, etc. Really useful both for helping focus on what is useful, refine viable incremental steps, and build momentum.

Add zero-copy conversion of `bytearray` to `bytes` by providing `__bytes__()`

Add zero-copy conversion of `bytearray` to `bytes` by providing `bytes()`