Add zero-copy conversion of `bytearray` to `bytes` by providing `bytes()`

pitrou · March 12, 2025, 8:37am

Well, I think BytesIO is almost there already thanks to getbuffer:

>>> bio = BytesIO()
>>> bio.write(b"x" * 10)
10
>>> bio.getbuffer()[1] = 42
>>> bio.getvalue()
b'x*xxxxxxxx'

The only thing lacking is a way to presize/enlarge a BytesIO without passing an actual bytestring to append. Currently, BytesIO.truncate doesn’t enlarge the buffer.

So perhaps we want to add a dedicated BytesIO.resize(length: int, bytefill: bytes | int = b'\0') method

Hmm, really? How come the address stays the same even if you enlarge the buffer to an arbitrarily large size?

cmaloney · March 12, 2025, 10:20pm

Agreed. From my perspective both BytesIO and bytearray need the same fundamental engineering (ex. efficient non-linear resizing, import/export bytes, construct from buffer, offset into a referenced contiguous buffer, …), just present two different API styles

BytesIO presents a file-like API for interaction.
bytearray presents a C “manually resizable block of bytes” + sequence / bytes operations.

I mentally model BytesIO as wrapping a bytearray (_pyio does that). It can definitely be implemented so bytearray uses BytesIO. Primary constraints I have been watching out for are BytesIO inherits / brings code from IOBase while bytearray is exposed in the C Stable API. Have yet to find / still hoping for a good path to de-duplicating…

bytes is also very similar, but fully prevents mutation outside the C API _PyBytes_Reize. The C API mutation is relied on by _io.BytesIO and in _io.FileIO.readall today. Python native code cannot implement those as efficiently as Python does perform zero-copy resize and modify of bytes as an implementation detail but Python code writers can’t interact or assert that behavior by design; rather they are pointed to bytearray for those needs.

getbuffer() has some caveats: Exports must be deleted before the object can be resized (or .close()), and they can’t create bytes zero-copy. I definitely think a number of cases would be best just moving to BytesIO. A lot of code, such as protocol bindings designed around a “mutable bytes” C-like structure, the API bytearray fills, and rewriting all that to get the performance is a lot more work to me than moving to a bytes inside bytearray which brings it closer to ByteIO, makes things faster by default, and with a simple API add from this thread (.take_bytes([n])) makes Python-native “read all” loops in an optimized build within 10% performance of the hand-rolled C FileIO.readall() (have that implementation locally). Can definitely follow the path of “Educate to use BytesIO” and discourage bytearray. Not sure how you’d implement a _pyio.BytesIO efficiently in that case.

Fom Memory Management — Python 3.13.2 documentation

Resizes the memory block pointed to by p to n bytes. The contents will be unchanged to the minimum of the old and the new sizes.
…
If the request fails, PyMem_Realloc() returns NULL and p remains a valid pointer to the previous memory area.

Definitely possible that documentation has diverged from in-practice implementations. It is somewhat different than man 3 realloc on my Linux box which includes “if the area pointed to was moved, a free(ptr) was done”). _PyBytes_Resize has a different API shape / takes a PyBytes** so it can modify the caller’s pointer directly.

pitrou · March 13, 2025, 7:32am

Well, “if the request fails” is the critical condition here. The original pointer will (probably) not be valid anymore if the resize request succeeds.

Indeed, the idea is that you call getvalue() to get the final bytes object. IIRC, that is zero-copy.

cmaloney · March 14, 2025, 12:03am

Slightly different path: the bytes CPython code already considers the bytes to be mutable under specific circumstances (if(Py_REFCNT(op) == 1 && PyBytes_CheckExact(op))). What if in that specific case, where the bytes could be (and is) modified for efficiency in CPython implementation and via the C API, creating a memoryview() / Py_Buffer which is writeable from the bytes would succeed (so can assign / set ranges of bytes).

At that point, I think bytes has everything needed to efficiently back _pyio.BytesIO directly, the need for a .resize() becomes bytes.ljust() which parallels str.ljust(), and can be a later step (bytes += b'\0' * pad works but is measurably slower than .ljust() or .resize()).

Efficiency relies then on implementation specific reducing copies, but they’re all removable (and CPython does) while matching the API requirements; optimizing bytes to not copy unnecessarily should improve performance more broadly for implementations.

I prototyped locally and seems to work. bytes change is very minimal (Make a new helper bytes_is_mutable and change the Py_REFCNT(op) == 1 instances to use it, change bytes_buffer_getbuffer to pass the return from bytes_is_mutable to the readonly parameter of PyBuffer_FillInfo.

encukou · March 14, 2025, 9:35am

That’s an optimization related to the current reference-counting implementation. It should not affect behaviour: to the user, this mutation should be equivalent to destroying the old object and creating a new one (at the same address, coincidentally).

Wouldn’t you also need to ensure that the refcount stays at 1 while a mutable buffer is exported? That would be hard; PyBuffer_FillInfo itself does an incref.

cmaloney · March 14, 2025, 7:59pm

At the same address isn’t quite right here, _PyBytes_Resize may reallocate and copy the data to a new object; that is used by PyBytes_Concat (although not bytes_concat used in PySequenceMethods).

bytes is a PyVarObject with its machine bytes storage inline, which seems to be a big part of why going from a raw buffer of machine bytes (bytearray currently) to it requires copying the machine bytes. bytearray is converted to bytes often, and can take a while to copy, typically io.DEFAULT_BUFFER_SIZE (was 8KB, now 128KB in main), sometimes multi-GB buffers (ex. _pyio large_file_test). That copy is a large percentage of the runtime in code I’ve been measuring, led to this particular rabbit hole.

Trying to find a set of python operations that let me mutate a contiguous buffer of machine bytes and then return it as bytes, without needing to write a C extension/code (what _io.BytesIO does; the C API has tools that solve this). Duck typing (ex. returning bytearray when used to return bytes), unfortunately breaks surrounding code.

From my perspective doesn’t have to stay at one, is actually preferable that it increases; other code reallocating/resizing the object while the memoryview exists wouldn’t be good. That you can’t get a mutable buffer twice is a little weird. Given exporting a mutable buffer from bytes was always disallowed (no code today would run which does it) I think a restriction change that won’t break working code, allows solving the use case I’m looking at.

encukou · March 17, 2025, 7:50am

Once you export a writable buffer, the original bytes object is not constant any more – it can be mutated via the buffer. Such bytes objects couldn’t be hashable any more.
How do you prevent it from being used as a bytes?

cmaloney · March 17, 2025, 8:30pm

That problem exists today with PyBytes stable C API usage both in CPython and by external projects. I don’t know of any protections around it other than the general “don’t overflow/underflow the buffer” protections in debug builds and compiler hardening in release builds.

It may make sense, at least as a compile option in debug builds, to add a memory protection (ex. mprotect) that the buffer isn’t modified and a way for C API users to indicate “this bytes is now ready for use” / immutable. A number of design/performance implications to try and do that (to me at least memory fragmentation concerns and “packing” bytes objects into pages as mprotect type protection is per memory page). Definitely exposing it via memoryview in Python makes it easier to construct problematic cases than today.

Tracking “might be mutable” in bytes + “definitely immutable” I suspect would help prevent and possibly find bugs today; doing that without breaking / changing PyBytes ABI and performance feels intricate but likely worthwhile. Also help C API users write the code they intended / catch hard to catch in code review bugs.

_io.BytesIO C does this via limited exposure of the underlying PyBytesObject and checking “refcount” and “exports”. Keeping a “mutable bytes” inside bytearray lets bytearray match this, and the bytes underlying becomes immutable at the take_bytes([n]) API. bytearray containing a bytes is still my preference for how to improve this use case, but exploring different directions per questions and to build consensus. Want to be certain as it does change implementation of a fundamental concrete python object per C API docs terminology, and there is definitely reason to be cautious in adding new APIs.

To restate and expand a bit from earlier the copy to go from bytearray → bytes is over 20% of the runtime of python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read on my linux dev box with an SSD and 90% of the performance delta from C _io.FileIO.readall. GzipFile.readinto reads full file before copying into the provided buffer · Issue #128646 · python/cpython · GitHub is a recent community contributed case with measurment. psycopg 3 as a sample community project does bytearray + memoryview to reduce copies in cases that might show up in loops. There is no zero-copy path from bytearray to bytes currently.

I think this comes up often enough in shipping code and is enough of a performance delta for the implementation change + an extra API function to be worth it, don’t think migrating all bytearray code to BytesIO is a better path for all these cases, and in some isn’t viable. Some code would be better with BytesIO, but that shouldn’t to me block building options that would improve all the cases. Moving bytearray to contain bytes, is a +65 / -25 line delta code change in my current implementation.

ChrisBarker-NOAA · March 18, 2025, 5:16am

I can’t say I fully follow this whole thread, but it does seem that this indeed, is the core issue, and would be helpful in multiple places. And if so then rather than trying to add magic to the C API (checking for only one reference, etc..) – couldn’t a new method be added to bytearray:

bytearray.to_bytes():

returns a bytes object with the contents of bytearray with no copying, while clearing the bytearray object.

equivalent to :

b = bytes(a_bytearray)
a_bytearray.clear()

but without a memory copy.

bytearray already has a clear() method – so this is not more dangerous – it’d be up to the caller to know if it was OK to clear it.

I know that it breaks the tradition of mutating methods like this returning None, but it seems to me to be a pretty obvious API for this use case.

encukou · March 18, 2025, 7:32am

API for mutating immutable objects is on the record for being problematic, it wouldn’t be added today. More importantly.
The protection is this:

The data must not be modified in any way, unless the object was just created using PyBytes_FromStringAndSize(NULL, size).

(Just a note in the docs – but this is C; mutating a read-only buffer is also “only” banned like this.)

The rule is that you can’t modify the data once you expose the bytes object to Python code. That precondition doesn’t make sense for an argument to a buffer export function.

One possibility for such a zero-copy function is this:

It would “steal” its argument, so its caller can’t use it any more.
To make sure a bytes object can’t be retrieved from Py_buffer.obj (or memoryview.obj), it would set the type to a new class that has the same memory layout as bytes, but no functionality.
- In ~Python 3.2 this would be safe (equivalent to destroying the bytes and creating a new object) but nowadays you’d need to check with faster-cpython and free-threading teams if there are new assumptions this would break.
To make the optimization transparent externally, in the refcount>1 case it would need to copy the data.

I don’t see a way to expose that as a Python function, as those can’t steal their arguments.

If we require extension authors to call API mark bytes as “finished”, we might as well use a “writer” pattern – start with a struct with the same memory layout as bytes; fill it up; then initilaize the PyObject header.
Victor’s recent proposal for this was rejected for unrelated reasons.

Yeah. That side of the equation is pretty clear.

I should have said this earlier, but: thank you for looking into this!
Sadly, I don’t see a solution myself. I hope you arrive at one, and I hope that pointing out the issues I see at this stage is helpful.

bytes data directly follows the header, so this would require reserving a bytes header^[1] in every bytearray. Or adding a pointer to every bytes + a pointer indirection to every operation on bytes.

Looks like some variation of this is a possibility:

That’s a trade-off between speed and memory usage; the diff size doesn’t matter that much.
I don’t know whether the trade-off is worth it, but, here I don’t see any important invariants broken!

30 bytes on a 64-bit box. Possibly only 24 bytes in the future. ↩︎

pitrou · March 18, 2025, 7:57am

That sounds reasonable to me. I don’t think tiny bytearrays are very common ^[1], so the relative overhead should be negligible.

ideally, they shouldn’t, at least ↩︎

storchaka · March 18, 2025, 12:28pm

I am surprised this discussion is still going on. The idea is incompatible with existency of PyByteArray_AS_STRING() which returns a writable buffer. Even if we remove PyByteArray_AS_STRING() or make it return an immutable buffer (either of which is a major breaking change), the optimization benefit will be lost in many other cases (for example, PyBytes_AsString() has to make a copy if the refcount is not 1).

cmaloney · March 18, 2025, 6:10pm

Much more minimal proposal now:

bytearray contains bytes
An explicit API that equates to .to_bytes() + .clear(), my current favorite of the three bike sheds is .take_bytes([n]).

So there isn’t any implicit detach or need to keep track of extra references. With that, cases in asyncio and other code can remove the end of function copy if they measure and find it worthwhile. It would be nice to be able to do other cases, but as you pointed out there are complications.

Already, PyByteArray_AS_STRING() returns a non-modifiable block of bytes if the underlying size is 0 / default initialized (there is a shared buffer instance in that case). The buffer location changes on resize today (see discussion around memory allocation / PyMem_Realloc() a bit earlier in thread), so the return of PyByteArray_AS_STRING() is already non-constant over lifetime in presence of resizing. Given that, PyBytes_FromStringAndSize(NULL, size) + _PyBytes_Resize(bytes, size) I think can do what is provided today. The ob_start member used by PyByteArray_AS_STRING stays valid, and changes at the same times.

From my perspective, bytearray tends to be 1024+ bytes in size, DEFAULT_BUFFER_SIZE is 128kb, so memory overhead of an extra PyVarObject vs. pure buffer is non-zero but not a large memory overhead. There is more writes to set more fields in the PyVarObject case, but my measuring compared to the copy of large buffers which is today required, I think most the time it is better, particularly for large I/O blocks.

The memory layout of PyByteArrayObject gains a new pointer at the end. The ob_alloc field in it today technically becomes redundant (but I don’t remove it in my PR). So slightly bigger base object, but not a lot.

storchaka · March 18, 2025, 6:45pm

Nobody expects that the return of PyByteArray_AS_STRING() is constant over lifetime. But it is expected that you can modify the buffer returned by PyByteArray_AS_STRING() immediately after the call.

As for .detach() or .take_bytes(), the benefit will only be in case when you need to convert the whole bytearray object to bytes. And there were no overallocations, so it will not help after .extend() or +=. And there were no insertions or deletions. In all other cases – no benefit, only small overhead.

cmaloney · March 18, 2025, 9:31pm

I believe the current proposal matches current behavior. If there is a specific case you’d like to look at, happy to step through and double check. Is good to validate changes work as intended.

bytearray.resize() + no-copy .detach() at end of function is significantly cheaper than the existing required copy to bytes. Returning all or most of a bytearray copied to a bytes at the very end of a function is common in the examples found by other community members earlier in this discussion thread.

.extend() and += are both slow relative to bytearray.resize() (https://github.com/python/cpython/pull/129560#issuecomment-2635841580) + os.readinto() / writeable buffer protocol . .resize() + .readinto() is the preferred / “fast” code pattern; .extend() and += both require more memory allocated + data copies. Even in .extend() or += code this proposal allows removing a copy at end of function to return bytes, enables speedup but doesn’t break current. Goal is to make a number of common cases better.

Part of the reason for .take_bytes([n]) over .detach() is that a bytearray is commonly used to build a buffer until a “marker” byte/event (ex. newline), then split the buffer in two returning the portion before the marker as a bytes. take_bytes([n]) simplifies calling code in that case over current slice + copy or alternative proposal .detach() and lets the implementation decide what part is fastest for a new allocation + copy. An allocation and a copy is required because need two distinct blocks of machine bytes at the end of function. In that case, I believe will still be faster than alternatives but haven’t implemented and measured.

If there is an “offset” in the bytearray would also require a copy. I have not looked at that case a lot, and any implementation would need to copy to make a bytes then. My personal preference is disallow / raise an exception (encourage equivalent explicit copy such as bytes(ba)) as it helps express design intent and guard it against unintended breakage by future changes. For reference, bytes in both os.read and _io.FileIO.readall today generally are over-allocated at start of function by at least 1 byte from expected size, then use _PyBytes_Resize to reduce to “actual size” at the end of the loop. In the case of an offset the proposal adds no slowness.

encukou · March 19, 2025, 9:47am

An offset (ob_bytes != ob_start) is an internal detail, it should not affect the visible behaviour.

But, you could:

make bytearrays normally have an “offset” big enough to hold a bytes header (incidentally making ba[:0] = b'data' cheaper, however insignificant it is to optimize that)
if take_bytes finds it has enough space (and alignment), it can skip a copy and fill in the bytes header

cmaloney · March 19, 2025, 5:20pm

noted: re:offset, makes sense.

The draft PR I made does the bytes header inline implementation you suggest, although doesn’t do it lazily :). Does mean that bytearray(generate_bytes()) doesn’t have to memcpy into the bytearray storage if it is the only reference to the bytes which speeds up some cases.

I think in time with a move to bytes as the machine bytes inside bytearray, could refactor the bytearray constructor (and potentially other methods) to defer to bytes directly more of the time, reduce code a bit and mean improving one improves both. Currently shared code happens by way of stringlib.

cmaloney · March 27, 2025, 7:53pm

I would like to work on adding .take_bytes([n]), not sure what next steps are from where this thread is. Any guidance would be appreciated.

vstinner · March 31, 2025, 6:27pm

An alternative to this deep bytearray change is to not use bytearray, but use PEP 782 PyBytesWriter C API exposed in Python. Pseudo-code:

buf = bytearray(100)
n = os.readinto(buf)
return bytes(buf[:n])  # O(n) copy operation

would become:

buf = types.BytesWriter(100)
n = os.readinto(buf)
return buf.finish(n)  # O(1) operation most of the time

storchaka · March 31, 2025, 6:58pm

BytesIO is a Python equivalent of PyBytesWriter. But there is no simple and efficient method to reserve its size. truncate() in BytesIO works differently than in files. This bothered me long time, perhaps we should fix it.

Add zero-copy conversion of `bytearray` to `bytes` by providing `__bytes__()`

Add zero-copy conversion of `bytearray` to `bytes` by providing `bytes()`