Motivation
Code to work with binary data and network protocols often crafts a specific byte sequence using a bytearray and then convert it to a bytes.
This currently requires a memcpy to create the bytes from the bytearray as well as requiring 2x the amount of memory since both must exist in memory at the same time. This pattern appears in asyncio, base64. urllib, as well as other CPython standard libraries. Some third party libraries avoid the copies by using bytearray everywhere. Given the bytearray no longer needs its copy of the data it would be nice to get out a bytes without a copy.
Proposal
I propose adding a way to remove the copy by:
- Changing
bytearrayto usePyBytesWriterfrom PEP 782 - Adding
.take_bytes([n])method
The first step makes it so that the bytes stored in bytearray are convertible to a bytes without copying using a supported mechanism; the second step exposes that to Python code. This makes it possible to resolve issues like gh-60107 as well as move .bytearray to no longer use the soft-deprecated by PEP 782 _PyBytes_Resize API
Sample Implementation
This adds .take_byes([n]) and migrates some CPython library code paths to use it. Running pyperformance against my branch shows no significant slowdowns in code which uses bytearray while some paths have measureable improvements (ex. asyncio_tcp asyncio streams 1.03x faster, regex_dna regex 1.08x faster). Code which uses very large buffers tends to speed up more; for example 4GB buffers in _pyio running test_io.test_largefile no longer get copied more than halving the runtime.
The branch increases the size of bytearray by one pointer while leaving the PyBytesWriter lazily allocated much like the current bytearray storage is. There are cases where we still need to copy some (ex. the bytearray is offset, which is resolved via memmove). The “remaining” bytes must always be copied to a new PyBytesWriter.
Acknowledgments
This idea evolved in Add zero-copy conversion of `bytearray` to `bytes` by providing `__bytes__()`. Thanks @methane and @blhsing for coming up with the .take_bytes([n]) API shape and motivating in Python examples for it.
When would I use this?
Any code which makes a ba = bytearray(), modifies it, then calls bytes(ba). Other common patterns are bytes(ba); ba.clear() or bytes(ba[:n]); del ba[:n]. Note that if you want to discard data past a point, the most efficient pattern becomes ba.resize(n) then ba.take_bytes() so that take_bytes([n]) doesn’t need to keep around the soon to be discarded extra bytes.
Why .take_bytes([n])
The proposal started as just “convert the whole buffer”, but in a number of byte stream processing usecases such as tcp streams and console I/O it was seen that often the code would “process to marker” then split at the marker n bytes into the buffer. .take_bytes([n]) enables that use-case while keeping it simple to get the whole buffer.
The n index supports both positive (from start) and negative (from end); but always takes the buffer before the split point. The portion after the split point, if any, must always be copied into a new PyBytesWriter. To take just the end of a bytearray, delete the start of the bytearray. bytearray handles this efficiently already; just changing its “start” pointer inside the buffer.
I need more stream-like reading and writing
Use BytesIO! It already does a similar optimization, copying only when needed.