Add zero-copy conversion of `bytearray` to `bytes` by providing `bytes()`

cmaloney · February 2, 2025, 4:25am

Currently to be efficient in Python, a “zero copy” / “zero extra memory” pure-python I/O loop needs to:

Use a mutable buffer of bytes (bytearray) buf = bytearray(BUFFER_SIZE)
Fill it using file.readinto(buf)/os.readinto(buf), resizing efficiently
- file.read / os.read always allocate and returns a new immutable bytes object which often is copied into the bytearray
Convert the data filled bytearray to a bytes using bytes(buf).

At the moment there doesn’t seem to be a way to do step 3 from bytearray to bytes without 2x the size of buf memory allocated, at least temporarily, and a copy of the data from the bytearray to the new bytes object. There is fast path code (bytearray implements the Buffer Protocol), but that still requires 2 buffers allocated of buf length (the bytearray and new bytes) + a memcpy.

I think the extra space + copy could be eliminated in some cases by:

Have bytearray internally use the Bytes Objects Stable C API to allocate the buffer it points to. This means the buffer is/can become a valid immutable bytes object, but is treated mutably as a single-owner contiguously allocated buffer as the C API allows today (ex. Modules/_io/fileio.c relies on this in main).
Use _PyBytes_Resize unstable API to resize the buffer to desired capacity when needed for bytearray’s existing API.
Provide __bytes__() on bytearray which if there are no views (memoryview) to the bytearray, only one reference, and no offset inside the bytearray returns the bytes storage resized from capacity to the actual size (bytearray often has a capacity larger than size). In other cases, __bytes__() would allocate a new bytes and copy data to it same as today.

This would slightly change the ABI / layout of PyByteArrayObject, as I think it would need an additional PyObject * pointing to the bytes object start. The ob_bytes and ob_start members I think could be kept and work as today, just there would be a PyBytesObject head just before ob_bytes pointed to by the new member.

blhsing · February 3, 2025, 5:55am

Can you elaborate on your actual use case? From the 3 steps above it seems as if your use case is simply to read from a file into a bytes object, in which case you can use file.read/os.read instead.

If you just need a read-only bytes-like object of an existing bytearray a, you can also use a.__buffer__(0).toreadonly() to obtain a read-only memoryview of the bytearray’s buffer.

cmaloney · February 3, 2025, 6:30am

Trying to do an efficient “read all” loop. One from _pyio (Pure python implementation of _io):

github.com/python/cpython

Lib/_pyio.py

a29a9c0f3


      
          def readall(self):
              """Read until EOF, using multiple read() call."""
              res = bytearray()
              while data := self.read(DEFAULT_BUFFER_SIZE):
                  res += data
              if res:
                  return bytes(res)
              else:
                  # b'' or None
                  return data

Currently that has really poor performance relative to _io / the C implementation, because data gets a newly allocated bytes of size DEFAULT_BUFFER_SIZE, res += data then does a resize/realloc to fit the new buffer, then data is copied from the initial bytes into the bytearray.

That means two extra memory alloctions in Python (alloc + resize), 1 deallocation (the temp bytes), and a in-python copy as well as the kernel copy from → python memory.

Using bytes + bytes you always have to have 2 copies in python. With an “expandable” bytearray, you can eliminate the python → python copy, as well as the temporary bytes. The only problem is while bytearray looks like bytes in many ways, a lot of code doesn’t work the same if one is returned instead of bytes.

blhsing · February 3, 2025, 6:44am

I see. To specifically avoid making a copy of all the bytes chunks read from a file when building a bytes object you can build it directly with the bytes(iterable) constructor instead of converting an intermediary bytearray:

from itertools import chain
from functools import partial

buffer_size = 10

with open(__file__, 'rb') as file:
    print(bytes(chain.from_iterable(iter(partial(file.read, buffer_size), b''))))

methane · February 3, 2025, 6:59am

I think this is better, because we only need one contiguous huge buffer instead of two.

def readall(self):
    """Read until EOF, using multiple read() call."""
    chunks = []
    while data := self.read(DEFAULT_BUFFER_SIZE):
        chunks.append(data)
    if chunks
        return b''.join(chunks)
    else:
        # b'' or None
        return data

blhsing · February 3, 2025, 7:04am

Inada Naoki:

def readall(self):
    """Read until EOF, using multiple read() call."""
    chunks = []
    while data := self.read(DEFAULT_BUFFER_SIZE):
        chunks.append(data)
    if chunks
        return b''.join(chunks)
    else:
        # b'' or None
        return data

But with this approach the list would need to hold all the bytes chunks, while with a generator like mine we need to hold only one chunk at a time.

But thanks to you, I totally forgot about b''.join, so my code above can be modified with:

with open(__file__, 'rb') as file:
    print(b''.join(iter(partial(file.read, buffer_size), b'')))

cmaloney · February 3, 2025, 7:09am

Copying byte by byte + running an iterator (and the calls that requires) I suspect is going to be slower than the memcpy / memmove at the end which is specialized by the libc implementation / special hardware instructions exist to do efficiently. Definitely reduces overall allocated space though.

cmaloney · February 3, 2025, 7:11am

I do like this pattern a lot. It still requires 2x the memory (all the data exists in chunk, then a contiguous bytes needs to be constructed), and you need number of chunks + 1 allocations, instead of just a single python memory managed area (the bytearray) and a single copy (os.readinto which does a single → process) copy.

(For better performance, each read should get non-linearly bigger than the last so really big reads still have a small number of .read() calls + allocations, but that is a somewhat distinct problem)

methane · February 3, 2025, 7:15am

b''.join() creates temporary list from iterable. So your code is same to mine, except generator overhead.

Monarch · February 3, 2025, 7:16am

I like this proposal. I’ve seen similar patterns every now and then. It’s not super common but it’s not rare either. Doing it right isn’t all that obvious either. So I’m +1 on adding something to stdlib to better handle it. Another example where a similar pattern is used in stdlib would be hashlib:

github.com/python/cpython

Lib/hashlib.py

b13b76f98


      
          buf = bytearray(_bufsize)  # Reusable buffer to reduce allocations.
          view = memoryview(buf)
          while True:
              size = fileobj.readinto(buf)
              if size == 0:
                  break  # EOF
              digestobj.update(view[:size])
          
          return digestobj

blhsing · February 3, 2025, 7:17am

Ah yes I forgot about that. Thanks for pointing it out.

cmaloney · February 3, 2025, 7:30am

This code with a zero-copy bytearray → bytes would match FileIO for reading a whole file in terms of memory usage + CPU (honestly might be a useful io utility to add so doesn’t have to be copy/pasted as much). What I’m hoping to make the _pyio implementation[0]

# size_guess should be the "known" / estimated size or DEFAULT_BUFFER_SIZE (ex. 4096 bytes)
DEFAULT_BUFFER_SIZE = 4096

def read_to_eof(fd: int, *, size_estimate: int | None = None):
    bufsize = size_estimate if size_estimate is not None else DEFAULT_BUFFER_SIZE
    result = bytearray(bufsize)
    bytes_read = 0
    while True:
        if bytes_read >= bufsize:
            # Parallels _io/fileio.c new_buffersize
            if bufsize > 65536:
                addend = bufsize >> 3
            else:
                addend = bufsize + 256
            if addend < DEFAULT_BUFFER_SIZE:
                addend = DEFAULT_BUFFER_SIZE
            bufsize += addend
            result.resize(bufsize)
        assert bufsize - bytes_read > 0, "Should always try and read at least one byte"
        try:
            n = os.readinto(fd, memoryview(result)[bytes_read:])
        except BlockingIOError:
            if bytes_read > 0:
                break
            return None
        if n == 0:  # reached the end of the file
            break
        bytes_read += n

    del result[bytes_read:]
    return bytes(result)

[0] Needs gh-129559: Add `bytearray.resize()` by cmaloney · Pull Request #129560 · python/cpython · GitHub for bytearray.resize()

blhsing · February 3, 2025, 7:33am

Indeed. I guess a simpler proposal then is for the bytes(iterable) constructor to optionally take an iterable of bytes instead, so that internally it’ll grow the buffer according to the size of each incoming bytes chunk and efficiently memcpy from the chunk.

cmaloney · February 3, 2025, 8:05am

Nice! I Hadn’t run into the hashlib loop. Forgot about the “mutable intermediate buffer” cases for bytearray (I’ve been looking a lot just at I/O and “read efficiently” recently). Will definitely keep an eye out for more of those cases.

storchaka · February 3, 2025, 12:54pm

I was thinking about this idea. This is actually how BytesIO is implemented.

But what if the original bytearray is changed after creating a bytes? You need to check if there are external references to the internal bytes buffer (i.e. if the references count is larger than 1) before every mutable operation. You can do this for Python interface, this will add some overhead, but it may be tolerable. But in the C API there is PyByteArray_AS_STRING(). You cannot catch mutation of the memory array returned by PyByteArray_AS_STRING(). And it is not possible to make a copy in PyByteArray_AS_STRING(), because this C API never fails, so the users never check it for errors. The only way is to deprecate and finally remove PyByteArray_AS_STRING(), but this is too serios change.

Stefan2 · February 3, 2025, 1:11pm

How poor? And how poor with larger DEFAULT_BUFFER_SIZE? What/how are you measuring?

pf_moore · February 3, 2025, 1:48pm

Could you have a detach operation that took ownership of the internal bytes buffer and returned it as a readonly bytes object, while resetting the bytearray by giving it a new empty buffer?

storchaka · February 3, 2025, 6:11pm

This might work, but only in one direction. In BytesIO it works in two directions – when you create a BytesIO from a bytes object it does not copy the data, but keep a reference for an existing bytes object, and getvalue() returns the internal bytes object (truncating it if possible).

The “moving constructor” would be useful in many other cases – for example when you create tuple from a list or frozenset from a set and throw away the original list or set. But it is dificult to came with solution which would not look like preliminary optimization.

cmaloney · February 3, 2025, 7:27pm

My thought was that if the reference count is larger than 1 bytearray → bytes must copy (Same performance implications as today largely). Ideally to me that conversion leaves the bytearray “empty” just in case code tried to manipulate / fill the bytearray afterwards. It’s what my head coming from C++ “move” semantics thinks (no one should touch the vacated object, but it’s best to leave it in a safe state just in case). Definitely adds complexity though.

I hadn’t thought about / explored just using BytesIO in these cases / if that could be more efficient… One of the problems in the open-coded readall loop, is that ideally the ideal size for the buffer allocation needs to keep increasing in a particular way for good performance (see fileio.c new_bufferesize from bugs like bpo-15758 / gh-59962). If that “Expand the buffer efficiently” could stay contained in the BytesIO rather than needing to be hand-written in the loop…

I looked some at adding a io.readall but there seems like too many design features in the hand rolled versions (provide a size estimate, “cap” the max size that be grown to, …).

cmaloney · February 3, 2025, 7:36pm

I’ve been looking at memory usage as well as runtime. Particular OS + hardware you’re running on matters (My x86_64 Linux desktop box takes longer than my M3 MacOS laptop, presumably because of system optimizations).

Running FileIO test_large_read using ./python -m test -M8g -uall test_largefile -m test_large_read -v on my Linux box removing the current _pyio copies + extra allocations halves memory usage (making it the same as C _io), and reduces runtime ~3.8s → ~2.4s.

Add zero-copy conversion of `bytearray` to `bytes` by providing `__bytes__()`

Add zero-copy conversion of `bytearray` to `bytes` by providing `bytes()`