Add `take_bytes([n])` to `bytearray` providing a zero-copy path to `bytes`

Motivation

Code to work with binary data and network protocols often crafts a specific byte sequence using a bytearray and then convert it to a bytes.

This currently requires a memcpy to create the bytes from the bytearray as well as requiring 2x the amount of memory since both must exist in memory at the same time. This pattern appears in asyncio, base64. urllib, as well as other CPython standard libraries. Some third party libraries avoid the copies by using bytearray everywhere. Given the bytearray no longer needs its copy of the data it would be nice to get out a bytes without a copy.

Proposal

I propose adding a way to remove the copy by:

  1. Changing bytearray to use PyBytesWriter from PEP 782
  2. Adding .take_bytes([n]) method

The first step makes it so that the bytes stored in bytearray are convertible to a bytes without copying using a supported mechanism; the second step exposes that to Python code. This makes it possible to resolve issues like gh-60107 as well as move bytearray to no longer use the soft-deprecated by PEP 782 _PyBytes_Resize API.

Sample Implementation

This adds .take_byes([n]) and migrates some CPython library code paths to use it. Running pyperformance against my branch shows no significant slowdowns in code which uses bytearray while some paths have measureable improvements (ex. asyncio_tcp asyncio streams 1.03x faster, regex_dna regex 1.08x faster). Code which uses very large buffers tends to speed up more; for example 4GB buffers in _pyio running test_io.test_largefile no longer get copied more than halving the runtime.

The branch increases the size of bytearray by one pointer while leaving the PyBytesWriter lazily allocated much like the current bytearray storage is. There are cases where we still need to copy some (ex. the bytearray is offset, which is resolved via memmove). The “remaining” bytes must always be copied to a new PyBytesWriter.

Acknowledgments

This idea evolved in Add zero-copy conversion of `bytearray` to `bytes` by providing `__bytes__()`. Thanks @methane and @blhsing for coming up with the .take_bytes([n]) API shape and motivating in Python examples for it.

When would I use this?

Any code which makes a ba = bytearray(), modifies it, then calls bytes(ba). Other common patterns are bytes(ba); ba.clear() or bytes(ba[:n]); del ba[:n]. Note that if you want to discard data past a point, the most efficient pattern becomes ba.resize(n) then ba.take_bytes() so that take_bytes([n]) doesn’t need to keep around the soon to be discarded extra bytes.

Why .take_bytes([n])

The proposal started as just “convert the whole buffer”, but in a number of byte stream processing usecases such as tcp streams and console I/O it was seen that often the code would “process to marker” then split at the marker n bytes into the buffer. .take_bytes([n]) enables that use-case while keeping it simple to get the whole buffer.

The n index supports both positive (from start) and negative (from end); but always takes the buffer before the split point. The portion after the split point, if any, must always be copied into a new PyBytesWriter. To take just the end of a bytearray, delete the start of the bytearray. bytearray handles this efficiently already; just changing its “start” pointer inside the buffer.

I need more stream-like reading and writing

Use BytesIO! It already does a similar optimization, copying only when needed.

5 Likes

For the example, I think you can save even more memory copying by using an iterator:

def _byteswap(data, width):
    return bytes(data[i + width - 1 - j] 
                 for i in range(0, len(data), width) 
                 for j in range(width))

I suggest finding another motivating example to demonstrate the benefit.

Updated to strike out the _PyBytes_Resize piece (I had a local branch which used that; the code in main does not).

Nice different way of writing! My goal with the example is a small, shipping today, easy to understand piece of code. Inside the sample implementation change I update many more cases across the standard library they just tend not to be as short and self contained.

This does look like it’s worth the cost of an additional PyVarObject header in every bytearray.
A PyBytesWriter is too big though; it’s designed for one-shot use and (currently) includes a a 512-byte “small buffer”. You’ll probably need to reimplement the idea rather than use it directly.

1 Like

bytearray supports fast deletion at the start and at the end, it avoids resizing the buffer in most cases. Using bytes or PyBytesWriter inside bytearray may lose this optimization, no?

1 Like

Is the concern here too much overhead for small bytearrays?

The current resize / expand code for bytearray has intentional over-allocation in some cases to avoid lots of small resizes while also giving low overhead if used as a fixed size buffer; want to make sure I can benchmark/measure this case and make sure performance meets expectations.

In the prototype i built this optimization is kept; the bytearray has members pointers which track both “start of allocation” and “start of data” which begin the same but diverge when the start is deleted and those are maintained as is. take_bytes gets slightly more expensive when the two pointers differ as the data needs to be moved ( memmove) to the start of the space before finalizing the PyBytesWriter to the final size.

Larger ones too. If I recall correctly (correct me if I’m wrong), once you have more than 512 bytes, a writer will leave the 512-byte “small buffer” unused. I think that’s too wasteful for a long-lived object.

1 Like

@vstinner To implement the “no small_buffer” case to minimize overhead for long-lived PyBytesWriter what would you think of changing PyBytesWriter to have a Py_ssize_t small_buffer_size; member + char small_buffer[1]; as the last member so it can operate like a PyVarObject with a variable-length tail for the small buffer optimization.

My thought would be PyBytesWriter_Create(0) gets a small_buffer of 256 bytes allocated. If a size is provided then, like the current code, byteswriter_resize is called which always results in writer->obj being set. In the byteswriter_resize case no small_buffer is allocated (memory overhead ~ sizeof(Py_ssize_t) + 1 byte). The small_buffer_size gets used instead of sizeof(writer->small_buffer) in the code.

edit: Did a prototype implementation exp: PyBytesWriter variable size small buffer · cmaloney/cpython@de6759a · GitHub (needs specific tests added; passes make test)
edit2: pybyteswriter_resize doesn’t always result in a writer->obj so need some method other than size=0 to flag. Probably just do a CPython internal C API to start.

It would make the freelist more complicated and avoids some micro-optimizations (fixed buffer of 256 bytes). I would prefer to leave PyBytesWriter as it is. Why not using a bytes object directly in bytearray?

Have a prototype implementation that does that :slight_smile: (see: Add zero-copy conversion of `bytearray` to `bytes` by providing `__bytes__()` - #53 by cmaloney). Doing that means there is a new internal use of PyBytes_FromStringAndSize(NULL, len) + _PyBytes_Resize(); from the other thread the hope was to avoid that and only change bytearray’s internals once by going to PyBytesWriter once it was added.

1 Like

In general, PyBytes_FromStringAndSize(NULL, len) and _PyBytes_Resize() should be avoided (PEP 782 soft-deprecated them), but IMO using them in bytearray perfectly makes sense.

3 Likes

I don’t know about the internals, but talking on the Python-site API:

WHat about a .freeze method that would just change things internally so that effectively became a frozen_bytearray with the same buffer and essentially could be used directly as a bytes instance everywhere?

This could agree with drafted PEP 805 (and no problem it having its own public freeze method, just as dataclasses and datetime have a .replace method, even though there is now copy.replace)

I think .freeze has a different goal, making things “immutable” and thus thread safe, to .take_bytes which focuses on APIs that require returning bytes after having a “mutable block of bytes”. I prototyped changing functions to return “buffer protocol” (or “bytes-like”) objects rather than bytes and found that surrounding/calling code often broke because it required exact bytes. There are some projects (ex. psycopg) which did a major version break to move to bytearray most places because once a data buffer is returned to the user the library doesn’t need the bytes and modifying is fine. For the Python standard library APIs I don’t think there is sufficient upside for that breaking of a change.

As one example io.RawIOBase.read is defined to return either bytes or None (no data available/blocked on non-blocking). Internally that means it needs to fill a block of memory with bytes from the OS (ex. using os.readinto) then turn that into a bytes object. Doing this in Python language is maybe possible with BytesIO (which exposes a “file-like” API rather than “bytes-like”) but that internally relies on a cool CPython-internal specific implementation to keep track of references (and contains a mutable bytes which is later finalized; much like this proposal).

It would be really cool if .freeze uniformly could transition between distinct “immutable” types with the same/similar API surface (ex. bytearray ↔ bytes, frozenset ↔ set, etc). That would enable another solution to this problem. To actually implement it I think you need most the same pieces here because CPython C Code / interpreter requires bytes to have a very specific memory layout (data inline with the object); and bytearray has a very different layout (data in a separate allocation from the object). Those distinct implementations are really efficient for current use cases and I think would be hard to get all the benefits in one common CPython object implementation.

1 Like

Planning to start implementing this shortly, created gh-139871 to track. I will validate having the extra PyVarObject head doesn’t significantly regress the pyperformance benchmarks (Can lazily make it on .take_bytes(), but that adds complexity and makes it harder to add some other optimizations).

Migrating individual code paths I plan to leave to separate issues from adding .take_bytes().

–

One of the secondary optimizations I’ve been looking at with this is changing bytearray construction off bytes to be zero-copy. bytearray("test", encoding='utf8') where .encode goes from str/unicodeobject → bytes works because the encoder usually returns a single-refcount bytes. Unfortunately bytearray(b'\0\1\2\3' * 1024) which is common in code bases the bytes PyObject* passed to the bytearray tp_init has a reference count of 2 so it can’t be taken “safely” as far as I can find. If there’s some way to do that would be really neat.