Add zero-copy conversion of `bytearray` to `bytes` by providing `__bytes__()`

How about something like:

  1. Make the bytearray buffer a PyBytes, cloning techniques from _io.BytesIO
    • Ideally they can share most their implementation
    • bytearray is directly C-API exposed, but _io.BytesIO is not, so can’t do a more direct swap of bytearray with _io.BytesIO I don’t think?
  2. Add an explicit bytearray.detach that raises an exception if a copy would be required, replaces the the buffer with a new empty bytes object.
    • Code can show intent, get an error when programmer expectation is broken
    • No silent slowness in performance-critical code
  3. Add bytearray.__bytes__ which “detaches” when possible, and copies otherwise
    • bytes(bytearray(1024 * 1024 * 1024)) without copying
    • Should make common “make bytes at return” (ex. _pyio.FileIO.readall()) faster.
    • Other code keeps correctness and current cost.
  4. Teach bytearray to be constructable from a exact bytes or __bytes__ without copying
    • This relies on reference count being 1 for no-copy case.
    • bytearray(b'1234') should no longer copy.
    • Makes bytearray construction from objects which were just encoded to bytes efficient / copy-free. Makes the pattern of “build a Python object of state then serialize to byes and write out” that things like email.message.Message do faster out of the box.

At that point, code that exists today should be faster. Programmer optimized code can do .detach() to get explicit behavior. bytes(bytearray(b'\0' * 1_000_000_000)) would not copy.

I think at that point _pyio.BytesIO I could implement as a small wrapper around bytearray.

Sorry, my answer is not directly related to the topic, but the code you submitted makes me wonder about how stupid the reallocation/resizing of bytearray can be currently ^^

The snippet you gave to resize the buffer was probably derived from this snippet in file.io
cpython/Modules/_io/fileio.c at main · python/cpython · GitHub

static size_t
new_buffersize(fileio *self, size_t currentsize)
{
    size_t addend;

    /* Expand the buffer by an amount proportional to the current size,
       giving us amortized linear-time behavior.  For bigger sizes, use a
       less-than-double growth factor to avoid excessive allocation. */
    assert(currentsize <= PY_SSIZE_T_MAX);
    if (currentsize > LARGE_BUFFER_CUTOFF_SIZE)
        addend = currentsize >> 3;
    else
        addend = 256 + currentsize;
    if (addend < SMALLCHUNK)
        /* Avoid tiny read() calls. */
        addend = SMALLCHUNK;
    return addend + currentsize;
}

Doing the ramp up from 1 bytes to 4 MB. the numbers are completely random.
It’s worse than that, most of the numbers are odd. They completely disregard CPU word size, page size, etc…
I’d think the memory allocator is optimized to allocate specific sizes? Usually around power of 2 and (large) page size?
I don’t think it makes any sense to have allocations of 8193 or 16643 bytes? That’s 2 pages or 4 pages + 1 byte.

I wonder if there is a small performance gain to get with better magic numbers =)

        1 00000000000000000000000000000001
     8193 00000000000000000010000000000001
    16642 00000000000000000100000100000010
    33540 00000000000000001000001100000100
    67336 00000000000000010000011100001000
    75753 00000000000000010010011111101001
    85222 00000000000000010100110011100110
    95874 00000000000000010111011010000010
   107858 00000000000000011010010101010010
   121340 00000000000000011101100111111100
   136507 00000000000000100001010100111011
   153570 00000000000000100101011111100010
   172766 00000000000000101010001011011110
   194361 00000000000000101111011100111001
   218656 00000000000000110101011000100000
   245988 00000000000000111100000011100100
   276736 00000000000001000011100100000000
   311328 00000000000001001100000000100000
   350244 00000000000001010101100000100100
   394024 00000000000001100000001100101000
   443277 00000000000001101100001110001101
   498686 00000000000001111001101111111110
   561021 00000000000010001000111101111101
   631148 00000000000010011010000101101100
   710041 00000000000010101101010110011001
   798796 00000000000011000011000001001100
   898645 00000000000011011011011001010101
  1010975 00000000000011110110110100011111
  1137346 00000000000100010101101011000010
  1279514 00000000000100111000011000011010
  1439453 00000000000101011111011011011101
  1619384 00000000000110001011010110111000
  1821807 00000000000110111100110001101111
  2049532 00000000000111110100010111111100
  2305723 00000000001000110010111010111011
  2593938 00000000001001111001010010010010
  2918180 00000000001011001000011100100100
  3282952 00000000001100100001100000001000
  3693321 00000000001110000101101100001001
  4154986 00000000001111110110011001101010
1 Like

Different than my focuses / I think other overheads are a bigger contributor to slowness for cases I care a lot about (there’s a known file size and succeed in 2 reads). Fairly confident a PR with a pyperf benchmark (sample in a PR) showing the improvements would be popular. new_buffersize is between 12 and 17 years old for its buffer expansion looking at git blame. I think Issue 15758: FileIO.readall() has worst case O(n^2) complexity - Python tracker was the last thing that tweaked it.

It will raise more often than you can expect. If you have append() or extend() to non-empty bytearray, you need to overallocate the internal buffer to avoid quadractic complexity. Overallocated bytes object needs to be shrinked in detach(). It will work only if you never resized the initial bytearray object or have a single extend() to empty bytearray object. Raising and catching exception adds a large overhead, so it is better to make it optional. I am not sure that the case for guaranteed no-copy is important.

It cannot “detach” (i.e. replace the internal buffer with the empty one. It would be a braking change. __bytes__() should not modify the original object.

If you mean returning a reference to the buffer without replacing it with the empty one, this will not work.

Constructor cannot steal the argument. It would break the uses of the C API PyByteArray_FromObject() and PyObject_Call*(). After creating the bytearray object, there will be at least two references – in the bytearray object and owned by the caller.

So we should check for reference count being 1 not in the constructor, but later. What PyByteArray_AS_STRING() should do if the reference count is not 1? It cannot copy, because allocation can fail and there is no way to return an error.

And there is a similar issue with PyBytes_AsString(). The user can modify or not modify the array, we don’t know. For safety we should assume that it will be modified, for performance there should not be copy if there is no modification.

The current C API does not allow such optimization. To make it possible we should deprecate and remove the parts of very old and widely used C API. This would be very painful.

3 Likes

Why explicit: Because for _pyio.FileIO.readall(), the “silent” copy from bytearray to bytes at the end of the function is why it’s slower currently than _io.FileIO.readall(). It matters for efficient operation and is a significant performance regression if an extra copy happens. I’ve spent a decent amount of time hotfixing code releases for that sort of regression and would like tools so I don’t have to again. I find it hard to spot in human review, hard to write a test for, and I have encountered it in live systems. Having a simple explicit way to express intent would be a good step to me on being able to have one io.FileIO.readall() implementation.

If you want to experiment locally:

make && ./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v
vs.
make && ./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read -v

For my workstation on a optimized build currently 1.4s _pyio and 0.679s _io. The recent os.readinto and bytearray.resize changes eliminated a lot of the other inefficiencies. Ongoing work in gh-129005

Focus Cases for Me: I’ll work on brainstorming more paths. Just .detach() on bytearray objects would solve the _pyio.BytesIO and _pyio.FileIO.readall performance cases I’ve been looking at recently. There are some other cases I’m concerned with that I don’t have good solutions for. Definitely looking for ideas.

In particular, what I’d like is if there were zero-copy paths to do:

  1. bytearray(bytes(bytearray(1024 * 1024 * 1024))). A lot of I/O wrapper objects and code ends up doing that pattern. Path("README2.rst").write_bytes(Path("README.rst").read_bytes()) is the code pattern I’ve commonly seen. Read from one byte stream, write it to a different byte stream. Some have faster OS apis (ex. copyfile), but not most. Same thing with text / encode/decode utf-8 files commonly as well (although that adds more layers / unicodeobject).

  2. bytearray(str(big_object).encode()) and bytearray("test".encode()), which is common in print() to stdout/stderr, write to network, and write to disk code paths. My understanding of the code today that (in PYTHONUTF8 mode), the unicodeobject already contains UTF-8, to make the bytes it does a PyBytesFromStringAndSize() to copy fasst-path, Things like Path("email.msg).write_bytes(email.message.EmailMessage().as_byes()) does this with the message filled by other python code.

  3. Path("README.rst").read_bytes()[10:], which currently always copies the bytes except the first 10 / some common prefix (ex. a file header). I’ve written parsers which do that. For examples I can share I rewrote netrc as an experiment a couple months ago using match + slice and the initial version was 200+ times slower. Running cProfile on it showed it was because of all the slice copies (many of which I eliminated in the current form by rewriting to only make the slices lazily “_materialize”. netrc is used by requests if no auth= is passed; Command line utilities which use it see the netrc time show in their runtime directly (if you have .netrc file, and especially if that is on a NFS home drive).

    (Currently ~/.netrc file gets stat() called for every entry with a password… which I’ve been debating making a issue and PR for, esp. with file._stat_atopen. It’s just checking “does the file user match the current user” so once gives same security protections as once per entry).

I’m starting to wonder if there would be some way to have an interpreter internal “Buffer Protocol proxy” that could morph into str, bytes, or bytearray either zero-copy or minimal-copy (depending on the buffer protocol mode and the end type). I don’t have any concrete proposals yet though. Really welcome others exploration and thoughts. I’m very much still learning my way around CPython and its implementation.

1 Like

Re-reading this I don’t like how I presented and approached in the “Why Explicit” section. Wish I would have focused more on showing the particular case I was investigating, how I measured, and what I inferred from that.

I opened a PR to move bytearray to use a buffer convertible to bytes without copying. So far not adding other optimizations: gh-129005: Move bytearray to use bytes as a buffer by cmaloney · Pull Request #130563 · python/cpython · GitHub

That enables adding bytearray._detach() which I plan to do in a separate PR.

With bytes + detach on my machine (AMD 64 bit Linux, Optimized build):
_io takes: ~0.791s and uses ~2GB of RAM
_pyio current: ~1.073s and uses ~4GB of RAM
_pyio with bytearray._detach: ~0.887s and uses ~2GB of RAM

# _io.FileIO.readall of a large file
./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read

# _pyio.FileIO.readall of a large file
./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read

Thanks for the great insight and analysis. I was not aware of this very useful implementation detail. Would there be any concern if this behavior is fully publicized in BytesIO.getvalue’s currently very brief documentation?

What’s truly magical about BytesIO.getvalue is that it not only creates a bytes object that shares buffer with the BytesIO object, but if the BytesIO object gets written afterwards, then and only then it would detach the bytes object and allocate new memory for itself to make a copy.

But this magic is possible only because BytesIO has no equivalent C API function to PyByteArray_AS_STRING() so there is no fear of untrackable mutation to the buffer. So as you pointed out it is not possible for bytearray to implement the same magic without deprecating PyByteArray_AS_STRING() and creating signficant backwards incompatibility.

Great work, and I do think that an explicit detach method that empties the bytearray is a great idea, but with the insight Serhiy offered, I can’t help but think BytesIO as it is is already a fully viable zero-copy solution to all of your aforementioned use cases. For example, _pyio.FileIO.readall can be rewritten as:

from io import BytesIO

def readall(self):
    res = BytesIO()
    while data := self.read(DEFAULT_BUFFER_SIZE):
        res.write(data)
    return res.getvalue() or data

Maybe we can just fully document BytesIO’s behavior, or do you see a use case that BytesIO doesn’t already cover?

1 Like

_pyio implements a BytesIO, and without a bytearray.detach() it has the same “has to copy” problem / it isn’t currently able to do the same optimization the C _io.BytesIO does.

1 Like

Ahh haha you’re right. _pyio does not use the C implementation of BytesIO because that’s the whole point of _pyio. I’m convinced then.

EDIT: And to answer myself as to why this optimization behavior of BytesIO isn’t documented, it’s because the Python version can’t perform the same magic due to the current limitation of bytearray. But now if your PR gets merged, we can then make the Python version of BytesIO perform the same optimization and publicly document the behavior.

1 Like

Are there optimization opportunities outside the _pyio module for bytearray? _pyio is not used by CPython, I don’t think that optimizing _pyio is enough to justify changing bytearray internals.

1 Like

Constructing from str / unicodeobject can be optimized with a little bit of code to be significantly faster (1.13ms → 866us on a debug build for me):

import pyperf

runner = pyperf.Runner()

runner.timeit(
    name="bytearray_unicode",
    stmt="bytearray('a' * 1_000_000, encoding='utf8')")

In theory constructing from bytes should be able to do similarly, but my simple code is seeing a refcount of 2 on the bytes object passed as arg which I don’t understand yet.

For both bytes and str construction, I think can also reduce the amount of code in bytearray rather than add more special cases, just defer to the bytes constructor which has more optimized paths while reducing code.


(working on release builds, latest main is not hitting recursion / stack errors when expected on some tests meaning I can’t do a full optimize build currently)

1 Like

Got an optimized build working (for other projects I’d upped my rlimit on stack size, reducing to standard 8192KB got tests passing again):

# main:
.....................

bytearray_bytes: Mean +- std dev: 438 us +- 7 us

.....................

bytearray_unicode: Mean +- std dev: 678 us +- 13 us

# remove unneeded copy for bytes and str
.....................

bytearray_bytes: Mean +- std dev: 7.02 us +- 0.23 us

.....................

bytearray_unicode: Mean +- std dev: 437 us +- 9 us

Experimental demonstration/test patches:
experimental bytes not copying
experimental str less copying

More Testing Info

Benchmark code:

import pyperf

runner = pyperf.Runner()

runner.timeit(
    name="bytearray_unicode",
    stmt="bytearray('a' * 1_000_000, encoding='utf8')")


runner.timeit(
    name="bytearray_bytes",
    stmt="bytearray(b'a' * 1_000_000)")

Simple checking around “does the bytearray refcount bits work”:

print("T0")
bytearray(b'a' * 1_000_000)

print("T1")
a = b'a' * 1_000_000
bytearray(a)

print("T2")
bytearray(b'123451234512345123451234512345123451234512345123451234512345' + b'123')

I checked asyncio.stream. It contains this pattern several times:

# convert whole bytearray to bytes and clear the buffer
                chunk = bytes(self._buffer)
                self._buffer.clear()
# consume the first n bytes (n is equal to or less than len(self._buffer).
            data = bytes(memoryview(self._buffer)[:n])
            del self._buffer[:n]
2 Likes

The base64 module creates a bytearray as a buffer for encoding/decoding before copying the buffer to bytes and throwing away the local bytearray by returning:

Same in the zipfile module:

and the wave module:

and the punycode encoding module:

1 Like

Gotta point out that in this particular case only the first n bytes are copied to bytes and sliced away from the bytearray so the proposed detach method isn’t quite applicable.

I know. I meant some modification to the proposal would make it more usable. (e.g. bytearray.take_bytes([n]))

2 Likes

Great idea. I suppose we can generalize this as bytearray.take_bytes(n=None) to cover the proposed detach by default, and slice the first n bytes as bytes when n is positive, and the last n bytes when n is negative.

I think BytesIO is better for some of these. The pattern of “append bytes” repeatedly then return bytes today BytesIO is more efficient at doing and about the same code complexity. bytearray / explicit manual pre-allocation is most useful to me to reduce copies when have readinto (ex. fp.readinto(memoryview(buf)[bytes_read:]) vs. buf += fp.read()) and/or to avoid lots of allocations + deallocations in a tight loop (ex. temporary buffer when pumping data from source to particular hash function). If data is coming from a fp.read(), BytesIO is simpler and faster.

My measurement for “is this a win” is currently wall time + memory allcoations via tracemalloc and memray. Are there tools in CPython’s testing suite so can test / validate that optimizations added for individual cases don’t regress accidentally. For me, changing a test intentionally because have a better algorithm/design is awesome, things accidentally getting slower is frustrating. If anyone has pointers would be appreciated (including prior designs and broader ecosystem tools).

I’d like to move forward with this / moving bytearray to contain a bytes instead of general block of C memory. In particular, I think we’re reaching consensus that:

  1. There’s a measurable overhead reduction by removing the copy constructing from bytes, from str, and in convert to bytes at end of function which are common in existing code.
  2. There isn’t a way to do this efficiently in Python code that can’t use BytesIO or C APIs directly today. Because of existing C API guarantees, bytearray and BytesIO need to have distinct implementations.
  3. .take_bytes([n]) / detach and leave in good state as an abstraction provides useful semantics and would help simplify code. .take_bytes([n]) preferred over .detach().

As I’ve been experimenting with / reviewing the code to do that, I found there is a potentially significant behavior change between _PyBytes_Resize and PyMem_Realloc.

In particular that means PyByteArray_AsString may return a different address before and after a resize. Today, the address only changes when capacity changes between zero and non-zero, but should stay the same otherwise.

Does that change anyone’s thoughts or raise additional concerns?


Open questions:

  1. Is it worth changing over a deprecation cycle on the bytearray C API that means it can’t use all the same highly efficient techniques as BytesIO? (I think consensus is no)

  2. There are quite a few “write loops” and “read from” loops that return bytes. .take_bytes([n]) would be an improvement in those cases, but there’s still a lot of redundante code.

    I found a PSRT recommendation that in general these should use use non-linear buffer resizing. I’m planning to open a separate ideas / discussion to add BytesIO.readfrom(file, /, *, estimate=None, limit=None) that would achieve that for read from loops.

    Core validation of that improvement is it should be the implementation of FileIO.readall with minimal performance change (slightly more objects, but same set of operations/syscalls/memory allocation). In that will propose migrating more specific loops. Without migration, just moving to .take_bytes([n]) would still be an improvement.

    Write loops I think would be about the same amount of code and more efficient if migrated to BytesIO today.

    That means for both “Read from a bytes file-like object” and “write repeatedly to bytes buffer” would have a common efficient class to use: BytesIO