PyBytesWriter definitely is a lot neater to me than using the existing _PyBytes_Resize pieces. Think it could be interesting to back bytearray generally. Also means that the “Efficiently oversized grow buffer on repeated appends” could be de-duplicated a bit. Still doesn’t allow going from “mutable array of bytes” → bytes in pure Python code. asyncio, encodings.punycode, zipfile, _pyio all make a bytearray, extend/resize/mutate, then at end of function convert the bytearray to a bytes which is returned to caller.
Agreed, not sure with the overloads of BytesIO today there is a straightforward path for that? In my case also specifically looking at “How does _pyio.BytesIO get faster?” Right now the large file test runs into two pieces around it: a. it can’t resize b. it can’t convert the internal bytearray to bytes without a copy (and can’t use bytes as at Python interpreter level it’s immutable). It’s one of the slower individual tests, especially on older/smaller hardware. I suspect adds up in CI runtime quite a bit.
Both of those are part of why I think just the bytearray.take_bytes([n]) subset of the idea today. It just removes the end copy operation which enables significant speed up for cases in asyncio, _pyio.BytesIO, etc. today. As other pieces (ex. PyBytesWriter) are standardized, can potentially make even more efficient. Some tradeoffs in “can a large single-reference bytes be put into a bytearray without copying”, not sure how common that is.
bytearray.take_bytes([n]) can be implemented lazily on copy out as @encukou suggested, which keeps things much more localized. Need to be careful around the extend buffer code that currently “realigns” to no padding before, but I think very solvable. Means “start with enough padding at front for the zero-copy out”. In time I think could move to PyBytesWriter underneath which would simplify further but be largely independent. The take_bytes([n]) I think is useful as a small step that lets code specify intent that enables an optimization both with and without that migration.
It feels like overall there’s a preference for adding new abstractions and using them, rather than extending or modifying bytearray?
That has the secondary implication to me of “modify existing code to use the new / more efficient abstractions” rather than adding a new API here. If that is the case, would it be reasonable to make an issue and work on migrating existing code to from bytearray to _io.BytesIO where it measurably improves performance?
I’m happy to work on code for BytesIO improvements to make it cover more cases if there’s a particular API / case. .truncate() tweaks and / or .reserve() it sounds like are high level whats wanted?
To summarize (if for nothing else, my own future reference)
Shelving .take_bytes([n]) for bytearray until at least PEP 782 / PyBytesWriter. I’ll look at measuring and migrating to io.BytesIO and study tradeoffs / what it looks like to use for other cases bytearray is commonly used. Including individual ideas for modifications of BytesIO (ex. truncate changes).
Constructing and manipulating large blocks of bytes (4KB - 1GB+), there are two main options currently:
_io.BytesIO: Allows getting the underlying bytes with minimum number of copies in most cases, file-like API which works well for repeated read/write, but less ideal for make a block of memory and manipulate bytes (ex. building a file header or network protocol blob).
bytearray: Allows manipulating a block of memory changing/setting individual bytes and resizing easily, but requires a copy to get to a bytes. The copy can dominate runtime when working with large buffers.
Improving both of those to be more comprehensive would be nice. It’s impossible to implement BytesIO like functionality in pure-python (ex. _pyio), and there is currently no way to do it without directly using the Unstable C API; in particular, there’s no way to allocate a block of memory with direct access to manipulate machine bytes then convert / finish it into a bytes without copying today. PEP 782 will improve this situation!
Individual ideas for changes/improvements for BytesIO so it can cover even more cases are welcome today. Measurable code performance improvements by going from bytearray to BytesIO may be okay. A more efficient _pyio.BytesIO implementation isn’t viable at the moment and not somewhere to focus in CPython.
In any case, we cannot guarantee that the optimization like in BytesIO works on non reference counting implementations (e.g. PyPy or Jython). This is a CPython specific optimization, like optimization for in-place string concatenation.
We (CPython developers) obviously cannot guarantee anything about PyPy or Jython (they can even choose the violate the Python spec if they want), but it’s still reasonable to expect them to provide the same zero-copy semantics as CPython would do for a hypothetical buffer-detaching operation.
Yes, they can implement detach(), but optimizations for the constructor and getvalue() cannot be fully implemented without reference counting.
It is possible to implement it only partially. For example, a constructor may have complexity O(1), but the first modification will have to make a copy, even if the original bytes object no longer referred from outside.
People are also good at figuring out how to optimize even with a lot of constraints in my experience. Just takes time and will. If it keeps coming up in profiles slowly work out what the underlying problems are and ideas to address.
For bytes slicing and conversion to bytes of a memory allocation / malloc / bytearray is more expensive in CPython because PyBytesObject must always have its bytes (ob_bytes) inline and that means have to make a new PyVarObject head w/ space allocated and copied to every time / can’t reuse a head repeatedly (for cases like repeated yield of slices in a loop). Could do things like build new abstractions which allow to capture that, but a very different optimization point and hard to adapt CPython to use without breaking a lot of other important established guarantees . Different tradeoffs than C++ for instance where std::string tends to do inline for small objects, and longer ones a second allocation. For short strings definitely never having to do an extra memory hop can be really efficient, and that makes a lot of code work well / memoryview exists for slicing large objects without copying bytes when needed. Lots of interesting ways to optimize.
Big difference is that std::string is mutable (in particular, it can be arbitrarily resized), so it has to allow for a separately allocated data buffer.