PEP 782: Add PyBytesWriter C API

Hi,

After multiple iterations on the API, I decided to write down a PEP for the PyBytesWriter C API. It’s easier to understand the API with its documentation, examples, and discussions around it. A single document (the PEP) should help the discussion.

→ Read PEP 782: Add PyBytesWriter C API

The API is now based on sizes rather than pointers, even if two functions using pointers are provided for convenience.

Abstract

Add a new PyBytesWriter C API to create bytes objects.

Soft deprecate PyBytes_FromStringAndSize(NULL, size) and _PyBytes_Resize() APIs. These APIs treat an immutable bytes object as a mutable object. They remain available and maintained, don’t emit deprecation warning, but are no longer recommended when writing new code.

Perhaps we should at least add a compile-time deprecation warning for _PyBytes_Resize?
I understand that it’s not possible to do it for PyBytes_FromStringAndSize(NULL, size) while still allowing PyBytes_FromStringAndSize(<non-null pointer>, size).

From my reading and assumptions, I guess PyBytesWriter_WriteBytes and PyBytesWriter_Format will extend the buffer? It should say so explicitly in their description, including how it relates to a prior overallocation (e.g. if I PyBytesWriter_Create(10) and then PyBytesWriter_WriteBytes(<10 bytes>), does it use the 10 I specified or does it allocate 10 more?).

If I call PyBytesWriter_WriteBytes multiple times, does it append or overwrite?

Does PyBytesWriter_GetData get the start of the data or the position where WriteBytes would next write to? (Obviously irrelevant if it overwrites from the start each time.)

Why disallow shrinking with PyBytesWriter_Grow? It can handle negative growth just fine (potentially with data loss, but that’s likely intentional). I have plenty of cases where I would PyBytesWriter_Grow(-1) to trim a trailing null or character.

What does “update the buf pointer” mean for PyBytesWriter_GrowAndUpdatePointer? Maybe it needs a realistic example, because it seems like Grow is going to allocate garbage/zeros and UpdatePointer is going to move my pointer past it so that I don’t write values into the new part of the allocation? Doesn’t seem useful.


Got up to the next example, and I see it would make sense to pass in the result of a strlen previously used to strcpy into the result of GetData? Is that the intent? Still not entirely clear how or when I’d use this API.


The “Overallocation” section should mention that (whether?) finishing the writer will trim overallocations.

The implementation allocates internally a bytes object …

Maybe call out that this isn’t a required part of the design? Other implementations may do it differently, and CPython may do it differently in the future.

There is no impact on the backward compatibility

I don’t think you can deprecate (even soft deprecate) functions and say there’s no impact :wink: A short example of how to change code using PyBytes_FromStringAndSize(NULL, size) into code that will compile with both the old API and the new API would be useful here - checking PY_VERSION_HEX is probably the best option here?

Hi Steve,

The key here is PyBytesWriter_GetSize which returns the current size of the writer.

PyBytesWriter_WriteBytes and PyBytesWriter_Format increase this size: they call Resize() internally to increase the size.

PyWriterWriter_Create(n) sets the writer size to n.

I will try to clarify that in the PEP.

It does append. The high-level API example is based on that.

It’s always the start of the data. It’s similar to PyBytes_AS_STRING() and PyByteArray_AS_STRING().

Ah. It’s just an arbitrary limit to help detecting bugs in the code. I’m open to remove this limitation.

PyBytesWriter_Grow() is just a convenient helper around PyBytesWriter_Resize() and PyBytesWriter_GetSize().

Ah, I added pseudo-code in the hope that it would be enough to explain the behavior.

The UpdatePointer part only means that if the internal buffer is moved in memory, the pointer is updated to point to the new memory address. It’s only a helper function. The position inside the writer buffer is unchanged.

I’m sorry, I’m confused, which example are you referring to?

I only described the “reference implementation”, it’s not part of the Specification. I will try to clarify that.

There are only soft deprecations. Existing code will continue to work as if nothing happened. No warnings is emitted. But I can repeat the soft deprecations there if you want.

Ok, I will add such example.

Thanks, those sound like the changes I wanted to see in the PEP text.

Should’ve linked, sorry. I meant the GrowAndUpdatePointer example.

Yeah, but I’m sure people will misquote it later on (I’ve just come from an argument where someone was doing it…). Better to be clear now and not have to have arguments later.

1 Like

I plan to backport PEP 782 (PyBytesWriter) implementation to Python 3.13 in pythoncapi-compat. I wrote a draft PR to show that it’s doable. The backport might be less efficient (e.g. cannot use a free list), but it allows to have a single code base working on all Python versions (my backport works on Python 2.7 and PyPy 2/3).

You need to reference pythoncapi-compat in the PEP then, and it wouldn’t surprise me if a PEP delegate preferred you only used official/core examples - things in the past that have tried to refer to 3rd party projects have been contentious.

Overall looks good to me, two nitpick API addition ideas from me that definitely could be deferred:

  1. A way to get a “Buffer Protocol” view onto the PyBytesWriter without having to “finalize” it into a bytes first. The use case I’m thinking of is to build up a block of machine bytes, and then introspect/debug/validate the bytes state in Python code or present to a user (ex. a progress bar) the “partial” built bytes could then be further modified. Some tradeoff in added complexity as I suspect would want to “lock” the buffer from being resizable though when there are any exports.
  2. PyBytesWriter_FromBytes(PyObject *) where a pre-existing bytes object is passed in. I’m thinking about cases where an API takes a bytes in its constructor (or a str + encoding=). It simplifies boiler plate in those cases, enables removing an otherwise unavoidable copy of the machine bytes from the initial buffer value.

The whole point of this API is to have the data in a state that isn’t valid or suitable for exposing to Python code. Particularly once we start worrying about threading, it’s nice to have an API that deliberately doesn’t support access from multiple threads (so nothing goes through locks or mutexes, unless the caller does it, which we usually won’t have to).

If you want to work with partially initialised buffers, just use bytearray.

Again, this API is intended for constructing a bytes object from other native data. The copy is indeed unavoidable, because the original bytes object has to remain immutable, so it only saves a PyBytes_AsString call and then PyBytesWriter_WriteBytes of that string.

The functions that would be involved in converting str+encoding into bytes is likely going to be using a bytes writer internally. So proposing to expose the partially initialised but un-finished writer from those falls under the first point.

2 Likes

I appreciate the thoughts and comments. I’m trying to build myself a picture of BytesIO, bytes, bytearray and now PyBytesWriter what they do the same, what they do differently, and when should use each. There is also _BlocksOutputBuffer but that is more specifically isolated. They have a lot of similarities and similarly structured code. I personally would like to reduce the amount of code and “when do I use which” / tradeoffs. Definitely understand this is a new and somewhat distinct use case. Hoping to find ways it can make the existing ones simpler as well (even just in implementation). Thanks for the thoughts.

That would break projects treating compiler warnings as errors. In PyPI top 8,000 projects, there are 41 projects (C extensions) using _PyBytes_Resize(), it’s significant.

I would prefer to have at least one Python version with PEP 782 API (ex: in Python 3.14) before considering to deprecate _PyBytes_Resize() (ex: in Python 3.15). Even if I really dislike _PyBytes_Resize() :smiley:

5 Likes

It would make the implementation more complicated, I would prefer to not add it for now. Instead, functions should be specialized to use directly the PyBytesWriter C API (for now).

The PyUnicodeWriter API has such optimization to store a read-only Unicode object which is copied on the next write. It allows some micro-optimizations on specific cases such as "%s" % "abc" or "{}".format("abc"). It makes the implementation more complicated.

I don’t think that it’s worth it for PyBytesWriter API since I didn’t find any function in the Python code base which would benefit from such PyBytesWriter_FromBytes(PyObject *) function.

2 Likes