The C-API has two methods to construct bytes
, PyBytes_FromObject and PyObject_Bytes. In PyObject_Bytes
, the __bytes__
method is tried first then it defers to PyBytes_FromObject
. PyBytes_FromObject
never checks for __bytes__
, but rather starts with the buffer interface.
Most the time this isn’t significant, but it is possible for __bytes__
to return a distinct set of machine byes than the buffer protocol. This came up as a concern while working on removing a bytearray
→ bytes
copy in int.from_bytes
since changing resolution order can change behavior. An example bytes
python subclass that does this:
>>> class my_bytes(bytes):
... def __bytes__(self):
... return b"bytes"
...
... def __buffer__(self, flags):
... return memoryview(b"buffer")
...
... class distinct_bytes_buffer(bytes):
... def __bytes__(self):
... return my_bytes(b"ob_sval")
...
... def __buffer__(self, flags):
... return memoryview(b"distinct_buffer")
...
... a = distinct_bytes_buffer(b"distinct_ob_sval")
... bytes(a)
...
b'ob_sval'
>>> a
b'distinct_ob_sval'
>>> memoryview(a).tobytes()
b'distinct_buffer'
Since Python 3.12 / PEP 688 the buffer protocol is available in Python. That means it can be used in python-only bytes
subclasses.
I’d like to propose that the resolution order going to bytes should change to always be:
PyBytes_CheckExact
(This is ~1.2x faster than buffer protocol first)- Buffer Protocol
__bytes__
- Other bytes construction methods as today (List, tuple, unicode, etc).
I think that will both provide moderate speedup over the current __bytes__
first-sometimes as well as make behavior more consistent. The example above would return b'distinct_buffer'
for all three cases. In-language bytes
subclasses can implement buffer protocol to get consistent behavior (The above example). Other objects that are bytes-like but not a bytes
and would need to copy the data to get one (ex. bytearray
) can rely on buffer protocol being used for zero copy behavior.
I know of at least three places this would mean changing: PyBytes_FromObject
, int.from_bytes
, bytes_new_impl
. Suspect there may be more in the broader Python ecosystem as well as CPython, can do more comprehensive code searches. Generally I think pointing people towards the buffer protocol gives them a simple and effective solution to efficiently get at underlying data / machine bytes. My hope is over time if Python adopted this idea the ecosystem would also move towards it for simple, consistent behavior.