Prefer Buffer Protocol over __bytes__ in functions taking a "bytes like"

The C-API has two methods to construct bytes, PyBytes_FromObject and PyObject_Bytes. In PyObject_Bytes, the __bytes__ method is tried first then it defers to PyBytes_FromObject. PyBytes_FromObject never checks for __bytes__, but rather starts with the buffer interface.

Most the time this isn’t significant, but it is possible for __bytes__ to return a distinct set of machine byes than the buffer protocol. This came up as a concern while working on removing a bytearraybytes copy in int.from_bytes since changing resolution order can change behavior. An example bytes python subclass that does this:

>>> class my_bytes(bytes):
...     def __bytes__(self):
...        return b"bytes"
...  
...     def __buffer__(self, flags):
...         return memoryview(b"buffer")
...  
... class distinct_bytes_buffer(bytes):
...     def __bytes__(self):
...         return my_bytes(b"ob_sval")
... 
...     def __buffer__(self, flags):
...         return memoryview(b"distinct_buffer")
...  
... a = distinct_bytes_buffer(b"distinct_ob_sval")
... bytes(a)
... 
b'ob_sval'
>>> a
b'distinct_ob_sval'
>>> memoryview(a).tobytes()
b'distinct_buffer'

Since Python 3.12 / PEP 688 the buffer protocol is available in Python. That means it can be used in python-only bytes subclasses.

I’d like to propose that the resolution order going to bytes should change to always be:

  1. PyBytes_CheckExact (This is ~1.2x faster than buffer protocol first)
  2. Buffer Protocol
  3. __bytes__
  4. Other bytes construction methods as today (List, tuple, unicode, etc).

I think that will both provide moderate speedup over the current __bytes__ first-sometimes as well as make behavior more consistent. The example above would return b'distinct_buffer' for all three cases. In-language bytes subclasses can implement buffer protocol to get consistent behavior (The above example). Other objects that are bytes-like but not a bytes and would need to copy the data to get one (ex. bytearray) can rely on buffer protocol being used for zero copy behavior.

I know of at least three places this would mean changing: PyBytes_FromObject, int.from_bytes, bytes_new_impl. Suspect there may be more in the broader Python ecosystem as well as CPython, can do more comprehensive code searches. Generally I think pointing people towards the buffer protocol gives them a simple and effective solution to efficiently get at underlying data / machine bytes. My hope is over time if Python adopted this idea the ecosystem would also move towards it for simple, consistent behavior.

JFR, OP’s PR: gh-132108: Add Buffer Protocol support to int.from_bytes to improve performance by cmaloney · Pull Request #132109 · python/cpython · GitHub

Why? CC @sobolevn

I think that there are some implicit relations between dunder methods, even if they aren’t documented explicitly. E.g. __index__() and __float__() should return equal (unless second raises OverflowError) values on any Integral type.

The int.from_bytes docs says: " The argument bytes must either be a bytes-like object or an iterable producing bytes." I looks like current implementation not follows to docs.

1 Like

Very much this. I’d like to say bytes-like means a specific Method Resolution Order / conversion (MRO): buffer protocol first, exact bytes check produces the same (is just an optimization), then list out other methods. Can definitely keep special cases for specific reasons (ex. compatibility) but those would document their deviation from bytes-like MRO rather than working a little different sometimes today. Gives a common direction for things to head and guidance how to resolve bytes-like cases more generally.


re: int.from_bytes PR, happy to add __bytes__ to MRO or not. Big thing in that PR for me is removing the currently mandatory copy from bytearray to a bytes to then convert to int() (and discard the bytes). At the moment from my perspective it’s in limbo / no way I know of to get to a decision. I’m heading to PyConUS in May, hoping to explore and learn more there.