PEP 688: Making the buffer protocol accessible in Python

As evidenced by my message to typing-sig (apologies, sent before discussions-to was updated), I’m sympathetic to concerns about breaking bytes / bytearray compatibility. However, I find Jelle’s point about hashability fairly compelling — and in general it’s nice to be able to reflect immutability in one’s types.

but I don’t want to get too precise or restrictive here

Part of the issue is that using bytes as a shorthand prevents people who want to be precise from being precise. If we choose to remove the shorthand, maybe we could suggest use of typing.ByteString as an “imprecise” type.

Breaking down type annotation use cases

A) buffers that are passed to C code that use the buffer protocol
B) functions that use all the various methods of builtins.bytes (join, find, lower, etc) — these likely also work on bytearray “for free”
C) functions that hash bytes
D) functions that iterate over bytes or e.g. extend a bytearray

In the current world (where static duck typing lets us treat bytearray and memoryview as subtypes of bytes)

A) is not expressible
B) is typed as bytes. hooray static duck typing, type checkers let us pass bytearray and things work. but boo, static duck typing, we’re allowed to pass memoryview and things don’t work.
C) is typed as bytes. boo static duck typing, since type checkers let us pass bytearray and writable memoryview
D) is often typed as bytes, as shorthand for bytes | bytearray | memoryview, but should probably be typed as Sequence[int]

In the PEP 688 world as proposed

A) is typed as types.Buffer
B) is typed as bytes | bytearray
C) is typed as bytes, or maybe bytes | memoryview if we’re willing to hope noone passes us a writable memoryview
D) probably should be typed as Sequence[int]

In a PEP 688 world but where we allow bytearray to duck type to bytes (but not memoryview)

A) is typed as types.Buffer
B) is typed as bytes
C) is typed as bytes but we risk someone passing us a bytearray (or bytes | memoryview, with similar caveat to above)
D) probably should be typed as Sequence[int]

Overall I think I could give or take duck type compatibility between bytearray and bytes, but I do strongly feel we should drop memoryview from the bytes shorthand. The only non Sequence[int]-like thing I can think of that works on bytes | bytearray | memoryview is b.hex()

3 Likes

types.Buffer provides no Python API at all, except that you are allowed to pass the object to C functions that accept a buffer (such as memoryview()).

I realized that this is a problem, because it means with PEP 688 as written now there is no way to express a type like "supports the buffer protocol and __getitem__ and __len__" (which is approximately the common interface of bytes and memoryview).

So to support that, we need a different approach. Perhaps we can make the core create a __buffer__ attribute on types that support the buffer protocol. Then in typing, we can simply create Protocols that check that attribute.

What should the value of the attribute be? We can’t support the full buffer protocol in a Python call, but perhaps we could make x.__buffer__() return memoryview(x). Alternatively, we could just set it to True.

Initially I thought that it has some relation to Allow objects implemented in pure Python to export PEP 3118 buffers · Issue #58006 · python/cpython · GitHub.

Thanks, that’s interesting. Support for buffer classes written in Python sounds like a job for a different PEP. But I learned that __buffer__ already has a meaning in PyPy (docs), so we shouldn’t use that name for flagging buffers unless we actually replicate PyPy’s functionality. I’d suggest __get_buffer__ if we go with a method and __is_buffer__ if we make it just a boolean.

There is an old problem with the documentation. It uses terms “buffer” (not clear), “object that implement the buffer protocol” (too long), “bytes string”, “bytestring”, “bytes-like object”.

The problem is that there are different degrees of “bytelikeness”.

  1. Supports the buffer protocol.
  2. Additionally supports len() which returns the size in bytes.
  3. Additionally supports indexing and iteration.
  4. Has most of str methods (except encode() of course).
  5. Supports the continuous buffer protocol.
  6. bytes and bytesarray (they not only support the continuous buffer protocol, they are NUL-terminated).
  7. bytes only (it is hashable).

Different functions has different requirements for their arguments, and it is difficult to describe it correctly and unambiguously. We need to establish a non-vague terminology and use it consistently.

There is collections.abc.ByteString which I expected could be used for type 4, but seems it is going to be deprecated (Deprecate collections.abc.ByteString · Issue #91896 · python/cpython · GitHub).

3 Likes

I just talked about PEP 688 at a PyCon lightning talk presenting two options for supporting typing for the buffer protocol:

  1. Adding a new __buffer__(flags) dunder method

This would work similarly to PyPy’s __buffer__ method. We’d map this dunder to the bf_getbuffer slot in Python; Python objects would implement it by returning a memoryview.

Open question: What about the bf_releasebuffer slot?

  1. Adding an __isbuffer__ = True attribute to buffer objects

This is simpler and avoids having to deal with more of the complexities of the buffer protocol. However, this behavior would be unlike any other dunder, and it may be confusing to users if they set the field on a Python class and the class doesn’t actually become a buffer.


I like option 1 best, but I’d like to make sure it works well with the C buffer protocol.

To Serhyi’s point, I think the documentation is often a bit vague about terms like “buffer” or “sequence”. (Is a “sequence” a collections.abc.Sequence, or just a class that accepts ints in __getitem__, or something in between?) I would like to restrict the term buffer to “supports the buffer protocol”, and use more precise terms for the other possibilities.

I’ve written typesheds for MicroPython (GitHub - hlovatt/PyBoardTypeshed: Typesheds (a.k.a.: interface stubs, `pyi` files, and type hints) for MicroPython.) and this would be a great help because it will allow custom buffer types. I currently use:

AnyReadableBuf: Final = TypeVar("AnyReadableBuf", bytearray, array, memoryview, bytes)

AnyWritableBuf: Final = TypeVar("AnyWritableBuf", bytearray, array, memoryview)

Which brings me to the second point that distinguishing between readonly and readwrite is common in MicroPython and would be a valuable addition.

Would also suggest names: AnyReadableBuf and AnyWritableBuf to be consistent with AnyStr.

Thanks for the feedback! I’ll continue to try to think of ways to support writability in an elegant way.

(Also, a constrained TypeVar doesn’t make sense for this use case. We can talk about this further in Discussions · python/typing · GitHub if you like.)

Sadly, I missed your lightning talk.

Is this instead of the Buffer type you’re proposing in PEP 688, or in addition?

Could you show a complete example?

We can talk about this further in Discussions · python/typing · GitHub if you like.

Is there an existing topic or are you proposing to start one. If there is a better solution than a constrained type, I’m all for it.

Please open a new topic.

This would replace current PEP 688’s types.Buffer. I’ll write out some complete explanations.

Option 1. __buffer__ special method

This will allow implementing buffer types in Python too, so it’s also a significant non-typing change.

  • Buffer types implemented in C automatically get a __buffer__ method exposed in Python. It takes a flags: int argument and returns a memoryview wrapping the Py_buffer object returned by the underlying slot.
  • flags is the same as in C, an OR of various fields documented around Buffer Protocol — Python 3.11.0a7 documentation. For convenience, perhaps we should expose those flags in the stdlib somewhere (a types.BufferFlags enum?).
  • Types implemented in Python that define a __buffer__ method automatically get it mapped to the bf_getbuffer slot. They will then be usable as buffers (e.g., they can be passed
  • Not sure yet how this affects the bf_releasebuffer slot.
  • To check for buffers in typeshed or elsewhere, we can now simply define a Protocol with def __buffer__(self, flags: int) -> memoryview: ....
  • For convenience, we can add a typing.SupportsBuffer protocol defining this method. (Or it can go into collections.abc?)
  • For backporting, we can add typing_extensions.Buffer, and we can lie in typeshed that the __buffer__ method existed before 3.12.

Some code samples:

# typeshed builtins.pyi
class bytes:
    def __buffer__(self, flags: int) -> memoryview: ...

# typeshed typing.pyi
class SupportsBuffer(Protocol):
    def __buffer__(self, flags: int) -> memoryview: ...

# user code
from typing import SupportsBuffer

def need_buffer(bf: SupportsBuffer):
     memoryview(bf)

class MyBuffer:
    def __buffer__(self, flags: int) -> memoryview: ...
        return memoryview(b"hi")

need_buffer(MyBuffer())  # works

Option 2: __isbuffer__ = True attribute

This will allow checking for buffer types through a protocol, but not defining them in Python.

  • Buffer types implemented in C automatically expose an attribute __isbuffer__ = True.
  • If a Python type sets this attribute, nothing happens, except that it’s now lying about being a buffer.
  • As with Option 1, we can use Protocols to check for buffers, and add a typing.SupportsBuffer protocol for convenience.

How useful would it be for a Python object to define a __buffer__() method except in something like a Mock?

It seems that calling b.__buffer__() does the same thing as memoryview(b) except for something with the flags. If we only cared about the runtime behavior we could just add the flags to the memoryview() constructor? It seems that the relationship between __buffer__ and memoryview is similar to that between __len__ and len().

The bf_releasebuffer slot is called by memoryview(b).release() and after that the memory view is no longer usable (most operations just give errors).

I like type checks using the presence of a method (i.e., __buffer__) better than checks for a data field (__isbuffer__).

I presume the part of PEP 688 about replacing arg: bytes with args: Buffer is also invalidated? It just doesn’t mean the same thing. If we want to do something about arg: bytes implying arg: bytes|bytearray we should think harder IMO. I could easily be convinced that memoryview doesn’t belong in that union though.

It is important that bf_releasebuffer is set to NULL in bytes and to non-NULL in mutable types. Some C code only accept types with this slot set to NULL.

Thanks, that’s an important detail that I forgot. Is it important enough to distinguish between Buffer and MutableBuffer?

1 Like

Hi!

I am writing this representing the Python Steering Council. Thanks a lot for submitting this PEP to us and apologies for the delay in the response! We have discussed the PEP in detail and are generally happy to accept it but we have an item we would like to discuss. The PEP proposes overloading __buffer__ but we noticed several 3rd party libraries and projects already use this with potentially different semantics. Some examples of this include pypy, pyzmq and other popular projects (from a simple GitHub search). We want the impact of this to be at least discussed in the PEP and ideally reach to maintainers if the plan continues to be to keep __buffer__ in the proposal.

To be clear: the only hard requirement here that we are asking for is having this aspect included in the document and the risk analyzed, but any further effort would be greatly appreciated.

Please, reach out to me or any other member of the Steering Council if you have any questions, or if you need any clarifications or if we can help in any way!

Thanks a lot for the great work!

1 Like

Also, one clarification: we are aware that we have documented that dunder methods are subject to change without warning. The reason we want this aspect to be discussed is so we can consider openly the effects of the change between us and the community, not because we think anything has to change here in the proposal (in the sense that a different dunder needs to be used) necessarily.

Thanks @pablogsal!

Here’s some initial discussion of third-party __buffer__ methods:

So for the most part, existing uses of __buffer__ are compatible with PEP 688, but I’ll reach out to PyPy to hear their thoughts.

I agree that while dunder names are documented as reserved (2. Lexical analysis — Python 3.11.1 documentation; thanks @zware for finding the link), we shouldn’t break compatibility lightly if major third-party users are using a name that is technically reserved. However, in this case I think using the __buffer__ name actually enhances compatibility, so I’m leaning towards keeping the name, but documenting this decision in the PEP.

PyPy would love to see this adopted. As mentioned above, PyPy has a PyPy-specific __pypy_.buffereable base class with the __buffer__ method. The PyPy interpreter will look for that method on instances of classes that inherit from bufferable to implement memoryview(obj). This is what allows us to implement memoryview(ctypes.Array(...)) in pure python.

I removed the undocumented use of __buffer__ from numpy in order to clear up any confusion around this convention.

4 Likes

Sorry for the delay (we had a couple of weeks that we could not meet due to holidays and stuff).

I am very happy to report here that the Steering Council approves PEP 688: Making the buffer protocol accessible in Python with the latest changes. Thanks a lot @JelleZijlstra for the fantastic work and the patience. This is a fantastic improvement and we are very happy to see it happening :slight_smile:

Congratulations! :tada: :metal:

10 Likes