PEP 688: Making the buffer protocol accessible in Python

I just posted PEP 688 to support static typing of buffer objects:

Summary:

  • Add types.Buffer for use in isinstance/issubclass checks for the buffer protocol. (Implemented through an __instancecheck__ written in C that checks for the bf_getbuffer slot.)
  • Remove the static typing special case that makes bytes also mean bytearray and memoryview to static type checkers.
  • The PEP is intended for Python 3.12; the 3.11 feature freeze is too close.

I’m happy to hear any feedback, but some specific questions:

  • Are there use cases other than static typing for runtime checking of buffer types? Does the PEP’s design support those use cases?
  • Should we attempt to support checking for writable buffers as opposed to readonly buffers? If so, how should we do that?
1 Like

Presumably C code can already check for the presence of bf_getbuffer to determine whether a type is a buffer? Could typing-extensions have a small C module that provides types.Bytes exactly as in CPython?

Could you explain more about removing the “bytes also means bytearray” typing? As the referenced test shows, Werkzeug makes some assumptions with it. I’m not sure what I think yet. It’s not that we explicitly do anything to support bytearray along with bytes, it just happens to work. It seems like that should still type check: if something is marked as accepting bytes, then passing a bytearray should still pass as long as only valid things are done to it.

That’s possible. Currently typing-extensions is a single Python file. I worry that making it into a package including C code will cause problems: more risk of breakage on unusual platforms; more complicated wheel builds; harder to get it approved for users who need security approval to install third-party packages; breakage for people who vendor it.

This is discussed in PEP 688 – Making the buffer protocol accessible in Python | peps.python.org.

This is definitely a point I’m open to changing if there’s consensus in favor of keeping the special case, but I’d rather not:

  • In general, when you put a nominal type in an annotation, that means you accept only that type. Exceptions to this rule make the type system harder to understand for users.
  • bytearray is mostly compatible with bytes, but not entirely. Notably, it’s not hashable. If you have an API that accepts bytes and you decide to add caching (using the bytes object as a dict key), suddenly your API won’t work for bytearray any more.
3 Likes

I have to admit I haven’t had time to read the PEP yet, but I worry that this is going to cause a lot of churn and permanently add a lot of ugliness to all signatures that currently support bytes (since most of them also support bytearray and many also support memoryview).

1 Like

I did read that section, but my concern is the same as Guido’s. This makes the signature more complex, which makes it harder for general users to understand.

From my perspective it doesn’t make the types more accurate. Bytearray is duck-type compatible with bytes in the vast majority of cases where it is implicitly allowed right now.

This mypy issue is a great example of why this PEP is needed, in my opinion: Surprising error involving bytes duck-type compatibility · Issue #12643 · python/mypy · GitHub

It’s true that the status quo “works” in a lot of situations. But when it doesn’t work, it can have utterly bewildering effects for end users. In this case, mypy’s special-casing for bytes duck-typing has led it to assess that a line of code is simultaneously reachable and yet also unreachable.

Perhaps you might argue that this specific case is an issue with mypy’s implementation, rather than an issue with the general rule of “type checkers should pretend bytearray and memoryview are compatible with bytes”. But it’s hard for me to see what the “correct” solution for mypy is here, if we want to maintain the fiction that memoryview and bytearray are subtypes of bytes.

@Jelle, in the PEP, perhaps it might be good to discuss in a bit more depth the problems that the status quo has caused over the years? There’s a few links taking you to external discussions, but not much discussion in the text itself.

The advantage of the status quo is that it almost always works (hashing bytes isn’t that common) and makes the annotations a lot more readable. With this proposal, every

def foo(arg: bytes) -> bytes:
    ...

may have to become

def foo(arg: bytes|bytearray|memoryview) -> bytes|bytearray|memoryview:
    ...

and that’s just a lot harder to grok for a human. Plus it may be incorrect – depending on what the code does it may never return a memoryview even if the input is one.

I worry that this is just going to be one of the many papercuts that will turn off the “no typing in my backyard” crowd while not solving an important pain point for typing users.

I would also like to remind readers that there are plenty of other situations where types don’t capture the whole picture. The intention of typing for Python was always to be useful, not necessarily to be 100% correct and precise.

2 Likes

It certainly feels like that to me (as someone who is at least mildly type-averse).

Huge +1 on this point. The more “correct and precise” typing gets, the less practical and useful it feels to me. That’s not a criticism of the people working hard on typing, it just seems to me to reflect very different needs and experiences.

The current behaviour of bytes as an annotation never even occurred to me - it “just works” the way I expect it to (by which I say bytes, but actually I’m fine with anything that acts sufficiently like a bytes object - duck typing at its best!) Whereas the proposed approaches - whether bytes|bytearray|memoryview or types.Buffer - act as a stumbling block for me, because suddenly I have to think “what do I really mean?” and yet I can’t express the answer I want to give, which is “bytes, or stuff that works like bytes but I don’t want to get too precise or restrictive here…”

I’m mostly in favour of having a type that expresses “exposes the buffer protocol”, if people have a need for that. But I’d rather it were thought of as a specialised thing, and “normal” usage remained bytes (with the existing special case that makes it “do what I mean” for casual use retained).

2 Likes

Thanks for the feedback, sounds like I should reconsider the bytes/bytearray promotion. I’ll do some more research in this area and report back.

One more question though for those who like the bytes shortcut: Should it really also include memoryview? Unlike bytearray, memoryview lacks most of the interface bytes provides; it’s just a buffer that provides sequence access.

Maybe I’m misunderstanding, but per the primary change in this PEP, couldn’t this instead become just

from types import Buffer
def foo(arg: Buffer) -> Buffer:

unless you specifically needed those exact types. and only those types? Or is that not adequate here?

Also, without this, how would I explicitly specify that I want specifically bytes and not bytearray or memoryview?

Just one data point, but as a non-expert user with some experience using static typing for certain projects and not for others, after initially being rather averse to it I’d find this sort of implicit magic rather confusing and complicated to remember and keep track of. And I’m not sure it really changes things for those who aren’t fans of static typing.

Or perhaps as a compromise, type checkers could be encouraged to expose an option to determine their behavior in this case, e.g. --strict turns this behavior off, or it requires some flag to turn on?

1 Like

I am not sure what the Python-level API is that types.Buffer promises. E.g. does it have a .lower() method? I can’t find the docs for it even. Now, as long as we’re talking arguments in stub files, that doesn’t matter, since all we need is that bytes and bytearray subclass Buffer. But for return types this would limit what the recepient can do, and for code using inline type annotations (i.e. in .py files) it’s also important.

The simplest Python API is memoryview: you can do memoryview(obj) if obj implements the buffer protocol.

As evidenced by my message to typing-sig (apologies, sent before discussions-to was updated), I’m sympathetic to concerns about breaking bytes / bytearray compatibility. However, I find Jelle’s point about hashability fairly compelling — and in general it’s nice to be able to reflect immutability in one’s types.

but I don’t want to get too precise or restrictive here

Part of the issue is that using bytes as a shorthand prevents people who want to be precise from being precise. If we choose to remove the shorthand, maybe we could suggest use of typing.ByteString as an “imprecise” type.

Breaking down type annotation use cases

A) buffers that are passed to C code that use the buffer protocol
B) functions that use all the various methods of builtins.bytes (join, find, lower, etc) — these likely also work on bytearray “for free”
C) functions that hash bytes
D) functions that iterate over bytes or e.g. extend a bytearray

In the current world (where static duck typing lets us treat bytearray and memoryview as subtypes of bytes)

A) is not expressible
B) is typed as bytes. hooray static duck typing, type checkers let us pass bytearray and things work. but boo, static duck typing, we’re allowed to pass memoryview and things don’t work.
C) is typed as bytes. boo static duck typing, since type checkers let us pass bytearray and writable memoryview
D) is often typed as bytes, as shorthand for bytes | bytearray | memoryview, but should probably be typed as Sequence[int]

In the PEP 688 world as proposed

A) is typed as types.Buffer
B) is typed as bytes | bytearray
C) is typed as bytes, or maybe bytes | memoryview if we’re willing to hope noone passes us a writable memoryview
D) probably should be typed as Sequence[int]

In a PEP 688 world but where we allow bytearray to duck type to bytes (but not memoryview)

A) is typed as types.Buffer
B) is typed as bytes
C) is typed as bytes but we risk someone passing us a bytearray (or bytes | memoryview, with similar caveat to above)
D) probably should be typed as Sequence[int]

Overall I think I could give or take duck type compatibility between bytearray and bytes, but I do strongly feel we should drop memoryview from the bytes shorthand. The only non Sequence[int]-like thing I can think of that works on bytes | bytearray | memoryview is b.hex()

3 Likes

types.Buffer provides no Python API at all, except that you are allowed to pass the object to C functions that accept a buffer (such as memoryview()).

I realized that this is a problem, because it means with PEP 688 as written now there is no way to express a type like "supports the buffer protocol and __getitem__ and __len__" (which is approximately the common interface of bytes and memoryview).

So to support that, we need a different approach. Perhaps we can make the core create a __buffer__ attribute on types that support the buffer protocol. Then in typing, we can simply create Protocols that check that attribute.

What should the value of the attribute be? We can’t support the full buffer protocol in a Python call, but perhaps we could make x.__buffer__() return memoryview(x). Alternatively, we could just set it to True.

Initially I thought that it has some relation to Allow objects implemented in pure Python to export PEP 3118 buffers · Issue #58006 · python/cpython · GitHub.

Thanks, that’s interesting. Support for buffer classes written in Python sounds like a job for a different PEP. But I learned that __buffer__ already has a meaning in PyPy (docs), so we shouldn’t use that name for flagging buffers unless we actually replicate PyPy’s functionality. I’d suggest __get_buffer__ if we go with a method and __is_buffer__ if we make it just a boolean.

There is an old problem with the documentation. It uses terms “buffer” (not clear), “object that implement the buffer protocol” (too long), “bytes string”, “bytestring”, “bytes-like object”.

The problem is that there are different degrees of “bytelikeness”.

  1. Supports the buffer protocol.
  2. Additionally supports len() which returns the size in bytes.
  3. Additionally supports indexing and iteration.
  4. Has most of str methods (except encode() of course).
  5. Supports the continuous buffer protocol.
  6. bytes and bytesarray (they not only support the continuous buffer protocol, they are NUL-terminated).
  7. bytes only (it is hashable).

Different functions has different requirements for their arguments, and it is difficult to describe it correctly and unambiguously. We need to establish a non-vague terminology and use it consistently.

There is collections.abc.ByteString which I expected could be used for type 4, but seems it is going to be deprecated (Deprecate collections.abc.ByteString · Issue #91896 · python/cpython · GitHub).

3 Likes