Question: PEP 3118 format strings and the buffer protocol

The Buffer protocol specification says that the buffer format string is

A NULL-terminated string in struct-module-style syntax describing the contents of a single item. If this is NULL, “B” (unsigned bytes) is assumed.

PEP 3118 says that a bunch of new format strings will be added to the struct module. Numpy uses these strings. But the struct module and the struct module documentation has not been updated. This makes the documentation for the buffer protocol a bit misleading since a buffer protocol consumer that is based on the docs.python.org buffer documentation does not work with numpy buffers.

So some questions:

  1. what is the status of PEP 3118 format strings?
  2. where should consumers of the buffer protocol go to learn what they should support in format strings?
  3. It would be nice if there were some api to validate format strings so that it could act as a point of truth on whether a format string is valid. Or is it explicitly intended that producers can put other stuff in the format string?

I am getting some users complaining that my buffer protocol consumer isn’t compatible with numpy buffers, and that is why I’m curious about the status of this.

(Also, is this the right place for this question?)

The relevant PEP section lists proposed additions and says that the struct module will be updated. I don’t know if that happened. Do you (by actual experiment? (Which is to say, is numpy using the stdlib struct or an augmented version? You may need to look at numpy doc and ask on a numpy forum.)

The ‘u’ and ‘w’ formats for ucs-2 and ucs-4 were appropriate in Python2 and up to 3.2, when unicode strings were one or the other, but seem less so when unicode strings use 1, 2, or 4 bytes per char, and a user cannot necessarily know.

says that the struct module will be updated. I don’t know if that happened.

Per Jelle’s comment here, the c and ? codes have been added but the other changes have not happened. I have also directly checked that they are not implemented in the struct module.

The O code seems useful and simple to implement (I implemented it in 16 lines). Though it is an easy way to segfault the interpreter. I agree that u and w codes look… questionable. Maybe it would be nice to have a way to put utf8 into the struct but it isn’t super important.

Numpy also uses:

  • O for type object
  • g for type np.longdouble
  • Zf, Zd, Zg for types cfloat, cdouble, and clongdouble
  • The w format for e.g., dtype="U25"
  • Struct layout T{} and optional name of preceding element :name::
>>> memoryview(np.ones([10], dtype=[('a', float), ('b', int)])).format
'T{d:a:l:b:}'

The full logic they use to generate the format string is here:
https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/buffer.c#L182-L193

But my primary concern isn’t about the struct module, my question is:

what is valid for a producer of the buffer protocol to put in the format string?

And that I think it ought to be documented somewhere.

is numpy using the stdlib struct or an augmented version?

Numpy isn’t using the struct module at all. The way that the struct module gets involved is that instead of documenting the buffer protocol format field in the buffer protocol documentation, the buffer protocol docs just have a link to the struct module docs. But this organization means that any features that are allowed as part of the buffer protocol but are not supported by the struct module go undocumented entirely.

Looking at the implementation of the buffer protocol, PyBuffer_SizeFromFormat uses struct.calcsize so it will only work if the buffer format is understood by the struct module. Other than that, none of the buffer protocol implementation looks at the format field. The memoryview tolist() method also looks at the format, it will just say NotImplementedError: memoryview: unsupported format for many numpy buffers.

1 Like

Can you file a documentation issue on github. com/python/cpython to track fixing this?

PEP 3118 is the authority here. The fact that the struct module and the memoryview object may not support all operations on all buffer types is irrelevant. The buffer protocol allows interoperation between a producer and a consumer, and both can be arbitrary third-party libraries.

What is undocumented exactly? PEP 3118 – Revising the buffer protocol | peps.python.org is quite terse but it still counts as documentation :slight_smile:

Can you file a documentation issue on github. com/python/cpython to track fixing this?

Okay will do.

What is undocumented exactly? PEP 3118 – Revising the buffer protocol | peps.python.org is quite terse but it still counts as documentation

Well…

  1. by documented I mean documented on docs.python.org. At least a link to the proposal would be good, though an actual explanation would be better. I think the usual standard for being documented isn’t “the information exists in a proposal but docs.python.org does not mention the proposal and contradicts it.”
  2. Meador Inge and others seems to suggest that 3118 is overly vague and in particular leaves it unclear what a grammar for the format strings is implement PEP 3118 struct changes · Issue #47382 · python/cpython · GitHub

Thankfully we can use the Numpy implementation as a reference. I will try to help out with this.

1 Like

Ah, fair enough. Yes, PEP 3118 should be linked to in relevant portions of those docs. For example, it should be mentioned in the C API doc to Py_buffer::format.

Hmm, to be honest, I don’t think the most complex parts of PEP 3118 will ever be implemented anywhere (does any library make use of the “pointer” prefix? or of suboffsets?).

I opened another issue about this on numpy.

The following lark grammar seems to accept the formats used in ctypes and numpy:

?start: root
root: entry+
?entry: (array | _normal_entry ) name?

array: shape _normal_entry
shape: "(" _shape_body ")"
_shape_body: (NUMBER ",")* NUMBER

_normal_entry: pointer | (byteorder? repeat? ( padding | struct | prim ))
pointer: "&" entry

struct: "T{" entry* "}"
padding: "x"


name:  ":" IDENTIFIER ":"
byteorder: BYTEORDER
repeat: NUMBER
prim: PRIMITIVE


IDENTIFIER: /[^:^\s]+/
NUMBER: ("0".."9")+
BYTEORDER: "@" | "=" | "<" | ">" | "^" | "!"
PRIMITIVE: "X{}" | "Zf" | "Zd" | "Zg" | /[?cbBhHiIlLqQfdgOs]/

%ignore /\s+/