A NULL-terminated string in struct-module-style syntax describing the contents of a single item. If this is NULL, “B” (unsigned bytes) is assumed.
PEP 3118 says that a bunch of new format strings will be added to the struct module. Numpy uses these strings. But the struct module and the struct module documentation has not been updated. This makes the documentation for the buffer protocol a bit misleading since a buffer protocol consumer that is based on the docs.python.org buffer documentation does not work with numpy buffers.
So some questions:
what is the status of PEP 3118 format strings?
where should consumers of the buffer protocol go to learn what they should support in format strings?
It would be nice if there were some api to validate format strings so that it could act as a point of truth on whether a format string is valid. Or is it explicitly intended that producers can put other stuff in the format string?
I am getting some users complaining that my buffer protocol consumer isn’t compatible with numpy buffers, and that is why I’m curious about the status of this.
(Also, is this the right place for this question?)
The relevant PEP section lists proposed additions and says that the struct module will be updated. I don’t know if that happened. Do you (by actual experiment? (Which is to say, is numpy using the stdlib struct or an augmented version? You may need to look at numpy doc and ask on a numpy forum.)
The ‘u’ and ‘w’ formats for ucs-2 and ucs-4 were appropriate in Python2 and up to 3.2, when unicode strings were one or the other, but seem less so when unicode strings use 1, 2, or 4 bytes per char, and a user cannot necessarily know.
says that the struct module will be updated. I don’t know if that happened.
Per Jelle’s comment here, the c and ? codes have been added but the other changes have not happened. I have also directly checked that they are not implemented in the struct module.
The O code seems useful and simple to implement (I implemented it in 16 lines). Though it is an easy way to segfault the interpreter. I agree that u and w codes look… questionable. Maybe it would be nice to have a way to put utf8 into the struct but it isn’t super important.
Numpy also uses:
O for type object
g for type np.longdouble
Zf, Zd, Zg for types cfloat, cdouble, and clongdouble
The w format for e.g., dtype="U25"
Struct layout T{} and optional name of preceding element :name::
But my primary concern isn’t about the struct module, my question is:
what is valid for a producer of the buffer protocol to put in the format string?
And that I think it ought to be documented somewhere.
is numpy using the stdlib struct or an augmented version?
Numpy isn’t using the struct module at all. The way that the struct module gets involved is that instead of documenting the buffer protocol format field in the buffer protocol documentation, the buffer protocol docs just have a link to the struct module docs. But this organization means that any features that are allowed as part of the buffer protocol but are not supported by the struct module go undocumented entirely.
Looking at the implementation of the buffer protocol, PyBuffer_SizeFromFormat uses struct.calcsize so it will only work if the buffer format is understood by the struct module. Other than that, none of the buffer protocol implementation looks at the format field. The memoryview tolist() method also looks at the format, it will just say NotImplementedError: memoryview: unsupported format for many numpy buffers.
PEP 3118 is the authority here. The fact that the struct module and the memoryview object may not support all operations on all buffer types is irrelevant. The buffer protocol allows interoperation between a producer and a consumer, and both can be arbitrary third-party libraries.
by documented I mean documented on docs.python.org. At least a link to the proposal would be good, though an actual explanation would be better. I think the usual standard for being documented isn’t “the information exists in a proposal but docs.python.org does not mention the proposal and contradicts it.”
Ah, fair enough. Yes, PEP 3118 should be linked to in relevant portions of those docs. For example, it should be mentioned in the C API doc to Py_buffer::format.
Hmm, to be honest, I don’t think the most complex parts of PEP 3118 will ever be implemented anywhere (does any library make use of the “pointer” prefix? or of suboffsets?).