Buffer Protocol with PEP3118 struct string syntax

vorticon · January 14, 2025, 10:26am

Hi everyone.

I have been digging through the buffer protocol and realized that it appears to specify native Python support for arbitrary nested data structures, pretty much like what you can do with nested C structures, plus arbitrary N-dimensional arrays. Previously, my assumption was that - for example - the struct module supports only flat structures, and no arrays at all.

However, PEP3118 goes further, see here. The “Additions to the struct string-syntax” seem to be as powerful as structured Numpy arrays (or record arrays) with custom (and possibly nested) dtypes. PEP3118 specifically mentions the intention to support more complex memory layouts as in Numpy and the ctypes module.

The PEP is pretty clear that (refering to the additions)

The struct module will be changed to understand these as well and return appropriate Python objects on unpacking.

However, those additions seem to be unsupported in Python 3.13.

The two examples given in PEP3118 (“Nested structure” and “Nested array”) are rejected when passed to struct.calcsize or struct.pack, throwing a

struct.error: bad char in struct format.

Do I misunderstand something, or was it really forgotten to implement this section of PEP3118?

Some background information on how I came across this issue:

Since quite a long time, I found it convenient to use Numpy arrays to make some more complex C structures accessible in Python. I specifically use numpy arrays with custom structured dtypes to interface with native code written in C (and compiled as Python C extensions). I’m working with numerical algorithms and often I need to pass large double arrays back and forth between Python and compiled C code. Numpy is quite convenient because with numpy.dtype(…, align=True) it guarantees binary compatibility with a standard C compiler, according to the Numpy documentation. So I can manipulate my memory structures in native Python while having compiled C code operating on the same data structure. I use this mechanism a lot, and most of the time the data structures are not hardcoded but parametric (the C code is generated code).

Having a Python-native way to access structured binary data is definitely a good idea, and superior to relying on third-party libraries. That’s why I like the approach chosen in PEP3118.

I also found that Numpy actually implements the syntax additions from PEP3118, as in this example:

>>> import inspect
>>> import numpy
>>> dt = numpy.dtype([("a", "f8"), ("b", "f8", 3), ("c", "u4", (2, 2))])
>>> arr = numpy.zeros(shape=(), dtype=dt)
>>> buf = arr.__buffer__(inspect.BufferFlags.FORMAT)
>>> buf.format
'T{d:a:(3)d:b:(2,2)I:c:}'

In plain words: we define a numpy.dtype for a custom data structure, then request a memoryview via Python’s buffer protocol, and find that Numpy gives us a format string compliant with the extended PEP3118 syntax, and equivalent to the custom dtype.

So I’m actually wondering how I can interface with such an array using only the Python standard library - i.e. using the struct module, which I would expect to support the same syntax.

Thanks for any help.

encukou · January 14, 2025, 12:50pm

Indeed, this was left out of the implementation of PEP 3118, back in 2006. You currently can’t do this with struct.
You can use ctypes for some realated use cases, but the API is not convenient or complete, the encoding is maybe slightly different, and it’s not well tested:

import ctypes
class S(ctypes.Structure):
    #_pack_ = 1  # (try this)
    _fields_ = [
        ("field_a", ctypes.c_float),
        ("field_b", ctypes.c_double * 3),
        ("field_c", ctypes.c_char * 2 * 2),
    ]
    
print(memoryview(S()).format)
# → T{<f:field_a:4x(3)<d:field_b:(2,2)<c:field_c:4x}

As for “finishing the implementation” of PEP 3118, I think the ship has sailed and adding these to struct would need a new design. (And I propose to clarify that in the PEP.)

Since you posed in Core Development: Yes, I’m willing to mentor someone who’d want to put in the work of researching and implemening something that would suit struct, ctypes and numpy (and others?).

ngoldbaum · January 14, 2025, 9:07pm

If there’s interest in improving the buffer protocol implementation in Python, something that would make this much more attractive from the NumPy side is to make it easier to work with arbitrary data types. See this thread for the last time NumPy developers broached this.

While the Arrow C Data interface would definitely help in many use cases, it’s not quite what NumPy needs (e.g. >1D data, strides, etc), and the buffer protocol has the nice advantage that it’s already used throughout the ecosystem.

It would also be nice to fix the thread safety issues inherent in the design of the buffer protocol. Producers of buffers ought to have a way to do borrow checking on the buffer.

Sorry if this thread is actually unrelated to coming up with a sucessor to the buffer protocol. I just thought I’d throw it out there because it would unlock a lot of cool functionality in the future to allow zero-copy data sharing with arbitrary data layouts.

encukou · January 15, 2025, 9:28am

I commented there.
tldr: I don’t see how we can put a stdlib-sized (small!) scope around arbitrary data types like datetimes or sparse arrays.