Buffer protocol and arbitrary (data) types

seberg · April 27, 2023, 9:18am

The buffer protocol is hugely important in many scientific application. However, it is limited to a quite fixed set of types based on the C single characters code (the struct interface).

We could extend this, which may for example make sense for bfloat16 support that is gaining popularity. For bfloat16 it would make sense to simply agree on a new character. But overall, there are more use-cases which I suspect would be useful (datetimes or just arbitrary user types).

One option would be to try to use name metadata on valid buffer formts. But memoryview() doesn’t care about the buffer.fmt being valid (unless you do indexing) and is happy to raise a NotImplementedError when it sees an “invalid” buffer (memoryview(invalid_buf_fmt_buffer) works).
That means we can stuff pretty much arbitrary stuff into the buffer format, so long we are certain that it is a clearly invalid format.

So I am wondering if we could extend the buffer protocol with something like defining a new type-code like:

[module$qualname$param]

as a generic extension “type-code” (or really any variations of something like this). The [] should make it clearly invalid currently, I think.

When e.g. NumPy sees such a [] enclosed format, it could then define a protocol on top of it like:

getattr(sys.modules[module], qualname)._numpy_dtype_from_buffer_fmt(param, byteorder="=")

which would have to return a valid numpy dtype instance. An important part here is that NumPy doesn’t have to recognize the dtype directly. It could be defined by a downstream library. (If there are no security concerns, NumPy could import the module. If there are concers, NumPy could raise an error asking the user to import it.)

Cython should be able to do the same thing, just that the user would already type the memoryview at compile time and attach the correct format to it (of course this would need API in Cython to do that).

Would such a fmt extension making use of currently “invalid” format problematic in any way? As long as we define a protocol (and hope that nobody already does something similar) Python itself doesn’t seem to need to add any support (maybe beyond error message improvements).

Another or additional approach might be to just start the format with an invalid single character to opt into a whole new version.

seberg · June 1, 2023, 5:15am

I would like to unblock the scientific python community working on improving the situation in this regard (and maybe further things also).
(This may or may not happen soon, but chances are it will never happen there is a feeling it requires an unclear and slow upstream decision process and there has been a need for this for many years.)

While maybe not perfect, I think extending the buffer protocol is better than inventing yet another protocol. And its inclusion in the language does have come with some perks.

So there is a question how to proceed: If there is no opinion from Python, we should just press ahead and decide something from our side. But I wouldn’t complain if we can have some rough agreement first. Right now, it seems to me like there is no real opinion which to me means silent consensus so long that the scientific python side of the ecosystem can agree on something?

mdroettboom · June 1, 2023, 5:44pm

My sense is that most of the energy in the Scipy community around data interchange is happening with Apache Arrow or the Consortium for Python Data API standards, which are admittedly much broader in scope than the Python buffer protocol. I’m not saying the buffer protocol shouldn’t be improved around the edges, and I think having something simple built into the language makes a lot of sense, especially for basic data exchange with C libraries, etc., but I think the efforts to build something more universal for data science are likely to happen outside of the language.

seberg · June 1, 2023, 7:43pm

Yes, I am aware (and involved in some), although admit I don’t know Arrow well enough, but I am not aware that it generalizes to well arbitrary user-defined DTypes. The Data API standard thing or DLPack definitely doesn’t even talk about it.
Generally though, ammeding the buffer protocol seems a far more straight forward way to get typed memoryviews in Cython which support in principle anything. Plus, we might get much better reach (soonish).

These efforts wish to cross language boundaries better, but beyond that are in parts more narrow in scope than such an ammendment could allow in the buffer protocol.

And yes, I also still dislike that we never even discuss improving it because of the feeling that it is too hard to evolve (for no reason other than unclear ownership and the fact that maybe nobody feels quite confident about pushing it).

Frankly, I even suspect we could reasonably hack device (i.e. GPU/cuda, etc.) support into the buffer protocol and piggy-back it in a way that means it can be used on older Python versions. Although that would of course create buffer objects that the core memoryview wouldn’t support (until it does maybe).

guido · June 2, 2023, 12:24am

Nobody besides Mike has responded. I wonder if there are others in the SciPy community interested in this? Together you could do-author a PEP (but it would be best to have a discussion of a straw man proposal here first).

ngoldbaum · July 6, 2023, 4:44pm

I’ve pinged the cython mailing list to see if they’re interested, as the most interesting use case for this will require cython support.

da-woods · July 8, 2023, 7:29pm

I tried to reply to the Cython-devel mailing list, but I got a “mail-bounced” message, so presumably I’m not trusted today for some reason. Therefore I’ve posted my response below here (apologies if it doesn’t quite make sense in this thread, since it’s more a response to your mailing list message):

So my superficial thoughts:

The buffer protocol has two bits. The first says “given this predictable memory layout, you can look up an item in memory with these rules”; the second describes what the items in memory are. I think you’re only proposing to change the second part of it. I’d encourage you not to change the first part - the nice thing about the first part is that it’s relatively simple and doesn’t try to do anything. For example I’d be sceptical about trying to support ragged arrays.
As you identify, for a more advanced memoryview to be useful in Cython, Cython really has to be able to know an underlying C type for your data at compile-time and be able to validate that the buffer it’s passed matches that C type at runtime. The validation could have varying degrees of strictness (i.e. in the worst case we could just check the size matches and trust the user). We already support that to extent (packed structs with structured arrays) but that doesn’t cover everything
For your variable length string example, the C struct to use is fairly obvious (just your struct ss). The difficult bit is likely to be memory management of that. I’d kind of encourage you not to expect Cython to handle the memory management for this type of thing (i.e. it can expose the struct to the user, but it becomes the user’s own problem to work out if they need to allocate memory when they modify the struct).
Things like the datetime for Pandas, or a way of having a float16 type seems like the sort of thing we should definitely be able to do.
In terms of Apache Arrow - if there was demand we probably could add support for it. Their documentation says: “The Arrow C data interface defines a very small, stable set of C definitions that can be easily copied in any project’s source code” - so that suggests it need not be a dependency.
One of the points of the “typed memoryview” vs the older “np.ndarray” interface is that it was supposed to be more generally compatible. While we could extend it to match any non-standard additions that Numpy tries to make, that does feel dodgy and likely to conflict when other projects do their own thing. I think it would be better if the Python standard could be extended (even if it was just something like a code to indicate “mystery structure of size X”)

Don’t know if these thoughts are useful. They’re a bit scattered. I guess the summary is “we could definitely do more with custom data types, but don’t break the things that made the buffer protocol nice”.

seberg · July 11, 2023, 9:11am

Right, I don’t want to push that here, and it wouldn’t be something for Cython to support. One could do such extensions (just like supporting to export device memory) by extending the Py_buffer struct with new fields.
You can do that safely in a backwards compatible way that works on current Python, NumPy, Cython by introducing a PyBUF_EXTENDED and pre-initialize any new fields. All current consumers just ignore the flag (which is fine).
(Downstream can backport the flag and extended struct to old Python versions.)

That would be a distinct discussion. But, based on that thought: if we have concerns about safety, one could add a PyBUF_EXTENDED_FORMAT request flag to ensure new format strings will never be seen by current consumers.
The downside would be that memoryview(requires_special_format) fails. That isn’t a show blocker but wrapping in memoryview is a nice pattern to simplify ownership tracking (numpy does this).

but it becomes the user’s own problem to work out if they need to allocate memory when they modify the struct).

Right, for types without references this wouldn’t matter. I guess it would be cool if it is at least plausible for Cython to be extended in a way to help dealing with embedded references (reference counting for embedded objects or even custom allocations).

In the simplest case, Cython would have to check if the format matches a user provided format string exactly. It would be cool if there is a plausible extension that numpy.datetime64 can match any unit and expose it neatly to the author of def func(datetime64[:] times).
In the case of datetime64, the possible units are limited so this could also be solved with a union type, though.

I think it would be better if the Python standard could be extended (even if it was just something like a code to indicate “mystery structure of size X”)

I agree that it would be best if Python prescribes it. We can already spell “X random bytes with name Y”, but to me it seems safer to:

Use a currently invalid format string (“mystery struct” isn’t just random bytes with a name)
Prescribe a naming scheme to ensure clashes don’t happen (e.g. include the defining module name)

da-woods · July 16, 2023, 10:38am

One question I’m not clear on:

Taking the simple np.datetime64 type. If I understand correctly, it’s a 64 bit signed integer representing a number of “intervals” since 1970. The dtype object encodes what interval it is (i.e. year, day, second, nanosecond, and a bunch of others).

What extra functionality would you ideally like Cython provide for a memoryview of datetime64?

Just view it as a 64 bit int (which I think is what you do now)?
Make it a “distinct” 64 bit int (so you know what it’s a 64 bit int, but you also know it’s a special 64 bit int, and can’t easily mix it with regular 64 bit ints)?
Make each datetime64[unit] combination a distinct 64 bit int type (so you can add datetime64[s] to datetime64[s] but not to datatime64[D])?
call a Numpy-defined conversion function to convert it to/from a different struct on access (i.e. indexing an datetime64 memoryview calls NpyDatetime_ConvertDatetime64ToDatetimeStruct(?) so that the user has access to it in a more convenient form?
Something else what I haven’t thought of?

seberg · July 18, 2023, 12:54pm

Right datetime is already very complicated! After all use-cases things would just be plain C (or C++) types with some constant format string.

For datetimes, it would be OK to just map it to the C npy_datetime64 (or int64 it is just an alias anyway), or a C++ type. But, the unit information is problematic:

Users need need a way to get it (even if they reach into memoryview.view.fmt to do so).
Whatever customization we have to match the format specifier, it needs to be able to deal with parameters.

So, it may be that the only way to do this would be to allow (limited?) subclassing of the Cython memoryview class, at least for the parametric case. And no, I wouldn’t want to tag on NumPy-defined stuff unless that turns out the convenient way in some future.

encukou · January 15, 2025, 9:25am

Since this was mentioned as still relevant:

Could you start writing a PEP? The recommended sections should guide you – why we need a solution, what we need from the solution (what exactly are the use cases? what do we need from CPython?), and then how it can be implemented.

My thoughts on this: it might be useful to separate the underlying “C” struct from mapping to Python types. The buffer protocol might be extendable to do the former well, but for things like “these 8 bytes are a DateTime” or “this uint8 should be treated as a Python bool”, up to “wrap this struct in a custom class”… I’m not convinced that a universal bytecode-based format can work.

Ideally, whatever needs to be in CPython would be tightly scoped, so we could call it done rather soon, leaving further iterations to external projects.

Perhaps CPython can “simply” reserve prefixes like [np:...] or [arrow:...] for external specifications? I mean, it already doesn’t really care, maybe we just need to document it and add nice error messages.

seberg · January 16, 2025, 2:17pm

Thanks for the note/ping, I hope @ngoldbaum and I can author a short PEP some time soonish. Looking at the module$qualname thing I wrote. I have to think about it, but yeah. Likely it is better if NumPy doesn’t have to look for that object and do a subclass check, but rather only does that after it already found np:.

seberg · March 25, 2025, 4:05pm

@ngoldbaum and I have written a PEP draft for this proposal with some small changes/extension and @da-woods was so kind to help with a Cython PoC implementation.

You can find the draft here: PEP Draft buffer protocol custom dtypes - HackMD
The PoC implementations can be found here for numpy and here for cython (NumPy one requires the Cython one to build right now, but this is not strictly necessary).
(Just to note, I have thoughts on further extensions ^[1] but that is for a different thread!)

To summarize the main points (of course flexible on details):

We use [] for such a custom dtype. To deal with aliases, we decided to include ; as a way to include multiple aliases within the brackets (hopefully not used much!).
Each type identifier always starts with unique_name$, e.g. numpy$... after which we have arbitrary printable ascii characters (minus ;[]). Pointers, etc. will need to be encoded.
For example, NumPy can then define that a type name follows after the $. (EDIT: Finish sentence)
We have double checked that none of the large packages (Cython, NumPy, Python, torch, …) have problems with this.^[2]

We would be happy for feedback, or hoping to create a proper (pre?) PEP out of this soon!

I am thinking about extending the protocol further to allow storing more things, including non CPU memory, here is an earlier start. But I think this is much simpler and more directly useful with NumPy/Cython (and further extensions become more useful if they can use this). ↩︎
Some tend to ignore the format fully, this is already unsafe e.g. for objects arrays. ↩︎

encukou · March 30, 2025, 12:02am

Here’s what came to my mind:

In the post, “For example, NumPy can then define” is missing the rest of the sentence
should this PEP specify a struct prefix for fallbacks? e.g. “[mylib$datetime64;struct$8B] = “if you don’t understand my datetime64, treat this as 8 bytes”? That would need a patch CPython as well :‍)
Could you add a concrete example to “adopters are encouraged to honor this byte-order and size state where it makes sense”?
Move the PoC links from Abstract to Reference Implementation
bikeshedding: is $ best? A : might end a “heading” more naturally.

Note: I’ll be on paternity leave for a month; don’t block the work on me!

seberg · March 31, 2025, 8:58am

Thanks a lot of the feedback! I’ll expand/change the draft soon.

Could you add a concrete example to “adopters are encouraged to honor this byte-order and size state where it makes sense”?

Yes, will do. To clarify here: E.g. NumPy timedelta64 can be stored as little or big endian. If the format is >[numpy$numpy.dtypes:TimeDelta64DType:...], the > state modifies the byte-order, and I think it makes sense to use that (but I don’t think it matters enough to strictly prescribe it).
It could even make sense to use something like Z[cpp$std:bfloat16] for a complex bfloat16 (but of course C++ knows how to spell that directly also!).

For neither of these, I think it is important to strictly prescribe it, but especially for the byte-order, I think it makes sense to nudge towards it (it is easy/typical to just raise an error for non-native, in practice I don’t expect much non-native use anyway).

missing the rest of the sentence

sorry about that.

is $ best? A : might end a “heading” more naturally.

I don’t have a strong opinion for $ or : (or any other symbol like #). Avoiding : seemed slightly clearer for numpy$numpy.dtypes:StringDType (using the __module__:__qualname__ convention) as the first numpy refers to the prefix and the second to the module (the NumPy dtype need not live in NumPy).
But of course it is just as well defined to use numpy:numpy.dtypes:StringDType, and NumPy is likely the odd one out w.r.t. to the duplication anyway.

should this PEP specify a struct prefix for fallbacks? e.g. […] That would need a patch CPython as well

I agree we should at least say that struct$ would have this meaning. And yeah, I guess that means that memoryview should learn about it timely!
(A question is whether to use buffer$ as the buffer protocol uses an extended syntax, or struct$ but strictly limit to struct.struct syntax.)

That said, I wouldn’t use it for the timedelta example. IMO, the buffer protocol should indicate a logical type, not a physical one. So if we were interested in indicating the size or “physical type”, we may want a more explicit provision for that.

OTOH, there are likely times where it is OK to work with the physical struct even though we have a more precise logical type available. And also use-cases where practicality just beats purity (because it is clear enough for the user that this will happens).

mhvk · April 16, 2025, 5:37pm

I like the overall idea, but a few comments.

Like Petr, I dislike $ and prefer :, which is a natural, more readable separator. (My emacs habits of editing remote files with /ssh:host:... would actually like it to be /numpy:... with no end character, but [...] is fine! I wondered about URIs, with, say, numpy:// and struct://, but that really seemed not all that helpful.)
Perhaps more important: there will be mistakes and hence there likely will be a need for different versions. Should the ability to add a version number be included from the get-go? Of course that can just be a different prefix, but perhaps good to define? In terms of the code being a pip-installable package, one could think of allowing numpy>=2 – protocol versions will probably usually go with package versions…)
Giving byte-order before the element, like <[...], seems weird. Should it not be part of the definition itself? At least, for aliases, I think something like struct:>dd will be clearer than having the byte order at the start.

p.s. Small thing: do add links to the buffer protocol and the standard struct format to the text, to save people like me from having to look them up: Buffer Protocol — Python 3.13.3 documentation and struct — Interpret bytes as packed binary data — Python 3.13.3 documentation

mhvk · April 16, 2025, 5:52pm

Separately, would it be possible to give a concrete example in the draft PEP, perhaps how StringDType could be implemented? The numpy PR linked to does not make clear how it would work. I think effectively the buffer would need to have two parts, the pointers/lengths and a blob with the actual (medium and long) strings. What would shape, strides, suboffsets, etc., be?

I do think having a struct alias would generically be a great idea. For the StringDType case, how would the struct alias be represented?

seberg · April 17, 2025, 8:27pm

Am travelling and going to get back in a bit.

Sounds like one more nudge toward just using :, I gave my small reason for avoiding it and don’t mind that at all (i.e. we get numpy:numpy.dtypes:StringDType identifier sep and model sep)!
So will probably just change it in the next iteration.

there will be mistakes and hence there likely will be a need for different versions.

The question is at which level? NumPy can version as much as it wants via the name, and that doesn’t need a provision, e.g. numpy-2. I wouldn’t even do that. For numpy:numpy.dtypes.StringDType:<...> only the ... part will need care if extended (i.e. dtype specific versioning). And yes, we may want to do that, but it seems like a NumPy discussion not a PEP one.

At the [] level, I suspect we already have enough flexibility characters to do versioning somehow, but if much desired, I would be happy to provision something. (But I think you meant at the dtype level.)

Giving byte-order before the element, like <[...] , seems weird.

It can be, but the buffer protocol/struct provisions for a leading > to change everything to big endian. I opted for nudging to honor that, although I agree (and I think that is how it is written) that it is fine not to (and even more convenient probably!).
I had two small nudges towards that. First, for types without parameters it may actually be convenient (i.e. cpp:bfloat16 if we get there) and second, I thought that if we do this, it may nudge towards more care about refusing to read the wrong byte-order.

would it be possible to give a concrete example in the draft PEP, perhaps how StringDType could be implemented?

The Cython proof of concept does exactly two things with this addition:

Get access to the elements through which are an opaque struct/typedef in NumPy.
Get access to the NumPy dtype object through the custom format string.

After that, we can work with the data via the NumPy C-API. The actual ABI is opaque here. Maybe it is enough to clarify this in the PR and/pr next PEP draft iteration?
I don’t want to explain more here! This is enough for NumPy at the moment (we may expose the ABI, but I am not convinced it is even interesting here).
I don’t want to diverge into how strings are be stored in practice and the merits/trade-offs that we can or cannot do in NumPy (or here).