Buffer protocol and arbitrary (data) types

The buffer protocol is hugely important in many scientific application. However, it is limited to a quite fixed set of types based on the C single characters code (the struct interface).

We could extend this, which may for example make sense for bfloat16 support that is gaining popularity. For bfloat16 it would make sense to simply agree on a new character. But overall, there are more use-cases which I suspect would be useful (datetimes or just arbitrary user types).

One option would be to try to use name metadata on valid buffer formts. But memoryview() doesn’t care about the buffer.fmt being valid (unless you do indexing) and is happy to raise a NotImplementedError when it sees an “invalid” buffer (memoryview(invalid_buf_fmt_buffer) works).
That means we can stuff pretty much arbitrary stuff into the buffer format, so long we are certain that it is a clearly invalid format.

So I am wondering if we could extend the buffer protocol with something like defining a new type-code like:

[module$qualname$param]

as a generic extension “type-code” (or really any variations of something like this). The [] should make it clearly invalid currently, I think.

When e.g. NumPy sees such a [] enclosed format, it could then define a protocol on top of it like:

getattr(sys.modules[module], qualname)._numpy_dtype_from_buffer_fmt(param, byteorder="=")

which would have to return a valid numpy dtype instance. An important part here is that NumPy doesn’t have to recognize the dtype directly. It could be defined by a downstream library. (If there are no security concerns, NumPy could import the module. If there are concers, NumPy could raise an error asking the user to import it.)

Cython should be able to do the same thing, just that the user would already type the memoryview at compile time and attach the correct format to it (of course this would need API in Cython to do that).


Would such a fmt extension making use of currently “invalid” format problematic in any way? As long as we define a protocol (and hope that nobody already does something similar) Python itself doesn’t seem to need to add any support (maybe beyond error message improvements).

Another or additional approach might be to just start the format with an invalid single character to opt into a whole new version.

1 Like

I would like to unblock the scientific python community working on improving the situation in this regard (and maybe further things also).
(This may or may not happen soon, but chances are it will never happen there is a feeling it requires an unclear and slow upstream decision process and there has been a need for this for many years.)

While maybe not perfect, I think extending the buffer protocol is better than inventing yet another protocol. And its inclusion in the language does have come with some perks.

So there is a question how to proceed: If there is no opinion from Python, we should just press ahead and decide something from our side. But I wouldn’t complain if we can have some rough agreement first. Right now, it seems to me like there is no real opinion which to me means silent consensus so long that the scientific python side of the ecosystem can agree on something?

2 Likes

My sense is that most of the energy in the Scipy community around data interchange is happening with Apache Arrow or the Consortium for Python Data API standards, which are admittedly much broader in scope than the Python buffer protocol. I’m not saying the buffer protocol shouldn’t be improved around the edges, and I think having something simple built into the language makes a lot of sense, especially for basic data exchange with C libraries, etc., but I think the efforts to build something more universal for data science are likely to happen outside of the language.

Yes, I am aware (and involved in some), although admit I don’t know Arrow well enough, but I am not aware that it generalizes to well arbitrary user-defined DTypes. The Data API standard thing or DLPack definitely doesn’t even talk about it.
Generally though, ammeding the buffer protocol seems a far more straight forward way to get typed memoryviews in Cython which support in principle anything. Plus, we might get much better reach (soonish).

These efforts wish to cross language boundaries better, but beyond that are in parts more narrow in scope than such an ammendment could allow in the buffer protocol.

And yes, I also still dislike that we never even discuss improving it because of the feeling that it is too hard to evolve (for no reason other than unclear ownership and the fact that maybe nobody feels quite confident about pushing it).

Frankly, I even suspect we could reasonably hack device (i.e. GPU/cuda, etc.) support into the buffer protocol and piggy-back it in a way that means it can be used on older Python versions. Although that would of course create buffer objects that the core memoryview wouldn’t support (until it does maybe).

Nobody besides Mike has responded. I wonder if there are others in the SciPy community interested in this? Together you could do-author a PEP (but it would be best to have a discussion of a straw man proposal here first).