Buffer protocol and arbitrary (data) types

The buffer protocol is hugely important in many scientific application. However, it is limited to a quite fixed set of types based on the C single characters code (the struct interface).

We could extend this, which may for example make sense for bfloat16 support that is gaining popularity. For bfloat16 it would make sense to simply agree on a new character. But overall, there are more use-cases which I suspect would be useful (datetimes or just arbitrary user types).

One option would be to try to use name metadata on valid buffer formts. But memoryview() doesn’t care about the buffer.fmt being valid (unless you do indexing) and is happy to raise a NotImplementedError when it sees an “invalid” buffer (memoryview(invalid_buf_fmt_buffer) works).
That means we can stuff pretty much arbitrary stuff into the buffer format, so long we are certain that it is a clearly invalid format.

So I am wondering if we could extend the buffer protocol with something like defining a new type-code like:

[module$qualname$param]

as a generic extension “type-code” (or really any variations of something like this). The [] should make it clearly invalid currently, I think.

When e.g. NumPy sees such a [] enclosed format, it could then define a protocol on top of it like:

getattr(sys.modules[module], qualname)._numpy_dtype_from_buffer_fmt(param, byteorder="=")

which would have to return a valid numpy dtype instance. An important part here is that NumPy doesn’t have to recognize the dtype directly. It could be defined by a downstream library. (If there are no security concerns, NumPy could import the module. If there are concers, NumPy could raise an error asking the user to import it.)

Cython should be able to do the same thing, just that the user would already type the memoryview at compile time and attach the correct format to it (of course this would need API in Cython to do that).


Would such a fmt extension making use of currently “invalid” format problematic in any way? As long as we define a protocol (and hope that nobody already does something similar) Python itself doesn’t seem to need to add any support (maybe beyond error message improvements).

Another or additional approach might be to just start the format with an invalid single character to opt into a whole new version.

1 Like

I would like to unblock the scientific python community working on improving the situation in this regard (and maybe further things also).
(This may or may not happen soon, but chances are it will never happen there is a feeling it requires an unclear and slow upstream decision process and there has been a need for this for many years.)

While maybe not perfect, I think extending the buffer protocol is better than inventing yet another protocol. And its inclusion in the language does have come with some perks.

So there is a question how to proceed: If there is no opinion from Python, we should just press ahead and decide something from our side. But I wouldn’t complain if we can have some rough agreement first. Right now, it seems to me like there is no real opinion which to me means silent consensus so long that the scientific python side of the ecosystem can agree on something?

2 Likes

My sense is that most of the energy in the Scipy community around data interchange is happening with Apache Arrow or the Consortium for Python Data API standards, which are admittedly much broader in scope than the Python buffer protocol. I’m not saying the buffer protocol shouldn’t be improved around the edges, and I think having something simple built into the language makes a lot of sense, especially for basic data exchange with C libraries, etc., but I think the efforts to build something more universal for data science are likely to happen outside of the language.

Yes, I am aware (and involved in some), although admit I don’t know Arrow well enough, but I am not aware that it generalizes to well arbitrary user-defined DTypes. The Data API standard thing or DLPack definitely doesn’t even talk about it.
Generally though, ammeding the buffer protocol seems a far more straight forward way to get typed memoryviews in Cython which support in principle anything. Plus, we might get much better reach (soonish).

These efforts wish to cross language boundaries better, but beyond that are in parts more narrow in scope than such an ammendment could allow in the buffer protocol.

And yes, I also still dislike that we never even discuss improving it because of the feeling that it is too hard to evolve (for no reason other than unclear ownership and the fact that maybe nobody feels quite confident about pushing it).

Frankly, I even suspect we could reasonably hack device (i.e. GPU/cuda, etc.) support into the buffer protocol and piggy-back it in a way that means it can be used on older Python versions. Although that would of course create buffer objects that the core memoryview wouldn’t support (until it does maybe).

Nobody besides Mike has responded. I wonder if there are others in the SciPy community interested in this? Together you could do-author a PEP (but it would be best to have a discussion of a straw man proposal here first).

I’ve pinged the cython mailing list to see if they’re interested, as the most interesting use case for this will require cython support.

2 Likes

I tried to reply to the Cython-devel mailing list, but I got a “mail-bounced” message, so presumably I’m not trusted today for some reason. Therefore I’ve posted my response below here (apologies if it doesn’t quite make sense in this thread, since it’s more a response to your mailing list message):

So my superficial thoughts:

  1. The buffer protocol has two bits. The first says “given this predictable memory layout, you can look up an item in memory with these rules”; the second describes what the items in memory are. I think you’re only proposing to change the second part of it. I’d encourage you not to change the first part - the nice thing about the first part is that it’s relatively simple and doesn’t try to do anything. For example I’d be sceptical about trying to support ragged arrays.

  2. As you identify, for a more advanced memoryview to be useful in Cython, Cython really has to be able to know an underlying C type for your data at compile-time and be able to validate that the buffer it’s passed matches that C type at runtime. The validation could have varying degrees of strictness (i.e. in the worst case we could just check the size matches and trust the user). We already support that to extent (packed structs with structured arrays) but that doesn’t cover everything

  3. For your variable length string example, the C struct to use is fairly obvious (just your struct ss). The difficult bit is likely to be memory management of that. I’d kind of encourage you not to expect Cython to handle the memory management for this type of thing (i.e. it can expose the struct to the user, but it becomes the user’s own problem to work out if they need to allocate memory when they modify the struct).

  4. Things like the datetime for Pandas, or a way of having a float16 type seems like the sort of thing we should definitely be able to do.

  5. In terms of Apache Arrow - if there was demand we probably could add support for it. Their documentation says: “The Arrow C data interface defines a very small, stable set of C definitions that can be easily copied in any project’s source code” - so that suggests it need not be a dependency.

  6. One of the points of the “typed memoryview” vs the older “np.ndarray” interface is that it was supposed to be more generally compatible. While we could extend it to match any non-standard additions that Numpy tries to make, that does feel dodgy and likely to conflict when other projects do their own thing. I think it would be better if the Python standard could be extended (even if it was just something like a code to indicate “mystery structure of size X”)

Don’t know if these thoughts are useful. They’re a bit scattered. I guess the summary is “we could definitely do more with custom data types, but don’t break the things that made the buffer protocol nice”.

1 Like

Right, I don’t want to push that here, and it wouldn’t be something for Cython to support. One could do such extensions (just like supporting to export device memory) by extending the Py_buffer struct with new fields.
You can do that safely in a backwards compatible way that works on current Python, NumPy, Cython by introducing a PyBUF_EXTENDED and pre-initialize any new fields. All current consumers just ignore the flag (which is fine).
(Downstream can backport the flag and extended struct to old Python versions.)

That would be a distinct discussion. But, based on that thought: if we have concerns about safety, one could add a PyBUF_EXTENDED_FORMAT request flag to ensure new format strings will never be seen by current consumers.
The downside would be that memoryview(requires_special_format) fails. That isn’t a show blocker but wrapping in memoryview is a nice pattern to simplify ownership tracking (numpy does this).

but it becomes the user’s own problem to work out if they need to allocate memory when they modify the struct).

Right, for types without references this wouldn’t matter. I guess it would be cool if it is at least plausible for Cython to be extended in a way to help dealing with embedded references (reference counting for embedded objects or even custom allocations).

In the simplest case, Cython would have to check if the format matches a user provided format string exactly. It would be cool if there is a plausible extension that numpy.datetime64 can match any unit and expose it neatly to the author of def func(datetime64[:] times).
In the case of datetime64, the possible units are limited so this could also be solved with a union type, though.

I think it would be better if the Python standard could be extended (even if it was just something like a code to indicate “mystery structure of size X”)

I agree that it would be best if Python prescribes it. We can already spell “X random bytes with name Y”, but to me it seems safer to:

  • Use a currently invalid format string (“mystery struct” isn’t just random bytes with a name)
  • Prescribe a naming scheme to ensure clashes don’t happen (e.g. include the defining module name)

One question I’m not clear on:

Taking the simple np.datetime64 type. If I understand correctly, it’s a 64 bit signed integer representing a number of “intervals” since 1970. The dtype object encodes what interval it is (i.e. year, day, second, nanosecond, and a bunch of others).

What extra functionality would you ideally like Cython provide for a memoryview of datetime64?

  • Just view it as a 64 bit int (which I think is what you do now)?
  • Make it a “distinct” 64 bit int (so you know what it’s a 64 bit int, but you also know it’s a special 64 bit int, and can’t easily mix it with regular 64 bit ints)?
  • Make each datetime64[unit] combination a distinct 64 bit int type (so you can add datetime64[s] to datetime64[s] but not to datatime64[D])?
  • call a Numpy-defined conversion function to convert it to/from a different struct on access (i.e. indexing an datetime64 memoryview calls NpyDatetime_ConvertDatetime64ToDatetimeStruct(?) so that the user has access to it in a more convenient form?
  • Something else what I haven’t thought of?

Right datetime is already very complicated! After all use-cases things would just be plain C (or C++) types with some constant format string.

For datetimes, it would be OK to just map it to the C npy_datetime64 (or int64 it is just an alias anyway), or a C++ type. But, the unit information is problematic:

  1. Users need need a way to get it (even if they reach into memoryview.view.fmt to do so).
  2. Whatever customization we have to match the format specifier, it needs to be able to deal with parameters.

So, it may be that the only way to do this would be to allow (limited?) subclassing of the Cython memoryview class, at least for the parametric case. And no, I wouldn’t want to tag on NumPy-defined stuff unless that turns out the convenient way in some future.