Adding PyBytes_FromBuffer? (and similar for array.array)

sthibaul · December 6, 2022, 6:11pm

I am looking for a way to get a bytes array from a buffer-protocol object, without a copy.

We are processing huge amounts of data (think GB), which we want to avoid copying back&forth between processes etc. that we expose in Python as byte arrays. We have a C char* pointer, from which we can with PyMemoryView_FromMemory create a memory view that implements the buffer protocol. We tried to use PyBytes_FromObject to get a bytes array from this, but this is actually calling PyBytes_FromStringAndSize which makes a copy, which we want to avoid.

We know that the converse is possible: getting a memory view from a bytes array, but that doesn’t allow to share that memory storage between e.g. different instances of python.

NumPy, for instance, provides PyArray_FromBuffer which does what we want: it just maps the data, without performing any copy. But we’d rather be able to manage different types of data, not only numpy arrays.

guido · December 6, 2022, 8:58pm

This is not possible using built-in APIs. The implementation of the bytes object requires the object header to be contiguous with the data, so if you have a buffer pointing to just the data, it will have to be copied.

Can you explain why memoryview does not satisfy your needs?

sthibaul · December 6, 2022, 10:59pm

The implementation of the bytes object requires the object header to be contiguous with the data,

Ah

Is that also the case with array.array?

Can you explain why memoryview does not satisfy your needs?

It’s just from the application (using our library) point of view, they would want to use byte arrays and we would want to optimize exposing it to them. If it’s not possible, we’ll thus have to document that they should migrate to NumPy etc. if they want best performance, ok.

guido · December 6, 2022, 11:38pm

Why don’t you read the source code? You know where to find it, right?

Have you asked them why they prefer byte arrays? Do they want to modify the data in place?

One issue that’s making all of this hard is ownership of the memory buffer. When the last reference to the data goes away, what should happen? Both bytearray and array.array assume that the object owns the memory containing the data and that it has been allocated in a certain way so that they can deallocate it accordingly.

I’m guessing that numpy has the functionality you require because your use case occurs regularly in that space, so it’s likely that that’s your best bet regardless, if your users aren’t happy with memoryview.

sthibaul · December 7, 2022, 12:29am

Guido van Rossum via Discussions on Python.org, le mar. 06 déc. 2022 23:48:33 +0000, a ecrit:

● Samuel Thibault:

Is that also the case with array.array?

Why don’t you read the source code? You know where to find it, right?

Yes, but it’s not clear what current assumptions are there to stay or
not. While for PyBytesObject the current implementation indeed embeds
the conten inside the structure itself, and apparently that’s not to be
changed ; for arrayobject it is an allocated pointer, which currently is
assumed to be a PyMem_NEW() since array_dealloc always uses PyMem_Free,
but possibly that could be revisited to extend the support.

● Samuel Thibault:

It’s just from the application (using our library) point of view, they
would want to use byte arrays and we would want to optimize exposing it to
them. If it’s not possible, we’ll thus have to document that they should
migrate to NumPy etc. if they want best performance, ok.

Have you asked them why they prefer byte arrays? Do they want to modify the
data in place?

Possibly, we’ll have to check with them.

One issue that’s making all of this hard is ownership of the memory buffer.
When the last reference to the data goes away, what should happen? Both
bytearray and array.array assume that the object owns the memory containing the
data and that it has been allocated in a certain way so that they can
deallocate it accordingly.

Indeed, in what I’m suggesting the creator of a mapped array would have
to be responsible for deallocation.

I’m guessing that numpy has the functionality you require because your use case
occurs regularly in that space, so it’s likely that that’s your best bet
regardless, if your users aren’t happy with memoryview.

Ok, we’ll try to stay this that for now.

Thanks!
Samuel

guido · December 7, 2022, 12:46am

(Indeed; it would break too many internal APIs, and it would make bytes objects more expensive to use – they are performance critical so that’s undesirable.)

You could write an extension module that subclasses array.array and overrides the deallocation function. (Subclassing bytes to do the same isn’t sensible, alas.)

malemburg · December 7, 2022, 11:16am

You may want to have a look at PyArrow and its C / Cython API: Using pyarrow from C++ and Cython Code — Apache Arrow v10.0.1

Apache Arrow is designed for exactly the purpose you seem to have in mind and works not only with Python, but also many other implementation languages.