PEP 467: Minor API improvements for binary sequences

stoneleaf · December 27, 2023, 8:03pm

PEP 467: Minor API improvements for binary sequences is ready for further discussion. The one point I’m a little unsure of is the exclusion of memoryview (because I have a project that would benefit from it).

Thoughts?

CAM-Gerlach · December 27, 2023, 8:12pm

Happy to see this moving again! Is there an accessible summary of the changes from the last-discussed version?

Also, just a reminder to update the PEP with a Discussions-To link in the header pointing to this thread. Thanks!

stoneleaf · December 27, 2023, 8:35pm

IIRC, the only change is the removal of a late-added ascii constructor.

Updates to PEP done, thanks.

hauntsaninja · December 27, 2023, 8:35pm

I have wanted to do the equivalent of bytes.fromint and bytes.iterbytes in the past and found it unnecessarily tricky, so I’d be in favour of this. If it were up to me, I would add getbyte and iterbytes to memoryview; I think it’s important that it duck types.

It looks like this PEP has a history that’s almost longer than my knowledge of Python, so thanks for sticking with it!

jeff5 · December 28, 2023, 7:35am

The elements of a memoryview are not always unsigned bytes, raising additional questions about the proper semantics. E.g. is iterbytes legitimate for a non-byte memoryview and what if it is signed? I can understand why this is out of scope for the PEP right now.

jeff5 · December 28, 2023, 8:00am

Minor nit: “To replace the now discouraged behavior” → “To replace now discouraged behavior”, or tell us (under “Motivation” perhaps), what is the discouraged behaviour (usage?) that this replaces.

pitrou · December 28, 2023, 9:36am

Well, memoryview objects already have a tobytes method that works in non-bytes contexts, so adding iterbytes and getbyte sounds reasonable to me.

>>> m = memoryview(b"abcd")
>>> list(m)
[97, 98, 99, 100]
>>> list(m.cast('h'))
[25185, 25699]
>>> m.cast('h').tobytes()
b'abcd'

jeff5 · December 28, 2023, 9:54am

I did this check before writing:

>>> b = b'ab'
>>> b'abracadabra'.count(b)
2
>>> b'abracadabra'.count(memoryview(b).cast('h'))
2

even though:

>>> memoryview(b).cast('h')[0]
25185

I’m a tad uncomfortable that a memoryview should be treated casually as bytes-like irrespective of the item size. If it’s a settled view that it should (rather than an accident of the C API) I’ll get used to it. If it isn’t a settled view, it would be controversial in the PEP too.

stoneleaf · December 28, 2023, 8:16pm

Proper handling of the data is the responsibility of the programmer.

If part of the memoryview is ascii-encoded text, odds are high that the data contained therein is not homogeneous, and that is the use-case targeted by this PEP.

jeff5 · December 29, 2023, 7:29am

It’s out of scope so it doesn’t really matter right now.

Daverball · December 29, 2023, 8:11am

Whether or not it should be is the point of this discussion though, the PEP is still a draft, so it can change and @stoneleaf is one of the authors, so you shouldn’t dismiss discussing the inclusion/exclusion of memoryview based on the current text of the draft.

jeff5 · December 29, 2023, 9:20am

Well, if you insist. My position is that the exclusion of memoryview from scope, while it is a gap in duck-typing, is understandable because the itemsize (or element type) makes it a more complex animal.

To answer @stoneleaf directly, if the owner of the data has responsibly declared that it is properly handled as a sequence of signed 16-bit integers, and some library method subsequently treats it as bytes-like without checking, that might be considered an error. When it came to a comparable decision in some code I wrote, ISTR I chose to raise an error. I was surprised the str-like bytes operations allowed what I showed.

The owner of that data could explicitly cast the memoryview, or a slice, to unsigned bytes to permit the interpretation, e.g. to decode bytes as ascii.

So I am mildly against what @hauntsaninja appears mildly in favour of, bringing memoryview into scope, and would stick with what @stoneleaf has drafted. No disrespect intended to either.

If it is already the settled core-dev view that an unsigned bytes interpretation is always implicitly allowable, then my code is wrong, and so may be my guess about why memoryview has been excluded, leaving me moderately in favour of inclusion. (Interested in both, but have maybe taken up too much space already.)

ofek · December 29, 2023, 6:21pm

I’ve always defined a utility helper:

def int_to_bytes(num: int) -> bytes:
    return num.to_bytes((num.bit_length() + 7) // 8 or 1, 'big')

I think since the new proposed method requires ASCII then I will have to continue using that. Is there a better way?

stoneleaf · December 29, 2023, 7:08pm

@ofek, your utility will work with any size integer, while the proposed .fromint() is specifically for integers in range(256) – it takes the place of the Python 2 chr function.

pitrou · December 30, 2023, 10:38am

This discussion is about behaviour explicitly triggered by method calls such as getbyte and iterbytes. The existence of the method tobytes sets a precedent, and I don’t think anyone has complained about it (note that this method is itself modelled on Numpy’s ndarray.tobytes method, which works on any datatype).

Whether or not arbitrary memoryviews should be implicitly accepted as raw bytes is a different discussion.

Daverball · December 30, 2023, 10:46am

I think the only minor controversial part is the optional order argument, since you could argue that iterbytes and getbyte on memoryview should be able to accept that argument as well in order to remain consistent with tobytes.

pitrou · December 30, 2023, 11:17am

Adding an optional order argument to these methods on memoryviews sounds reasonable to me. That said, if noone wants to bother implementing it, it can be omitted and the method raise NotImplementedError on multi-dimensional memoryviews.

steve.dower · December 30, 2023, 12:19pm

~~+1 for including memoryview for the sake of duck-typing (which is the same reason why memoryview.count behaves like list.count). As Antoine says, these are explicitly invoked methods.~~

Along those lines, if an order argument is added, it had better be added everywhere or else the duck-typing doesn’t work. Raising an error if the provided argument (presumably non-zero) doesn’t make sense for the value is a standard ValueError.

[Edit] Actually, I changed my mind. -1 on including memoryview, but still for the sake of duck-typing.

The point of duck-typing is to let the caller choose the type of the argument, not the callee. Passing a memoryview is a clear decision by the caller that their intention is to pass a list of a specific type - they have their own mechanism to pass a sequence of bytes if that’s what they intended. Just as the callee can do a type check for a memoryview and treat it specially.

stoneleaf · December 30, 2023, 2:04pm

iterbytes and getbyte are not related to int.tobytes, and don’t need to support the same parameters.

That’s not the workflow I’m trying to support. When memoryview is used with files containing a mixture of ascii-encoded text along with binary chunks (such as .dbf and .pdf), you have the same issues as when using a bytearray:

record_data_type = mv[offset_to_data_type]  # want a 'C', get a 67

Daverball · December 30, 2023, 2:53pm

I am talking about memoryview.tobytes not int.to_bytes. Where order determines the walk-order of the dimensions in multidimensional views, this order has nothing to do with endianness, which is what byteorder on int.to_bytes is for.