PEP 467: Minor API improvements for binary sequences

PEP 467: Minor API improvements for binary sequences is ready for further discussion. The one point I’m a little unsure of is the exclusion of memoryview (because I have a project that would benefit from it).

Thoughts?

7 Likes

Happy to see this moving again! Is there an accessible summary of the changes from the last-discussed version?

Also, just a reminder to update the PEP with a Discussions-To link in the header pointing to this thread. Thanks!

IIRC, the only change is the removal of a late-added ascii constructor.

Updates to PEP done, thanks.

I have wanted to do the equivalent of bytes.fromint and bytes.iterbytes in the past and found it unnecessarily tricky, so I’d be in favour of this. If it were up to me, I would add getbyte and iterbytes to memoryview; I think it’s important that it duck types.

It looks like this PEP has a history that’s almost longer than my knowledge of Python, so thanks for sticking with it!

2 Likes

The elements of a memoryview are not always unsigned bytes, raising additional questions about the proper semantics. E.g. is iterbytes legitimate for a non-byte memoryview and what if it is signed? I can understand why this is out of scope for the PEP right now.

1 Like

Minor nit: “To replace the now discouraged behavior” → “To replace now discouraged behavior”, or tell us (under “Motivation” perhaps), what is the discouraged behaviour (usage?) that this replaces.

Well, memoryview objects already have a tobytes method that works in non-bytes contexts, so adding iterbytes and getbyte sounds reasonable to me.

>>> m = memoryview(b"abcd")
>>> list(m)
[97, 98, 99, 100]
>>> list(m.cast('h'))
[25185, 25699]
>>> m.cast('h').tobytes()
b'abcd'

I did this check before writing:

>>> b = b'ab'
>>> b'abracadabra'.count(b)
2
>>> b'abracadabra'.count(memoryview(b).cast('h'))
2

even though:

>>> memoryview(b).cast('h')[0]
25185

I’m a tad uncomfortable that a memoryview should be treated casually as bytes-like irrespective of the item size. If it’s a settled view that it should (rather than an accident of the C API) I’ll get used to it. If it isn’t a settled view, it would be controversial in the PEP too.

Proper handling of the data is the responsibility of the programmer.

If part of the memoryview is ascii-encoded text, odds are high that the data contained therein is not homogeneous, and that is the use-case targeted by this PEP.

1 Like

It’s out of scope so it doesn’t really matter right now.

Whether or not it should be is the point of this discussion though, the PEP is still a draft, so it can change and @stoneleaf is one of the authors, so you shouldn’t dismiss discussing the inclusion/exclusion of memoryview based on the current text of the draft.

Well, if you insist. My position is that the exclusion of memoryview from scope, while it is a gap in duck-typing, is understandable because the itemsize (or element type) makes it a more complex animal.

To answer @stoneleaf directly, if the owner of the data has responsibly declared that it is properly handled as a sequence of signed 16-bit integers, and some library method subsequently treats it as bytes-like without checking, that might be considered an error. When it came to a comparable decision in some code I wrote, ISTR I chose to raise an error. I was surprised the str-like bytes operations allowed what I showed.

The owner of that data could explicitly cast the memoryview, or a slice, to unsigned bytes to permit the interpretation, e.g. to decode bytes as ascii.

So I am mildly against what @hauntsaninja appears mildly in favour of, bringing memoryview into scope, and would stick with what @stoneleaf has drafted. No disrespect intended to either.

If it is already the settled core-dev view that an unsigned bytes interpretation is always implicitly allowable, then my code is wrong, and so may be my guess about why memoryview has been excluded, leaving me moderately in favour of inclusion. (Interested in both, but have maybe taken up too much space already.)

I’ve always defined a utility helper:

def int_to_bytes(num: int) -> bytes:
    return num.to_bytes((num.bit_length() + 7) // 8 or 1, 'big')

I think since the new proposed method requires ASCII then I will have to continue using that. Is there a better way?

1 Like

@ofek, your utility will work with any size integer, while the proposed .fromint() is specifically for integers in range(256) – it takes the place of the Python 2 chr function.

2 Likes

This discussion is about behaviour explicitly triggered by method calls such as getbyte and iterbytes. The existence of the method tobytes sets a precedent, and I don’t think anyone has complained about it (note that this method is itself modelled on Numpy’s ndarray.tobytes method, which works on any datatype).

Whether or not arbitrary memoryviews should be implicitly accepted as raw bytes is a different discussion.

3 Likes

I think the only minor controversial part is the optional order argument, since you could argue that iterbytes and getbyte on memoryview should be able to accept that argument as well in order to remain consistent with tobytes.

Adding an optional order argument to these methods on memoryviews sounds reasonable to me. That said, if noone wants to bother implementing it, it can be omitted and the method raise NotImplementedError on multi-dimensional memoryviews.

+1 for including memoryview for the sake of duck-typing (which is the same reason why memoryview.count behaves like list.count). As Antoine says, these are explicitly invoked methods.

Along those lines, if an order argument is added, it had better be added everywhere or else the duck-typing doesn’t work. Raising an error if the provided argument (presumably non-zero) doesn’t make sense for the value is a standard ValueError.


[Edit] Actually, I changed my mind. -1 on including memoryview, but still for the sake of duck-typing.

The point of duck-typing is to let the caller choose the type of the argument, not the callee. Passing a memoryview is a clear decision by the caller that their intention is to pass a list of a specific type - they have their own mechanism to pass a sequence of bytes if that’s what they intended. Just as the callee can do a type check for a memoryview and treat it specially.

1 Like

iterbytes and getbyte are not related to int.tobytes, and don’t need to support the same parameters.

That’s not the workflow I’m trying to support. When memoryview is used with files containing a mixture of ascii-encoded text along with binary chunks (such as .dbf and .pdf), you have the same issues as when using a bytearray:

record_data_type = mv[offset_to_data_type]  # want a 'C', get a 67

I am talking about memoryview.tobytes not int.to_bytes. Where order determines the walk-order of the dimensions in multidimensional views, this order has nothing to do with endianness, which is what byteorder on int.to_bytes is for.