PEP 467: Minor API improvements for binary sequences is ready for further discussion. The one point I’m a little unsure of is the exclusion of memoryview
(because I have a project that would benefit from it).
Thoughts?
PEP 467: Minor API improvements for binary sequences is ready for further discussion. The one point I’m a little unsure of is the exclusion of memoryview
(because I have a project that would benefit from it).
Thoughts?
Happy to see this moving again! Is there an accessible summary of the changes from the last-discussed version?
Also, just a reminder to update the PEP with a Discussions-To
link in the header pointing to this thread. Thanks!
IIRC, the only change is the removal of a late-added ascii
constructor.
Updates to PEP done, thanks.
I have wanted to do the equivalent of bytes.fromint
and bytes.iterbytes
in the past and found it unnecessarily tricky, so I’d be in favour of this. If it were up to me, I would add getbyte
and iterbytes
to memoryview; I think it’s important that it duck types.
It looks like this PEP has a history that’s almost longer than my knowledge of Python, so thanks for sticking with it!
The elements of a memoryview
are not always unsigned bytes, raising additional questions about the proper semantics. E.g. is iterbytes
legitimate for a non-byte memoryview
and what if it is signed? I can understand why this is out of scope for the PEP right now.
Minor nit: “To replace the now discouraged behavior” → “To replace now discouraged behavior”, or tell us (under “Motivation” perhaps), what is the discouraged behaviour (usage?) that this replaces.
Well, memoryview
objects already have a tobytes
method that works in non-bytes contexts, so adding iterbytes
and getbyte
sounds reasonable to me.
>>> m = memoryview(b"abcd")
>>> list(m)
[97, 98, 99, 100]
>>> list(m.cast('h'))
[25185, 25699]
>>> m.cast('h').tobytes()
b'abcd'
I did this check before writing:
>>> b = b'ab'
>>> b'abracadabra'.count(b)
2
>>> b'abracadabra'.count(memoryview(b).cast('h'))
2
even though:
>>> memoryview(b).cast('h')[0]
25185
I’m a tad uncomfortable that a memoryview
should be treated casually as bytes-like irrespective of the item size. If it’s a settled view that it should (rather than an accident of the C API) I’ll get used to it. If it isn’t a settled view, it would be controversial in the PEP too.
Proper handling of the data is the responsibility of the programmer.
If part of the memoryview is ascii-encoded text, odds are high that the data contained therein is not homogeneous, and that is the use-case targeted by this PEP.
It’s out of scope so it doesn’t really matter right now.
Whether or not it should be is the point of this discussion though, the PEP is still a draft, so it can change and @stoneleaf is one of the authors, so you shouldn’t dismiss discussing the inclusion/exclusion of memoryview
based on the current text of the draft.
Well, if you insist. My position is that the exclusion of memoryview
from scope, while it is a gap in duck-typing, is understandable because the itemsize (or element type) makes it a more complex animal.
To answer @stoneleaf directly, if the owner of the data has responsibly declared that it is properly handled as a sequence of signed 16-bit integers, and some library method subsequently treats it as bytes-like without checking, that might be considered an error. When it came to a comparable decision in some code I wrote, ISTR I chose to raise an error. I was surprised the str
-like bytes
operations allowed what I showed.
The owner of that data could explicitly cast the memoryview
, or a slice, to unsigned bytes to permit the interpretation, e.g. to decode bytes as ascii.
So I am mildly against what @hauntsaninja appears mildly in favour of, bringing memoryview
into scope, and would stick with what @stoneleaf has drafted. No disrespect intended to either.
If it is already the settled core-dev view that an unsigned bytes interpretation is always implicitly allowable, then my code is wrong, and so may be my guess about why memoryview
has been excluded, leaving me moderately in favour of inclusion. (Interested in both, but have maybe taken up too much space already.)
I’ve always defined a utility helper:
def int_to_bytes(num: int) -> bytes:
return num.to_bytes((num.bit_length() + 7) // 8 or 1, 'big')
I think since the new proposed method requires ASCII then I will have to continue using that. Is there a better way?
@ofek, your utility will work with any size integer, while the proposed .fromint()
is specifically for integers in range(256)
– it takes the place of the Python 2 chr
function.
This discussion is about behaviour explicitly triggered by method calls such as getbyte
and iterbytes
. The existence of the method tobytes
sets a precedent, and I don’t think anyone has complained about it (note that this method is itself modelled on Numpy’s ndarray.tobytes
method, which works on any datatype).
Whether or not arbitrary memoryviews should be implicitly accepted as raw bytes is a different discussion.
I think the only minor controversial part is the optional order
argument, since you could argue that iterbytes
and getbyte
on memoryview
should be able to accept that argument as well in order to remain consistent with tobytes
.
Adding an optional order
argument to these methods on memoryviews sounds reasonable to me. That said, if noone wants to bother implementing it, it can be omitted and the method raise NotImplementedError
on multi-dimensional memoryviews.
+1 for including memoryview for the sake of duck-typing (which is the same reason why memoryview.count
behaves like list.count
). As Antoine says, these are explicitly invoked methods.
Along those lines, if an order
argument is added, it had better be added everywhere or else the duck-typing doesn’t work. Raising an error if the provided argument (presumably non-zero) doesn’t make sense for the value is a standard ValueError
.
[Edit] Actually, I changed my mind. -1 on including memoryview, but still for the sake of duck-typing.
The point of duck-typing is to let the caller choose the type of the argument, not the callee. Passing a memoryview is a clear decision by the caller that their intention is to pass a list of a specific type - they have their own mechanism to pass a sequence of bytes if that’s what they intended. Just as the callee can do a type check for a memoryview and treat it specially.
iterbytes
and getbyte
are not related to int.tobytes
, and don’t need to support the same parameters.
That’s not the workflow I’m trying to support. When memoryview
is used with files containing a mixture of ascii-encoded text along with binary chunks (such as .dbf
and .pdf
), you have the same issues as when using a bytearray
:
record_data_type = mv[offset_to_data_type] # want a 'C', get a 67
I am talking about memoryview.tobytes
not int.to_bytes
. Where order
determines the walk-order of the dimensions in multidimensional views, this order
has nothing to do with endianness, which is what byteorder
on int.to_bytes
is for.