PEP 467: Minor API improvements for binary sequences

Ah, thank you for the clarification.

I think it is still irrelevant for the purpose of this PEP, since the PEP’s goal is treating a block of bytes as individual bytes and not individual integers.

I’m not necessarily making a point for or against including the order parameter, I’ve never personally had to use it, since I don’t interface with Fortran code so I don’t really have strong feelings about it.

But the status quo on memoryview is, that it does exist, so it’s reasonable to anticipate that the same people that currently rely on order on tobytes might argue for its inclusion on iterbytes and getbyte, since in some sense those are just convenience methods for accessing the same information without having to convert the entire buffer to bytes ahead of time, so I think it’s at least worth mentioning in the PEP if you decide to include memoryview.

In my experience, many use cases of memoryview are simply motivated by the zero-copy nature of memoryviews. In this context, the memoryview just represents a bytes view.

4 Likes

This was my first thought before my head went into type systems and I changed my mind :smiley:

I guess I can be persuaded either way, so maybe don’t count my vote in any direction. I’ll settle for provoking discussion and hopefully consensus can be reached without me. (Though as an aside, I’d love more “view” type objects - C# got a huge amount of value out of their string span type, and I’d imagine we could make str[:] -> strview and bytes[:] -> bytesview pretty transparent to users.)

8 posts were split to a new topic: Zero-copy slicing

Is the Motivation sectionof PEP-467 still accurate? Specifically the “will aid in … in porting any remaining Python 2 wire format code” text which sounds a little unusual many years after the EOL of Python 2. I understand why that was a timely motivation in 2014 (when PEP-461 to re-add % to bytes was accepted for the related reasons), I’d leave that claim out today.

There are people and organizations with legacy Python 2 code still running and sitting around un-ported. But a lack of this PEP’s APIs are not the reason why. They’ve got code changes to make no matter what and have not invested in making them. Adding new APIs doesn’t change that non-technical situation.

1 Like

Reading through the PEP, my first questions before getting are are pretty much exactly what PEP 467 – Minor API improvements for binary sequences | peps.python.org answers. (“We have at least one way to do this, why is it useful to have more?” being a natural PEP-review question.)

.fromsize thoughts

B.fromsize(5, fill=b'\x5a') should be expanded to allow fill values that are more than a single byte. Because this is the first thing I expect users of that to ask for. Effectively the equivalent of B(fill) * (len(fill)/size) with a better proper error message or well defined truncation behavior when len(fill) is not a multiple of size. Initializing memory with a value need not be restricted to one byte values.

.fromint thoughts

I like it. It fills a need. And it is kept simple with the 0…255 range. Good. I do wonder about the name as it wasn’t entirely obvious to me what it did without reading the PEP, meaning I expect the same to be true for others who haven’t read the docs. But I think that is okay in this situation, it’ll be quickly learned.

An alternative name such as .frombyte could be confusing as “what is a byte?” is a valid question in a Python context given that a length 1 bytes object and 8-bit unsigned integer values are both valid interpretations. .fromint removes that potential awkward misunderstanding.

I appreciate all of the explicit proposed documentation cross-linking.

.getbyte and .iterbytes methods

Both of these seem very clearly useful and so much more readable than [x:x+1] slice syntax.

bchr built-in

I’m glad this was removed.

The question of memoryview support.

If we’re going to leave that out of this PEP and further the divergence in Duck Type similarity between it’s API that that of bytes and bytearray we need to explicitly state why in the PEP and call out that we are accepting a divergence.

I do expect some negative impact if we leave it out as Python duck typing means people enjoy just passing this variety of types around to APIs perhaps not necessarily expecting one or the other. Suddenly the shapes won’t match and a library other thinking of, say, only bytes when writing code may start using the new APIs, only to belatedly break users who were passing in memoryview

So to me, I view keeping memoryview in-scope would be a good thing.

Specifically: memoryview.getbyte and memoryview.iterbytes seem like they should exist and work on any memoryview with single byte elements and a single dimension (.format == "B" and .ndim == 1). Raising a well defined exception otherwise (TypeError? ValueError?).

3 Likes

Probably NotImplementedError then, because support could be later added. Since tobytes is available on all memoryview objects, including n-dimensional and non-contiguous, there’s no conceptual reason why getbyte and iterbytes shouldn’t be exposed as well:

>>> import numpy as np
>>> obj = np.arange(24, dtype='int16').reshape(2,3,4).swapaxes(1, 2)[...,::2]
>>> obj
array([[[ 0,  8],
        [ 1,  9],
        [ 2, 10],
        [ 3, 11]],

       [[12, 20],
        [13, 21],
        [14, 22],
        [15, 23]]], dtype=int16)
>>> obj.shape, obj.strides
((2, 4, 2), (24, 2, 16))
>>> m = memoryview(obj)
>>> m.tobytes()
b'\x00\x00\x08\x00\x01\x00\t\x00\x02\x00\n\x00\x03\x00\x0b\x00\x0c\x00\x14\x00\r\x00\x15\x00\x0e\x00\x16\x00\x0f\x00\x17\x00'
# note that `tobytes` walks the values in logical order
1 Like

This doesn’t mention int.to_bytes as an alternative to bytes.fromint

I don’t see enough need for bytes.fromint when we have int.to_bytes

But if it is added,

>>> (512).to_bytes()
Traceback [...]
OverflowError: int too big to convert

It seems like it would be good for the type of exception to match.
(OverflowError rather than ValueError)

Also, why not include the other functionality from int.to_bytes in bytes.fromint? (like length with endianness, and a signed option)

I would prefer
from_size over fromsize
from_int over fromint
get_byte over getbyte

and probably
iter_bytes over iterbytes
though I’m not as sure about this last one since “iter” isn’t a word.

I thought that using multiple words without underscore was mostly leftover from older Python, and newer Python was moving towards underscores.
But maybe I’m just imagining that…

1 Like

The 2->3 conversion text has been dropped. (Still applies to me, though!)

Agreed, and added.

1 Like

If memoryview is in scope of the PEP, but the specification is limited to 1D ‘B’ arrays as @gpshead suggests, then I think the output of tobytes is the same whatever is passed in order.

For the output of memoryview.tobytes to be useful to the client that calls it, it needs at least the shape and element format. The use case for including memoryview is duck-similarity with bytes and bytearray. So the client is assuming the 1D ‘B’ case and no enquiry will be made into shape and format. Either that or the client has a code path specific to ducks more closely resembling memoryview.

I would therefore not include the order argument in the PEP’s specification of the API.

But I think you are right about their signatures in memoryview when implemented, with the same default as in memoryview.tobytes. Anything else would be confusing. Personally, I expected a default of order='A' (a copy of the memory is what PEP-3118 proposed), but it isn’t the place to argue that, and it is moot in the use case.

You cannot know at the point memoryview.[tobytes|getbyte|iterbytes] is called what the client is assuming about shape and format, or whether it will subsequently enquire about them in order to interpret the data correctly. So I think it cannot be a type or value error to call them on a memoryview of whatever shape, but as Antoine @pitrou points out, they might be not implemented.

The presence of an additional optional parameter for one of the types doesn’t really prevent duck typing, it’s just that if you are duck typing you cannot rely on it, so the person passing in the memoryview is responsible for the data already being in the correct order, at least if you ever decide to allow multi-dimensional memoryviews to work with getbyte/iterbytes.

My argument mostly was, that if order is useful on tobytes, it probably can be just as useful in tobyte and iterbytes, so sooner or later someone might ask for it, but it’s also fine to defer implementing getbyte/iterbytes for multi-dimensional views until someone demonstrates a need for it, so it doesn’t necessarily need to be part of this PEP, but I also don’t think it would hurt its chances if it were.

2 Likes

We agree, but it is reassuring you also think those methods in memoryview could offer an optional order parameter without the API requiring it of bytes, etc…

I think a section on how this applies to memoryview would make sense, discussing order.

I would rather say they are responsible for understanding that the receiving code is going to take a particular 1D bytes view that may not be meaningful in the context. A bit like bytes.__contains__ doing this:

>>> np.array([[28518, 8312], [30058, 28781]], dtype='int16') in b'The quick brown fox jumps over the lazy dog'
True
2 Likes