Array and memoryview endianness

dg-pb · February 20, 2025, 6:16am

struct module supports explicit specification of endianness, but I can not find anything about it in array or memoryview docs.

2 questions:

Default is either native or little. Which one is it? Why?
Is explicit definition allowed for array and memoryview.cast? If not, why?

tim.one · February 20, 2025, 6:57am

Native:

When no prefix character is given, native mode is the default. It packs or unpacks data based on the platform and compiler on which the Python interpreter was built.

This is to ease building structs usable as-is by platform C code.

No.

struct is also intended to build structures that can be ported across platforms. In particular, various network protocols demand “big endian” formats. array and memoryview are not intended to supply x-platform formats. For that reason, e.g., they don’t supply any notion of “standard sizes” either. For example, an array format code of “L” is the platform C’s “unsigned long” type, and is only guaranteed to consume at least 4 bytes.

struct, again for x-platform use, allows forcing specific sizes.

tim.one · February 20, 2025, 7:10am

BTW, I answered “native” to the implied question of what struct’s default is. If you’re asking instead about the behavior of array/memoryview, same answer (native) but different reason: an array that didn’t use the native storage format would be computationally useless.

dg-pb · February 20, 2025, 7:32am

Thanks

Trying to put it in comparison to numpy, which does allow explicit specification.
Any chance you know what strategy they employ for it not to be “computationally useless”?

seberg · February 20, 2025, 10:46am

NumPy just implements the cast and uses it (if necessary transparently for the user). I can see that it is a perfectly practical choice that memoryview doesn’t deal with byte-swapping.

FWIW, byte-swapping is important for sharing data e.g. on disk, but dunno if memoryview needs to support it (but to me it is mostly a facilitator for the buffer protocol, not to actually work with).
The low-level integration in NumPy probably isn’t really needed (i.e. nobody should do math with byte-swapped data, all you need is to be able to convert it once if you e.g. memory-mapped a file).
(EDIT: but it is probably practical/nice and saves memory in some cases.)

tim.one · February 20, 2025, 5:47pm

A Python array.array is a different type than a numpy.ndarray. The latter stores endian information, but the former doesn’t. numpy dynamically swaps bytes when extracting a scalar, if needed, to match native endianness. Ditto when storing a scalar.

Too much hidden magic for my tastes . I’ve never had to deal with endianness issues in numpy so have no feel for actual practice. It would be good to hear from someone who did, Offhand, if I had to read up non-native data into a numpy array, same as in Python I’d byteswap it to native order immediately after reading

seberg · February 21, 2025, 7:33am

Well, you need byte-swapping casts for any form of serialization, so you do need to be able to represent it (but could try to limit it to places related to serialization). That could be a specialized IO library, but when NumPy started it effectively was that specialized IO library :).

As long as NumPy has some support for unaligned memory or byte-swapping, the extra burden of threading it almost everywhere is that you have to check for both almost everywhere (and often we also check for contiguous memory).

So, unless you get rid (sandbox) both unaligned and byte-swapped it is probably not much really easier to try to limit the support only to IO or so.
I.e. for NumPy I would the main difficulty isn’t code complexity (based on its internal design), but just fact itself that you should know about byte-swapping as a contributor.
(Yes, on first sight there will be a lot of complexity, but most of that is necessary for any parametric dtype such as structs or datetimes.)

EDIT: I suppose that it is easy to forget these checks, is a bit of a problem…

Rosuav · February 21, 2025, 9:04am

I can’t speak specifically to numpy or arrays, but whenever I’m reading something that has a defined endianness (such as a binary file), I prefer to explicitly state the desired endianness as a read operation, and then work in the abstract after that. If that means setting one flag on the array saying “you are little-endian 32-bit integers”, then that seems pretty convenient to me, but most of what I work with is structures, so that’s already covered.