What should be the default value for `int.to_bytes(..., byteorder=?, ...)`

barry · September 14, 2021, 2:58pm

I’m not the only one who’s thought about this as the “obvious” way.

mdickinson · September 14, 2021, 4:13pm

I’ve failed to clearly articulate why I consider a system-dependent default to be problematic. Let me try to fix that:

Suppose you’re writing some piece of code that requires a 24-bit unsigned int (probably as part of lots of other pieces of data) to be communicated over byte-oriented channels. With the convenience of the new default values for int.to_bytes and int.from_bytes, your code for this might contain a pair of functions that look like this:

def encode_my_data(my_data):
    value = my_data.important_int
    encoded_value = value.to_bytes(3)
    < ...  more encoding ... >

def decode_my_data(encoded_data):
    < ... more decoding ... >
    encoded_value = < ... extract relevant 3 bytes here ...>
    my_data.important_int = int.from_bytes(encoded_value)

You write appropriate unit tests and request a review. The code passes tests and review, the feature works in manual testing, and clearly the decoding and encoding match - all looks good.

But with a system-dependent default this code is subtly buggy: if both functions run on the same machine, all is well. But if the two functions run on different machines, the correctness relies on the two machines having the same native byteorder. And that might be a valid assumption right now, and on all test and CI machines, and only fail much later in a deployed environment, or when the context that the code is run in changes, or when the code is copy-and-pasted to a different codebase, or …

In contrast, with a fixed default (FWIW there seem to be good reasons to prefer "big"), this code is fine whether the two functions are run on the same machine or different machines, and whether the two functions are run at the same time or at widely separated times, or in a different context.

There’s a strong analogy with the encoding argument to the open builtin. A piece of code that does

with open(my_config_file, "r"):
    ...

is easy to write, easy to read, easy to review, might well pass all tests, and is again subtly buggy in many cases: it uses a (hidden) platform-dependent encoding. And 99.9% of the time (including on all test machines), that platform-dependent encoding might happen to match the one that was used to encode my_config_file. But then you deploy this code to a Windows machine in Japan whose system encoding is Shift JIS, and you discover that your config file backslashes have been turned into Yen signs, and your code fails in some horribly obscure manner.

I regret that I’m in a position to report that the above scenario is not a hypothetical - it’s a real source of actual late-discovered bugs in tested, reviewed code in production. (And I’m not alone in that discovery - this is what motivated PEP 597, of course.)

A good API should not only make it easy to do the right thing, but also make it hard to accidentally do the wrong thing. Using a system-dependent default for int.from_bytes and int.to_bytes makes it easy to accidentally do the wrong thing.

barry · September 14, 2021, 6:54pm

Thanks Mark. I understand the problem, of course. I think you’d have to admit that the same subtle bug already exists if you were to use the struct module instead.

Personally, I’m skeptical that people with this use case would reach for int.to_bytes() and int.from_bytes() than struct.pack() and struct.unpack() since I think binary data interchange format would most likely contain structured data than plain ints. And there they already have to be explicit about byte order, size, padding, etc. If that’s the case, why would these APIs be different?

Maybe there is no difference, but we’re stuck with struct’s defaults because we can’t change that in a backward compatible way. Then the argument might well be, “yes we have that problem over there, but let’s not repeat that here, so it’s better to break precedent.”

tim.one · September 14, 2021, 7:03pm

I’m with Mark, but will go on to say “make the default ‘big’, period, end of story”.

Very easy to code, document, and understand, and by default creates bytes objects that convert back faithfully regardless of which platorm(s) the two halves of the round-trip are run on.

“Native” order can make sense in struct, because it’s restricted there to integers of a few small power-of-2 byte lengths. That reflects hardware realities. But there is no HW reality to prefer either ordering for, say, a 37-byte integer.

But even in struct, native ordering typically isn’t used unless it’s trying to pack/unpack blocks of memory that need also to be understood as native C structs by the platform C compiler. That’s the primary reason struct exists, and the default to “native sizes and alignments” there reflects that primary purpose.

The int methods at issue here are far removed from that. I’m hard pressed to think of a real use case where I’d want “native” ordering for the int methods. What struct does made sense for struct. But the context at issue here is not struct.

“A foolish consistency is the hobgoblin of little minds.”

mdickinson · September 14, 2021, 7:06pm

Happily!

Thanks; yes, that’s a perfect summation of my position. As with any design decision, it’s all about trade-offs, and for me the break in consistency with the struct module feels like a lesser weevil than the potential issues from the system dependence.

pitrou · September 14, 2021, 7:24pm

I think it would be a bug magnet if int.to_bytes had a different default than the struct module. So it should either be the same default, or not default at all.

For the verbosity issue, perhaps we can allow shorthands, e.g int.to_bytes(n, '<') (little endian) and int.to_bytes(n, '>') (big endian).

barry · September 15, 2021, 4:48am

"big" by default it is then. And they lived happily ever after.

barry · September 15, 2021, 4:49am

Let’s go with "big" and be done with it! I’ll let someone else take up the shorthand sword!