Hello all,
Why is the Python 3 interpreter printing byte strings by decoding bytes as UTF-8 (perhaps actually only true for the ASCII analogue codepoints), instead of printing \xNN values? Am I missing something?
Thank you
Hello all,
Why is the Python 3 interpreter printing byte strings by decoding bytes as UTF-8 (perhaps actually only true for the ASCII analogue codepoints), instead of printing \xNN values? Am I missing something?
Thank you
Because 1) Python 2 used byte strings for text and 2) Many internet protocols still do so. So printable ascii codes are printed as the characters because that is more often what people want and to avoid breaking existing code.
I do not know that much, but I don’t think that “text” is a data type. Bytes of memory used as data, have values, and there there are encodings which allow for abstraction of that data as characters. So when representing (printing) data, I think it is optimal that you should not mix abstraction levels.
It’s easier to debug:
b = b'POST /users HTTP/1.1\r\n'
print(b[0:4] == b'POST')
It isn’t. It is decoding them as ASCII, then giving you \xNN as the representation of non-ASCII codes. The reason is historical, that originally Python aimed no higher than C in multi-lingual support. In those days, character meant byte, and countries with funny keyboards would make do with a local work-around (like codepages and other bad ideas).
Officially, bytes
and bytearray
are arrays of small integers. They aren’t “binary strings”. The default repr()
is fairly unreadable if it isn’t ASCII text, but it is easy to code something more suitable to your immediate purpose, like:
' '.join(f'{x:02x}' for x in b)
Or a bit more concisely: b.hex(" ")
>>> "café".encode("utf-8").hex(" ")
'63 61 66 c3 a9'
Thank you all for responding. This behavior is a coherence flaw in my opinion, albeit a minor one. I’ve found the proper workaround to be:
b_utf16 = 'café'.encode('utf-16')
print("b'" + ''.join(f'\\x{byte:02x}' for byte in b_utf16) + "'")
You’re missing the point — it wasn’t meant to print all bytes using the \x
prefix, just the non-printable ones. The standard string encoding has been ASCII since dinosaurs roamed the terminal, and it’s still the de jure and de facto standard in almost all protocols.
For what purpose is that result better?
There is no point that you are making.
Probably up to you to find it.