Unusal behavior of python3* print hex values

I’m new to python but have expierences on other languages.
I discoverd an unusal behavior of python3* on my kali linux. All I want was to print an output of 4 values given in hex:

This is OK and works how it should be:
Python 2.7.18 (default, Mar 28 2022, 20:47:09) [GCC 11.2.0] on linux2
$ python2.7 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 8a 04 08 0a |7…|

This is my problem:
Python 3.10.4 (main, Mar 24 2022, 13:07:27) [GCC 11.2.0] on linux
$ python3.10 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 c2 8a 04 08 0a
-------------------^^ ??

Python 3.9.12 (main, Mar 24 2022, 13:02:21) [GCC 11.2.0] on linux
$ python3.9 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 c2 8a 04 08 0a |7…|
-------------------^^ ??

Did I something wrong ? Any suggestions ?
Thanks to the community for help.

According to this Stackoverflow post by Benjamin Peterson, you can use

import sys
sys.stdout.buffer.write(b"\x37\x8a\x04\x08")

In Python 3, strings are Unicode strings by default, and in order to use byte-strings, you have to convert things explicitly using str.encode and bytes.decode

2 Likes

Thank you - that helps. I will keep this in mind.

I understand why the 0a at the end is there in the output, but why is the c2 there in the output?

Because the default encoding for the stdout filehandle (on linux) is utf-8. The encoding for \x80 is <\xc2, \x80>.
https://www.utf8-chartable.de/

2 Likes

Yes. I see that now, thanks.

So, with Python 2.7 (or maybe any 2.x) is the print() function handling this is a different way, hence the different output?

It’s not really print itself being different. It’s that python2 strs are just bytes, while they’re unicode in python3. And that the default encoding on stdout is ascii in python2 and utf-8 in python3.

3 Likes

Thank you. I get it.

I was looking at the bit patterns and noticed the output change when bit 7 became set, but it seems that while my ‘bark’ may have been correct, I had the wrong ‘tree’.

The Python 2.7 example uses a byte-string. It outputs the bytes

  • \x37 = ‘7’ in ASCII
  • \x8a = undefined in ASCII; VTS (Vertical Tab Set) in Latin-1; ‘ä’ in MacRoman
  • \x04 = EOT (End of Transmission) in ASCII
  • \x08 = BACKSPACE in ASCII

followed by a newline (linefeed) generated by the print statement.

The Linux hexdump utility sees those five bytes.

The Python 3 examples use a Unicode text string. It outputs the Unicode characters:

  • ‘\x37’ = ‘7’
  • code point U+008A (VTS)
  • code point U+0004 (EOT)
  • code point U+0008 (BACKSPACE)

plus the newline generated by print. These have to be converted to bytes before being written to stdout.

On almost all Linux systems, the encoding used to convert Unicode characters to bytes is UTF-8.

In UTF-8, all of those five Unicode characters except the VTS are encoded to a single byte with the same numeric value as the escape code, but the VTS character is encoded to a double-byte sequence 8a 04 (in hex).

So hexdump sees six bytes, not five.

You can learn more about Unicode and encodings here:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Pragmatic Unicode

4 Likes

Thanks for posting the extremely useful links.

These go a long way to explain why I get a strange looking cli prompt when I ssh into one of my Linux boxes:
 NUC6-1  rob  ~ 