I’m new to python but have expierences on other languages.
I discoverd an unusal behavior of python3* on my kali linux. All I want was to print an output of 4 values given in hex:
This is OK and works how it should be:
Python 2.7.18 (default, Mar 28 2022, 20:47:09) [GCC 11.2.0] on linux2
$ python2.7 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 8a 04 08 0a |7…|
This is my problem:
Python 3.10.4 (main, Mar 24 2022, 13:07:27) [GCC 11.2.0] on linux
$ python3.10 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 c2 8a 04 08 0a
-------------------^^ ??
Python 3.9.12 (main, Mar 24 2022, 13:02:21) [GCC 11.2.0] on linux
$ python3.9 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 c2 8a 04 08 0a |7…|
-------------------^^ ??
Did I something wrong ? Any suggestions ?
Thanks to the community for help.
In Python 3, strings are Unicode strings by default, and in order to use byte-strings, you have to convert things explicitly using str.encode and bytes.decode
It’s not really print itself being different. It’s that python2 strs are just bytes, while they’re unicode in python3. And that the default encoding on stdout is ascii in python2 and utf-8 in python3.
I was looking at the bit patterns and noticed the output change when bit 7 became set, but it seems that while my ‘bark’ may have been correct, I had the wrong ‘tree’.
The Python 2.7 example uses a byte-string. It outputs the bytes
\x37 = ‘7’ in ASCII
\x8a = undefined in ASCII; VTS (Vertical Tab Set) in Latin-1; ‘ä’ in MacRoman
\x04 = EOT (End of Transmission) in ASCII
\x08 = BACKSPACE in ASCII
followed by a newline (linefeed) generated by the print statement.
The Linux hexdump utility sees those five bytes.
The Python 3 examples use a Unicode text string. It outputs the Unicode characters:
‘\x37’ = ‘7’
code point U+008A (VTS)
code point U+0004 (EOT)
code point U+0008 (BACKSPACE)
plus the newline generated by print. These have to be converted to bytes before being written to stdout.
On almost all Linux systems, the encoding used to convert Unicode characters to bytes is UTF-8.
In UTF-8, all of those five Unicode characters except the VTS are encoded to a single byte with the same numeric value as the escape code, but the VTS character is encoded to a double-byte sequence 8a 04 (in hex).
So hexdump sees six bytes, not five.
You can learn more about Unicode and encodings here: