I’m new to python but have expierences on other languages.
I discoverd an unusal behavior of python3* on my kali linux. All I want was to print an output of 4 values given in hex:
This is OK and works how it should be:
Python 2.7.18 (default, Mar 28 2022, 20:47:09) [GCC 11.2.0] on linux2
$ python2.7 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 8a 04 08 0a |7…|
This is my problem:
Python 3.10.4 (main, Mar 24 2022, 13:07:27) [GCC 11.2.0] on linux
$ python3.10 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 c2 8a 04 08 0a
Python 3.9.12 (main, Mar 24 2022, 13:02:21) [GCC 11.2.0] on linux
$ python3.9 -c ‘print ("\x37\x8a\x04\x08");’ | hexdump -C
00000000 37 c2 8a 04 08 0a |7…|
Did I something wrong ? Any suggestions ?
Thanks to the community for help.
According to this Stackoverflow post by Benjamin Peterson, you can use
In Python 3, strings are Unicode strings by default, and in order to use byte-strings, you have to convert things explicitly using
Thank you - that helps. I will keep this in mind.
I understand why the
0a at the end is there in the output, but why is the
c2 there in the output?
Because the default encoding for the stdout filehandle (on linux) is utf-8. The encoding for \x80 is <\xc2, \x80>.
Yes. I see that now, thanks.
So, with Python 2.7 (or maybe any 2.x) is the
print() function handling this is a different way, hence the different output?
It’s not really
print itself being different. It’s that python2
strs are just bytes, while they’re unicode in python3. And that the default encoding on stdout is
ascii in python2 and
utf-8 in python3.
Thank you. I get it.
I was looking at the bit patterns and noticed the output change when bit 7 became set, but it seems that while my ‘bark’ may have been correct, I had the wrong ‘tree’.
The Python 2.7 example uses a byte-string. It outputs the bytes
- \x37 = ‘7’ in ASCII
- \x8a = undefined in ASCII; VTS (Vertical Tab Set) in Latin-1; ‘ä’ in MacRoman
- \x04 = EOT (End of Transmission) in ASCII
- \x08 = BACKSPACE in ASCII
followed by a newline (linefeed) generated by the print statement.
The Linux hexdump utility sees those five bytes.
The Python 3 examples use a Unicode text string. It outputs the Unicode characters:
- ‘\x37’ = ‘7’
- code point U+008A (VTS)
- code point U+0004 (EOT)
- code point U+0008 (BACKSPACE)
plus the newline generated by print. These have to be converted to bytes before being written to stdout.
On almost all Linux systems, the encoding used to convert Unicode characters to bytes is UTF-8.
In UTF-8, all of those five Unicode characters except the VTS are encoded to a single byte with the same numeric value as the escape code, but the VTS character is encoded to a double-byte sequence 8a 04 (in hex).
So hexdump sees six bytes, not five.
You can learn more about Unicode and encodings here:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Thanks for posting the extremely useful links.
These go a long way to explain why I get a strange looking cli prompt when I ssh into one of my Linux boxes:
NUC6-1 rob ~