Unusal behavior of python3* print hex values

The Python 2.7 example uses a byte-string. It outputs the bytes

  • \x37 = ‘7’ in ASCII
  • \x8a = undefined in ASCII; VTS (Vertical Tab Set) in Latin-1; ‘ä’ in MacRoman
  • \x04 = EOT (End of Transmission) in ASCII
  • \x08 = BACKSPACE in ASCII

followed by a newline (linefeed) generated by the print statement.

The Linux hexdump utility sees those five bytes.

The Python 3 examples use a Unicode text string. It outputs the Unicode characters:

  • ‘\x37’ = ‘7’
  • code point U+008A (VTS)
  • code point U+0004 (EOT)
  • code point U+0008 (BACKSPACE)

plus the newline generated by print. These have to be converted to bytes before being written to stdout.

On almost all Linux systems, the encoding used to convert Unicode characters to bytes is UTF-8.

In UTF-8, all of those five Unicode characters except the VTS are encoded to a single byte with the same numeric value as the escape code, but the VTS character is encoded to a double-byte sequence 8a 04 (in hex).

So hexdump sees six bytes, not five.

You can learn more about Unicode and encodings here:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Pragmatic Unicode

4 Likes