How to convert character to hexadecimal and back?

I tried to code to convert string to hexdecimal and back for controle as follows:

input = 'Україна'
table = []
for item in input:
  table.append(item.encode('utf-8').hex())
output = ''
for item in table:
  output += chr(int(item, 16))
print(output)

To my surprise the output is total different from the input. What have I done anything wrong?

The output was: 킣킺톀킰톗킽킰

UTF-8 is one particular encoding. If you want a reversible transformation, the chr function’s counterpart is ord.

I tried to code to convert string to hexdecimal and back for controle
as follows:

input = 'Україна'
table = []
for item in input:
 table.append(item.encode('utf-8').hex())

At this point you have a list of hexadecimal strings, one per character
(well, technically Unicode codepoints). Example:

 >>> input = 'Україна'
 >>> hexcodes = [ item.encode('utf-8').hex() for item in input ]
 >>> hexcodes
 ['d0a3', 'd0ba', 'd180', 'd0b0', 'd197', 'd0bd', 'd0b0']

However, utf-8 is a variable width multibyte encoding. Its value is that
for the first 128 codes (the ASCII range) the byte encoding is the same
at 1 byte per code. (This made plain ASCII files automatcally UTF-8
compatible and made a lot of western european text compactly
represented. The flip side is that later values have a longer encoding.)

The reverse of this encoding is not reversing your .hex() call. The
high order bits indicate the length of the encoding of the code value,
and do not themselves contribute to the code value itself.

To undo this:

  • undo your hex() into bytes
  • decode the bytes by decoding as UTF-8: bs.decode('utf-8') if your
    bytes values was in a variable named bs

Cheers,
Cameron Simpson cs@cskk.id.au

This is a separate suggestion.

In Python a str, like the value of output is immutable. So, when we do output += chr(int(item, 16)) a new str gets created each time and destroyed after the next iteration of the for loop. We only really need the final string to be created.

You could use str.join like

output = ''.join(chr(int(item, 16)) for item in table)

The use of list comprehension can also be applied to the creation of table, as

table = [item.encode('utf-8').hex() for item in input]
1 Like

Code style hints, first:

  1. Please do not use input as a variable name - that causes shadowing, meaning that the built-in function input is no longer available (the name input can only mean one thing at a time).

  2. As Franklin suggested, consider using list comprehensions, generator expressions etc. to iterate and collect data - it’s much simpler and more direct. The code could be as simple as:

data = 'Україна'
table = [item.encode('utf-8').hex() for item in data]
output = ''.join(chr(int(item, 16)) for item in table]
print(output)

This means that each character in the input string will be converted into bytes using the UTF-8 encoding, and then a string representing those byte values will be created. Each such value is added to the list.

This means that each of the hex strings will be converted into a single integer, and then the corresponding Unicode code point will be looked up.

The reason this does not give the same result is because UTF-8 encoding does not convert characters into the bytes used for an integer representation of that element of the string. It uses a variable amount of bytes for each element, and sets some “flag” bits as a way of signalling, in-band, how many bytes to use.

For example, 'У' contains a single element with Unicode code point 1059. Stored as a 2-byte integer, that would require the bytes 0x23 0x04 in little-endian, or 0x04 0x23 in big-endian. UTF-8 is conceptually “big-endian”, but it also sets some flag bits, in such a way that the encoding is instead 0xd0 0xa3 - as an integer, 53411.

Two-byte UTF-8 sequences use eleven bits as actual information-carrying bits. Three bits are set in the first byte to mean “this is the first byte of a 2-byte UTF-8 sequence”, and two more in the second byte to mean “this is part of a multi-byte UTF-8 sequence (not the first byte)”. (This is a bit redundant, but encoding this way means that it’s easy to detect corruption when a code point gets sliced in half).

To undo the encoding, we should instead get the corresponding bytes from the hex dump (rather than a single integer), and decode it:

input = 'Україна'
table = []
for item in input:
  table.append(item.encode('utf-8').hex())
output = ''
for item in table:
  output += bytes.fromhex(item).decode('utf-8')
print(output)

To get hex dumps of the actual Unicode code point values, we should use ord as the opposite of chr (which gives an integer rather than bytes), and convert the integer to a hex dump using string formatting:

input = 'Україна'
table = []
for item in input:
  table.append(f'{ord(item):x}')
output = ''
for item in table:
  output += chr(int(item, 16))
print(output)

Note that this approach will use whatever number of hexadecimal digits is needed to represent the values (even, as here, if that’s an odd number).

Please give me your code as an example how I can reverse it, as I understand that code much better.
Thank you very much, as I’m not expert.