Code style hints, first:
-
Please do not use input
as a variable name - that causes shadowing, meaning that the built-in function input
is no longer available (the name input
can only mean one thing at a time).
-
As Franklin suggested, consider using list comprehensions, generator expressions etc. to iterate and collect data - it’s much simpler and more direct. The code could be as simple as:
data = 'Україна'
table = [item.encode('utf-8').hex() for item in data]
output = ''.join(chr(int(item, 16)) for item in table]
print(output)
This means that each character in the input string will be converted into bytes using the UTF-8 encoding, and then a string representing those byte values will be created. Each such value is added to the list.
This means that each of the hex strings will be converted into a single integer, and then the corresponding Unicode code point will be looked up.
The reason this does not give the same result is because UTF-8 encoding does not convert characters into the bytes used for an integer representation of that element of the string. It uses a variable amount of bytes for each element, and sets some “flag” bits as a way of signalling, in-band, how many bytes to use.
For example, 'У'
contains a single element with Unicode code point 1059. Stored as a 2-byte integer, that would require the bytes 0x23 0x04 in little-endian, or 0x04 0x23 in big-endian. UTF-8 is conceptually “big-endian”, but it also sets some flag bits, in such a way that the encoding is instead 0xd0 0xa3 - as an integer, 53411.
Two-byte UTF-8 sequences use eleven bits as actual information-carrying bits. Three bits are set in the first byte to mean “this is the first byte of a 2-byte UTF-8 sequence”, and two more in the second byte to mean “this is part of a multi-byte UTF-8 sequence (not the first byte)”. (This is a bit redundant, but encoding this way means that it’s easy to detect corruption when a code point gets sliced in half).
To undo the encoding, we should instead get the corresponding bytes from the hex dump (rather than a single integer), and decode it:
input = 'Україна'
table = []
for item in input:
table.append(item.encode('utf-8').hex())
output = ''
for item in table:
output += bytes.fromhex(item).decode('utf-8')
print(output)
To get hex dumps of the actual Unicode code point values, we should use ord
as the opposite of chr
(which gives an integer rather than bytes
), and convert the integer to a hex dump using string formatting:
input = 'Україна'
table = []
for item in input:
table.append(f'{ord(item):x}')
output = ''
for item in table:
output += chr(int(item, 16))
print(output)
Note that this approach will use whatever number of hexadecimal digits is needed to represent the values (even, as here, if that’s an odd number).