Endcoding Errors

Harmon · March 1, 2021, 3:14pm

Having a problem getting around encoding error on a Python 3.6
UnicodeDecodeError: ‘utf-16-le’ codec can’t decode bytes in position 14-15:
unexpected end of data
Any suggestions?
-Thanks

jeanas · March 1, 2021, 7:23pm

That means that the data cannot be decoded. Either the file is corrupted, or it is being read using the wrong encoding. UTF-16 is not so common as far as I know; are you overriding the encoding manually? Have you tried opening the file or stream with explicit encoding, e.g., open(..., encoding="utf8")?

Harmon · March 1, 2021, 8:05pm

Thank you for responding. I am using statement and several variations of -
if type(content) == str:
content = content.encode(‘utf-8’)
This works with other Turkish files I have been working with.

jeanas · March 1, 2021, 8:23pm

So, how are you reading the files? Could you post a code example that fails along with the complete error traceback?

Harmon · March 2, 2021, 2:05pm

Thanks - This is a work script and a lot of it is packaged so I don’t have access to a lot of the code. I will pursue this with our engineering team.
Thank you so much for taking the time to help me.

pepoluan · March 2, 2021, 7:19pm

The “unexpected end of data” message suggests that the bytes in position 14-15 are part of “surrogate pairs”, and therefore the bytes in position 15-16 must also be a surrogate pair.

Check if the bytes in 14-15 lies between D800 and DBFF. If so, then you’ll need to grab them best two bytes to compete the character.