UTF-8 and ANSI encoding issue

Hi all,

i have a code that writes to a file with utf-8 encoding. But when i try to open the same file i created in the same script i get an error message saying it can’t read it because the file is in ANSI.

Here is the code of creating the file:

with open(new_file, 'w', encoding='utf-8') as f:
    for item in items:
        f.write('%s\n' % item)
with open(new_file, 'a', encoding='utf-8') as f:
    for line in lines:
        f.write('%s\n' % line)

This is the code which supposed to open the same file created previously:

with open(new_file, 'r', encoding='utf-8') as asf:
        content = asf.readlines()

The error message:

File "C:\Users\XXX\AppData\Local\Programs\Python\Python310\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 541: invalid continuation byte

Any help will be appreciated.

I agree this will work given the code above.
But not if the file read is not the same as the file that was written.
I would double check that the file being read contains what you expect.

You can also read the file as bytes and print out the data, using repr(), around the point the decode fails to get a clue where the problem data came from.

Determine first whether the item writes or line writes cause the problem by commenting out one block then the other. The reduce the number of whichever.

The file is exactly what i expected in the content manner.

I ran this code in order to get the file’s encoding:

import chardet

with open('file.csv', 'rb') as f:
    result = chardet.detect(f.read())

The result was ‘windows-1255’. I replaced the “utf-8” with “windows-1255” in the reading code section and it works with no problem.

So now my question is… Why the encoding is not “utf-8” as i wrote in the code when i created the file?
I can only guess that my OS system locale is the problem. I’ll try adding the “Beta: Use Unicode UTF-8 for worldwide language support”.

On second thought, if i want my code to be cross-platform and languages i need to add a code that will check the encoding after creating it and add the encoding to the code.

The obvious conclusion from the evidence is:
You have a code path that is writing into the file and it is not using UTF-8 on that code path.

problem_offset = 541
with open('file.csv', 'rb') as f:
      data = f.read()
      print(repr(data[problem_offset-40:problem_offset+40]

If you run the above code what the output suggest where the non-utf-8 is coming from?

1 Like

I’m getting this:

b'End, Style, Actor, MarginL, MarginR, MarginV, Effect, Text\r\nDialogue: 0,0:00:25.'

Now, from 'End ‘till Text’ it’s part of a list of strings which the first part of the code of the file creation does.
And then the ‘Dialogue: 0,0:00:25.’ it’s part of a string coming from the second part of the code which appends a string

I have no idea where the \r coming from.

The ‘\r\n’ is just a carriage return char + newline used on Windows for line endings. It does not cause trouble for utf-8 encoding or decoding.

Windows-1255 is, according to Wikipedia, a codepage used for writing Hebrew.

I don’t know what is going on, but I do not believe that the code for writing the file that you originally quoted is actually the code used to write the file (at least it’s not the whole story). It may be that the original file is encoded as windows-1255, which causes problems for Hebrew letters when decoding as utf-8, even though ascii characters (abc etc) don’t lead to trouble. For instance:

>>> (chr(0x05d4) + '\r\n').encode('windows-1255').decode('utf-8')  # ה\r\n
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: invalid continuation byte

But… I wouldn’t necessarily trust what chardet tells (unless you happen to know that you are indeed writing Hebrew). If the file is somehow corrupt, then I don’t know how chardet behaves. The decoding error could have other causes too.

You could try writing a marker, such as a line of *, at the end of the code that writes the file, and then, when reading the file, check it by reading it as bytes and checking that the last line is the expected line of *. If it’s not, then there must be some other code somewhere that you’ve overlooked that’s also writing to the file, possibly with the wrong encoding or not being explicit about the encoding.

There is nothing shown in this example that would cause a problem with UTF-8 or many other encodings - all the bytes are plain ASCII values that correspond to text just like what you see.

The conclusion is that this output comes from a different file with the same name as the one that is causing the problem.

Please keep in mind that when you open a file with just its name, that is a relative path - Python looks directly in the program’s “current working directory”, which is not necessarily the path that was shown in the terminal when you started the program. In particular, in a larger program, it’s easy to have something like os.chdir at some point. If you write the file in one directory and then try to read it from another, you will get whatever file is in the other directory, separate from what was written.

Reference Q&A:

1 Like

Thanks for everyone.
I decided to put a code that checks the file encoding and according to the result opens the file with that encoding.

:pray:

It seems that your files are not being written as your design dictates.
By doing this detection workaround you are potentially hiding a a bug that may break your code later.

Usually detection code only looks at the start of a file and as such may not stop the problem with encoding issues breaking your code if the problem text is after the code checked byt the detection logic.

2 Likes