UTF-8 and ANSI encoding issue

dady · November 23, 2023, 2:28pm

Hi all,

i have a code that writes to a file with utf-8 encoding. But when i try to open the same file i created in the same script i get an error message saying it can’t read it because the file is in ANSI.

Here is the code of creating the file:

with open(new_file, 'w', encoding='utf-8') as f:
    for item in items:
        f.write('%s\n' % item)
with open(new_file, 'a', encoding='utf-8') as f:
    for line in lines:
        f.write('%s\n' % line)

This is the code which supposed to open the same file created previously:

with open(new_file, 'r', encoding='utf-8') as asf:
        content = asf.readlines()

The error message:

File "C:\Users\XXX\AppData\Local\Programs\Python\Python310\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 541: invalid continuation byte

Any help will be appreciated.

barry-scott · November 23, 2023, 3:19pm

I agree this will work given the code above.
But not if the file read is not the same as the file that was written.
I would double check that the file being read contains what you expect.

You can also read the file as bytes and print out the data, using repr(), around the point the decode fails to get a clue where the problem data came from.

tjreedy · November 23, 2023, 3:24pm

Determine first whether the item writes or line writes cause the problem by commenting out one block then the other. The reduce the number of whichever.

dady · November 23, 2023, 4:10pm

The file is exactly what i expected in the content manner.

I ran this code in order to get the file’s encoding:

import chardet

with open('file.csv', 'rb') as f:
    result = chardet.detect(f.read())

The result was ‘windows-1255’. I replaced the “utf-8” with “windows-1255” in the reading code section and it works with no problem.

So now my question is… Why the encoding is not “utf-8” as i wrote in the code when i created the file?
I can only guess that my OS system locale is the problem. I’ll try adding the “Beta: Use Unicode UTF-8 for worldwide language support”.

On second thought, if i want my code to be cross-platform and languages i need to add a code that will check the encoding after creating it and add the encoding to the code.

barry-scott · November 23, 2023, 5:34pm

The obvious conclusion from the evidence is:
You have a code path that is writing into the file and it is not using UTF-8 on that code path.

problem_offset = 541
with open('file.csv', 'rb') as f:
      data = f.read()
      print(repr(data[problem_offset-40:problem_offset+40]

If you run the above code what the output suggest where the non-utf-8 is coming from?

dady · November 23, 2023, 6:04pm

I’m getting this:

b'End, Style, Actor, MarginL, MarginR, MarginV, Effect, Text\r\nDialogue: 0,0:00:25.'

Now, from 'End ‘till Text’ it’s part of a list of strings which the first part of the code of the file creation does.
And then the ‘Dialogue: 0,0:00:25.’ it’s part of a string coming from the second part of the code which appends a string

I have no idea where the \r coming from.

hansgeunsmeyer · November 23, 2023, 7:21pm

The ‘\r\n’ is just a carriage return char + newline used on Windows for line endings. It does not cause trouble for utf-8 encoding or decoding.

Windows-1255 is, according to Wikipedia, a codepage used for writing Hebrew.

I don’t know what is going on, but I do not believe that the code for writing the file that you originally quoted is actually the code used to write the file (at least it’s not the whole story). It may be that the original file is encoded as windows-1255, which causes problems for Hebrew letters when decoding as utf-8, even though ascii characters (abc etc) don’t lead to trouble. For instance:

>>> (chr(0x05d4) + '\r\n').encode('windows-1255').decode('utf-8')  # ה\r\n
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: invalid continuation byte

But… I wouldn’t necessarily trust what chardet tells (unless you happen to know that you are indeed writing Hebrew). If the file is somehow corrupt, then I don’t know how chardet behaves. The decoding error could have other causes too.

MRAB · November 23, 2023, 7:47pm

You could try writing a marker, such as a line of *, at the end of the code that writes the file, and then, when reading the file, check it by reading it as bytes and checking that the last line is the expected line of *. If it’s not, then there must be some other code somewhere that you’ve overlooked that’s also writing to the file, possibly with the wrong encoding or not being explicit about the encoding.

kknechtel · November 23, 2023, 9:55pm

There is nothing shown in this example that would cause a problem with UTF-8 or many other encodings - all the bytes are plain ASCII values that correspond to text just like what you see.

The conclusion is that this output comes from a different file with the same name as the one that is causing the problem.

Please keep in mind that when you open a file with just its name, that is a relative path - Python looks directly in the program’s “current working directory”, which is not necessarily the path that was shown in the terminal when you started the program. In particular, in a larger program, it’s easy to have something like os.chdir at some point. If you write the file in one directory and then try to read it from another, you will get whatever file is in the other directory, separate from what was written.

Reference Q&A:

dady · November 24, 2023, 12:19pm

Thanks for everyone.
I decided to put a code that checks the file encoding and according to the result opens the file with that encoding.

barry-scott · November 24, 2023, 12:55pm

It seems that your files are not being written as your design dictates.
By doing this detection workaround you are potentially hiding a a bug that may break your code later.

Usually detection code only looks at the start of a file and as such may not stop the problem with encoding issues breaking your code if the problem text is after the code checked byt the detection logic.

Topic		Replies	Views
How to fix utf-8 error when reading text file? Python Help	14	807	April 3, 2024
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 4: invalid start byte Python Help	1	2633	October 27, 2023
The cp1252 codec is significantly slower than ascii or utf8 when reading a file that only contains ascii encoded text, could it be improved? Ideas	11	2887	September 15, 2023
Endcoding Errors Python Help	5	5894	March 2, 2021
Bytes data being converted to string before reaching to encoding's decode method Python Help	7	2303	August 31, 2023

UTF-8 and ANSI encoding issue

Related Topics