How to fix utf-8 error when reading text file?

c-rob · April 2, 2024, 1:37pm

I have Python 3.12 on Windows 10.

I have a program to find a string in a 12MB file .dat file which was exported from Excel to be a tab-delimited file. However when the file is read I get this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in byte position 7997: invalid continuation byte

When I open the file in my text editor (Notepad++) and go to position 7997 I don’t see any special characters when I turn on “Show special characters”. The cursor is between 2 normal letters: H and S.

Here’s my function which reads the text file and searches through it.

def search_file_lookahead(filename, search_strings):
    r"""Searches a text file for occurrences of 
    multiple strings in any order using lookahead assertions.
    """

    with open(filename, 'r', encoding='utf-8') as file:
        lines = file.readlines() # ERROR is here then "lines" has no value
    
    pattern = f"(?=.*{')(?=.*'.join(search_strings)})"  # Construct a single pattern with lookaheads

    matches = []
    for line_number, line in enumerate(lines):
        if re.search(pattern, line):
            matches.append((line_number + 1, line.strip()))

    if matches:
        print(f"Found matches in {filename}:")
        for line_number, line in matches:
            print(f"- Line {line_number}: {line}")
    else:
        print(f"No matches found in {filename} for the given strings.")

I trap the error and this error causes the program to quite even though I put a “pass” in the exception clause. This also causes the variable lines to be undefined.

How do I ignore UTF-8 errors like this and read the text file to process it and use regex on it to search for strings?

Thank you. I’m stumped after searching for some web pages about this.

MegaIng · April 2, 2024, 1:45pm

Notepad++ tells you what encoding your file is in. Use that instead of trying to ignore utf-8 errors.

c-rob · April 2, 2024, 1:49pm

I did. Notepad++ reports the file is UTF-8. I still get the error above.

I updated the function to show the error if someone wants to run it.

FIXED: I added “errors=‘ignore’” to open statement.

with open(filename, 'r', encoding='utf-8', errors='ignore') as file:

barry-scott · April 2, 2024, 3:20pm

Clearly it is not utf-8, the error proves that.

onePythonUser · April 2, 2024, 3:45pm

Hi,

per your original post, you stated that you are getting a UnicodeDecodeError: utf-8. From the following link, could it be that you have it encoded in another unicode version. From the following post:

https://sites.pitt.edu/~naraehan/python3/reading_writing_methods.html

"UnicodeDecodeError" means you have a file encoding issue. Each computer has its own system-wide default encoding, and the file you are trying to open is encoded in something different, most likely some version of Unicode. If this happens, you should specify the encoding using the encoding='xxx' switch while opening the file. If you are not sure which encoding to use, try 'utf-8', 'utf-16', and 'utf-32'

jeff5 · April 2, 2024, 5:06pm

Errors are there to help you. @barry-scott is correct: the file is not encoded with UTF-8. Windows is not very good at that, by default at least: too much history of byte-encoded workarounds.

So if you care about the data, I recommend you find out what the encoding actually is. \xe9 is “é” in code page 1252, which is quite a common choice for file encoding on Windows and crops up a bit in otherwise English text.

MRAB · April 2, 2024, 5:52pm

Read the file as, say, Latin-1, and look for that character and the surroundng text.

For example:

import re

with open(filename, 'r', encoding='latin-1') as file:
    text = file.read()

print(re.findall(r'\w*[\x80-\xFF]\w*', text))

Does it look OK? Is it in a word that contains é?

barry-scott · April 2, 2024, 9:21pm

It is typical of a Windows program to save as the local 8-bit code page, cp1252 being common.

Using encoding="cp1252" is often needed when reading Windows files.

The other char that breaks utf-8 assumption is 0xa0 its the right-hand smart quote.
Its so common that a web search for “0xa0” (and only that) takes you to sites telling you how to fix your code!

kknechtel · April 2, 2024, 11:28pm

Exactly what does this mean? Walk me through the process of creating this file. The extension .dat is highly suspicious here; this is almost never used for a file that should be interpreted as containing text. Is all the data created by typing things in Excel, or is it ultimately sourced from somewhere else? When the data is “exported”, what options are chosen for export?

The position given in the error message is a byte position, while Notepad++ will presumably show you a character position (and might also translate newlines).

Let’s see if we can diagnose the file’s encoding (and verify that it is, in fact, supposed to be interpreted as text) by looking at some raw byte values around the (first) problematic position. Please show the result you get from:

import binascii

with open(filename, 'rb') as file:
    file.seek(7900)
    for i in range(16):
        data = file.read(16)
        print(*map('{:02x}'.format, data), sep=' ')

petersuter · April 3, 2024, 6:21am

To investigate file encoding problems it can also be useful to view the raw file in a hex editor. For example just open https://hexed.it/ in a browser and drag the file into it to open it locally. Enter 7997 on the right in the “Go to” field.

c-rob · April 3, 2024, 9:20am

A tab-delimited file uses tabs to separate columns in each line, instead of commas in a CSV file.

I use a macro to save the current spreadsheet tab as a tab-delimited file. But a tab-delimited file from Excel is made like this:

Open the Excel file to export.
Go to File, click More, then Export.
Click Change File Type. Choose tab-delimited .txt file.
Click Save Icon.

The position given in the error message is a byte position, while Notepad++ will presumably show you a character position (and might also translate newlines).

No, it’s a programmer’s text editor so it can take me to a line position, or byte position. I thought it had a hex view mode but I couldn’t find it. Maybe it has a plugin to show stuff in hex mode.

c-rob · April 3, 2024, 9:24am

That’s a great tool I will save, but I cannot use it for this file as the file has customer data. Can you recommend a Windows Desktop app for a hex editor? I haven’t used one in 15 years. I will need to find and probably edit certain hex values in the file then save the 12MB file.

kknechtel · April 3, 2024, 9:58am

And there’s no option in this process anywhere to select a text encoding?

How did data get input to begin with? Did the customer just give you a .xlsx file to start with or something?

Reading the file in binary mode and translating the byte values into hex digits works. It’s certainly not user-friendly, but it shows you the important part for debugging. That was the point of the code I suggested to you.

petersuter · April 3, 2024, 11:08am

This is not UTF-8 encoded here, but uses some locale, probably cp1252.

You can instead click “Save as Another File Type” there and then “Unicode Text”, that’s then tab-delimited UTF-16 I think.

I think it doesn’t upload the data anywhere, it’s a local app running in your browser only, but I understand.

“010 Editor” is excellent, but this question is probably offtopic on this forum.

@kknechtel’s advice above to use Python is also excellent of course.

MRAB · April 3, 2024, 3:42pm

I use EditPad Pro (paid), but there’s also EditPad Lite (free).