I have a program to find a string in a 12MB file .dat file which was exported from Excel to be a tab-delimited file. However when the file is read I get this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in byte position 7997: invalid continuation byte
When I open the file in my text editor (Notepad++) and go to position 7997 I don’t see any special characters when I turn on “Show special characters”. The cursor is between 2 normal letters: H and S.
Here’s my function which reads the text file and searches through it.
def search_file_lookahead(filename, search_strings):
r"""Searches a text file for occurrences of
multiple strings in any order using lookahead assertions.
"""
with open(filename, 'r', encoding='utf-8') as file:
lines = file.readlines() # ERROR is here then "lines" has no value
pattern = f"(?=.*{')(?=.*'.join(search_strings)})" # Construct a single pattern with lookaheads
matches = []
for line_number, line in enumerate(lines):
if re.search(pattern, line):
matches.append((line_number + 1, line.strip()))
if matches:
print(f"Found matches in {filename}:")
for line_number, line in matches:
print(f"- Line {line_number}: {line}")
else:
print(f"No matches found in {filename} for the given strings.")
I trap the error and this error causes the program to quite even though I put a “pass” in the exception clause. This also causes the variable lines to be undefined.
How do I ignore UTF-8 errors like this and read the text file to process it and use regex on it to search for strings?
Thank you. I’m stumped after searching for some web pages about this.
per your original post, you stated that you are getting a UnicodeDecodeError: utf-8. From the following link, could it be that you have it encoded in another unicode version. From the following post:
"UnicodeDecodeError" means you have a file encoding issue. Each computer has its own system-wide default encoding, and the file you are trying to open is encoded in something different, most likely some version of Unicode. If this happens, you should specify the encoding using the encoding='xxx' switch while opening the file. If you are not sure which encoding to use, try 'utf-8', 'utf-16', and 'utf-32'
Errors are there to help you. @barry-scott is correct: the file is not encoded with UTF-8. Windows is not very good at that, by default at least: too much history of byte-encoded workarounds.
So if you care about the data, I recommend you find out what the encoding actually is. \xe9 is “é” in code page 1252, which is quite a common choice for file encoding on Windows and crops up a bit in otherwise English text.
It is typical of a Windows program to save as the local 8-bit code page, cp1252 being common.
Using encoding="cp1252" is often needed when reading Windows files.
The other char that breaks utf-8 assumption is 0xa0 its the right-hand smart quote.
Its so common that a web search for “0xa0” (and only that) takes you to sites telling you how to fix your code!
Exactly what does this mean? Walk me through the process of creating this file. The extension .dat is highly suspicious here; this is almost never used for a file that should be interpreted as containing text. Is all the data created by typing things in Excel, or is it ultimately sourced from somewhere else? When the data is “exported”, what options are chosen for export?
The position given in the error message is a byte position, while Notepad++ will presumably show you a character position (and might also translate newlines).
Let’s see if we can diagnose the file’s encoding (and verify that it is, in fact, supposed to be interpreted as text) by looking at some raw byte values around the (first) problematic position. Please show the result you get from:
import binascii
with open(filename, 'rb') as file:
file.seek(7900)
for i in range(16):
data = file.read(16)
print(*map('{:02x}'.format, data), sep=' ')
To investigate file encoding problems it can also be useful to view the raw file in a hex editor. For example just open https://hexed.it/ in a browser and drag the file into it to open it locally. Enter 7997 on the right in the “Go to” field.
The position given in the error message is a byte position, while Notepad++ will presumably show you a character position (and might also translate newlines).
No, it’s a programmer’s text editor so it can take me to a line position, or byte position. I thought it had a hex view mode but I couldn’t find it. Maybe it has a plugin to show stuff in hex mode.
That’s a great tool I will save, but I cannot use it for this file as the file has customer data. Can you recommend a Windows Desktop app for a hex editor? I haven’t used one in 15 years. I will need to find and probably edit certain hex values in the file then save the 12MB file.
And there’s no option in this process anywhere to select a text encoding?
How did data get input to begin with? Did the customer just give you a .xlsx file to start with or something?
Reading the file in binary mode and translating the byte values into hex digits works. It’s certainly not user-friendly, but it shows you the important part for debugging. That was the point of the code I suggested to you.