UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Hi team,
I have this function in the script

def read_file(file):
GZIP_MAGIC_NUMBER = “1f8b”
f = open(file)
if f.read(2).encode(“hex”) == GZIP_MAGIC_NUMBER:
f.close()
f = gzip.GzipFile(file, “r”)
else:
f.close()
f = open(file, “r”)
return f

But when i need to read a compress file in gzip format, i obtained this error
$ python3 findHHIvan.py -s 86VRPQ2GD6EE6M0G2GLY0M -f message.log.2024-05-06_1128.2024-05-06_1131.gz -d /cxpslogs/powerBI/pruebasTransaction
searching in specified directories…
first search
file /cxpslogs/powerBI/pruebasTransaction/message.log.2024-05-06_1128.2024-05-06_1131.gz |
Traceback (most recent call last):
File “findHHIvan.py”, line 658, in
found = search(searching_criterias, files, found)
File “findHHIvan.py”, line 313, in search
arch = read_file(file)
File “findHHIvan.py”, line 126, in read_file
if f.read(2) == GZIP_MAGIC_NUMBER:
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py”, line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

Can you help me why this error? in Python 2.7 is running, but in 3.x is not running

Regards

Hi Ivan, welcome to the forum.

Before future posts, please read the pinned thread to understand how to post code properly. This time the problem is simple, but normally we need to see the code properly, or it won’t be possible to analyze it.

This file contains compressed data, not text. Therefore, please open it in binary mode.

In Python 3, using text mode (the default) for a file will decode the text (to make a proper Unicode string) for you. But this can only work if it actually is text, and the encoding is correct.

As a reminder, 2.7 is not supported. It has not been, for more than 4 years.

Hi Karl,
I need to work with file compressed or not. In python 2.7 is working but it isn’t working in python 3.3.

def read_file(file):
GZIP_MAGIC_NUMBER = bytes.fromhex(‘1f8b’)
f = open(file)
if f.read(2) == GZIP_MAGIC_NUMBER:
f.close()
f = gzip.GzipFile(file, “r”)
else:
f.close()
f = open(file, “r”) #open not compressed file
return f

I have this output
$ python3 findHHIvan.py -s 86VRPQ2GD6EE6M0G2GLY0M -f message.log.2024-05-06_1128.2024-05-06_1131.gz -d /cxpslogs/powerBI/pruebasTransaction
searching in specified directories…
first search
file /cxpslogs/powerBI/pruebasTransaction/message.log.2024-05-06_1128.2024-05-06_1131.gz |
Traceback (most recent call last):
File “findHHIvan.py”, line 657, in
found = search(searching_criterias, files, found)
File “findHHIvan.py”, line 312, in search
arch = read_file(file)
File “findHHIvan.py”, line 125, in read_file
if f.read(2) == GZIP_MAGIC_NUMBER: #check if compressed
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py”, line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

Thanks a lot!

Please read the answer that @kknechtel gave.

Yes, I understand that. The purpose of your code is to uncompress it.

When you check for the GZIP_MAGIC_NUMBER, you need to open the file in binary mode. Otherwise, f.read(2) means to get a string from the first two characters of the file. But when the file is a Gzip file, it does not start with valid characters. Besides that, GZIP_MAGIC_NUMBER does not contain characters. It is a bytes object. Strings are completely separate. A comparison between bytes and str will always be False in Python 3.

You can process the file in binary mode no matter what the format is, because every file contains bytes. But not all bytes can be correctly interpreted as text.

It worked in 2.7, because Python 2 had wrong ideas about what text is.

Thanks Karl.

Now I can read compressed file but I have problem with IF conditional
I need to read compressed file or text file.
This is the function:

def read_file(file):
GZIP_MAGIC_NUMBER = bytes.fromhex(‘1f8b’)
f = open(file)
if f.read(2) == GZIP_MAGIC_NUMBER:
f.close()
f = gzip.open(file, “rt”)
else:
f.close()
f = open(file, “r”) #open not compressed file
return f

And the error
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py”, line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

This says to open the file as text. That is the opposite of what you are supposed to do.

i obtained this issue when i read compressed files and text files

$ python3 findHHIvan.py -s 86VRPQ2GD6EE6M0G2GLY0M -f message.log.2024-05-06_1128.2024-05-06_1131.gz -d /cxpslogs/powerBI/pruebasTransaction
searching in specified directories…
first search
file /cxpslogs/powerBI/pruebasTransaction/message.log.2024-05-06_1128.2024-05-06_1131.gz |
7551057 lines, 427440 messages, 2 matches found, time elapsed

total lines: 7551057; total messages: 427440; total matches found: 2; total time: 13.595723390579224
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
second search
file /cxpslogs/powerBI/pruebasTransaction/message.log.2024-05-06_1128.2024-05-06_1131.gz |
7551057 lines, 427440 messages, 2 matches found, time elapsed

total lines: 7551057; total messages: 427440; total matches found: 2; total time: 13.542299270629883
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
third search
file /cxpslogs/powerBI/pruebasTransaction/transaction.log.2024-05-06_10.cexpgtap4p.log |
Traceback (most recent call last):
File “findHHIvan.py”, line 663, in
found2 = search3(found2)
File “findHHIvan.py”, line 571, in search3
for line in arch:
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/gzip.py”, line 305, in read1
return self._buffer.read1(size)
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/_compression.py”, line 68, in readinto
data = self.read(len(byte_view))
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/gzip.py”, line 479, in read
if not self._read_gzip_header():
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/gzip.py”, line 427, in _read_gzip_header
raise BadGzipFile(‘Not a gzipped file (%r)’ % magic)
gzip.BadGzipFile: Not a gzipped file (b’[2’)

The function is now
def read_file(file):

GZIP_MAGIC_NUMBER = bytes.fromhex(‘1f8b’)

GZIP_MAGIC_NUMBER = “1f8b”
f = open(file)
#if f.read(2).encode(“hex”) == GZIP_MAGIC_NUMBER:
f.close()
f = gzip.open(file, “rt”)

else:

f.close()

f = open(file, “r”) #open not compressed file

return f

This is what you’re supposed to be doing:

def read_file(file):
    GZIP_MAGIC_NUMBER = bytes.fromhex('1f8b')
    f = open(file, 'rb')         # open as binary data
    if f.read(2) == GZIP_MAGIC_NUMBER:
        f.close()
        f = gzip.open(file, 'r') # open compressed file
    else:
        f.close()
        f = open(file, 'r')      # open as text
    return f

Alternatively, instead of checking yourself whether it’s a compressed file, you could try opening the file using gzip and fall back to opening it as a text file if gzip complains:

def read_file(file):
    try:
        f = gzip.open(file, 'r')
    except gzip.BadGzipFile:
        f = open(file, 'r')
    return f
1 Like

Thanks Mattew

Now i have this error
File “findHHIvan.py”, line 311, in search
if line.find(“[”) > -1: #reads each line and captures whole messages to later evaluate
TypeError: argument should be integer or bytes-like object, not ‘str’


for file in files: #search all files sequentially
start = time.time() #takes initial time for file read
print (“file”, file, " | “),
if print_lines_flag == 0:
print_lines_flag = 1
arch = read_file(file)
for line in arch:
lines += 1
if line.find(”[“) > -1: #reads each line and captures whole messages to later evaluate
message = “”
message = message + line
starting_line = 1
if line.find(”]") > -1: #this is to get those oneliner messages
messages += 1
if not starting_line:
message = message + line
for searching_criteria in searching_criterias:
found = check_message(message, searching_criteria, file, found) #evaluate each individual message to see if all terms you are looking for are found
starting_line = 0
else:
if not starting_line:
message = message + line
starting_line = 0

Is it ok these values?

GZIP_MAGIC_NUMBER = bytes.fromhex(“1f8b”)

if f.read(2).encode( ‘utf-8’) == GZIP_MAGIC_NUMBER: ###(Check if compressed)

In order to preserve formatting, please select any code or traceback that you post and then click the </> button, as you’ve been asked to do repeatedly.

The traceback is saying that line is bytes, not str. Presumably, it’s from one of the compressed files, so it needs to be decoded. What encoding is the text in?

1 Like

The “magic number” isn’t text, it’s 2 bytes, so it shouldn’t be encoded. Just let gzip complain if the file isn’t a GZIPped file.

I need to read text files and compressed files with this script
This script in python 2.7 run ok but in python 3.8 isn’t running.

Matthew Barnett wrote:

 def read_file(file):
     try:
         f = gzip.open(file, 'r')
     except gzip.BadGzipFile:
         f = open(file, 'r')
     return f

which we presume you’re using.
Your error message:

 File "findHHIvan.py", line 311, in search
     if line.find("<MSG>[") > -1:                                    #reads each line and captures whole messages to later evaluate
 TypeError: argument should be integer or bytes-like object, not 'str'

sounds like the file is open in binary mode, not text mode.

Looking at the gzip.open docs here:

it says:

 The mode argument can be any of 'r', 'rb', 'a', 'ab', 'w', 'wb', 'x'
 or 'xb' for binary mode, or 'rt', 'at', 'wt', or 'xt' for text mode.
 The default is 'rb'.

so unfortunately Matthew’s suggestion gives you a binary mode gzip file.
Replace his 'r' with 'rt'. You can do that in both opens for
consistency.

Python 2’s str is approximately equivalent to Python 3’s bytes, but for handling text in Python 3 you should really be using Python 3’s str which is equivalent to Python 2’s unicode, and for that you need to know the encoding of the text. Is it UTF-8?

Assuming that you’re using UTF-8:

import codecs

def read_file(file):
    try:
        f = gzip.open(file, 'r') # open as binary
        f = codecs.getreader('utf-8')(f) # decode to text, assuming it's UTF-8
    except gzip.BadGzipFile:
        f = open(file, 'r' encoding='utf-8') # open as text, assuming it's UTF-8
    return f

thanks Mattew, but the issue continue

$ python3 findHHIvan.py -s 86VRPQ2GD6EE6M0G2GLY0M -f message.log.2024-05-06_1128.2024-05-06_1131.gz -d /cxpslogs/powerBI/pruebasTransaction
searching in specified directories…
first search
file /cxpslogs/powerBI/pruebasTransaction/message.log.2024-05-06_1128.2024-05-06_1131.gz |
7551057 lines, 427440 messages, 2 matches found, time elapsed
second search
file /cxpslogs/powerBI/pruebasTransaction/message.log.2024-05-06_1128.2024-05-06_1131.gz |
7551057 lines, 427440 messages, 2 matches found, time elapsed
total lines: 7551057; total messages: 427440; total matches found: 2; total time: 37.806538343429565
third search
file /cxpslogs/powerBI/pruebasTransaction/transaction.log.2024-05-06_10.cexpgtap4p.log |
Traceback (most recent call last):
File “findHHIvan.py”, line 675, in
found2 = search3(found2)
File “findHHIvan.py”, line 583, in search3
for line in arch:
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py”, line 645, in next
line = self.readline()
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py”, line 558, in readline
data = self.read(readsize, firstline=True)
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/codecs.py”, line 498, in read
newdata = self.stream.read(size)
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/gzip.py”, line 292, in read
return self._buffer.read(size)
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/_compression.py”, line 68, in readinto
data = self.read(len(byte_view))
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/gzip.py”, line 479, in read
if not self._read_gzip_header():
File “/opt/rh/rh-python38/root/usr/lib64/python3.8/gzip.py”, line 427, in _read_gzip_header
raise BadGzipFile(‘Not a gzipped file (%r)’ % magic)
gzip.BadGzipFile: Not a gzipped file (b’[2’)

It appears that gzip.open doesn’t check the file until you start reading from it, so:

import codecs

def read_file(file):
    try:
        # Assume that it's a gzip file.
        f = gzip.open(file, 'r')
        # Try to read a byte.
        f.read(1)
    except gzip.BadGzipFile:
        # It's not a gzip file, so assume that it's a text file encoded in UTF-8.
        f = open(file, 'r' encoding='utf-8')
    else:
        # It looks like a gzip file, so rewind to the start...
        f.seek(0)
        # ...and prepend a decoder for UTF-8.
        f = codecs.getreader('utf-8')(f)
    return f
1 Like

Thank a lot Matthew!!! issue resolved! :smiley:
Beside I thank to Cameron, Karl for the help offer.