find the header location in a .bin file without reading the whole file at a time

Sanjib · March 5, 2021, 8:15pm

The below code is okay to find the header locations in a .bin file. But I don’t want to read whole file at a time. Because the header has different byte format. If I read the whole as a uint16, I will not be able to get those bytes(some are uint8 and uint32). What is the solution so that I can read the file to find the header location and then I can assign the datatype for the header for the information. Basically discard all data before the header and read the header(2bytes-2bytes[little endian]-2bytes[big endian]-1bytes-1byte-…) Thanks in advance!!!

with open(filename, mode='rb') as f:
            b = f.read()
            np_data = np.frombuffer(b, dtype=np.uint16)
            findIndex = np.where(np_data == int("00000050", 16))

steven.daprano · March 6, 2021, 12:21am

Are there any restrictions of where the header can be?

(It seems strange to me that the header isn’t at the start of the file,
but can be anywhere, but whatever.)

If the file is small enough to read into memory easily, say, less than
100MB, then any code you write in Python will be much, much slower than
reading the whole file into memory and letting numpy do the work.
You should only consider something different only if the the file is
so huge that you can’t read it all into memory, or if you are reading a
massive file from really slow media (say, reading from a CD ROM over an
unreliable network).

Anyway, instead of reading the entire file all at once, you can read it
in chunks, and search each chunk, and only if not found go on to read
the next chunk.

f.read(64*1024*1024)  # Read 64 MB at a time.

Don’t forget to take into account that the header might be split between
two chunks. If the maximum size of the header is 32 bits (4 bytes), you
can prepend the last 4 bytes of the previous chunk to the current chunk.

But honestly, unless your file is huge, this extra complexity will
probably just make it slower.

cameron · March 6, 2021, 12:52am

The below code is okay to find the header locations in a .bin file. But
I don’t want to read whole file at a time.

Ineed not. That would use a heap of memory and I/O.

Because the header has different byte format. If I read the whole as a
uint16, I will not be able to get those bytes(some are uint8 and
uint32). What is the solution so that I can read the file to find the
header location and then I can assign the datatype for the header for
the information. Basically discard all data before the header and read
the header(2bytes-2bytes[little endian]-2bytes[big
endian]-1bytes-1byte-…) Thanks in advance!!!

The stdlib “struct” module provides methods for parsing basic binary
types like uint16 little endian and so forth from binary data. You will
still need to read the file (in pieces, not all at once!)

In terms of avoiding I/O, you can probably use the mmap module to map
the file into memory ,and use struct on those data. That avoids reading
the whole file with f.read(). The OS will read pages from the file as
needed when you access the memory.

Stepping beyond the standard library for a bit:

I’ve been reading a lot of binary data recently, and tend to want to
read it as a stream, like you would read lines from a file.

For this purpose I’ve got 2 modules on PyPI:

cs.buffer, which will take any iterable of bytes-like things and present
you pieces in the sizes you want (eg 2 bytes for a 16 bit value). It has
factories for making buffers from files (like your
open(filename,mode=‘rb’)), mmapped files, lists or bytes, etc etc. That
way you can use it on all sorts of things depending where your data are
coming from.

cs.binary, a suite of binary data structure parsing classes which are
crafted to operate on a buffer from cs.buffer.

This includes a bunch of classes like UInt16LE to read a 2-byte little
endian value from a buffer; these actually use the struct module
internally and for structures with multiple fields they return a
namedtuple instead of a plain tuple like struct does.

But you can easier make you own classes for whatever structure you need
to parse. There’s a BinaryMultiStruct factory which takes a struct
format string and list of matching field names, and returns you a class
for parsing that struct, which hands you namedtuples.

They all have common methods like parse (gives you a class instances
from a buffer), parse_value (for single value things like a 16-bit
number, gives you the value), scan which yields instances as an
iterator, eg:

for obj in MyStruct.scan(bfr):
    ... do stuff with obj, which has been parsed from the buffer ...

Your use case wouldlook something like this:

from cs.buffer import CornuCopyBuffer
from cs.binary import UInt16LE

with open(filename, mode='rb') as f:
    bfr = CornuCopyBuffer.from_file(f)
    v16 = UInt16LE.parse_value(bfr)
    ... do whatever you need to parse the .bin file ...

Cheers,
Cameron Simpson cs@cskk.id.au

pepoluan · March 7, 2021, 5:49am

A bit tangential, but in telecommunications, this is actually a real-life problem if one has a bitstream and needs to first “sync” to find where a “frame” begins. This is especially common when processing T-carrier, E-carrier, and SONET transmissions.

Sanjib · March 7, 2021, 9:48pm

Thank all for your valuable insights. I was trying to use struct package to read the file. Below is the code. But unable to find the header position. But the above code is running okay. DO you know what I am doing wrong? I will try all those suggested way.

with open(fileName, mode='rb') as f:
 x = os.stat(fileName).st_size
 y = int(x/4)
 print('x,y:', x, y)
 for i in range(0, y, 4):
     x = st.unpack('<I', f.read(4))
     if x == int("00000050", 16):
         print(" Got it: ")

Sanjib · March 7, 2021, 10:10pm

I don’t understand, what do u mean by the restrictions. This is very common, sometime there are some garbage values before the header. In this bin file header starts with 80. So I am trying to find it out. The size of the file is 2GB. Thus there are a couple of disadvantages to read the whole file at a time with a fixed datatype.

Thanks

Sanjib · March 7, 2021, 10:15pm

Thanks for the suggestion. I was trying to use struct package. But no result. what is the wrong in the below code? I am reading 4 bytes at a time and comparing.

```
with open(fileName, mode='rb') as f:
x = os.stat(fileName).st_size
y = int(x/4)
print('x,y:', x, y)
for i in range(0, y, 4):
    x = st.unpack('<I', f.read(4))
    if x == int("00000050", 16):
        print("Got it: ")

cameron · March 7, 2021, 11:01pm

I’d precompute int(“00000050”, 16) up the front, something like:

MARKER_VALUE = int("00000050", 16)

and use that inside the loop. But there’s nothing obviously wrong here.
Do you have a documentation reference for your .bin files?

Cheers,
Cameron Simpson cs@cskk.id.au

Sanjib · March 7, 2021, 11:09pm

I only have the following information:
header starting value : 80 (= int(“00000050”, 16)) which is 16 bits.
Total header size is 16 bytes.
First, I have to find out 80 which I am trying to do.

My question is ,np.where is working fine. but unpack is not working. May be i am doing somethig wrong which i am not getting.

steven.daprano · March 8, 2021, 12:16am

I wouldn’t use the struct module to search for the header, that requires
you to read the values four bytes at a time which will be slow as
anything!

If we can assume that the header is word aligned (it can only occur at
an even address) then the search code is simple:

# Untested.
BUFFER_SIZE = 64*1024*1024  # read a 64MiB buffer
assert BUFFER_SIZE % 2 == 0
START_HEADER = b'\x00\x50'  # 80 in 16-bits
with open(filename, 'rb') as f:
    count = 0
    buffer = f.read(BUFFER_SIZE)
    while buffer:
        p = buffer.find(START_HEADER)
        if p != -1:
            print("found at offset", p + count*BUFFER_SIZE)
            break
        else:
            count += 1
            buffer = f.read(BUFFER_SIZE)
    else:  # while...else
        # executed if no break
        print('not found')

Once you find the offset, then you can read ten bytes from the buffer to
form the header:

header = buffer[p:p+10]
if len(header) != 10:
    header += f.read(10 - len(header))
    if len(header) != 10:
        assert f.read(1) == b''  # End Of File.
        print('not enough bytes in header')

and then, once you have the ten byte header, you can use the struct
module to unpack it into fields.

If the header might not be word aligned, then you have to be a bit more
careful. Each time you read a new buffer, you can prepend it with the
last byte from the previous buffer.

Another thought: I don’t know what numpy does when it reads an array of
uint16. Depending on the endianness of your file, you might need to look
for any combination of sixteen bits:

b'\x00\x50'
b'\x50\x00'

Either of those could, possibly, count as 80 in 16-bits, depending on
the endianness of your file.

Sanjib · March 8, 2021, 3:39am

It works! Thanks a lot. I am curious about the buffer size, why BUFFER_SIZE = 64MiB?
In this project data has been written in 16384B. When I set that, it doesn’t work.

During unpacking the header

header = buffer[p:p+16]
header_info = st.unpack('<8H', header)

header info is
(20480, 0, 2048, 0, 62464, 36870, 25857, 43265).

But for the following code

header_info_tst = st.unpack('<8H', f.read(16))

header info is differnt
(80, 0, 31, 0, 1780, 400, 357, 22791) that shows the correct values, only problem is that it shows 31 but it should be 8. May be it is missing 8 to 30 header.

Data are written in the following way

    [header information] start with 80 line 9
    [header information] start with 80 line 10
    ...................
    [header information] star with 80 line n

One question more how to prepend the last byte from the previous buffer?

I really appreciate your time and help.

cameron · March 8, 2021, 6:05am

You’re reading 4 byte integers (‘I’). You what to read(2) and use ‘H’
(unsigned short, 16 bits).

Cheers,
Cameron Simpson cs@cskk.id.au

cameron · March 8, 2021, 6:07am

If the underlying I/O is being done in nice big chunks it won’t be
nearly as bad as you imagine. read(2) or read(4) just p[ulls from a
buffer. What goes on to populate that buffer should normally be much
larger.

Pulling 2 or 4 bytes from the buffer and applying a struct shouldn’t
really be a lot slower than your batches-of-ints approach.

Cheers,
Cameron Simpson cs@cskk.id.au

Sanjib · March 8, 2021, 3:30pm

Thanks !!!