Skip certian line of a file or order

CC.L · March 23, 2023, 3:52am

Hi All,
I am facing one problem with my code when reading a file. The file is a log file.
Below is the log file. Basely I wish to find the string, my string order:
Step 1 find this string : m>>> DL- ingress traffic
Step 2 find this string: >>> DL- Mcs= 26.0 m then find >>> DL- ingress traffic

sometimes the log file will be incomplete, as you see below DL-MCS come first, this is an incomplete log. Is there any method to skip the beginning file.

It’s pretty hard to explain, let look at below log, and my code

log file:

[20230310.232915.783022][info]:[DL- UE[28]: Tput=   14.699137 Mbps, Mcs= 26.0(Sigma= 0.0)]
[20230310.232915.783051][info]:[DL- UE[29]: Tput=   14.699133 Mbps, Mcs= 26.0(Sigma= 0.0)]
[20230310.232915.783061][info]:[>>> DL- Mcs= 26.0, RbNum=  99.0, Layers= 4.0]
[20230311.012825.882186][info]:[e[40;32m>>> DL- ingress traffic: 117.590118(Mbps), egress traffic: 120.919250(Mbps), ReTx: 2.519794(Mbps)e[0m]
[20230311.012825.882339][info]:[>>> DL- Mcs= 26.0, RbNum=  71.2, Layers= 4.0]
[20230311.012825.882189][info]:[e[40;32m>>> DL- ingress traffic: 119.000(Mbps), egress traffic: 125.919250(Mbps), ReTx: 2.819794(Mbps)e[0m]
[20230311.012835.882479][info]:[>>> DL- Mcs= 26.0, RbNum=  70.7, Layers= 4.0]

My code:

import re
elogfileName="elog2.txt"

with open(elogfileName, 'r') as filedata:    
    for line in filedata:   
        #print(re.findall(r"(m>>>\ DL\- ?)", line))         
        if re.findall(r"(m>>>\ DL\- ?)", line):
            print(line.strip())

        if re.search(r'\[(\d+\.\d+\.\d+)\].*?(>>> DL- Mcs=[^]]+)', line):
            print(line.strip())

Please refer to the picture, I have labeled A BＣ
Ａ is an incomplete log, which I wish to skip
B I want to start searching from here ([>>> DL- Mcs)
C after finding B will find C (2m>>> DL- )

Is there any ways to ignore A part, the A string should be after B, but due to this is an incomplete log.

I can manually remove the log of A, but is there any better way to establish using the code?

rob42 · March 23, 2023, 10:11am

Maybe using a generator to read the elogfile would help?

This is not a full solution, rather a suggestion:

 filedata = (line for line in
            open("elog2.txt", mode="r", encoding="UTF-8"))

You can now use data = next(filedata) to read each line from the elog2.txt and maybe use a try/except loop to catch the StopIteration error, which will happen if you try to use next() more times than there are lines to read; 7, in your example.

Possibly, you could use the likes of:
if 'some_search_term' in data: rather than regex, but again, it’s just a suggestion.

Maybe this will be of some help, maybe not.

CAM-Gerlach · March 23, 2023, 12:48pm

It might be most efficient to run a regex on the entire file rather than one by one on each line. Either run re.search with your first regex to find the first line on which B occurs (or, just use in on the string directly), and then subset the file to just the lines following it and re.search the remaining portion for C. Or, use a lookbehind to do it all in a single re.search call.

Note that you’re not specifying the file encoding, which can result in a UnicodeDecodeError or mojibake on platforms that don’t default to UTF-8 (assuming the file is written as such). You should explicitly specify it with encoding="UTF-8" in the open() call.

I don’t see how this solves the problem and behaves essentially the same as the original example, as in both cases, the file is iterated line by line, reading each line as it is processed (since you just create a generator that does the same thing as simply iterating over filedata directly). However, unlike the OP’s code, it doesn’t include the with context manager and thus leaves the file open afterward, which is a very bad practice. Therefore, it is a strict regression over the original.

Furthermore, if iterating directly over the file object, you can seek() to skip to a specific position in the file without having to read it, which may or may not end up being useful here depending on the exact circumstances.

CC.L · March 23, 2023, 1:15pm

so there any example you can show me? is it using the next method() or seek() method better?

aivarpaalberg · March 23, 2023, 2:02pm

Why not skip all the lines until >>> DL- Mcs= 26.0 m is encountered and only then start # do_something ?

steven.rumbalski · March 23, 2023, 6:55pm

You can nest loops so that you only search for the second line once you’ve found the first.

with open(elogfileName, 'r') as filedata:    
    for line in filedata:                    
        if "m>>>  DL - " in  line:
            for nextline in filedate:
                if re.search(r'\[(\d+\.\d+\.\d+)\].*?(>>> DL- Mcs=[^]]+)', nextline):
                    print(line, nextline, end='')
                    break # so you can start looking for the first match again

You could also handle this by using a boolean flag, but in this case I like the nesting better.

kknechtel · March 24, 2023, 12:35am

When you say that there are two steps to this… does that mean, the string from step 2 has to be later in the file than the string from step 1?
Should it be on the same line? On the immediately next line? On any later line? Something else?
What should happen, if we find the string from step 1 but don’t find the string from step 2? Is that okay - are we just done? Do we need to wait until we find the string from step 2, before we print the string from step 1?
What should happen after we find the string from step 2? Are we finished with the file? Should we look for a step 1 string again? Something else?

This doesn’t accomplish very much; the open file object is already iterable, and calling next on it already works.

CC.L · March 24, 2023, 3:16am

HI Karl
When you say that there are two steps to this… does that mean, the string from step 2 has to be later in the file than the string from step 1? Yes
As you see below

Below this line occur on top of the line, but this is an incomplete log, I wish to ignore it

[20230310.232915.783022][info]:[DL- UE[28]: Tput= 14.699137 Mbps, Mcs= 26.0(Sigma= 0.0)]
[20230310.232915.783051][info]:[DL- UE[29]: Tput= 14.699133 Mbps, Mcs= 26.0(Sigma= 0.0)]

Below is complete lof, which will start to search

[20230310.232915.783061][info]:[>>> DL- Mcs= 26.0, RbNum= 99.0, Layers= 4.0]
[20230311.012825.882186][info]:[e[40;32m>>> DL- ingress traffic: 117.590118(Mbps), egress traffic: 120.919250(Mbps), ReTx: 2.519794(Mbps)e[0m]
[20230311.012825.882339][info]:[>>> DL- Mcs= 26.0, RbNum= 71.2, Layers= 4.0]

The reason why I use two regular expressions on different lines is that in this two string, there are values that I want to get. As you can see the code is simple code, which is not my full code.

def main()
    with open(elogfileName, 'r') as filedata:    
        for line in filedata:   
           #step 1
            if re.findall(r"(m>>>\ DL\- ?)", line):
                parse(line, givenString
            #step 2
            if re.search(r'\[(\d+\.\d+\.\d+)\].*?(>>> DL- Mcs=[^]]+)', line):
                parse_bler(line, 'DL')

I hope my explanation you can understand.
It’s like the front part is an incomplete log, so I want to ignore it.

CC.L · March 24, 2023, 6:26am

Thanks alot, your code work