Taking only lines with numbers from text files and pack everything in one new file

Hello Python fellows.

I have many .txt files with these names (I will give only several examples):

File names are like here:
(f=1),theta=0,phi=-45,VP(1) (Total) [pw].txt
(f=1),theta=0,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=0,phi=45,VP(1) (Total) [pw].txt
(f=1),theta=30,phi=-45,VP(1) (Total) [pw].txt
(f=1),theta=30,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=30,phi=45,VP(1) (Total) [pw].txt
(f=1),theta=60,phi=-45,VP(1) (Total) [pw].txt
(f=1),theta=60,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=60,phi=45,VP(1) (Total) [pw].txt
(f=1),theta=90,phi=-45,VP(1) (Total) [pw].txt
(f=1),theta=90,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=90,phi=45,VP(1) (Total) [pw].txt
(f=1),theta=120,phi=-45,VP(1) (Total) [pw].txt
(f=1),theta=120,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=120,phi=45,VP(1) (Total) [pw].txt
(f=1),theta=150,phi=-45,VP(1) (Total) [pw].txt
(f=1),theta=150,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=150,phi=45,VP(1) (Total) [pw].txt
(f=1),theta=180,phi=-45,VP(1) (Total) [pw].txt
(f=1),theta=180,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=180,phi=45,VP(1) (Total) [pw].txt

So basically, in all names of these files theta and phi are variables.
In every of those files I have something like this:

Theta [deg.] Phi [deg.] Abs(E )[V/m ] Abs(Theta)[V/m ] Phase(Theta)[deg.] Abs(Phi )[V/m ] Phase(Phi )[deg.] Ax.Ratio[ ]

0.000 0.000 3.970e+01 3.970e+01 267.906 2.258e-03 214.586 3.162e+02

So, in the first line there are variables only with characters, and in the second line are numbers.

In the first step, I want to choose only files with for example phi=0. My files will be then something like:

(f=1),theta=0,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=30,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=60,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=90,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=120,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=150,phi=0,VP(1) (Total) [pw].txt
(f=1),theta=150,phi=0,VP(1) (Total) [pw].txt

In the second step, I want to take from all these new 7 files (so for every theta) line with numbers, and those lines from all 7 files I want to pack in one new .txt file.
At the end I will have one new .txt file with 7 rows only with numbers, no with characters.

Any help is greatly appreciated.

Best regards,

Pajtonko

I would approach this by breaking the task down into smaller steps, solving the smallest step I can think of, and combining little solutions into a bigger one.[1]

For example, what stood out to me immediately was:

“Check whether a file name matches the phi and/or theta value I want.”

To do this, I need to know what values a particular file has. I’d start with a function that solves just this one thing for one file:

def get_file_vars(filename: str): ...

Personally, I would use a regular expression for this. They are good at string matching and have a feature called “capture groups” that helps extract little pieces of a larger string. Here is the one I came up with, you can see the values it matches and the capture group results: https://regex101.com/r/4pmqMF/1

With a pattern, our function can be something like this:

import re

file_vars_pattern = re.compile(r".*,theta=(.+),phi=(.+),.*txt")

def get_file_vars(filename: str):
    maybe_match = file_vars_pattern.match(filename)
    if maybe_match is None:
        return None
    return maybe_match.group(1, 2)

This tries to use the pattern on a filename, and if it works returns the theta and phi values as a tuple. I can run this function against some strings and/or write unit tests and see whether it works. Now that I have the function, I no longer have to worry about how to do “what variable values does a particular file have?”, because I can just call the function.

Back to the larger question of “Check whether a filename matches a particular phi and/or theta”. That might look something like this:

def file_matches_variables(filename: str, theta: str | None, phi: str | None) -> bool: ...

Since you may not always care about both variables, I marked their parameters optional - they may be present or they may be “None”, meaning we don’t have a value for that one. We can already use the first function we wrote to help write this one, just taking care in case the pattern doesn’t match like we want:

def file_matches_variables(filename: str, theta: str | None, phi: str | None) -> bool:
    maybe_vars = get_file_vars(filename)
    if maybe_vars is None:
        print(f"couldn't parse a filename! '{filename}'")
        return False
    
    file_theta, file_phi = maybe_vars
    # ...

You could of course do something different instead of printing if the file name pattern match doesn’t work, depending on how unexpected that happening is. For instance if you really don’t think that should ever happen, throw an exception instead so the program stops instead.

Checking the two variables is pretty straightforward. The following is my personal style, there are many ways you could write this but I tend use this “test something, early return” style because it reduces nesting:

    # ... see above
    if theta is not None and file_theta != theta:
        return False
    
    if phi is not None and file_phi != phi:
        return False
    
    # else, it passes all our tests!
    return True

There are a few edge cases and strict correctness things I’m not dealing with, for brevity. For example, you could convert file_theta and file_phi into integers, and use int | None for the function parameters instead, since they are technically numbers (and it looks like all whole integers based on your sample file names?).

We’ve now solved the subproblem “how do I know if a filename matches the variable values I want?”, but only for one file. We also aren’t reading filenames out of a folder or reading file contents yet. Whenever I have to deal with file paths, directories, filenames and the like, I tend to use the pathlib module: https://docs.python.org/3/library/pathlib.html.

We can use pathlib to get all the filenames in a single folder with something like this:

from pathlib import Path

def files_example(folder: Path):
    for item in folder.iterdir():
        if item.is_file():
            # do stuff!

In our case, what we can do is get all the files in a folder that match our variables:

def get_matching_files(folder: Path, match_theta: str | None, match_phi: str | None) -> list[Path]:
    matching_file_paths = []

    for item in folder.iterdir():
        if item.is_file() and file_matches_variables(item.name, match_theta, match_phi):
            matching_file_paths.append(item)
        
    
    return matching_file_paths

You can be fancier with the above and use a list comprehension, but I’ve used a full for loop because without more refactoring it would be an uncomfortably long list comprehension.

There isn’t much left to do, except reading from the files we find and collecting their lines into a new file. (pathlib can help us with those also.) Here’s a little helper function that takes a file path, and gets the second line:

def get_line_two(file: Path) -> str:
    return file.read_text().splitlines()[1]

read_text is a handy function that opens the file for us and collects the whole file into one big string. Don’t use it if your files are very large and you only need some of the data, but since it looks like your files are small then reading the whole file at once is inconsequential. Here is a snippet showing how to put it all together; this is a list comprehension but could use a for loop like in “get_matching_files” and work exactly the same way:

input_folder = Path("your folder path goes here!")
# example values
want_theta = None
want_phi = "-45"

lines_to_write = [
    get_line_two(file) for file in get_matching_files(input_folder, want_theta, want_phi)
]

Now you have a list containing the second line of every file in your folder that matches the theta and phi values you set. The only thing left is to write them to a file. There are several equivalent ways to do this as well; here is one:

output_file_name = Path("path to write output file!")
with output_folder.open(mode='w') as output:
    output.writelines(lines_to_write)

And there you have it!

There are many ways you could make such a script more user friendly or more robust. For example, I like to use argparse to add command line options to my processing scripts. This would let you set things like the input folder path, theta value, and phi value when you run the script, instead of having to edit it every time you want to use different values. It might also be nice to use the theta and phi values to generate the name of the output file - that would make it easier to run the script several times with different variables without having to move the output files around by hand.

That is a long post, but I hope it helps break things down for you in a way that you can apply to other tasks and scripts in the future.

Do note that I wrote this without testing the code examples! You may need fix things up. I’m mostly trying to give an outline.


  1. This is sometimes called “bottom up” design; there is also “top down” design, which is another way of thinking about the same thing depending on the situation or how your individual brain likes to think about problems. ↩︎

Hello Matt.

Dude, thank you very much for so educational reply. I will need some time to digest everything you wrote, but BIG BIG thank you.

Best regards,

Nikola

1 Like