Parsing Text File

TazFleck · June 23, 2022, 1:03pm

Hello everyone,

I would like to ask you how to parsing a specific part of a text file. I would like to extract this specific column of the text but I have many lines before and I don’t know how to extract specifically a column because I have a mix of string and float.

Regards,

vbrozik · June 23, 2022, 1:34pm

Hello,
could you please paste at least few lines as text here? It is always better to send textual information as text, not image.

For example without the text it is not possible to test the solution.

Please put the text between triple backticks to prevent Markdown from mangling it:

```
your text
```

TazFleck · June 23, 2022, 1:38pm

11688 Data ; Jmax 90 ; St Dev 0.159

#5. 5. 2. 3. 3. .5 Spin Statistics , Spin Y
P1 D66 0 1 0 1 P0 D6 0 0 0 0 D1 dip
801.0 264.0 1031.5 388.4 0.13778094357E+00 0.42182248646E-07
72 0.d+00 0 Para Number ; Model Accuracy Parameters
28SiF4
Dim 21 fév 2021 16:09:29 CET Hmn Frdm Value/cm-1 St.Dev./cm-1
1 2(0,0A1) 0000A1 0000A1 A1 02 224 0.13778023448E+00 0.3915693E-06
2 4(0,0A1) 0000A1 0000A1 A1 04 139 -0.41039338392E-07 0.6560125E-10
3 4(4,0A1) 0000A1 0000A1 A1 04 536 -0.33591716068E-08 0.4290270E-11
4 6(0,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
5 6(4,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
6 6(6,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
7 8(0,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
8 8(4,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
9 8(6,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
10 8(8,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
11 0(0,0A1) 0100E 0100E A1 20 330 0.26421941002E+03 0.3967863E-04
12 2(0,0A1) 0100E 0100E A1 22 130 -0.14303321917E-03 0.3393096E-07
13 2(2,0E ) 0100E 0100E E 22 248 -0.46790609420E-04 0.2657215E-07
14 3(3,0A2) 0100E 0100E A2 23 197 0.14085216624E-06 0.2969422E-09
15 4(0,0A1) 0100E 0100E A1 24 152 0.38404874052E-09 0.6656298E-11
16 4(2,0E ) 0100E 0100E E 24 204 -0.10234422562E-09 0.3485302E-11

I would like a column.

TazFleck · June 23, 2022, 1:39pm

I don’t know how to use markdown sadly…

I use this way and I would like to have the column of “value/cm-1”

#11688 Data ; Jmax 90 ; St Dev 0.159
5. 5. 2. 3. 3. .5 Spin Statistics , Spin Y
P1 D66 0 1 0 1 P0 D6 0 0 0 0 D1 dip
801.0 264.0 1031.5 388.4 0.13778094357E+00 0.42182248646E-07
72 0.d+00 0 Para Number ; Model Accuracy Parameters
28SiF4
Dim 21 fév 2021 16:09:29 CET Hmn Frdm Value/cm-1 St.Dev./cm-1
1 2(0,0A1) 0000A1 0000A1 A1 02 224 0.13778023448E+00 0.3915693E-06
2 4(0,0A1) 0000A1 0000A1 A1 04 139 -0.41039338392E-07 0.6560125E-10
3 4(4,0A1) 0000A1 0000A1 A1 04 536 -0.33591716068E-08 0.4290270E-11
4 6(0,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
5 6(4,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
6 6(6,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
7 8(0,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
8 8(4,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00

steven.daprano · June 23, 2022, 4:32pm

As Václav said: “Please put the text between triple backticks”

On standard American keyboards the backtick is on the “~” key, next to the 1 key.

Or if you are posting on the Discuss website, use the fancy editor widgets and use the </> button to format code.

petersuter · June 23, 2022, 4:52pm

There are many possible ways to do this. (regex, csv, …)
But it looks like a very simple counting could work:

import pathlib
path = pathlib.Path('example.txt')
started = False
values = []
for line in path.read_text(encoding='utf-8').splitlines():
    if line.startswith('Dim'):
        started = True
        continue
    if not started:
        continue
    values.append(float(line[40:57]))

I didn’t test this or double check if the column numbers 40 and 57 are correct, but maybe just try adjusting it if not.

vbrozik · June 23, 2022, 8:51pm

Here is a more robust solution with regex and a generator which allows processing of very large files:

Input data:

input_lines = """
11688 Data ; Jmax 90 ; St Dev 0.159
#5. 5. 2. 3. 3. .5 Spin Statistics , Spin Y
P1 D66 0 1 0 1 P0 D6 0 0 0 0 D1 dip
801.0 264.0 1031.5 388.4 0.13778094357E+00 0.42182248646E-07
72 0.d+00 0 Para Number ; Model Accuracy Parameters
28SiF4
Dim 21 fév 2021 16:09:29 CET Hmn Frdm Value/cm-1 St.Dev./cm-1
1 2(0,0A1) 0000A1 0000A1 A1 02 224 0.13778023448E+00 0.3915693E-06
2 4(0,0A1) 0000A1 0000A1 A1 04 139 -0.41039338392E-07 0.6560125E-10
3 4(4,0A1) 0000A1 0000A1 A1 04 536 -0.33591716068E-08 0.4290270E-11
4 6(0,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
5 6(4,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
6 6(6,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
7 8(0,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
8 8(4,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
9 8(6,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
10 8(8,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
11 0(0,0A1) 0100E 0100E A1 20 330 0.26421941002E+03 0.3967863E-04
12 2(0,0A1) 0100E 0100E A1 22 130 -0.14303321917E-03 0.3393096E-07
13 2(2,0E ) 0100E 0100E E 22 248 -0.46790609420E-04 0.2657215E-07
14 3(3,0A2) 0100E 0100E A2 23 197 0.14085216624E-06 0.2969422E-09
15 4(0,0A1) 0100E 0100E A1 24 152 0.38404874052E-09 0.6656298E-11
16 4(2,0E ) 0100E 0100E E 24 204 -0.10234422562E-09 0.3485302E-11
""".splitlines()

import re

ITEMS_RE = r'''(?x)      # verbose regex
        \S*\([^)]*\)\S*  # sequence with parenthesis can contain whitespace
        | \S+            # or sequence of any non-whitespace characters
        '''

def iterate_values(lines, data_column=-2, data_header='Value/cm-1'):
    """Iterate values from text lines."""
    min_fields = data_column + 1 if data_column >= 0 else -data_column
    in_table = False
    for line in lines:
        line_fields = re.findall(ITEMS_RE, line)
        if len(line_fields) >= min_fields:
            if in_table:
                yield float(line_fields[data_column])
            elif line_fields[data_column] == data_header:
                in_table = True

# data from the variable:
values = list(iterate_values(input_lines))

# data from a file:
with open('2022-06-23_parsing_text.txt') as input_file:
    values = list(iterate_values(input_file))

values

[0.13778023448,
 -4.1039338392e-08,
 -3.3591716068e-09,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 264.21941002,
 -0.00014303321917,
 -4.679060942e-05,
 1.4085216624e-07,
 3.8404874052e-10,
 -1.0234422562e-10]

TazFleck · June 23, 2022, 9:59pm

thank you I will try

vbrozik · June 23, 2022, 10:14pm

Just an interesting note: Dim is short for dimanche which is French word for Sunday.

The whole part Dim 21 fév 2021 16:09:29 CET is a time (including date) in Central European time zone (UTC+1). It is just a reminder that assumptions not based on real information can lead to a wrong code.

mlgtechuser · June 24, 2022, 1:09am

Fence your posted code with “backticks” like this (you can copy and paste the [ ``` ] from here):

```
<paste or type code here>
```

The backticks will make the code look like this:

volunteers = 0
print("Posting with a reader-friendly code format...")
volunteers += 1
print(f"...will probably encourage at least {volunteers} person to help me.")
while True:
    volunteers += 1
    print(f"or maybe {volunteers} people...")

You can also use the monospace font inline by using single backticks before and after the text.

`use the monospace font`

TazFleck · June 25, 2022, 10:15am

Hi guys,

I tried your solution but I have the same problem as before, I could not convert string to float with many words.

mlgtechuser · June 25, 2022, 1:04pm

#11688 Data ; Jmax 90 ; St Dev 0.159
5. 5. 2. 3. 3. .5 Spin Statistics , Spin Y
P1 D66 0 1 0 1 P0 D6 0 0 0 0 D1 dip
801.0 264.0 1031.5 388.4 0.13778094357E+00 0.42182248646E-07
72 0.d+00 0 Para Number ; Model Accuracy Parameters
28SiF4
Dim 21 fév 2021 16:09:29 CET  Hmn  Frdm         Value/cm-1  St.Dev./cm-1
   1  2(0,0A1) 0000A1 0000A1 A1 02   224  0.13778023448E+00 0.3915693E-06
   2  4(0,0A1) 0000A1 0000A1 A1 04   139 -0.41039338392E-07 0.6560125E-10

• • •

   3  4(4,0A1) 0000A1 0000A1 A1 04   536 -0.33591716068E-08 0.4290270E-11
   4  6(0,0A1) 0000A1 0000A1 A1 06     0  0.00000000000E+00 0.0000000E+00
   5  6(4,0A1) 0000A1 0000A1 A1 06     0  0.00000000000E+00 0.0000000E+00
   6  6(6,0A1) 0000A1 0000A1 A1 06     0  0.00000000000E+00 0.0000000E+00
   7  8(0,0A1) 0000A1 0000A1 A1 08     0  0.00000000000E+00 0.0000000E+00
   8  8(4,0A1) 0000A1 0000A1 A1 08     0  0.00000000000E+00 0.0000000E+00
   9  8(6,0A1) 0000A1 0000A1 A1 08     0  0.00000000000E+00 0.0000000E+00
  10  8(8,0A1) 0000A1 0000A1 A1 08     0  0.00000000000E+00 0.0000000E+00
  11  0(0,0A1)  0100E 0100E  A1 20   330  0.26421941002E+03 0.3967863E-04
  12  2(0,0A1)  0100E 0100E  A1 22   130 -0.14303321917E-03 0.3393096E-07

  13  2(2,0E )  0100E 0100E  E  22   248 -0.46790609420E-04 0.2657215E-07
  14  3(3,0A2)  0100E 0100E  A2 23   197  0.14085216624E-06 0.2969422E-09
  15  4(0,0A1)  0100E 0100E  A1 24   152  0.38404874052E-09 0.6656298E-11
  16  4(2,0E )  0100E 0100E  E  24   204 -0.10234422562E-09 0.3485302E-11

Applying the K.I.S.S. principle (“Keep It Super-Simple”)…
…this data is separated by spaces with no meaningful spaces inside the column values, so it can be made to behave like a ‘space-delimited’ text file. After removing the meaningless space characters, we can break each line at the spaces and find the columns that way.

The process is:

Read each line from the CSV file.
Find the first line of data.
- as Václav pointed out, the ‘Dim’ is not reliable since it’s a day of the week and is almost certain to change.
- the numeral '1 ’ with a space after it appears to be reliable, but this should be thoroughly investigated.
- The code below assumes that the '1 ’ is reliable (‘1’ + <space>). If that turns out to be not reliable, we could assume that the data consistently starts on the 8th line OR look for two ‘:’ that are two characters apart OR a number of any other methods. The programmer needs to decide what method works best for this file format. (Ideally, the file has a firm specification that gives some certainty on how the header is structured, like “data ALWAYS starts on line 8”.)
Read each line of data. Remove the padded space from the 2^nd Column.
Find the location of each string of space characters and pull that column’s data into the corresponding column of a two-dimensional list. BONUS: this is exactly what the Python split() function does!
Read the target column with a for: loop and list[row][col] reference.

NOTE: Step 3 can just read the target column if none of the other column data are needed. The code below reads all columns and is probably more useful.

new_row = []
data_table2 = []
data_start_marker = '1 '

csv_file = open("KikiData.csv",'r')         #if the file is too large to fit into memory...
csv_rows = csv_file.readlines()             #...loop through the file line-by-line using 'readlines()'

for line_num,row in enumerate(csv_rows):
    if row.startswith(data_start_marker):   #find the first data line
        data_start = line_num
        break                               #stop looping; go to the next line of code after the loop

col_num = 7     # ←←this is the column you asked for (first item in a list is position 0)
data_table = [row.replace( ' )' , ')' ) for row in csv_rows]
data_table1 = [row.split() for row in data_table [data_start:]]
data_col = [data_table1[i][col_num] for i in range(len(data_table1))]   #print the column

The code below has print() loops to print the columns vertically AND also has a for: loop that shows what the data_table1 = [row.split() for row… line does.

new_row = []
data_table2 = []
data_start_marker = '1 '

csv_file = open("KikiData.csv",'r')         #if the file is too large to fit into memory...
csv_rows = csv_file.readlines()             #...loop through the file line-by-line using 'readlines()'

for line_num,row in enumerate(csv_rows):
    if row.startswith(data_start_marker):   #find the first data line
        data_start = line_num
        break                               #stop looping; go to the next line of code after the loop

col_num = 7     #this is the column you asked for (first item in a list is position 0)
data_table = [row.replace( ' )' , ')' ) for row in csv_rows]
data_table1 = [row.split() for row in data_table [data_start:]]
data_col = [data_table1[i][col_num] for i in range(len(data_table1))]   #print the column
for item in data_col:
    print(item)
#THIS LOOP ↓↓↓ DOES THE SAME THING AS 'data_table1 =' ABOVE ↑↑↑  Use the one that is clearest to you.
for row in csv_rows[data_start:]:           #process the data rows from data_start row to end of csv_rows list
    new_row = row.split()                   #break the columns on this row into a list; 'space' is the default character to split at => string.split(" ")
    data_table2.append(new_row)
    new_row = []

data_col = [data_table2[i][col_num] for i in range(len(data_table2))]   #print the column
for item in data_col:
    print(item)

vbrozik · June 25, 2022, 2:12pm

I am sorry but we cannot help you if you do not show any details It is essential to see what exactly did you run and what were the exact results including possible complete error messages.

I have no idea what does this mean exactly.

The code I sent works exactly how I posted it (for the input data I posted). Just copy, paste both the input_lines initialization and the code. Comment out the file reading example and add print(values). It should work with any supported version of Python (tested in Python 3.10.4).

Not really, there is at least one column which contains whitespace characters. Look carefully at the second column. Unfortunately because of this your code will not work.

…but we can probably assume that the columns will be fixed-position. The simple code from Peter is good for that, just change the start condition to a more robust one and use the real column position (instead of the estimated one).

mlgtechuser · June 25, 2022, 5:39pm

I did see that and then grabbed the data in my post from the OP’s pasted data to replace the screenshot. HTML collapsed the spaces. Unformatted code strikes again! I’ll add spaces to the test data in my KikiData.csv file and re-paste it in my post.

there is at least one column which contains whitespace characters.

There are several columns with padded spaces because of course the data is actually in column format.

because of this your code will not work.

Au contraire, mon frère. Thanks to split(), [python.doc] the extra whitespace is stripped out.

^{Built-in Data Types > Strings > split():}
If [a separator] is not specified or is None , […] consecutive whitespace are regarded as a single separator

As long as the data within a column isn’t broken by a space, the strip() won’t break up any data fields. The lesson there, as mentioned above, is that knowing the data file structure is extremely important. (Thank you for bringing this up. I’ll edit my post to clarify that it only “behaves like” space-delimited data for the reasons just stated.)

To be proper purists about the column format, the lines can be broken up with a dictionary or list of tuples that defines the column widths, of course.

vbrozik · June 25, 2022, 7:21pm

Now I understand. You used the incomplete data where there are no spaces inside the second column. See the data in the post before that, in the screenshot or in my post (collapsed) as I used them in my code.

non, pas les espaces dont je parlais… You are referring to the whitespace characters between the fields. Yes, str.split() takes care of them. I was referring to the whitespace characters inside the field of the second column.

mlgtechuser · June 25, 2022, 10:01pm

Ç’est le vérité !! I did not click to open the screenshot for a clear view and then only copied the first three lines of data because the last few lines in your post were behind my editor, so it looked consistent. Good catch, Václav!

I have made the following edits to my ever-changing post above:

Posted enough of the data set to include those spaces, reformatted to match the screenshot.
(Take a look at the folded data posting layout. I think you’ll like it. You can click anywhere in the gap.)
Added a simple list comprehension to remove all padded right parentheses ‘)’. ^[1]
- Since the extra padding in the 2^nd column is only to make the right parentheses line up for human readability, it would need to be removed eventually for performing any data processing on that column; might as well do it now.

At this rate, I’ll be posting another version that divides the lines by column.

No, I didn’t change any internal states. ↩︎

TazFleck · June 26, 2022, 10:09am

To be honest, as a beginner of parsing with python you go very deeply and I don’t understand everything ^^’

vbrozik · June 26, 2022, 11:20am

That is OK we can explain the parts that are difficult for you. We just do not know what is difficult for you.

You wrote that you were not able to make the code working so show us what exactly are you running, what are the results and how they differ from what you would like to get.

I think the following code does what you want. I put it together here as I described earlier. Just paste it to a text file e.g. parse_column.py and run it: python3 parse_column.py

input_lines = '''
11688 Data ; Jmax 90 ; St Dev 0.159
#5. 5. 2. 3. 3. .5 Spin Statistics , Spin Y
P1 D66 0 1 0 1 P0 D6 0 0 0 0 D1 dip
801.0 264.0 1031.5 388.4 0.13778094357E+00 0.42182248646E-07
72 0.d+00 0 Para Number ; Model Accuracy Parameters
28SiF4
Dim 21 fév 2021 16:09:29 CET Hmn Frdm Value/cm-1 St.Dev./cm-1
1 2(0,0A1) 0000A1 0000A1 A1 02 224 0.13778023448E+00 0.3915693E-06
2 4(0,0A1) 0000A1 0000A1 A1 04 139 -0.41039338392E-07 0.6560125E-10
3 4(4,0A1) 0000A1 0000A1 A1 04 536 -0.33591716068E-08 0.4290270E-11
4 6(0,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
5 6(4,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
6 6(6,0A1) 0000A1 0000A1 A1 06 0 0.00000000000E+00 0.0000000E+00
7 8(0,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
8 8(4,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
9 8(6,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
10 8(8,0A1) 0000A1 0000A1 A1 08 0 0.00000000000E+00 0.0000000E+00
11 0(0,0A1) 0100E 0100E A1 20 330 0.26421941002E+03 0.3967863E-04
12 2(0,0A1) 0100E 0100E A1 22 130 -0.14303321917E-03 0.3393096E-07
13 2(2,0E ) 0100E 0100E E 22 248 -0.46790609420E-04 0.2657215E-07
14 3(3,0A2) 0100E 0100E A2 23 197 0.14085216624E-06 0.2969422E-09
15 4(0,0A1) 0100E 0100E A1 24 152 0.38404874052E-09 0.6656298E-11
16 4(2,0E ) 0100E 0100E E 24 204 -0.10234422562E-09 0.3485302E-11
'''.splitlines()

import re

ITEMS_RE = r'''(?x)      # verbose regex
        \S*\([^)]*\)\S*  # sequence with parenthesis can contain whitespace
        | \S+            # or sequence of any non-whitespace characters
        '''

def iterate_values(lines, data_column=-2, data_header='Value/cm-1'):
    """Iterate values from text lines."""
    min_fields = data_column + 1 if data_column >= 0 else -data_column
    in_table = False
    for line in lines:
        line_fields = re.findall(ITEMS_RE, line)
        if len(line_fields) >= min_fields:
            if in_table:
                yield float(line_fields[data_column])
            elif line_fields[data_column] == data_header:
                in_table = True

# data from the variable:
values = list(iterate_values(input_lines))

print(values)

TazFleck · June 26, 2022, 11:21am

I will try thank you

TazFleck · June 26, 2022, 8:20pm

I tried and it works thanks you but I would like to store values inside another text file ? Do I have to use fichier.write to do that ?