I would like to ask you how to parsing a specific part of a text file. I would like to extract this specific column of the text but I have many lines before and I don’t know how to extract specifically a column because I have a mix of string and float.
There are many possible ways to do this. (regex, csv, …)
But it looks like a very simple counting could work:
import pathlib
path = pathlib.Path('example.txt')
started = False
values = []
for line in path.read_text(encoding='utf-8').splitlines():
if line.startswith('Dim'):
started = True
continue
if not started:
continue
values.append(float(line[40:57]))
I didn’t test this or double check if the column numbers 40 and 57 are correct, but maybe just try adjusting it if not.
import re
ITEMS_RE = r'''(?x) # verbose regex
\S*\([^)]*\)\S* # sequence with parenthesis can contain whitespace
| \S+ # or sequence of any non-whitespace characters
'''
def iterate_values(lines, data_column=-2, data_header='Value/cm-1'):
"""Iterate values from text lines."""
min_fields = data_column + 1 if data_column >= 0 else -data_column
in_table = False
for line in lines:
line_fields = re.findall(ITEMS_RE, line)
if len(line_fields) >= min_fields:
if in_table:
yield float(line_fields[data_column])
elif line_fields[data_column] == data_header:
in_table = True
# data from the variable:
values = list(iterate_values(input_lines))
# data from a file:
with open('2022-06-23_parsing_text.txt') as input_file:
values = list(iterate_values(input_file))
Just an interesting note: Dim is short for dimanche which is French word for Sunday.
The whole part Dim 21 fév 2021 16:09:29 CET is a time (including date) in Central European time zone (UTC+1). It is just a reminder that assumptions not based on real information can lead to a wrong code.
Fence your posted code with “backticks” like this (you can copy and paste the [ ``` ] from here):
```
<paste or type code here>
```
The backticks will make the code look like this:
volunteers = 0
print("Posting with a reader-friendly code format...")
volunteers += 1
print(f"...will probably encourage at least {volunteers} person to help me.")
while True:
volunteers += 1
print(f"or maybe {volunteers} people...")
You can also use the monospace font inline by using single backticks before and after the text.
Applying the K.I.S.S. principle (“Keep It Super-Simple”)…
…this data is separated by spaces with no meaningful spaces inside the column values, so it can be made to behave like a ‘space-delimited’ text file. After removing the meaningless space characters, we can break each line at the spaces and find the columns that way.
The process is:
Read each line from the CSV file.
Find the first line of data.
as Václav pointed out, the ‘Dim’ is not reliable since it’s a day of the week and is almost certain to change.
the numeral '1 ’ with a space after it appears to be reliable, but this should be thoroughly investigated.
The code below assumes that the '1 ’ is reliable (‘1’ + <space>). If that turns out to be not reliable, we could assume that the data consistently starts on the 8th line OR look for two ‘:’ that are two characters apart OR a number of any other methods. The programmer needs to decide what method works best for this file format. (Ideally, the file has a firm specification that gives some certainty on how the header is structured, like “data ALWAYS starts on line 8”.)
Read each line of data. Remove the padded space from the 2nd Column.
Find the location of each string of space characters and pull that column’s data into the corresponding column of a two-dimensional list. BONUS: this is exactly what the Python split() function does!
Read the target column with afor:loop andlist[row][col]reference.
NOTE: Step 3 can just read the target column if none of the other column data are needed. The code below reads all columns and is probably more useful.
new_row = []
data_table2 = []
data_start_marker = '1 '
csv_file = open("KikiData.csv",'r') #if the file is too large to fit into memory...
csv_rows = csv_file.readlines() #...loop through the file line-by-line using 'readlines()'
for line_num,row in enumerate(csv_rows):
if row.startswith(data_start_marker): #find the first data line
data_start = line_num
break #stop looping; go to the next line of code after the loop
col_num = 7 # ←←this is the column you asked for (first item in a list is position 0)
data_table = [row.replace( ' )' , ')' ) for row in csv_rows]
data_table1 = [row.split() for row in data_table [data_start:]]
data_col = [data_table1[i][col_num] for i in range(len(data_table1))] #print the column
The code below has print() loops to print the columns vertically AND also has a for: loop that shows what the data_table1 = [row.split() for row… line does.
new_row = []
data_table2 = []
data_start_marker = '1 '
csv_file = open("KikiData.csv",'r') #if the file is too large to fit into memory...
csv_rows = csv_file.readlines() #...loop through the file line-by-line using 'readlines()'
for line_num,row in enumerate(csv_rows):
if row.startswith(data_start_marker): #find the first data line
data_start = line_num
break #stop looping; go to the next line of code after the loop
col_num = 7 #this is the column you asked for (first item in a list is position 0)
data_table = [row.replace( ' )' , ')' ) for row in csv_rows]
data_table1 = [row.split() for row in data_table [data_start:]]
data_col = [data_table1[i][col_num] for i in range(len(data_table1))] #print the column
for item in data_col:
print(item)
#THIS LOOP ↓↓↓ DOES THE SAME THING AS 'data_table1 =' ABOVE ↑↑↑ Use the one that is clearest to you.
for row in csv_rows[data_start:]: #process the data rows from data_start row to end of csv_rows list
new_row = row.split() #break the columns on this row into a list; 'space' is the default character to split at => string.split(" ")
data_table2.append(new_row)
new_row = []
data_col = [data_table2[i][col_num] for i in range(len(data_table2))] #print the column
for item in data_col:
print(item)
I am sorry but we cannot help you if you do not show any details It is essential to see what exactly did you run and what were the exact results including possible complete error messages.
I have no idea what does this mean exactly.
The code I sent works exactly how I posted it (for the input data I posted). Just copy, paste both the input_lines initialization and the code. Comment out the file reading example and add print(values). It should work with any supported version of Python (tested in Python 3.10.4).
Not really, there is at least one column which contains whitespace characters. Look carefully at the second column. Unfortunately because of this your code will not work.
…but we can probably assume that the columns will be fixed-position. The simple code from Peter is good for that, just change the start condition to a more robust one and use the real column position (instead of the estimated one).
I did see that and then grabbed the data in my post from the OP’s pasted data to replace the screenshot. HTML collapsed the spaces. Unformatted code strikes again! I’ll add spaces to the test data in my KikiData.csv file and re-paste it in my post.
there is at least one column which contains whitespace characters.
There are several columns with padded spaces because of course the data is actually in column format.
because of this your code will not work.
Au contraire, mon frère. Thanks to split(), [python.doc] the extra whitespace is stripped out.
Built-in Data Types > Strings > split():
If [a separator] is not specified or is None , […] consecutive whitespace are regarded as a single separator
As long as the data within a column isn’t broken by a space, the strip() won’t break up any data fields. The lesson there, as mentioned above, is that knowing the data file structure is extremely important. (Thank you for bringing this up. I’ll edit my post to clarify that it only “behaves like” space-delimited data for the reasons just stated.)
To be proper purists about the column format, the lines can be broken up with a dictionary or list of tuples that defines the column widths, of course.
Now I understand. You used the incomplete data where there are no spaces inside the second column. See the data in the post before that, in the screenshot or in my post (collapsed) as I used them in my code.
non, pas les espaces dont je parlais… You are referring to the whitespace characters between the fields. Yes, str.split() takes care of them. I was referring to the whitespace characters inside the field of the second column.
Ç’est le vérité !! I did not click to open the screenshot for a clear view and then only copied the first three lines of data because the last few lines in your post were behind my editor, so it looked consistent. Good catch, Václav!
Posted enough of the data set to include those spaces, reformatted to match the screenshot.
Added a simple list comprehension to remove all padded right parentheses ‘)’. [1]
Since the extra padding in the 2nd column is only to make the right parentheses line up for human readability, it would need to be removed eventually for performing any data processing on that column; might as well do it now.
At this rate, I’ll be posting another version that divides the lines by column.
That is OK we can explain the parts that are difficult for you. We just do not know what is difficult for you.
You wrote that you were not able to make the code working so show us what exactly are you running, what are the results and how they differ from what you would like to get.
I think the following code does what you want. I put it together here as I described earlier. Just paste it to a text file e.g. parse_column.py and run it: python3 parse_column.py