How to split text after a series of uppercase words and create a new row in Python

Ercole · July 18, 2022, 3:20pm

I used an OCR on thousands pdf files to create a CSV dataset of parliamentary speeches. In the dataset there are two columns: one contains the text of the speeches and the other contains the names of speaker who gave the speech.

The problem is the following. Sometimes the OCR merged two speeches together. In particular, the name of speaker B and his speech are contained in the text of the speech of speaker A (in the speeches column).

Now, given that the name of the speaker is always entirely in capital letters, is there a method to fix this in Python? For instance, is there a way to tell Python that in all the instances in which there is a series of words entirely in capital letters in the text of a speech to take those words and put them in the speakers’ columns and take the words that follows and put them in the respective speeches’ column, creating a new row?

What I have:

Speaker	Speech
SPEAKER ALPHA	Lorem ipsum. SPEAKER BETA dolor sit amet
SPEAKER GAMMA	Nunc tincidunt tincidunt erat

What I need:

Speaker	Speech
SPEAKER ALPHA	Lorem ipsum.
SPEAKER BETA	dolor sit amet
SPEAKER GAMMA	Nunc tincidunt tincidunt erat

Needless to say I am a newbie in Python.

Thank you very much for any help!

rob42 · July 18, 2022, 6:10pm

Yes, it can be done. How complex (or not) this task would be, depends on how consistent the data is. It seems (from what you’ve posted) that every name of the speaker, is followed by a space (ASCII 32) and a tab (ASCII 9), which is a little odd, considering you describe this as a ‘CSV’ file. Do you mean a ‘TSV’ file?

Although this would be an interesting exerciser in Python coding, you could simply use the ‘find and replace’ function that’s built into any half decent text editor.

Ercole · July 18, 2022, 6:25pm

Hi, unfortunately it is not that consistent as it is the result of an OCR reading old scanned documents from the late 1940s. Furthermore, the reason I am avoiding using the find and replace function is that 1) there are thousands of those files, therefore it could take decades and 2) there is still the task of creating a new row and copy and paste the text that follows the “NAME SURNAME”, which makes the task quite lengthy.

rob42 · July 18, 2022, 11:16pm

Okay, well that’s going to be a point of failure.

This is a bit of a hack-up and I’m sure that others here can suggest some improvements, but as a first draft…

#a sample string from the OP
string = 'SPEAKER ALPHA 	Lorem ipsum. SPEAKER BETA dolor sit amet SPEAKER GAMMA 	Nunc tincidunt tincidunt erat'

#new string to hold a clean version: replace any tabs with spaces
nstring = string.replace(' \t', ' ')

#two lists to hold the speaker name and the speech made
speaker = []
speech = []

#two variables used to construct the data for the list objects
speaker_name = ''
speech_text = ''

#enumerate the words in nstring
words = nstring.split(' ')

#build a list of speakers
for word in words:
    if word.isupper():
        speaker_name += word+' '
    elif speaker_name:
        speaker.append(speaker_name.strip())
        speaker_name = ''

#build a list of speeches
for word in words:
    if not word.isupper():
        speech_text += word+' '
    elif speech_text:
        speech.append(speech_text.strip())
        speech_text = ''
speech.append(speech_text.strip())

#output the results
for item in range(len(speaker)):
    print(speaker[item],speech[item])

{edit done for a couple of minor changes: these things keep from sleeping }

vbrozik · July 19, 2022, 9:32pm

Here is a solution using regex.

import re

input_lines = ['SPEAKER ALPHA 	Lorem ipsum. SPEAKER BETA dolor sit amet SPEAKER GAMMA 	Nunc tincidunt tincidunt erat']

def split_speaker_lines(lines):
    for line in lines:
        split = re.split(r'([A-Z]{2,}(?:\s+[A-Z]{2,})*)', line)[1:]
        yield from zip(split[::2], split[1::2])

for line in split_speaker_lines(input_lines):
    print(line)

('SPEAKER ALPHA', ' \tLorem ipsum. ')
('SPEAKER BETA', ' dolor sit amet ')
('SPEAKER GAMMA', ' \tNunc tincidunt tincidunt erat')

It finds the speaker as a sequence of at least two capital letters: [A-Z]{2,} optionally followed by whitespace characters and repetition of the same: (?:\s+[A-Z]{2,})*. If this definition is not robust enough it can be improved.

Possible text on the line before the first speaker name is thrown away [1:]. Change it if it should behave differently.

rob42 · July 20, 2022, 4:21am

@Ercole
As you can see, there are a couple of solutions here, giving the option to run a test and see if either of these will be of any use to you.

The output from the solution that I put together, can be modified to write the data to a file, rather than (as is) to a display screen.

For file output, I would not use a CSV format, because you may run into issues with the ‘speech’ containing commas, if said speech has been correctly punctuated; rather use a TSV format (tab-separated values), the mod to my script being:

#output the results
print('Speaker'+'\t'+'Speech')
for item in range(len(speaker)):
    print(speaker[item]+'\t'+speech[item])

As I’ve eluded to, it would be very easy to have the output written to a TSV file.

Ercole · July 28, 2022, 7:42am

rob42 vbrozik
Hi, thank you for the suggestions, building on your advice I have written the following code:

import pandas as pd
import more_itertools as mit
import numpy as np
from fuzzywuzzy import fuzz
import glob


path ='C:/Users/~~~/dummy_test'
filenames = glob.glob(path + "/*.csv")

for filename in filenames:


        df = pd.read_csv(filename)
        df2 = pd.read_csv('C:/Users/~~~/names.csv')

        col_names = ['legislature', 'session', 'date', 'speech_order', 'speaker', 'chair','position', 'speech', 'terms']



for speechindex, speech in enumerate(df['speech']):
    upperflag = False
    
    if pd.isna(speech) == True:
        print("skipped")
        continue

    res = [idx for idx, chr in enumerate(speech) if chr.isupper()]

    groups = [list(group) for group in mit.consecutive_groups(res)]


    NAMES = []
    INDICES = []

    for g in groups:
        if len(g) > 2:
            
            upperflag = True
            
            names = []
            indices = []
            for l in g:
                
                names.append(speech[l])
                indices.append(l)

            NAMES.append(''.join(names))
            INDICES.append(indices)



    if upperflag == True:

        

        fullname = ' '.join(NAMES)
        fullname = fullname.lower()
        

        for index, row in df2.iterrows():
        

            if fuzz.token_sort_ratio(fullname,row['Names_up']) > 90:

                
                df = df[['legislature', 'session', 'date', 'speech_order', 'speaker', 'chair', 'position', 'speech', 'terms']]


                first_index = INDICES[0][0]
                
                
                last_index = INDICES[-1][-1]
                last_index = last_index + 1
                speaker_speech = speech[last_index:-1]
                fullname = fullname.upper()
                
                
                df.loc[speechindex,'speech'] = speech[0:first_index]


                new_row = ['','','','',fullname,'','',speaker_speech,'']
                val = speechindex + 1  


                df = pd.DataFrame(np.insert(df.values, val, new_row, axis=0))      


                df.columns = col_names

                break


        
        df.to_csv(filename)

However, I cannot understand why is not working…It appears to leave the files completely unaffected. Could you please help mefinding the mistake?

rob42 · July 28, 2022, 9:57am

Hi Mark,

I can see that you’ve made some progress with this; nice.

I don’t know that you need five different packages in order to achieve your goal, but then again, you’ve done far more work on this than I, so maybe you do.

Myself, I’ve not used any of the packages that you’ve chosen to use and as such I’m in no position to either judge or help, but I’m sure that other members here are in a better position. If I feel that I can add value, I’ll certainly try.

edit to add:

The only observations I can offer are

Check that the filename is full (as in the full path) and correct; an easy attribute to check, with a simple print(filename) function call. Also you could check the time/date stamp of the file and see if it’s current.
Check that the df does in fact contain data.

That aside, I don’t know what else to suggest.

It may be helpful to others if say what you have done/checked, so that any suggestions made are not rebutted with “Yes, I’ve already checked that.”, which is not going to be of any help to anyone.

vbrozik · July 28, 2022, 3:40pm

I do not have (significant) experience with the used libraries either.

In addition to what Rob wrote:

I almost never write a part of code longer than about 10 lines without some kind of testing. What I do:

write a part of code
add suitable diagnostic output (for example print() calls) or in bigger projects tests (I use pytest)
check if it works as intended if not I make changes and check again
if it works I continue with following parts of the code

Some parts of code I test separately in the interactive Python (just run python3 or py without parameters) or in a Jupyter notebook.

So - as Rob already suggested add diagnostic prints to your code.

What I noticed in your code:

Here you always replace the content df is pointing to. So when the loop finishes, df will contain just the content from the last file.
The last two statements do not change in the loop iterations. There is no point to execute them repeatedly inside the loop. If you need them in the current form, just take them out of the loop by unindenting them.

for filename in filenames:
        df = pd.read_csv(filename)
        df2 = pd.read_csv('C:/Users/~~~/names.csv')
        col_names = ['legislature', 'session', 'date', 'speech_order', 'speaker', 'chair','position', 'speech', 'terms']

The convention in Python is to use uppercase identifiers for constants but these are not constants which makes the very code confusing. Do not be afraid of using longer_descriptive_names.

    NAMES = []
    INDICES = []

The large amount of empty lines make the code hard to read. For example here it is hard to follow that the two commands are inside the if branch. Normally we do not insert any empty lines in similar contexts. Empty lines are sometimes used to separate parts of code which perform separate sub-tasks. For example we can separate file reading, data processing, file writing.

    if upperflag == True:

        

        fullname = ' '.join(NAMES)
        fullname = fullname.lower()

BTW upperflag should rather be upper_flag in the Python naming convention. Also if it can contain only bool values (True or False), use just this equivalent and easier to read code: if upper_flag:

Here you write just to the last file name from the for cycle which is far at the beginning of your program. …and do you really want to overwrite the original file?

        df.to_csv(filename)