I used an OCR on thousands pdf files to create a CSV dataset of parliamentary speeches. In the dataset there are two columns: one contains the text of the speeches and the other contains the names of speaker who gave the speech.
The problem is the following. Sometimes the OCR merged two speeches together. In particular, the name of speaker B and his speech are contained in the text of the speech of speaker A (in the speeches column).
Now, given that the name of the speaker is always entirely in capital letters, is there a method to fix this in Python? For instance, is there a way to tell Python that in all the instances in which there is a series of words entirely in capital letters in the text of a speech to take those words and put them in the speakersā columns and take the words that follows and put them in the respective speechesā column, creating a new row?
Yes, it can be done. How complex (or not) this task would be, depends on how consistent the data is. It seems (from what youāve posted) that every name of the speaker, is followed by a space (ASCII 32) and a tab (ASCII 9), which is a little odd, considering you describe this as a āCSVā file. Do you mean a āTSVā file?
Although this would be an interesting exerciser in Python coding, you could simply use the āfind and replaceā function thatās built into any half decent text editor.
Hi, unfortunately it is not that consistent as it is the result of an OCR reading old scanned documents from the late 1940s. Furthermore, the reason I am avoiding using the find and replace function is that 1) there are thousands of those files, therefore it could take decades and 2) there is still the task of creating a new row and copy and paste the text that follows the āNAME SURNAMEā, which makes the task quite lengthy.
Okay, well thatās going to be a point of failure.
This is a bit of a hack-up and Iām sure that others here can suggest some improvements, but as a first draftā¦
#a sample string from the OP
string = 'SPEAKER ALPHA Lorem ipsum. SPEAKER BETA dolor sit amet SPEAKER GAMMA Nunc tincidunt tincidunt erat'
#new string to hold a clean version: replace any tabs with spaces
nstring = string.replace(' \t', ' ')
#two lists to hold the speaker name and the speech made
speaker = []
speech = []
#two variables used to construct the data for the list objects
speaker_name = ''
speech_text = ''
#enumerate the words in nstring
words = nstring.split(' ')
#build a list of speakers
for word in words:
if word.isupper():
speaker_name += word+' '
elif speaker_name:
speaker.append(speaker_name.strip())
speaker_name = ''
#build a list of speeches
for word in words:
if not word.isupper():
speech_text += word+' '
elif speech_text:
speech.append(speech_text.strip())
speech_text = ''
speech.append(speech_text.strip())
#output the results
for item in range(len(speaker)):
print(speaker[item],speech[item])
{edit done for a couple of minor changes: these things keep from sleeping }
import re
input_lines = ['SPEAKER ALPHA Lorem ipsum. SPEAKER BETA dolor sit amet SPEAKER GAMMA Nunc tincidunt tincidunt erat']
def split_speaker_lines(lines):
for line in lines:
split = re.split(r'([A-Z]{2,}(?:\s+[A-Z]{2,})*)', line)[1:]
yield from zip(split[::2], split[1::2])
for line in split_speaker_lines(input_lines):
print(line)
It finds the speaker as a sequence of at least two capital letters: [A-Z]{2,} optionally followed by whitespace characters and repetition of the same: (?:\s+[A-Z]{2,})*. If this definition is not robust enough it can be improved.
Possible text on the line before the first speaker name is thrown away [1:]. Change it if it should behave differently.
@Ercole
As you can see, there are a couple of solutions here, giving the option to run a test and see if either of these will be of any use to you.
The output from the solution that I put together, can be modified to write the data to a file, rather than (as is) to a display screen.
For file output, I would not use a CSV format, because you may run into issues with the āspeechā containing commas, if said speech has been correctly punctuated; rather use a TSV format (tab-separated values), the mod to my script being:
#output the results
print('Speaker'+'\t'+'Speech')
for item in range(len(speaker)):
print(speaker[item]+'\t'+speech[item])
As Iāve eluded to, it would be very easy to have the output written to a TSV file.
I can see that youāve made some progress with this; nice.
I donāt know that you need five different packages in order to achieve your goal, but then again, youāve done far more work on this than I, so maybe you do.
Myself, Iāve not used any of the packages that youāve chosen to use and as such Iām in no position to either judge or help, but Iām sure that other members here are in a better position. If I feel that I can add value, Iāll certainly try.
edit to add:
The only observations I can offer are
Check that the filename is full (as in the full path) and correct; an easy attribute to check, with a simple print(filename) function call. Also you could check the time/date stamp of the file and see if itās current.
Check that the df does in fact contain data.
That aside, I donāt know what else to suggest.
It may be helpful to others if say what you have done/checked, so that any suggestions made are not rebutted with āYes, Iāve already checked that.ā, which is not going to be of any help to anyone.
I do not have (significant) experience with the used libraries either.
In addition to what Rob wrote:
I almost never write a part of code longer than about 10 lines without some kind of testing. What I do:
write a part of code
add suitable diagnostic output (for example print() calls) or in bigger projects tests (I use pytest)
check if it works as intended if not I make changes and check again
if it works I continue with following parts of the code
Some parts of code I test separately in the interactive Python (just run python3 or py without parameters) or in a Jupyter notebook.
So - as Rob already suggested add diagnostic prints to your code.
What I noticed in your code:
Here you always replace the content df is pointing to. So when the loop finishes, df will contain just the content from the last file.
The last two statements do not change in the loop iterations. There is no point to execute them repeatedly inside the loop. If you need them in the current form, just take them out of the loop by unindenting them.
for filename in filenames:
df = pd.read_csv(filename)
df2 = pd.read_csv('C:/Users/~~~/names.csv')
col_names = ['legislature', 'session', 'date', 'speech_order', 'speaker', 'chair','position', 'speech', 'terms']
The convention in Python is to use uppercase identifiers for constants but these are not constants which makes the very code confusing. Do not be afraid of using longer_descriptive_names.
NAMES = []
INDICES = []
The large amount of empty lines make the code hard to read. For example here it is hard to follow that the two commands are inside the if branch. Normally we do not insert any empty lines in similar contexts. Empty lines are sometimes used to separate parts of code which perform separate sub-tasks. For example we can separate file reading, data processing, file writing.
BTW upperflag should rather be upper_flag in the Python naming convention. Also if it can contain only bool values (True or False), use just this equivalent and easier to read code: if upper_flag:
Here you write just to the last file name from the for cycle which is far at the beginning of your program. ā¦and do you really want to overwrite the original file?