Preferred data type for text analysis

Flammers · October 10, 2022, 7:53am

I am using selenium to download Press releases from companies which I would like to analysis in a second step. The type of analyses include for example:

Search each press release for the words “Source”, “Contact” or “About”.
Examine whether the company name is found in the first X characters after the words “Source” or “Contact”
The website allows me to save each press release in a separate file whereas the format can either be a pdf, docx or rtf format. I am completely new to text analysis with python and was asking myself if I should prefer one of the formats over the other when it comes to my desired data analysis? I am also happy for any recommendation of python libraries for these kind of analysis.

TeamSpen210 · October 10, 2022, 1:20pm

RTF is the simplest of the file formats there, being a regular text document mostly with additional markup commands on top. You’d probably be able to parse a lot with no libraries at all, while the others would definitely need one to make sense of it.

barry-scott · October 10, 2022, 7:39pm

have a look at the RTF and see if you can find a easy pattern to spot to get the text you want.
I do recall that RTF is a complex markup.

Flammers · October 11, 2022, 9:50am

Thank you very much for the helpful answers, @TeamSpen210 and @barry-scott