Preferred data type for text analysis

I am using selenium to download Press releases from companies which I would like to analysis in a second step. The type of analyses include for example:

  • Search each press release for the words “Source”, “Contact” or “About”.
  • Examine whether the company name is found in the first X characters after the words “Source” or “Contact”
    The website allows me to save each press release in a separate file whereas the format can either be a pdf, docx or rtf format. I am completely new to text analysis with python and was asking myself if I should prefer one of the formats over the other when it comes to my desired data analysis? I am also happy for any recommendation of python libraries for these kind of analysis.

RTF is the simplest of the file formats there, being a regular text document mostly with additional markup commands on top. You’d probably be able to parse a lot with no libraries at all, while the others would definitely need one to make sense of it.

1 Like

have a look at the RTF and see if you can find a easy pattern to spot to get the text you want.
I do recall that RTF is a complex markup.


Thank you very much for the helpful answers, @TeamSpen210 and @barry-scott