I am using selenium to download Press releases from companies which I would like to analysis in a second step. The type of analyses include for example:
- Search each press release for the words “Source”, “Contact” or “About”.
- Examine whether the company name is found in the first X characters after the words “Source” or “Contact”
The website allows me to save each press release in a separate file whereas the format can either be a pdf, docx or rtf format. I am completely new to text analysis with python and was asking myself if I should prefer one of the formats over the other when it comes to my desired data analysis? I am also happy for any recommendation of python libraries for these kind of analysis.