I’m currently using this to extract main article from HTML file.
cat BruceLee.html | trafilatura >> BruceLee.txt
trafilatura is excellent, BTW.
I’d like to do this in batch for all HTML files in a nested directory structure.
I know how to use bash to convert all PDFs in nested directory to TXT, but I don’t know how to do this with python. Or can I execute python from within bash?
Any suggestions?
Thanks in advance.