How to extract main article from HTML files in directory structure

ericlindellnyc · September 20, 2021, 3:06pm

I’m currently using this to extract main article from HTML file.

cat BruceLee.html | trafilatura >> BruceLee.txt

trafilatura is excellent, BTW.
I’d like to do this in batch for all HTML files in a nested directory structure.

I know how to use bash to convert all PDFs in nested directory to TXT, but I don’t know how to do this with python. Or can I execute python from within bash?

Any suggestions?
Thanks in advance.

Milton_Mobley · September 28, 2021, 7:07pm

you can execute bash from within python. See docs for os.system python function