I need to parse HTML files with the Python 3.8
xml package. This must be possible because some of the
xml.etree.ElementTree methods have parameters that take
"html" as a value, but I can’t find an example of how it’s done.
I get an exception when I try to parse the HTML file:
htmlRoot = etree.ElementTree.parse(filepathname).getroot()
The parser throws an “undefined entity” exception when it encounters an HTML entity. I assume this is because HTML entities are predefined, while XML entities are not.
As the statement shows, I’m using the default parser. Maybe there’s an HTML parser but I haven’t found one. I’m not even sure there are other parsers; maybe I’m expected to roll my own.
I don’t want to use Python’s
html package because I need to walk a complete parsed tree like
xml.etree provides. The html package doesn’t work that way.
I’ve found examples of parsing HTML with the
lxml package, but
lxml isn’t part of the standard Python configuration. That’s would be a problem for coworkers who don’t know Python and need a “plug and play” application.