Parsing HTML with the XML module

Jonathan-Sachs-NVIDIA · November 16, 2021, 7:25pm

I need to parse HTML files with the Python 3.8 xml package. This must be possible because some of the xml.etree.ElementTree methods have parameters that take "xml" or "html" as a value, but I can’t find an example of how it’s done.

I get an exception when I try to parse the HTML file:

htmlRoot = etree.ElementTree.parse(filepathname).getroot()

The parser throws an “undefined entity” exception when it encounters an HTML entity. I assume this is because HTML entities are predefined, while XML entities are not.

As the statement shows, I’m using the default parser. Maybe there’s an HTML parser but I haven’t found one. I’m not even sure there are other parsers; maybe I’m expected to roll my own.

I don’t want to use Python’s html package because I need to walk a complete parsed tree like xml.etree provides. The html package doesn’t work that way.

I’ve found examples of parsing HTML with the lxml package, but lxml isn’t part of the standard Python configuration. That’s would be a problem for coworkers who don’t know Python and need a “plug and play” application.

MartinPacker · November 17, 2021, 8:21am

Note:

HTML isn’t necessarily XML.
There could be a namespace issue.
What’s wrong with Beautiful Soup?

tjol · November 17, 2021, 12:21pm

The only functions in xml.etree.ElementTree that take a method="html" argument are functions for writing markup, not for reading markup.

It might be possible to feed HTML into ElementTree by providing a custom clone or subclass of xml.etree.ElementTree.XMLParser, but that really doesn’t sound like something most people would want to do, especially as BeautifulSoup exists.

Jonathan-Sachs-NVIDIA · November 17, 2021, 2:06pm

So what would you do in a case like this? As far as I know, Python’s html module cannot be used to build an element tree, so I would have to reinvent the wheel (the element tree) to make it do what I want.

Part of my original problem statement seems to have been overlooked, so I’ll repeat it: “…lxml isn’t part of the standard Python configuration. That would be a problem for coworkers who don’t know Python and need a “plug and play” application.” The same would be true of Beautiful Soup.

As a side note, Beautiful Soup would give me no advantage over lxml, because the HTML I’m processing is program-generated, and will never have the kind of faulty syntax that Beautiful Soup is designed to salvage.

tjol · November 17, 2021, 3:26pm

Python comes with ‘batteries included’, but there will always be many tasks that are cumbersome using just the standard library. This appears to be one of them.

I’d just use a third-party module. Such as lxml, or whatever. And tell your users how to install the dependencies with pip. Or you could bundle your dependencies. Or provide an installer that installs everything.

Or you can just use the standard html.parser module. I imagine it’s easier to build your own DOM-like tree with the help of html.parser than repurpose ElementTree, but I’m really just speculating here.

fdrake · November 17, 2021, 3:57pm

You could even use html.parser to build a DOM using the xml.dom.minidom implementation. Should be fairly straightforward.

-Fred