HTML document generation and editing using the XML package

HTML is a language that a lot of people encounter in their day to day and work, as demonstrated by the populariry of the many web development libraries out there (Django, Flask, …).

Python has a useful but limited html package capable of parsing a document and not much else, despite it being described as “utilities to manipulate HTML“. However, Python also has a powerful XML package, with built in XPATH among other useful features.

I believe there’s potential in providing the XML package functionalities inside the html package, with a few html-specific modifications.

This would provide a html-specific interface, with very little effort.

Also, any improvements to the xml package would directly benefit this one as well.

Proposal:

Add a two new classes inside the HTML package, children of xml’s ElementTree and Element (maybe called HTMLTree and HTMLElement?), with any html-specific changes the community finds useful.

Despite some features providing safeguards, the proposed implementation would not provide any strong guarantees, much in the line of utilities such as the sqlite implementation or the http server.

A subset of the following additions should be agreed upon. If no addition is found to be useful, this entire proposal is rendered unuseful.

Proposed additions:

  • Replace the “text” attribute with the “innerText” (or “inner_text”) and “innerHTML” properties, where innerText is automatically escaped.

  • If no initial html code is provided/parsed, the tree automatically contains <html></html>.

  • Add a custom HTMLStyle class to manage styles. When serializing the HTML, all its attributes in __dict__ (if any) are converted to a string, and sanitized not to inject code outside of the style attribute

  • Allow some basic templating capabilities, where the user can somehow replace some tag with a subtree

E.g. A simple parsed html tree <html><body><header></header><content></content></body></html> in the my_html variable

One could do something like: my_html.replace(“header”, my_header_subtree).

The name for the function is also open for suggestions.

The actual implementation details could also be discussed, because maybe we could benefit from having a XPATH as the first argument instead of a simple tag name.

  • Add a new way of adding elements with the readability of indentation.

This one is a proposal to allow an alternative way of creating simple html trees within Python.

It is not a need per se, but the proposal arises from the fact that to write an html file you either need a lot of .append/.extend calls (which gets really messy really quick), or external html files, which requires the user to edit two files for one simple task.

This could be done in many ways. My initial and sketchy idea is:

with body.div(id="content") as div:

or

with body.tag.div(id=”content”) as div:

where body, an HTMLElement generates a LazyHTMLElement (terrible name, but the best I came up with) which does not add the Element to the Tree until it is __enter__-ed.

This could be implemented in a way that the lazy element keeps a reference to the element that created it and it calls self.parent.append(Element(*self.data))when it is __enter__-ed.

If this proposal is liked, it should be decided whether to allow any tag or just a selected most common few.


Any ideas and improvements are more than welcome.

Note that all names provided are made up as reference but are open to name suggestions.

1 Like

The proposed additions are ordered from most conservative to least conservative

Are you familiar with Beautiful Soup? In addition to being a reference to a less-well-known Alice in Wonderland song, it’s also a really amazing HTML parsing library. It can do all the things you’re describing, and it can even parse malformed HTML (which a lot of XML parsers will choke on). I’ve used it for a lot of transformation scripts, great for fixing up the literal thousands of HTML pages on a web site I’m curator of.

I suspect it can do everything you need here, with the possible exception of templating, which you’d have to do with code (or insert something like Jinja or a Markdown parser).

2 Likes