HTML is a language that a lot of people encounter in their day to day and work, as demonstrated by the populariry of the many web development libraries out there (Django, Flask, …).
Python has a useful but limited html package capable of parsing a document and not much else, despite it being described as “utilities to manipulate HTML“. However, Python also has a powerful XML package, with built in XPATH among other useful features.
I believe there’s potential in providing the XML package functionalities inside the html package, with a few html-specific modifications.
This would provide a html-specific interface, with very little effort.
Also, any improvements to the xml package would directly benefit this one as well.
Proposal:
Add a two new classes inside the HTML package, children of xml’s ElementTree and Element (maybe called HTMLTree and HTMLElement?), with any html-specific changes the community finds useful.
Despite some features providing safeguards, the proposed implementation would not provide any strong guarantees, much in the line of utilities such as the sqlite implementation or the http server.
A subset of the following additions should be agreed upon. If no addition is found to be useful, this entire proposal is rendered unuseful.
Proposed additions:
-
Replace the “text” attribute with the “innerText” (or “inner_text”) and “innerHTML” properties, where innerText is automatically escaped.
-
If no initial html code is provided/parsed, the tree automatically contains
<html></html>. -
Add a custom HTMLStyle class to manage styles. When serializing the HTML, all its attributes in
__dict__(if any) are converted to a string, and sanitized not to inject code outside of thestyleattribute -
Allow some basic templating capabilities, where the user can somehow replace some tag with a subtree
E.g. A simple parsed html tree <html><body><header></header><content></content></body></html> in the my_html variable
One could do something like: my_html.replace(“header”, my_header_subtree).
The name for the function is also open for suggestions.
The actual implementation details could also be discussed, because maybe we could benefit from having a XPATH as the first argument instead of a simple tag name.
- Add a new way of adding elements with the readability of indentation.
This one is a proposal to allow an alternative way of creating simple html trees within Python.
It is not a need per se, but the proposal arises from the fact that to write an html file you either need a lot of .append/.extend calls (which gets really messy really quick), or external html files, which requires the user to edit two files for one simple task.
This could be done in many ways. My initial and sketchy idea is:
with body.div(id="content") as div:
or
with body.tag.div(id=”content”) as div:
where body, an HTMLElement generates a LazyHTMLElement (terrible name, but the best I came up with) which does not add the Element to the Tree until it is __enter__-ed.
This could be implemented in a way that the lazy element keeps a reference to the element that created it and it calls self.parent.append(Element(*self.data))when it is __enter__-ed.
If this proposal is liked, it should be decided whether to allow any tag or just a selected most common few.
Any ideas and improvements are more than welcome.
Note that all names provided are made up as reference but are open to name suggestions.