We need to add some helper functions to test whether a string can be used as the name of XML element or attribute and whether it can be represented in XML at all, because existing XML serialization code does not check this. It is impractical to check this if the data was read from an existing XML (it is already valid), but if you construct XML from arbitrary user data, you can accidentally create an invalid XML or even valid incorrect XML. So we need to give the user the tools to check this if they need.
I initially thought to add new functions in the xml.sax.saxutils module which already contains few common functions: escape(), unescape() and quoteattr(). But @vstinner proposed to create a new xml.utils module for the new function or put them directly in the xml top-level package. Neither new functions, nor old existing functions are specific to SAX parsing. They were in xml.sax.saxutils because some SAX specific code needed escape() and quoteattr(), and then they were made public.
Corresponding HTML functions escape() and unescape() live in the top-level html module, but the html package is much smaller than the xml package (for now). Not every user of XML need these functions, escape() and quoteattr() are only needed if you write an XML by hand (instead of using ElementTree or minidom), and new validation functions will be needed only if you serialize an arbitrary user data.
Should we create a new xml.utils module for new functions and move existing functions from xml.sax.saxutils there? Or move them directly in xml?
I would be in favor of moving everything to xml.utils instead of xml. For a “large” package example, we already have email.utils.quote, so I guess it would make sense to have xml.utils.quote.
One reason why I would put it in a dedicated module is because xml/__init__.py is empty and feels more like a global entry-point rather than a location where I would put utilities. While I agree that XML and HTML are more related to each other, as Serhiy observed, the package is currently way smaller (and it only needs 2 util functions).
While I agree with mimicking html.escape and xml.escape, I would suggest re-exporting (possibly limited) xml.utils directly in this case but keep the implementation in xml.utils in case we need to add more than what is needed.
I’d prefer if they were in the xml namespace. Mainly because people are much more likely to use something like xml.escape(blah).
If it’s tucked under xml.utils, you’ll often end up with patterns like:
from xml.utils import escape
# 100s of lines later
def foobar(s):
s = something(s)
s = escape(s) # cryptic
return s
Whereas with:
import xml
# 100s of lines later
def foobar(s):
s = something(s)
s = xml.escape(s) # explicit, obvious
return s
…it’s immediately clear what’s happening.
And yes, I’m well aware you can just do from xml.utils import escape as xmlescape, but that’s never going to be as universal. Even within the same project, it could drift between as xmlescape and as xml_escape. It’s also a manual step, unlike just typing xml and letting auto-import take care of the rest before continuing naturally with .escape.
Alternatively, if it ends up in something like xml.utils, I would suggest prepending xml to all free functions in that module, following the precedent set by urllib.parse (e.g., urlquote, urlsplit, urlparse).
The difference between xml and html is not only the size. There are two ways to load XML in memory or create a tree in memory. If you use them, then you do not need escape() for serialization. You need it only if you write it by hand. For HTML, there is no the in-memory tree option (for now). So html.escape() is used relatively more often (100%) than xml.sax.saxutils.escape() (less than 100%). The latter is also simpler than the former, so most code (difflib, plistlib, pydoc, xmlrpc, ElementTree, minidom) simply use their own version rather than importing from xml.sax.saxutils. Maybe adding a fast implementation in xml or xml.utils will change this.
I should note that those articles advise avoiding utils because it tends to grow into a package where one would just put whatever stuff they want inside as soon as it’s utils-like. I disagree that this observation applies to our case here because xml.utils is well-scoped (and it’s under the xml package, so I fail to see how someone can’t see why we would something that is not related to xml utilities in this module). One article also says:
Yes, there is a couple of utils packages in Django. Shame on them for using utils name. However, notice that at least some of them could be separated from the framework and bundled as optional dependencies. Also, at least they are grouped in cohesive sub-packages - e.g. django.utils.timezone or django.utils.translation.
Unless you are writing a framework, stay away from utils.
Choosing xml.utils would not be against this recommendation either! Similarly, I don’t think having xml_utils.py (as recommended here) applies as we’re already in the xml package.
The only argument that I could see in favor of not having a utils submodule is the following:
Socially: Having a module named “utils” can set a precedent, implicitly encouraging the creation of more such modules. This can lead to the unfortunate situation where you have to resolve naming conflicts between multiple “utils” modules.
However, I think having the possibility of blowing up the top-level xml file with more utilities is not a good idea. It’s perfectly fine to re-export some functions from xml.utils to xml if it’s to ease usage, but IMO, the implementation should be separate.