Place for XML helper functions

storchaka · October 9, 2025, 9:11am

We need to add some helper functions to test whether a string can be used as the name of XML element or attribute and whether it can be represented in XML at all, because existing XML serialization code does not check this. It is impractical to check this if the data was read from an existing XML (it is already valid), but if you construct XML from arbitrary user data, you can accidentally create an invalid XML or even valid incorrect XML. So we need to give the user the tools to check this if they need.

I initially thought to add new functions in the xml.sax.saxutils module which already contains few common functions: escape(), unescape() and quoteattr(). But @vstinner proposed to create a new xml.utils module for the new function or put them directly in the xml top-level package. Neither new functions, nor old existing functions are specific to SAX parsing. They were in xml.sax.saxutils because some SAX specific code needed escape() and quoteattr(), and then they were made public.

Corresponding HTML functions escape() and unescape() live in the top-level html module, but the html package is much smaller than the xml package (for now). Not every user of XML need these functions, escape() and quoteattr() are only needed if you write an XML by hand (instead of using ElementTree or minidom), and new validation functions will be needed only if you serialize an arbitrary user data.

Should we create a new xml.utils module for new functions and move existing functions from xml.sax.saxutils there? Or move them directly in xml?

storchaka · October 9, 2025, 9:13am

The corresponding issue:

github.com/python/cpython

Add utility functions to test for legal XML characters and names

відкрито 08:44AM - 02 Oct 25 UTC

serhiy-storchaka

type-feature stdlib topic-XML

# Feature or enhancement It is well known fact that some characters (such as `<…` or `&`) must be escaped in XML and HTML, and that attribute values must be quoted. It is less known fact (unless you specially looked for it) that not all characters can be included in XML, even if escaped. For example, XML cannot contain the null character. There is also restriction on names of elements and attributes. They cannot contain `<`, `>`, `/`, `!`, `?`, spaces and many other characters. But unlike to Python identifiers, `-`, `:`, `.`, etc are acceptable. The list of valid and invalid characters is pretty long. In the case when the user input is used for element or attribute names without validation, this can even lead to XML injection vulnerability ([CVE-2025-9375](https://nvd.nist.gov/vuln/detail/CVE-2025-9375)) So, I think that it would be useful to provide standard functions to validate XML characters and names in the stdlib. `xml.sax.saxutils` looks an appropriate place, it already has `escape()` and `quoteattr()` utilities. We can also provide functions to "sanitize" XML characters, similar to `sanitize_xml()` in `Lib/test/libregrtest/utils.py`, but more general. This is similar to old issue #63014. Now, the problem is that there are two standards of XML: 1.0 and 1.1. The former is much more popular. And they have different definitions of legal characters. There are also restricted characters in XML 1.1 which cannot be used in "well-formed" documents and parsed entities. There is also a set of characters (version depending) using which legal but is discouraged. Should we have several functions or several parameters to specify the XML version and other options? https://www.w3.org/TR/xml/#charsets https://www.w3.org/TR/xml11/#charsets Fortunately, the syntax for names is the same in XML 1.0 and 1.1. https://www.w3.org/TR/xml/#NT-Name https://www.w3.org/TR/xml11/#NT-Name We can also add a similar set of functions for HTML. ### Linked PRs * gh-139768

picnixz · October 9, 2025, 9:21am

I would be in favor of moving everything to xml.utils instead of xml. For a “large” package example, we already have email.utils.quote, so I guess it would make sense to have xml.utils.quote.

doerwalter · October 9, 2025, 10:16am

I’d vote for putting them directly in xml. xml.escape reads nicely and mirrors html.escape.

Its unfortunate that for email those functions live in email.utils, but xml and html are more closely related than xml and email.

picnixz · October 9, 2025, 2:00pm

One reason why I would put it in a dedicated module is because xml/__init__.py is empty and feels more like a global entry-point rather than a location where I would put utilities. While I agree that XML and HTML are more related to each other, as Serhiy observed, the package is currently way smaller (and it only needs 2 util functions).

While I agree with mimicking html.escape and xml.escape, I would suggest re-exporting (possibly limited) xml.utils directly in this case but keep the implementation in xml.utils in case we need to add more than what is needed.

Monarch · October 9, 2025, 3:48pm

I’d prefer if they were in the xml namespace. Mainly because people are much more likely to use something like xml.escape(blah).

If it’s tucked under xml.utils, you’ll often end up with patterns like:

from xml.utils import escape

# 100s of lines later
def foobar(s):
    s = something(s)
    s = escape(s)  # cryptic
    return s

Whereas with:

import xml

# 100s of lines later
def foobar(s):
    s = something(s)
    s = xml.escape(s)  # explicit, obvious
    return s

…it’s immediately clear what’s happening.

And yes, I’m well aware you can just do from xml.utils import escape as xmlescape, but that’s never going to be as universal. Even within the same project, it could drift between as xmlescape and as xml_escape. It’s also a manual step, unlike just typing xml and letting auto-import take care of the rest before continuing naturally with .escape.

Monarch · October 9, 2025, 4:11pm

Alternatively, if it ends up in something like xml.utils, I would suggest prepending xml to all free functions in that module, following the precedent set by urllib.parse (e.g., urlquote, urlsplit, urlparse).

storchaka · October 12, 2025, 8:38am

The difference between xml and html is not only the size. There are two ways to load XML in memory or create a tree in memory. If you use them, then you do not need escape() for serialization. You need it only if you write it by hand. For HTML, there is no the in-memory tree option (for now). So html.escape() is used relatively more often (100%) than xml.sax.saxutils.escape() (less than 100%). The latter is also simpler than the former, so most code (difflib, plistlib, pydoc, xmlrpc, ElementTree, minidom) simply use their own version rather than importing from xml.sax.saxutils. Maybe adding a fast implementation in xml or xml.utils will change this.

storchaka · October 12, 2025, 8:48am

So, lets take a poll.

Where to add utility functions for XML?

In the top-level xml module
In the new xml.utils module
Leave them in xml.sax.saxutils
Other (explain in comments)

0 voters

hugovk · October 12, 2025, 11:03am

I voted for the top-level xml module because utils isn’t a great name.

picnixz · October 12, 2025, 11:28am

I should note that those articles advise avoiding utils because it tends to grow into a package where one would just put whatever stuff they want inside as soon as it’s utils-like. I disagree that this observation applies to our case here because xml.utils is well-scoped (and it’s under the xml package, so I fail to see how someone can’t see why we would something that is not related to xml utilities in this module). One article also says:

Yes, there is a couple of utils packages in Django. Shame on them for using utils name. However, notice that at least some of them could be separated from the framework and bundled as optional dependencies. Also, at least they are grouped in cohesive sub-packages - e.g. django.utils.timezone or django.utils.translation.

Unless you are writing a framework, stay away from utils.

Choosing xml.utils would not be against this recommendation either! Similarly, I don’t think having xml_utils.py (as recommended here) applies as we’re already in the xml package.

The only argument that I could see in favor of not having a utils submodule is the following:

Socially: Having a module named “utils” can set a precedent, implicitly encouraging the creation of more such modules. This can lead to the unfortunate situation where you have to resolve naming conflicts between multiple “utils” modules.

However, I think having the possibility of blowing up the top-level xml file with more utilities is not a good idea. It’s perfectly fine to re-export some functions from xml.utils to xml if it’s to ease usage, but IMO, the implementation should be separate.

storchaka · April 30, 2026, 4:54pm

The result is a draw, 9:9. Half for toplevel xml module, and half for xml.utils.

I do not have strong opinion either. I am now going to implement new funtions in xml.utils, but expose them in xml. It will be easier to deprecate one of places when we changed our mind if an alias was always available at other place. If we leave this as is, it will be just an implementation detail. If we decided to add many new functions in xml.utils without exposing in xml, we can move the implementation of old functions to xml and export them in xml.utils.