How to prevent etree to fill-in the namespace when parsing

Gouvernathor · May 28, 2024, 12:50am

I’m parsing SVG images, which are XML code.
They all have an xmlns (a namespace) at the base node, looking like <svg xmlns="http://www.w3.org/2000/svg" version="1.1"....
When parsing it using the xml.etree.elementtree module, it fills-in these namespace, to prepend the element tags, resulting in the root node having the tag {http://www.w3.org/2000/svg}svg, and an attrib where version is present but xmlns is not.

So, to keep a manageable tree, I used this code:

template_str = template_file.read()
namespace_match = re.search(r'xmlns="([^"]+)"\s*', template_str)
if namespace_match is not None:
    namespace = namespace_match.group(1)
    namespace_to_replace = namespace_match.group(0)
    template_str = template_str.replace(namespace_to_replace, "")
template_ET = ET.fromstring(template_str)
template_ET.set("xmlns", namespace)

The last set is so that when I export it back to an svg file, it keeps its namespaces as before.

Is there a way to tell etree to handle xmlns as if it were an ordinary attribute, and skip all that messy hack ?

jamestwebber · May 28, 2024, 4:07am

Not the most helpful solution as it requires a dependency, but if you can use lxml it supports this with the nsmap attribute:

>>> e = et.parse("my.svg")
>>> e.getroot().nsmap
{None: 'http://www.w3.org/2000/svg',
 'xlink': 'http://www.w3.org/1999/xlink',
 'serif': 'http://www.serif.com/'}

PopGreene · May 28, 2024, 8:39pm

I’m not quite sure I know what you’re asking, but I do have a problem with the decision to insert namespaces by link in the curly braces, e.g. {…}.

To fix that problem for myself, I wrote the short script below. The idea is to replace the “{…}” namespace with the “…:” for each occurrence. It has served my purpose. I hope it can help you.

import xml.etree.ElementTree as etree

def fixNs(stack, tag):
    if tag[0] == '{':
        n, l = tag[1:].split('}')
        for x in reversed(stack):
            if x[1] == n:
                if len(x[0]):
                    return x[0].encode("utf-8") + ":" + l
                else:
                    return l
    else:
        return tag

def parse(f):
    root = None
    stack = []
    nsinel = []
    for ev, x in etree.iterparse(f, events=("start", "end", "start-ns", "end-ns")):
        if ev == "start":
            if root is None:
                root = x
            for n, u in reversed(nsinel):
                if len(n):
                    x.attrib["xmlns:" + n.encode("utf-8")] = u
                else:
                    x.attrib["xmlns"] = u
            nsinel = []
        elif ev == "end":
            x.tag = fixNs(stack, x.tag.encode("utf-8"))
            for k in x.attrib.iterkeys():
                kf = fixNs(stack, k.encode("utf-8"))
                if kf != k:
                    x.attrib[kf] = x.attrib[k]
                    del x.attrib[k]
        elif ev == "start-ns":
            stack.append(x)
            nsinel.append(x)
        elif ev == "end-ns":
            stack.pop()

    d = etree.ElementTree()
    d._setroot(root)
    return d

Gouvernathor · May 29, 2024, 12:12pm

@jamestwebber I don’t have the means to test that at the moment, but I’ll come back to it.

@PopGreene your code is more complex than mine, so I’m not sure it’s a better alternative. But I’ll look into it if I have the time.

Kxnr · May 29, 2024, 3:03pm

XPath is also useful if you’d like to ignore namespaces. It doesn’t remove them from the document, but lets you query for tags with, e.g.:


tree.findall(“{*}sometag”)

If lxml is an option, as @jamestwebber suggested, it generally makes working with namespaces easier. When it hasn’t been available, I’ve often written little functions that can take a dict of namespaces to help build XPath expressions.