Status of defusedxml and recommendation in docs

The Python XML documentation explicitly recommends the use of the defusedxml package for security purposes.

However, this package has not received published updates since early 2021 and only explicitly declares support up to Python 3.9.

Given this, is the use of defusedxml still an appropriate recommendation to make in the docs? Or do we think newer versions of Python are secure enough in this respect? (In which case the documentation should also be updated)

1 Like

@tiran owns that package and is in the best position to comment.

The Python standard library XML libraries are long-term-stale and not likely to ever see improvements.

3 Likes

If we have to make a recommendation we should do lxml which is the main package based on community consensus

@gpshead you typed “stale”, is there any chance you meant “stable”?

@ofek I have no opinion on lxml specifically but my concern is that defusedxml is being recommended explicitly for the purposes of mitigating potential security vulnerabilities. Is lxml recommended for the same reason? In either case it seems somewhat strange to recommend third-party libraries as being more secure than the standard library implementation itself.

I should say that I understand that defusedxml filled a gap at a previous point in time. What I’m really wondering is whether the use of defusedxml is still explicitly recommended given that it does not seem to receive support. And if not, do we believe the standard library has been updated with the fixes from defusedxml?

1 Like

Based on this discussion I believe lxml has addressed all previous security related critiques and they even have FAQs about safe usage.

2 Likes

Nice, it sounds like we should update the documentation recommendation then! Could you file and docs issue about XML Processing Modules — Python 3.12.0 documentation pointing at this thread?

My writing “stale” was intentional snark. It means both stable and stale. =)

The larger point behind that is that the standard library is not a great place for libraries in need for major overhaul or evolution because users write code that depends on all APIs and behaviors, limiting what changes we can make without causing disruption, and users often need better things long before they’ll be able to upgrade their Python runtime. So when possible it is often better to look for externally maintained modules from PyPI. Which is why we sometimes link to external things.

Those evolve though, so any time our recommendation doesn’t seem great, updating it makes sense.

I would assume not.

The list of changes in CPython’s xml libraries does not appear to contain work along those lines. History for Lib/xml - python/cpython · GitHub since the warning was added to the doc 10.5 years ago in Issue 17538: Document XML vulnerabilties · python/cpython@7380a67 · GitHub.

I recently looked into security issues with xml parsing in Python, so this topic is timely.

There are a few things that confuse me:

In the lxml faqs for parsing safely on a server, it says to set resolve_entities=False. But the default for that setting is True.. How can this be addressed when using libraries that themselves use lxml?

Also, the last sentence of that FAQ entry refers to defusedxml. Isn’t it a bit circular to say that defusedxml is no longer needed, but the way to make lxml safer is to use the examples in defusedxml?

The defusedxml package comes with an example setup and a wrapper API for lxml that applies certain counter measures internally.

I agree that the docs need to be updated, but so far I’m unsure what they should be updated to recommend.

1 Like

Christian updated defusedxml today, as it happens, and added a note that the vendored libexpat used since 3.8 includes ‘billion laughs’ proctection, and that the SAX and DOM parsers don’t load external entities since Python since 3.7.1. This clearly isn’t everything, but the state of play in Python does seem to have improved.

A

2 Likes

Sorry, my snark detector was off when I read this post :sweat_smile:

I noticed the exact same thing in the lxml docs. Also the linked GitHub thread explicitly calls out the fact that lxml is not safe against XXE out of the box.

I want to quote the current Python documentation here:

defusedxml is a pure Python package with modified subclasses of all stdlib XML parsers that prevent any potentially malicious operation. Use of this package is recommended for any server code that parses untrusted XML data.

If we changed that recommendation to lxml today, it seems like a lot of people would (unknowingly!) wind up with insecurely configured XML parsers. It seems that we would need to be extremely careful before making any documentation updates.

To raise the stakes a bit more, the defusedxml package is widely recommended in the security community as the proper way to protect against various vulnerabilites related to XML. Just for a few examples:

  • It is mentioned in various places in the Bandit documentation: blacklist_imports — Bandit documentation
  • There is a public semgrep rule that mentions the Python documentation explicitly: Semgrep

    The Python documentation recommends using defusedxml instead of xml
    because the native Python xml library is vulnerable to XML External
    Entity (XXE) attacks. These attacks can leak confidential data and “XML
    bombs” can cause denial of service

  • The OWASP cheatsheet on XXE links to the Python documentation page: XML External Entity Prevention - OWASP Cheat Sheet Series
  • Various security tool vendors also mention defusedxml in their recommendations

Point being: any update to the docs here needs to carefully consider the implications. It would be nice if one of the following things were true:

  1. The stdlib implementations were updated with secure defaults (seems unlikely given @gpshead’s comment)
  2. defusedxml guaranteed some kind of continuing support, ideally from PSF
1 Like

I would prefer to remove recommendations instead.

Is there a third option?
3. lxml (or another widely-used PyPI package) reviews the current state of security in their package compared to defusedxml, and makes some statement regarding what the library does or doesn’t provide protection for.

For me, the main problem right now is trying to figure out where each library is at in terms of security. Do I need to do more or change defaults to be protected from the various xml issues?

1 Like

I think the problem is that PSF has basically endorsed defusedxml as the standard way to enforce XML security and the security community has taken that as more or less gospel truth.

Removing the recommendation at this point means basically washing our hands of it, while a decent number of people will continue to assume that defusedxml is the standard, recommended way to remediate potential XML vulnerabilities.

After taking a closer look at lxml it is clear that it does not provide the same security guarantees out of the box, even if it can be configured to be more secure.

Given that the recommendation has been in place for quite awhile now, it seems like PSF has some responsibility to either provide an updated recommendation and/or guarantee some level of support for defusedxml. My opinion is that defusedxml seems to be a thoughtful and elegant solution to the security issue, and that it deserves ongoing maintenance. I might go so far as to say that it deserves to become part of the standard library itself.

I’m not a Python core developer but if I were I would hard reject that as there are efforts underway to strip the standard library of things.

The optimal solution is to either remove the recommendation entirely or recommend the project that has by far the most maintenance and widespread usage (lxml) with an example of how to use it safely.

So I think that’s fair. But I am looking at this from the perspective of someone who has to make security recommendations, and it is clear that the security community has adopted defusedxml as the de-facto solution, mostly on the basis of the Python docs themselves. So it does feel like PSF has some kind of obligation in that respect. It’s entirely possible that lxml should be the preferred implementation but it does not currently provide the kind of (default) security guarantees provided by defusedxml.

Also defusedxml is a drop-in replacement for the standard library. And so purely on the basis of making a security recommendation, it is a very easy solution to adopt for someone who is currently using the standard library. Switching to lxml would seem to involve breaking API changes on top of some insecure defaults, and so it does not seem to serve the same purpose in this respect.

1 Like

(I am a core dev, but I have little or no experience with XML). I think the obligation here is to keep the docs up to date, and that’s all. If that means removing a no-longer-accurate recommendation, then that’s what we should do (as usual, “what we should do” means “we’d be happy for someone to submit a PR” :wink:).

1 Like

Personally I don’t see a conflict here.

The stdlib docs recommend defusedxml to be installed to patch problems in the stdlib XML modules.
People who aren’t restricted to the stdlib and know that there’s a world on PyPI will find out that lxml is the best of class lib for XML processing and use it. lxml docs should point out the security settings for sure, but I don’t know that the stdlib docs should.

2 Likes

It might be true that some linters and static analysis tools check for that in large part due to the documentation, but I would question the stance that there is a clear consensus in the security community. Where I have worked (places where security has multiple teams to encompass different threat models) I have never used that library but rather used lxml with the safe options.

1 Like

Bringing defusedxml into the stdlib really just means disabling some default XML features that are already in the stdlib. So if we’re prepared to break users’ XML parsing in the name of security, we should be able to do that pretty easily whenever we like.

So far, we’ve not been pushed into it.

1 Like

This is a misconception. That PSF is not in an authoring, reviewing, or endorsing role for anything you find in CPython or its documentation. The PSF’s role in this situation is as the mere copyright assignee. They, like anyone, are welcome to file issues and propose changes to the CPython project code and docs and processes.

The contents of CPython (code, docs, releases, etc) are created and reviewed by some CPython Core Developers, the majority of which are volunteers. Something being in the documentation is no more than a statement that the core dev(s) involved in putting it in there at the specific moment in time it went in thought it being there would improve the state of the world.

If something no longer seems right, file an issue and offer a PR with an explanation that you believe relevant core devs will agree with.

defusedxml has been updated. It still seems quite relevant. (As a result, I doubt you’ll find core devs interested in removing that link today).