However, this package has not received published updates since early 2021 and only explicitly declares support up to Python 3.9.
Given this, is the use of defusedxml still an appropriate recommendation to make in the docs? Or do we think newer versions of Python are secure enough in this respect? (In which case the documentation should also be updated)
@gpshead you typed “stale”, is there any chance you meant “stable”?
@ofek I have no opinion on lxml specifically but my concern is that defusedxml is being recommended explicitly for the purposes of mitigating potential security vulnerabilities. Is lxml recommended for the same reason? In either case it seems somewhat strange to recommend third-party libraries as being more secure than the standard library implementation itself.
I should say that I understand that defusedxml filled a gap at a previous point in time. What I’m really wondering is whether the use of defusedxml is still explicitly recommended given that it does not seem to receive support. And if not, do we believe the standard library has been updated with the fixes from defusedxml?
My writing “stale” was intentional snark. It means both stable and stale. =)
The larger point behind that is that the standard library is not a great place for libraries in need for major overhaul or evolution because users write code that depends on all APIs and behaviors, limiting what changes we can make without causing disruption, and users often need better things long before they’ll be able to upgrade their Python runtime. So when possible it is often better to look for externally maintained modules from PyPI. Which is why we sometimes link to external things.
Those evolve though, so any time our recommendation doesn’t seem great, updating it makes sense.
Christian updated defusedxml today, as it happens, and added a note that the vendored libexpat used since 3.8 includes ‘billion laughs’ proctection, and that the SAX and DOM parsers don’t load external entities since Python since 3.7.1. This clearly isn’t everything, but the state of play in Python does seem to have improved.
Sorry, my snark detector was off when I read this post
I noticed the exact same thing in the lxml docs. Also the linked GitHub thread explicitly calls out the fact that lxml is not safe against XXE out of the box.
I want to quote the current Python documentation here:
defusedxml is a pure Python package with modified subclasses of all stdlib XML parsers that prevent any potentially malicious operation. Use of this package is recommended for any server code that parses untrusted XML data.
If we changed that recommendation to lxml today, it seems like a lot of people would (unknowingly!) wind up with insecurely configured XML parsers. It seems that we would need to be extremely careful before making any documentation updates.
To raise the stakes a bit more, the defusedxml package is widely recommended in the security community as the proper way to protect against various vulnerabilites related to XML. Just for a few examples:
There is a public semgrep rule that mentions the Python documentation explicitly: Semgrep
The Python documentation recommends using defusedxml instead of xml
because the native Python xml library is vulnerable to XML External
Entity (XXE) attacks. These attacks can leak confidential data and “XML
bombs” can cause denial of service
Is there a third option?
3. lxml (or another widely-used PyPI package) reviews the current state of security in their package compared to defusedxml, and makes some statement regarding what the library does or doesn’t provide protection for.
For me, the main problem right now is trying to figure out where each library is at in terms of security. Do I need to do more or change defaults to be protected from the various xml issues?
I think the problem is that PSF has basically endorsed defusedxml as the standard way to enforce XML security and the security community has taken that as more or less gospel truth.
Removing the recommendation at this point means basically washing our hands of it, while a decent number of people will continue to assume that defusedxml is the standard, recommended way to remediate potential XML vulnerabilities.
After taking a closer look at lxml it is clear that it does not provide the same security guarantees out of the box, even if it can be configured to be more secure.
Given that the recommendation has been in place for quite awhile now, it seems like PSF has some responsibility to either provide an updated recommendation and/or guarantee some level of support for defusedxml. My opinion is that defusedxml seems to be a thoughtful and elegant solution to the security issue, and that it deserves ongoing maintenance. I might go so far as to say that it deserves to become part of the standard library itself.
So I think that’s fair. But I am looking at this from the perspective of someone who has to make security recommendations, and it is clear that the security community has adopted defusedxml as the de-facto solution, mostly on the basis of the Python docs themselves. So it does feel like PSF has some kind of obligation in that respect. It’s entirely possible that lxml should be the preferred implementation but it does not currently provide the kind of (default) security guarantees provided by defusedxml.
Also defusedxml is a drop-in replacement for the standard library. And so purely on the basis of making a security recommendation, it is a very easy solution to adopt for someone who is currently using the standard library. Switching to lxml would seem to involve breaking API changes on top of some insecure defaults, and so it does not seem to serve the same purpose in this respect.
(I am a core dev, but I have little or no experience with XML). I think the obligation here is to keep the docs up to date, and that’s all. If that means removing a no-longer-accurate recommendation, then that’s what we should do (as usual, “what we should do” means “we’d be happy for someone to submit a PR” ).
The stdlib docs recommend defusedxml to be installed to patch problems in the stdlib XML modules.
People who aren’t restricted to the stdlib and know that there’s a world on PyPI will find out that lxml is the best of class lib for XML processing and use it. lxml docs should point out the security settings for sure, but I don’t know that the stdlib docs should.
It might be true that some linters and static analysis tools check for that in large part due to the documentation, but I would question the stance that there is a clear consensus in the security community. Where I have worked (places where security has multiple teams to encompass different threat models) I have never used that library but rather used lxml with the safe options.
Bringing defusedxml into the stdlib really just means disabling some default XML features that are already in the stdlib. So if we’re prepared to break users’ XML parsing in the name of security, we should be able to do that pretty easily whenever we like.
This is a misconception. That PSF is not in an authoring, reviewing, or endorsing role for anything you find in CPython or its documentation. The PSF’s role in this situation is as the mere copyright assignee. They, like anyone, are welcome to file issues and propose changes to the CPython project code and docs and processes.
The contents of CPython (code, docs, releases, etc) are created and reviewed by some CPython Core Developers, the majority of which are volunteers. Something being in the documentation is no more than a statement that the core dev(s) involved in putting it in there at the specific moment in time it went in thought it being there would improve the state of the world.
If something no longer seems right, file an issue and offer a PR with an explanation that you believe relevant core devs will agree with.
defusedxml has been updated. It still seems quite relevant. (As a result, I doubt you’ll find core devs interested in removing that link today).