By now the Internet is virally becoming aware that a backdoor got injected into xz by one of its maintainers.
The part that interests me here is that the backdoor was not injected into the Git repository on GitHub, but only in the release tarballs. Due to how Autoconf & co. work, it is common for tarballs of Autoconf projects to include extra files not in the Git repository (like the configure script). Apparently, the attacker exploited the fact that differences between the release tarball and the repository were not considered particularly suspicious by downstream redistributors in order to make the attack less discoverable.
Should any conclusions be drawn from this for Python’s packaging model? Currently, all releases on PyPI normally contain an sdist, which may or may not contain the same files as the source repository (apart from PKG-INFO). This can potentially be used as a vector for making an exploit easier to hide.
Many (most?) packages also distribute wheels on PyPI, whose contents can be completely different to the sources in the Git repository. Wheels contain only the actual code and native libraries, without including irrelevant-for-installation files (like README), and differences are fairly normal there. Even for sdists, there isn’t much point in including the CI pipeline definitions.
Also, even if there was parity between the Git repository and PyPI artifacts, the backdoor could live in the Git repository for years, undetected, or well-hidden and activated years later by a benign change.
Normally, the wheel building process should be repeatable. Not all build backends do reproducible builds by default, unfortunately, but I know at least hatchling and flit-core put some effort into that. So I should be able to trust a wheel without inspecting it, by building the source (repo or sdist) and checking the hash. (Whether anybody actually does that for PyPI is another problem…)
AFAIU, distributors like Linux distros usually don’t use wheels but build sdists, but I may be wrong on that.
Point is, the sdist adds one more thing to audit (with repo and wheels), and because it’s assumed by people to be “more or less the same thing as the repo” (unlike wheels), my guess is that it’s less likely to come under scrutiny.
Increasingly, there’s a two part answer to that: “don’t use wheels” - true. “build sdists” - might be false, I’ve been told by a few distro packagers that they don’t use sdists, and require a “proper” release tarball - but the one generated by GitHub might qualify, which might loop back into the beginnings of this topic…
Well, other than that we’re vulnerable to the same thing, probably not?
IIUC, one of benefits of Trusted Publishers is enabling auditable release pipelines for projects that tie the release artifact to a specific commit hash+timestamp.
I think we’re very very far away from any kind of enforcement of audit trails/provenance on such things and there are wider efforts to do so (partly motivated IIUC by draft/upcoming US and EU regulations around software supply chain).
Theoretically, Lack of a build farm for PyPI - pypackaging-native could also be viewed as an avenue for resolving this, ensuring consistency across the build process end-to-end by also recording the incoming source tarball.
What are the obstacles for independent verification of build artifacts? From what I can think of, there’s just full reproducibility: going from source version-control checkout (or potentially source archive, if commit metadata isn’t required) to wheel with the same inputs (eg same file modification-dates, and for platform-specific wheels, same statically-linked libraries).
Using Trusted Publishers and a PyPI build farm still requires trusting third parties (which for most I suspect is fine, but I think there should still be a path for independent verification without requiring any trust).
Yep – we designed Trusted Publishing to seal the step between “a release is minted on GitHub” (or other supported CI/CD providers) and “one or more distributions of that release appear on PyPI.”
In practice, I think this is about as good as it gets: Python packaging has largely migrated off of source distributions for what I think are good reasons (including security reasons, like not running setup.py on every single install edge). Consequently we need some kind of separate distribution format, which means some kind of differential versus the “ground truth” of the source repo. That means that some variant of this technique (which isn’t itself a vulnerability, IMO) will always exist, including wherever package distributions are redistributed (e.g. as dpkgs, Homebrew bottles, etc.).
(One significant missing piece of the Trusted Publishing story is how end users can verify the provenance of TP’d packages. PEP 740 is our WIP PEP for that.)
Thanks everyone, and sorry for the late reply. I take it that trusted publishing is the way to go for users who want this kind of check, and generalizing it should be the way to make sure dependency sets can more frequently be verified in this way.
Based on the social factors behind how the backdoor managed to infiltrate, there are also lessons to learn about 1) the security implications of Code of Conduct violations (rude conduct turning into guilt-tripping, with the apparent purpose of…) 2) social pressure on weak-link maintainers and the potential consequences.
However, I reckon that the PSF already does a quite good job with both of these things.
Would trusted publishers have made any difference in this scenario at all? The way I see the chain of events would have been:
Our evil JiaT75 would have added the malicious adjustments to the release tarballs as before
He/she would have then signed them against his/her GitHub account
Downstream repositories would have seen a valid signature authentically proving that the tarball came from the owner of a GitHub account of a highly active and trusted xz maintainer
Redhat and Ubuntu maintainers would have slept soundly that night, comfortably within the illusion that nothing is amis
I guess if we steer people towards building and signing on CI then at least the tampering steps have to be recorded in source control but I can’t see it being that difficult to obfuscate that trail too.
I don’t think it would have made a difference in terms of stopping Jia, since (in terms of attacker models) they were the perfect insider threat.
What Trusted Publishing does is close one of the gaps that Jia exploited to remain surreptitious (the disconnect between source code as it appears in the public repo and package distributions as they appear on a host/index/repo/etc.). With Trusted Publishing, the source repo is linked to the publishing step and can (but isn’t yet, see below) be exposed to the public for auditing and review.
More generally: IMO, it’s impossible to stop this kind of insider threat. Instead, all we can do is change the parameters of the game to force attackers into the open, making their attack riskier for them. That’s what PEP 740 will enable: it will enable PyPI to redistribute the Trusted Publisher state for downstream clients to verify individual release distributions against. That state includes things like the git commit that the distribution was built against, meaning that the attacker has to forfeit their stealth to produce a verifiable package.
TL;DR: I think there’s no perfect defense here, and TP is not a panacea. But, when coupled with PEP 740, it will force the attacker into a higher visibility position, which will help with incident response and timeline reconstruction.
I though all the xz exploit files were already in git i.e. there was no surreptitious changing of the output .tar.gz file (maybe I misunderstood what happened)? In which case Trusted Publishing wouldn’t have helped (as the sdist equivalent would have been built on CI, so the exploit is ready), and PyPI already does have pregenerated content in sdists (e.g. cython C files, or complied/minified JS for browsers), so it would seem entirely plausible for this same attack to affect PyPI.
No, there were files in the released artifact that weren’t present in the source repository (this is “expected” due to the use of autoconf) and the attacker made changes to these files causing them to differ from what should have been generated by autoconf.
Ah, I missed that build-to-host.m4 wasn’t committed, and instead the file was excluded via .gitingore (I thought it was, and I wonder if anyone would have noticed if it had been added with all the other m4 files…).
My thought was more along the lines of “how would I make it hard to spot such a thing using Python packaging”, and the use of pregenerated files (and the expectation of those) and the potential divergence between source tree → wheel (which might happen on CI) and source tree → sdist → wheel would seem to be the easiest places to hide, which to some extent is out of scope of Trusted Publishing (given it only proves that it was generated on Trusted Publisher).