Should any conclusions be drawn from the xz backdoor?

By now the Internet is virally becoming aware that a backdoor got injected into xz by one of its maintainers.

The part that interests me here is that the backdoor was not injected into the Git repository on GitHub, but only in the release tarballs. Due to how Autoconf & co. work, it is common for tarballs of Autoconf projects to include extra files not in the Git repository (like the configure script). Apparently, the attacker exploited the fact that differences between the release tarball and the repository were not considered particularly suspicious by downstream redistributors in order to make the attack less discoverable.

Should any conclusions be drawn from this for Python’s packaging model? Currently, all releases on PyPI normally contain an sdist, which may or may not contain the same files as the source repository (apart from PKG-INFO). This can potentially be used as a vector for making an exploit easier to hide.

Has this been discussed in the past?


Many (most?) packages also distribute wheels on PyPI, whose contents can be completely different to the sources in the Git repository. Wheels contain only the actual code and native libraries, without including irrelevant-for-installation files (like README), and differences are fairly normal there. Even for sdists, there isn’t much point in including the CI pipeline definitions.

Also, even if there was parity between the Git repository and PyPI artifacts, the backdoor could live in the Git repository for years, undetected, or well-hidden and activated years later by a benign change.


I completely agree, but note that

  • Normally, the wheel building process should be repeatable. Not all build backends do reproducible builds by default, unfortunately, but I know at least hatchling and flit-core put some effort into that. So I should be able to trust a wheel without inspecting it, by building the source (repo or sdist) and checking the hash. (Whether anybody actually does that for PyPI is another problem…)
  • AFAIU, distributors like Linux distros usually don’t use wheels but build sdists, but I may be wrong on that.

Point is, the sdist adds one more thing to audit (with repo and wheels), and because it’s assumed by people to be “more or less the same thing as the repo” (unlike wheels), my guess is that it’s less likely to come under scrutiny.

1 Like

Increasingly, there’s a two part answer to that: “don’t use wheels” - true. “build sdists” - might be false, I’ve been told by a few distro packagers that they don’t use sdists, and require a “proper” release tarball - but the one generated by GitHub might qualify, which might loop back into the beginnings of this topic…


Well, other than that we’re vulnerable to the same thing, probably not?

IIUC, one of benefits of Trusted Publishers is enabling auditable release pipelines for projects that tie the release artifact to a specific commit hash+timestamp.

I think we’re very very far away from any kind of enforcement of audit trails/provenance on such things and there are wider efforts to do so (partly motivated IIUC by draft/upcoming US and EU regulations around software supply chain).

Theoretically, Lack of a build farm for PyPI - pypackaging-native could also be viewed as an avenue for resolving this, ensuring consistency across the build process end-to-end by also recording the incoming source tarball.


With my distributor hat on, this scenario is a large part of the reason why I prefer building from GitHub tags rather than tarballs on PyPI.


What are the obstacles for independent verification of build artifacts? From what I can think of, there’s just full reproducibility: going from source version-control checkout (or potentially source archive, if commit metadata isn’t required) to wheel with the same inputs (eg same file modification-dates, and for platform-specific wheels, same statically-linked libraries).

Using Trusted Publishers and a PyPI build farm still requires trusting third parties (which for most I suspect is fine, but I think there should still be a path for independent verification without requiring any trust).

1 Like

Yep – we designed Trusted Publishing to seal the step between “a release is minted on GitHub” (or other supported CI/CD providers) and “one or more distributions of that release appear on PyPI.”

In practice, I think this is about as good as it gets: Python packaging has largely migrated off of source distributions for what I think are good reasons (including security reasons, like not running on every single install edge). Consequently we need some kind of separate distribution format, which means some kind of differential versus the “ground truth” of the source repo. That means that some variant of this technique (which isn’t itself a vulnerability, IMO) will always exist, including wherever package distributions are redistributed (e.g. as dpkgs, Homebrew bottles, etc.).

(One significant missing piece of the Trusted Publishing story is how end users can verify the provenance of TP’d packages. PEP 740 is our WIP PEP for that.)


Thanks everyone, and sorry for the late reply. I take it that trusted publishing is the way to go for users who want this kind of check, and generalizing it should be the way to make sure dependency sets can more frequently be verified in this way.


That’s my opinion, at least!


Based on the social factors behind how the backdoor managed to infiltrate, there are also lessons to learn about 1) the security implications of Code of Conduct violations (rude conduct turning into guilt-tripping, with the apparent purpose of…) 2) social pressure on weak-link maintainers and the potential consequences.

However, I reckon that the PSF already does a quite good job with both of these things.