Addressing dependency version confusion exploit chain

Thanks for your response @uranusjr

Where is the documentation for this? I’m not disbelieving you, I just don’t know where it is.

But yes, overall, I agree with your sentiment, I just wonder if there isn’t more that we can do.

I don’t believe there is much documentation on this. It’s inherently something that larger companies building Python applications would be interested in, but we’ve not had much contribution from such organisations in the form of documentation of best practices like this.

The majority of PyPA members are volunteers, so ideally this sort of documentation would be contributed by third parties who have experience with this sort of model.

1 Like

I gave a talk in PyCon US 2019 explaining how the SecureDrop project does this. I also wrote a blog post today explaining the newer changes we brought into the workflow for the same. All of our code is available in the related git repository.

We have

  • reproducible source tarballs
  • reproducible wheels (from both public and private packages)
  • OpenPGP signature verification of the hashes
  • and finally reproducible Debian packages from them
3 Likes

This is really great. So how do we go about introducing this stuff to the wider community?

The disclosure of this attack is so far the best way of getting people’s attention on it. It’s just going to have to be announced repeatedly in the hope that people will look for solutions.

I got Microsoft to publish a generic (and almost entirely non-advertisment) whitepaper on it with high level guidance. We didn’t know that “dependency confusion” would stick as the name.

It’s deliberately not a list of products or companies that offer workarounds, but the hope is that it’ll get people looking for them. Conference talks and blog posts (which are dated and not expected to be “forever” references like a paper) are better venues for sharing specifics like that, and hopefully citing our paper will give them enough legitimacy to be accepted by those who write them.

6 Likes

Pulp and DevPi have docs on this:

And there’s this:

Are there CWE (Common Weakness Enumeration) or CAPEC URIs for this new attack pattern? Is it a:

  • “CWE-610: Externally Controlled Reference to a Resource in Another Sphere”
    CWE -
  • “CWE-829: Inclusion of Functionality from Untrusted Control Sphere”
    CWE -
  • CWE-1104: Use of Unmaintained Third Party Components"
    CWE -

Is this correct: if you specify a non-PyPI index-url, it’s possible that you’ll install compromised packages (that have a higher version number); and the potential impact is the same as for other software supply chain issues?

Does TUF specify a signed mapping between package names and keys? AFAIU, this issue is not resolved in e.g. the DEB and RPM ecosystems either (any GPG key imported to the GPG keyring can sign for any version of any package from any index-url); though you can specify:

  • channel priority
  • which packages to allow to be installed from a given channel: allow/block
  • per-package hashes for all platforms (like pipenv)
  • per-package hashes for just one platform (e.g. “arm64” in pip requirements.txt hashes)

It’s also an option to solve the deptree locally, freeze it to requirements.txt with package hashes for the target platform,and then install that with pip install --no-deps; or generate a distributable binary with bundled static deps and no at-dev-time/at-install-time dynamic retrieval of updated dependencies with something like blaze or pants.

Keep in mind that for the specific attack published last week, it is sufficient to just specify pinned versions. That’s all. The attack was predicated entirely on being able to publish a version number higher than any legitimately-published version number, and relying on unpinned dependencies in installation processes.

Hash-comparison, signature-checking, etc. are all welcome as additional ways to protect against tampered packages, illegitimate packages, etc. but aren’t strictly necessary for this attack.

3 Likes

Afraid not, at least not in the general case (possibly for pip as it stands today).

All the package managers out there treat package name+version as the primary key, and so if two indexes claim to have the same version they are able to choose from whichever they like. NuGet takes the fastest response (presumed lowest latency), when I looked, pip doesn’t seem to choose one reliably. My guess would be that it normally ends up with --index-url rather than any --extra-index-url, but it may also get the first processed response (presumably, lowest latency).

If your private server has MyPackage==1.0.0 and one day PyPI also has it, you’ve got some chance of getting either. There are no defined or specified rules for how pip decides (nor any other package manager (except Conda)).

For this attack, it is sufficient to claim the name of each internal package on every single feed you reference and ensure that you claim a different version than what you intend to install. To avoid spam, we didn’t recommend this, instead suggesting that proxying feeds (which will return its own MyPackage==1.0.0 rather than any MyPackage from an upstream feed) and ensuring you only ever pass a single --index-url to pip.

2 Likes

pip doesn’t seem to choose one reliably. My guess would be that it normally ends up with --index-url rather than any --extra-index-url , but it may also get the first processed response (presumably, lowest latency)

So this is a practical thing that we can fix by making the choice deterministic and documenting it.

Yep, except it’ll break users, so that has to be managed. But the break is unavoidable - there’s no way to fix the behaviour without changing it from what it currently is (but at least now we can say it’s a security fix without risking disclosure - that wasn’t the case when I suggested it privately a few months back).

1 Like

Guess we can do the normal delayed thing, where we add it in as an option with a warning on the command line that new behaviour will become the default in a X amount of time. And probably just leave in the unsafe option if people have no choice.

Thanks @steve.dower, I was assuming that only a single index/feed was in use inside the enterprise network, and so my statement was based on that assumption. If users use multiple feeds, they’re opening the door to many more complications as you point out :slight_smile:

Certainly in our network, since we have the occasional need to block packages from PyPI, or provide patched versions of them, our users cannot connect pip to multiple feeds, so using a namespace-based technique will be reliable for us.

2 Likes

A single custom index with vetted packages is currently the best and safest approach for medium to large organizations. For small organizations, a centrally maintained lock file with pinned versions and hashes can be a feasible solution, too. Lock and contrain files don’t scale well. Developers must be extra careful that they don’t pull in unpinned packages.

Unfortunately not every organization uses a custom PyPI clone consistently. Just early this week a lead engineer from a major telco company complained, that an update of a package broke their “staging environment for a distributed hardware testing platform” (quote).

In the distant future PyPI might grow additional features to reduce. Crowd-sourcing package reviews could be a path to create a curated view of PyPI for popular packages. Packagers would still upload to PyPI, trusted users then review and vote on packages or even each updates, eventually package/update ends up on curated.pypi.org.

If you don’t keep a local copy of the package dependencies, you cannot have locally-reproducible builds (because your build depends upon external data sources (e.g. PyPI,) that you do not control and have no local copy of).

You can cache ~/.pip/cache in a multi-stage Docker image.

Why don’t I see this as a new issue?

  • We trust the package index

  • If you add one or more indices, you trust those too: “shadowed” packages needn’t even have a greater version number, FWIU.

  • Does the PyPI implementation of TUF prevent package shadowing with or without greater version numbers?

  • We trust the {DNS in, } the network

  • It is still the case that the first DNS response will be taken. DNSSEC, DoH/DoT.

  • One lost CA cert and SSL/TLS with PKI are broken.

  • A MITM needn’t specify a greater package version number to succeed (when nothing is checking signed hashes and checking that they’re authorized signatures for that software (which is composed of many unsigned PRs that need review))

Perhaps it’s not said enough: pip is for development. If you want to install packages and set permissions such that the code can’t overwrite itself (a W^X vulnerability), you’ll need to run pip as different non-root user and/or set the permissions and/or use an OS package manager with per-file hashes.

Again, does TUF have a signed list of who is authorized to sign releases of each package? How can legit package shadowing work with that?

(Edit: this post has been moderated? Good luck with this.)

Your statement is incorrect. You need access to a verifiable and available copy. It doesn’t have to be a local copy.

That statement is at least misleading and IMHO also wrong. pip is designed for production deployment, but you have to implement additional checks to make it secure.

PEP 458’s use of TUF does not prevent malicious code on PyPI. Simply speaking every upload to PyPI will get a valid TUF signature. PEP 458 only indicates data integrity of PyPI and its mirrors after packages have been uploaded. A correct PEP 458 signature does not imply trust in a package, only trust in PyPI infrastructure after uploads have been processed.

Precisely. A key point about this exploit is that it attacks environments that don’t implement sufficient additional infrastructure (dependency pinning, audited and curated internal PyPI proxies). While it’s reasonable for pip and PyPI to try to mitigate the risks here, IMO the real message of this report is that there are a lot of companies who should know better, not investing enough in that sort of infrastructure.

One thing that could be done is better documentation and publicity on “how to build a robust and secure Python development environment”. But the only people who can do that are companies who have built such an environment, so it’s a bit of a chicken-and-egg problem :slightly_frowning_face:

3 Likes

Let’s be clear here. Pip doesn’t prevent anything, because pip by design downloads and runs arbitrary code from the internet (that’s what an install from source of a setuptools-based project does). Pip has mechanisms that people can use to limit exposure, but none of them are mandatory, and I don’t expect them to ever be mandatory.

Pip is a component in an ecosystem. And the ecosystem is not designed to be end-to-end secure and curated - it’s designed to promote easy sharing of code, in the spirit of open source. And I, for one, hope that if the goals of more security and code sharing ever come into conflict, code sharing will win - people who need security and auditability have the means to pay for it, and should be encouraged to do so. Python is a product of the open source culture, and should by default support that culture, IMO.

8 Likes

For comparative production readiness in regards to e.g. DNF and APT,
Python packages could gain per-manifest-file hashes for eggs and wheels (which don’t execute setup.py as the package manager user) so that debsums or rpm -Va, a database(s) of installed packages and their global/user/conda_root/venv/condaenv install paths, and a way to detect package install path collisions.

Is it not the case that - to exploit ah this - the version number can be the exact same as the actual version number because once you’ve specified an additional --index-url/--extra-index-url, you’ve trusted that source with equal priority as the default index server?

Is pip sufficient with e.g. sudo and umask, or find -exec chmod (edit: and semanage fcontext for SELinux extended attrs)? That’s not portable and it’s not built into pip, a tool that’s useful for read/write development and building one-file software releases.

Solutions for this, AFAICT:

  • add channel priorities (to the CLI, config syntax, and pip._internals)
  • add per-channel allow and deny package lists
  • require per-package release signing keyrings
    • Where do I specify which GPG keys are authorized to release a package (pkgx) on PyPI named my package URI; but not other packages such as e.g. pip?

I’m sorry to be so blunt, but you are talking nonsense.

Distro packages are not less likely to contain malicious code because they contain checksums and are signed. They contain fewer malicious code, because they go through a lengthy review and validation process. Packages are scanned, verified, tested, and checked before they even reach a staging area. Multiple humans and multiple independent systems do manual and automatic verification of code. Users have to become proven and trusted packagers before they able to get involved.

The process is even more rigorous and complex for enterprise distributions. To give you an example: RHEL 8 is based on Fedora 28. F28 was released on May 1, 2018. RHEL 8.0 on on May 7, 2019. RHEL contains only a fraction of the Fedora packages and it took a year to finalize QE. FIPS and CC validation took even longer or is still ongoing.

It would cost tens of millions of USD to apply the a similar level of scrutiny to top 1,000 packages on PyPI. It would also mean each release on PyPI would have to be delayed by days, weeks, perhaps even months.

PyPI and pip are an open ecosystem with all the pros and cons of an open ecosystem. For the most part they work like design and intended. It’s by design that everybody can upload code to PyPI.

1 Like