Addressing dependency version confusion exploit chain

Hello,

In light of recent articles about package managers preferring higher version packages.

The exploit path is essentially this:

  1. An organisation creates a custom python library with a certain version
  2. The org places the library in a requirements file with a relaxed or no version spec, along with a mixture of modules available in pypi
  3. Someone comes and squats on their custom python library in pypi, with a higher version
  4. pip will then install the squatted library from pypi, as pip prefers newer versions

We actually talked about this very problem here:

If the version in svn is later than any on PyPI, it will be picked. if not, it won’t. If the latest version on PyPI is the same as the version number in svn, then either can be picked, but they should be identical (because that’s what having the same version number means) and so why would you care?
If you want to pick the version in svn in preference to those on PyPI, give it a higher version number.

Only in this ticket, we’re just talking about svn.

I suppose we can say that orgs have a responsiblity to pin their version numbers, but I think this warrants a discussion about what our defaults are.

One of suggestions in the linked issue is to use --no-index but that requires splitting requirements into things that are internal and public. This may be fine, but at least we should note and document it somewhere as a special use case.

The general recommendation for organisational usages like this to build a private proxy instead of accessing PyPI directly, since files distributed on PyPI (those with different names from your private packages) are inherently no more trustworthy to you as well. There are multiple tools for building a auditable private PyPI mirror for your organisation, such as bandersnatch and devpi.

If you really want to arbitrarily pick and choose what you believe is safe, there are also “pass-through proxy” tools that provide half-assed security guarantees. devpi provides many related functionalities, and I built another myself if you don’t want to set up a full-fledged high-volume web server. The idea is you point --index-url to your own server so you have full control of what package name means what.

1 Like

Thanks for your response @uranusjr

Where is the documentation for this? I’m not disbelieving you, I just don’t know where it is.

But yes, overall, I agree with your sentiment, I just wonder if there isn’t more that we can do.

I don’t believe there is much documentation on this. It’s inherently something that larger companies building Python applications would be interested in, but we’ve not had much contribution from such organisations in the form of documentation of best practices like this.

The majority of PyPA members are volunteers, so ideally this sort of documentation would be contributed by third parties who have experience with this sort of model.

1 Like

I gave a talk in PyCon US 2019 explaining how the SecureDrop project does this. I also wrote a blog post today explaining the newer changes we brought into the workflow for the same. All of our code is available in the related git repository.

We have

  • reproducible source tarballs
  • reproducible wheels (from both public and private packages)
  • OpenPGP signature verification of the hashes
  • and finally reproducible Debian packages from them
4 Likes

This is really great. So how do we go about introducing this stuff to the wider community?

The disclosure of this attack is so far the best way of getting people’s attention on it. It’s just going to have to be announced repeatedly in the hope that people will look for solutions.

I got Microsoft to publish a generic (and almost entirely non-advertisment) whitepaper on it with high level guidance. We didn’t know that “dependency confusion” would stick as the name.

It’s deliberately not a list of products or companies that offer workarounds, but the hope is that it’ll get people looking for them. Conference talks and blog posts (which are dated and not expected to be “forever” references like a paper) are better venues for sharing specifics like that, and hopefully citing our paper will give them enough legitimacy to be accepted by those who write them.

6 Likes

Pulp and DevPi have docs on this:

And there’s this:

Are there CWE (Common Weakness Enumeration) or CAPEC URIs for this new attack pattern? Is it a:

  • “CWE-610: Externally Controlled Reference to a Resource in Another Sphere”
    CWE -
  • “CWE-829: Inclusion of Functionality from Untrusted Control Sphere”
    CWE -
  • CWE-1104: Use of Unmaintained Third Party Components"
    CWE -

Is this correct: if you specify a non-PyPI index-url, it’s possible that you’ll install compromised packages (that have a higher version number); and the potential impact is the same as for other software supply chain issues?

Does TUF specify a signed mapping between package names and keys? AFAIU, this issue is not resolved in e.g. the DEB and RPM ecosystems either (any GPG key imported to the GPG keyring can sign for any version of any package from any index-url); though you can specify:

  • channel priority
  • which packages to allow to be installed from a given channel: allow/block
  • per-package hashes for all platforms (like pipenv)
  • per-package hashes for just one platform (e.g. “arm64” in pip requirements.txt hashes)

It’s also an option to solve the deptree locally, freeze it to requirements.txt with package hashes for the target platform,and then install that with pip install --no-deps; or generate a distributable binary with bundled static deps and no at-dev-time/at-install-time dynamic retrieval of updated dependencies with something like blaze or pants.

Keep in mind that for the specific attack published last week, it is sufficient to just specify pinned versions. That’s all. The attack was predicated entirely on being able to publish a version number higher than any legitimately-published version number, and relying on unpinned dependencies in installation processes.

Hash-comparison, signature-checking, etc. are all welcome as additional ways to protect against tampered packages, illegitimate packages, etc. but aren’t strictly necessary for this attack.

3 Likes

Afraid not, at least not in the general case (possibly for pip as it stands today).

All the package managers out there treat package name+version as the primary key, and so if two indexes claim to have the same version they are able to choose from whichever they like. NuGet takes the fastest response (presumed lowest latency), when I looked, pip doesn’t seem to choose one reliably. My guess would be that it normally ends up with --index-url rather than any --extra-index-url, but it may also get the first processed response (presumably, lowest latency).

If your private server has MyPackage==1.0.0 and one day PyPI also has it, you’ve got some chance of getting either. There are no defined or specified rules for how pip decides (nor any other package manager (except Conda)).

For this attack, it is sufficient to claim the name of each internal package on every single feed you reference and ensure that you claim a different version than what you intend to install. To avoid spam, we didn’t recommend this, instead suggesting that proxying feeds (which will return its own MyPackage==1.0.0 rather than any MyPackage from an upstream feed) and ensuring you only ever pass a single --index-url to pip.

2 Likes

pip doesn’t seem to choose one reliably. My guess would be that it normally ends up with --index-url rather than any --extra-index-url , but it may also get the first processed response (presumably, lowest latency)

So this is a practical thing that we can fix by making the choice deterministic and documenting it.

1 Like

Yep, except it’ll break users, so that has to be managed. But the break is unavoidable - there’s no way to fix the behaviour without changing it from what it currently is (but at least now we can say it’s a security fix without risking disclosure - that wasn’t the case when I suggested it privately a few months back).

1 Like

Guess we can do the normal delayed thing, where we add it in as an option with a warning on the command line that new behaviour will become the default in a X amount of time. And probably just leave in the unsafe option if people have no choice.

Thanks @steve.dower, I was assuming that only a single index/feed was in use inside the enterprise network, and so my statement was based on that assumption. If users use multiple feeds, they’re opening the door to many more complications as you point out :slight_smile:

Certainly in our network, since we have the occasional need to block packages from PyPI, or provide patched versions of them, our users cannot connect pip to multiple feeds, so using a namespace-based technique will be reliable for us.

2 Likes

A single custom index with vetted packages is currently the best and safest approach for medium to large organizations. For small organizations, a centrally maintained lock file with pinned versions and hashes can be a feasible solution, too. Lock and contrain files don’t scale well. Developers must be extra careful that they don’t pull in unpinned packages.

Unfortunately not every organization uses a custom PyPI clone consistently. Just early this week a lead engineer from a major telco company complained, that an update of a package broke their “staging environment for a distributed hardware testing platform” (quote).

In the distant future PyPI might grow additional features to reduce. Crowd-sourcing package reviews could be a path to create a curated view of PyPI for popular packages. Packagers would still upload to PyPI, trusted users then review and vote on packages or even each updates, eventually package/update ends up on curated.pypi.org.

If you don’t keep a local copy of the package dependencies, you cannot have locally-reproducible builds (because your build depends upon external data sources (e.g. PyPI,) that you do not control and have no local copy of).

You can cache ~/.pip/cache in a multi-stage Docker image.

Why don’t I see this as a new issue?

  • We trust the package index

  • If you add one or more indices, you trust those too: “shadowed” packages needn’t even have a greater version number, FWIU.

  • Does the PyPI implementation of TUF prevent package shadowing with or without greater version numbers?

  • We trust the {DNS in, } the network

  • It is still the case that the first DNS response will be taken. DNSSEC, DoH/DoT.

  • One lost CA cert and SSL/TLS with PKI are broken.

  • A MITM needn’t specify a greater package version number to succeed (when nothing is checking signed hashes and checking that they’re authorized signatures for that software (which is composed of many unsigned PRs that need review))

Perhaps it’s not said enough: pip is for development. If you want to install packages and set permissions such that the code can’t overwrite itself (a W^X vulnerability), you’ll need to run pip as different non-root user and/or set the permissions and/or use an OS package manager with per-file hashes.

Again, does TUF have a signed list of who is authorized to sign releases of each package? How can legit package shadowing work with that?

(Edit: this post has been moderated? Good luck with this.)

Your statement is incorrect. You need access to a verifiable and available copy. It doesn’t have to be a local copy.

That statement is at least misleading and IMHO also wrong. pip is designed for production deployment, but you have to implement additional checks to make it secure.

PEP 458’s use of TUF does not prevent malicious code on PyPI. Simply speaking every upload to PyPI will get a valid TUF signature. PEP 458 only indicates data integrity of PyPI and its mirrors after packages have been uploaded. A correct PEP 458 signature does not imply trust in a package, only trust in PyPI infrastructure after uploads have been processed.

1 Like

Precisely. A key point about this exploit is that it attacks environments that don’t implement sufficient additional infrastructure (dependency pinning, audited and curated internal PyPI proxies). While it’s reasonable for pip and PyPI to try to mitigate the risks here, IMO the real message of this report is that there are a lot of companies who should know better, not investing enough in that sort of infrastructure.

One thing that could be done is better documentation and publicity on “how to build a robust and secure Python development environment”. But the only people who can do that are companies who have built such an environment, so it’s a bit of a chicken-and-egg problem :slightly_frowning_face:

3 Likes

Let’s be clear here. Pip doesn’t prevent anything, because pip by design downloads and runs arbitrary code from the internet (that’s what an install from source of a setuptools-based project does). Pip has mechanisms that people can use to limit exposure, but none of them are mandatory, and I don’t expect them to ever be mandatory.

Pip is a component in an ecosystem. And the ecosystem is not designed to be end-to-end secure and curated - it’s designed to promote easy sharing of code, in the spirit of open source. And I, for one, hope that if the goals of more security and code sharing ever come into conflict, code sharing will win - people who need security and auditability have the means to pay for it, and should be encouraged to do so. Python is a product of the open source culture, and should by default support that culture, IMO.

8 Likes