Proposal: Preventing dependency confusion attacks with the map file

trishankatdatadog · February 1, 2023, 7:53pm

Hello everyone,

I am exploring how to prevent dependency confusion attacks with an implementation of an idea called the map file (aka TAP 4). The reader can find more information in that TAP, but hopefully this example is illuminating:

{
	"repositories": {
		"PyTorch": ["https://download.pytorch.org/whl/nightly"],
		"PyPI": ["https://pypi.org/simple"]
	},
	"mapping": [
		{
			"paths": ["torch*"],
			"repositories": ["PyTorch"],
			"terminating": true
		},
		{
			"paths": ["*"],
			"repositories": ["PyPI"]
		}
	]
}

Example 1: A map file instructing pip to download all torch* packages only from the PyTorch nightly index, and all other packages from PyPI.

The careful reader would note that the map file in Example 1 would have prevented the recent dependency confusion attack on torchtriton.

Note that although this proposal borrows the idea of the map file from The Update Framework (TUF), it does not require implementing or using TUF whatsoever (which is being discussed and implemented elsewhere with PEP 458).

The interested reader can find a working proof of concept (POC) implementing the map file in pip. Note that the POC is currently probably incorrect: it most likely does not correctly account for backtracking dependency resolution, which only highlights @pradyunsg’s point that it’s not trivial to implement any solution to the problem (not to mention that there might be other bugs). I am also not claiming that this is the best approach, but I am posting this now to raise a public discussion on how we can prevent this entire class of attacks going forward.

Thanks in advance for your time, and looking forward to your thoughts!

trishankatdatadog · February 1, 2023, 7:55pm

BTW, the output currently looks as follows (note that it correctly resolves where torchtriton should be found, but that it ultimately fails for unrelated reasons):

✗ pip -vvv download --map-file src/pip/_internal/map/map.json torchtriton
Created temporary directory: /private/var/folders/9v/tcls3j_d1_zbn209nj978zph0000gn/T/pip-build-tracker-k6pzzplc
Initialized build tracking at /private/var/folders/9v/tcls3j_d1_zbn209nj978zph0000gn/T/pip-build-tracker-k6pzzplc
Created build tracker: /private/var/folders/9v/tcls3j_d1_zbn209nj978zph0000gn/T/pip-build-tracker-k6pzzplc
Entered build tracker: /private/var/folders/9v/tcls3j_d1_zbn209nj978zph0000gn/T/pip-build-tracker-k6pzzplc
Created temporary directory: /private/var/folders/9v/tcls3j_d1_zbn209nj978zph0000gn/T/pip-download-afsu76rb
paths: ['torch*']
torchtriton matches torch*
threshold: 1
repositories: ['PyTorch']
index: https://download.pytorch.org/whl/nightly
index_urls: ['https://download.pytorch.org/whl/nightly']
Getting page https://download.pytorch.org/whl/nightly/torchtriton/
Looking up "https://download.pytorch.org/whl/nightly/torchtriton/" in the cache
Request header has "max_age" as 0, cache bypassed
Starting new HTTPS connection (1): download.pytorch.org:443
https://download.pytorch.org:443 "GET /whl/nightly/torchtriton/ HTTP/1.1" 200 3912
Updating cache with response from "https://download.pytorch.org/whl/nightly/torchtriton/"
Response header has "no-store"
Fetched page https://download.pytorch.org/whl/nightly/torchtriton/ as text/html
found torchtriton: ['https://download.pytorch.org/whl/nightly']
index_urls_locations: ['https://download.pytorch.org/whl/nightly']
1 location(s) to search for versions of torchtriton:
* https://download.pytorch.org/whl/nightly
Fetching project page and analyzing links: https://download.pytorch.org/whl/nightly
Getting page https://download.pytorch.org/whl/nightly
Looking up "https://download.pytorch.org/whl/nightly" in the cache
Request header has "max_age" as 0, cache bypassed
https://download.pytorch.org:443 "GET /whl/nightly HTTP/1.1" 200 1171
Updating cache with response from "https://download.pytorch.org/whl/nightly"
Response header has "no-store"
Fetched page https://download.pytorch.org/whl/nightly as text/html
  Skipping link: not a file: https://download.pytorch.org/whl/Pillow/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/certifi/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/charset-normalizer/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/cmake/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/filelock/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/idna/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/mpmath/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/nestedtensor/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/networkx/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/numpy/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/packaging/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/pytorch-triton/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/requests/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/sympy/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torch/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torcharrow/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torchaudio/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torchcsprng/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torchdata/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torchdistx/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torchrec/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torchtext/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/torchvision/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/typing-extensions/ (from https://download.pytorch.org/whl/nightly)
  Skipping link: not a file: https://download.pytorch.org/whl/urllib3/ (from https://download.pytorch.org/whl/nightly)
Skipping link: not a file: https://download.pytorch.org/whl/nightly
Given no hashes to check 0 links for project 'torchtriton': discarding no candidates
ERROR: Could not find a version that satisfies the requirement torchtriton (from versions: none)
ERROR: No matching distribution found for torchtriton
Exception information:
[...tracebacks...]
Remote version of pip: 22.3.1
Local version of pip:  23.0.dev0
Was pip installed by pip? False
Removed build tracker: '/private/var/folders/9v/tcls3j_d1_zbn209nj978zph0000gn/T/pip-build-tracker-k6pzzplc'

brettcannon · February 1, 2023, 8:11pm

Does “paths” map to the URL or the downloaded file?

Would another way to help with this problem would be if there was a priority order to indexes and you can only continue on if all indexes are reachable? So in the PyTorch case the PyTorch index would have said it handled a package and the indexing could have stopped there, while letting other packages fall through to PyPI. In other words, make "terminating": true what always happens when an index has a project and let people specify the order of indexes that an installer queries.

trishankatdatadog · February 1, 2023, 8:25pm

To the URLs of indices

Yes, exactly! Note that the mapping attribute uses a list precisely to specify the order in which indices should be searched. The terminating attribute is used precisely to stop backtracking, and thus search any following indices. The threshold attribute, while not used in Example 1, can be used enforce that two or more separate indices must agree on the same packages, like so:

{
	"repositories": {
		"PyTorch": ["https://download.pytorch.org/whl/nightly"],
		"PyPI": ["https://pypi.org/simple"]
	},
	"mapping": [
		{
			"paths": ["torch*"],
			"repositories": ["PyTorch", "PyPI"],
			"terminating": true,
			"threshold": 2
		},
		{
			"paths": ["*"],
			"repositories": ["PyPI"]
		}
	]
}

Example 2: A map file where both PyTorch and PyPI must agree on all torch* packages, and where all other packages are entrusted only to PyPI.

steve.dower · February 1, 2023, 10:04pm

I’ve been using simpleindex to do this. The syntax is quite a bit simpler than this, but the functionality appears to be much the same (it’s also nicely extensible, which I’ve been using to hide credentials for authenticated feeds from pip).

pf_moore · February 1, 2023, 10:55pm

Agreed. I’d much rather encourage people to use index proxies like simpleindex for this sort of thing.

If the need to have a separate index service running is problematic (people shy away from index proxies, and that’s the only reason I can really think of), it might be better to add a pip option that takes a Python script, and when pip runs, starts that script and uses the server it creates as the index. So something like pip install --index-server=startindex.py .... This is a really unformed idea at the moment, but I really do think that “making it easier for people to use index proxies” combined with an ecosystem of index proxy configurations, is a much better way of addressing these issues than adding yet more complex options to pip (which means that any other installers don’t have them - for example, I think PDM implements its own finder logic so the proposed map file approach wouldn’t help PDM users).

pf_moore · February 1, 2023, 11:26pm

This discussion prompted me to actually write this proposal down.

github.com/pypa/pip

An option to start a local index proxy when running pip

opened 11:24PM - 01 Feb 23 UTC

pfmoore

type: feature request S: needs triage

### What's the problem this feature will solve? Many problems people have with …index server handling (such as index priority, dependency confusion attacks, filtering available package releases by age, etc) can relatively simply be solved using an index proxy, that presents a "view" of the underlying index(es). However, users are typically reluctant to use such a proxy index. Generally it's difficult to get people to articulate the reasons they don't like this option, but the most common complaint is the need to manage a separate running service. ### Describe the solution you'd like If pip had an option that specified a script which started an index server for the duration of the pip invocation, people could use this to avoid the need to have a permanently running index proxy. For example, the user creates[^1] a script `my_index.py`, which starts up an index server. Then, they invoke pip using the command `pip install --index-script=my_index.py ...`. When started, pip will run the index script, and communicate with it to agree on a proxy URL that it will provide. The rest of the pip invocation works as normal, using the temporary index. When pip completes, it shuts down the proxy automatically. The details of the communication between pip and the script will need to be established, but it can probably be as simple as pip choosing a port number, and passing it as an argument to the script. [^1]: In an ideal world, an ecosystem of proxy implementations will become available, so the user simply downloads a suitable script and configures it. ### Alternative Solutions The current approach, where the user has to manually start a proxy before running pip, is viable, but appears unattractive to users as a solution. Alternative solutions to individual issues have been proposed as pip feature requests - for example, #8606 and https://discuss.python.org/t/proposal-preventing-dependency-confusion-attacks-with-the-map-file/23414. These solve individual problems, but are not as general as the proposed solution. ### Additional context Creating an index proxy script is potentially more complexity than many users will be comfortable with, which is likely to limit the adoption of this proposal. This can be addressed, at least in part, by publishing a set of scripts that handle well-known cases like index prioritisation. Unfortunately, pip has a limited amount of developer resource, and that makes it difficult to implement solutions to the various issues raised in a timely manner. Also, it's necessary to make sure that any solution works for *every* user's situation, which further delays resolution. By, in effect, "outsourcing" the work of solving the issue to the end user, simple, tightly focused solutions can be delivered in a much more timely manner, and pressure is taken off the volunteers supporting pip. ### Code of Conduct - [X] I agree to follow the [PSF Code of Conduct](https://www.python.org/psf/conduct/).

I’m sufficiently motivated that I’ll probably work on a PR for this. It’s not guaranteed that the other maintainers will approve of it, but it won’t be complete vapourware

dstufft · February 1, 2023, 11:30pm

If I am understanding things correctly, the only way for this to work as it stands is that the end user has to provide this map file, because only the end user knows the collection of repositories they are using, and which packages are valid from which repositories.

I don’t think that PyPI could provide this information, because it doesn’t know about https://download.pytorch.org/whl/nightly, nor do I think https://download.pytorch.org/whl/nightly could provide it, because while it does (presumably) know about PyPI, it doesn’t know about any other indexes that the user may want to install from (other nightly indexes, a local cache, whatever).

I recognize that if someone did correctly setup their mapping file in advance, then this would prevent dependency confusion attacks arising from multiple independent repositories. However, I think it is basically analogous to the idea that you can implement package signing by having the end user maintain a mapping of projects to signing keys-- technically true, but in practice the overhead of doing so means (almost?) nobody actually uses that feature.

To safely use this feature, I would need to:

Investigate my entire dependency tree and determine where the authors of every package in that tree intended I install their package from.
Write this out in a mapping file, and make sure I never invoke pip (and that no tool I use ever invokes pip) without passing this mapping file.
Continuously maintain this mapping file, such so that if a new dependency (ala torchtriton) is added I am aware of it and investigate where it is supposed to come from.
- Hopefully I wrote my initial mapping file in such a way that it fails closed not open so I’m implicitly noticed through a failing install.

I dunno, I’m pretty skeptical of things that boil down to “ask the end user to audit their entire dependency graph to determine the correct location for every one of their dependencies” to gain the benefits of the proposal ^[1].

I think that index proxies are good for some things, but I think there’s another aspect of this that ultimately the end user is the only person who actually has all of the information available to them to make these choices, and I think it’s kind of silly to say that every user who wants to use multiple indexes should setup an index proxy.

It’s a good solution for a lot of use cases where a set of users are sharing a set of indexes for a specific reason, but not really a great general solution, and I think most users would be confused and resistant to actually using it for this use case.

Ultimately I don’t think we can, realistically, prevent this entire class of attacks going forward without going to a system where the name of a dependency has a very strong connection to the location of the dependency, or more foundational, globally unique names (something like Go’s use of URLs instead of abstract names for instance). However, doing that makes situations like mirroring or forking much more complicated. Though I think the real killer for that is I don’t think it’s possible to migrate the entire Python ecosystem to using globally unique names.

One idea I can think of that doesn’t prevent this entire class of attack, but that does make it much harder to pull off, is to change pip (and other installers) such that they expect packages to only live in a singular repository by default, and if it finds the same package in multiple repositories it takes some protective action ^[2].

This means that for the common ^[3] case where packages only come from one repository or another, the end user doesn’t have to do anything but they are protected against dependency confusion ^[4] in the case where they’re actually being attacked (in the pytorch example, torchtriton is available from both PyPI and https://download.pytorch.org/whl/nightly).

The downside here is that particularly heuristic isn’t perfect, because there are legitimate reasons to do this, so you would still need some way to tell pip “hey for X package, you should install it from Y repository”, which ultimately is what the mapping file is doing, so you could re-use that idea, and treat this idea as a way to “close the gap”. Or you could do something simpler, the specific mechanism doesn’t matter as long as you have a way to tell pip what to do besides fail (or warn or whatever) in that false positive case.

I also suspect that the specific proposal here is way more complicated than we would need, but that’s not really important. ↩︎
This protective action could just be warning the user, or it could be hard failing and requiring the user to tell pip et al which one of the two locations is authoritative. ↩︎
At least, common in the situation where someone is using multiple repositories, which itself isn’t that common I think. ↩︎
Technically there is a gap here, if a project disappears from a repository OR the user didn’t configure the repository where it was intended to come from, then pip would only see the “malicious” dependency from the other repositories, and silently install it. IMO this is a very rare case though, and the small gap it allows more than makes up for the secure by default nature of this idea. ↩︎

dustin · February 1, 2023, 11:45pm

I think hash pinning generally would prevent these types of attacks? This is an interesting idea but I don’t see how it’d be necessary if we have fully specified, version/hash pinned lockfiles (and, people are using them).

pf_moore · February 1, 2023, 11:46pm

Oh, absolutely! But it’s not every user who insists that indexes must be processed in order, for example. People can use multiple indexes now. They can’t say “make my local index take priority over PyPI”, except by using a proxy. Someday, maybe pip will have “index priority” functionality (making it easier to use index proxies doesn’t stop us adding custom solutions for individual cases) but in the meantime, it’s a proxy or nothing.

I agree with you that I think this proposal has enough problems as a result of it expecting a user to create and maintain a map file, but even ignoring that issue, I think index proxies are useful for the following reasons:

They allow prototyping proposals like this.
They offer a solution that’s portable to any installer (yeah, I know, pip’s the only one for all practical purposes, but maybe PDM - specifically, the unearth library it uses - is chipping away at that idea?)
They act as a sanity check - is a proposal sufficiently better than using a proxy to justify it?

I’d like to make proxies easier to use so that the bar in (3) is higher - people have to work harder to justify saying “pip needs to implement this”. They still can justify it, but they need to do a bit more work

steve.dower · February 2, 2023, 12:14am

Hash pinning prevents the attack, but it doesn’t provide a useful alternative for cases where you really want to get a package from your own index rather than upstream. And it’s often prohibitively… annoying, so it doesn’t get used when it should (or it gets automated in a way that negates the value).

Azure Artifacts handles this by forwarding package requests to upstream (when configured), until you’ve pushed your own version to your feed, at which point you stop getting newer versions from PyPI (until you override it). So even without hash pinning, you can reliably use multiple indexes for the most common scenario.

dstufft · February 2, 2023, 12:28am

Yea, I mentioned it in your issue, but I’m not really against the idea of some flag to spin up a local proxy during that pip invocation or whatever. I think index proxies are great, and I’m very much a fan of making them easier and more convenient to use. I just think that a lot of these features do make sense inside pip itself, and I think a lot of people aren’t going to like the requirement to use a proxy server, so I suspect it won’t actually stop those requests coming in

I think hash pinning solves them, but it has the same problem in that it requires the end user to explicitly use some mechanism and maintain their list of local hashes to get the benefit of it. Doing something with like, automatic lock files like some languages do could narrow the gap, but it still is something that users have to think about and manage to gain the protection at all-- IOW the default is “off”.

This is roughly similar to my above idea of having pip bail out if it finds the same project defined in multiple repositories. The Azure Artifacts example presents a little bit nicer of a UX since it implicitly has an idea that the Azure Artifact repository should take precedence, which pip doesn’t really have any sort of implicit or explicit ordering to the repositories.

Of course we could implement this by letting users control the ordering of repositories, and only fetch candidates from the first repository that returns any candidates. That’s still roughly the same idea, that by default a project can only come from a single repository, and just differs in what we do when the project comes from multiple.

zmallen · February 2, 2023, 1:52am

Specifying an index proxy as a local file via the the --index-server flag is interesting. I could see other uses of shimming pip install to not only prevent typosquatting, but verifying packages via sigstore or scanning packages and bailing if something looks awry

trishankatdatadog · February 2, 2023, 5:04am

Thanks for the replies, everyone!

To reply to a few points:

Index proxies are indeed another way to solve the problem (the ultimate version of which is using a single network proxy to intercept all package managers, not just pip, but now you have two problems instead of one), but I don’t think they correctly interact with backtracking dependency resolution. If there are different sets of indices that can provide different sets of candidates for the same requirements, how would pip explore all the possibilities in this case? That is why I agree with @dstufft that it would be useful to have any such feature baked into pip.
Hash pinning works to an extent, but unfortunately, users are still susceptible to dependency confusion attacks on the first resolution of hashes.

I concur with @dstufft that the map file imposes some UX requirements on users, but the feature is designed for mostly enterprise/security use cases, and these files could conceivably be written and shared.

kpfleming · February 2, 2023, 6:59am

Many (most?) of those use cases will result in an enterprise-wide proxy or other form of intermediary already, especially when there are ‘internal’ packages involved. Asking users inside the org to configure pip’s global index-url to point at that is much simpler than distributing and maintaining a ‘map file’.

trishankatdatadog · February 2, 2023, 3:08pm

I’m skeptical, unless things have changed since the blog post on these attacks (a mere 2yo ago): is that why Apple, Microsoft, Tesla, Yelp, and others were affected? Not to mention the fact that even if you do deploy custom internal proxies, you may still be susceptible to these attacks when you first resolve and pull packages from the public.

kpfleming · February 2, 2023, 3:15pm

I was referring to the situation now(-ish), which changed in many ways as a result of those attacks two years ago. Two of the more common enterprise registry/proxy tools (Artifactory and Nexus Repository Manager) gained features specifically to address this type of problem (for many artifact types too, not just Python).

In my experience with those tools, they permit the admin to setup repository priorities so that higher-priority repos shadow lower-priority ones: if package ‘A’ exists in both repos, ignoring version numbers, it will never be pulled from the lower-priority repo. If an enterprise using such a tool is making use of PyTorch and pulling from the PyTorch repository, they’d insert it into their priority list below their repos of internal packages and above PyPI.

trishankatdatadog · February 2, 2023, 3:18pm

That is nice, but what should we tell people (not necessarily enterprises) who are not using these tools for one reason or another?

kpfleming · February 2, 2023, 3:35pm

I think that’s been covered in this thread to some degree; protection against dependency confusion can be gained multiple ways, and users are free to choose the path which works best for them. Your proposed solution could be one of those options.

The thing is, every single one of the options requires work; maybe for the end user, maybe for the publisher, maybe for someone else in between, or possibly more than one of them. Users and publishers who are not using any of the existing options “for one reason or another” have made a choice to avoid that work, which may be understandable but leaves them vulnerable.

trishankatdatadog · February 2, 2023, 4:19pm

Regardless of whatever solution(s) we may end up with, here are what I think are some important properties:

Priority. A solution must take into account that for some projects, some indices must be searched over others first.
Backtracking. The solution may search different indices for the same package in order of priority.
Termination. Given (2), the solution should be able to terminate the search at any time given some condition (e.g., this project is missing in this index) even if all indices are not searched.
Dependency resolution. Given (2), the solution should interact correctly with any backtracking dependency resolver (e.g., search all possible indices for a given project before moving on).

I think (1) is the bare minimum. (2)-(4) can be excluded if the indices are assumed to be mutually exclusive in the projects they provide.