If I am understanding things correctly, the only way for this to work as it stands is that the end user has to provide this map file, because only the end user knows the collection of repositories they are using, and which packages are valid from which repositories.
I don’t think that PyPI could provide this information, because it doesn’t know about
https://download.pytorch.org/whl/nightly, nor do I think
https://download.pytorch.org/whl/nightly could provide it, because while it does (presumably) know about PyPI, it doesn’t know about any other indexes that the user may want to install from (other nightly indexes, a local cache, whatever).
I recognize that if someone did correctly setup their mapping file in advance, then this would prevent dependency confusion attacks arising from multiple independent repositories. However, I think it is basically analogous to the idea that you can implement package signing by having the end user maintain a mapping of projects to signing keys-- technically true, but in practice the overhead of doing so means (almost?) nobody actually uses that feature.
To safely use this feature, I would need to:
- Investigate my entire dependency tree and determine where the authors of every package in that tree intended I install their package from.
- Write this out in a mapping file, and make sure I never invoke pip (and that no tool I use ever invokes pip) without passing this mapping file.
- Continuously maintain this mapping file, such so that if a new dependency (ala
torchtriton) is added I am aware of it and investigate where it is supposed to come from.
- Hopefully I wrote my initial mapping file in such a way that it fails closed not open so I’m implicitly noticed through a failing install.
I dunno, I’m pretty skeptical of things that boil down to “ask the end user to audit their entire dependency graph to determine the correct location for every one of their dependencies” to gain the benefits of the proposal .
I think that index proxies are good for some things, but I think there’s another aspect of this that ultimately the end user is the only person who actually has all of the information available to them to make these choices, and I think it’s kind of silly to say that every user who wants to use multiple indexes should setup an index proxy.
It’s a good solution for a lot of use cases where a set of users are sharing a set of indexes for a specific reason, but not really a great general solution, and I think most users would be confused and resistant to actually using it for this use case.
Ultimately I don’t think we can, realistically, prevent this entire class of attacks going forward without going to a system where the name of a dependency has a very strong connection to the location of the dependency, or more foundational, globally unique names (something like Go’s use of URLs instead of abstract names for instance). However, doing that makes situations like mirroring or forking much more complicated. Though I think the real killer for that is I don’t think it’s possible to migrate the entire Python ecosystem to using globally unique names.
One idea I can think of that doesn’t prevent this entire class of attack, but that does make it much harder to pull off, is to change pip (and other installers) such that they expect packages to only live in a singular repository by default, and if it finds the same package in multiple repositories it takes some protective action .
This means that for the common case where packages only come from one repository or another, the end user doesn’t have to do anything but they are protected against dependency confusion in the case where they’re actually being attacked (in the pytorch example,
torchtriton is available from both PyPI and
The downside here is that particularly heuristic isn’t perfect, because there are legitimate reasons to do this, so you would still need some way to tell pip “hey for X package, you should install it from Y repository”, which ultimately is what the mapping file is doing, so you could re-use that idea, and treat this idea as a way to “close the gap”. Or you could do something simpler, the specific mechanism doesn’t matter as long as you have a way to tell pip what to do besides fail (or warn or whatever) in that false positive case.