If I am understanding things correctly, the only way for this to work as it stands is that the end user has to provide this map file, because only the end user knows the collection of repositories they are using, and which packages are valid from which repositories.
I donât think that PyPI could provide this information, because it doesnât know about https://download.pytorch.org/whl/nightly
, nor do I think https://download.pytorch.org/whl/nightly
could provide it, because while it does (presumably) know about PyPI, it doesnât know about any other indexes that the user may want to install from (other nightly indexes, a local cache, whatever).
I recognize that if someone did correctly setup their mapping file in advance, then this would prevent dependency confusion attacks arising from multiple independent repositories. However, I think it is basically analogous to the idea that you can implement package signing by having the end user maintain a mapping of projects to signing keys-- technically true, but in practice the overhead of doing so means (almost?) nobody actually uses that feature.
To safely use this feature, I would need to:
- Investigate my entire dependency tree and determine where the authors of every package in that tree intended I install their package from.
- Write this out in a mapping file, and make sure I never invoke pip (and that no tool I use ever invokes pip) without passing this mapping file.
- Continuously maintain this mapping file, such so that if a new dependency (ala
torchtriton
) is added I am aware of it and investigate where it is supposed to come from.
- Hopefully I wrote my initial mapping file in such a way that it fails closed not open so Iâm implicitly noticed through a failing install.
I dunno, Iâm pretty skeptical of things that boil down to âask the end user to audit their entire dependency graph to determine the correct location for every one of their dependenciesâ to gain the benefits of the proposal .
I think that index proxies are good for some things, but I think thereâs another aspect of this that ultimately the end user is the only person who actually has all of the information available to them to make these choices, and I think itâs kind of silly to say that every user who wants to use multiple indexes should setup an index proxy.
Itâs a good solution for a lot of use cases where a set of users are sharing a set of indexes for a specific reason, but not really a great general solution, and I think most users would be confused and resistant to actually using it for this use case.
Ultimately I donât think we can, realistically, prevent this entire class of attacks going forward without going to a system where the name of a dependency has a very strong connection to the location of the dependency, or more foundational, globally unique names (something like Goâs use of URLs instead of abstract names for instance). However, doing that makes situations like mirroring or forking much more complicated. Though I think the real killer for that is I donât think itâs possible to migrate the entire Python ecosystem to using globally unique names.
One idea I can think of that doesnât prevent this entire class of attack, but that does make it much harder to pull off, is to change pip (and other installers) such that they expect packages to only live in a singular repository by default, and if it finds the same package in multiple repositories it takes some protective action .
This means that for the common case where packages only come from one repository or another, the end user doesnât have to do anything but they are protected against dependency confusion in the case where theyâre actually being attacked (in the pytorch example, torchtriton
is available from both PyPI and https://download.pytorch.org/whl/nightly
).
The downside here is that particularly heuristic isnât perfect, because there are legitimate reasons to do this, so you would still need some way to tell pip âhey for X package, you should install it from Y repositoryâ, which ultimately is what the mapping file is doing, so you could re-use that idea, and treat this idea as a way to âclose the gapâ. Or you could do something simpler, the specific mechanism doesnât matter as long as you have a way to tell pip what to do besides fail (or warn or whatever) in that false positive case.