Many (most?) of those use cases will result in an enterprise-wide proxy or other form of intermediary already, especially when there are ‘internal’ packages involved. Asking users inside the org to configure pip’s global index-url to point at that is much simpler than distributing and maintaining a ‘map file’.
I’m skeptical, unless things have changed since the blog post on these attacks (a mere 2yo ago): is that why Apple, Microsoft, Tesla, Yelp, and others were affected? Not to mention the fact that even if you do deploy custom internal proxies, you may still be susceptible to these attacks when you first resolve and pull packages from the public.
I was referring to the situation now(-ish), which changed in many ways as a result of those attacks two years ago. Two of the more common enterprise registry/proxy tools (Artifactory and Nexus Repository Manager) gained features specifically to address this type of problem (for many artifact types too, not just Python).
In my experience with those tools, they permit the admin to setup repository priorities so that higher-priority repos shadow lower-priority ones: if package ‘A’ exists in both repos, ignoring version numbers, it will never be pulled from the lower-priority repo. If an enterprise using such a tool is making use of PyTorch and pulling from the PyTorch repository, they’d insert it into their priority list below their repos of internal packages and above PyPI.
That is nice, but what should we tell people (not necessarily enterprises) who are not using these tools for one reason or another?
I think that’s been covered in this thread to some degree; protection against dependency confusion can be gained multiple ways, and users are free to choose the path which works best for them. Your proposed solution could be one of those options.
The thing is, every single one of the options requires work; maybe for the end user, maybe for the publisher, maybe for someone else in between, or possibly more than one of them. Users and publishers who are not using any of the existing options “for one reason or another” have made a choice to avoid that work, which may be understandable but leaves them vulnerable.
Regardless of whatever solution(s) we may end up with, here are what I think are some important properties:
- Priority. A solution must take into account that for some projects, some indices must be searched over others first.
- Backtracking. The solution may search different indices for the same package in order of priority.
- Termination. Given (2), the solution should be able to terminate the search at any time given some condition (e.g., this project is missing in this index) even if all indices are not searched.
- Dependency resolution. Given (2), the solution should interact correctly with any backtracking dependency resolver (e.g., search all possible indices for a given project before moving on).
I think (1) is the bare minimum. (2)-(4) can be excluded if the indices are assumed to be mutually exclusive in the projects they provide.
That they should investigate them, or a free alternative (like devpi, for instance)?
I fail to see why every issue has to be handled in the installer. People have a responsibility to decide for themselves and build a solution out of available components.
This starts to feel like the discussion about wanting one tool that does everything, extended to even include non-workflow things like proxying/filtering indexes…
Unfortunately, for better or worse, this is where everyone tends to look at first. ↩︎
My first thought is that you talk about the solution, but what’s the problem? Presumably “dependency confusion attacks”? But I’m not vulnerable to those - because I don’t specify multiple indexes. So you need to be clearer about what the problem is, before stating what properties a solution “must have”.
For example, we’ve seriously considered dropping
--extra-index-url and just making pip use a single index. It may not happen (probably won’t, because of compatibility), but it’s “a solution to the problem” and it doesn’t relate at all to your “important properties” because it bypasses them. Any “index proxy” solution is the same, when looked at from a pip perspective.
Ignoring that issue for the moment, though, I’m not sure I’m comfortable with your properties anyway.
Regarding (1), there’s the problem of whether the absence of a project from one index should mean it doesn’t exist, or whether it should mean “look in the next index”. That’s not decidable without knowledge of the individual package. My understanding of the pytorch case was that it exploited this because the malicious package was found on PyPI rather than being not found because it wasn’t on the pytorch index. But failing to find
requests because it’s not on the pytorch index is obviously wrong.
Regarding (2-4), you’re talking about rather deep integration into the resolver, if I’m understanding correctly. That’s tricky and fragile code, and already the source of a lot of confusion and frustration for users with complex dependency trees. I get that you want to ensure that we address potential attacks, but doing so in a way that negatively impacts the experience of a completely different class of users who may or may not even have an exposure to that problem, seems risky. You also need to realise that the resolver (resolvelib) is an external library, not part of pip, and as such we can’t simply do what we want. I’ve no idea, for example, if resolvelib will let us implement (2) (there’s an argument that affects the order of backtracking, but even the library author considers its behaviour to be arcane and unpredictable!)
Does that help? It’s basically “no, I don’t agree with you”, so probably not as much as you would have liked
FWIW, Trishank’s list of properties is still valid even if
--extra-index-url – it’d “just” shift the question to the index proxy that users would be forced to use instead.
It’s tractable. We can order the candidates we feed into the resolver, for a given requirement, based on which index it comes from.
As I read this, it felt self-evident to me that it meant “look in the next index”.
Then can’t we move towards that instead of introducing another configuration option?
I think this while requiring all indices to be up and respond goes a long way towards helping with this problem. After that you can run your own proxy to get the desired result if you want something fancier like a fall-through situation.
We seem to have gone in a circle. This discussion is linked in index-url extra-index-url install priority order · Issue #8606 · pypa/pip · GitHub, which is an extensive thread on the complexities and practical difficulties of making index ordering part of pip.
I repeatedly suggested index proxies in that thread, and was told they would help with the problem but people still didn’t switch to using them, for reasons which aren’t clear to me (but sadly, it’s easy to uncharitably interpret the explanations as “it’s easier to ask you to do work than to do some myself”).
I suggest those participating here read that discussion. Especially the parts where multiple people say that a proxy solves the issue for them…
Perhaps equally uncharitably, but the reason may be “pip got it wrong in the first place” (with the obvious to us caveat that pip got it right in the first place, but we’re no longer in that place, and so people who weren’t around back then don’t have that context).
If you start from the baseline that “pip is wrong,” telling everyone else to work around that wrong-ness can be interpreted precisely as “it’s easier to ask you to do work…”
Explaining that “pip’s
--[extra-]index-url options are only intended for specifying a set of equivalent mirrors of the same index” ought to resolve a lot of that tension. The response will be “well that doesn’t help me”, to which you can say “here’s a suggestion that will” (at which point we get into the fix-it-once-for-all vs. make-everyone-work-around-it discussion, but at least that’ll be on the right topic).
That’s entirely fair. I thought in that thread that I had made it clear that this is a consequence of pip’s underlying model and changing that model is fundamentally hard, not that “pip is right and that’s the end of it”. But it’s hard to be sure you communicate the message you think you’re trying to get across.
Also, to be clear, I didn’t intend to give the impression that I actually believe that uncharitable interpretation. I was just expressing frustration that it’s way too difficult to get anyone to tell you why they don’t want to use an index, that isn’t just “it’s a nuisance”. And we can’t do much about a statement like that.
One problem here which isn’t immediately obvious is that there are two (separate but related) issues. The “underlying model” I am talking about is the idea that given two artifacts, both for project FOO version X.Y, they are equivalent. Outside the context of dependency confusion, people ask pip for index priority because they want to force pip to pick “our version of foo-x.y, not the version on PyPI”. And that is the version of index priority that’s most problematic (both in implementation terms, and in conceptual terms). We haven’t implemented index priority because there’s all sorts of non-obvious decisions to make there. If foo-x.y is installed, that’s OK isn’t it? Even if the installed version came from PyPI?
For dependency confusion, there’s a whole bunch of other non-obvious questions. Mostly around the idea that a project not being in index A needs to signal “this project doesn’t exist” even if it’s present in index B. So that’s two independent sets of questions/decisions to cover. And there’s no guarantee that there’s a set of answers that covers both situations.
All of which can be handled, right now, with no complicated debates or frustrating misunderstandings, by an appropriately configured index proxy applying the exact rules you want for your situation.
Yeah. We said all that. More than once. I get that most people haven’t read the (many thousands of) words that have already been written on the matter, but even so… Apologies if it’s hard not to include a bit of mild snark.
This isn’t actually the right idea. I know some people have suggested it needs to be solved, but it’s both completely unworkable and also the one that has a solution (don’t refer to index B at all! Bingo! Anything that doesn’t exist in index A doesn’t exist).
Dependency confusion mitigation only requires that Package==X on index A is preferred over Package>=X on index B. The complexity is that index A needs to respond with a recognisable “I don’t know that package” instead of a generic 404 so that the tool knows whether to go to the next index (first case) or to abort the whole thing (second case).
I agree it’s not obvious. But we have spent the time to dig through it all and there’s really only a small set of options for users or tool/index developers. No doubt another factor in the frustration being directed towards the pip team on this one…
Unless I’m misunderstanding you, that needs a change to the simple index specification, to allow it to distinguish that case. If so, then why do people keep suggesting this is a problem for pip to solve? The solution is to get the spec change, then pip will change to support that. If in doing so, we need to deal with index priority, then we’ll do so (the spec for the “I don’t know that package” response would cover that consumers are expected to try “the next index” which will force recognition of the concept of an index order). It might not happen fast, but it’s a well-defined way forward.
In fact, from what you’re saying, I can close down the whole “pip needs index priority to fix dependency confusion” argument by just saying “not until the index spec is fixed, because we can’t fix the problem without index support”
I don’t think it would be particularly hard to add ordering to pip unless the finder stuff has changed dramatically since I implemented the JSON PEP , I also don’t think it (or any changes outside of pip) are required to close the gap here.
If we say that pip will error out (by default) if it finds the same package in multiple repositories, then dependency confusion attacks are virtually eliminated . Nothing else is required, no options to pip, no repository proxy, no changes the index spec, nothing.
This isn’t even hard to implement in pip, it already generates a list of all of the candidates for a particular project, but it does it in a multi step process that filters out links in each phase (to ignore wheels that are invalid for the current system, etc). That would just need adjust to build out the entire list first, then check the source of all of the links and make sure they all come from the same repository, then filter out the links that need filtered for one reason or another.
Unfortunately, and here is where some sort of configuration in pip comes in, I can almost guarantee that the moment someone runs into this hypothetical “project comes from multiple repositories” error, they’re going to look for a way to tell pip which repository it should use so that they can do the right thing, and if pip doesn’t have some kind of an answer that they find satisfactory, they’re likely to be unhappy.
There are many possible shapes that pip’s answer to “how do I resolve the multiple repositories error?” could take, such as:
- Tell them that pip doesn’t support doing that, and if they need it then they should use a repository proxy that will give them better control over mix and matching multiple repositories.
- Give them some way to pick repository server priority, and pick the highest priority server as the “winner” .
- Give them some configuration option that gives them fine grained control over mapping a project name to a repository .
I actually don’t think there is any change to the repository API that can fix this, because the repository doesn’t have enough information to know what it’s supposed to be serving. In the typical case where a person has some internal project that is named
foo in an internal repository, and then someone comes along and registers
foo on PyPI… PyPI has no way to know that, for that person,
foo should be served from some other repository, nor does that other repository know anything about PyPI or what it expects to be able to serve .
Modifying the repository spec to distinguish between “404 because this project doesn’t exist” and “404 because this project has no files” doesn’t tell you anything other than that repository X thinks it should be serving that project, but doesn’t have any files for it.
The only program that has enough information to know what to do here is pip (and other installers), nothing else does nor can (and by extension, the only person who has that information, is the person(s) invoking pip with those multiple repositories).
The hardest parts is figuring out how to surface the UI to let people specify the order, but once you have that, you’d just take that order into account, but pip already has to order the list of candidates to get something resembling what users expect, it would just be adding that index ordering to the already existing sort key function. ↩︎
The only time they would be possible, is if the dependency doesn’t exist at all in the repository it is supposed to be in, or if that repository hasn’t been configured for some reason. The latter problem is a problem no matter what we do, because it’s basically “I forgot to configure pip so it knows I don’t want XYZ from PyPI”, and the former is a relatively minor gap IMO. ↩︎
Other attacks like typosquatting are, of course, still possible. ↩︎
A change to the index spec would be required if we wanted to enable a clean way for the index to say “I own X project, but I don’t have any files for it”, but that particular gap is very minor IMO, and not worth thinking about. ↩︎
This actually would be a bit of a different thing than generating an error if a project is found in multiple repositories, since it would just implicitly turn the error into a “get it from the highest priority index it is found in”, unless another option was also added to say “and also ignore the multiple repository error for X project”. ↩︎
This could be the mapping file proposed in this thread or it could be something simpler. ↩︎
Ok, explicitly for PyPI they could, because PyPI is special, but the class of attacks doesn’t have to revolve around PyPI, it’s any multiple of repositories. ↩︎
I guess I wasn’t clear, but I meant distinguishing something like “pypi.org doesn’t have package
foo, hence 404” vs. “ppyi.org is a typo and the whole site doesn’t exist, hence 404”.
A project with no files isn’t even a project in most cases, but if it did exist, then I’d expect to get a non-404 response that lists no files, and so the install will error out. Which seems to me like the natural behaviour in the single index case.
This is already basically handled by PEP 691 content types, or at least it will be if/when we deprecate and remove the generic
text/html content type from pip et al. If we enter a hypothetical future where pip does not support
text/html, and it only supports PEP 691 content types, then a typo wouldn’t return a valid content type , and pip could error out .
We could either amend that PEP or write another PEP to be more explicit about using those content types for 404 responses if we wanted to I suppose, but I don’t think that is required either.
The hypothetical implementation I put above doesn’t error out if multiple repositories define a project, but all but one of them have zero files in them. That’s mostly because the way pip’s repository support is implemented right now, it doesn’t keep track of non 404 responses that result in 0 files discovered . However, it wouldn’t be particularly difficult I think to implement it so that a repository can effectively signal that it knows of project X, but has no files for project X, so that it would trigger the hypothetical “project X is coming from multiple repositories” error case even if it doesn’t itself provide files.
I suspect that might cause more breakage than it’s worth though, since it would mean that anyone who preemptively registered a project on PyPI to hold the name, but didn’t upload any files to it, would fall into the error bucket, rather than just silently doing the right thing for them. However, it is true that it is a very small gap that may be left open in some obscure situations, I just think the preemptive registration is far more likely of a case, and would rather not break them .
Unless they somehow managed to typo to another valid repository URL of course! ↩︎
This does NOT mean having to drop support for HTML encoding completely, a repository can use
application/vnd.pypi.simple.v1+htmlfrom PEP 691 to return an HTML response, which pip can distinguish as “something that implements the Python Repository API”, which it can’t from a generic
Internally, pip basically has a method that just iterates over lists of links from various sources, and a source with 0 links is a zero length list. ↩︎
Projects also have a work around available to them, they can just publish a
0.0.0version that is a placeholder package that would trigger that behavior as well. ↩︎
Isn’t this all taken care of implicitly by the repository’s index? If you don’t jump straight to a distribution’s URL and instead ask the repository upfront for the list of projects then you can do an initial check if the index has the project before hitting distribution URLs.