Proposal: Preventing dependency confusion attacks with the map file

pf_moore · February 2, 2023, 4:21pm

That they should investigate them, or a free alternative (like devpi, for instance)?

I fail to see why every issue has to be handled in the installer. People have a responsibility to decide for themselves and build a solution out of available components.

This starts to feel like the discussion about wanting one tool that does everything, extended to even include non-workflow things like proxying/filtering indexes…

trishankatdatadog · February 2, 2023, 4:35pm

Paul, I understand your concerns, especially adding complexity to pip^[1], but please see my post about desired properties that any solution should have. I’d be curious to know what you think.

Unfortunately, for better or worse, this is where everyone tends to look at first. ↩︎

pf_moore · February 2, 2023, 6:23pm

My first thought is that you talk about the solution, but what’s the problem? Presumably “dependency confusion attacks”? But I’m not vulnerable to those - because I don’t specify multiple indexes. So you need to be clearer about what the problem is, before stating what properties a solution “must have”.

For example, we’ve seriously considered dropping --extra-index-url and just making pip use a single index. It may not happen (probably won’t, because of compatibility), but it’s “a solution to the problem” and it doesn’t relate at all to your “important properties” because it bypasses them. Any “index proxy” solution is the same, when looked at from a pip perspective.

Ignoring that issue for the moment, though, I’m not sure I’m comfortable with your properties anyway.

Regarding (1), there’s the problem of whether the absence of a project from one index should mean it doesn’t exist, or whether it should mean “look in the next index”. That’s not decidable without knowledge of the individual package. My understanding of the pytorch case was that it exploited this because the malicious package was found on PyPI rather than being not found because it wasn’t on the pytorch index. But failing to find requests because it’s not on the pytorch index is obviously wrong.

Regarding (2-4), you’re talking about rather deep integration into the resolver, if I’m understanding correctly. That’s tricky and fragile code, and already the source of a lot of confusion and frustration for users with complex dependency trees. I get that you want to ensure that we address potential attacks, but doing so in a way that negatively impacts the experience of a completely different class of users who may or may not even have an exposure to that problem, seems risky. You also need to realise that the resolver (resolvelib) is an external library, not part of pip, and as such we can’t simply do what we want. I’ve no idea, for example, if resolvelib will let us implement (2) (there’s an argument that affects the order of backtracking, but even the library author considers its behaviour to be arcane and unpredictable!)

Does that help? It’s basically “no, I don’t agree with you”, so probably not as much as you would have liked

pradyunsg · February 2, 2023, 11:47pm

FWIW, Trishank’s list of properties is still valid even if pip dropped --extra-index-url – it’d “just” shift the question to the index proxy that users would be forced to use instead.

It’s tractable. We can order the candidates we feed into the resolver, for a given requirement, based on which index it comes from.

As I read this, it felt self-evident to me that it meant “look in the next index”.

brettcannon · February 3, 2023, 12:40am

Then can’t we move towards that instead of introducing another configuration option?

I think this while requiring all indices to be up and respond goes a long way towards helping with this problem. After that you can run your own proxy to get the desired result if you want something fancier like a fall-through situation.

pf_moore · February 3, 2023, 8:25am

We seem to have gone in a circle. This discussion is linked in index-url extra-index-url install priority order · Issue #8606 · pypa/pip · GitHub, which is an extensive thread on the complexities and practical difficulties of making index ordering part of pip.

I repeatedly suggested index proxies in that thread, and was told they would help with the problem but people still didn’t switch to using them, for reasons which aren’t clear to me (but sadly, it’s easy to uncharitably interpret the explanations as “it’s easier to ask you to do work than to do some myself”).

I suggest those participating here read that discussion. Especially the parts where multiple people say that a proxy solves the issue for them…

steve.dower · February 3, 2023, 10:58am

Perhaps equally uncharitably, but the reason may be “pip got it wrong in the first place” (with the obvious to us caveat that pip got it right in the first place, but we’re no longer in that place, and so people who weren’t around back then don’t have that context).

If you start from the baseline that “pip is wrong,” telling everyone else to work around that wrong-ness can be interpreted precisely as “it’s easier to ask you to do work…”

Explaining that “pip’s --[extra-]index-url options are only intended for specifying a set of equivalent mirrors of the same index” ought to resolve a lot of that tension. The response will be “well that doesn’t help me”, to which you can say “here’s a suggestion that will” (at which point we get into the fix-it-once-for-all vs. make-everyone-work-around-it discussion, but at least that’ll be on the right topic).

pf_moore · February 3, 2023, 11:37am

That’s entirely fair. I thought in that thread that I had made it clear that this is a consequence of pip’s underlying model^[1] and changing that model is fundamentally hard, not that “pip is right and that’s the end of it”. But it’s hard to be sure you communicate the message you think you’re trying to get across.

Also, to be clear, I didn’t intend to give the impression that I actually believe that uncharitable interpretation. I was just expressing frustration that it’s way too difficult to get anyone to tell you why they don’t want to use an index, that isn’t just “it’s a nuisance”. And we can’t do much about a statement like that.

One problem here which isn’t immediately obvious is that there are two (separate but related) issues. The “underlying model” I am talking about is the idea that given two artifacts, both for project FOO version X.Y, they are equivalent. Outside the context of dependency confusion, people ask pip for index priority because they want to force pip to pick “our version of foo-x.y, not the version on PyPI”. And that is the version of index priority that’s most problematic (both in implementation terms, and in conceptual terms). We haven’t implemented index priority because there’s all sorts of non-obvious decisions to make there. If foo-x.y is installed, that’s OK isn’t it? Even if the installed version came from PyPI?

For dependency confusion, there’s a whole bunch of other non-obvious questions. Mostly around the idea that a project not being in index A needs to signal “this project doesn’t exist” even if it’s present in index B. So that’s two independent sets of questions/decisions to cover. And there’s no guarantee that there’s a set of answers that covers both situations.

All of which can be handled, right now, with no complicated debates or frustrating misunderstandings, by an appropriately configured index proxy applying the exact rules you want for your situation.

Yeah. We said all that. More than once. I get that most people haven’t read the (many thousands of) words that have already been written on the matter^[2], but even so… Apologies if it’s hard not to include a bit of mild snark.

Which, by the way, I’d argue is implicit in our standards, so it’s not just pip’s model. ↩︎
Or to put it less charitably, “done some research on the topic before asking” ↩︎

steve.dower · February 3, 2023, 12:09pm

This isn’t actually the right idea. I know some people have suggested it needs to be solved, but it’s both completely unworkable and also the one that has a solution (don’t refer to index B at all! Bingo! Anything that doesn’t exist in index A doesn’t exist).

Dependency confusion mitigation only requires that Package==X on index A is preferred over Package>=X on index B. The complexity is that index A needs to respond with a recognisable “I don’t know that package” instead of a generic 404 so that the tool knows whether to go to the next index (first case) or to abort the whole thing (second case).

I agree it’s not obvious. But we have spent the time to dig through it all and there’s really only a small set of options for users or tool/index developers. No doubt another factor in the frustration being directed towards the pip team on this one…

pf_moore · February 3, 2023, 1:35pm

Unless I’m misunderstanding you, that needs a change to the simple index specification, to allow it to distinguish that case. If so, then why do people keep suggesting this is a problem for pip to solve? The solution is to get the spec change, then pip will change to support that. If in doing so, we need to deal with index priority, then we’ll do so (the spec for the “I don’t know that package” response would cover that consumers are expected to try “the next index” which will force recognition of the concept of an index order). It might not happen fast, but it’s a well-defined way forward.

In fact, from what you’re saying, I can close down the whole “pip needs index priority to fix dependency confusion” argument by just saying “not until the index spec is fixed, because we can’t fix the problem without index support”

dstufft · February 3, 2023, 2:17pm

I don’t think it would be particularly hard to add ordering to pip unless the finder stuff has changed dramatically since I implemented the JSON PEP ^[1], I also don’t think it (or any changes outside of pip) are required to close the gap here.

If we say that pip will error out (by default) if it finds the same package in multiple repositories, then dependency confusion attacks are virtually eliminated ^[2]^[3]. Nothing else is required, no options to pip, no repository proxy, no changes the index spec^[4], nothing.

This isn’t even hard to implement in pip, it already generates a list of all of the candidates for a particular project, but it does it in a multi step process that filters out links in each phase (to ignore wheels that are invalid for the current system, etc). That would just need adjust to build out the entire list first, then check the source of all of the links and make sure they all come from the same repository, then filter out the links that need filtered for one reason or another.

Unfortunately, and here is where some sort of configuration in pip comes in, I can almost guarantee that the moment someone runs into this hypothetical “project comes from multiple repositories” error, they’re going to look for a way to tell pip which repository it should use so that they can do the right thing, and if pip doesn’t have some kind of an answer that they find satisfactory, they’re likely to be unhappy.

There are many possible shapes that pip’s answer to “how do I resolve the multiple repositories error?” could take, such as:

Tell them that pip doesn’t support doing that, and if they need it then they should use a repository proxy that will give them better control over mix and matching multiple repositories.
Give them some way to pick repository server priority, and pick the highest priority server as the “winner” ^[5].
Give them some configuration option that gives them fine grained control over mapping a project name to a repository ^[6].

I actually don’t think there is any change to the repository API that can fix this, because the repository doesn’t have enough information to know what it’s supposed to be serving. In the typical case where a person has some internal project that is named foo in an internal repository, and then someone comes along and registers foo on PyPI… PyPI has no way to know that, for that person, foo should be served from some other repository, nor does that other repository know anything about PyPI or what it expects to be able to serve ^[7].

Modifying the repository spec to distinguish between “404 because this project doesn’t exist” and “404 because this project has no files” doesn’t tell you anything other than that repository X thinks it should be serving that project, but doesn’t have any files for it.

The only program that has enough information to know what to do here is pip (and other installers), nothing else does nor can (and by extension, the only person who has that information, is the person(s) invoking pip with those multiple repositories).

The hardest parts is figuring out how to surface the UI to let people specify the order, but once you have that, you’d just take that order into account, but pip already has to order the list of candidates to get something resembling what users expect, it would just be adding that index ordering to the already existing sort key function. ↩︎
The only time they would be possible, is if the dependency doesn’t exist at all in the repository it is supposed to be in, or if that repository hasn’t been configured for some reason. The latter problem is a problem no matter what we do, because it’s basically “I forgot to configure pip so it knows I don’t want XYZ from PyPI”, and the former is a relatively minor gap IMO. ↩︎
Other attacks like typosquatting are, of course, still possible. ↩︎
A change to the index spec would be required if we wanted to enable a clean way for the index to say “I own X project, but I don’t have any files for it”, but that particular gap is very minor IMO, and not worth thinking about. ↩︎
This actually would be a bit of a different thing than generating an error if a project is found in multiple repositories, since it would just implicitly turn the error into a “get it from the highest priority index it is found in”, unless another option was also added to say “and also ignore the multiple repository error for X project”. ↩︎
This could be the mapping file proposed in this thread or it could be something simpler. ↩︎
Ok, explicitly for PyPI they could, because PyPI is special, but the class of attacks doesn’t have to revolve around PyPI, it’s any multiple of repositories. ↩︎

steve.dower · February 3, 2023, 2:23pm

I guess I wasn’t clear, but I meant distinguishing something like “pypi.org doesn’t have package foo, hence 404” vs. “ppyi.org is a typo and the whole site doesn’t exist, hence 404”.

A project with no files isn’t even a project in most cases, but if it did exist, then I’d expect to get a non-404 response that lists no files, and so the install will error out. Which seems to me like the natural behaviour in the single index case.

dstufft · February 3, 2023, 2:48pm

This is already basically handled by PEP 691 content types, or at least it will be if/when we deprecate and remove the generic text/html content type from pip et al. If we enter a hypothetical future where pip does not support text/html, and it only supports PEP 691 content types, then a typo wouldn’t return a valid content type ^[1], and pip could error out ^[2].

We could either amend that PEP or write another PEP to be more explicit about using those content types for 404 responses if we wanted to I suppose, but I don’t think that is required either.

The hypothetical implementation I put above doesn’t error out if multiple repositories define a project, but all but one of them have zero files in them. That’s mostly because the way pip’s repository support is implemented right now, it doesn’t keep track of non 404 responses that result in 0 files discovered ^[3]. However, it wouldn’t be particularly difficult I think to implement it so that a repository can effectively signal that it knows of project X, but has no files for project X, so that it would trigger the hypothetical “project X is coming from multiple repositories” error case even if it doesn’t itself provide files.

I suspect that might cause more breakage than it’s worth though, since it would mean that anyone who preemptively registered a project on PyPI to hold the name, but didn’t upload any files to it, would fall into the error bucket, rather than just silently doing the right thing for them. However, it is true that it is a very small gap that may be left open in some obscure situations, I just think the preemptive registration is far more likely of a case, and would rather not break them ^[4].

Unless they somehow managed to typo to another valid repository URL of course! ↩︎
This does NOT mean having to drop support for HTML encoding completely, a repository can use application/vnd.pypi.simple.v1+html from PEP 691 to return an HTML response, which pip can distinguish as “something that implements the Python Repository API”, which it can’t from a generic text/html. ↩︎
Internally, pip basically has a method that just iterates over lists of links from various sources, and a source with 0 links is a zero length list. ↩︎
Projects also have a work around available to them, they can just publish a 0.0.0 version that is a placeholder package that would trigger that behavior as well. ↩︎

brettcannon · February 3, 2023, 11:32pm

Isn’t this all taken care of implicitly by the repository’s index? If you don’t jump straight to a distribution’s URL and instead ask the repository upfront for the list of projects then you can do an initial check if the index has the project before hitting distribution URLs.

dstufft · February 4, 2023, 1:32am

In theory, in reality installers don’t really (or shouldn’t really) use that if they can get away with it, it’s quite large on PyPI.

pf_moore · February 4, 2023, 10:58am

I believe it’s also up to 24 hours out of date, so it could result in unexpected errors.

7om · February 4, 2023, 11:34pm

While I fully support the idea of having a mechanism in pip to prevent dependency confusion, we needed a quick solution about a year ago and came up with this pypi proxy which can be used by anyone: https://pypi.coherentminds.de/

trishankatdatadog · February 6, 2023, 3:50pm

Here is why an index or network proxy alone won’t generally solve the problem:

Let’s say you’re given 2 projects to resolve: “foo>=1.0.0” and “bar”
- “foo” can be provided by indices A or B (in that priority order)
- B provides “bar-2.0.0”
- A provides “foo-1.0.0” which conflicts with “bar-2.0.0”
- B provides “foo-2.0.0” which doesn’t conflict with “bar-2.0.0”
- Typical backtracking dependency resolver
  - Check for bar
    - Fetch bar-2.0.0 from B
    - Check for foo>=1.0.0
      - Fetch foo-1.0.0 from A
        
        But this foo conflicts with bar
      - Fetch foo-2.0.0 from B [NOTE: this won’t actually happen unless you tell it to]
        
        This is OK

As I noted before, if you don’t care about backtracking when it comes to choice of projects across indices (because, say, you assume or require indices to be mutually exclusive in the provision of projects), then there is no problem. However, the moment you want or need “network-level” backtracking, then a simple network proxy will do the wrong thing, because the pip backtracking dependency resolver will be completely oblivious to the fact that it needs to backtrack from A to B when it comes to “foo”. Again, this is why we need to concretely state the problem.

trishankatdatadog · February 6, 2023, 3:57pm

Meanwhile, based on the discussion I started, @dstufft had privately proposed a simple heuristic that enables pip to error out when multiple indices happen to provide the same project, which doesn’t solve the problem, but goes a long way. Dan Lorenc has implemented a working(?) POC here.

uranusjr · February 6, 2023, 4:43pm

That is a restriction from how certain existing solutions are implemented, not a restriction of the proxy concept. I believe there are solutions (devpi?) that implement file-level merging.