Proposal: Preventing dependency confusion attacks with the map file

pf_moore · February 6, 2023, 4:56pm

So (ignoring for a second the two indexes) you have

foo-1.0.0 (conflicts with bar-2.0.0)
foo-2.0.0
bar-2.0.0

I don’t see why index priority or proxying is relevant. The only valid solution is foo-2.0.0, bar-2.0.0. And I’m not sure what the relationship is to dependency confusion. Is it that you want the resolve to fail, because the resolver shouldn’t see foo-2.0.0 “because it comes from the wrong index”? If that’s the case, why would a proxy that served foo just from A and bar from B not do what you want? That’s easy - serve stuff from A by default, and only if something is requested that is not on A, do you access B for that project.

Or is it as @uranusjr suggests, you’re assuming too limited a view of what a proxy could do?

trishankatdatadog · February 6, 2023, 4:56pm

Sorry, IDK what “file-level merging” means, but my point is precisely that a simple, unidirectional (proxy->pip) may not be sufficient.

dstufft · February 6, 2023, 5:23pm

So I’m not actually sure what the current behavior is in Trishank’s hypothetical, but I think the question roughly boils down to:

If two different indexes both have a foo-2.0-py3-none-any.whl, with different dependencies, what happens?

I think the assumption Trishank is making is that currently pip will treat both of those as distinct installation candidates, and will potentially backtrack between them if needed and that a index proxy cannot have two entries for foo-2.0-py3-none-any.whl that exhibits the same behavior.

What I’m not sure of is:

What does pip do if multiple indexes present the “same” filename, but with different dependencies.
What does pip do if one index presents the “same” filename multiple times, but with different dependencies.

I suspect that the “defined” answer is that the behavior exhibited under either of those conditioned is undefined and may change at any time, but the “real” answer is that the package finder will treat those two cases as roughly identical, and include both links in the list of discovered files… but I haven no idea what the resolver itself does with it.

As far as I can tell and remember, there is nothing in the repository API spec that prevents (2) from happening, but specific implementations of both the repository and the clients may assume that filename is unique per repository, so it’s likely it might not work in practice.

pf_moore · February 6, 2023, 7:42pm

Pip treats “candidates” as uniquely identified by name and version, so this is in effect, a case of “garbage in, garbage out”. We will de-duplicate and pick one of the two wheels in an essentially arbitrary manner. Different dependencies is irrelevant. At the point we make the choice, we’ve not even looked at dependency metadata (and we can’t, without a major performance hit).

I’m pretty sure the finder de-duplicates, and I’d characterise “which it chooses” as being undefined. There probably is a logic to which gets picked, but it’s an implementation detail subject to change at any time. It may even depend on arbitrary details like where the option is specified.

The resolver, on the other hand, is very straightforward. Candidates are unique up to name and version, so the resolver simply cannot see two candidates with the same name and version - it’s definitely the case that any deduplication happens before the resolver sees the files (and changing that would require some very fundamental redesign). If we want to do anything in pip, it would of necessity have to happen before the resolver gets invoked.

But if we ignore weird cases like local source trees or git URLs, everything that matters happens in the finder. That gets a list of sdist and wheel filenames, and infers the name/version metadata from the filenames. It returns a set of valid candidates, stripping out anything that’s not compatible with the install parameters, and the set will contain one file per name/version pair (so the decision “do we use a wheel or a sdist” happens here, for instance).

That’s why I have a hard time understanding why a proxy isn’t the right solution here. There is no information that pip’s finder has which a proxy can’t also access^[1].

It’s certainly arguable that the step where we arbitrarily choose one file if we get two with the same name would be better if it was deterministic, at least from a security point of view. But designing a UI to specify that order, in a way that can be implemented without major disruption to the finder logic, is the sticking point. And making such a UI easily usable for people wanting to protect against dependency confusion attacks is another constraint in that. So far, lots of people have said we “should” do things like this, but no-one has actually provided a PR demonstrating that it’s achievable in practice. Personally, I think the implementation is likely to be possible, but the UI will be an endless source of contention, and I have little or no interest in arguing over UI myself, or any personal need for the feature, so I’m unlikely to work on it.

Well, technically, an internal pip solution has access to the requirements specified on the command line, but I can’t see how having that would help, and it’s quite possible that pip’s finder actually doesn’t (currently) have access to that data either. ↩︎

dstufft · February 6, 2023, 8:20pm

Not that it matters a ton since the answer is the same anyways, “the behavior is undefined, and pip treats”, but just FWIW, I went and tested it:

Given a page like:

<a href="/p1/foo-1.0.tar.gz">foo-1.0.tar.gz</a>
<a href="/p2/foo-1.0.tar.gz">foo-1.0.tar.gz</a>

PackageFinder.find_all_candidates() returns both links:

[<InstallationCandidate('foo', <Version('1.0')>, <Link http://127.0.0.1:8080/p1/foo-1.0.tar.gz (from http://127.0.0.1:8080/foo/)>)>, <InstallationCandidate('foo', <Version('1.0')>, <Link http://127.0.0.1:8080/p2/foo-1.0.tar.gz (from http://127.0.0.1:8080/foo/)>)>]

And it ultimately uses the second of those links.

I don’t have time right now to construct a more elaborate test case to see what happens if I try and trick back tracking to select a different file, if it’s uniquely keying by name+version then it won’t really matter (and I guess that means that even in the somewhat normal case of multiple wheels that implement the same version pip’s back tracking resolver can’t differentiate between them).

That means that a index proxy should be roughly as capable of pip is to implement any of these options, though I’m not sure that’s a particularly satisfactory answer in the general case or not.

dstufft · February 6, 2023, 9:50pm

I went ahead and opened an issue on pip about not allowing the same package to come from two repositories, since that’s something concrete that we can do in pip to prevent almost all dependency attacks, and I didn’t want the idea to get lost in discourse threads.

trishankatdatadog · February 7, 2023, 4:08am

I’m not sure that this is generally the “correct” solution (depending on use cases: I can easily imagine a case where a user may wish to try the p1 version before backtracking to p2 in case the former fails). If only one version is somehow arbitrarily considered, then, yes, network proxies will do the job; otherwise, I think the story is a bit more complicated (e.g., somehow the proxy needs to tell pip to first try p1, then p2, at which point it seems to me that you might as well implement the feature in pip). It all depends on use cases.

In any case, I agree that a default “fail-fast” heuristic should prevent most accidental dependency confusion attacks.

pf_moore · February 7, 2023, 9:14am

Either you are using the term “backtracking” in a way that doesn’t match what pip actually does, or this isn’t something that pip can do anyway, as I explained above.

trishankatdatadog · February 7, 2023, 3:48pm

Considering the two different interpretations:

Could you explain why? Do you mean that a network proxy can tell pip to consider ["p1", "p2"], in that order? What did I miss?
If (1) isn’t possible, then it may be something worth discussing (not now, but later, depending on use cases from the community).

uranusjr · February 7, 2023, 3:52pm

Pip never prefers lower versions over higher versions (unless yanked), it is a built-in mechanism, unrelated to a proxy. That’s how versions work. Of course you can come up with scenarios where you want this specific behaviour, but I would argue it implies it is using versions incorrectly and should not be supported.

trishankatdatadog · February 7, 2023, 4:10pm

I see, thanks! But does this answer Donald’s Q? What if the versions are the same (as in the p1 vs p2 example above)? Is there a way to tell pip to try different “subvariants” (for the lack of a better word) of the same, say, foo-1.0.0 package in some order? It’s fine if it’s not currently possible, but I’d just like to know.

pf_moore · February 7, 2023, 4:26pm

It’s not possible. Packages are uniquely identified by name and version. There’s no such thing as “subvariants” - “there can be only one”

(For any pedants^[1] reading along - this is very over-simplified. But let’s make sure we’re on the same page over the basics before worrying about that).

Like me ↩︎

dstufft · February 7, 2023, 5:15pm

Yea, I’m not super familiar with the new resolver work so I wasn’t sure.

There’s two things in pip related to this:

The PackageFinder, whose job it is to scan all of the repositories and produce a list of available files that pip could install.
The Resolver, whose job it is to take a set of desired packages, and resolve it into a set of versions to install, using the list of files from PackageFinder.

I’m familiar with (1), which does actually support “subvariants” like you’re thinking. It does that mostly by accident because it’s API surface is basically “return a list of links to files”, so the most obvious way to implement that is roughly ^[1]:


repositories = [...]
links = []

for repo in repositories:
    resp = requests.get(repo + "/package/")
    links += extract_links(resp.content)

print(links)  # [<InstallationCandidate('foo', <Version('1.0')>, <Link http://127.0.0.1:8080/p1/foo-1.0.tar.gz (from http://127.0.0.1:8080/foo/)>)>, <InstallationCandidate('foo', <Version('1.0')>, <Link http://127.0.0.1:8080/p2/foo-1.0.tar.gz (from http://127.0.0.1:8080/foo/)>)>]

What I wasn’t sure is what the resolver would do with the above, since you could obviously implement it to treat each link as a separate candidate that it could resolve against, but you obviously don’t have to do that.

What @pf_moore and @uranusjr have explained now is that roughly speaking, internally to the resolver is a mapping of (project, version) to file to install, so even though the PackageFinder can feed it multiple files per version, it will ultimately end up picking one of them as “the” file for that version, and will just discard the rest and won’t consider them at all during the resolving phase.

That means that a index proxy can do the exact same thing, pick one of the duplicate files as the winner and just serve that OR it can just list both of them and pip will see them as two distinct files until they get passed into the resolver, which will pick one of them as “the” file for that version. In either case, the behavior is basically the same regardless of whether there’s one index or two indexes.

It’s obviously more complicated than this, this section of code also filters out wheels that aren’t valid for the target platform, things that have a python-requires that doesn’t match our current Python, etc. ↩︎

brettcannon · February 7, 2023, 11:44pm

But that only is an issue if you need it for PyPI. If it’s the last index in the order (or the only index) then if you get a 404 you don’t care if the index is down or it doesn’t have the distribution as the distribution isn’t available regardless since you’ve exhausted your options at that point.

But is there a technical reason for that, or just how it is right now because it hasn’t been a priority to make it refresh faster? And once again, this is only a concern if PyPI is not the last index and you have multiple indices to check.

dstufft · February 8, 2023, 2:04am

It’s not only an issue for PyPI, PyPI is just the place where the badness of doing so gets very extreme.

Besides, there’s no requirement that a project with 0 files appears in the repository index anyways, PyPI does because that was the simplest thing to implement. PyPI also doesn’t return a 404 for projects with 0 files.

Both of those things are implementation details of PyPI though, and if you’re going to mandate that repositories list projects they know about, then we might as well mandate that they don’t 404 on the project specific page.

Refreshing that page is slow and memory intensive, so we try to avoid doing it. We could invest time and energy into solving the technical challenges that caused us to limit it, but there’s little reason to do so IMO.

trishankatdatadog · February 8, 2023, 6:05pm

So, I think my summary was correct: an index/network proxy will “work” only if your use case doesn’t depend on the dependency resolver exploring all the (index/network) possibilities (in some order). Any solution to this problem should keep this in mind.

steve.dower · February 8, 2023, 6:47pm

An index proxy could do its own resolution based on whatever information it likes, though. There’s no reason to assume a “dumb” proxy.

But I suspect in reality, people who use a proxy will configure it for their needs and then let it route packages by name, rather than trying to resolve everything automatically. (I say this as a real-life user of index proxies, who also supports other real-life users of index proxies.)

pf_moore · February 8, 2023, 6:58pm

I guess so. But if your use case does depend on that, you’re using an installer other than pip (because pip doesn’t support that) so this discussion seems like it’s irrelevant for you?

dstufft · February 8, 2023, 7:01pm

In pip, nobody’s use case depends on that, because pip doesn’t do that now

Maybe some other resolver does do that, but I’m not aware of one. All the other resolvers I know in Python are even less willing to support variants.

A proxy couldn’t do the kind of resolution that Trishank is talking about, because it requires knowing the context that something is being resolved into, which the repository API doesn’t provide a way to do that.

It’s a moot point though because as has been mentioned, pip doesn’t do that kind of resolving anyways, and if it did, the repository API doesn’t require a single entry per filename, so the repository can just list multiple files with the same filename.

trishankatdatadog · February 8, 2023, 8:25pm

Hmm, maybe we’re finally on to something. If it is safe to assume that at most one index would serve at most one^[1] package for a project, then any solution to this problem could be simplified (such as @pradyunsg’s proposal).

This is not strictly true, but true enough for our purposes. ↩︎