PEP 708 - Extending the Repository API to Mitigate Dependency Confusion Attacks

pf_moore · May 27, 2023, 7:27pm

I’m assuming this is still ongoing. Let me know if there’s anything I can do to help.

dstufft · June 7, 2023, 3:45am

Alright, I’ve updated the PEP, however in doing so I realized that a reasonably important use case wasn’t well served, so I’m going to ask that we hold off on pronouncement for at least a few days to make sure that nobody has a problem with the updated PEP.

You can see the diff here, but the summary of changes is:

Rewording the Repository Tracks “intro” as @pf_moore asked to better (I hope) clarify it.
Allow tracking multiple repositories, to support the use case where you have a proxy that is merging multiple repositories for you, but you also happen to specify those repositories yourself.

Otherwise, everything is the same.

pf_moore · June 8, 2023, 9:44am

I added some review comments to the diff (sorry it was after you’d merged it!)

dstufft · June 8, 2023, 3:38pm

No worries, those suggestions all look fine so I’ll incorporate in a few minutes and update.

dstufft · June 8, 2023, 3:44pm

Gonna let that ride for a day or two in case anyone else notices something.

dstufft · June 12, 2023, 2:18am

I plan to ask for a pronouncement on this PEP on Friday, June 16th. That will be ~10 days since posting, the update that made a change to the PEP, when I had already asked for pronouncement.

If anyone has anything to say about this PEP, then I hope they do so, at least to ask for more time, before the 16th

pelson · June 12, 2023, 10:18am

Thank you for preparing this PEP. The problem it is solving is important and I’m very happy to see the progress here. I’m professionally very close to this problem, and operate a repository for a group of 400+ specialists using Python (from novice to expert level). That repository operates behind an internet-less segregated network realm, and combines very many domestically developed packages with those on PyPI.org. Unlike some others that I have observed, we have avoided “claiming” names on PyPI.org and instead rely on our repository to mitigate the risks of dependency confusion (and other risks), and we configure pip to have a single index-url pointing to our own repository. Indeed, I am giving a presentation at EuroPython next month on the subject (and need to start writing the presentation soon ).

Despite all of this investment in the problem, I do not mind admitting that I have read this PEP at least 3 times in detail, and find it to be a complex document for a relatively simple concept. It leaves me unsure of whether it actually solves the problem (I think it does), and yet it doesn’t really cover in any detail the company proxy use case similar to the one I’ve mentioned above (combination of domestic packages + proxying).

Concretely, the motivation, rationale, and “how to communicate this” are very long, and perhaps parts of it could be moved to an appendix of real use cases. At the same time, there is a need, IMO for a relatively short paragraph which gives some technical explanation of how this is solving some of the real-world problems (much like you would have done when explaining it in-person) - there is plenty of discussion on background, motivation and the specification, but not much on the high-level approach that the specification is following and how that addresses the problems discussed.

On the technical side, there are a few questions that I have asked myself (some of which I’ve answered, and not all of which need responses):

How would a proxy handle this PEP?

It could strip the PEP-708 metadata, or it can replace all upstream URLs with its own. In both cases, this doesn’t work well if you need to augment the index (e.g. for piwheels or pytorch). To solve this, you must also proxy the augmenting index also.

How does this look if you have domestic packages which must take precedence over those on pypi.org?

I think for domestic packages (which may have name collisions with those on pypi.org), there is no advantage to PEP-708, and you are forced to handle the merging in your repository correctly. It looks like this is communicated in the PEP with:

For private repositories that host private projects, it is recommended that you mirror the public projects that your users depend on into your own repository, taking care not to let a public project merge with a private project.

In short, this PEP doesn’t really cover the very often cited post/vulnerability relating to dependency confusion and Python: Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies | by Alex Birsan | Medium

Does this mean that the repository must know its own URL? (asked because it isn’t currently the case, and makes operating a repo behind a reverse-proxy a bit harder)

It looks like this is avoided thanks to the following:

When using alternate locations, clients MUST implicitly assume that the url the response was fetched from was included in the list

Does a project have to declare its alternate-locations in order to be tracked?

It isn’t clear from the spec whether I should be allowed to run a service such as piwheels which declares appropriate “tracks” metadata without that also being declared as a known service on the upstream repo with “alternate-locations” metadata. It seems desirable to be able to run such a service, but it isn’t clear if I need to mirror/proxy all files from upstream if I am not a recognised “alternate-locations” declared provider.

I think that it is intended that “alternate-locations” and “tracks” are independent concepts, and that the caveat:

“tracks” MUST be under the control of the repository operators themselves, not any individual publisher using that repository

Is handling this independence. The key thing that is going on here is that if a client adds a repo explicitly, they are saying “I trust the operators of this repo, but I don’t necessarily trust all of the people who have the right to add projects to that repo” (such as is the case with pypi.org).

Technically, if these are independent concepts, perhaps independent PEPs would have made sense?

If I declare an “alternate-location” do repository clients automatically follow those locations, or do they need to explicitly add those alternate-locations as indexes too?

I assume you do need to explicitly add them (otherwise we can do some scary stuff with a repository). If that is the case, why couldn’t that “alternate-location” repo simply use “tracks” metadata? At this point, I think I am confused and will stop - hopefully some clarifications will help resolve this . Perhaps it really is the case that the client (pip) is expected to follow the alternate-locations of a project automatically? Perhaps the difference between “tracks” metadata and “alternate-locations” metadata is that one is designed to be a new index-url and the other is designed to be an extra-index-url?

A few editorial notes:

There is a “See TBD for more information” still in the PEP
I proposed a typo fix in https://github.com/python/peps/pull/3019#discussion_r1223968592

To summarise: I find the length of the document to be a risk. It makes internalising what otherwise may be a simple concept difficult (for this human and for ChatGPT ). As you can see from my questions, I am not even clear whether or not this PEP is introducing two independent concepts solving similar needs, or if the two concepts are critically linked.

Yet again, I apologise for the long post. I have tried to give detail where it is important, and be open when it is clear that I’ve not understood something properly. I hope it is helpful feedback and will be happy to add any more context anything that isn’t clear.

Cheers,
Phil

steve.dower · June 12, 2023, 2:31pm

I think a good short summary of this (and I trust @dstufft to correct me and @pelson to say if it’s not good) is that PEP 708 makes all PyPI-like indexes incompatible with each other, such that packages from one are not interchangeable with those from another even if they have the same name, and then provides two mechanisms to explicitly claim compatibility, either for the entire index or particular packages.

It’s definitely one of these cases where the basic fix is so trivial as to get little space in the PEP, and so it appears that the exceptional cases are actually the main ones.

So perhaps the first part of the specification needs a section like this:

Reject incompatible packages

When an installer is :

sourcing a package from multiple indexes, and

more than one index is able to provide any versions of a particular package, and

overriding metadata (specified below) is absent or inconsistent,

the installer MUST refuse to install the package.

For clarity, where only a single index is being used, or where only a single index provides the package, the installer should not reject it.

The remainder of this specification covers the “overriding metadata” that should be used by an installer to decide when a package may be installed from one of multiple indexes.

(Maybe add a point to that list for “and no user overrides” to allow installers to offer command line arguments or something?)

Good luck with your EuroPython session! If I’m free, I’ll try to come along. I covered the issue (briefly) in my session last year, so it’ll be great to see it followed up by an upstream fix.

dustin · June 12, 2023, 2:59pm

I agree with @steve.dower’s. The “What is changing” actually already has a good succinct summary of the PEP, but it’s sort of buried in the document, and could probably stand to be brought to the top in some way.

dstufft · June 12, 2023, 5:08pm

Nobody has ever accused me of being too terse

What @steve.dower summarized the PEP as is essentially correct:

Except both mechanisms are scoped to per project/package, one mechanism is simpler, but relies on the trust a user must place in a repository operator, and the other is a little more involved but is “trust-less”.

How would a proxy handle this PEP?

If a proxy is intended to replace PyPI, e.g. only be used with -i and without any --extra-index-url (e.g. the proxy is intended to be the sole source of packages), then it doesn’t have to do anything, and the new behavior will enforce that for it.

If you want to allow augmenting with piwheels or pytorch then you just need to emit the “tracks” metadata, so:

<meta name="pypi:tracks" content="https://pypi.org/simple/holygrail/">

or

{
  "meta": {
    "tracks": ["https://pypi.org/simple/holygrail/"]
  },

this is done per project, so you can make project level decisions.

How does this look if you have domestic packages which must take precedence over those on pypi.org?

You don’t emit any new metadata for that project, and pip will refuse to “merge” your packages with the name on PyPI. If the same name is used on PyPI and your index, and users have configured both, then they will get an error and will require explicit configuration to inform pip of which repository they want to get that particular package from.

Again, if your users are only using your repo with -i and no --extra-index-url, then nothing at all changes.

Does this mean that the repository must know its own URL? (asked because it isn’t currently the case, and makes operating a repo behind a reverse-proxy a bit harder)

The proxy is not required to know it’s own URL, only the URLs of any upstream repositories that it is mirroring.

Does a project have to declare its alternate-locations in order to be tracked?

No, as you mentioned these are wholly independent mechanisms.

If I declare an “alternate-location” do repository clients automatically follow those locations, or do they need to explicitly add those alternate-locations as indexes too?

No, clients are always in control of what locations they fetch from, both tracks and alternate locations is about controlling the “merging” of multiple repositories that pip etc do, not about redirecting clients to multiple locations.

A few questions kind of touched in why there are two mechanisms, so to answer that, it basically comes down to trust, and what assumptions we can make.

We assume that there are these parties involved:

The user who is invoking pip (or another client).
The repository operators running each repository (each repository with a different set of operators).
The project authors who upload things to a repository.

We also assume that the following trust relationships exist:

The user is trusted (since they presumably trust themselves!).
The repository operator is trusted by the user (since it’s a fundamental requirement currently).
The project owner that the client intended to install is trusted.
The project owner that the client did not intend to install is not trusted.

Then we end up with multiple distinct scenarios:

The user has a single index server (-i with no --extra-index-url) and they want to install the pkg X from it.
The user has multiple index servers, which each host different projects named X.
The user has multiple index servers, all of which host X, but X is “owned” by the same project owner on all of the index servers configured.
The user has multiple index servers, but one of them is the “canonical location” for X, and the rest are extending that X package to add additional files (binaries for different platforms, a simple internal mirror, whatever).

In (1), PEP 708 makes no changes, and the user is already secure. The “source” of the files on that index doesn’t matter, whether X is the same name as somewhere else simply doesn’t matter, it is entirely unambiguous what the user means when they say X.

The problem with the status quo for 2, 3, and 4 is that when someone does pip install X it is ambiguous what they mean, and pip currently chooses to just assume that the user wants to treat X as equivalent across all indexes, which we know is wrong.

So PEP 708 says that for the (2) case, pip will now fail by default, and generate an error. The same name (X) coming from multiple repositories is ambiguous, and pip will require some mechanism to tell it that actually X is the same across these repositories.

However, we really have two distinct cases where X might be spread across multiple repositories:

The repository operators (or the repository software itself) is in charge of the name X, and they are attesting that X on their repository (A) is actually the same as X on this other repository (B).
- We believe we can trust this information from A without confirmation from B, because we’ve already assumed the repository operators are trusted.
- This is made harder by the fact that in many cases, B has no knowledge of A, because they’re mirrors or some other “downstream” provider.
The project authors who control X on the repository A, also control X on the repository B.
- We cannot blindly trust this metadata, because we only trust the project author that the user intended to install from, but we cannot determine if the user intended to install from A or B. However, what we can do is say that if both A and B can mutually agree that they are the same, then we’ve resolved the ambiguity without blindly trusting one project owner over another.
- This only works if all repositories can mutually agree on the set of repositories that are equal.

The first of these is “tracks”, and the second of these is “alternate-locations”.

We have two mechanisms, because we have two different scenarios, with varying levels of trust, that require different mechanics to be used safely. Now in many cases, the repository operator may also be the the project owner (the pytorch mirror is a good example of this), so they would be free to use either mechanism since they are filling both roles (presumably they’d use tracks since it’s “easier”).

Ultimately though, it boils down to:

Tracks: The trusted repository operator is attesting that X on A is also X on B.
Alternate locations: The untrusted (at this point in the resolution) owners of X on A and X on B are able to mutually agree that their X is equivalent.

dstufft · June 12, 2023, 5:24pm

As an additional note, the thing that makes pip install X secure by default is extremely simple, it’s basically:

The name X is abstract, and requires pip to resolve it to a “concrete” name like https://pypi.org/simple/x/.

When there is only one repository, then the resolution from abstract to concrete is unambiguous, and can be assumed safe.

When there is multiple repositories, this resolution is ambiguous, does X mean https://pypi.org/simple/x/ or does it mean https://example.com/simple/x/? We cannot know without further information.

And what PEP 708 does, is require pip (and other clients) to fail rather than guess in the face of ambiguity, and this is where PEP 708 makes pip install safe by default against dependency confusion.

Of course, failing to install X with no way to resolve the error is pretty awful. So PEP 708 instructs clients to provide a means of users to explicitly configure X to come from a given set of repositories, but does not specify that means because how configuration is handled is a client level decision.

However, we also recognize that the simple proposal of refusing to guess when resolution is ambiguous will be a major breaking change, and it requires every user to re-solve the same problem over and over again.

For example, we know that if you trust piwheels.org, then you can assume that X on piwheels.org is also X on PyPI, and it feels wrong to require each individual user of piwheels to explicitly opt into considering X on piwheels and PyPI to be equivalent.

We also don’t want to train users to just blindly configure the client to merge names, we want it to be an exceptional case that they’ll think about, not something that they run into so often that they get alert fatigue and just start mashing whatever it takes to proceed.

So to reduce the amount of breakage that PEP 708 will inflict upon the world, and to reduce the amount of toil end users have to cope with and to try and limit the failure to truly exceptional cases, we introduce the tracks and alternate locations metadata to allow repository operators or project owners to “fix” the breakage caused by PEP 708, for all of their users, rather than each user doing it individually.

pelson · June 23, 2023, 2:10pm

Thanks for all this additional context (and apologies for not having got back to you sooner).

I’ve been reflecting on this a lot, and have the feeling that the PEP is overstated in its aim. It isn’t solving all of the dependency confusion issues that exist, and even the title could more simply be expressed as “Safely enabling repositories to be extended (e.g. via pip’s --extra-index-url)”.

In that context, I understand that the two cases that this PEP is aiming to address are:

The ability to safely extend all/many of a repository’s projects with specific builds (the piwheels case)
The ability to safely extend a repository with a small set of projects and/or project builds, for which the name is owned both on the upstream and the extending repository (the pytorch case)

And the mechanisms proposed are:

tracks: Repository level definition (exposed through the project page endpoint) for declaring an owner/upstream repository. Defined on the “extending” repository (e.g. not pypi.org). Upstream/owner doesn’t need to acknowledge the tracking repository. (EDIT for clarification: it is the case that you “track” an upstream project, not a repository).
alternate-locations: Project level definition for cooperative repositories (they each know about the other). Must be consistent across all cooperating repositories.

The problem though is that the torchtriton case with the tracks concept would still be vulnerable to dependency confusion, no? (The pytorch repo would track pypi.org, have a project called torchtriton, and somebody subsequently registers the name on pypi.org. ). If I’ve understood correctly, alternate-locations would be the only way to go.

The problem here, IMO, is that the two mechanisms are very subtle, and it would be easy to overlook such details. The alternate-locations approach is the more robust IMO, and it essentially looks like a consensus based namespacing - I don’t know if the consensus part is necessary vs having a namespace authority. At the same time, use case 1 (piwheels) could be argued that operating a complete index is reasonable/logical, and may be better served by a service (e.g. on pypi.org) or library to do so conveniently… I don’t know how palatable that would be though?

dstufft · June 23, 2023, 2:25pm

What kind of dependency confusion attacks do you think is still possible?

Both pieces of metadata are per-project. So the pytorch repo would have $root/pytorch/ which would set tracks to pypi.org/simple/pytorch/ and $root/torchtriton/ which did not.

pelson · July 5, 2023, 9:15am

OK, got it. Thanks. So a repository operator would have to set tracks for each project which tracks, and not track an entire repo (I’ll add a clarifying edit to my post to avoid future confusion).

The attack that gave the name to “dependency confusion” in the first place is not solved by this PEP directly. I happily acknowledge that this PEP improves the situation by preventing the installation of confused names, but it is still easy to cause business disruption by registering names on PyPI which collide with internal names - the effect is orders of magnitude less severe (i.e. there is no arbitrary code execution), but it is still disruptive never-the-less (and not obvious that individual developers can solve the problem).

To solve that problem today you need to run a repository which groups multiple indexes by priority order, or introduce some form of namespacing (either a prefix/pattern, or something like an NPM scope / Maven groupId) which the repository can use in order to choose which project is the one it should expose.

It is therefore surprising to me that we would be better to invent our own concepts (2 of them), which are easily confused (I’m a case in point ) and which don’t fully solve the underlying dependency confusion issue (you still have to prioritise and/or namespace). Perhaps it is right to reject both the ordering and scopes / groupID / namespace concepts for Python, but I believe they should be comprehensively rejected, and not lightly dismissed as comes across in the PEP currently.

pf_moore · July 5, 2023, 10:52am

I don’t understand your concern here. The purpose of this PEP is to ensure that end users don’t find themselves installing the wrong package due to a dependency confusion attack. As you said, it achieves that aim. You seem concerned with the work needed by the operator of the extra index. That’s a fair concern, but not directly related to the core aim of the PEP. Ultimately, someone is going to have to do extra work to maintain the necessary infrastructure to prevent dependency confusion. The PEP makes that the responsibility of the people setting up an infrastructure that’s intended to use multiple indexes. That’s a choice the PEP makes (and a lot of the points in the PEP, and the discussions that led to the PEP, are about the reasoning behind that choice). You can disagree, but the PEP needs to make such choices. The main question is whether enough people disagree with the PEP, and/or the arguments against the PEP are compelling enough to persuade the PEP delegate (me!) to reject the PEP.

They aren’t lightly dismissed. As you note yourself, they are implementable today by anyone who is willing to set up and maintain the appropriate “grouping” repository. What isn’t available today is an out of the box implementation of that functionality in the standard tools. That’s a conscious choice, because there’s a significant cost to developing and maintaining such a solution, which would need to be paid by the PyPI and pip maintainers (among others). As a result, the cost impacts all of the users who don’t use multiple indexes and hence gain nothing from the extra work. That’s why all of the existing proposals to implement index ordering, scoped names, etc, have been rejected in the past.

And the PEP does explain why repository ordering and scopes have been rejected, so the explanation is there. If anyone disagrees with the reasons given, they can of course write a PEP proposing an alternative solution. Of course, in doing the research to put together such a PEP, they may get a better understanding of why PEP 708 rejected the idea, but that’s sort of the point of writing a PEP…

steve.dower · July 5, 2023, 3:08pm

It’s even better than that - with this PEP if you do no work, you prevent dependency confusion. All the work goes into bringing it back if you want it.

It seems Phil is more concerned about someone intending to install a private package that doesn’t exist on their private feed but does on PyPI. This is essentially the same as typosquatting, and is (still) only solved by using a curated feed instead of PyPI, not in conjunction with it.

dstufft · July 5, 2023, 4:42pm

Phil Elson:

The attack that gave the name to “dependency confusion” in the first place is not solved by this PEP directly. I happily acknowledge that this PEP improves the situation by preventing the installation of confused names, but it is still easy to cause business disruption by registering names on PyPI which collide with internal names - the effect is orders of magnitude less severe (i.e. there is no arbitrary code execution), but it is still disruptive never-the-less (and not obvious that individual developers can solve the problem).

To solve that problem today you need to run a repository which groups multiple indexes by priority order, or introduce some form of namespacing (either a prefix/pattern, or something like an NPM scope / Maven groupId) which the repository can use in order to choose which project is the one it should expose.

It is therefore surprising to me that we would be better to invent our own concepts (2 of them), which are easily confused (I’m a case in point ) and which don’t fully solve the underlying dependency confusion issue (you still have to prioritise and/or namespace). Perhaps it is right to reject both the ordering and scopes / groupID / namespace concepts for Python, but I believe they should be comprehensively rejected, and not lightly dismissed as comes across in the PEP currently.

The attacks described in the original dependency confusion article are solved by this, in that the attack itself is no longer possible. Your definition of solved doesn’t match what I suspect most people’s is-- for instance TLS solves the problem of a MITM attacker, and it does so by making it so a MITM attacker, at most, can only DoS the service not arbitrarily read and write on the connection.

It is not possible to generically solve dependency confusion attacks such that there is no possible way for the presence of an attack to not cause disruption without drastic changes in how packaging works. Solving that problems requires knowing in advance which repositories the user prefers to get particular packages from, which isn’t a knowable thing.

Prioritizing, scopes, namespaces etc do not solve this problem generically. They only solve the problem if the user has already configured their client to prioritize (or map scopes/namespaces) correctly.

The fundamental problem comes from the fact that project names in Python are ambiguous, they don’t indicate where they should come from (versus something like Go, which uses URLs that do indicate where they should come from).

Given that, we could solve it by make it so that instead of pip install foo you have to do pip install pypi.org:foo, but migrating to that (even if we wanted to) would probably be the hardest thing we’ve done in packaging to date

BrenBarn · July 5, 2023, 7:09pm

There is actually something else that could be done, and it’s really more of a pip thing, but I’m not sure if it’s currently done because I never use pip to install things from anywhere but pypi (and even that not much!).

What could be done is have pip, by default, not auto-install whatever it’s going to install, but rather show the “plan” for what will be installed, including (crucially) the repo from which each will be grabbed. Then the user can review this before hitting “y” to proceed. Some similar stuff was discussed on some of the other packaging threads.

I suppose switching to an interactive install mode would be considered a big change so maybe people won’t want to do this. And of course people can still just blindly hit yes and still screw up. But conda does this and in practice I’ve found it useful for noticing when it’s getting confused between defaults and conda-forge and I need to do something to straighten it out before actually executing the install.

CAM-Gerlach · July 5, 2023, 11:59pm

I’m strongly in favor of this behavior by default (or at the very least, behind a configurable flag), but it seems to me it would be more appropriate for a pip issue (or at least a separate post in packaging) than the discussion thread for PEP 708?

dstufft · July 6, 2023, 1:27am

That doesn’t really work for a few reasons (some specific to Python):

Unless you’re wheel only, it’s not possible to generate a plan without possibly executing code in a sdist.
It relies on users being aware of where every dependency in the transitive closure is supposed to come from-- something that is unlikely to happen in larger dependency sets ^[1].
Its unfeasible to review where everything comes from, every time you install it, when dealing with large closures. I’ve worked on projects that have > 1,000 deps in it’s transitive set of dependencies.
It trains people to just blindly “click” through.
It doesn’t work in non-interactive cases, which is a large % of installs.

For example, torchtriton was a transitive dependency, not a direct dependency. Its likely that a large number of people would have no idea which is the correct registry. ↩︎