PEP 708 - Extending the Repository API to Mitigate Dependency Confusion Attacks

Just re-reading the last weeks feedback, it looks like so far people are pretty okay with the PEP. At least I haven’t seen any real concerns about it.

Just some follow ups on specific feedback items:

  • I’m currently going to continue to handwave the specific implementation detail of how installers allow people to configure what repo a project comes from, but I’m going to explicitly ask the poetry folks to weigh in on whether they would want that or not. If anyone has another installer they’d want to ask, please have them weigh in too!

  • I’m going to keep the HTML implementation of the ideas. I do agree with @pf_moore’s original point on the pre-PEP discussion that this feature is important enough to add it to the HTML form, at least until we formally decide to deprecate that and stop adding features to it. I’m not going to do that with this PEP, I think something like that would deserve it’s own PEP to deal with how we manage the likely breakages that come out of that.

  • I’m going to make the source of the alternate locations metadata more explicit.

  • I’m going to change some of the wording around repository operator to try and expand on the idea that it’s also the people who manage the namespace that we’re trusting with the tracks metadata. On PyPI that’s one and the same as who operates the software, but in other cases there may be people who aren’t “operating” the repository in a strict sense, but are still in charge of who owns what name.

4 Likes

LGTM. One other thing I did think of, in the “What should I do” section for “private repositories that host private projects” the recommendation is to mirror everything locally. While I agree that’s the ideal approach, and should be recommended, I think it would be useful to address the situation where that’s not possible - for example, an environment where users are allowed to use projects from PyPI as needed, and mirroring everything isn’t viable, or a case when a specific project needs an exception (such as relying on pytorch nightlies, which are too frequent and too large for the “approve and mirror” workflow to be viable).

I do think the existing recommendation is right, but heading off the inevitable “that’ll never work for us” responses[1] might be worthwhile.


  1. Not all of which will be unreasonable, TBH. ↩︎

I’ve reached out to folks at Artifactory about this, asking them to provide inputs on this as well FWIW.

1 Like

Look to see if the discovered files span multiple repositories; if they do then determine if either “Tracks” or “Alternate Locations” metadata allows safely merging ALL of the repositories where files were discovered together. If that metadata does NOT allow that, then generate an error, otherwise continue.

I was discussing this with @fridex: what does “safely merging” mean here? I think I can guess, but could you give a concrete algorithm and a few examples to clarify what you mean, please?

1 Like

The list of files are sorted by a preference order by installers to pick a file[1]. I expect that the merging would be adding the two (or more) lists together before this sort by preference order.


  1. they don’t use whatever the order is in the index response ↩︎

Ok, so please help me understand how this would prevent or at least mitigate the pytorch incident.

Users do a pip install with the --extra-index-url pointing to pytorch.org. What would alternate locations say there? List both pytorch.org and pypi.org?

And what would the alternate location for same project say on PyPI? Would it also list both of these indices? And would you install pytorch only if both indices happen to agree on alternate locations?

pytorch.org would have pypi.org (with an implicit pytorch.org or it could be explicit if someone wanted).

pypi.org would have pytorch.org (with an implicit `pypi.org, or it could be explicit if someone wanted).

If files for torchtriton was found on both indexes, it would only be installed if they agreed on alternate locations.

If the user had a third repository, like piwheels configured, which didn’t have that metadata, and it found torchtriton there it would fail.

Presumably this expands to: “all versions from both indexes would be considered for installation, and where package name+version(?) matches, both indexes are assumed to provide interchangeable files”?

Roughly yea.

The PEP itself doesn’t have an opinion on how clients treat the files once the mechanisms in the PEP have ensured that the combination of repositories is safe/expected. Currently in pip (and poetry I believe) it assumes name+version being equal on multiple repositories means those files are interchangeable.

This PEP doesn’t attempt to alter the behavior of an installer in the “safe” cases, however the safe cases are derived, it just attempts to turn unsafe cases into errors [1].


  1. Other than through the explicit configuration option, which also doesn’t really change the behavior of the safe case, other than the suggested approach applying a broad filter to the list of files to pretend they only come from a subset of the configured repositories. ↩︎

Some thoughts from a poetry guy. (Opinions of other poetry folks may differ.)

tldr: I’m a little torn and can’t give a clear yes or no to the mapping file.

I think the PEP is very useful for pip, but it is not yet clear to me how we will implement it in poetry.

Maybe, it’s only nitpicking and not relevant but in case it helps understanding the following remarks, I want to make clear that the PEP or the mapping file is only relevant for poetry’s dependency resolution mechanism (i.e. locking) but not for poetry’s installer because resolved packages are locked with sources so that the installer has no choice anymore.

The PEP mentions that the order of repositories was never meaningful to pip. However, it has always been meaningful to poetry. Thus, it is more difficult for poetry to give up this stance.

If I understand correctly, the PEP requires to look up all repositories for a given dependency (which makes sense if repositories are not ordered anyway) but is more controversial if repositories are ordered. At the moment, even though repositories are ordered, poetry looks up all repositories, too. However, that’s a performance issue for many users. Thus, a common wish is that a dependency should not be looked up in subsequent repositories if it is found in a previous one. Currently, we are planning to implement this feature: Replace the `secondary` source type with more granular types · Issue #6713 · python-poetry/poetry · GitHub

The PEP also mentions that order is difficult when considering multiple locations. That might be one reason why repositories have to be defined in the pyproject.toml for poetry. Another reason is that the source of each chosen package is written to the lock file and the only input for the lock file is the pyproject.toml. If we would define repositories in another file, the pyproject.toml is not self-contained anymore.

In my opinion, the mentioned mapping file poses the same issue to poetry. Poetry already supports defining an explicit source for a dependency. However, that’s defined in the pyproject.toml.

There are no wildcards like in the proposed mapping file, you can only define explicit sources for specific dependencies. However, a typical use case that might not be covered by the mapping file - I did not read the whole thread so forgive me if it is - is defining different sources for different versions of a package. E.g. (in poetry syntax):

somelib = [
    { version = "1.2", markers = "sys_platform != 'linux'", source = "PyPI" },
    { version = "1.2+local", markers = "sys_platform == 'linux'", source = "myPrivateRepo" },
]

I do not want to hide the fact that some poetry users wish to define repositories globally (or overwrite repository URLs), but that’s something we haven’t yet taken into consideration. I could imagine that we might allow a mapping file that does not define repositories but only which repository is preferred for which dependency, preserving the option to define a specific source in the pyproject.toml (and thus overwriting the mapping file) and falling back to the ordered repositories if there is no entry in the mapping file. However, that may result in different solutions with and without the mapping file, which could cause confusion…

The PEP calls that out somewhat in the rejected ideas section, where it rejected relying on hash checking-- the same would apply for poetry’s lock file (which you seem to confirm if I read you correctly), but it’s rejected as a solution because it doesn’t really solve the problem, it just moves it from install time to resolve time. That certainly narrows the gap of vulnerability, but doesn’t prevent it like this PEP enables.

Whether order is meaningful or not isn’t exactly the right metric. The question is whether order matters above all else. if you have Index A > Index B in poetry, and Index A has foo-1.0 and index B has foo-2.0, will it install foo 1.0? What about if the dependency is on “foo>1.5”? To have order make things safe means that you have to halt searching as soon as you find any file, even ones that don’t match the current solution set.

I don’t use poetry, so maybe “install foo 1.0” is what people expect for poetry in the above case, I would expect foo 2.0, but that might be my biases at play, but this is one of the reasons why I rejected using ordering to fix this, because it’s subtle to get right and have it still safe.

I’m not sure if that’s the case from your statement or not. One thing I’d note is the PEP doesn’t require looking up all repositories, it’s just saying that any repositories you’re finding files in, should follow the PEP. If you’re only searching one repository (for whatever reason, because you implemented safe order based searching, because you have an explicit source = ..., whatever) it doesn’t require you to go out and fetch other repositories that you weren’t even going to use anyways. It just makes sure when you do search multiple repositories, it’s done safely.

One problem I foresee with putting everything in pyproject.toml here [1] is that it doesn’t have a clean mechanism for handling transitive dependencies. The torchtriton case isn’t one where any of the torch users were directly installing it, but rather it was a transitive dependency that got pulled in implicitly. If those users put the torch repository first AND poetry implements it safely as I mentioned above, then that’s fine, but if not it means users need to audit every transitive dependency and/or lift the entire dependency tree up to be a direct dependency.

FWIW the wildcard was just one suggestion of what the repository file could look like, I personally wouldn’t include it because I think it makes it too hard to read and more likely to get it wrong, which are a bad pair for security sensitive stuff.

The idea of a repository file in my head wouldn’t support version specific things like that, or platform specific or whatever. It’s not really meant to change the UX of how repositories are defined for any installer, it’s more like a filter over what the allowed values are. So personally if I had a case like the above, I would say that if you listed somelib in the repository file, then it would have to list both PyPI and myPrivateRepo, and then the installer specific features layer on top of that if they want to restrict it further.

But this is also something I’m not sure of! If the poetry project thinks that their existing solution for specifying repository is good enough for them, then unless another installer steps forward and really wants a standard one I suspect that it’s good enough to just treat that as part of the installer UX without needing to standardize it.


  1. Well really in putting it in the dependency specifiers, the name of the file doesn’t really matter, it’s tying it to the actual dependency specifiers that causes it. ↩︎

That’s a good point. Poetry will choose foo 2.0. This is a case where we would have to apply the PEP.

One problem I foresee with putting everything in pyproject.toml here is that it doesn’t have a clean mechanism for handling transitive dependencies. The torchtriton case isn’t one where any of the torch users were directly installing it, but rather it was a transitive dependency that got pulled in implicitly. If those users put the torch repository first AND poetry implements it safely as I mentioned above, then that’s fine

As it is currently implemented, it’s not safe. As soon as we resolve the issue I linked in my previous post, it will be possible for the user to define it in a safe way. However, it will also be possible to define it in an unsafe way until we implement this PEP.

Thanks for your detailed answer. I think that helped me a bit to identify where we would need to apply this PEP.

I forgot to mention here, I made a few small tweaks to this PEP, you can see them at PEP 708: Clarifications + Reject Defining the Explicit Configuration by dstufft · Pull Request #3060 · python/peps · GitHub

Not changing anything, just documenting the questions raised above.

Since the discussion has died out on this, I’m going to assume that folks are pretty happy (or at least OK) with where it landed, and if nobody speaks up next week I’ll ask for formal pronouncement.

2 Likes

Alright then, @pf_moore I think we’re g2g for a formal pronouncement :slight_smile:

Cool. Ping me in a week or so if I haven’t made a decision by then.

FWIW, just discussed the PEP with Donald, who was kind enough to take the time to explain it to me, and it LGTM. Thanks for all your hard work, and everyone else for their feedback!

1 Like

Having taken a look at the PEP, I have one concern, about the section Repository “Tracks” Metadata. The explanation in this section seems confused. It starts off talking about this metadata as being at the repository level but the details clearly imply that it’s intended to be applied at the project level. I think the intended behaviour is fine, but I think the specification needs to be worded better to make the intention explicit.

Apart from that point, the PEP overall looks good to me. If you can try to word that section better, I think I’ll be fine to approve the PEP.

1 Like

Agree: I was confused about this myself until Donald explained it to me. Maybe calling it Project-to-Repository “Tracks” Metadata is better, at the expense of being quite a bit more verbose, but maybe someone can think of a better name.

@dstufft just a quick reminder, I’m waiting on a rewording of that one section before approving.

1 Like

Sounds good, I’ll get that sorted this week.