Proposal: Preventing dependency confusion attacks with the map file

uranusjr · February 9, 2023, 12:32pm

What do package and project mean in this context?

trishankatdatadog · February 9, 2023, 1:59pm

Sorry, maybe not using standard nomenclature.

Project: django, numpy, etc
Package: django-1.0.0.tar.gz, etc

kpfleming · February 9, 2023, 3:19pm

Where you are using ‘package’ the PyPA term is ‘distribution’; it’s either a source distribution (sdist) or a binary distribution (wheel).

The term ‘package’ in Python means something very different!

trishankatdatadog · February 9, 2023, 3:49pm

Ok, so is it accurate to say instead “at most one index would serve at most one ~~package~~ distribution for a project”? If so, you get the idea.

steve.dower · February 9, 2023, 4:11pm

I think you may be inverting responsibility in that statement to the point where it doesn’t make much sense.

The index will list all the distribution files it has for a particular project, and the installer gets to choose which one(s) it wants to download and use. In general, it’s up to the installer to make these decisions.

So what you seem to be saying is “an installer will select at most one distribution from at most one index for a project”, which is true, but isn’t very helpful. And since it’s so unhelpful, I have to assume I don’t know what you’re actually trying to get at

trishankatdatadog · February 9, 2023, 4:28pm

Sorry, but I’ve tried my best to explain to no avail. I’m planning to write a document elsewhere that will describe these issues in more detail, and will share it here when ready.

dstufft · February 10, 2023, 3:08pm

There’s been discussion happening on the pip issue tracker about this issue as well, and I think that out of that a pretty reasonable idea has surfaced. I wrote that idea up in a semi formal proposal doc so people can see it without having to decipher multiple discussions (some going back years), I’ve also included some ideas on communicating this out and what our messaging to end users might end up being.

This isn’t any sort of formal proposal of exactly what we should do. Things like the exact option names, the names of files, etc I’ve just kind of jammed some name in there for now that can serve as a placeholder to make it easier to reason about until we get to dealing with specifics.

If folks think this path is generally a reasonable one to go down, we can start figuring out how to break this out into actual an PEP (or PEPs maybe?), but I figured we can get some rough consensus on something like this as an approach first.

You can find that at Securing pip against Dependency Confusion.

brettcannon · February 10, 2023, 10:49pm

Am I reading the proposal correctly in that alternate-locations is for someone like PyPI for projects to list alternative places to look, while tracks is for those who stand up their index or proxy?

dstufft · February 11, 2023, 5:35am

Roughly yea, tracks and alternate-locations basically serve the same purpose, but alternate-locations is “harder” to use, because it’s intended for cases where you don’t know if you trust the metadata yet.

So example, Piwheels would use tracks, because if you’re choosing to use piwheels you’re choosing to inherently trust the repository operators behind Piwheels, so if they say “hey we’re tracking the same namespace as PyPI” then we can assume that’s trustworthy.

However, let’s say that someone decides to test their project on TestPyPI ^[1], and someone registers the name mypy on TestPyPI and wants to allow using that with mypy on PyPI. TestPyPI couldn’t use the tracks metadata here because, since it’s unidirectional, TestPyPI has no mechanism to validate that the person who owns mypy on TestPyPI is the same person who owns it on PyPI.

Which is where the alternate-locations metadata comes in.

The mypy project on PyPI would set alternate-locations = ["https://test.pypi.org/simple/mypy/"] and the mypy project on TestPyPI would set alternate-locations = ["https://pypi.org/simple/mypy/"].

By default this doesn’t do anything to change where pip looks (so it’s not like the old days where pip would implicitly add testpypi to it’s places to look for files), but instead if someone did:

$ pip install --extra-index-url https://test.pypi.org/simple/ mypy

Pip would basically follow this flow:

Fetch https://pypi.org/simple/mypy/
a. See the alternate-locations = ["https://test.pypi.org/simple/mypy/"]
b. add the implicit https://pypi.org/simple/mypy/
Fetch https://test.pypi.org/simple/mypy
a. See the alternate-locations = ["https://pypi.org/simple/mypy/"]
b. add the implicit https://test.pypi.org/simple/mypy/
Scan the list of files and build a set of all of the locations they came from.
Do file locations - (pypi_locations & testpypi_locations), and if that isn’t an empty set, return an error.

So essentially they do the say thing (allow pip to associate the same name on two different repositories with each other, without intervention from the end user), but they operate in different ways due to different trust models that exist between person who operates a repository, and person who just happens to own a name on a repository.

To be honest, I suspect in most cases the tracks metadata will be used, because most repositories out there that aren’t PyPI and TestPyPI all of the projects are owned by the same people who are operating the repository, and since that means we can inherently trust them, they can just use the tracks metadata since it’s simpler and easier to use.

However I didn’t want to special case PyPI here and I didn’t want to assume that no other repository like PyPI, where the repository operators and the project owners are different groups of people, exists or would ever exist, so I designed the feature such that it would allow an untrusted project owner to prove they own that name on multiple repositories.

TestPyPI is bad, but it’s the most well known example I have offhand of this. ↩︎

steve.dower · February 13, 2023, 2:06pm

Hopefully the package author could set alternate-locations = ["https://test.pypi.org/simple/mypy/", "https://pypi.org/simple/mypy/"] in both and then the index only returns locations other than itself?

“Intended index” isn’t a build-time parameter for most people, though there could be value in making people decide up front (particularly if we ever get any kind of package/metadata signing integrated into builds).

trishankatdatadog · February 14, 2023, 1:14am

@dstufft Could we get comment access on your GDoc? Unwieldy to read there but comment here (also important context lost). Thx!

dstufft · February 14, 2023, 3:04am

The link should have commenter access now.

dstufft · February 19, 2023, 6:10pm

Yes, or it could report both anyways. The important thing is that all of the values agree with each other. The implicit addition of the current index is just there to prevent mistakes where someone leaves out the current location (it also allows some minor size optimization for repositories that are particularly bandwidth sensitive).

The set math will work either way, so it should be fine either way.

dstufft · February 19, 2023, 6:22pm

I’m posting this here and to the pip issue just so nobody misses it.

It’s been about 10 days since I posted my proposal and other than a few questions I haven’t seen anyone raise a objection to the overall idea, and previously folks had seemed on board with the idea (the longer proposal designed to make sure everyone was on the same page and to make it easier for people to jump in without having to read both threads in their entirety).

Given nobody has objected, I’m going to take that as a sign that it’s worthwhile to take this to a PEP, so I’ll go ahead and start working on that. I plan to focus that PEP around the changes to the repository protocol and what those implications are for installers, I will likely include a non normative recommendations for installers that provide some high level guidance to installers to match the rough behavior in the proposal though, but I won’t spell out specific UX for installers.

wedwards · March 5, 2023, 12:30pm

As a new user, I am unable to post over 2 links. Therefore, I have placed the links corresponding to the numbers in brackets here: https://pastebin.com/raw/4FLbUL9z

–

My $0.02 as a ‘regular’ pip user:

My use case:

Install packages from both PyPI and an internal simple index[1].
Internal packages should be installed only from the internal index. Never from PyPI.
Dependencies of internal packages may be installed from PyPI.

I have investigated the following options:

--extra-index-url

Installs a package from PyPI if it exists, or from the given index URL.

Do not use this if the package should be installed only from the given index URL. An attacker could upload a package with the same name to PyPI, causing the wrong package to be installed. This is called ‘dependency confusion’.

Remedy:

Use hash-checking mode[2]. If the hash does not match, the package will not be installed. This prevents the wrong package from being installed if it does exist on PyPI.

However, this remedy has several issues:

The wrong hash is used if the package exists on PyPI at generation time.
Hash is created for a specific version. Therefore, dependencies must be pinned to specific versions, which is generally discouraged for libraries[3].

Conclusion: not suitable.

--index-url

Installs a package from the given index URL. Never from PyPI.

Use this if the package should be installed only from the given index URL. If a package with the same name exists on PyPI, it is not installed. After all: only the index URL is used. Not PyPI. This solves ‘dependency confusion’.

However, this option has several issues:

When using requirements.txt: --index-url applies to the entire file. It is not possible to use --index-url for a specific package. Workaround: use several requirements.txt files, and include them using -r.
It is not possible for a package which should be installed from a specific index URL to depend on packages from PyPI. After all: only the index URL is used. Workaround: use an index proxy (see below).

Conclusion: not suitable.

Index proxy

Proxies requests to PyPI and/or custom indices.

This solves all issues with --index-url. After all: the issues with --index-url stem from not being able to properly use multiple indices. With an index proxy, pip uses only one index.

The most-mentioned options are:

simpleindex[4]
devpi[5]

I also came across Thoth[6]. It was created for the very purpose of avoiding ‘dependency confusion’. However, its documentation does not seem to be straightforward. Additionally, it uses ‘AI’ and ‘moves the resolution process to the cloud’. I will pick an effective and elegant solution over marketing blah-blah.

Conclusion: suitable.

–

Based on my investigation, an index proxy seems like the only suitable solution at the moment.

In my view, being able to install packages from multiple repositories - without being vulnerable to ‘dependency confusion’ - is a simple and common use case, and should therefore not require an end user to run and maintain additional software.

At the time of writing, the best proposal that I - an end user - have come across is specifying an index URL per package[7]. This solves all issues with --index-url (see above).

sinoroc · March 5, 2023, 12:53pm

Really? I had no idea this (the workaround described here) would work. Is that true? Is that a usage supported by pip? Is that something that can be recommend to users?

dstufft · March 5, 2023, 9:00pm

I have not tested this, but I would be surprised, knowing what I know about pip internals, that this worked.

allevato · August 3, 2023, 6:26pm

Using several requirements.txt files, and installing each with its own --index-url, is doable. However, it has the same issue pointed out here, which is that if a package in one index depends on packages that come from another index, there is no way to automatically retrieve the dependencies from the other index during installation.

Topic		Replies	Views
PEP 708 - Extending the Repository API to Mitigate Dependency Confusion Attacks Standards	90	5935	April 19, 2024
Announcement: pip 20.2 release! Packaging release	6	26335	August 7, 2020
An update on pip and dependency resolution Packaging	17	6356	June 29, 2022
PEP 708 – Extending the Repository API to Mitigate Dependency Confusion Attacks PEPs	1	741	February 23, 2023
Announcement: pip 23.1 release! Packaging	7	9251	April 29, 2023

Proposal: Preventing dependency confusion attacks with the map file

Related Topics