Proposal: Preventing dependency confusion attacks with the map file

What do package and project mean in this context?

Sorry, maybe not using standard nomenclature.

Project: django, numpy, etc
Package: django-1.0.0.tar.gz, etc

Where you are using ā€˜packageā€™ the PyPA term is ā€˜distributionā€™; itā€™s either a source distribution (sdist) or a binary distribution (wheel).

The term ā€˜packageā€™ in Python means something very different!

1 Like

Ok, so is it accurate to say instead ā€œat most one index would serve at most one package distribution for a projectā€? If so, you get the idea.

I think you may be inverting responsibility in that statement to the point where it doesnā€™t make much sense.

The index will list all the distribution files it has for a particular project, and the installer gets to choose which one(s) it wants to download and use. In general, itā€™s up to the installer to make these decisions.

So what you seem to be saying is ā€œan installer will select at most one distribution from at most one index for a projectā€, which is true, but isnā€™t very helpful. And since itā€™s so unhelpful, I have to assume I donā€™t know what youā€™re actually trying to get at :slight_smile:

1 Like

Sorry, but Iā€™ve tried my best to explain to no avail. Iā€™m planning to write a document elsewhere that will describe these issues in more detail, and will share it here when ready.

1 Like

Thereā€™s been discussion happening on the pip issue tracker about this issue as well, and I think that out of that a pretty reasonable idea has surfaced. I wrote that idea up in a semi formal proposal doc so people can see it without having to decipher multiple discussions (some going back years), Iā€™ve also included some ideas on communicating this out and what our messaging to end users might end up being.

This isnā€™t any sort of formal proposal of exactly what we should do. Things like the exact option names, the names of files, etc Iā€™ve just kind of jammed some name in there for now that can serve as a placeholder to make it easier to reason about until we get to dealing with specifics.

If folks think this path is generally a reasonable one to go down, we can start figuring out how to break this out into actual an PEP (or PEPs maybe?), but I figured we can get some rough consensus on something like this as an approach first.

You can find that at Securing pip against Dependency Confusion.

5 Likes

Am I reading the proposal correctly in that alternate-locations is for someone like PyPI for projects to list alternative places to look, while tracks is for those who stand up their index or proxy?

Roughly yea, tracks and alternate-locations basically serve the same purpose, but alternate-locations is ā€œharderā€ to use, because itā€™s intended for cases where you donā€™t know if you trust the metadata yet.

So example, Piwheels would use tracks, because if youā€™re choosing to use piwheels youā€™re choosing to inherently trust the repository operators behind Piwheels, so if they say ā€œhey weā€™re tracking the same namespace as PyPIā€ then we can assume thatā€™s trustworthy.

However, letā€™s say that someone decides to test their project on TestPyPI [1], and someone registers the name mypy on TestPyPI and wants to allow using that with mypy on PyPI. TestPyPI couldnā€™t use the tracks metadata here because, since itā€™s unidirectional, TestPyPI has no mechanism to validate that the person who owns mypy on TestPyPI is the same person who owns it on PyPI.

Which is where the alternate-locations metadata comes in.

The mypy project on PyPI would set alternate-locations = ["https://test.pypi.org/simple/mypy/"] and the mypy project on TestPyPI would set alternate-locations = ["https://pypi.org/simple/mypy/"].

By default this doesnā€™t do anything to change where pip looks (so itā€™s not like the old days where pip would implicitly add testpypi to itā€™s places to look for files), but instead if someone did:

$ pip install --extra-index-url https://test.pypi.org/simple/ mypy

Pip would basically follow this flow:

  1. Fetch https://pypi.org/simple/mypy/
    a. See the alternate-locations = ["https://test.pypi.org/simple/mypy/"]
    b. add the implicit https://pypi.org/simple/mypy/
  2. Fetch https://test.pypi.org/simple/mypy
    a. See the alternate-locations = ["https://pypi.org/simple/mypy/"]
    b. add the implicit https://test.pypi.org/simple/mypy/
  3. Scan the list of files and build a set of all of the locations they came from.
  4. Do file locations - (pypi_locations & testpypi_locations), and if that isnā€™t an empty set, return an error.

So essentially they do the say thing (allow pip to associate the same name on two different repositories with each other, without intervention from the end user), but they operate in different ways due to different trust models that exist between person who operates a repository, and person who just happens to own a name on a repository.

To be honest, I suspect in most cases the tracks metadata will be used, because most repositories out there that arenā€™t PyPI and TestPyPI all of the projects are owned by the same people who are operating the repository, and since that means we can inherently trust them, they can just use the tracks metadata since itā€™s simpler and easier to use.

However I didnā€™t want to special case PyPI here and I didnā€™t want to assume that no other repository like PyPI, where the repository operators and the project owners are different groups of people, exists or would ever exist, so I designed the feature such that it would allow an untrusted project owner to prove they own that name on multiple repositories.


  1. TestPyPI is bad, but itā€™s the most well known example I have offhand of this. ā†©ļøŽ

Hopefully the package author could set alternate-locations = ["https://test.pypi.org/simple/mypy/", "https://pypi.org/simple/mypy/"] in both and then the index only returns locations other than itself?

ā€œIntended indexā€ isnā€™t a build-time parameter for most people, though there could be value in making people decide up front (particularly if we ever get any kind of package/metadata signing integrated into builds).

@dstufft Could we get comment access on your GDoc? Unwieldy to read there but comment here (also important context lost). Thx!

The link should have commenter access now.

Yes, or it could report both anyways. The important thing is that all of the values agree with each other. The implicit addition of the current index is just there to prevent mistakes where someone leaves out the current location (it also allows some minor size optimization for repositories that are particularly bandwidth sensitive).

The set math will work either way, so it should be fine either way.

1 Like

Iā€™m posting this here and to the pip issue just so nobody misses it.

Itā€™s been about 10 days since I posted my proposal and other than a few questions I havenā€™t seen anyone raise a objection to the overall idea, and previously folks had seemed on board with the idea (the longer proposal designed to make sure everyone was on the same page and to make it easier for people to jump in without having to read both threads in their entirety).

Given nobody has objected, Iā€™m going to take that as a sign that itā€™s worthwhile to take this to a PEP, so Iā€™ll go ahead and start working on that. I plan to focus that PEP around the changes to the repository protocol and what those implications are for installers, I will likely include a non normative recommendations for installers that provide some high level guidance to installers to match the rough behavior in the proposal though, but I wonā€™t spell out specific UX for installers.

5 Likes

As a new user, I am unable to post over 2 links. Therefore, I have placed the links corresponding to the numbers in brackets here: https://pastebin.com/raw/4FLbUL9z

ā€“

My $0.02 as a ā€˜regularā€™ pip user:

My use case:

  • Install packages from both PyPI and an internal simple index[1].
  • Internal packages should be installed only from the internal index. Never from PyPI.
  • Dependencies of internal packages may be installed from PyPI.

I have investigated the following options:

--extra-index-url

Installs a package from PyPI if it exists, or from the given index URL.

Do not use this if the package should be installed only from the given index URL. An attacker could upload a package with the same name to PyPI, causing the wrong package to be installed. This is called ā€˜dependency confusionā€™.

Remedy:

Use hash-checking mode[2]. If the hash does not match, the package will not be installed. This prevents the wrong package from being installed if it does exist on PyPI.

However, this remedy has several issues:

  • The wrong hash is used if the package exists on PyPI at generation time.
  • Hash is created for a specific version. Therefore, dependencies must be pinned to specific versions, which is generally discouraged for libraries[3].

Conclusion: not suitable.

--index-url

Installs a package from the given index URL. Never from PyPI.

Use this if the package should be installed only from the given index URL. If a package with the same name exists on PyPI, it is not installed. After all: only the index URL is used. Not PyPI. This solves ā€˜dependency confusionā€™.

However, this option has several issues:

  • When using requirements.txt: --index-url applies to the entire file. It is not possible to use --index-url for a specific package. Workaround: use several requirements.txt files, and include them using -r.
  • It is not possible for a package which should be installed from a specific index URL to depend on packages from PyPI. After all: only the index URL is used. Workaround: use an index proxy (see below).

Conclusion: not suitable.

Index proxy

Proxies requests to PyPI and/or custom indices.

This solves all issues with --index-url. After all: the issues with --index-url stem from not being able to properly use multiple indices. With an index proxy, pip uses only one index.

The most-mentioned options are:

  • simpleindex[4]
  • devpi[5]

I also came across Thoth[6]. It was created for the very purpose of avoiding ā€˜dependency confusionā€™. However, its documentation does not seem to be straightforward. Additionally, it uses ā€˜AIā€™ and ā€˜moves the resolution process to the cloudā€™. I will pick an effective and elegant solution over marketing blah-blah.

Conclusion: suitable.

ā€“

Based on my investigation, an index proxy seems like the only suitable solution at the moment.

In my view, being able to install packages from multiple repositories - without being vulnerable to ā€˜dependency confusionā€™ - is a simple and common use case, and should therefore not require an end user to run and maintain additional software.

At the time of writing, the best proposal that I - an end user - have come across is specifying an index URL per package[7]. This solves all issues with --index-url (see above).

Really? I had no idea this (the workaround described here) would work. Is that true? Is that a usage supported by pip? Is that something that can be recommend to users?

I have not tested this, but I would be surprised, knowing what I know about pip internals, that this worked.

Using several requirements.txt files, and installing each with its own --index-url, is doable. However, it has the same issue pointed out here, which is that if a package in one index depends on packages that come from another index, there is no way to automatically retrieve the dependencies from the other index during installation.