What do package and project mean in this context?
Sorry, maybe not using standard nomenclature.
Project: django, numpy, etc
Package: django-1.0.0.tar.gz, etc
Where you are using āpackageā the PyPA term is ādistributionā; itās either a source distribution (sdist) or a binary distribution (wheel).
The term āpackageā in Python means something very different!
Ok, so is it accurate to say instead āat most one index would serve at most one package distribution for a projectā? If so, you get the idea.
I think you may be inverting responsibility in that statement to the point where it doesnāt make much sense.
The index will list all the distribution files it has for a particular project, and the installer gets to choose which one(s) it wants to download and use. In general, itās up to the installer to make these decisions.
So what you seem to be saying is āan installer will select at most one distribution from at most one index for a projectā, which is true, but isnāt very helpful. And since itās so unhelpful, I have to assume I donāt know what youāre actually trying to get at
Sorry, but Iāve tried my best to explain to no avail. Iām planning to write a document elsewhere that will describe these issues in more detail, and will share it here when ready.
Thereās been discussion happening on the pip issue tracker about this issue as well, and I think that out of that a pretty reasonable idea has surfaced. I wrote that idea up in a semi formal proposal doc so people can see it without having to decipher multiple discussions (some going back years), Iāve also included some ideas on communicating this out and what our messaging to end users might end up being.
This isnāt any sort of formal proposal of exactly what we should do. Things like the exact option names, the names of files, etc Iāve just kind of jammed some name in there for now that can serve as a placeholder to make it easier to reason about until we get to dealing with specifics.
If folks think this path is generally a reasonable one to go down, we can start figuring out how to break this out into actual an PEP (or PEPs maybe?), but I figured we can get some rough consensus on something like this as an approach first.
You can find that at Securing pip against Dependency Confusion.
Am I reading the proposal correctly in that alternate-locations
is for someone like PyPI for projects to list alternative places to look, while tracks
is for those who stand up their index or proxy?
Roughly yea, tracks
and alternate-locations
basically serve the same purpose, but alternate-locations is āharderā to use, because itās intended for cases where you donāt know if you trust the metadata yet.
So example, Piwheels would use tracks
, because if youāre choosing to use piwheels youāre choosing to inherently trust the repository operators behind Piwheels, so if they say āhey weāre tracking the same namespace as PyPIā then we can assume thatās trustworthy.
However, letās say that someone decides to test their project on TestPyPI [1], and someone registers the name mypy
on TestPyPI and wants to allow using that with mypy
on PyPI. TestPyPI couldnāt use the tracks
metadata here because, since itās unidirectional, TestPyPI has no mechanism to validate that the person who owns mypy
on TestPyPI is the same person who owns it on PyPI.
Which is where the alternate-locations
metadata comes in.
The mypy
project on PyPI would set alternate-locations = ["https://test.pypi.org/simple/mypy/"]
and the mypy
project on TestPyPI would set alternate-locations = ["https://pypi.org/simple/mypy/"]
.
By default this doesnāt do anything to change where pip looks (so itās not like the old days where pip would implicitly add testpypi to itās places to look for files), but instead if someone did:
$ pip install --extra-index-url https://test.pypi.org/simple/ mypy
Pip would basically follow this flow:
- Fetch
https://pypi.org/simple/mypy/
a. See thealternate-locations = ["https://test.pypi.org/simple/mypy/"]
b. add the implicithttps://pypi.org/simple/mypy/
- Fetch
https://test.pypi.org/simple/mypy
a. See thealternate-locations = ["https://pypi.org/simple/mypy/"]
b. add the implicithttps://test.pypi.org/simple/mypy/
- Scan the list of files and build a set of all of the locations they came from.
- Do
file locations - (pypi_locations & testpypi_locations)
, and if that isnāt an empty set, return an error.
So essentially they do the say thing (allow pip to associate the same name on two different repositories with each other, without intervention from the end user), but they operate in different ways due to different trust models that exist between person who operates a repository, and person who just happens to own a name on a repository.
To be honest, I suspect in most cases the tracks
metadata will be used, because most repositories out there that arenāt PyPI and TestPyPI all of the projects are owned by the same people who are operating the repository, and since that means we can inherently trust them, they can just use the tracks
metadata since itās simpler and easier to use.
However I didnāt want to special case PyPI here and I didnāt want to assume that no other repository like PyPI, where the repository operators and the project owners are different groups of people, exists or would ever exist, so I designed the feature such that it would allow an untrusted project owner to prove they own that name on multiple repositories.
-
TestPyPI is bad, but itās the most well known example I have offhand of this. ā©ļø
Hopefully the package author could set alternate-locations = ["https://test.pypi.org/simple/mypy/", "https://pypi.org/simple/mypy/"]
in both and then the index only returns locations other than itself?
āIntended indexā isnāt a build-time parameter for most people, though there could be value in making people decide up front (particularly if we ever get any kind of package/metadata signing integrated into builds).
@dstufft Could we get comment access on your GDoc? Unwieldy to read there but comment here (also important context lost). Thx!
The link should have commenter access now.
Yes, or it could report both anyways. The important thing is that all of the values agree with each other. The implicit addition of the current index is just there to prevent mistakes where someone leaves out the current location (it also allows some minor size optimization for repositories that are particularly bandwidth sensitive).
The set math will work either way, so it should be fine either way.
Iām posting this here and to the pip issue just so nobody misses it.
Itās been about 10 days since I posted my proposal and other than a few questions I havenāt seen anyone raise a objection to the overall idea, and previously folks had seemed on board with the idea (the longer proposal designed to make sure everyone was on the same page and to make it easier for people to jump in without having to read both threads in their entirety).
Given nobody has objected, Iām going to take that as a sign that itās worthwhile to take this to a PEP, so Iāll go ahead and start working on that. I plan to focus that PEP around the changes to the repository protocol and what those implications are for installers, I will likely include a non normative recommendations for installers that provide some high level guidance to installers to match the rough behavior in the proposal though, but I wonāt spell out specific UX for installers.
As a new user, I am unable to post over 2 links. Therefore, I have placed the links corresponding to the numbers in brackets here: https://pastebin.com/raw/4FLbUL9z
ā
My $0.02 as a āregularā pip user:
My use case:
- Install packages from both PyPI and an internal
simple
index[1]. - Internal packages should be installed only from the internal index. Never from PyPI.
- Dependencies of internal packages may be installed from PyPI.
I have investigated the following options:
--extra-index-url
Installs a package from PyPI if it exists, or from the given index URL.
Do not use this if the package should be installed only from the given index URL. An attacker could upload a package with the same name to PyPI, causing the wrong package to be installed. This is called ādependency confusionā.
Remedy:
Use hash-checking mode[2]. If the hash does not match, the package will not be installed. This prevents the wrong package from being installed if it does exist on PyPI.
However, this remedy has several issues:
- The wrong hash is used if the package exists on PyPI at generation time.
- Hash is created for a specific version. Therefore, dependencies must be pinned to specific versions, which is generally discouraged for libraries[3].
Conclusion: not suitable.
--index-url
Installs a package from the given index URL. Never from PyPI.
Use this if the package should be installed only from the given index URL. If a package with the same name exists on PyPI, it is not installed. After all: only the index URL is used. Not PyPI. This solves ādependency confusionā.
However, this option has several issues:
- When using
requirements.txt
:--index-url
applies to the entire file. It is not possible to use--index-url
for a specific package. Workaround: use severalrequirements.txt
files, and include them using-r
. - It is not possible for a package which should be installed from a specific index URL to depend on packages from PyPI. After all: only the index URL is used. Workaround: use an index proxy (see below).
Conclusion: not suitable.
Index proxy
Proxies requests to PyPI and/or custom indices.
This solves all issues with --index-url
. After all: the issues with --index-url
stem from not being able to properly use multiple indices. With an index proxy, pip uses only one index.
The most-mentioned options are:
- simpleindex[4]
- devpi[5]
I also came across Thoth[6]. It was created for the very purpose of avoiding ādependency confusionā. However, its documentation does not seem to be straightforward. Additionally, it uses āAIā and āmoves the resolution process to the cloudā. I will pick an effective and elegant solution over marketing blah-blah.
Conclusion: suitable.
ā
Based on my investigation, an index proxy seems like the only suitable solution at the moment.
In my view, being able to install packages from multiple repositories - without being vulnerable to ādependency confusionā - is a simple and common use case, and should therefore not require an end user to run and maintain additional software.
At the time of writing, the best proposal that I - an end user - have come across is specifying an index URL per package[7]. This solves all issues with --index-url
(see above).
Really? I had no idea this (the workaround described here) would work. Is that true? Is that a usage supported by pip? Is that something that can be recommend to users?
I have not tested this, but I would be surprised, knowing what I know about pip internals, that this worked.
Using several requirements.txt
files, and installing each with its own --index-url
, is doable. However, it has the same issue pointed out here, which is that if a package in one index depends on packages that come from another index, there is no way to automatically retrieve the dependencies from the other index during installation.