External hosting linked to via PyPI

Allow external hosting on PyPI
We get a lot of requests for this, but most folks aren’t aware that this used to be a feature of PyPI that was removed via PEP 470 because in practice it wasn’t great for end-users, similar to reasons mentioned above.

After reading PEP 470 I have not found a solid reason why allowing external hosting is a bad idea. The concerns I saw are:

  • Security of packages coming from external repo.
  • End-user experience
    • The previous, now-reverted implementation of this proposal may have a bad UX, but I haven’t seen technical arguments why there can’t be a good UX, with reasonable rules and error messages, etc. Why not start discussing what a good UX should be, before rejecting this idea?
    • At the end of the day having pip install external-package working automatically is better user experience than manually adding/using external index.
  • Package author’s experience
    • If this is too much overhead, they can continue to ask for size exemptions like today. But this is at least going to help many of the big organizations who are releasing the biggest packages on pypi, as they probably can afford such overhead in exchange for a better user experience for their end-users.

Would love to hear more insights about more concrete reasons against external hosting, or whether this idea can be revisited today.

(Maybe this should be split into a separate thread)

If an organization would whitelist destinations, pip could become unusable unless the required external hosts are whitelisted as well. Of course one could argue an organization should put up a (proxy) cache in this kind of situation.

In Nixpkgs we store the urls in our expressions to fetch packages. Would PyPI provide redirections? Otherwise, are there any kind of requirements to urls of external hosts. I am thinking here about being able to support mirrors.

Thanks! Also wondering if this should be discussed separately. Anyway I’m responding to some concerns below:

whitelist destinations

This is true but I think it is not worse than the situation where large libraries like pytorch host their own index. Fundamentally, unless pypi.org is able to provide enough storage for everyone (seems unlikely?), this issue cannot be avoided.

Would PyPI provide redirections?

Maybe? This losts the ability to checksum the integrity of external packages, so in general feels like an undesirable and surprising change. But the use case is still valid, because nixpkg seems to have its own checksum anyway (I’m not familiar with it).

being able to support mirrors

If pypi were to provide external URLs, I think mirrors should be able to decide whether to serve external URLs like pypi, or to also mirror external packages, on a per-package basis (or even per-file). It’s a decision they should make based on their users’ needs and storage capacity. As an example, in the anaconda mirror of TUNA, we selectively mirror a few anaconda cloud channels based on users requests (https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/).

Just to be up front, if someone has a proposal to add a new external links feature, they’re free to make that proposal and we can discuss it’s merits and decide on it.

To fully understand the context of PEP 470, you should also read PEP 438, but there were aspects of that deprecation that were specific to how external link hosting was implemented on PyPI (tbh, external file hosting on PyPI was never really designed, it was organically arrived at by the intersection of multiple tools random implementation details, so it was pretty messy in particular’j.

If you ignore the parts that were specific to the old implementation, the big check marks against it were:

  • It reduces the availability of pip install, since (assuming you’re using the defaults) you’re always going to depend on PyPI itself being up to discover the links, adding additional points of failure just means that it’s more likely that pip install fails.
    • This often times manifested as the PyPI team being called on to investigate issues of “PyPI being down”, when in reality it was some third party host that was down, which put further strain on our volunteer team.
  • It makes mirroring harder and incomplete. PyPI doesn’t require that something is F/OSS to be released to PyPI, when you upload you agree to a terms of service that basically allows us, and anyone else, to distribute the files you upload. With external hosting those files no longer are covered by that ToS (We would have to ask a lawyer if they could be) so mirrors can’t just automatically assume they have the right to download and mirror those files, which means that mirrors can only cover the files on PyPI itself, and have to still refer the external file, which makes the mirror only a partial mirror.
  • PEP 470 believed that user should be in control of where pip, acting on their behalf, talks to. By default it talks to PyPI, and that’s sort of a well known thing, but we can’t expect users to know every package that hosts itself externally. Thus implicitly talking to third party hosts means that those users are no longer in direct control of where pip will contact on their behalf, and that is instead delegated to random projects.
  • Having things “in control” of PyPI makes implementing future features easier, for instance there’s now a PEP where we’re going to extract metadata files and host them (probably alongside the original file) to make version resolution faster. Backfilling and/or mandating that happening for new releases would be difficult or impossible with externally hosted files.

Like I said though, all decisions can be revisisted if there’s someone who wants to write a PEP with a compelling argument, but those are the biggest concerns I personally have with the abstract concept of external file hosting.

7 Likes