PEP 759, External Wheel Hosting

barry · October 3, 2024, 1:12am

This PEP proposes what I think is a unique approach to safely hosting external wheels, while keeping PyPI as the source of truth for project and release metadata. The PEP proposes a new uploadable file format, easily derived from .whl files, which includes metadata to point to an external wheel hosting service, tied to organizational accounts. PEP 759 acknowledges the history behind PEPs 438 and 470, and explains how this approach is different.

This PEP comes out of the wheel-next initiative, a collaboration among many folks who are looking to evolve the various standards to handle the needs of some organizations publishing packages to PyPI. Other initiatives under this umbrella include wheel 2.0, variants, etc.

All due credit and thanks goes to my coauthor @emmatyping and various reviewers who helped shape the initial seed of an idea into this proposal. Should this PEP be accepted, we plan on implementing the feature in warehouse.

ncoghlan · October 3, 2024, 2:53am

Seems like a plausible approach. The PEP definitely does a good job of anticipating potential objections.

Some potential issues/questions:

artifacts that aren’t uploaded to PyPI aren’t intrinsically covered by the “worldwide non-revocable binary distribution permission” included in the PyPI terms of service. For PSF and PyPI client legal safety, obtaining permission to enable external hosting needs to extend that license to externally hosted files referenced via rim files.
should PyPI archive all externally hosted wheels as a risk mitigation activity? (the PEP doesn’t really need to specify one way or another, just ensure the legal right to do that exists)
how will TUF metadata be handled? (should be OK given the way the PEP works, but readers shouldn’t have to work through the implications on their own)
similar question for PyPI mirrors (caching proxies in particular may want to catch when external hosting is used and cache those files anyway)
what happens if a rim file is uploaded, replaced by a whl file, then the whl file is deleted? Also deleting the rim file in that case would be consistent with the other index requirements in the PEP.
when retrieving support information for a failed external download, how do clients get the rim file URL or the related support metadata if the index server API is only reporting the location of the external whl archive?
it would be nice if the support metadata could directly include a reference to a service status URL like https://status.python.org/

And some backronym bikeshedding (since the “Installable” bit reads oddly to me):

Remote Installation Metadata (it’s the download for installation that is remote from the index, rather than the metadata itself being remotely installed)
Remote Index Metadata (the metadata index upload is remote from the wheel artifact upload)
Remote Index/Installation Metadata (both terms are accurate, it just depends on your point of view)

Edit: “Retrieval Indirection Metadata” would be an unambiguous backronym that doesn’t rely on the ambiguously relative “Remote” term.

mhsmith · October 3, 2024, 1:01pm

this feature MUST be explicitly enabled by PyPI admins at their discretion. Since this will not be a common request, we don’t expect the overhead to be nearly as burdensome as […] file/project size increase requests

Why do you expect this, when the requests would come from the same people for the same reasons? If the current process “can take some time to resolve”, wouldn’t that apply to the new process as well? And if the problems with the existing process were fixed, would the complexity added by this PEP still be justified?

The PEP currently starts off from the premise that the existing process cannot be fixed. If this is true, then the PEP should explain why, including quantifying the exception request volumes that we’re currently facing, the sizes they’re requesting, and the time they take to resolve. If there are any existing external indexes which could never be hosted on PyPI, either because of their size or other reasons, that should be explained as well.

dholth · October 3, 2024, 1:12pm

How about .rind for Remote INstallation Data. This is because wheels are cheese.

woodruffw · October 3, 2024, 2:57pm

Thanks for this @barry! I also think this is a plausible approach, and the technical design proposed makes a lot of sense to me (with my “hacking on Warehouse” hat on).

Some scattered thoughts:

The RIM contains hashes for the remote wheel, which the index is expected to serve via the hashname=hash fragment. If I understand correctly this is behaviorally compatible with pip (which uses the fragment hash to check for accidental corruption), but not semantically compatible (pip’s docs say the hash fragment is only for accidental corruption, whereas this will promote it to protecting against a third-party server going rogue and serving something malicious). So, I think pip’s docs and/or this PEP should probably clarify that the hash fragment MUST be checked and should be treated as a security boundary rather than a data corruption boundary.
RIM-supplied urls are de facto identifiable (since they won’t be pythonhosted.org), but maybe it makes sense to make that explicit? In other words, if the index receives a RIM, perhaps the resulting index entry should have data-externally-hosted=true (and similar for JSON)? I’m not sure if this makes sense to do, though.
How does this interact with mirrors? This is related to @ncoghlan’s question about licensing, but also on a technical level: I believe most mirrors currently use the journaling mechanism via XML-RPC, so the journal might also need to distinguish RIMs from wheels to help mirroring clients fetch and persist the externally hosted wheel.
On a policy level: should orgs that are approved for RIM uploads be required to commit to a certain level of uptime/availability? I think it’s in PyPI’s interest to impose some kind of availability requirement here, since a single (but very critical) externally hosted wheel can potentially cause widescale disruption across package consumers.

My understanding is that the bigger problem with the current process is not the time-to-resolution (which is a problem!), but the fact that it needs to be periodically repeated (i.e. repeatedly asking for more quota as the cumulative project size grows). This PEP would turn that into a one-time process (per my understanding), since an org could upload as many RIMs as it pleases once approved.

I agree about fixing the underlying process, though – to my understanding the PyPI admins/maintainers are currently working on organization-level quota support, which would also offer a solution here (by allowing corporate orgs to pay for quota, or be uncapped entirely potentially). So I think it’d be good for this PEP to explain when external wheel hosting should be used vs. org-level quotas (I think it potentially makes sense to have both, since both have tradeoffs).

dustin · October 3, 2024, 3:14pm

Some thoughts as I read the PEP:

Regarding rationale: I don’t think it’s clear from the PEP what the rationale is. I agree that the “wheel stub” pattern is not great, but from my perspective, limit increases are working. For 99.9% of projects (this is an actual statistic), they are below the per-file upload limit, and those that do need a larger limit almost always receive it. Is the rationale that there is a problem that the current system of limit increases doesn’t currently solve, or can’t solve in the future?

If the rationale is that PyPI is not well equipped to handle the administrative burden of handling file limit requests (entirely possible), I’m not convinced this proposal improves that situation that by a) adding significant upfront engineering work to support the PEP and b) adding roughly a similar amount of burden due to having to manage requests for external hosting.

I’m also struggling to understand how PyPI staff will make a determination on such external hosting requests: how can we verify that an external host will meet the availability needs to uphold the existing standard of service? I wouldn’t want to approve a host that has less uptime than PyPI, or would provide a download that took 10x longer, but I have no easy way to verify that upfront.

Regarding user experience: I don’t think this proposal addresses other downsides of removing limits on file sizes – large files generally result in a poor user experience due to increased install times, bandwidth consumption, disk consumption and opportunity for network timeout. My concern is that implementation of this proposal would allow file sizes to grow unchecked and this experience would get worse. Larger files also makes mirroring PyPI harder, for example, many mirrors provide filtering (e.g. Mirror filtering — bandersnatch 6.6.0.dev0 documentation) for exactly this reason.

Regarding availability: I think this proposal is generally underestimating the challenge of availability and longevity needed here. PyPI is arguably the most committed host for Python software: it’s dedicated to hosting the last ~20 years of artifacts and continuing to host them for the next ~20 years and beyond. I am dubious that any other host would be able to provide a similar level of commitment, despite their best efforts and intentions. I also don’t see the proposal addressing what should happen if an external host goes dark, which I think will invariably happen over a long enough period of time.

Regarding security: This proposal seems to rely on hash-checking to protect against external host compromise, but unless hash checking is mandated when using external hosts, I don’t think it has a high enough adoption rate to be an effective mitigation here. I see some discussion on tools needing to relax restrictions on what external domains they would trust, but no clear solution for how they should do it. It seems like the assumption is that any user-provided external domain should be trusted by both PyPI and the end user. Additionally, it should be noted that many PyPI users additionally restrict what external IP addresses are trusted (enough that we have a FAQ about it: Help · PyPI), and introducing additional potential IP addresses.

Overall, this seems like a lot of effort (for both PyPI to implement, installers to implement, external hosts to implement, and users to understand) with not much upside (a faster & unrestricted release process for a very very small subset of PyPI users) which would likely introduce more friction for everyone (challenges around trust and availability for the end user, a support queue for PyPI).

emmatyping · October 3, 2024, 3:52pm

Very good points, we will update the PEP regarding these.

Good point! I think we may need to require that the hash in the RIM is sha 512, as is used in TUF metadata.

I think this will be up to the mirror software and mirror host. Some will block mirroring files not from pythonhosted.org, while some may choose to mirror those.

Just like if a wheel is uploaded then deleted, neither a RIM nor wheel may be uploaded for the same artifact name. We could probably be clearer about spelling this out in the PEP.

Agreed, I think we didn’t want to be too prescriptive of what the status/support setup looked like, but I think this would be useful.

pf_moore · October 3, 2024, 3:59pm

I agree with pretty much all of what @dustin said, but this point in particular concerns me. I don’t think it’s at all clear what the user experience will be for someone trying to install a package with a complex dependency tree if some external host, deep in the dependencies, disappears. Yes, at the simplest level, the user gets a “the download failed” message, but people simply don’t expect packages to “just disappear” - PyPI is viewed as a permanent store and this proposal undermines that, to a certain level. I think that needs to be addressed, and not just covered by “we’ll require people to promise not to do that”.

Similarly, what happens if if an external host gets renamed^[1]? Would it be allowed to replace the .rim file? How would that affect caching of PyPI responses, or things like PEP 710 (which proposes that the source URL for an installed package gets recorded in the installed .dist-info metadata)? Or lockfiles - the download URL for a wheel is recorded in a lockfile, so if the external host gets renamed, all existing lockfiles are invalidated.

companies get taken over and rebranded, and as part of that domain names can change ↩︎

pf_moore · October 3, 2024, 5:12pm

(The following could very quickly go off-topic, and be better in its own thread, so I’ve posted separately to try to make it easier to split if needed).

Unlike @dustin, I do think there’s a problem that needs solving here, but I’m not entirely convinced that it’s the one this PEP is trying to solve. Specifically, the PEP notes that external hosting could be solved with separate indexes, but claims that approach is unsuitable. That’s superficially correct, but it seems to me that the correct response here should be to make additional indexes better.

The pip tracker contains many examples of cases where an additional index would be a great solution for someone’s problem, but the cost of hosting such an index, and the complexity of using it, is simply too high. If hosting and using a custom index were straightforward, that would unlock a lot of opportunities that are currently blocked.

The difficulty is that there are a number of problems that would need to be solved here, and most of them are not things that can be easily fixed by defining a standard. So, to me, PEP 759 feels like an attempt to use the tool we’ve got (standardisation) to solve a problem that needs a different tool. If all you’ve got is a hammer, everything looks like a nail, as it were.

The advantage of framing this issue (and others) as being about multiple indexes, is that the index is a unit of trust. By that, I mean that people trust PyPI - they trust it to be available, and to publish projects that are from the people they say they are from. On the other hand, they don’t (or shouldn’t) trust PyPI to curate the packages it hosts. The implied contract for externally hosted packages is different, and so the trust levels are not the same.

Building a UI for selecting indexes that makes that model of picking your trust levels explicit, would allow users to choose what level of curation, what level of reliability, etc., they wanted. It would also allow selection based on other criteria - does this index host only formal releases, or does it host nightly builds? Or does it host only official components for project X, or does it include community contributions? Etc.

The problem is that designing such a UI isn’t something we can handle via the standards process. It needs a different kind of effort. Maybe just someone tasked to work on a pip (or uv, or both?) feature. Maybe a funded project involving UX specialists as well as developers. Or maybe just development of an interface specification that the community agrees on (in a way that matches the standards process, but is separate from it) and which installers agree to implement.

The second aspect of this is the cost (both financial and administrative) of hosting. Many users are put off using an extra index because they don’t have the means to do so easily. But maybe that could be solved by PyPI offering a new option, to make additional indexes available to users (under something like https://orgname.pypi.org). To avoid getting into questions of hosting, those indexes could be built using some form of proxy index software like simpleindex, that exposes an index API, but explicitly redirects hosting to external storage. This would mean that these indexes would have different reliability contracts than PyPI itself, but that’s explicit, and as noted above, the end user is aware from the fact that they are using an additional index, that the implied contract is different.

To be perfectly honest, I feel that if we’re not going to spend the time and effort to make multiple indexes robust and user friendly, then we’d be far better simply removing that option, and deprecating --extra-index-url altogether. But while we support multiple indexes as a valid way of consuming packages, it feels wrong to propose functionality that could be handled via multiple indexes, but dismiss that possibility “because multiple index support isn’t good enough”.

steve.dower · October 3, 2024, 6:14pm

I agree that this solution also solves the problem set out by the PEP, and I think it’ll do it in a way that is healthier for our ecosystem as a whole.

PEP 708 solves the biggest argument against (as soon as installers implement it), and the rest is basically just the UX of helping users specify an additional index.

I’m not convinced that hosting the index is prohibitive - it’s a fairly trivial format, and there are plenty of companies that already offer free or paid hosting for a PyPI-like feed. (Even more if you’re willing to generate the index and use static file storage, which is absolutely doable.)

The issue is that users won’t do it, dependents can’t do it, and malicious users will take advantage of it. So if we can solve for those three, then I don’t think there’s any real issue in having large packages published to their own index.

mhsmith · October 3, 2024, 7:32pm

You might not even need to generate the index, as the simple repository API is compatible with the auto-generated directory index pages of many webservers.

pf_moore · October 3, 2024, 8:32pm

For a good user experience^[1], you need the JSON API and separate metadata downloads. It’s not trivial to maintain a production quality index (it’s not hard, but it’s not trivial). Which is why I was suggesting that PyPI offer a proxying indexing service, so that users can just handle hosting and leave details of the API to others. It doesn’t have to be PyPI, though. Anyone could offer this service - and from what @steve.dower says, people already do (although I’ll freely admit I wouldn’t know myself how to find something, and I’m theoretically a packaging expert )

What is true is that anyone able or willing to provide external hosting for wheels, and manage the production and publication of .rim files, definitely does have the ability to host an index. It would be pretty easy for someone to write a program that took a flat directory full of wheels, and generated a directory structure that could be exposed over https as an index for those wheels. With that, publishing becomes nothing more than running that script to regenerate the static index data whenever a new wheel is added to the index. And it can all be hosted on literally any static web publishing site.

The key is addressing the bad reputation multiple indexes have - which is down to bad UI, the slow adoption of PEP 708, and how long we’ve been struggling to improve things in this area.

i.e., performance ↩︎

EpicWink · October 3, 2024, 8:43pm

PEP critique: the API field should probably url, not uri, as an identifier in general isn’t useful for locating a file for download.

I think it should archive the files, with an expectation that these files aren’t ever needed: I suggest saving these files to take storage.

Restoration would be manual (ie no automatic failover mechanism) and the only commitment PyPI would give to users is “when a PyPI admin notices an outage”.

This would address my availability concern

As an extension of archiving, I suggest having PyPI periodically download the files and check their hash. This checks for both status and integrity. If the check fails, notify PyPI admins and maybe stop serving the link to the file?

PyPI could run this checking during periods of low activity (may as well use the compute right?).

On the subject of relying on the promise of external hosts, is their not a concern about the grace of the companies providing free resources (storage, compute, CDN) to PyPI? It’s a bit hard to uphold availability guarantees if the machines if PyPI has nowhere to run.

I agree, but I don’t think it should be part of this PEP, rather a change to the repo API spec.

I think this is a problem with those specs. I only view download source URLs as convenient documentation anyway, as at the end of the day the only thing that matters (outside of privacy, which has orthogonal solutions ^[1]) is the exact file being downloaded.

eg certain hosts will track usage statistics by IP, which can be mitigated with proxying (eg VPN) ↩︎

mhsmith · October 4, 2024, 1:12am

According to the maintainers of NumPy, “PyPI has a serious issue with approving limit increase requests”. Which is why they deleted a bunch of pre-releases last month to stay within their existing quota, and inadventently broke some people’s workflows.

Clearly there’s something wrong here, because a project as important as NumPy should not need to worry about whether it can get a few dollars a month worth of extra storage. But I’m not clear on what exactly the issue is, or how it could be improved.

ncoghlan · October 4, 2024, 1:17am

The PEP allows a whl upload to supersede a previous rim upload, but doesn’t state explicitly whether that deletes the previous upload or not. Requiring automatic deletion of the superseded file would eliminate the ambiguity.

That said, I find @pf_moore and @steve.dower’s points persuasive - this feels like a case where any effort invested is likely to be somewhat flexible in the implementation goals it targets, and improving the ergonomics of querying and hosting secondary indexes would offer a better overall outcome.

dustin · October 4, 2024, 5:51pm

As far as I can tell, numpy has made a single request for a 40GB limit and received it the same day: https://github.com/pypi/support/issues/2480. I’m also unsure what the perceived problem with that is.

barry · October 4, 2024, 6:35pm

Thanks everyone for the great feedback and questions! We are keeping an eye on them and plan to respond in more detail on technical matters and administrative questions, but I think the most important thing to discuss right now is this:

The question then is whether we believe that a more diverse federation of independent indexes is better, feasible, and usable. PEP 759 definitely takes the position that PyPI is the primary, authoritative index for the Python community, and^[1] that additional indexes are unwieldy to maintain and difficult to use. Unfixably so? Let’s dive in and see where it leads us.

PEP 708, which has been provisionally accepted, is part of the trust mix that reduces the chances for dependency confusion attacks across multiple indexes. It doesn’t address the ease of setting up and maintaining alternate indexes and doesn’t address the end user UX of enabling multiple indexes^[2]. Even so, it’s not clear what the status and prognosis is for the full acceptance criteria and adoption across the ecosystem.

Let’s say that simpleindex makes it at least as easy to set up and maintain an alternative index as it would be to set up and maintain a PEP 759 external host. There’s an example in simpleindex’s README for enabling S3 routes, and in our minds we were thinking about S3 as a possibility for PEP 759 external hosted URLs. Let’s further say that simpleindex or something like it could handle the sustained and peak loads it could potentially be put under for large, popular indexes.

The question then is, can we make the UX for multiple indexes work?

Imagine a complex dependency graph that ends up hitting four indexes: PyPI, and indexes AI, BI, and CI. All four host packages that need to be installed for the user’s application to work. Their top-level dependency lives on PyPI, so they just pip install mydep. That would break at some point when the package they need refers to AI. So then they add --extra-index-url for AI and try to install again, but now it fails a little later trying to access BI. Rinse and repeat.

Or worse, it seems to work but in fact doesn’t really install the correct mix of package and versions.

There’s no way for that user, who may not even know anything about extra indexes or deep transitive dependencies, to get a working venv out of the box, let alone after minutes or hours of frustrating interwebs searching. That scenario will leave a bad taste, so the “out of the box experience for non-experts” has to be concretely addressed. I’d go so far as to say that’s the most important UX problem to solve.

A quick read of @sinoroc idea wouldn’t help here, but also, how do you make such pre-configurations portable, to other desktops or CI machines? I don’t think you really want to have to copy-and-paste long pip command lines either.

But let’s say we solve that too. Now comes the problem of index priority, which has been brought up elsewhere^[3]. E.g. if each of AI, BI, and CI host rootpkg how will pip know which index is the “right” one to get it from? How will it know that BI’s version is out of date, so get it from CI? Extend that to many packages in the transitive dependency graph, and now you have an even more complex configuration problem. I’d need to tell pip, AI is the highest priority for pkgA, but BI is the highest priority for pkgB and pkgZ, CI is the highest priority for pkgC and use PyPI for everything else. Both discovering this and configuring this is going to be extremely challenging I suspect.

A further challenge is implementing all of this across the ecosystem. As @dustin points out, these aren’t challenges that can be overcome by standards. You’d have to work it out in pip and uv at a minimum, find UX and configuration that works for both tools, wait for those changes to roll out, and have some suboptimal answer for the long tail.

I think that’s an enormous effort, and I’m skeptical as a practical matter that it will can happen. But I’m also open and eager to read the persuasive refutations I expect will follow soon!

In contrast, PEP 759 unequivocally doubles down on PyPI being both the canonical source of truth for package metadata, and the default such index, with all the trust that implies.

perhaps implicitly ↩︎
although I see @sinoroc has started a separate thread proposing something like apt repos ↩︎
w.r.t. variants support, though I can’t find an appropriate link atm ↩︎

jamestwebber · October 4, 2024, 7:36pm

I do think this is a solvable problem, based on the fact that it’s routine to use multiple conda channels^[1] in the same project without much issue. There’s a learning curve, sure, but it’s not that complicated to document: “install our package with -c bioconda, get the right version of pytorch with -c pytorch and everything else will come from your default channel conda-forge”.

It is theoretically a really hairy problem, but I think in practice most use-cases are not that complicated. It’s not a tangled network of interdependent channels, it’s a hub-and-spoke with one main repository for almost everything, and a few extra ones for specialized areas.

e.g. bioconda, pytorch, and conda-forge ↩︎

brettcannon · October 4, 2024, 8:32pm

Because the version number is greater on CI (BTW, “AI” and “CI” definitely name clash in my head for things not related to abstract indexes )? I think it’s a question of whether you search all indexes or stop at the first one. And that does tie into index priority, but it might suggest you don’t prioritize BI over CI if it isn’t keeping up.

I think it’s also a question as to whether it’s going to be common for multiple indexes to be hosting the same packages, or will they typically be disjoint in terms of where you want to get a specific package from?

barry · October 4, 2024, 11:35pm

Agreed that both index priority and index “coverage” are key questions here!