PEP 759, External Wheel Hosting

cofiem · October 5, 2024, 9:14am

I’ve been slowly working through what is required to implement PEP708 in pip.
The questions raised in this thread over the last few days are things I’ve been thinking about, and it’s great to see other people are thinking about them too!

As just one aspect (you can see the key part of the implementation work in progress in the pip PR), the question of trust, and exactly what kind of trust, is important. If projects and repositories do use PEP708, what is actually being presented to users? Essentially what @pf_moore said:

However, PEP708 isn’t really about allowing user choice, it’s about projects and repositories building the “graph” of trust on behalf of users, by specifying alternate locations and/or tracking repositories they trust. But I’m really not sure how this should be presented to users. Some users might want more choice or control, or to get more insight into that trust. Can UI/UX in tooling do that? Or maybe it’s a question of making the “graph” visible somehow, including who (e.g. PyPI user) specified the edges, and where exactly the nodes (repositories / projects) are located.

That’s a bunch of questions without much resolution, but this seemed like the right place to put these thoughts.

pf_moore · October 5, 2024, 9:41am

This is very much my position as well. I don’t have any answers here, but I think we need to be asking the questions. Are there any insights we can get from other communities? I’m not aware of distributors like Debian, npm, or cargo, or even conda, having the sorts of debates we have. Maybe that’s just because I’m not very involved with those communities - but maybe they have a model that works better?

Is that really true? It seems to me that PEP 759 is only really tackling a very specific issue - people wanting to store PyPI-hosted packages somewhere that isn’t managed by PyPI. As I understand it, the main reason for that is basically a dissatisfaction with the quota management process on PyPI, which I would have thought is something that we could address in other ways (not least of which is to find out what is causing the dissatisfaction, and address it within PyPI, via funding if needed).

If PEP 759 (or any other proposal) genuinely does want to strengthen the idea that PyPI is the canonical source of all package metadata, then should we not be deprecating, and ultimately removing, the idea of having multiple indexes at all? That’s something of a strawman argument, but my point is that extra indexes are currently in a bad state where people are unhappy enough with them that they aren’t willing to use them for obvious cases where they would be useful (the cases targetted by PEP 759) and yet they are useful enough that no-one wants us to remove them.

Some of that is down to pip, as we’ve historically had a pretty simplistic view of extra indexes (merge them all together on an equal basis) and we don’t have the resources to spend on designing anything more complex. I know that uv is working on better support for multiple indexes, though, so maybe they will come up with a better design - if volunteer community effort isn’t sufficient, we can always nick the design that the business-funded project manages to develop

The other part of the problem, and the one that no-one seems to be addressing (other than generic but non-specific suggestions that hosting solutions exist “out there”), is simplifying the process of hosting a secondary index. My suggestion that PyPI offer a service was basically because PyPI is the only service organisation that the PSF has - so if it has to come from “the Python organisation”, PyPI is where it would need to be. But I see no problem if somewhere like Github ^[1] were to offer a package index service. I feel that in order to have credibility, it would need to have a clear guaranteed-free option, so we didn’t get stuck with the question of “what do organisations or individuals who can’t afford hosting costs do?”

or Microsoft, Google, Amazon, … ↩︎

barry · October 5, 2024, 7:27pm

I agree as well that these are great questions to ask. It’s difficult because you have to build the infrastructure for the trust you’re trying to model, and come up with UX that can be used to express the trust you want, based on the model. Then you have to explain it to users in a way they can understand how to manage that trust in the environments where they will be using the model. And I think you have to focus both on “casual” users who just want things to work, and “expert” users who need to deeply control that trust.

Perhaps the PEP could do better at explain the distinction, but yes, it is true. PEP 759 is explicit in separating two important purposes of PyPI - functioning as the source of truth for package metadata, and serving as the central repository for package/release artifacts. PyPI didn’t always serve release artifacts, as PEP 759’s historical section explains. PEP 759 preserves the central trusted role for package metadata and standardizes a mechanism to allow for trusted external hosting of release artifacts.

That’s a very good topic to discuss. While quota management doesn’t affect most packagers, it’s often pretty painful for packagers who do encounter it, as others in this thread have explained.

Why are quotas in place at all? Is it to preserve costly and scare resources, such as donated bandwidth, disk space, etc.? Is it to prevent PyPI from being used to share inappropriate material smuggled in large binary wheels? Something else? All of the above? I sense that some of these constraints were imposed in a different era and may be useful to re-examine. The PyPI FAQ doesn’t give any rationale behind the quotas.

If there were no quotas, PEP 759 wouldn’t be needed. If we decide that quotas are still needed to maintain the economics or safety of PyPI and its consumers, then let’s spell that out and see if there are ways to accomplish the same goals in a different way.

I feel like this is the more tractable issue. My employer runs its own index, and others do as well so it’s not that hard. Plus simpleindex seems pretty easy to set up and run, though I haven’t tried it myself. It isn’t that expensive for even smaller index providers, in terms of infrastructure costs, though it might be prohibitive in terms of maintainer/administrative costs. I don’t feel like this is the primary factor why we don’t have more indexes, but I could be wrong.

GitLab already does and it’s free. Though commercial^[1] JFrog’s Artifactory does. Why don’t people use these?

and I don’t know where it can be configured to host a public index ↩︎

pf_moore · October 5, 2024, 10:31pm

Do we have any way of asking the people who would benefit from PEP 759 why hosting an external index on one of these offerings wouldn’t be a suitable alternative? If the answer is “it’s because the UI for extra indexes in pip and other installers is bad”, then we can further ask what precisely needs fixing in that UI.

The biggest problem we always have with pip features is getting any clear feedback from potential users on what they actually need. It’s possible we have an invaluable opportunity here to avoid that problem, for this issue at least.

barry · October 5, 2024, 11:51pm

I’ve shared the PEP and this thread with folks on other forums, private and public (including e.g. the PyTorch slack). Give it a few days^[1] and let’s see if they can chime in.

My take on this is that it’s just a yard too far to ask people to extend their pip CLI with --extra-index-urls or add stuff to pip.conf to enable those extra indexes. There could be other reasons, but I think this is a lot to ask of the general consumer of libraries who aren’t steeped in Python packaging expertise but just wanna GSD^[2].

it’s the weekend after all ↩︎
Get Stuff Done ↩︎

mikeshardmind · October 6, 2024, 12:11am

This is only tangentially related, but if consumers of packages have issues with using --extra-index-url, then publishers of those packages will run into problems on behalf of their users.

From the standpoint of someone who used to be in charge of a company’s internal python packages, --extra-index-url has a few undesirable behaviors, and by the time I had handed off this responsibility to someone else, we were mirroring all dependencies locally and requiring that be the only index used.

The largest issue with --extra-index-url is the inability to specify which index something should be found on. For developers, the use of extra-index-url would be dangerous as it would require all dependencies always be pinned by hash, especially internal ones during the development stage (otherwise subject to dependency confusion). Such a requirement on pinning had significantly more friction than a reasonable, but manual process to mirror specific needed dependencies internally and in some cases, choose to automatically mirror them and continue trusting (to an extent).

you can use --index-url (not extra)+ --no-deps multiple times, and a correctly ordered invocation of installing dependencies this way will work around this, but that’s also not without friction, and prone to user error.

I still think an --extra-index-url (or something akin to it) might be a suitable answer, but for it to be one, I think existing issues around --extra-index-url need a better story.

pradyunsg · October 6, 2024, 9:16am

All of these and, also, as noted by Dustin earlier: large files are a worse UX for users – those installing stuff as well as those mirroring it. Also, note that when a release is first seen is when a lot of the traffic that PyPI sees from mirrors happens… so for not-very-popular projects, it might actually be a substantial portion of the downloads and it’s extremely “spiky” in terms of badnwidth in case of large downloads.

GitHub - pypi/support: Issue tracker for support requests related to using https://pypi.org also has a list of usecases that we reject for upload limit requests.

I’m pretty sure this is the answer but I also don’t think the users are the best people to give us an answer to the “what” around it.

I think what we need is more akin to “what is the problem you’re trying to solve” and… honestly, we have this data from various user reports on pip / Poetry / PDM / uv / etc issue trackers, asking for things like index pinning, index priorities, and more. The work needed here is someone looking through what we have already and pulling out “data” from it (to help others understand things) + generating insights from that “data” (because they’ll probably be the person who has seem most requests made).

I still think that installers are the wrong place to push complexity for this stuff, and that we should push for projects/products for hosting indexes with custom rules to be used instead. There’s no reason that a $company index couldn’t forward users to files hosted on PyPI (for example).

pf_moore · October 6, 2024, 10:24am

Yeah, I was trying to avoid repeating the statement that we should invest in getting some proper UX specialists involved, because I feel like I make that point way too often, but this sort of research is precisely what we need here.

(Also, it’s what I was hinting at when I suggested “stealing uv’s solution” - at the moment, the uv team is the closest thing we have to funded research in this area, so we should take advantage of whatever UI insights they can come up with).

pradyunsg · October 6, 2024, 10:53am

FWIW, this doesn’t need a UX specialist per se. I think having a person with some Python experience + some experience with designing for “large number of users” stuff + some domain understanding would also be able to tackle this stuff.

ncoghlan · October 6, 2024, 3:34pm

This assumes there’s no changes on the PyPI end to allow easier definition of PEP 708 tracks entries on PyPI that refer to authoritative external servers (rather than PyPI being the authoritative source and the external servers the supplementary ones).

It should already be possible to define those by uploading a single stub package and immediately yanking it (so there is somewhere on PyPI to attach the project level PEP 708 metadata).

With that approach, it doesn’t matter if AI, BI, and CI publish projects with conflicting names, as only one of them would be listed as authoritative in the PyPI project metadata.

There’d still be UX work to do for installers to handle the case where an index included in the resolution process doesn’t provide any usable files, but also lists tracks metadata for an index that isn’t included, but that’s UX work that needs to be done for PEP 708 anyway.

brettcannon · October 7, 2024, 6:19pm

Do we want to have this discussion as a separate topic? I have some thoughts, but I also don’t want to derail this conversation. I also don’t want to distract if we are not up to talking about this right now.

But I wouldn’t call that “the good ol’ days” either.

I think this is fairly common for large companies. The next step is building your own wheels from source.

aterrel · October 7, 2024, 6:48pm

Over the past year, I have been tracking various packages in the PyTorch, JAX, and other GPU communities, aiming to address the challenges discussed here: What to do about GPUs and the built distributions that support them. These communities have consistently emphasized the need to keep pypi.org as the central distribution platform.

In response, we started developing some experiments under Wheel Next, which serves as a neutral space, not tied to any specific community, and adheres to the same open-source principles as Python.

To those who argue this isn’t an issue, I can point to several examples of file size approval processes that took over a month:

In collaboration with the PSF, we’ve sought to secure more funding for ecosystem support, but the existing limits are already affecting how the community packages the software. For instance, some groups use nightly labels on conda, which is not something supported on pypi.org. I also see more and more is an breaking up of large projects into lots of smaller ones. Additionally, PyTorch bundles many vendored packages, making it harder to build optional dependencies.

The concept of external wheels, or the “wheel-stub pattern,” allows these communities to participate in the ecosystem without burdening users with the complexity of identifying which indexes are trusted. While people often cite poor download speeds as bad user experience, it’s much worse when a package is unavailable due to platform limits. A service that marks when the last package was downloadable and updates the metadata in Wheelhouse could improve the UI for external wheels, helping users better understand what’s available.

The community wants to remain within the pip ecosystem. They don’t want to switch tools just to access these libraries, so this pep helps us keep true to that goal.

brettcannon · October 7, 2024, 10:08pm

I think that view depends on who the user is blaming for the download speed/availability issues. I think the only way for this to work is for tools like pip to make it abundantly clear the only thing PyPI is providing is metadata and to clearly assign blame for download issues to the file host. But even then I’m sure the issues will still get sent to PyPI for being the cause.

I think both sides are concerned about their respective UX. I think the question is whether .rim files makes sense now or only in the face of the identified issues with indexes and whether .rim files would still make sense if we think we can fix the index issues?

barry · October 8, 2024, 12:10am

That’s certainly an issue. I think in addition to that, there’s the problem of reproducibility and portability. If I’m a dev working on something locally, and now I want to share that with my colleagues, port my solution to CI, or document how to make it work with downstream customers, I’ve got to include potentially complicated pip command lines or arcane pip.conf settings to make it work, and even then those instructions will be error prone.

As Andy mentioned, it is worse UX for those files not to be available. Don’t most (all?) installers cache downloads? If so, yes, you pay the price on the initial download, but that will hopefully be amortized somewhat to lessen the pain. The state of “pip ecosystem” packaging just isn’t yet up to the task of making it easy to ensmallen many of the packages we’re talking about.

Mirrors are different, but under PEP 759 I suspect the PyPI pressure would lessen considerably. Only .rim files would have to be mirrored from PyPI and if they choose to mirror external wheels, those won’t be requests coming to PyPI, so I think it could actually help both the PyPI bandwidth and storage “problems”.

I kind of think that’s a big benefit of PEP 759. It requires almost no changes to existing installers. It pushes the complexity to PyPI instead of installers or users.

I really am looking for solutions that preserve the simplicity of pip install foo.

barry · October 8, 2024, 12:24am

Take pip for example, run with -v. It (kind of) tells you where it’s getting the wheels from, e.g.

% pip install -v --no-cache-dir torch
Using pip 24.2 from /playground/.venv/lib/python3.12/site-packages/pip (python 3.12)
Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/79/78/29dcab24a344ffd9ee9549ec0ab2c7885c13df61cde4c65836ee275efaeb/torch-2.2.2-cp312-none-macosx_10_9_x86_64.whl.metadata
  Downloading torch-2.2.2-cp312-none-macosx_10_9_x86_64.whl.metadata (25 kB)

uv pip install also tells you what it’s doing:

DEBUG Searching for a compatible version of mpmath (>=1.1.0, <1.4)
DEBUG Selecting: mpmath==1.3.0 [compatible] (mpmath-1.3.0-py3-none-any.whl)
DEBUG No cache entry for: https://files.pythonhosted.org/packages/43/e3/7d92a15f894aa0c9c4b49b8ee9ac9850d6e63b03c9c32c0367a13ae62209/mpmath-1.3.0-py3-none-any.whl.metadata
DEBUG No cache entry for: https://files.pythonhosted.org/packages/2a/d2/4cda4f2c9a21b426c5f5b80a70991dc26b78bcecd7b03a8e8a22cc1cddc1/MarkupSafe-3.0.0-cp312-cp312-macosx_10_13_universal2.whl.metadata

Of course, you have to know that files.pythonhosted.org is actually PyPI anyway, but let’s ignore that for the moment. Who would the user blame for those slow downloads?

One of the things the PEP tries to do is point any “blame” for availability at the org running the external host. It’s not too much of a stretch to believe the same blame for any download slowness would also go to that org’s support contact.

ncoghlan · October 8, 2024, 12:42am

PyPI and pip get bug reports for source package builds failing, so I can understand their concern that they will get mistargeted bug reports regardless of what any error messages say (since the assumption that everyone reads error messages before filing bug reports with the tool they’re using is unfortunately not valid).

Still, as long as a proposal does it’s best to mitigate that problem, that’s reasonable (since actually solving it isn’t possible).

BrenBarn · October 8, 2024, 8:25am

I think this would be a mistake without other (rather radical) changes to PyPI.^[1] We need a better experience for multi-channel distribution, rather than doubling down on trying to force everything into one channel.

I’ve got a bunch of thoughts about this that I can’t fully organize right now, but the short version is that I think the friction in this area cannot really be alleviated without some holistic thinking. I see connections between this thread, the lockfile thread, the namespace thread, and various possible changes to pip or other install tools. All of these are different angles on an underlying issue, which is the need for a clean and smooth system for package authors to publish packages and for users to install those packages, maximizing the probability of “when I try to install stuff, what winds up installed is what I hoped would be installed”.

Based on what I see in other threads it does seem like one part of the problem is pip’s lack of clear channel separation/specification when installing packages. Another part though is the lack of a thoroughgoing environment manager that could manage such choices at the environment level^[2].

One example of such a change would be not allowing anyone to just upload anything to PyPI. ↩︎
i.e., so that you could say “in this environment, I want this set of channels to be consulted in this order” and then not have to repeatedly remember some long command-line incantation to ensure you get the right packages from the right place every time you install something ↩︎

steve.dower · October 8, 2024, 10:41am

Pip supports a per-environment configuration file specifically for this reason (it looks in {sys.prefix}/pip.(ini/conf))

barry · October 8, 2024, 7:07pm

Don’t forget the variants and index priority threads.

We also have to keep in mind that pip isn’t the only game in town today, even if it’s the most popular tool. Thinking about standards and interoperability, while giving tools and the ecosystem the freedom to experiment is also crucially important, but complicates matters even more.

Agreed! I hope we can do that and keep an eye on what it will take to achieve these goals, and the time frame to get there. I think we have some medium term urgency to address while keeping our eye on the long term vision.

ofek · October 8, 2024, 9:15pm

James Webber:

barry:

Imagine a complex dependency graph that ends up hitting four indexes: PyPI, and indexes AI, BI, and CI. All four host packages that need to be installed for the user’s application to work. Their top-level dependency lives on PyPI, so they just pip install mydep. That would break at some point when the package they need refers to AI. So then they add --extra-index-url for AI and try to install again, but now it fails a little later trying to access BI. Rinse and repeat.

I do think this is a solvable problem, based on the fact that it’s routine to use multiple conda channels[1] in the same project without much issue. There’s a learning curve, sure, but it’s not that complicated to document: “install our package with -c bioconda, get the right version of pytorch with -c pytorch and everything else will come from your default channel conda-forge”.

Can you please show an example of a requirements.txt-style file with an enumeration of mixed default-source and custom source? Or at least, how that ecosystem would define an environment with a single file.

I just did research as part of PEP 752 and yes they have a better model but we seem to be limited by the historical precedent of installer index flags in our thinking. Although easy to implement technically, users don’t want that in most cases. What users want, and what tooling in other ecosystems provide, is the ability to override the source of particular dependencies. I don’t know much about what modern Java does but here is what I found among the other language ecosystems:

Language	Tool	Resolution behavior
Rust	Cargo	Dependency resolution can be modified within `Cargo.toml` using the `[patch]` table.
JS	Yarn	Although they have the concept of protocols (which are similar to the URL schemes of our direct references), users configure the resolutions field in the `package.json` file.
JS	npm	Users can configure the overrides field in the `package.json` file.
Ruby	Bundler	The `Gemfile` allows for specifying an explicit source for a gem.
C#	NuGet	It’s possible to override package versions by configuring the `Directory.Packages.props` file.
PHP	Composer	The `composer.json` file allows for specifying repository sources for specific packages.
Go	go	The `go.mod` file allows for specifying a replace directive. Note that this is used for direct dependencies as well as transitive dependencies.

As I just mentioned, it doesn’t matter if every company magically has their own index starting right now: users don’t have a way (not even UV, yet) to override where particular dependencies come from.

User experience should be the highest of priorities and anything that makes a user override the source should be discouraged. For example, a dependency might not have been released with a fix so a fork is maintained in the interim. In that case the cause is a bug and bugs are discouraged.

Similarly, I view the concept of quotas as a poor reason to cause users extra configuration. For good reasons we cannot get rid of them and therefore I am in strong support of this proposal so that when our tooling improves to support granular overrides users will not need to configure any due to quotas.