That’s a great question! Maybe we should split the topic of quotas into a separate thread so we can make progress on a common understanding of the need for quotas, the problems they cause and for which projects, and how we can solve that either by tech, policy, administration, or all of the above.
Because of abuse, by people who upload entire movies/pirated software as PyPI packages so they can be redistributed for free (including splitting them up into individual files so that each file can be below the limit).
The only other way around this that has any success at all is to do enough identity verification that if you’re found to be abusing it, it can be provided to local law enforcement (though usually the “verification” is a stolen credit card, and nothing is really gained). And identity verification is likely to be worse (as in, less popular) than quotas and manual increases.
This wouldn’t apply if we lifted quotas for verified organizations. I think almost, if not all, producers of large packages are valid orgs who would never do this intentionally. Even accidental abuse (e.g. some credential got stolen) would be quickly remedied.
Yeah. The “verified” bit also means we’ve got enough identity to follow up properly if abuse does occur, so lifting them for anyone who’s trusted is probably fine. You could even argue that a certain number of releases or years of clean history could make a maintainer eligible to exceed quotas, but that’s probably more complicated.
I know someone suggested the quota discussion should be separated out - I’d just like to say that I think it’s on topic as the most relevant argument against limiting deletions. If deletions are limited, the quota becomes a very hard block for many maintainers, whereas today they can unblock themselves by deleting old versions.
I’ll admit, I’m not the expert on quotas so far, and lean a lot on historical context. But I’ll give it a shot!
Probably a good idea to split this out, and/or revive the topic @dstufft authored this perspective a couple of years ago with both historical reasoning and some potential steps forward (apologies if it’s linked already somewhere).
This probably needs some further refreshing [1], overall having to download huge files from PyPI is often a poor user experience, especially over lower-speed connections. File storage is also not free, so there’s a cost (or at the very least credit budgeting) involved if quotas go away. [2]
One more thought is that since we receive a donation of CDN cache services, there’s the potential that adding more bytes overall to the “cache pool” that the provider allocates may push other packages out of the cache - that’s something I have little visibility into today. That’s not really a quota problem per se, but with no limits on sizes, I could see that being a concern we’d need to address.
in ~2 years, PyPI storage effectively doubled to ~24TB ↩︎
Technically, we rarely delete files from long-term storage even after the project is removed from the index, but that’s a totally different conversation. ↩︎
While I agree completely that this is a bad user experience, I don’t see that it relates to quotas. The only link I can see is if we somehow view quotas as “penalising” projects that create large wheels, to encourage them to somehow stop doing so. But that’s misguided, IMO - the bad user experience is almost certainly enough of a penalty for those projects in itself, so punishing them further for a situation that they are presumably already trying to address is hardly fair.
That seems to me to be very similar to the way that limiting deletions will penalise projects with large files, by removing their main means of mitigating the impact of upload quotas.
So what I’m hearing from these comments is that the real issue here is the problem of projects which publish large wheels, especially those that for whatever reason want to do so frequently (possibly publishing test or nightly builds). Which suggests to me that we should actually be looking at how we can address that problem rather than trying to “fix” the symptoms.
PEP 759 is one possible approach here, as is improving the UX for using extra indexes (which would involve getting PEP 708 implemented, as well as looking at the UI of the front end tools). It’s also possible that what we really need to do is engage with some of the key projects like pytorch, and understand why they ship such massive wheels, and what other options they have considered (and in particular, why they discarded those options!)
While I appreciate the appeal of tackling smaller, more manageable changes, I don’t think we’re helping if we divert our limited resources (especially PyPI maintainer resource) from looking at the root cause of our problems here.
I believe this also ties into PEP 725 (external dependencies), or something that solves the same problems. As I understand it, the reason pytorch packages huge wheels is because they are including a lot of different binaries for different platforms/hardware.
I have once or twice seen projects that don’t have large wheels but publish very frequent releases (like every merged PR triggers a PyPI release) and that publish wheels for a wide array of platforms. I’m sure I saw a packaging thread about this where someone was asking for a quota increase but when I looked at PyPI it seemed like the project was being very profligate with resources like this. If there are no limits on what projects can upload to PyPI then there will be projects that will set up their CI to spew terabytes of unnecessary binaries into PyPI. It is too easy to set these things up without realising how much resource they consume.
I think that the nightly wheels case is better handled with a separate index where there is no expectation of long term storage and where it is very clear that someone needs to go out of their way to end up installing those wheels. The scientific python nightly wheels index keeps up to 5 nightly versions of each project and then each new upload overwrites the oldest previous one. That is a reasonable way to manage an index for nightly wheels. Some of the projects that upload there don’t even bother to update the version number so each upload overwrites the previous one every time.
I don’t remember the exact numbers but I read somewhere that the target for the nightly index is to keep total size of all files below something like 20GB so that Anaconda provides free hosting. If those nightly wheels were pumped into PyPI then just for those few projects it would add terabytes per year of permanent storage for binaries that are already irrelevant within a week of upload.
I think at least some of the organizations that have large wheels today probably also have some budget for helping to defray those costs. That requires an organizational structure both in PyPI and the PSF to collect, track, and utilize those funds of course.
We should definitely reach out to our CDN provider to get their take on that.
Agreed! We have to weigh any potential “bad UX” for large wheels hosted on PyPI against the alternative for not hosting them on PyPI. I don’t think those are insignificant. The reality is that these large wheels are useful so we need to find the lowest friction way of getting them into users’ environments.
+1
Some of the publishers of large wheels do watch these threads so I hope they will chime in. And I can try to facilitate discussions around this topic.
Yep, and the variant work is related too. @msarahan
There are two quotas in play here, the project size quota and the package size quota. I know most of the folks here know that, but I just wanted to be explicit about that.
One way that difference could come into play is by adding a time-based project quota. I think most big wheel publishers understand that PyPI isn’t a good index for nightlies, so probably want to be good citizens in this regard. A time based quota could help enforce that without limiting legitimate uploads (even ones where an “oops” requires a quick bug fix version bump).
I also think that PEP 694 could help here. I could imagine that if an org wanted to support nightlies on PyPI, they could open an upload session without a token, so the endpoint is guessable. They would upload the nightlies to the session, but never publish it. Consumers testing the nightly could access the files in the session. Probably (and allowed by the PEP) the best approach would be to just delete all the files in the unpublished session and upload new nightlies to the same session. Alternatively, there could be a special version specifier so that the session endpoint would be the same even if the nightly session is deleted unpublished (thus deleting all files). This would help CI access to the session so that it doesn’t change every night.
Which reminds me, even though PEP 694 is still in draft[1], PEP 763 should at least mention that PEP 694 session files are not subject to deletion limitation. Specifically, if we decide to limit deletions, it should only be for published releases, not unpublished session files.
and I have an outstanding PR to updated it, though I’ve been thinking about further improvements ↩︎
If PyPI is going to host nightlies then the best way would be to have a separate nightly.pypi.org index. What it needs that is different from test-pypi is validation that the same pypi.org user accounts control a given project in both the pypi.org and nightly.pypi.org indexes and then some policy/UI for deleting/overwriting uploads. Then projects can test nightlies of all dependencies with something like:
I agree that is likely a major factor and is one of the reason why some such projects recommend using conda instead. But I think fixing that problem on PyPI would require major changes to the conceptual model of what PyPI is. As ever, I think another useful path forward would be for official Python sources of information to explicitly mention conda’s existence as an alternative, so that we can break the chicken-and-egg cycle of people feeling like they have to provide a PyPI version of their package because that’s where people look.
Not sure if this is the one you were thinking of, but ruff asked for a quota increase about two years ago and said they were doing a new release every day. (I think the issue was linked from a thread on these forums.) That certainly raised my eyebrows when I saw it and personally I think it would be fine to just tell such projects “Sorry, you need to rethink your release schedule”.
I’m curious whether we have stats on storage usage that would tell us how much usage is in the “big body” of projects releasing large wheels vs. the “long tail” of projects with many small wheels. Another question is how much storage is used by “junk” like test uploads. I tend to think its not much because such things are often small, but there do seem to be a lot of them.
The BigQuery dataset for distribution_metadata may prove helpful here, as it contains name, size, and upload_time - all of which should help you find the answers you are looking for.
So from those stats, literally half of PyPI consists of various tensorflow distributions. Maybe we should be specifically focusing on “what to do about tensorflow?”
So from those stats, literally half of PyPI consists of various tensorflow distributions. Maybe we should be specifically focusing on “what to do about tensorflow?”
A community I participate in maintained its own PyPI mirror years ago to improve performance in CI jobs, but we had to abandon that when Tensorflow started publishing packages. The amount of storage required to mirror those basically exploded beyond our ability to control them. Eventually that situation led to Bandersnatch (the primary mirroring tool at the time) gaining a feature to filter out packages based on specific pattern matches, but by then we’d switched to caching proxies for PyPI instead.