Removing quota limits on PyPI

barry · October 29, 2024, 3:36pm

That’s a great question! Maybe we should split the topic of quotas into a separate thread so we can make progress on a common understanding of the need for quotas, the problems they cause and for which projects, and how we can solve that either by tech, policy, administration, or all of the above.

steve.dower · October 29, 2024, 5:42pm

Because of abuse, by people who upload entire movies/pirated software as PyPI packages so they can be redistributed for free (including splitting them up into individual files so that each file can be below the limit).

The only other way around this that has any success at all is to do enough identity verification that if you’re found to be abusing it, it can be provided to local law enforcement (though usually the “verification” is a stolen credit card, and nothing is really gained). And identity verification is likely to be worse (as in, less popular) than quotas and manual increases.

barry · October 29, 2024, 6:40pm

This wouldn’t apply if we lifted quotas for verified organizations. I think almost, if not all, producers of large packages are valid orgs who would never do this intentionally. Even accidental abuse (e.g. some credential got stolen) would be quickly remedied.

steve.dower · October 29, 2024, 6:58pm

Yeah. The “verified” bit also means we’ve got enough identity to follow up properly if abuse does occur, so lifting them for anyone who’s trusted is probably fine. You could even argue that a certain number of releases or years of clean history could make a maintainer eligible to exceed quotas, but that’s probably more complicated.

I know someone suggested the quota discussion should be separated out - I’d just like to say that I think it’s on topic as the most relevant argument against limiting deletions. If deletions are limited, the quota becomes a very hard block for many maintainers, whereas today they can unblock themselves by deleting old versions.

miketheman · October 30, 2024, 3:29pm

I’ll admit, I’m not the expert on quotas so far, and lean a lot on historical context. But I’ll give it a shot!

Probably a good idea to split this out, and/or revive the topic @dstufft authored this perspective a couple of years ago with both historical reasoning and some potential steps forward (apologies if it’s linked already somewhere).

github.com/pypi/warehouse

Evaluate if the Quota System is providing meaningful value

opened 12:20AM - 21 Jul 22 UTC

dstufft

needs discussion

Something that was surfaced in the [discussion around deletions](https://discuss….python.org/t/stop-allowing-deleting-things-from-pypi/17227/28?u=dstufft) was a concern that the quota system on PyPI, as it is currently implemented, is causing a less than ideal experience for both authors and users of PyPI. I've also gone back and read previous discussions or posts like [What to do about GPUs? (and the built distributions that support them)](https://discuss.python.org/t/what-to-do-about-gpus-and-the-built-distributions-that-support-them/7125?u=dstufft). The problems from the maintainer side, that I have seen surfaced: - Projects are being forced to delete older releases in order to make room for newer releases, though thankfully it is largely pre-releases currently [^1]. - Projects are avoiding uploading wheels for a variety of platforms, because they're worried about hitting the quota limits and having to ask for a quota increase and not knowing whether they'd actually get the increase [^2]. - When they do ask for a quota increase, it can take weeks or even months for the maintainer to get a reply, blocking their ability to do releases [^3]. Just to make sure that everyone is on the same page, the background of how file hosting/quotas has evolved on PyPI is roughly: - Originally PyPI did not support file uploads at all, nor was it intended to be used as a software repository for tools to consume. - At some point setuptools was written that started finding files to fetch from PyPI through a variety of mechanisms. - At some other point (not sure if before or after the last one), PyPI added the ability to host files on PyPI, and as a basic sanity check the... Apache I think it was at the time, host had a default limit on the total request body size (as most servers do), and over time this eventually got increased to 60M, effectively limiting files on PyPI to no more than 60M in size. - At some point PEP 470 removes external file hosting from PyPI, which meant that in order to have a good experience with `pip install ...` by default, projects are required to upload to PyPI unless they want to require their users to configure an additional repository. - As part of the migration to Warehouse, we switched from having a web server fronting Warehouse that buffered the entire request body to one that let Warehouse itself handle pulling those bytes off the wire, which no more buffering meant that Warehouse itself was responsible for setting limits, and originally just hard coded the same 60M limit that PyPI originally had. - In https://github.com/pypi/warehouse/issues/346, Richard noted that we were starting to get requests for larger files sizes for some projects, which was implemented https://github.com/pypi/warehouse/pull/655 to allow having that 60M limit changed on a per project basis. - In https://github.com/pypi/warehouse/issues/4288 it was surfaced that PyPI's on disk size was currently larger than 2TB but we didn't have a great mechanism to show the information on what projects were involved in that, which was implemented in https://github.com/pypi/warehouse/pull/4469 [^4] to add a ``/stats/`` route that showed the top N packages and how much storage they consume. - In https://github.com/pypi/warehouse/issues/7446 the idea of limiting the total size of a project was proposed and implemented in https://github.com/pypi/warehouse/pull/8128 and https://github.com/pypi/warehouse/pull/8129. That brings us to where we are today. I don't have really good information for how large PyPI has grown over time other than we're currently at 12TB and in 2018 we were at "> 2TB", but the per project quotas were implemented in 2020. It was mentioned in a [comment](https://github.com/pypa/pypi-support/issues/50#issuecomment-553495968) on Nov 13 13, 2019 that PyPI was currently at 6.5TB Picking 10GB as our default project quota in PyPI was done with this comment: > I grandfathered in all existing projects with Project.total_size >= 10GB. I set their limits to roughly twice their current size, minus ~20%, rounded to the nearest 10GB. My thought is that PyPI's total size is roughly doubling every year, and that the rate of growth of these should probably fall under that curve. > > I wouldn't expect any of the projects on https://pypi.org/stats to request total size increases in the next ~1 year. I think we can give them file size increases liberally though. At the time, the grandfathered in projects at >= 10G was 73 total projects. Currently our process for people to ask for increased limits is to have them post a ticket on https://github.com/pypa/pypi-support, and one of the PyPI team will come around and look into it. I went ahead and did some looking at those requests, and what I found was: - The oldest request in that repo goes back to Nov 13, 2019 asking for an increased file size limit [^5]. - There are a total of 383 requests in that time period, averaging to a limit request every 2.5 days since the first request. - Limit requests are split pretty evenly between requests for increase file size limit and increased project size limit, but there is a 10% tilt towards project size. - It appears out of 383 requests, 369 of them have been closed. Of those 369, 274 of them have been accepted, or about 75% of them, 11, or about 2% have been denied, and 19, or about 5% the user was guided towards alternative strategies to reduce their file size. The remaining ones were generally just ones where the issue was closed due to no response to asking questions [^6]. - Of the 11 that were denied, most of them were denied for the user hosting a large data file (including java jars, etc) in the project. - Of the 19 that were guided towards alternative strategies, it was largely split between: - Breaking the project up into sub projects, each getting its own limit [^7]. - Side loading large data files through some other fashion (e.g. a ``download()`` method). - Removing files from the wheels (tests, docs, etc) or getting the user to try different compilation strategies or even just paying attention to their file size causing them to notice something they can adjust to reduce file size. That's a lot of information there, but ultimately the questions for this issue are: - Is the quota system providing value? - Is our process for requesting an increase providing value? - Is there anything that we can change to reduce the friction? [^1]: This kind of flies in the face of how we typically expect PyPI to be used, as a stable archive of artifacts with deletions being rare. [^2]: This directly hurts the consumers of Python packages, as they lose out on the ability to install from wheels on those platforms. [^3]: Obviously this is due to the fact PyPI has no staff available to process these requests, relying on when volunteers are able/willing to do pretty tedious work going through issues. [^4]: This was ultimately reverted, then reworked, then had more changes to it over the years, but this was the initial PR to add it. [^5]: Since per project limits weren't added until 2020, that should mean that all of our project quota requests ended up here. [^6]: Categorizing this was kind of lossy, I had to go through all of those issues manually and skim through them, so there very well might have been some miscategorizations in my tally. [^7]: This feels kind of like approving the limit in spirit? If a project wants a single 20GB limit, that doesn't feel materially different to me than splitting the project into two, with two 10GB limits.

This probably needs some further refreshing ^[1], overall having to download huge files from PyPI is often a poor user experience, especially over lower-speed connections. File storage is also not free, so there’s a cost (or at the very least credit budgeting) involved if quotas go away.
^[2]

One more thought is that since we receive a donation of CDN cache services, there’s the potential that adding more bytes overall to the “cache pool” that the provider allocates may push other packages out of the cache - that’s something I have little visibility into today. That’s not really a quota problem per se, but with no limits on sizes, I could see that being a concern we’d need to address.

in ~2 years, PyPI storage effectively doubled to ~24TB ↩︎
Technically, we rarely delete files from long-term storage even after the project is removed from the index, but that’s a totally different conversation. ↩︎

pf_moore · October 30, 2024, 3:49pm

While I agree completely that this is a bad user experience, I don’t see that it relates to quotas. The only link I can see is if we somehow view quotas as “penalising” projects that create large wheels, to encourage them to somehow stop doing so. But that’s misguided, IMO - the bad user experience is almost certainly enough of a penalty for those projects in itself, so punishing them further for a situation that they are presumably already trying to address is hardly fair.

That seems to me to be very similar to the way that limiting deletions will penalise projects with large files, by removing their main means of mitigating the impact of upload quotas.

So what I’m hearing from these comments is that the real issue here is the problem of projects which publish large wheels, especially those that for whatever reason want to do so frequently (possibly publishing test or nightly builds). Which suggests to me that we should actually be looking at how we can address that problem rather than trying to “fix” the symptoms.

PEP 759 is one possible approach here, as is improving the UX for using extra indexes (which would involve getting PEP 708 implemented, as well as looking at the UI of the front end tools). It’s also possible that what we really need to do is engage with some of the key projects like pytorch, and understand why they ship such massive wheels, and what other options they have considered (and in particular, why they discarded those options!)

While I appreciate the appeal of tackling smaller, more manageable changes, I don’t think we’re helping if we divert our limited resources (especially PyPI maintainer resource) from looking at the root cause of our problems here.

jamestwebber · October 30, 2024, 3:53pm

I believe this also ties into PEP 725 (external dependencies), or something that solves the same problems. As I understand it, the reason pytorch packages huge wheels is because they are including a lot of different binaries for different platforms/hardware.

oscarbenjamin · October 30, 2024, 5:04pm

I have once or twice seen projects that don’t have large wheels but publish very frequent releases (like every merged PR triggers a PyPI release) and that publish wheels for a wide array of platforms. I’m sure I saw a packaging thread about this where someone was asking for a quota increase but when I looked at PyPI it seemed like the project was being very profligate with resources like this. If there are no limits on what projects can upload to PyPI then there will be projects that will set up their CI to spew terabytes of unnecessary binaries into PyPI. It is too easy to set these things up without realising how much resource they consume.

I think that the nightly wheels case is better handled with a separate index where there is no expectation of long term storage and where it is very clear that someone needs to go out of their way to end up installing those wheels. The scientific python nightly wheels index keeps up to 5 nightly versions of each project and then each new upload overwrites the oldest previous one. That is a reasonable way to manage an index for nightly wheels. Some of the projects that upload there don’t even bother to update the version number so each upload overwrites the previous one every time.

I don’t remember the exact numbers but I read somewhere that the target for the nightly index is to keep total size of all files below something like 20GB so that Anaconda provides free hosting. If those nightly wheels were pumped into PyPI then just for those few projects it would add terabytes per year of permanent storage for binaries that are already irrelevant within a week of upload.

barry · October 30, 2024, 5:05pm

I think at least some of the organizations that have large wheels today probably also have some budget for helping to defray those costs. That requires an organizational structure both in PyPI and the PSF to collect, track, and utilize those funds of course.

We should definitely reach out to our CDN provider to get their take on that.

Agreed! We have to weigh any potential “bad UX” for large wheels hosted on PyPI against the alternative for not hosting them on PyPI. I don’t think those are insignificant. The reality is that these large wheels are useful so we need to find the lowest friction way of getting them into users’ environments.

+1

Some of the publishers of large wheels do watch these threads so I hope they will chime in. And I can try to facilitate discussions around this topic.

Yep, and the variant work is related too. @msarahan

barry · October 30, 2024, 5:14pm

There are two quotas in play here, the project size quota and the package size quota. I know most of the folks here know that, but I just wanted to be explicit about that.

One way that difference could come into play is by adding a time-based project quota. I think most big wheel publishers understand that PyPI isn’t a good index for nightlies, so probably want to be good citizens in this regard. A time based quota could help enforce that without limiting legitimate uploads (even ones where an “oops” requires a quick bug fix version bump).

I also think that PEP 694 could help here. I could imagine that if an org wanted to support nightlies on PyPI, they could open an upload session without a token, so the endpoint is guessable. They would upload the nightlies to the session, but never publish it. Consumers testing the nightly could access the files in the session. Probably (and allowed by the PEP) the best approach would be to just delete all the files in the unpublished session and upload new nightlies to the same session. Alternatively, there could be a special version specifier so that the session endpoint would be the same even if the nightly session is deleted unpublished (thus deleting all files). This would help CI access to the session so that it doesn’t change every night.

Which reminds me, even though PEP 694 is still in draft^[1], PEP 763 should at least mention that PEP 694 session files are not subject to deletion limitation. Specifically, if we decide to limit deletions, it should only be for published releases, not unpublished session files.

and I have an outstanding PR to updated it, though I’ve been thinking about further improvements ↩︎

oscarbenjamin · October 30, 2024, 5:31pm

If PyPI is going to host nightlies then the best way would be to have a separate nightly.pypi.org index. What it needs that is different from test-pypi is validation that the same pypi.org user accounts control a given project in both the pypi.org and nightly.pypi.org indexes and then some policy/UI for deleting/overwriting uploads. Then projects can test nightlies of all dependencies with something like:

pip install -i nightly.pypi.org -r requirements.txt

BrenBarn · October 30, 2024, 8:08pm

I agree that is likely a major factor and is one of the reason why some such projects recommend using conda instead. But I think fixing that problem on PyPI would require major changes to the conceptual model of what PyPI is. As ever, I think another useful path forward would be for official Python sources of information to explicitly mention conda’s existence as an alternative, so that we can break the chicken-and-egg cycle of people feeling like they have to provide a PyPI version of their package because that’s where people look.

Not sure if this is the one you were thinking of, but ruff asked for a quota increase about two years ago and said they were doing a new release every day. (I think the issue was linked from a thread on these forums.) That certainly raised my eyebrows when I saw it and personally I think it would be fine to just tell such projects “Sorry, you need to rethink your release schedule”.

I’m curious whether we have stats on storage usage that would tell us how much usage is in the “big body” of projects releasing large wheels vs. the “long tail” of projects with many small wheels. Another question is how much storage is used by “junk” like test uploads. I tend to think its not much because such things are often small, but there do seem to be a lot of them.

miketheman · October 30, 2024, 9:47pm

The BigQuery dataset for distribution_metadata may prove helpful here, as it contains name, size, and upload_time - all of which should help you find the answers you are looking for.

dustin · October 31, 2024, 1:54pm

I think https://pypi.org/stats/ has what you’re looking for here.

pf_moore · October 31, 2024, 2:53pm

So from those stats, literally half of PyPI consists of various tensorflow distributions. Maybe we should be specifically focusing on “what to do about tensorflow?”

fungi · October 31, 2024, 3:11pm

So from those stats, literally half of PyPI consists of various tensorflow distributions. Maybe we should be specifically focusing on “what to do about tensorflow?”

A community I participate in maintained its own PyPI mirror years ago to improve performance in CI jobs, but we had to abandon that when Tensorflow started publishing packages. The amount of storage required to mirror those basically exploded beyond our ability to control them. Eventually that situation led to Bandersnatch (the primary mirroring tool at the time) gaining a feature to filter out packages based on specific pattern matches, but by then we’d switched to caching proxies for PyPI instead.