I just rejected a request for a 2.5GB file size limit on PyPI and think it’s probably time we have a discussion about why and the future of the ecosystem here.
File sizes on PyPI have been slowly increasing in size for a while now, mostly driven by certain project’s needs to support specific GPU ABIs. Newer ABIs, like CUDA 11 also seem to be resulting in even larger distributions than before, so this problem is getting worse over time.
You can get a sense for overall size of these large projects at https://pypi.org/stats/. (Note that not all of these projects are large due to individual file size – some just release small distributions very frequently and are unaffected by these issues.)
Why are large files challenging for PyPI?
There are a couple reasons why the PyPI maintainers are currently unwilling to raise the limit above 1GB:
- CDN constraints:
- We likely have an upper bound on cache size that our CDN provider holds for us (it’s hard to know for sure what it is, but they almost definitely are not holding all 7TB of PyPI in memory). The more large files in this cache, the less overall # of files it can hold, the less likely a given file will be in the cache, and thus the more churn our cache experiences (which leads to more backend requests, longer response times, increased bandwidth to our backends)
- Our current CDN “costs” are ~$1.5M/month and not getting smaller. This is generously supported by our CDN provider but is a liability for PyPI’s long-term existence.
- Networking / bandwidth constraints:
- These packages are already a large drain PyPI’s non-CDN infrastructure (bandwidth from backends to the CDN and storage). Our current infrastruce “costs” are ~$10K/month and also not getting smaller. This is also supported by one of our cloud providers, but is a liability (albeit smaller than our CDN liability).
- Larger overall size of PyPI on disk makes it harder to host mirrors, requiring mirrors like bandersnatch to implement features to block certain large projects from being mirrored.
- Upload experience:
- the current PyPI upload API is synchronous, and a >1GB upload is one long blocking request to that endpoint, and is more likely to fail or consume excess resources.
- Download experience:
- End users trying to download large files get a poor user experience in terms of install-time and reliability, especially if they are on poor connections.
What are our options?
There are a few options that have been proposed or considered, I’ll try to list them here as well as their challenges/downsides:
GPU tags
Similar to the existing platform and CPU architecture tags, we could introduce a new GPU ABI tag for wheels that corresponds to the GPU ABI that the wheel supports. Something like:
my_project-2.2.0-cp38-cp38-manylinux2010_x86_64
-cuda90
.whl
The challenge here is that there needs to be a reliable, cross-platform (and pure-Python) way to detect GPU ABIs, something like platform.gpu()
, that can be used by installers like pip
to determine what architectures the host supports. There isn’t a standard for detecting the various GPU ABIs in a reliable way.
Another challenge is that we haven’t added new tags to the Wheel spec, and it’s unclear how the addition of a new tag would be supported by the various tooling that produces/consumes wheels.
[edit]: It’s unlikely this would work, as @pradyunsg notes, this doesn’t work as-is because we have optional build tags in wheels.
Environment markers
As @pradyunsg notes, an option in the vein of PEP 496 is to have environment markers like:
install_requires=["packagename > 1.0 : sys.gpu == 'cuda11'"],
This still requires the same reliable, cross-platform (and pure-Python) way to detect GPU ABIs as above.
A downside here is that maintainers need to publish N+1 different projects for every N GPU ABIs they want to support (with one ‘parent’ project that requires all the ABI-specific projects, with markers), which has other challenges discussed in the next option.
Tell publishers to split their projects up
Some projects split their project namespace up based on the GPU ABI they are providing. For example, the cupy-cuda*
projects are split into cupy-cuda80
, cupy-cuda90
, etc, with one for each GPU ABI.
The downsides here are that while this looks like it’s reducing average package size, it’s still basically the same on disk. And while this works for cupy-cuda*
, it likely will not work for bigger frameworks like the project whose limit request I rejected. It’s also not very friendly to the user, as they need to figure out themselves which project to install.
Tell publishers to host elsewhere
One option is to just tell publishers they must self-host and tell their users how to install. This is what pytorch does, for example, and they tell their users to install with something like:
pip install torch==1.7.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
The downside to this is that it fractures the ecosystem, adds extra steps for the end user, and means that publishers need to set up and maintain their own PEP 503-compliant repository (or pay a third party to do so). It also means that someday this external index could go away or become compromised, as it doesn’t have the same support that projects published to PyPI have.
Allow external hosting on PyPI
We get a lot of requests for this, but most folks aren’t aware that this used to be a feature of PyPI that was removed via PEP 470 because in practice it wasn’t great for end-users, similar to reasons mentioned above.
I think we’re fairly unlikely to add this back, but perhaps if we consider “fat GPU wheels” as a special case (i.e., not make this available to all packages) it could be worth reconsidering.
Charge for large file hosting on PyPI
PyPI is currently entirely free-to-use. It’s possibly that publishers that want larger file sizes would be willing to pay to cover costs of infrastructure as well as improvements to tools to support larger file sizes (e.g. making a new, asynchronous upload API) which mitigate the infrastructure liabilities of hostinging larger files.
Challenges here are that PyPI currently has no infrastructure to handle any paid features or payments. We also don’t have a great sense of what this would be worth to publishers.
Selector packages
This was proposed in Idea: selector packages but ultimately is attempting to solve a much bigger problem than just the “fat GPU wheels” problem.
Downsides to this are that it introduces more “dynamic” dependencies, which we are generally trying to move away from (e.g. with setup.py
). More details are in that thread.
What other challenges do we have?
I think one additional challenge here is that we (the Python Packaging Authority) don’t seem to have many folks with a ton of experience / deep knowledge about GPUs and the needs of these frameworks. I personally don’t have more than what I’ve outlined here.
If you feel like you do, I’d appreciate your thoughts here, but otherwise it seems like we probably can’t fix this on our own and will also need to work with multiple competing projects as well as GPU providers themselves to find a solution that works for everyone.