There’s a set of related problems here with non-Python dependencies in addition to just file size. I wrote a blog post about key Python packaging issues from the point of view of scientific, data science and machine learning projects and users two weeks ago: Python packaging in 2021 - pain points and bright spots | Quansight Labs.
The PyPI request that triggered this thread was from a small, relatively obscure package. The same issue will show up for many other ML/AI projects needing CUDA though. A few related issues/observations:
Statistics · PyPI says that TensorFlow alone uses about 1 TB.
- PyTorch, MXNet, CuPy, and other such libraries will all have similar problems. Right now they tend to use CUDA 10.2, and end up with wheel sizes of close to 1 GB (already very painful). Soon they all want to have CUDA 11 be the default. See for example torch · PyPI, mxnet-cu102 · PyPI.
- RAPIDS - a large GPU ML project from NVIDIA - already gave up on PyPI completely due to issues around PyPI and wheels
- PyTorch self-hosts most wheels in a wheelhouse on S3, which makes it impossible for downstream libraries to depend on those wheels.
- large parts of the scientific computing, data science, machine learning, and deep learning rely on GPUs. This is a significant and growing fraction of all Python users.
- a bit further into the future: ROCm (the CUDA equivalent for AMD GPUs) may end up being in the same position.
- ABI tags do not solve the problem that there’s no separate package that can be depended on; every package author must just bundle in whatever libraries they need into their own wheel. That’s not limited to CUDA or GPUs, the same is true for other key dependencies like OpenBLAS or GDAL whose authors don’t have a special interest in Python.
The issue with CUDA 11 in particular is not just that CUDA 11 is huge, but that anyone who wants to depend on it needs to bundle it in, because there is no standalone
cuda11 package. It’s also highly unlikely that there will be one in the near to medium future because (leaving aside practicalities like ABI issues and wheel tags), there’s a social/ownership issue. The only entities that would be in a position to package CUDA as a separate package are NIVIDIA itself, or the PyPA/PSF. Both are quite unlikely to want to do this. For a Debian, Homebrew or conda-forge CUDA package it’s clear who would own this - there’s a team and a governance model for each. For PyPI it’s much less clear. And it took conda-forge 2 years to get permission from NVIDIA to redistribute CUDA, so it’s not like some individual can just step in and get this done.
Interfacing with other package managers
One potential solution that @dustin did not list - and which may the most desirable solution that would also solve related issue like GDAL for the geospatial stack and other non-Python dependencies - is: create a mechanism where Python packages uploaded to PyPI can declare that they depend on a non-PyPI package. There’s clearly a lot of potential issues to work out there and it’d be a large job, but if it could be made to work it would be a major win. It would help define the scope of PyPI more clearly, and prevent more important packages from not being on PyPI altogether in the future.
This is kind of similar to @steve.dower’s “selector package”, but the important difference is that it allows you to actually install (or check for installation of) the actual dependency you need built in a way you can rely on and test. A selector package is much more limited: if you specify
openmp-gnu then that will solve the PyTorch self-hosting issue, but not the file size and “bundle libs in” issues. If you get your external dependency from conda, homebrew or your Linux package manager, they may not have the same ABI or even functionality (for, e.g., GDAL or FFMpeg there are many ways of building them, two package managers are unlikely to provide the same things for each library).
Possibility of a solution fully within PyPI
None of the solutions @dustin listed are a structural fix. If the interfacing with external package managers won’t fly, then a solution within PyPI alone would have to have at least these ingredients:
- An ownership model where the PyPA or another well-defined team can be in charge of packages like
openblas in the absence of the original project or company packaging their own non-Python code for PyPI.
- A way to rebuild many packages for a new compiler toolchain, platform or other such thing.
In other words, have a conda-forge like organization and tech infra focused on packaging for PyPI. That is a way larger exercise than interfacing with an external package manager, and then would then still leave a large question mark around the scope of what should be packaged for PyPI. I assume people would not be unhappy with standalone CUDA, OpenBLAS or GDAL packages if they just worked - but where do you stop? Scientific computing and ML/AI depend on more and more large and exotic libraries, hardware and compilers - and I can imagine packaging all of those is something that’s not desirable nor sustainable.
Expertise & resources
Maybe I can help, either personally or (more likely) with getting other people and companies with the right expertise and resources involved. I am a maintainer of NumPy and SciPy where I’ve been involved in packaging them for a long time. I also lead a team working on PyTorch.
At Quansight Labs, which I lead together with Tania Allard, one of our main goals is to tackle PyData ecosystem-wide problems like these. We have already decided to spend part of our budget for this year to hire a packaging lead (currently hiring), who could spend a significant amount of time on working on topics like this one. If there’s interest in, for example, working out a PEP on interfacing with other package managers, we could work on that - and also on getting funding for the no doubt significant amount of implementation work. Possibly also including funded time for one or more PyPA members to help work through or do an in-depth review of the key issues for PyPI and pip/wheel/etc.