What to do about GPUs? (and the built distributions that support them)

I’m proposing a “small group” here just to keep the implementation and risk/reward ration tractable and predictable for PyPI, but the assumption is also that this would be corporate participants who are also contributing some of the largest wheels. PyPI’s hosting is donated but not certainly not “free” in the sense that it can just keep scaling indefinitely - this proposal would allow some of the asymmetric load to be shared and distributed across additional (paid) CDNs. More importantly, however, this would also allow commercial entities (who have multiple incentives to do so) to continue to contribute to the global pypi index while also continuing to control their own file sizes and destinies, so to speak.

Hello all, I’m a maintainer of the CuPy project.
Thank you very much for all your efforts in keeping the PyPI ecosystem healthy!

Although this is not a direct solution to the “large files on PyPI” issue, let me share the recent news related to GPU & Python.

  • CUDA now follows CUDA Enhanced Compatibility policy introduced in CUDA 11.1 (September 2020). This provides binary compatibility within the same CUDA major version, e.g., binary built with CUDA 11.1 can run on CUDA 11.1, CUDA 11.2, … but not on CUDA 12.0.
    In general, packages built with CUDA 11.1 will work with CUDA 11.2, so this should contribute to reducing the number of packages on PyPI. There was a technical limitation (related to “NVRTC” module) in CUDA 11.2 so we had to release cupy-cuda111 (for CUDA 11.1) and cupy-cuda112 (for CUDA 11.2) separately, but I heard that they’re fixing this issue in upcoming releases.

  • NVIDIA is about to release CUDA Python module that provides a Python/Cython wrapper for CUDA Toolkit libraries. I’m not sure how this library is going to be packaged, but some Python packages may rely on this unified library instead of releasing a package for each CUDA version.

  • AMD GPUs (ROCm Platform, which is a similar concept to NVIDIA’s CUDA Toolkit) are becoming popular. CuPy, PyTorch, and TensorFlow now all provide ROCm wheels. So solutions like environment markers may need to be designed in a vendor-independent way.


In conda-forge, there was a naming discussion on the previous scheme of using cpu / gpu, and while it’s not uniformly rolled out yet, cuda builds now use a cuda extension, leaving room also in the future for rocm builds or others.


3 posts were split to a new topic: External hosting linked to via PyPI

If GPU tags are considered, will CPU tags also be considered? AFAIK, pip cannot decide between versions of packages that have been compiled with difference optimizations (e.g. SSE4, AVX, AVX512). It would be great if that could come at the same time.

Looks like this discussion got posted to Hacker News, there’s a fair amount of comments about it here: What to do about GPU packages on PyPI? | Hacker News

CPU architectures are already included in tags. There is no current plan to extend this to include various CPU optimizations as well.

I meant CPU feature or version tags, not arch tags indeed. That seems analogous to these suggested CUDA version tags.

I know neither are planned, but I would propose that the CUDA version tag proposal be amended with a CPU feature tag.

“Dustin Ingram via Discussions on Python.orgpython1@discoursemail.com schreef op 22 mei 2021 07:10:30 CEST:

Is this still relevant, given the direction NVIDIA has taken with CudaPython?


From what I understand CudaPython is not helpful to the issue here.

CudaPython is an interface between cuda and python (e.g. allow calling nvrtc and other cuda APIs from python). The huge cuda-enabled libraries we’re discussing are all talking to cuda from C and the distribution of these C extensions is the challenge.
Unless all these projects rewrite their code to stop using CUDA’s C API, but use CudaPython API instead, the challenge still exists.

1 Like

The motivating concern here is that CUDA-related binary blobs made wheels big and have a suboptimal install experience (different package name, based on GPU arch).

With CUDAPython, the projects can move to using the Python/Cython API instead of the C API, and benefit from a better install story that also eliminates the need to bundle CUDA binaries in their wheels.

Is there any reason to believe projects that currently bundle CUDA in their wheels, cannot move to using CUDAPython instead?

As far as I know PyTorch, JAX, and TensorFlow uses CUDA via c++ templated code, not via python. So the path provided by CudaPython (write a CUDA kernel in a python text string, use NVidia tools to compile it) will not help projects like these. They write CUDA kernels in C++ code and compile them inside a C++ library. This compiled library is the part that is platform- and CUDA-version specific (2 dimensions needed for packages: platform, CUDA version), then Python adds a third dimension of Python version, leading to a combinatory explosion of number of packages needed.

What if the libraries build against a version of CUDAPython and depend on the binaries in that package, rather than statically linking everything? I’m pretty sure this is how the Conda equivalent has been working.

That’s not quite possible, at least not yet (it may be in the future, but likely not for the largest libraries). CUDA Python is not yet available on PyPI, and I suspect that when it does become available it’s the interface layer that allows lazy loading and not a full copy of CUDA Toolkit (which itself is ~1 GB, see Files :: Anaconda.org). So the docs of a package will continue to say "you must install CUDA version xx before pip install mypkg.

CUDA Python docs (Overview - CUDA Python 12.3.0 documentation) say that it’s based on PTX which is then JIT compiled when it is used. JIT compilation times can be prohibitive for large projects - this will be on demand so not 100% clear, but e.g. compiling all PyTorch kernels as PTX on import currently takes >30 minutes.

Switching to CUDA Python will be a large operations, so I don’t expect it to happen for packages like PyTorch or TensorFlow any time soon (the official PyTorch conda packages don’t even do this yet, they’re also >1GB for CUDA 11).

CUDA Python only supports CUDA >= 11.0 at the moment; CUDA 10.2 will stay relevant for quite a while (and is the default CUDA version on PyPI for, e.g., PyTorch).

For deep learning libraries cuDNN has more impact on final binary size than CUDA Toolkit, and CUDA Python does not support cuDNN.

Due to both the effort to move to CUDA Python and due to cuDNN and CUDA 10.2 not being supported I think this discussion is still relevant.


Ah okay, I was under the impression that it would include all the files necessary.

Perhaps we should be asking NVIDIA to also publish toolkit packages?

These projects may be backed by their own C or C++ code that talks to the CUDA C API (for example Tensorflow). One example is PyArrow: we don’t provide CUDA-enabled wheels, but it we did, they would go through the CUDA C API, not CUDAPython.

Perhaps we should be asking NVIDIA to also publish toolkit packages?

We are currently focusing on improvements to Python packaging at NVIDIA. Much of the recent conda packaging work for CTK has been published to the nvidia channel. Feedback is very much appreciated. Also, it would be helpful to understand how these components map to PyPI.

CUDA Python wheels are coming soon!


I saw this discussion online; I thought I should say something about the idea of using GPU tags or environment markers.

There are use cases where one would like to use a CPU wheel even when GPUs are available on a system. For example, as a JAX user, it’s quite common to use GPU wheels from JAX for model training while using the CPU version of TensorFlow for data-loading & processing. It’s advantageous to do so because the CPU wheels are usually much smaller. To me, this makes GPUs distinct from CPUs in the sense that GPUs are optional.

Do we have any progress on this topic? I mean, are there any existing workarounds for building wheels for multiple different HW accelerators?

The existing workaround for now would be to use Anaconda or conda-forge.