What to do about GPUs? (and the built distributions that support them)

msarahan · May 14, 2024, 2:35pm

In preparation for the packaging summit, where I’m hoping we’ll get to discuss this topic, I wanted to collect disparate threads and make it easier for others to come up to speed on this stuff. I’m posting it here for visibility, and because Dustin’s post here is already such a great centralized collection of knowledge.

The most pressing issue is that the GPU wheel files are especially large, which is leading to high bandwidth consumption. The status quo for wheel files are arguably larger than they could be, with respect to two characteristics:

Packages bundle many libraries. Not all functionality contained therein is utilized. A refactoring of the package bundle could improve distribution efficiency.
Packages bundle support for many microarchitectures. These are referred to as “fat” binaries.

I assume here that the trade-off for improving either of these characteristics is increased packaging complexity, either in more specific dependency relationships, new metadata for previously unaccounted for hardware specificity, or both. This is a well-discussed topic, going back at least to 2013, when Alyssa Coghlan and others debated this in the context of CPU SIMD features. That discussion was largely bypassed as BLAS implementations such as OpenBLAS and MKL provided the ability to select CPU SIMD features at runtime. This runtime selection is the same that “fat” binaries that GPU packages provide today, except that the GPU packages are larger, and providing multiple microarchitecture has a larger size impact on them.

Improving metadata for packages can open up more dependable ways to avoid software conflicts. Doing so may open up new avenues of sharing libraries among many packages, which would deduplicate and shrink the footprint of installations. Better metadata will also facilitate efforts to simplify and unify the user experience for maintaining and installing implementation variants of packages, which is currently cumbersome and divergent.

I aim to document the state of GPU packages today, in terms of what metadata is being used, how it is being represented, and how end users select from variants of a package. Several potential areas of recent development that may be useful for expanding metadata are highlighted, but the goal of this document is explicitly not to recommend any particular solution. Rather, it is meant to consolidate discussions across hundreds of forum posts and provide common ground for future discussions of potential solutions.

This document is written from an NVIDIA employee’s point of view with NVIDIA terminology, as examples are more readily at hand to the author. However, this is not an NVIDIA-specific nor even GPU-specific problem, and these issues should be extrapolated for any software that is associated with implementation variations. This document is also written with a focus on Linux, as that is where most of the deep learning packages at issue are run.

Status Quo

From the package consumers’ side

Users of PyTorch:

Installs pytorch with pip command:

pip3 install torch torchvision torchaudio

This implies a CUDA 12 build of pytorch that can fall back to CPU-only mode if no CUDA GPU is available. There are also CUDA 11, ROCm, and CPU builds, but these require passing a --index-url parameter to pip:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

The CUDA 12 pytorch packages depend on narrowly-focused CUDA packages that are specific to a particular CUDA version. For example, nvidia-cufft-cu12 and nvidia-cudnn-cu12

Key user experience aspects:

Default implementation assumes CUDA 12, but has fallback to CPU if CUDA 12 initialization fails
Decouples hardware-specific implementation details from package name. Shifts hardware implementation details to repo URL instead.
Each hardware-specific repo has identically named packages
Pip list can’t distinguish between implementation variants of torch
The narrowly-scoped CUDA 12 component packages that torch uses are more size-efficient than the kitchen-sink CUDA 11 packages, but they are still large. They are “fat” binaries with support for multiple microarchitectures.
Environment still contains packages that have hardware-specific details in them. The NVIDIA library packages have CUDA version as part of their package name. Reproducing environments is difficult.

Users of JAX:

Specify hardware implementation using the “extras” syntax:

pip install -U "jax[cpu]"

The CPU implementations are hosted on PyPI. Others require the user to specify an additional URL with the --find-links (-f for short, not to be confused with --extra-index-url, which implies a PEP 503 simple index) parameter:

pip install -U "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

There is also a cuda12 extra that uses a plugin mechanism in the CPU build of jaxlib to provide GPU support:

pip install -U "jax[cuda12]"

Key user experience aspects:

Requires the user to make an explicit choice of computational backend (no jaxlib dependency without the extra spec)
Erroneous extras specs (e.g. misspellings) only warn the user instead of erroring. This may lead to silent unexpected behavior, because there is no error.
Extras specs that are not hosted on PyPI show up as missing from pip if the --find-url parameter is omitted. For example, the cuda12_cudnn89 extra is defined in JAX’s setup.py. If the user tries to install this without the repo url:

pip install jax[cuda12_cudnn89]

Warns the user that cuda12_cudnn89 is an unknown feature:

WARNING: jax 0.4.28 does not provide the extra 'cuda12-cudnn89'

When the proper JAX PEP 503 repository provided with --find-links, the desired installation proceeds correctly

pip install jax[cuda12_cudnn89] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Environment still contains packages that have hardware-specific details in them. The NVIDIA library packages have CUDA version as part of their package name. Reproducing environments is difficult.

Users of NVIDIA RAPIDS:

RAPIDS libraries such as cuDF have historically not been available on PyPI because they are too large. Until recently, pip installing these packages would result in an unfriendly, intentional error, directing the user to try installing with the --extra-index-url parameter, pointed at pypi.nvidia.com. A new strategy using a PEP-517-compliant build backend package has allowed transparently fetching packages from NVIDIA’s external repository without requiring extra user input.

Key user experience aspects:

User can’t ask for just the library. They must specify hardware details in the package name (-cu12).
Although the need for the --extra-index-url parameter has been obviated by the new build backend approach, these packages still rely on an external repository. This requires extra effort when mirroring PyPI for air-gapped use. The build backend approach is not subject to dependency confusion attacks like other uses of --extra-index-url because the build backend resolves the appropriate wheel file and downloads it directly, rather than relying on any sort of repository priority.

From the package providers (library author) side

Pytorch maintains hardware-specific repositories for each supported hardware configuration. Because PyTorch’s instructions specify using the --index-url parameter, these repositories must mirror all dependencies of the torch packages.
Pytorch avoids the problem of multiple projects on PyPI by only supporting one default configuration/variant on PyPI, and hosting other variants on independent subfolders of their self-hosted repositories.
JAX separates hardware implementations using PEP440 local version tags. This avoids some difficulty in expressing dependencies in environment specifications, and would allow multiple implementations on PyPI if local version tags were permitted on PyPI.
NVIDIA dynamically edits pyproject.toml files when building packages to coordinate the CUDA version-specific suffixes.
A result of NVIDIA’s version-specific suffixes is that each variant (i.e. build against a CUDA version) is a separate project on PyPI. This makes the project page for any single library hard to find.

Rules that Python packaging standards impose

By convention, wheels should not have external (outside of python-managed) dependencies. PEP 513, which created the manylinux1 standard, codifies an acceptable set of external dependencies. Any necessary libraries beyond the acceptable set should be copied into the wheel, using a tool such as auditwheel. This limits reuse of core libraries such as CUDA.
Arbitrary tags cannot be added as suffixes. Doing so would break the current optional build tag field.

Rules that PyPI imposes

PyPI is the de-facto central repository for Python, where practically all python packagers participate, and as such, the metadata that it allows or denies defines how all packaging tools and environment managers work. Any novel support for additional metadata must be supported on PyPI if it will be successful.

Package size limit (100 MB default; up to 1 GB with manual approval)
Simple PyPI API (PEP 503/691) - package resolution is a process of iteratively parsing and retrieving files and their dependencies. The filename is a very important part of resolution. PEP 658 improves the download situation by avoiding the need to download the whole package to obtain its metadata.
PEP 440 Local version tags are not allowed. Nominally, local version tags shouldn’t be uploaded to any simple index, but not all indexes enforce this. PyPI does enforce it.

Elements of a satisfactory solution

Client-side

Opt-in to a feature should not require specific knowledge of hardware. For example, a user should be able to opt-in to something like “gpu-nvidia” instead of “cuda12-sm89”. The resolution of gpu-nvidia into a specific combination of hardware options should be handled by a vendor-provided solution (PEP 517 build backend, maybe).
Opting in to a feature such as gpu-nvidia should enable that functionality across other packages that support it. It should not need to be specified with subsequent install commands.
Package install tools and environment managers should be able to utilize hardware vendor-provided information to select the correct packages. Vendor-provided information should not be part of pip’s implementation, but should be something that vendors maintain. Again, a PEP 517 build backend may be useful here.
Environment specifications would need to account for hardware differences. This can be thought of as a special case of cross-platform lock files, where different variants or even different combinations of packages are necessary depending on the hardware. This could mean that hardware-specific information and packages would need to be captured separately from normal Python dependencies, as described in Idea: selector packages and PEP 725 – Specifying external dependencies in pyproject.toml | peps.python.org.

Repository-side

Multiple variants of a given package should all be grouped together on PyPI under one project. Filenames of variants must differ for this to work.
Variants of packages must mutually exclude one another. Package name overlap is the only mechanism that the python package ecosystem has for exclusion of other packages. Any annotation that is used to distinguish between variants must not be part of the package name.
Metadata must be consistent across different variants
Metadata should minimize use of “dynamic” entries

Potentially useful developments

PEP 658 (approved and implemented) ensures that the core METADATA files are served alongside package files in PEP 503 Simple repositories. Core metadata presents a place where implementation and compatibility information can be expressed far more descriptively than in a filename. Filenames would still need something extra to keep variants from overlapping, but the actual content matters less, and something generic like a hash may be workable.
PEP 708 (provisionally approved but not implemented) helps avoid issues with dependency confusion attacks that are introduced by needing to use repositories outside of PyPI
PEP 725 (draft) expresses arbitrary dependency metadata that any external package manager can use to satisfy shared library dependencies
- It is important to note that there is a strong impetus towards providing shared library packages on PyPI. NVIDIA provides many CUDA wheel packages that are little more than shared libraries. RAPIDS libraries are starting to use a python shim that can be imported to load the library, similar to the pattern described by Dustin Spicuzza. Intel has shared library packages for MKL. PEP 725 is already not needed for shared library use from centralized publishers. PEP 725 becomes absolutely essential for builds of open-source libraries, where build environment and options would strongly affect compatibility behavior of supposedly-similar versions of a library. In other words, PEP 725 is essential where there is more than one authoritative build of a library.
Proposal for dynamic metadata plugins is a proposal for dynamic build backend metadata plugins. This would be a great place to implement hardware detection, such as CUDA version and SM microarchitecture.