What to do about GPUs? (and the built distributions that support them)

Or extras (eg foo[xpu]) or specifically named distributions (eg foo-xpu).

I’ve seen and used both styles in the past. They seem to work ok.


In preparation for the packaging summit, where I’m hoping we’ll get to discuss this topic, I wanted to collect disparate threads and make it easier for others to come up to speed on this stuff. I’m posting it here for visibility, and because Dustin’s post here is already such a great centralized collection of knowledge.

The most pressing issue is that the GPU wheel files are especially large, which is leading to high bandwidth consumption. The status quo for wheel files are arguably larger than they could be, with respect to two characteristics:

  • Packages bundle many libraries. Not all functionality contained therein is utilized. A refactoring of the package bundle could improve distribution efficiency.
  • Packages bundle support for many microarchitectures. These are referred to as “fat” binaries.

I assume here that the trade-off for improving either of these characteristics is increased packaging complexity, either in more specific dependency relationships, new metadata for previously unaccounted for hardware specificity, or both. This is a well-discussed topic, going back at least to 2013, when Alyssa Coghlan and others debated this in the context of CPU SIMD features. That discussion was largely bypassed as BLAS implementations such as OpenBLAS and MKL provided the ability to select CPU SIMD features at runtime. This runtime selection is the same that “fat” binaries that GPU packages provide today, except that the GPU packages are larger, and providing multiple microarchitecture has a larger size impact on them.

Improving metadata for packages can open up more dependable ways to avoid software conflicts. Doing so may open up new avenues of sharing libraries among many packages, which would deduplicate and shrink the footprint of installations. Better metadata will also facilitate efforts to simplify and unify the user experience for maintaining and installing implementation variants of packages, which is currently cumbersome and divergent.

I aim to document the state of GPU packages today, in terms of what metadata is being used, how it is being represented, and how end users select from variants of a package. Several potential areas of recent development that may be useful for expanding metadata are highlighted, but the goal of this document is explicitly not to recommend any particular solution. Rather, it is meant to consolidate discussions across hundreds of forum posts and provide common ground for future discussions of potential solutions.

This document is written from an NVIDIA employee’s point of view with NVIDIA terminology, as examples are more readily at hand to the author. However, this is not an NVIDIA-specific nor even GPU-specific problem, and these issues should be extrapolated for any software that is associated with implementation variations. This document is also written with a focus on Linux, as that is where most of the deep learning packages at issue are run.

Status Quo

From the package consumers’ side

Users of PyTorch:

  • Installs pytorch with pip command:

pip3 install torch torchvision torchaudio

  • This implies a CUDA 12 build of pytorch that can fall back to CPU-only mode if no CUDA GPU is available. There are also CUDA 11, ROCm, and CPU builds, but these require passing a --index-url parameter to pip:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

  • The CUDA 12 pytorch packages depend on narrowly-focused CUDA packages that are specific to a particular CUDA version. For example, nvidia-cufft-cu12 and nvidia-cudnn-cu12

Key user experience aspects:

  • Default implementation assumes CUDA 12, but has fallback to CPU if CUDA 12 initialization fails
  • Decouples hardware-specific implementation details from package name. Shifts hardware implementation details to repo URL instead.
  • Each hardware-specific repo has identically named packages
  • Pip list can’t distinguish between implementation variants of torch
  • The narrowly-scoped CUDA 12 component packages that torch uses are more size-efficient than the kitchen-sink CUDA 11 packages, but they are still large. They are “fat” binaries with support for multiple microarchitectures.
  • Environment still contains packages that have hardware-specific details in them. The NVIDIA library packages have CUDA version as part of their package name. Reproducing environments is difficult.

Users of JAX:

  • Specify hardware implementation using the “extras” syntax:

pip install -U "jax[cpu]"

  • The CPU implementations are hosted on PyPI. Others require the user to specify an additional URL with the --find-links (-f for short, not to be confused with --extra-index-url, which implies a PEP 503 simple index) parameter:

pip install -U "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

  • There is also a cuda12 extra that uses a plugin mechanism in the CPU build of jaxlib to provide GPU support:

pip install -U "jax[cuda12]"

Key user experience aspects:

  • Requires the user to make an explicit choice of computational backend (no jaxlib dependency without the extra spec)
  • Erroneous extras specs (e.g. misspellings) only warn the user instead of erroring. This may lead to silent unexpected behavior, because there is no error.
  • Extras specs that are not hosted on PyPI show up as missing from pip if the --find-url parameter is omitted. For example, the cuda12_cudnn89 extra is defined in JAX’s setup.py. If the user tries to install this without the repo url:

pip install jax[cuda12_cudnn89]

Warns the user that cuda12_cudnn89 is an unknown feature:

WARNING: jax 0.4.28 does not provide the extra 'cuda12-cudnn89'

When the proper JAX PEP 503 repository provided with --find-links, the desired installation proceeds correctly

pip install jax[cuda12_cudnn89] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

  • Environment still contains packages that have hardware-specific details in them. The NVIDIA library packages have CUDA version as part of their package name. Reproducing environments is difficult.


RAPIDS libraries such as cuDF have historically not been available on PyPI because they are too large. Until recently, pip installing these packages would result in an unfriendly, intentional error, directing the user to try installing with the --extra-index-url parameter, pointed at pypi.nvidia.com. A new strategy using a PEP-517-compliant build backend package has allowed transparently fetching packages from NVIDIA’s external repository without requiring extra user input.

Key user experience aspects:

  • User can’t ask for just the library. They must specify hardware details in the package name (-cu12).
  • Although the need for the --extra-index-url parameter has been obviated by the new build backend approach, these packages still rely on an external repository. This requires extra effort when mirroring PyPI for air-gapped use. The build backend approach is not subject to dependency confusion attacks like other uses of --extra-index-url because the build backend resolves the appropriate wheel file and downloads it directly, rather than relying on any sort of repository priority.

From the package providers (library author) side

  • Pytorch maintains hardware-specific repositories for each supported hardware configuration. Because PyTorch’s instructions specify using the --index-url parameter, these repositories must mirror all dependencies of the torch packages.
  • Pytorch avoids the problem of multiple projects on PyPI by only supporting one default configuration/variant on PyPI, and hosting other variants on independent subfolders of their self-hosted repositories.
  • JAX separates hardware implementations using PEP440 local version tags. This avoids some difficulty in expressing dependencies in environment specifications, and would allow multiple implementations on PyPI if local version tags were permitted on PyPI.
  • NVIDIA dynamically edits pyproject.toml files when building packages to coordinate the CUDA version-specific suffixes.
  • A result of NVIDIA’s version-specific suffixes is that each variant (i.e. build against a CUDA version) is a separate project on PyPI. This makes the project page for any single library hard to find.

Rules that Python packaging standards impose

Rules that PyPI imposes

PyPI is the de-facto central repository for Python, where practically all python packagers participate, and as such, the metadata that it allows or denies defines how all packaging tools and environment managers work. Any novel support for additional metadata must be supported on PyPI if it will be successful.

  • Package size limit (100 MB default; up to 1 GB with manual approval)
  • Simple PyPI API (PEP 503/691) - package resolution is a process of iteratively parsing and retrieving files and their dependencies. The filename is a very important part of resolution. PEP 658 improves the download situation by avoiding the need to download the whole package to obtain its metadata.
  • PEP 440 Local version tags are not allowed. Nominally, local version tags shouldn’t be uploaded to any simple index, but not all indexes enforce this. PyPI does enforce it.

Elements of a satisfactory solution


  • Opt-in to a feature should not require specific knowledge of hardware. For example, a user should be able to opt-in to something like “gpu-nvidia” instead of “cuda12-sm89”. The resolution of gpu-nvidia into a specific combination of hardware options should be handled by a vendor-provided solution (PEP 517 build backend, maybe).
  • Opting in to a feature such as gpu-nvidia should enable that functionality across other packages that support it. It should not need to be specified with subsequent install commands.
  • Package install tools and environment managers should be able to utilize hardware vendor-provided information to select the correct packages. Vendor-provided information should not be part of pip’s implementation, but should be something that vendors maintain. Again, a PEP 517 build backend may be useful here.
  • Environment specifications would need to account for hardware differences. This can be thought of as a special case of cross-platform lock files, where different variants or even different combinations of packages are necessary depending on the hardware. This could mean that hardware-specific information and packages would need to be captured separately from normal Python dependencies, as described in Idea: selector packages and PEP 725 – Specifying external dependencies in pyproject.toml | peps.python.org.


  • Multiple variants of a given package should all be grouped together on PyPI under one project. Filenames of variants must differ for this to work.
  • Variants of packages must mutually exclude one another. Package name overlap is the only mechanism that the python package ecosystem has for exclusion of other packages. Any annotation that is used to distinguish between variants must not be part of the package name.
  • Metadata must be consistent across different variants
  • Metadata should minimize use of “dynamic” entries

Potentially useful developments


These points are basically saying “the additional information has to go in the ABI or platform tag”. Perhaps there’s value in letting people reach that conclusion themselves, but I don’t see the harm in saying it. (You could suggest changing the wheel filename format if you want to get everyone off side immediately :wink: I wouldn’t recommend)

I’d vote for the platform tag, since it’s already an arbitrary string, and then enhance packaging.tags to create a longer list (whether by default or via plugin). Then the only change PyPI-side is to accept more platform tags (other repositories should already accept arbitrary strings, but if not, they have work to do too).

There wouldn’t even be a change pip side, other than vendoring newer tags, but designing the opt-in/extension point will consume the effort.

I won’t assume it goes without saying, but consider me supportive of solving this problem. I won’t be at the summit to discuss it, but am interested in what approaches people think are feasible. Good luck!


I have a vague recollection of reading that the platform tag isn’t a great option because it doesn’t support optional entries well. Something about its regex. I haven’t been able to find that again. My memory is that Pradyun wrote it, but I might be getting mixed up with the “GPU tags” suggestion in Dustin’s OP, where the optional part is the build tag.

Anyway, I didn’t want to commit to any implementation details here, because I want to agree on shared context for what the problems are, and what the overall goals should look like. I felt like past efforts have fizzled when arguing about implementation details. I know we’ll have to work them out at some point, but 30 minutes at the packaging summit (if this is even an accepted topic) is better spent doing things that we can all agree on.


Personally I think 30 minutes in person is the best time to have the argument :smiley: There’s a good chance everyone already agrees on all the background, and just needs to be convinced that their one pet feature isn’t going to disappear.


I just want to note that this is still a problem for other libraries besides OpenBLAS and MKL. Maintaining fat builds is extra work so not all libraries or Python packages provide them. In many cases the end result is just not using SIMD etc capabilities even if there would be significant benefit in doing so.

I had been wondering whether x86_64 platforms tags might be extended to support the new x86_64 psABI levels. That is a separate discussion but it could also be made moot by an extensible system for defining and matching platform tags.


Yes! Thanks for reiterating that. I didn’t mean to say that OpenBLAS and MKL had solved the metadata problem, just that by shipping fat binaries, they were good enough for many applications and it reduced the pressure for a proper fix. You’ve been in this discussion for a really long time, Oscar. Thank you for staying involved in this discussion for so long! Are there other approaches to work within the current constraints that you would point to? Are there specific goals or design criteria you would add?

One thing about SIMD that is somewhat special is that there is a cascade of preferences. It’s not a simple match of one hardware attribute to one specific CPU type. I think this kind of resolution order can happen in what I call “vendor provided code” - likely in a PEP 517 build backend that does the job of translating a package name and version spec into a fully-resolved variant wheel.


One thing you allude to which something like selector packages doesn’t address is the ability to pre-generate a dependency graph for a given environment, ie pip can fully determine the set of files with static input on any arbitrary machine.

Is this a high priority goal of the community? Or is install-time arbitrary code execution fine for the foreseeable future?

1 Like

That is difficult if there’s any kind of dynamic resolution. Part of what has been percolating for me is whether it makes sense to have the hardware implementation metadata be separate from the python package metadata. This is swimming into implementation, so take it with a grain of salt, but here goes:

If you allow there to be a kind of placeholder python package that takes care of installing appropriate hardware support when it is itself installed or otherwise “activated”, then you would not necessarily need to capture the hardware support in an environment spec. If you can factor out the hardware-specific parts, then you leave the environment flexible to be instantiated on other hardware. For the sake of reproducibility, you’d still capture the hardware packages, but by keeping them separate from the “normal” packages, you’d maintain the ability to pre-generate dependency graphs - they’d just stop at stubs for the hardware implementations.

Is this a high priority goal of the community?

I hope so! It is a high priority at NVIDIA. I’ll do what I can to help make it happen.

Or is install-time arbitrary code execution done for the foreseeable future?

Realistically, I think this is likely. The good news is that it’s not arbitrary code execution in setup.py. The PEP 517 build backend approach puts the redirection logic into a wheel that can be statically inspected without running it. It also means the redirection logic is centralized and shared among many packages, instead of being copied and scattered across all the packages.

In the absence of a PEP 517 build backend to do redirection, I think we’d probably be stuck with vendor-provided metadata plugins to pip and other package managers. The thing I really like about the build backend approach is that it obviates the need for the user to do any kind of pre-installation to set things up or activate some behavior.

This said, if you need fully resolved package graphs and you can’t tolerate having stubs for hardware implementations, I don’t see any way to completely avoid dynamic behavior at some point in the process.


Definitely. Building different binaries for different SIMD levels (even if later combined into “fat” libraries or packages) would add to the combinatorial explosion of binary builds that non-pure Python packages already experience. It would also add to the distribution size, which is already a problem for some packages.

The better solution to this is to implement runtime dispatch for performance-critical codepaths, but that requires development work and is not always easy to retrofit into a codebase that was not designed with it in mind.


There has been a wild west in terms of x86_64 extensions but the psABI levels that I referred to are an attempt to tame that into something manageable. Those levels are now supported by multiple compilers and languages and are being used as the OS architecture for some Linux distros. It isn’t a combinatorial explosion because they are only defined for system V i.e. not Windows and for x86_64 i.e. not arm (and so not macos either). In practice only Linux wheels would use these on PyPI.

I imagine that most Python packages would not get much benefit from having more than an x86_64 wheel but some very particular packages would. On the other hand some packages might want to set e.g. x86-64-v2 as the minimum supported CPU and not bother providing wheels for plain x86-64.

An example I know of is python-flint where the underlying C library has optional features that are implicitly disabled by --host x86_64-pc-linux-gnu as is used for the Linux x86_64 wheels. Currently the only platform whose wheels include all features is macos arm64 because on that platform the wheels don’t need to be compatible with CPUs from 20 years ago. Being able to use x86-64-v3 would make it possible to enable some of those features and x86-64-v4 would enable all of them.

I would expect python-flint to provide x86-64, x86-64-v3 and x86-64-v4 wheels for manylinux if it were possible to do so. That would make the wheels more suitable for use in HPC environments which invariably run Linux, are likely to support v3/v4 instruction sets, and really do want maximum possible computing speed.

This is a bit of a tangent from the main discussion here but I just wanted to mention reasons besides GPUs to want to have more platform specificity than is currently possible. The problem of varying x86-64 capabilities that was discussed 10 years ago is still with us and has in fact become worse over time with newer CPUs and instruction sets. This case is different from GPUs in that it is not inconceivable to extend the list of hard-coded platform tags to accommodate the psABI levels rather than use an extensible system. A solution that would work for GPUs would likely also be able to handle this case though.


Regardless of “where” they are supposed to be defined, they are applicable (and probably desirable) on Windows as well. Just set the respective compiler flags for the enabled ISA features of each “microarchitecture level”.

Optional SIMD instructions definitely exist on non-x86 architectures as well, for example you might compile ARM code with SVE (or SVE2!) enabled or not.

That’s what we morally do for PyArrow, since for years our default build flags have enabled SSE4.2 and POPCOUNT. But some people would object that they’d prefer AVX2 or even AVX512-enabled builds for better performance.

I think it would be reasonable at this point to consider standardising such practices, for example saying that build tools MAY assume x86-64-v2 when building x86_64 wheels.

Adding x86_64_v{1,2,3,4} platform tags is another possibility, but it would require a means of detecting the appropriate version (in pure Python, because many packaging tools can’t or won’t depend on C extensions).

I don’t think build tools should do anything like that by default, except if it’s part of their advertised feature set. Offering flags to easily select x86-64-v<N> however could be desirable, as long as the default remains the baseline x86-64 feature set.


As someone that’s familiar with this problem, but hasn’t engaged in prior discussions around the solutions, I found this summary exceedingly helpful – thank you for putting it together.

A question on the “elements of a satisfactory solution”: should users be required (or not) to indicate in advance which accelerators they want to include in their installation?

E.g., compare specifying jax[gpu-nvidia] as the input dependency, vs. a system that automatically inferred whether GPU support was present and automatically selected jax[gpu-nvidia] given the jax requirement.


I hesitate to pursue any automatic selection of implementation. There’s such a broad spectrum of hardware and of problems that I don’t think we could do it justice. We should loudly advertise what hardware is available to be selected, but I think we should stop shy of defaulting.

For example, if we always default to gpu-nvidia if any sort of NVIDIA hardware is present, then we could end up picking my workstation’s GT 1030 barebones GPU, when there’s a Threadripper that could better serve the problem. Perhaps that’s an edge case, with CPUs typically being less powerful, and GPUs typically being more powerful.

Separately, I would prefer for the “selector” to be a configuration parameter or package unto itself, such that instead of specifying jax[gpu-nvidia], you would instead have a config section like:


Where gpu-nvidia would dispatch to some optimal match for the hardware that’s present, and cpu-x86_64 would dispatch to some optimal match for the CPU that’s present. The cpu precedence would take place if GPU packages were not available (perhaps like Flint in the examples that Oscar has provided).

Alternatively, it could be specified with a package installation, though this makes hierarchies harder or impossible to express in nice, general ways.

pip install hardware-select-gpu-nvidia
pip install hardware-fallback-cpu-x86_64
pip install jax
# jax for NVIDIA GPU is installed if available, or else CPU fallback gets installed

The point is that we should decouple hardware implementation from individual package specs, and instead prefer it to be a broader, perhaps environment-level, if not system-level property.


I opened a separate discussion to describe a variant of selector packages that makes the metadata more statically analysable.

1 Like

Not quite – you can do something like cp39-none+cuda12-manylinux2010_x86_64 or something equivalent in the existing compatibility tags. What we don’t have is the ability to add hypenated selectors, but we could augment platform and ABI tags to enable selection within those.

It might be worth looking at Spack, Conda etc to identify the various ways that we could encode the ABI compatibility story as well as some encoding of platform support beyond the OS alone.


Thanks for that clarification! It definitely makes things less restricted than I thought.

The way that Conda encodes arbitrary metadata is that it tracks any kind of variant input. It does so by knowing what variables are present in a recipe, and then finding if/where those variables get used. It also takes account of whether more than one value is present. These are designated “used variables” and they go into a file, “hash_input.json,” which is a dictionary with variable name as key, and variable value as value. This hash_input.json file gets hashed, and that hash is appended to the filename.

This hash approach seems like it might make it hard or impossible to sort filenames without understanding the metadata that went into the hash. To Conda, that doesn’t matter, because any of this metadata that matters would already be extracted to the central, consolidated repodata.json. Perhaps PEP 658 would make this feasible.


Not trying to stoke a fire or pass judgement, but is this another example of a similar style of selector package from tensorrt?