Or extras (eg foo[xpu]) or specifically named distributions (eg foo-xpu).
I’ve seen and used both styles in the past. They seem to work ok.
Or extras (eg foo[xpu]) or specifically named distributions (eg foo-xpu).
I’ve seen and used both styles in the past. They seem to work ok.
In preparation for the packaging summit, where I’m hoping we’ll get to discuss this topic, I wanted to collect disparate threads and make it easier for others to come up to speed on this stuff. I’m posting it here for visibility, and because Dustin’s post here is already such a great centralized collection of knowledge.
The most pressing issue is that the GPU wheel files are especially large, which is leading to high bandwidth consumption. The status quo for wheel files are arguably larger than they could be, with respect to two characteristics:
I assume here that the trade-off for improving either of these characteristics is increased packaging complexity, either in more specific dependency relationships, new metadata for previously unaccounted for hardware specificity, or both. This is a well-discussed topic, going back at least to 2013, when Alyssa Coghlan and others debated this in the context of CPU SIMD features. That discussion was largely bypassed as BLAS implementations such as OpenBLAS and MKL provided the ability to select CPU SIMD features at runtime. This runtime selection is the same that “fat” binaries that GPU packages provide today, except that the GPU packages are larger, and providing multiple microarchitecture has a larger size impact on them.
Improving metadata for packages can open up more dependable ways to avoid software conflicts. Doing so may open up new avenues of sharing libraries among many packages, which would deduplicate and shrink the footprint of installations. Better metadata will also facilitate efforts to simplify and unify the user experience for maintaining and installing implementation variants of packages, which is currently cumbersome and divergent.
I aim to document the state of GPU packages today, in terms of what metadata is being used, how it is being represented, and how end users select from variants of a package. Several potential areas of recent development that may be useful for expanding metadata are highlighted, but the goal of this document is explicitly not to recommend any particular solution. Rather, it is meant to consolidate discussions across hundreds of forum posts and provide common ground for future discussions of potential solutions.
This document is written from an NVIDIA employee’s point of view with NVIDIA terminology, as examples are more readily at hand to the author. However, this is not an NVIDIA-specific nor even GPU-specific problem, and these issues should be extrapolated for any software that is associated with implementation variations. This document is also written with a focus on Linux, as that is where most of the deep learning packages at issue are run.
pip3 install torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Key user experience aspects:
pip install -U "jax[cpu]"
pip install -U "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pip install -U "jax[cuda12]"
Key user experience aspects:
pip install jax[cuda12_cudnn89]
Warns the user that cuda12_cudnn89 is an unknown feature:
WARNING: jax 0.4.28 does not provide the extra 'cuda12-cudnn89'
When the proper JAX PEP 503 repository provided with --find-links, the desired installation proceeds correctly
pip install jax[cuda12_cudnn89] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RAPIDS libraries such as cuDF have historically not been available on PyPI because they are too large. Until recently, pip installing these packages would result in an unfriendly, intentional error, directing the user to try installing with the --extra-index-url parameter, pointed at pypi.nvidia.com. A new strategy using a PEP-517-compliant build backend package has allowed transparently fetching packages from NVIDIA’s external repository without requiring extra user input.
Key user experience aspects:
PyPI is the de-facto central repository for Python, where practically all python packagers participate, and as such, the metadata that it allows or denies defines how all packaging tools and environment managers work. Any novel support for additional metadata must be supported on PyPI if it will be successful.
PEP 658 (approved and implemented) ensures that the core METADATA files are served alongside package files in PEP 503 Simple repositories. Core metadata presents a place where implementation and compatibility information can be expressed far more descriptively than in a filename. Filenames would still need something extra to keep variants from overlapping, but the actual content matters less, and something generic like a hash may be workable.
PEP 708 (provisionally approved but not implemented) helps avoid issues with dependency confusion attacks that are introduced by needing to use repositories outside of PyPI
PEP 725 (draft) expresses arbitrary dependency metadata that any external package manager can use to satisfy shared library dependencies
Proposal for dynamic metadata plugins is a proposal for dynamic build backend metadata plugins. This would be a great place to implement hardware detection, such as CUDA version and SM microarchitecture.
These points are basically saying “the additional information has to go in the ABI or platform tag”. Perhaps there’s value in letting people reach that conclusion themselves, but I don’t see the harm in saying it. (You could suggest changing the wheel filename format if you want to get everyone off side immediately I wouldn’t recommend)
I’d vote for the platform tag, since it’s already an arbitrary string, and then enhance packaging.tags
to create a longer list (whether by default or via plugin). Then the only change PyPI-side is to accept more platform tags (other repositories should already accept arbitrary strings, but if not, they have work to do too).
There wouldn’t even be a change pip side, other than vendoring newer tags, but designing the opt-in/extension point will consume the effort.
I won’t assume it goes without saying, but consider me supportive of solving this problem. I won’t be at the summit to discuss it, but am interested in what approaches people think are feasible. Good luck!
I have a vague recollection of reading that the platform tag isn’t a great option because it doesn’t support optional entries well. Something about its regex. I haven’t been able to find that again. My memory is that Pradyun wrote it, but I might be getting mixed up with the “GPU tags” suggestion in Dustin’s OP, where the optional part is the build tag.
Anyway, I didn’t want to commit to any implementation details here, because I want to agree on shared context for what the problems are, and what the overall goals should look like. I felt like past efforts have fizzled when arguing about implementation details. I know we’ll have to work them out at some point, but 30 minutes at the packaging summit (if this is even an accepted topic) is better spent doing things that we can all agree on.
Personally I think 30 minutes in person is the best time to have the argument There’s a good chance everyone already agrees on all the background, and just needs to be convinced that their one pet feature isn’t going to disappear.
I just want to note that this is still a problem for other libraries besides OpenBLAS and MKL. Maintaining fat builds is extra work so not all libraries or Python packages provide them. In many cases the end result is just not using SIMD etc capabilities even if there would be significant benefit in doing so.
I had been wondering whether x86_64 platforms tags might be extended to support the new x86_64 psABI levels. That is a separate discussion but it could also be made moot by an extensible system for defining and matching platform tags.
Yes! Thanks for reiterating that. I didn’t mean to say that OpenBLAS and MKL had solved the metadata problem, just that by shipping fat binaries, they were good enough for many applications and it reduced the pressure for a proper fix. You’ve been in this discussion for a really long time, Oscar. Thank you for staying involved in this discussion for so long! Are there other approaches to work within the current constraints that you would point to? Are there specific goals or design criteria you would add?
One thing about SIMD that is somewhat special is that there is a cascade of preferences. It’s not a simple match of one hardware attribute to one specific CPU type. I think this kind of resolution order can happen in what I call “vendor provided code” - likely in a PEP 517 build backend that does the job of translating a package name and version spec into a fully-resolved variant wheel.
One thing you allude to which something like selector packages doesn’t address is the ability to pre-generate a dependency graph for a given environment, ie pip can fully determine the set of files with static input on any arbitrary machine.
Is this a high priority goal of the community? Or is install-time arbitrary code execution fine for the foreseeable future?
That is difficult if there’s any kind of dynamic resolution. Part of what has been percolating for me is whether it makes sense to have the hardware implementation metadata be separate from the python package metadata. This is swimming into implementation, so take it with a grain of salt, but here goes:
If you allow there to be a kind of placeholder python package that takes care of installing appropriate hardware support when it is itself installed or otherwise “activated”, then you would not necessarily need to capture the hardware support in an environment spec. If you can factor out the hardware-specific parts, then you leave the environment flexible to be instantiated on other hardware. For the sake of reproducibility, you’d still capture the hardware packages, but by keeping them separate from the “normal” packages, you’d maintain the ability to pre-generate dependency graphs - they’d just stop at stubs for the hardware implementations.
Is this a high priority goal of the community?
I hope so! It is a high priority at NVIDIA. I’ll do what I can to help make it happen.
Or is install-time arbitrary code execution done for the foreseeable future?
Realistically, I think this is likely. The good news is that it’s not arbitrary code execution in setup.py. The PEP 517 build backend approach puts the redirection logic into a wheel that can be statically inspected without running it. It also means the redirection logic is centralized and shared among many packages, instead of being copied and scattered across all the packages.
In the absence of a PEP 517 build backend to do redirection, I think we’d probably be stuck with vendor-provided metadata plugins to pip and other package managers. The thing I really like about the build backend approach is that it obviates the need for the user to do any kind of pre-installation to set things up or activate some behavior.
This said, if you need fully resolved package graphs and you can’t tolerate having stubs for hardware implementations, I don’t see any way to completely avoid dynamic behavior at some point in the process.
Definitely. Building different binaries for different SIMD levels (even if later combined into “fat” libraries or packages) would add to the combinatorial explosion of binary builds that non-pure Python packages already experience. It would also add to the distribution size, which is already a problem for some packages.
The better solution to this is to implement runtime dispatch for performance-critical codepaths, but that requires development work and is not always easy to retrofit into a codebase that was not designed with it in mind.
There has been a wild west in terms of x86_64 extensions but the psABI levels that I referred to are an attempt to tame that into something manageable. Those levels are now supported by multiple compilers and languages and are being used as the OS architecture for some Linux distros. It isn’t a combinatorial explosion because they are only defined for system V i.e. not Windows and for x86_64 i.e. not arm (and so not macos either). In practice only Linux wheels would use these on PyPI.
I imagine that most Python packages would not get much benefit from having more than an x86_64 wheel but some very particular packages would. On the other hand some packages might want to set e.g. x86-64-v2 as the minimum supported CPU and not bother providing wheels for plain x86-64.
An example I know of is python-flint where the underlying C library has optional features that are implicitly disabled by --host x86_64-pc-linux-gnu
as is used for the Linux x86_64 wheels. Currently the only platform whose wheels include all features is macos arm64 because on that platform the wheels don’t need to be compatible with CPUs from 20 years ago. Being able to use x86-64-v3 would make it possible to enable some of those features and x86-64-v4 would enable all of them.
I would expect python-flint to provide x86-64, x86-64-v3 and x86-64-v4 wheels for manylinux if it were possible to do so. That would make the wheels more suitable for use in HPC environments which invariably run Linux, are likely to support v3/v4 instruction sets, and really do want maximum possible computing speed.
This is a bit of a tangent from the main discussion here but I just wanted to mention reasons besides GPUs to want to have more platform specificity than is currently possible. The problem of varying x86-64 capabilities that was discussed 10 years ago is still with us and has in fact become worse over time with newer CPUs and instruction sets. This case is different from GPUs in that it is not inconceivable to extend the list of hard-coded platform tags to accommodate the psABI levels rather than use an extensible system. A solution that would work for GPUs would likely also be able to handle this case though.
It isn’t a combinatorial explosion because they are only defined for system V i.e. not Windows and for x86_64 i.e. not arm
Regardless of “where” they are supposed to be defined, they are applicable (and probably desirable) on Windows as well. Just set the respective compiler flags for the enabled ISA features of each “microarchitecture level”.
Optional SIMD instructions definitely exist on non-x86 architectures as well, for example you might compile ARM code with SVE (or SVE2!) enabled or not.
On the other hand some packages might want to set e.g. x86-64-v2 as the minimum supported CPU and not bother providing wheels for plain x86-64.
That’s what we morally do for PyArrow, since for years our default build flags have enabled SSE4.2 and POPCOUNT. But some people would object that they’d prefer AVX2 or even AVX512-enabled builds for better performance.
That’s what we morally do for PyArrow, since for years our default build flags have enabled SSE4.2 and POPCOUNT. But some people would object that they’d prefer AVX2 or even AVX512-enabled builds for better performance.
I think it would be reasonable at this point to consider standardising such practices, for example saying that build tools MAY assume x86-64-v2 when building x86_64
wheels.
Adding x86_64_v{1,2,3,4}
platform tags is another possibility, but it would require a means of detecting the appropriate version (in pure Python, because many packaging tools can’t or won’t depend on C extensions).
I think it would be reasonable at this point to consider standardising such practices, for example saying that build tools MAY assume x86-64-v2 when building
x86_64
wheels.
I don’t think build tools should do anything like that by default, except if it’s part of their advertised feature set. Offering flags to easily select x86-64-v<N>
however could be desirable, as long as the default remains the baseline x86-64 feature set.
As someone that’s familiar with this problem, but hasn’t engaged in prior discussions around the solutions, I found this summary exceedingly helpful – thank you for putting it together.
A question on the “elements of a satisfactory solution”: should users be required (or not) to indicate in advance which accelerators they want to include in their installation?
E.g., compare specifying jax[gpu-nvidia]
as the input dependency, vs. a system that automatically inferred whether GPU support was present and automatically selected jax[gpu-nvidia]
given the jax
requirement.
I hesitate to pursue any automatic selection of implementation. There’s such a broad spectrum of hardware and of problems that I don’t think we could do it justice. We should loudly advertise what hardware is available to be selected, but I think we should stop shy of defaulting.
For example, if we always default to gpu-nvidia
if any sort of NVIDIA hardware is present, then we could end up picking my workstation’s GT 1030 barebones GPU, when there’s a Threadripper that could better serve the problem. Perhaps that’s an edge case, with CPUs typically being less powerful, and GPUs typically being more powerful.
Separately, I would prefer for the “selector” to be a configuration parameter or package unto itself, such that instead of specifying jax[gpu-nvidia]
, you would instead have a config section like:
[preferred.compute]
gpu-nvidia
cpu-x86_64
Where gpu-nvidia
would dispatch to some optimal match for the hardware that’s present, and cpu-x86_64
would dispatch to some optimal match for the CPU that’s present. The cpu precedence would take place if GPU packages were not available (perhaps like Flint in the examples that Oscar has provided).
Alternatively, it could be specified with a package installation, though this makes hierarchies harder or impossible to express in nice, general ways.
pip install hardware-select-gpu-nvidia
pip install hardware-fallback-cpu-x86_64
pip install jax
# jax for NVIDIA GPU is installed if available, or else CPU fallback gets installed
The point is that we should decouple hardware implementation from individual package specs, and instead prefer it to be a broader, perhaps environment-level, if not system-level property.
One thing you allude to which something like selector packages doesn’t address is the ability to pre-generate a dependency graph for a given environment, ie pip can fully determine the set of files with static input on any arbitrary machine.
I opened a separate discussion to describe a variant of selector packages that makes the metadata more statically analysable.
Not quite – you can do something like cp39-none+cuda12-manylinux2010_x86_64
or something equivalent in the existing compatibility tags. What we don’t have is the ability to add hypenated selectors, but we could augment platform and ABI tags to enable selection within those.
It might be worth looking at Spack, Conda etc to identify the various ways that we could encode the ABI compatibility story as well as some encoding of platform support beyond the OS alone.
Thanks for that clarification! It definitely makes things less restricted than I thought.
The way that Conda encodes arbitrary metadata is that it tracks any kind of variant input. It does so by knowing what variables are present in a recipe, and then finding if/where those variables get used. It also takes account of whether more than one value is present. These are designated “used variables” and they go into a file, “hash_input.json,” which is a dictionary with variable name as key, and variable value as value. This hash_input.json file gets hashed, and that hash is appended to the filename.
This hash approach seems like it might make it hard or impossible to sort filenames without understanding the metadata that went into the hash. To Conda, that doesn’t matter, because any of this metadata that matters would already be extracted to the central, consolidated repodata.json. Perhaps PEP 658 would make this feasible.
Not trying to stoke a fire or pass judgement, but is this another example of a similar style of selector package from tensorrt?
I’ll admit I felt a little sick in my stomach when I stumbled upon this gem: class InstallCommand(install): def run(self): # pip-inside-pip hack ref #3080 run_pip_command( [ "install", "--extra-index-url", nvidia_pip_index_url, *tensorrt_submodules, ], subprocess.check_call, ) super().run() You can see some of the underlying motivation for the pip-ception he…