PEP Proposal - Platform‑Aware GPU Packaging and Installation for Python

d8ahazard · May 14, 2025, 3:48am

(Please be gentle, it’s my first submission here)

Abstract

This PEP standardizes packaging and installation of GPU‑accelerated Python libraries by introducing:

GPU‑specific wheel platform tags
GPU‑specific environment markers (cuda_version, rocm_version, has_cuda, has_rocm)
gpu‑pyres – an optional, pluggable hardware‑detection utility
Back‑compat support for current +cuXXX wheels
Build‑backend hooks and installation extras (e.g. [gpu])
Explicit user‑override mechanisms

The goal is an opt‑in workflow that lets pip (or wrappers) install the best CUDA, ROCm, or CPU variant automatically while preserving reproducibility.

Motivation

GPU builds are fragile today:

Multiple binaries (CPU, several CUDA toolkits, ROCm)
PyPI forbids duplicate version strings with different binaries
Users juggle custom indexes and manual downloads
CI reproducibility breaks across GPU/CPU hosts

A standards‑driven solution removes guesswork and custom scripts.

Proposal

1 Wheel platform tags

Add an accelerator suffix to the existing platform tag:

manylinux2014_x86_64.cuda124
manylinux2014_x86_64.rocm57
manylinux2014_x86_64.cpu
win_amd64.cuda121

Examples:

torch‑2.2.0‑cp310‑cp310‑manylinux2014_x86_64.cuda124.whl
torch‑2.2.0‑cp310‑cp310‑win_amd64.rocm57.whl
torch‑2.2.0‑cp310‑cp310‑manylinux2014_x86_64.cpu.whl

2 Environment markers

Marker	Type	Example	Meaning (runtime)
`cuda_version`	str	`"12.4"`	Highest CUDA driver API level present
`rocm_version`	str	`"5.7"`	ROCm driver level
`has_cuda`	bool	`true`	≥1 CUDA‑capable GPU detected
`has_rocm`	bool	`false`	≥1 ROCm‑capable GPU detected

Example usage:

flash-attn ; cuda_version >= "12.0"
torch      ; has_cuda == true

3 Local‑version wheels (`+cu124`)

Supported for transition; must be pinned exactly (torch==2.2.0+cu124) and remain disallowed on the main PyPI index.

4 Hardware detection — gpu‑pyres

Pure‑Python; 2 s default per probe (configurable).
Emits env vars PIP_CUDA_VERSION, PIP_ROCM_VERSION, HAS_CUDA, HAS_ROCM.
Multi‑GPU rigs: highest capability by default; override via --gpu-index N / PIP_GPU_INDEX=N.
Public test matrix (bare metal, WSL2, Docker) gates releases.

User overrides

Purpose	Variable / Flag
Force version	`PIP_CUDA_VERSION`, `PIP_ROCM_VERSION`
Force CPU	`PIP_FORCE_CPU=true`
Skip detection	`PIP_GPU_SKIP_DETECT=1`
Pick specific GPU index	`--gpu-index N` or `PIP_GPU_INDEX=N`

A pip plugin must respect any preset vars and only detect unset markers.

5 Integration modes

Mode A – pip plugin (requires resolver‑hook PEP)

pip install torch --prefer-gpu
Mode B – wrapper

gpu-pyres detect --export-env && pip install torch --prefer-gpu

6 Resolver flow (`--prefer-gpu`)

With detected cuda_version = 12.4:

.cuda124
Decrease minor: .cuda123 … .cuda120
Previous majors down to PIP_GPU_MIN_VERSION (default "11.0"): .cuda118 …
CPU fallback.
Without --prefer-gpu resolver behaves as today.

7 Build backend hooks

setuptools

[tool.setuptools-gpu]
gpu-tag = "auto"          # detect at build
timeout = 2000            # ms; default 2000 if omitted

Cross‑backend

Entry‑point group gpu_taggers returns the tag; Hatchling/Flit/PDM may implement helpers.

8 Extras

[project.optional-dependencies]
gpu  = ["torch ; has_cuda == true"]
cpu  = ["torch ; has_cuda == false"]
rocm = ["torch ; has_rocm == true"]

pip install mypkg[gpu] --prefer-gpu resolves using injected markers.

9 Version semantics

Numbers map to the driver API level (nvidia-smi), not toolkit build version.

10 Extensibility

Grammar supports future accelerators (e.g., .oneapi2024, .metal3).

Reproducibility & CI

Detection is opt‑in.
Selected wheels are pinned verbatim.
GPU‑less CI: set PIP_FORCE_CPU=true.

Error handling

If gpu‑pyres fails it prints a clear message; plugin falls back to CPU markers.
pip never hangs on detection.

Environment variables reference

Variable	Meaning
`PIP_CUDA_VERSION` / `PIP_ROCM_VERSION`	Override detected driver level
`HAS_CUDA` / `HAS_ROCM`	(Set by detector) boolean markers
`PIP_PREFER_GPU`	Same as `--prefer-gpu`
`PIP_GPU_INDEX`	Select GPU by PCI index
`PIP_FORCE_CPU`	Force CPU resolution
`PIP_GPU_SKIP_DETECT`	Disable detector
`PIP_GPU_MIN_VERSION`	Lowest driver level resolver may consider

Reference implementation

gpu‑pyres 0.1 with public test matrix and 2 s probe time‑outs.
pip plugin prototype after resolver‑hook PEP merges.
setuptools-gpu plus gpu_taggers entry‑point; helpers for Hatchling/Flit/PDM.
Coordination with major ML libraries to publish dual‑tag wheels.
Warehouse PR to accept .cuda*, .rocm*, .cpu tags.

Author

d8ahazard

emmatyping · May 14, 2025, 4:21am

Hey @d8ahazard! You’d probably be interested in WheelNext, an initiative to solve problems like “how do we package accelerated Python code so users get libraries for their platform(s)?”

In particular, the Variants Proposal is a more generic mechanism to achieve what you are proposing here. Please join us if you’re interested in working together!

dustin · May 14, 2025, 2:38pm

Hi @d8ahazard, thanks for sharing your thoughts here.

Add an accelerator suffix to the existing platform tag

Note that per https://peps.python.org/pep-0425/#compressed-tag-sets, the . character is already in use to delineate multiple tags in a compressed tag set, so using it to create an ‘accelerator suffix’ here would probably not be possible.

Since GPUs are not mutually exclusive, how would a resolver handle a platform that might have both CUDA and ROCm available to automatically select a single wheel without the user explicitly specifying it? Are ‘capabilities’ comparable across these architectures? Would specification always be required?

There is a lot of prior discussion about these issues at https://discuss.python.org/t/what-to-do-about-gpus-and-the-built-distributions-that-support-them/, you may want to review that thread if you haven’t already.

tiran · May 15, 2025, 5:59am

I like to elaborate why the wheel variant proposal is more complex than this PEP proposal. Just the CUDA or ROCm version are not sufficient to select the right package. The package also needs to include metadata which GPUs are supported. For example NVIDIA CUDA 12.8 supports CUDA compute capabilities 5.0 (Keplar) to 12.0 (Blackwell). AMD ROCm has the AMD GCN ISAs (e.g. gfx942 for Instinct MI300 cards). There are also relationships between driver version and CUDA / ROCm version. You somehow need to include the information in the wheel metadata and then implement a logic to select the right wheel variant.

Example: A system with GH200 (Grace Hopper) on ARM64 SBSA with NVIDIA driver 550.163.01 is going to need a wheel that has been compiled for aarch64, CUDA version 12.4.x or below, and CUDA compute capability 9.0.