PEP Proposal - Platform‑Aware GPU Packaging and Installation for Python

(Please be gentle, it’s my first submission here)

Abstract

This PEP standardizes packaging and installation of GPU‑accelerated Python libraries by introducing:

  • GPU‑specific wheel platform tags
  • GPU‑specific environment markers (cuda_version, rocm_version, has_cuda, has_rocm)
  • gpu‑pyres – an optional, pluggable hardware‑detection utility
  • Back‑compat support for current +cuXXX wheels
  • Build‑backend hooks and installation extras (e.g. [gpu])
  • Explicit user‑override mechanisms

The goal is an opt‑in workflow that lets pip (or wrappers) install the best CUDA, ROCm, or CPU variant automatically while preserving reproducibility.


Motivation

GPU builds are fragile today:

  • Multiple binaries (CPU, several CUDA toolkits, ROCm)
  • PyPI forbids duplicate version strings with different binaries
  • Users juggle custom indexes and manual downloads
  • CI reproducibility breaks across GPU/CPU hosts

A standards‑driven solution removes guesswork and custom scripts.


Proposal

1 Wheel platform tags

Add an accelerator suffix to the existing platform tag:

manylinux2014_x86_64.cuda124
manylinux2014_x86_64.rocm57
manylinux2014_x86_64.cpu
win_amd64.cuda121

Examples:

torch‑2.2.0‑cp310‑cp310‑manylinux2014_x86_64.cuda124.whl
torch‑2.2.0‑cp310‑cp310‑win_amd64.rocm57.whl
torch‑2.2.0‑cp310‑cp310‑manylinux2014_x86_64.cpu.whl

2 Environment markers

Marker Type Example Meaning (runtime)
cuda_version str "12.4" Highest CUDA driver API level present
rocm_version str "5.7" ROCm driver level
has_cuda bool true ≥1 CUDA‑capable GPU detected
has_rocm bool false ≥1 ROCm‑capable GPU detected

Example usage:

flash-attn ; cuda_version >= "12.0"
torch      ; has_cuda == true

3 Local‑version wheels (+cu124)

Supported for transition; must be pinned exactly (torch==2.2.0+cu124) and remain disallowed on the main PyPI index.

4 Hardware detection — gpu‑pyres

  • Pure‑Python; 2 s default per probe (configurable).
  • Emits env vars PIP_CUDA_VERSION, PIP_ROCM_VERSION, HAS_CUDA, HAS_ROCM.
  • Multi‑GPU rigs: highest capability by default; override via --gpu-index N / PIP_GPU_INDEX=N.
  • Public test matrix (bare metal, WSL2, Docker) gates releases.

User overrides

Purpose Variable / Flag
Force version PIP_CUDA_VERSION, PIP_ROCM_VERSION
Force CPU PIP_FORCE_CPU=true
Skip detection PIP_GPU_SKIP_DETECT=1
Pick specific GPU index --gpu-index N or PIP_GPU_INDEX=N

A pip plugin must respect any preset vars and only detect unset markers.

5 Integration modes

  • Mode A – pip plugin (requires resolver‑hook PEP)

    pip install torch --prefer-gpu

  • Mode B – wrapper

    gpu-pyres detect --export-env && pip install torch --prefer-gpu

6 Resolver flow (--prefer-gpu)

With detected cuda_version = 12.4:

  1. .cuda124
  2. Decrease minor: .cuda123 … .cuda120
  3. Previous majors down to PIP_GPU_MIN_VERSION (default "11.0"): .cuda118 …
  4. CPU fallback.
  5. Without --prefer-gpu resolver behaves as today.

7 Build backend hooks

setuptools

[tool.setuptools-gpu]
gpu-tag = "auto"          # detect at build
timeout = 2000            # ms; default 2000 if omitted

Cross‑backend

Entry‑point group gpu_taggers returns the tag; Hatchling/Flit/PDM may implement helpers.

8 Extras

[project.optional-dependencies]
gpu  = ["torch ; has_cuda == true"]
cpu  = ["torch ; has_cuda == false"]
rocm = ["torch ; has_rocm == true"]

pip install mypkg[gpu] --prefer-gpu resolves using injected markers.

9 Version semantics

Numbers map to the driver API level (nvidia-smi), not toolkit build version.

10 Extensibility

Grammar supports future accelerators (e.g., .oneapi2024, .metal3).


Reproducibility & CI

  • Detection is opt‑in.
  • Selected wheels are pinned verbatim.
  • GPU‑less CI: set PIP_FORCE_CPU=true.

Error handling

  • If gpu‑pyres fails it prints a clear message; plugin falls back to CPU markers.
  • pip never hangs on detection.

Environment variables reference

Variable Meaning
PIP_CUDA_VERSION / PIP_ROCM_VERSION Override detected driver level
HAS_CUDA / HAS_ROCM (Set by detector) boolean markers
PIP_PREFER_GPU Same as --prefer-gpu
PIP_GPU_INDEX Select GPU by PCI index
PIP_FORCE_CPU Force CPU resolution
PIP_GPU_SKIP_DETECT Disable detector
PIP_GPU_MIN_VERSION Lowest driver level resolver may consider

Reference implementation

  • gpu‑pyres 0.1 with public test matrix and 2 s probe time‑outs.
  • pip plugin prototype after resolver‑hook PEP merges.
  • setuptools-gpu plus gpu_taggers entry‑point; helpers for Hatchling/Flit/PDM.
  • Coordination with major ML libraries to publish dual‑tag wheels.
  • Warehouse PR to accept .cuda*, .rocm*, .cpu tags.

Author

d8ahazard

Hey @d8ahazard! You’d probably be interested in WheelNext, an initiative to solve problems like “how do we package accelerated Python code so users get libraries for their platform(s)?”

In particular, the Variants Proposal is a more generic mechanism to achieve what you are proposing here. Please join us if you’re interested in working together!

7 Likes

Hi @d8ahazard, thanks for sharing your thoughts here.

Add an accelerator suffix to the existing platform tag

Note that per https://peps.python.org/pep-0425/#compressed-tag-sets, the . character is already in use to delineate multiple tags in a compressed tag set, so using it to create an ‘accelerator suffix’ here would probably not be possible.

Since GPUs are not mutually exclusive, how would a resolver handle a platform that might have both CUDA and ROCm available to automatically select a single wheel without the user explicitly specifying it? Are ‘capabilities’ comparable across these architectures? Would specification always be required?

There is a lot of prior discussion about these issues at https://discuss.python.org/t/what-to-do-about-gpus-and-the-built-distributions-that-support-them/, you may want to review that thread if you haven’t already.

2 Likes

I like to elaborate why the wheel variant proposal is more complex than this PEP proposal. Just the CUDA or ROCm version are not sufficient to select the right package. The package also needs to include metadata which GPUs are supported. For example NVIDIA CUDA 12.8 supports CUDA compute capabilities 5.0 (Keplar) to 12.0 (Blackwell). AMD ROCm has the AMD GCN ISAs (e.g. gfx942 for Instinct MI300 cards). There are also relationships between driver version and CUDA / ROCm version. You somehow need to include the information in the wheel metadata and then implement a logic to select the right wheel variant.

Example: A system with GH200 (Grace Hopper) on ARM64 SBSA with NVIDIA driver 550.163.01 is going to need a wheel that has been compiled for aarch64, CUDA version 12.4.x or below, and CUDA compute capability 9.0.

3 Likes