(Please be gentle, it’s my first submission here)
Abstract
This PEP standardizes packaging and installation of GPU‑accelerated Python libraries by introducing:
- GPU‑specific wheel platform tags
- GPU‑specific environment markers (
cuda_version
,rocm_version
,has_cuda
,has_rocm
) - gpu‑pyres – an optional, pluggable hardware‑detection utility
- Back‑compat support for current
+cuXXX
wheels - Build‑backend hooks and installation extras (e.g.
[gpu]
) - Explicit user‑override mechanisms
The goal is an opt‑in workflow that lets pip
(or wrappers) install the best CUDA, ROCm, or CPU variant automatically while preserving reproducibility.
Motivation
GPU builds are fragile today:
- Multiple binaries (CPU, several CUDA toolkits, ROCm)
- PyPI forbids duplicate version strings with different binaries
- Users juggle custom indexes and manual downloads
- CI reproducibility breaks across GPU/CPU hosts
A standards‑driven solution removes guesswork and custom scripts.
Proposal
1 Wheel platform tags
Add an accelerator suffix to the existing platform tag:
manylinux2014_x86_64.cuda124
manylinux2014_x86_64.rocm57
manylinux2014_x86_64.cpu
win_amd64.cuda121
Examples:
torch‑2.2.0‑cp310‑cp310‑manylinux2014_x86_64.cuda124.whl
torch‑2.2.0‑cp310‑cp310‑win_amd64.rocm57.whl
torch‑2.2.0‑cp310‑cp310‑manylinux2014_x86_64.cpu.whl
2 Environment markers
Marker | Type | Example | Meaning (runtime) |
---|---|---|---|
cuda_version |
str | "12.4" |
Highest CUDA driver API level present |
rocm_version |
str | "5.7" |
ROCm driver level |
has_cuda |
bool | true |
≥1 CUDA‑capable GPU detected |
has_rocm |
bool | false |
≥1 ROCm‑capable GPU detected |
Example usage:
flash-attn ; cuda_version >= "12.0"
torch ; has_cuda == true
3 Local‑version wheels (+cu124
)
Supported for transition; must be pinned exactly (torch==2.2.0+cu124
) and remain disallowed on the main PyPI index.
4 Hardware detection — gpu‑pyres
- Pure‑Python; 2 s default per probe (configurable).
- Emits env vars
PIP_CUDA_VERSION
,PIP_ROCM_VERSION
,HAS_CUDA
,HAS_ROCM
. - Multi‑GPU rigs: highest capability by default; override via
--gpu-index N
/PIP_GPU_INDEX=N
. - Public test matrix (bare metal, WSL2, Docker) gates releases.
User overrides
Purpose | Variable / Flag |
---|---|
Force version | PIP_CUDA_VERSION , PIP_ROCM_VERSION |
Force CPU | PIP_FORCE_CPU=true |
Skip detection | PIP_GPU_SKIP_DETECT=1 |
Pick specific GPU index | --gpu-index N or PIP_GPU_INDEX=N |
A pip plugin must respect any preset vars and only detect unset markers.
5 Integration modes
-
Mode A – pip plugin (requires resolver‑hook PEP)
pip install torch --prefer-gpu
-
Mode B – wrapper
gpu-pyres detect --export-env && pip install torch --prefer-gpu
6 Resolver flow (--prefer-gpu
)
With detected cuda_version = 12.4
:
.cuda124
- Decrease minor:
.cuda123 … .cuda120
- Previous majors down to
PIP_GPU_MIN_VERSION
(default"11.0"
):.cuda118 …
- CPU fallback.
- Without
--prefer-gpu
resolver behaves as today.
7 Build backend hooks
setuptools
[tool.setuptools-gpu]
gpu-tag = "auto" # detect at build
timeout = 2000 # ms; default 2000 if omitted
Cross‑backend
Entry‑point group gpu_taggers
returns the tag; Hatchling/Flit/PDM may implement helpers.
8 Extras
[project.optional-dependencies]
gpu = ["torch ; has_cuda == true"]
cpu = ["torch ; has_cuda == false"]
rocm = ["torch ; has_rocm == true"]
pip install mypkg[gpu] --prefer-gpu
resolves using injected markers.
9 Version semantics
Numbers map to the driver API level (nvidia-smi
), not toolkit build version.
10 Extensibility
Grammar supports future accelerators (e.g., .oneapi2024
, .metal3
).
Reproducibility & CI
- Detection is opt‑in.
- Selected wheels are pinned verbatim.
- GPU‑less CI: set
PIP_FORCE_CPU=true
.
Error handling
- If gpu‑pyres fails it prints a clear message; plugin falls back to CPU markers.
- pip never hangs on detection.
Environment variables reference
Variable | Meaning |
---|---|
PIP_CUDA_VERSION / PIP_ROCM_VERSION |
Override detected driver level |
HAS_CUDA / HAS_ROCM |
(Set by detector) boolean markers |
PIP_PREFER_GPU |
Same as --prefer-gpu |
PIP_GPU_INDEX |
Select GPU by PCI index |
PIP_FORCE_CPU |
Force CPU resolution |
PIP_GPU_SKIP_DETECT |
Disable detector |
PIP_GPU_MIN_VERSION |
Lowest driver level resolver may consider |
Reference implementation
- gpu‑pyres 0.1 with public test matrix and 2 s probe time‑outs.
- pip plugin prototype after resolver‑hook PEP merges.
setuptools-gpu
plusgpu_taggers
entry‑point; helpers for Hatchling/Flit/PDM.- Coordination with major ML libraries to publish dual‑tag wheels.
- Warehouse PR to accept
.cuda*
,.rocm*
,.cpu
tags.
Author
d8ahazard