Indeed. To put it another way, what’s the backward compatibility plan here? Do installers treat existing sdists as “OK to use for installing” or not?
For what it’s worth, I did a very superficial analysis of how many projects on PyPI would be affected. Looking at the latest version of every project on PyPI from early March (the last date I had a snapshot for lying around), I see:
- 147,211 (28%) only distribute wheels
- 103,155 (20%) only distribute sdists
- 193,048 (37%) distribute both
- 77,782 (15%) distribute neither (assumed to be projects with no files, or which only distribute obsolete formats such as eggs or
.zip
format sdists)
From this, I imagine we can conclude that if we did add a new “not for installing” type of sdist, roughly 20% of all projects would ignore it because they only distribute sdists, so they “obviously” want them to be usable for installs. A further 28% would likely do nothing because they don’t distribute sdists now, and while they might choose to start distributing “not for install” sdists, I suspect they’ve made their choice and won’t think it’s worth changing. We can ignore the 15% that don’t distribute wheels or sdists.
That leaves 37% (around 200k projects) who might switch to the new form of sdist. Of those, 188,161 (97%) distribute a generic wheel alongside the sdist, whereas 4887 (3%) don’t. The 97% would gain nothing from a switch, because installers will always prefer the generic wheel over the sdist anyway.
So we’re down to 4887 projects out of half a million - just under 1%. I don’t know how many of those are significant from a quick scan but this does include obvious cases like numpy, scipy, pandas and matplotlib.
None of the above is intended to argue for or against the proposal, just to provide some context in terms of numbers. I do think it’s worth remembering that for the overwhelming majority of projects, this is not a problem that needs to be solved, though…
Code used to do the analysis
Yes, this is ugly, it was a quick hack. The JSON file I used looked like
{"projects": [{"name": "...", "files": [{"filename": "..."}, ...]}, ...]}
(plus other data I ignored).
import json
from itertools import groupby
from collections import Counter
with open("PyPI_simple.2024-03-07-11-26.json", "rb") as f:
data = json.load(f)
project_files = {}
for p in data["projects"]:
name = p["name"]
files = [f["filename"] for f in p.get("files",[])]
project_files[name] = files
def fv(file, project):
if file.endswith(".whl"): return file.split("-")[1]
if not file.endswith(".tar.gz"): return None
if not file.startswith(project + "-"): return None
return file[len(project)+1:-7]
pfvs = {
n: {
v: list(fs)
for v, fs in groupby(project_files[n], lambda f: fv(f, n))
if v is not None
}
for n in project_files
}
def types(l):
has_sdist = any(f.endswith(".tar.gz") for f in l)
has_wheels = any(f.endswith(".whl") for f in l)
if has_wheels and not has_sdist: return "Wheels only"
elif has_sdist and not has_wheels: return "Sdist only"
else: return "Both"
pfvtypes = {
n: {
v: types(list(fs))
for v, fs in groupby(project_files[n], lambda f: fv(f, n))
if v is not None
}
for n in project_files
}
def has_generic_wheel(files):
for f in files:
if not f.endswith(".whl"): continue
if "-none-any" in f: return True
return False
c = Counter(tuple(pfvtypes[p].values())[-1:] for p in pfvtypes)
print(c)
generics = Counter([has_generic_wheel(list(pfvs[p].values())[-1]) for p in pfvtypes if list(pfvtypes[p].values())[-1:] == ["Both"]])
print(generics)
sdist_is_generic = [p for p in pfvtypes if list(pfvtypes[p].values())[-1:] == ["Both"] and not has_generic_wheel(list(pfvs[p].values())[-1])]