Preventing unwanted attempts to build sdists

With regards to the extra files, it’s alarming to me when a
project already uses version control and includes files that have
no traceable source. While some of that is heightened awareness
with recent issues with xz, it’s also, even in a situation where
the extra files are reviewed and everything makes sense, often a
case of something that makes things more fragile to long term than
if at least the code which could be used to generate that file was
checked in. This raises all sorts of thoughts from “This might be
a path that receives less testing” to “this project normally has a
high standard for code review, but this code provided by the
project never was”

Projects I’m involved in these days that put “extra” files in their
sdists do so in order to ship context from Git repository state
contemporary with that from which the sdist was generated. We don’t
want to duplicate that metadata within files checked into the
repository, but once the source code is taken out of and isolated
from the repository (source tarballs used and reshipped downstream,
e.g. in GNU/Linux distributions) we also don’t want that context
lost.

Keep in mind that sdists already have “extra” files that don’t come
directly from revision control worktrees, notably Python package
metadata.

I believe this proposal is incomplete unless it addresses backtracking in some form (even if it is to say you gotta purge PyPI of your regular sdists). If your latest version has a manual build sdist (or no sdist) and one of your older versions has a regular sdist which should be “manually built”, you are back to square one.

Indeed. To put it another way, what’s the backward compatibility plan here? Do installers treat existing sdists as “OK to use for installing” or not?

For what it’s worth, I did a very superficial analysis of how many projects on PyPI would be affected. Looking at the latest version of every project on PyPI from early March (the last date I had a snapshot for lying around), I see:

  • 147,211 (28%) only distribute wheels
  • 103,155 (20%) only distribute sdists
  • 193,048 (37%) distribute both
  • 77,782 (15%) distribute neither (assumed to be projects with no files, or which only distribute obsolete formats such as eggs or .zip format sdists)

From this, I imagine we can conclude that if we did add a new “not for installing” type of sdist, roughly 20% of all projects would ignore it because they only distribute sdists, so they “obviously” want them to be usable for installs. A further 28% would likely do nothing because they don’t distribute sdists now, and while they might choose to start distributing “not for install” sdists, I suspect they’ve made their choice and won’t think it’s worth changing. We can ignore the 15% that don’t distribute wheels or sdists.

That leaves 37% (around 200k projects) who might switch to the new form of sdist. Of those, 188,161 (97%) distribute a generic wheel alongside the sdist, whereas 4887 (3%) don’t. The 97% would gain nothing from a switch, because installers will always prefer the generic wheel over the sdist anyway.

So we’re down to 4887 projects out of half a million - just under 1%. I don’t know how many of those are significant from a quick scan [1] but this does include obvious cases like numpy, scipy, pandas and matplotlib.

None of the above is intended to argue for or against the proposal, just to provide some context in terms of numbers. I do think it’s worth remembering that for the overwhelming majority of projects, this is not a problem that needs to be solved, though…

Code used to do the analysis

Yes, this is ugly, it was a quick hack. The JSON file I used looked like

{"projects": [{"name": "...", "files": [{"filename": "..."}, ...]}, ...]}

(plus other data I ignored).

import json
from itertools import groupby
from collections import Counter

with open("PyPI_simple.2024-03-07-11-26.json", "rb") as f:
    data = json.load(f)

project_files = {}
for p in data["projects"]:
    name = p["name"]
    files = [f["filename"] for f in p.get("files",[])]
    project_files[name] = files

def fv(file, project):
    if file.endswith(".whl"): return file.split("-")[1]
    if not file.endswith(".tar.gz"): return None
    if not file.startswith(project + "-"): return None
    return file[len(project)+1:-7]

pfvs = {
    n: {
        v: list(fs)
        for v, fs in groupby(project_files[n], lambda f: fv(f, n))
        if v is not None
    }
    for n in project_files
}

def types(l):
  has_sdist = any(f.endswith(".tar.gz") for f in l)
  has_wheels = any(f.endswith(".whl") for f in l)
  if has_wheels and not has_sdist: return "Wheels only"
  elif has_sdist and not has_wheels: return "Sdist only"
  else: return "Both"

pfvtypes = {
    n: {
        v: types(list(fs))
        for v, fs in groupby(project_files[n], lambda f: fv(f, n))
        if v is not None
    }
    for n in project_files
}

def has_generic_wheel(files):
    for f in files:
        if not f.endswith(".whl"): continue
        if "-none-any" in f: return True
    return False

c = Counter(tuple(pfvtypes[p].values())[-1:] for p in pfvtypes)
print(c)

generics = Counter([has_generic_wheel(list(pfvs[p].values())[-1]) for p in pfvtypes if list(pfvtypes[p].values())[-1:] == ["Both"]])
print(generics)

sdist_is_generic = [p for p in pfvtypes if list(pfvtypes[p].values())[-1:] == ["Both"] and not has_generic_wheel(list(pfvs[p].values())[-1])] 

  1. although a3b2bbc3ced97675ac3a71df45f55ba seems like an odd name… ↩︎

6 Likes

I think this is an interesting idea, but my concern here is that it effectively tries to work around a temporary problem with a permanent wart.

What I mean by this, is I think that if we were defining the ecosystem from scratch there’s probably two logical ways we could have solved this problem:

  • Have installers refuse to install from sdists (either at all, or by default).
  • Have projects be able to indicate whether their sdist should be automatically built or not.

Both of these solutions end up with the world in a pretty reasonable place, but the problem becomes how do we transition from the status quo to one of these.

Obviously the first of these there’s no good way to transition that doesn’t break a large number of existing projects that have chosen not to provide wheels (or are old enough to predate wheels).

For the latter of these, we end up in the problem that old versions of pip won’t know to respect that new “don’t use this” metadata, so presumably projects won’t want to upload sdists because they can’t rely on it.

To be honest though, I’m less concerned about this. This is the same basic problem that we had with Requires-Python, and all it really meant is the early adopters had to come up with a fallback strategy to prevent their sdists from being used implicitly (in many cases by generating an error in the setup.py). I think that pattern could also easily apply here, where we standardize some sort of Archival-Onlyflag, and we just… wait until a version of the various installers is common enough that it’s OK.

Another possible idea (though this has long term caching implications) is that /simple/$project/ won’t contain links to “archival only” sdists by default, and you have to fetch /simple/$project/?archival=y to include them.

Another possible idea (though this also has caching implications, but possibly not long term ones) is that we do UA detection at the edge and for UAs that are “too old”, we don’t include these archival only sdists [1].

Really though, I view packaging changes as designing for a decade+, so a multi year wait until you can really start relying on something doesn’t seem real bad to me :slight_smile:


  1. I’m normally very -1 on UA detection here, but if we time boxed it to say we’d only host it for say, 2 years or 3 years or so, it’s maybe OK? ↩︎

8 Likes

Thanks for doing that analysis, it gives us some food for thought. I think the real meat of the question, though, is in, as you say, how many of those are significant. It seems quite likely that a tiny number of packages can be causing most of the pain because they: A) are popular; B) distribute an sdist; C) won’t actually be seamless for many users to build with the sdist. The examples you give fall into this category.

I would not be surprised if less than 1% of packages are causing more than 90% of the unwanted-sdist-install pain.

This is, of course, not least because of those half a million packages, an unknown but presumably large percentage will be totally unaffected by any change we make, because they are broken, empty, abandoned, or otherwise unusable. This problem keeps coming up in discussions of proposed PyPI changes and I’m not sure how to handle it, but my hunch is that the most payoff will come simply from finding a solution that makes the UX nicer for installing the top N packages (or the top N packages with sdists). A solution that works for those is likely to also provide collateral benefits to other packages as well. Our problem is then reduced to debating what value of N to choose. :slight_smile:

4 Likes

Just to throw in my 2 cents as an implementer of an alternative to CPython, we very much like that pip installs sdists when it cannot find binary wheels, because most package maintainers cannot be expected to build binary wheels for alternative implementations that they don’t care about. The fact that pip falls back to sdists means that for many packages, the only problem for our users is that “pip install numpy seems to be taking a long time”, which is better than if they said “pip install numpy just doesn’t work”

3 Likes

Interesting. The basis of this proposal is that for some packages (of which numpy is one) installing from sdist will never be the right thing to do, regardless of whether there is a binary wheel available or not. If there’s no binary, you can build one from the sdist, but doing so is a much more complex and manual process than a simple pip install can handle.

Can you explain what is different about your user base that means pip install numpy (from sdist) works for them?

3 Likes

I’m sorry to say I’m not sure why pip install numpy from sdist wouldn’t always work? I’ve never encountered problems with it, but then, I always have development tools available on all my machines. I think the same might be true for our (GraalPy’s) typical user.

In my experience working on GraalPy and using pip install for the past 7 years, it always worked if I had the required native dependencies and toolchains in my system and the package had an sdist available. To this day I don’t think there are any binary wheels at all for GraalPy on PyPI and yet our users can install numpy, pandas, scipy, scikit-learn, matplotlib, … via simply via pip install. Of course, maybe some accelerator won’t be compiled in if you don’t have a BLAS library or CUDA, but it still works. I’ve seen occasional problems with packages like SciPy, where we had to tell users to downgrade gfortran or set some environment variable. But that’s simple to document, and (to me) much preferable than having to tell them “forget about pip install for native extensions”

But maybe that’s the difference? People expect pip install to install dependencies; if you have to do something manually beforehand, like make required native dependencies available, then that’s not really “pip install works”, it’s “doing manual steps plus pip install works”

4 Likes

I’ll save the big sdist rant for another day (it was fun, but not really beneficial to this topic :wink: )

What’s to stop projects that want to discourage building from source from adding their own environment variable check to their build script?

$> pip install numpy
Downloading numpy.0.0.0.tar.gz
Running build for 'numpy':
Build failed. The following output is from the 'numpy' package's build:

WARNING: No pre-built wheels for numpy were found for your platform.
Building from source can take a long time and require tools and libraries
to be installed on your machine. Visit <docs> for more information.
Set NUMPY_BUILD_SDIST=1 in order to build anyway.

$> NUMPY_BUILD_SDIST=1 pip install numpy
Using numpy.0.0.0.tar.gz from cache
Running build for 'numpy':
<One eternity later.png>
Installing collected packages: numpy
9 Likes

It seems misleading to count the number of potentially affected projects, while in reality it’s a problem that affects end users. So you would have to build a weighted statistics taking into accout the download count of each project.

For the record, many foundational libraries in the scientific space may want to make use of this. Anything that wraps non-trivial C, C++, Fortran or Rust code could apply.

10 Likes

But maybe that’s the difference? People expect pip install to
install dependencies; if you have to do something manually
beforehand, like make required native dependencies available, then
that’s not really “pip install works”, it’s “doing manual steps
plus pip install works”

You also need an operating system. And a Python interpreter. There’s
always “manual steps” that need doing, but maybe what you’re saying
is “some people don’t read instructions?”

Setting up your environment so you can install things (at least in
POSIX systems) has pretty much always been necessary, and most
software developers I know tend to assume users understand the tools
which come with their operating systems and can follow basic
building and installation instructions. It seems insulting to say
“sorry, your platform is slightly different from what we’ve
pre-built binary packages for, and we discourage you from trying to
build it yourself because we assume you probably won’t get it right,
it’ll be easier if you just get a different system.” With that
attitude, what’s the point of bothering to write properly portable
software at all?

If “people expect” something which isn’t true, doesn’t it make more
sense to correct those false assumptions?

It depends if native dependencies in the original message means system dependencies or python extension modules! @timfelgentreff can you clarify?

The Python ecosystem has change a lot since the early days, and many users are not software developers.

But in any case, the issue isn’t that users should know better how to compile these packages, it’s that they almost certainly shouldn’t be compiling them, and the fact that they are is a sign of something else wrong with their environment.

4 Likes

If your users are likely to have any version of gfortran installed by default, that’s the difference. Heck, a user base that is even capable of installing gfortran is very much an outlier compared to typical Python users. Having said that, of course, wheels are provided for “typical” users, so the real question is what proportion of the exceptional cases will have a build environment set up.

I guess I still don’t have a clear picture in my mind of who we’re trying to help here.

Maybe the real solution is for installers just to not report the build backend output, but simply to give a generic error, something like

No suitable binary distribution was available for XXX, and building from source failed. If you wish to build the package yourself, please check the documentation for XXX to find out how to do so.

3 Likes

Combining this with a version of @steve.dower’s suggestion does seem like a better user experience. Hide the output from the backend behind a flag, so that people who want to build from source and need to debug can do so without changing their workflow.

I’d be a little leery of just adding a message saying “please check the documentation for X” when that documentation might not exist or it might be insufficient. Right now it might just say “just use pip!” which is going to be frustrating for the user.

1 Like

Having ended up with an unintended self-compiled numpy or scipy on more than one occasion when I wasn’t paying close attention on a remote cluster, I would much prefer if building from source was opt-in:

No suitable binary distribution was available for XXX. If you wish to attempt building the package yourself, please run pip with the --build-from-sdist flag.

6 Likes

That’s a separate question, the relevant pip issue was quoted in the original post here. I still think we should do that, but it will cause issues for users of projects that only distribute sdists[1], which is why it’s taking time to work out the roll-out plan.

I’m honestly not 100% sure how this proposal relates to the idea of pip only installing wheels by default. I think it’s either:

  1. A workaround for projects that would prefer pip didn’t install from sdist, but don’t want to wait for that to happen.
  2. An alternative proposal intended to make it unnecessary for pip to switch to wheel-only by default.

IMO, the problem with (1) is that it’s a workaround that will have to stay forever, even after pip changes. And the problem with (2) is that it ignores the other benefits of installing from wheel only by default (most notably, security as installing a wheel doesn’t execute arbitrary code).


  1. Not all such projects are hard to build from sdist. It seems that some of them are pure Python, and simply use a sdist as if it were a universal wheel. ↩︎

3 Likes

Yes, but that’s not the expectation created by Pip, or the level of user-friendliness associated with Python. After all, “install the dependencies first” is in the category of “setting up your environment” (e.g. by installing things with the system package manager), and Pip is already expected not only to do that but resolve dependencies and determine versions of packages to install.

It doesn’t have to come across like that. The intent is more along the lines of “please acknowledge that you understand the situation and that if something goes wrong, you’re willing to put in some work to troubleshoot the problem, and that you won’t blame the Pip maintainers”.

Some things are basically not ever going to be portable outside of explicitly built wheels. Some things will just work on every platform because they’re pure Python; but of course these should build a -none-any- wheel anyway. In the in-between space, it’s worth warning people about what they’re getting into - especially Windows users who are rarely properly set up to compile anything. Again, it’s not “have you considered just switching to Linux instead and learning how gcc works?”, but “now might be a good time to step back, check our documentation, check our project’s discussion forum, consider other packages for this task, …”.

I’m afraid I don’t follow. The reason a wheel wouldn’t be found is because of what platform I’m on, not because I haven’t yet installed whatever system packages or set whatever environment variables etc. The latter are problems detected during the compilation attempt.

The fact that such projects exist seems like one of the biggest objections to any of the ideas proposed in Speculative: --only-binary by default? · Issue #9140 · pypa/pip · GitHub.

It’d be nice if there were some kind of automatic handling for that, like @uranusjr 's suggestion in that thread.

But also, part of the problem as I see it is that there are several wheel-preference policies that make some amount of sense, but generally no reasonable, unambiguous, concise way to describe them all on the command line.

2 Likes

Their python version counts as part of their environment, surely?

The four examples listed in the original post are all cases where the user installed python in such a way that they didn’t find a wheel, and (probably, not necessarily) they didn’t realize that’s what they were doing. It can be as simple as installing the latest python before packages have cut a release for it.

Of course, some users know exactly what they want, and they intend to build these packages from source. But a large chunk of people are not in that category and this leads some packages to avoid uploading an sdist at all (I’m just repeating the OP at this point).

1 Like