Preventing unwanted attempts to build sdists

geofft · May 24, 2024, 7:26pm

pypackaging-native has a very good writeup “Unsuspecting users getting failing from-source builds” that describes one of the practical usability problems with the pip/PyPI ecosystem. In brief, there are several popular packages where building from source is terribly complicated and unlikely to work unless you are intentionally building from source and have carefully set upa build environment, and where the preferred option for users of the software is to install a wheel. But there are a few reasons your resolver might fail to find a wheel, get an sdist instead, and attempt in vain to build it:

You’re on a platform / Python version that the authors don’t (yet) build wheels for.
You’re on a platform they do build wheels for, but your installer doesn’t know that (e.g., there’s a new manylinux tag).
sdists and wheels are not uploaded at the same time, and you happened to pip install between those and lost a race condition.
There’s an incompatible change in the wheel format, as is being proposed over in How to reinvent the wheel , and your installer can’t handle those wheels.
In these cases it would be better to get an error message up front mentioning the lack of a wheel than to imply to users that they should debug why the build failed.

A couple of very popular projects have started addressing this by not uploading sdists at all, which is unfortunate.

pypa/pip-#9140 “–only-binary by default?” proposes to address this by changing pip’s behavior to ignore sdists if it sees any wheels at all for a package, or some similar heuristic. But, being a heuristic, it seems like it’s hard to roll out. without breaking some existing use case.

I think the following design is fully backwards-compatible and accomplishes the goal:

Define a new type of sdist, which I’m going to call a “manual-build sdist” but we can probably find a better name. A manual-build sdist is just like any other sdist, except that the filename ends in .manualbuild, e.g., foo-1.0.tar.gz.manualbuild. The unsuffixed filename and the contents can be treated like any other sdist.
Extend PyPI to accept files ending in .manualbuild, if it doesn’t already accept them.
Extend the code in pip, uv, etc. to download manual-build sdists as if they were normal sdists when some command-line flag / configuration is enabled (e.g., pip install --manually-build :all: numpy). If that flag is not specified, they keep their current behavior, which should be to ignore these files
Ask the maintainers of projects that have complicated build processes and that intend to comprehensively provide wheels to rename their sdists to .manualbuild before uploading them.
Ideally, ask the maintainers of projects that currently aren’t uploading sdists at all to start uploading manual-build sdists.

This is opt-in on a per-project basis, and therefore it is fully backwards-compatible: any project that is not uploading manual-build sdists gets versions resolved exactly like they do today. A package that only has sdists has users build sdists; a package that only has wheels has users build wheels; a package that has an sdist for a new version but a wheel for an old or outdated version has users install the latest version.

Essentially this recognizes, as pynative-packaging puts it, the dual purposes of pypi. A regular (auto-build) sdist now conveys the intention that it’s meant for users to actually build under normal circumstances, which has so far been the implicit effect. A manual-build sdist is for publication and for consumption by redistributors or people with unusual use cases, and is just as easy for someone to download from the web interface or for a tool to grab, but explicitly conveys the intention that it’s not for routine installation.

This even is backwards compatible with the use case of build pipelines that do a pip install xyz in CI to produce the official wheel for that version of xyz from its sdist. If they don’t hear about this new feature, they’ll continue to upload normal (automatic-build) sdists, and pip install xyz will build it. If they do switch to uploading manual-build sdists, then they’re aware of this new feature, and they can adjust their CI to use pip install --manually-build :all: xyz; there’s no behavior change that they’re not aware of.

No changes to build tools are strictly required (as with the proposals to avoid heuristics by adding metadata); all you need to do is rename the file. We certainly could make this nicer but it’s not a hard requirement.

This is backwards-compatible with existing versions of pip (and other installers): for projects that only provide manual-build sdists, existing versions of installers will just ignore the .manualbuild files that they don’t recognize, and they’ll gracefully degrade to the same behavior as if sdists are not uploaded at all, which is what we want. But this will only happen for projects that explicitly choose to switch to manual-build sdists.

So I think this is a much easier rollout than any of the other proposals to address this problem: if no projects made any behavior changes, the new versions of pip etc. would not change any behavior either, so there’s no coordination problem. ^[1] There will be some work in encouraging popular projects to adopt this scheme for it to be actually useful, but it’s in their own interest to not have users post build failures in their support forums, so I think that’s easy. After all, a couple of major projects are already doing the worse version of this,

One thing that might break is automation by redistributors to find and download an sdist that they specifically intend to build from source - pypi2deb, pyp2rpm, etc. But those are already broken by the projects that choose to stop uploading sdists, and it’s an easy fix, and it only needs to be fixed once per tool. It is also most likely to affect those packages where the redistributors know they will have to spend manual effort getting the packaging right. So I think this is an acceptable impact.

By contrast, I think every proposal on pip#9140 to make a change in pip alone has some edge case where the heuristics do the wrong thing in a use case where people are not well equipped to figure out what happened, and so there’s some hesitation to make a change without a lot of care and planning and publicity. Some of the proposals also involve implementing better error reporting inside pip in a way that would be difficult to get right. Alternatively, there are proposals to accomplish roughly the same thing as this proposal via metadata, which would be a more involved change and also wouldn’t be as backwards-compatible. Empirically - there’s been a quick consensus on the issue that the change is a good idea, but no movement for a couple of years.

I welcome feedback about whether I’m missing anything that makes this approach complicated or unworkable - or whether one of the other proposals actually is straightforward. I’ll turn this proposal into a relatively short PEP if people feel this is a decent plan, and I’m happy to do the code changes needed, too.

There is a valid question about whether automated tools that generate and upload sdists should generate them as auto-buiild or manual-build sdists. But do any such tools exist? I was worried about cibuildwheel, but it just documents the CI steps you need, it doesn’t upload sdists on its own. ↩︎

notatallshaw · May 24, 2024, 7:58pm

I was thinking about this idea of “Preventing unwanted attempts to build sdists” and at the risk of completely fracturing this conversation, my idea was:

At either the proejct (or version release) level of the simple API add a new tag “expected-wheels”, which contains a list of tags (for platforms?) that expects to be available wheels for
Then frontends can implement Pip’s --prefer-binary-like behavior by default for projects which have this tag

There are a few wrinkles that would need to be hashed out, such as, what level the tags should be at and what the tags should represent exactly. But I think it acheives the same goal in a simpler way?

It would still require index, project maintainer, and frontend tool participation, so it’s a non-trivial amount of effort unfortunately.

mikeshardmind · May 24, 2024, 7:58pm

So, I think if we accept this is a problem, the separation between a normal sdist meant to be installed by end users, and one provided but expected to require user intervention to install makes sense. I don’t understand the perspective or think it’s a problem that needs solving, I think maintainers have already solved it by not uploading sdists anymore if they don’t intend to support installation in that manner.

As someone who doesn’t upload sdists (I can’t reasonably support people building from source on a platform I don’t have access to test, or it’s pure python and the wheel contains the source already, so why waste more resources of the index), I’d like to ask why you see this as unfortunate or a problem that needs solving.

From my perspective, the canonical source is the git repository, and wheels hosted on an index are an ecosystem convenience. (and pip install “git+…” works fine for advanced users)

MegaIng · May 24, 2024, 8:25pm

Then don’t. I don’t think providing a sdist should come with this expectation, and in my understanding this is what this thread is about, i.e. reducing this expectation by changing the tools (in some way)

I much prefer opening up a .tar.gz file which is clearly intended to be opened manually and well known to end users, compared to a .whl file, which in addition to (a subset of) the source code also contains unfamilar-to-end-user metadata files, and it doesn’t contain the expected pyproject.toml file users would want to look at. If preventing the waste of resources on the index was a goal, then the wheel format should not be used for pure python wheels, instead of dropping the more complete and useful sdist.

This assumes a git repo and host, that continues to exists for as long as pypi exists. I would much prefer if the ecosystems attempts to reduce the damage link rot can cause.

mikeshardmind · May 24, 2024, 8:31pm

With all due respect, none of those address the big question all of the parts you quoted were around here, so let me state it with less surrounding context:

Why is this something that we should have another format or metadata or thing to consider for?

The parts you quoted me on explained my perspective, and were there to allow someone to address

in a way that would reach me persuasively.

And before someone quotes at me about supply chains, with my security hat on, sdists only provide an illusion of security and arent guaranteed to have artifacts that match. attestation from a trusted build server on the other hand without providing an sdist does.

I also don’t see link rot or code browsing in formats meant for machines as persuasive, but that’s really a whole other set of concerns.

BrenBarn · May 24, 2024, 8:34pm

This issue was mentioned in passing in some of the other packaging threads.

I still think the best solution is to totally separate wheels and sdists. Pip (or whatever other package installer) should never attempt to install an sdist. Sdists shouldn’t be hosted in the same installer-checked index as installable artifacts (aka wheels). If people want to hosts sdists on python.org, that’s cool, but there should be some separate “pypsi” (python package source index) repository for those.^[1]

The audiences of “people who want to install python packages” and “people who want the source to repackage a python package” are so different, that I don’t think it makes sense to try to serve those audiences in the same index, and the former is so enormous compared to the latter that I think the latter is the one that should be split off into a new, separate index.

Deciding if it’s better to host your source on python.org or on your own external site/repo would then be “taking the pypsi challenge”. ↩︎

MegaIng · May 24, 2024, 8:41pm

Let me spell it out more clearly and slightly more focused:

Because not providing sdists is a terrible solution that hurts other parts of the ecosystem.

Or did I misunderstood your argument?

You yourself said that installers trying to build sdist is problematic.
Your suggested solution is to not provide sdists

I am now arguing that your solution is not a good one, so a different one is needed. (whether what OP suggests is a good solution, I am less sure about. Which is why I didn’t comment on that)

That might well be true, but I really don’t want to see sdist completely vanish. Even if only as historical documentation, they are are something that has value, and for example will make future forks to continue maintenance after the original maintainer vanished (taking all gits with him) way easier. If there is a different index that people would need to upload to, we will see sdist slowly vanish. If it’s just a different API for installers, that would probably be fine.

mikeshardmind · May 24, 2024, 8:47pm

I don’t see how, and you’ve yet to explain how either.

I have seen multiple people claim that the lack of sdists is a problem, but no tangible negative impact to not having them in the presence of better formats for offering the source (such as a link back to where the project is developed and built)

trying to maintain a project from only point-in-time snapshots of the project is going to be painful no matter what. While I can appreciate preservation efforts, I can’t see sdists as actually serving that purpose. Change history, issues and commit history tracking why things were done are integral to making well informed decisions when inheriting or forking an existing non-trivial project

MegaIng · May 24, 2024, 8:49pm

I have tried, but since we are apparently completely failing to communicate, I am not going to continue here.

(Oh, but I do want to call out the fact that you then reply to one of my attempts to explain clearly showing that “you’ve yet to explain how either” is wrong)

mikeshardmind · May 24, 2024, 8:52pm

I see no tangeible negative impact to not providing an sdist in any of your messages. I agree there’s something where we’re clearly not on the same page here, and I did not mean that comment to be read as hostilely as it seems to have been.

geofft · May 24, 2024, 8:53pm

I think the question of “why do we even need sdists” is an important one; if the consensus is that we should get rid of sdists, that would certainly solve the problem.

I’m just getting off the train, can I ask for a bit of time to get home and formulate the reply? I think there are good reasons, but they are fairly detailed. I don’t think we need to get too heated in the meantime.

But, in the meantime, the pypackaging-native site covers a few of these in thoughtful detail - see the links in my original message.

BrenBarn · May 24, 2024, 8:56pm

Any packages that currently don’t upload sdists could also vanish now in the same way. There’s also no reason sdists have to vanish just because they’re not searched by the install tool. It sounds like you’re saying you want a historical archive or mirror of various stages of a project, which I agree is nice, but, again, I don’t see it as something that should be mixed with “what happens when I try to install a package”.

In a way, having a different index might make sdists more attractive, since now package authors wouldn’t have to worry that uploading an sdist would create potential pain for users (when it incorrectly is chosen as something to install). You could still have upload tools like twine automatically upload the sdist, they just wouldn’t upload it to the same place.

leifwalsh · May 25, 2024, 2:05am

Just dropping in to say that I don’t think a link to a git repo satisfies the use case that an sdist satisfies, in the case where you’re trying to build from source (possibly even though you know it’s going to be difficult but you need to anyway).

A git repo will in many cases contain a different set of files than an sdist. Specifically it’s common for more files to be in git than are needed for building (e.g. test data, code in other languages, git repos containing multiple separate distributions), and sometimes an sdist contains files the git repo does not (e.g. code-generated files you wouldn’t check in).

I think there’s a need to have, somewhere, a prepared copy of source ready to be built, for some projects. I’d also point out that this “prepared source” tarball is an attack vector as we just saw with xz-utils, and it’s regrettable that the world uses them, but it’s still a fact that some of the world uses them and can’t reasonably build directly from what’s in git right now.

I’ll avoid wading deeper into the discussion of where that should be or what it should look like, at least for now.

geofft · May 25, 2024, 3:44am

So, on rereading, I think I phrased the proposal poorly by saying “Define a new type of sdist…” It’s not a new type of sdist at all. It’s the exact same sdist as we’ve always had, just with a different filename. This is intended to be a very small spec with a very small implementation - certainly smaller than the impact of even a single package going from uploading sdists to not uploading them, let alone encouraging the whole ecosystem to do so. We have sdists now; this is just about tweaking the UX around how they get used, not changing how they work.

But I do think sdists are in fact valuable for a handful of reasons, from technical to philosophical:

A git repo might have a different layout from a buildable sdist. I’ve actually run into two examples of this recently. One is tokenizers, a Rust library with bindings to Python and other languages. https://github.com/huggingface/tokenizers is a monorepo of all of those. The repo has a bindings/python subdirectory with pyproject.toml, and inside that, Rust code using pyo3 and pure-Python code. But if you download the sdist of tokenizers on pypi, there’s a pyproject.toml at top level, as well as the Rust library, the pyo3 bindings, and the pure-Python code. It might be possible to use git+https://github.com/huggingface/tokenizers#subdirectory=bindings/python or something, but that’s complicated.

The other example is PyTorch Lightning, which has been rebranding itself to just Lightning: existing users who import pytorch_lightning can use pypi project pytorch-lightning, whose sdist has a substantial src/pytorch_lightning/ directory, but src/pytorch_lightning/ in their Git repo is almost empty - because it gets copied over from src/lightning/ and a few other places. The #subdirectory trick won’t work here.

A git repo might require more dependencies to get something buildable. (This is analogous to how in autotools projects ./configure is usually not checked in but is in the distribution tarballs.) The entire purpose of Hatch’s hatchling and Flit’s flit-core is to support other people building projects that use Hatch and Flit without them having to install and learn and use Hatch and Flit. If we get rid of sdists, building from source becomes more complicated.

You can’t guarantee that everyone is using Git. Python itself only moved to Git in 2017. As late as 2015 I remember sending in patches to pyasn1 while it was using CVS on SourceForge! Maybe everyone really is on Git today, but there’s a lot of interesting work on version control systems such as Jujutsu and Sapling. If you want to use a new (or old) VCS, without sdists, you’re insisting your users learn your version control system (or you have to run a mirror to Git, I guess.) Meanwhile, tar is a pretty stable format, and exporting to one is not hard.

(Also, people are bad at tagging commits in practice, but it’s easy to see if a wheel on PyPI has a corresponding sdist or not.)

It may well be the case that your Git repos don’t do anything weird, and running the backend described at pyproject.toml just works. But I have no idea what weird things someone else’s Git repo might be doing.

Or, in other words: an sdist hosted on a central service is a standardized way to get some buildable source code and build it. Right now, you resolve a name on PyPI, download an sdist, and do the thing in PEP 517 to acquire build dependencies and do a build, and for most packages this works. This can be done programmatically and several people do. If we move away from sdists, we’re essentially giving up on all the work in PEP 517 etc. to define a standardized build process for Python code, and the way to build from source is now to find their VCS repo by hand and follow their build instructions, by hand which could be entirely custom and manual. Maybe that is a better model, but making that argument seems tough. In particular, it seems to me that PEP 517 is what enabled the Python ecosystem to start exploring more tools like Flit and Hatch that move away from ./setup.py. I remember running across a Flit project in the pre-PEP 517 days and seeing people ask “how am I supposed to build this? why do you want me to install and learn your weird tool?” The social pressure to use setuptools was very strong, and I think that was bad for the ecosystem. I’m worried that the effect of dropping sdists in favor of git+https:// will be to reintroduce that social pressure against new build tools, along with new social pressure against new VCSes, new source-level layouts, etc.

So if you want people to have the option of installing from source at all - and I think that’s important for the ecosystem because that’s what FOSS is about - it’s important to make that as straightforward and well-supported as possible. But this does not at all imply that everyone should be routinely installing from source! Debian, Fedora, etc. care very much about the FOSS ideal of everyone being able to build and modify the software, but they still distribute binaries, and there is no condition that would cause apt, dnf, etc. to silently try to build something from source.

Continuing to move in the philosophical direction, pypackaging-native says this better than I can, but PyPI is not just a tool for people installing software. There are users of PyPI who are not doing something directly equivalent to pip install, but are instead getting code to build and distribute in some other way. This is the use case of pyp2rpm et al.

(In fact, this proposal is basically implementing their suggestion:: “allow upload of sdist’s for archival and ‘source code to distributors flow’ without making them available for direct installation.”)

One key part of this is that PyPI provides standardized names for the Python ecosystem. If a package is installable with pip install xyz, you know that if it’s in Debian it’s named python3-xyz, if it’s in Nix it’s named python312Packages.xyz, etc. If you’re installing a package with a git+https:// URL, the name is much less obvious. (Yes, for packages that distribute wheels, they’ll still claim a name on PyPI, but for users who build from source, there’s no straightforward link from the name to the source or vice versa.)

There are a few other neat things that happen from PyPI being a level of abstraction for naming. If you are installing from git+https://github.com/some_user/xyz and they transfer it to another user or an organization, you need to update your dependencies; if you are installing from sdists, you don’t need to do anything. Apart from maintainer-initiated transfers, the PyPI maintainers can use their judgment to provide continuity of a name if the maintainer of an important package is inactive as well as to remove malware. To be clear, there are downsides of a model with an intermediary too, but we’re currently treating these all of these as positives, and there’s no proposal to make PyPI stop being an intermediary for wheels.

Note that none of these arguments are about security (as the previous comment mentioned, the VCS/sdist discrepancy is worse for security), archiving, or ease of applying patches or developing local changes. We do have some current practices to use sdists for all of those things, but that’s just another case of existing consumers of sdists that we should not break without reason; it’s not that sdists are inherently better for any of those.

I think we could eventually get to a world without sdists, if PyPI starts offering VCS mirroring and if we come up with a PEP 517-like way of describing how to build a wheel from a source code checkout, but those would require some detailed design.

mikeshardmind · May 25, 2024, 5:25pm

Thank you for the detailed explanation of the benefits you see to sdists here. I can’t say that I agree with any of them as many of the reasons why an sdist could be needed over a VCS link (git or otherwise) are the kind of things that raise alarm bells personally, but I do see how projects stopping providing sdists could be seen as disruptive to those who were relying on their continued existence.

Yes, that’s my expectation with building from source though, and we were sold wheels as a solution so that typical end users wouldn’t need to anymore. Usually solved with documentation of what dependencies are required, or some other means of acquiring them properly, sometimes some build scripts, and platform-specific redistributors adding appropriate dependency information for the ecosystem (not reliant on sdists)

I actually find this to be an argument against the status quo of sdists. Check-in your code generation tools. Mark generated files as generated, use build servers and attestation to ensure nobody ever has to implicitly trust a generated file or review something that wasn’t written to be reviewable, they can see how it was made and where it came from.

With that said, I understand the current situation isn’t perfect and there’s a lot of non-ideal code out there. I’ll leave it to others to figure out if trying to make it possible to upload sdists but not have pip/uv/other resolver use them is desirable to some use case.

Thank you (to everyone who responded) for your time spent explaining your perspectives on sdists remaining important.

jamestwebber · May 25, 2024, 6:08pm

Just to nitpick: pip can install from git, svn, mercurial and bazaar repositories. This isn’t about assuming “everything’s on GitHub” or “everyone uses git”. If a new VCS tool became popular I would expect install tools to learn how to install from them.

fungi · May 25, 2024, 7:26pm

I can’t say that I agree with any of them as many of the reasons
why an sdist could be needed over a VCS link (git or otherwise)
are the kind of things that raise alarm bells personally

“Personally” being important here. Some of us who have been
distributing free/libre open source software since the days when CVS
was in its infancy as an RCS replacement may not find it all that
alarming. An over-reliance on revision control systems as “the one
true form of distribution” gets my hackles up, “personally.” Yes,
Git is really cool and I do use it regularly, but it’s not the only
way, and sometimes not even the best way, to serve source code,
depending on the situation.

bryevdv · May 25, 2024, 7:44pm

Just to be clear, others of us who have been around since CVS and before would very much love to be able to stop to distributing sdists. Our sdists have to jump through hacky hoops to bundle pre-built typescript components, which I guess is an “abuse” of the notion of “source distribution” but also the only solution I’m willing to tolerate for my sanity. Not sure what tenure has to do with anything here.

fungi · May 25, 2024, 8:46pm

fungi:

Some of us who have been
distributing free/libre open source software since the days when CVS
was in its infancy as an RCS replacement may not find it all that
alarming.

Just to be clear, others of us who have been around since CVS and
before would very much love to be able to stop to distributing
sdists. Our sdists have to jump through hacky hoops to bundle
pre-built typescript components, which I guess is an “abuse” of
the notion of “source distribution” but also the only solution
I’m willing to tolerate for my sanity. Not sure what tenure has
to do with anything here.

I didn’t say that, I just I don’t find tarballs of source code
“alarming” (inconvenient at times, sure). I’m wagering it may be a
matter of perspective, since I know there are many in this ecosystem
who don’t remember a time before ~everything was managed in revision
control systems.

mikeshardmind · May 25, 2024, 9:00pm

FWIW, I don’t find source tarballs inherently alarming. I do find it alarming that one of the arguments made to say sdists are important, and that it’s a loss that some projects would stop providing them was to include other files that aren’t tracked in version control for projects using version control.

My perspective is not that source dists are inherently bad, just that a maintainer choosing not to provide one is not inherently a problem that needs solving.

With regards to the extra files, it’s alarming to me when a project already uses version control and includes files that have no traceable source. While some of that is heightened awareness with recent issues with xz, it’s also, even in a situation where the extra files are reviewed and everything makes sense, often a case of something that makes things more fragile to long term than if at least the code which could be used to generate that file was checked in. This raises all sorts of thoughts from “This might be a path that receives less testing” to “this project normally has a high standard for code review, but this code provided by the project never was”