Sdists for pure-Python projects

jeanas · March 26, 2023, 12:25pm

Recently, I’ve had to deal with some Python packaging issues, pushing me to familiarize myself with the (PyPA) tools and ecosystem.

There is one thing that still confuses me. Suppose I have a pure-Python project. No C, no build complexity, no compilation, plain and simple Python code. When I run python -m build, I get

a .tar.gz sdist,
a -py3-none-any.whl wheel.

My question is: why do I need both? Since the wheel is universal (not platform-specific), wouldn’t the wheel suffice? Is it only to include tests/docs in the sdist but not in the wheel? But then, who will consume my sdist and might need those tests, given that I very likely host my source code on GitLab/GitHub/whatever?

On An Overview of Packaging for Python — Python Packaging User Guide, I read (emphasis mine)

“In fact, Python’s package installer, pip, always prefers wheels because installation is always faster, so even pure-Python packages work better with wheels.

Binary distributions are best when they come with source distributions to match. Even if you don’t upload wheels of your code for every operating system, by uploading the sdist, you’re enabling users of other platforms to still build it for themselves. Default to publishing both sdist and wheel archives together, unless you’re creating artifacts for a very specific use case where you know the recipient only needs one or the other.”

Which makes me think that the sdist is only needed if my package contains C/C++/Rust/<insert compiled language> code.

Consider this both as a question and feedback about how the UX could be improved, whether through changes in the tools or in the packaging documentation

pf_moore · March 26, 2023, 12:34pm

Conceptually, a sdist is “what you need to build the project” and a wheel is “what you need to run (use) the project”. It’s a subtle, but rather important, distinction (for some use cases).

jeanas · March 26, 2023, 12:46pm

Yeah… but “build” is where I’m lost. What do you need to “build” in a pure-Python project?

(FWIW, I imagine that a newcomer who doesn’t know about writing C extensions can find the term “build backend” confusing as well, because Python code doesn’t need compilation.)

jeanas · March 26, 2023, 12:48pm

Let’s ask the question differently. If I upload only a wheel to PyPI and not an sdist, how bad is that? Are there people who cannot use my project anymore?

pf_moore · March 26, 2023, 12:55pm

Distributors like Conda, Debian and Red Hat build from source. That’s an explicit policy, and I believe is because it lets them lay the resulting files out according to their conventions, not the ones the wheel standard requires. The wheel doesn’t let them do that, because it doesn’t include pyproject.toml. There are also people who simply have a personal policy of “build everything from source”, who could use wheels, but choose not to.

Let me put the question the other way - given that it’s the convention, and it’s barely any extra effort over shipping just a wheel, why do you even want to not ship a sdist? Or is the question simply for curiosity (in which case I hope my answers helped)?

CAM-Gerlach · March 26, 2023, 1:27pm

There’s several steps required to transform a raw source tree into a something that can actually be copied into place by an installer:

locating the actual modules/packages
constructing the manifest, processing the core metadata
performing various checks, constructing the entrypoints
moving everything into the correct subdirectories to be prepared for installation
generating the rest of the metadata required by the built distribution output format (which may or may not be a wheel).

While things have gotten simpler with many modern build backends, historically (and still at present for many projects) this can involve any number of arbitrary dynamic transformations of the source tree into the packaged sdist, and the sdist into the built wheel (or other built/installable output). This could include:

fixing the version (setuptools-scm, many backends, etc)
generating or transforming the source files
moving things around
including arbitrary data
dynamically constructing the metadata
constraining the deps more tightly
Etc.

All of these things can be independent of building binary extension modules, and all of them can potentially vary based on how the build is invoked and configured, and on the desired built output format. Historically there were many more than there are today (bdist_msi, bdist_rpm, etc., etc), but third party distributors often can and do make different choices than the particular ones made for the project’s own PyPI wheel.

Not all the way down to machine code in a binary executable image, but it does need to be tokenized, parsed to AST and compiled to bytecode before it is executed at runtime by the interpreter. This is cached to disk (pyc) prior to first execution to non-trivially improve import and execution time.

As @pf_moore mentioned, nearly all distributors’ tooling is set up to build from sdists rather than wheels, because they have their own build distribution formats. Nowadays that often (but far from always) involves building a wheel and extracting the contents, it is done under a specifically-controlled/customized build environment that ensures the tooling works reliably. Furthermore, downstream redistributors typically require the tests to ensure their packages work properly, the docs to bundle in their installers, other metadata, config files and assets from the source (e.g. .desktop files), and the like.

Besides redistributors, it provides a canonical, checksummed (and potentially signed), complete and buildable source archive for that version of the project, independent of the platform, Python version and binary format, that contains the complete project source metadata, can be built to any supported format as required, and built under the conditions under the user’s control. And if, say, there ends up being a bug, compat issue, limitation, etc. in wheel, build, the project’s build backend, etc. that caused something to go wrong in specific cases or future versions, or required files to get omitted from the build distributions (licenses, etc), they can be rebuilt as needed from source.

pradyunsg · March 26, 2023, 1:46pm

There’s quite a few things – there’s the conversion of the metadata (eg: name, version, classifiers, dependencies etc) into the Core Metadata format and generating the RECORD and a few more things.

Certain pure-Python projects also involve code generation as part of their “sdist → wheel” process (eg: GitHub - pradyunsg/sphinx-theme-builder: Streamline the Sphinx theme development workflow (maintained, but extremely stable as of Jan 2023) generates pure-Python wheels for Sphinx themes, and the build process from sdist → wheel involves compiling web assets).

sinoroc · March 26, 2023, 1:54pm

I had written something about this a while ago, it is a bit outdated now, it was before [build-system], and I did not know as much on the topic as I do now, but I guess some points still stand. Nowadays, I would keep it as short as this: not everyone does, want to, or can use the wheel format, but the sdist format by virtue of its “raw” nature is quite universal. Anyway, it seems I am late to the party and this has already been said.

I would say no one, assuming the access to the source code repository is guaranteed, one could always build sdist if they need it. On PyPI there is no guarantee that the source code repository is available.

Maybe it is worth its own page on the Packaging User Guide, similar to Wheel vs Egg.

pradyunsg · March 26, 2023, 1:56pm

You can see the difference at https://inspector.pypi.io/project/sampleproject/3.0.0/ as well – the contents of the two are quite different and the build process is the process of transforming the sdist’s contents into the wheel’s contents.

mgorny · March 26, 2023, 4:17pm

Please build sdist, and include tests and documentation sources in them. In Gentoo, we find the ability to run tests prior to installing very important, and our users also have requested offline documentation multiple times.

While technically we could use the underlying repository, there are problems with that. There’s a huge established infrastructure behind downloading archives over HTTPS: mirrors, proxies, checksums. Every VCS requires reinventing this infrastructure, and that never worked well for us.

We have fell back to using automatically generated git snapshots, e.g. from GitHub but they are not guaranteed to be reliable. Only recently GitHub had to revert a change because all hell broke loose when checksums suddenly change. Also note that the recommended setuptools_scm setup has recently been discovered to cause unstable archives.

Even if the package wouldn’t have tests (which would be very bad), sdist has advantages over wheels for Gentoo:

it uses the same structure as the source repository, so patches usually apply as-is
tar+gzip are parts of the base Gentoo system while unzip is not — so technically one could end up having to specifically temporarily install unzip in order to unzip the wheel, and we need to unzip it in case user wanted to apply patches

So yes, sdists are more important than wheels. A sdist without wheels works for everyone. A wheel without sdist doesn’t.

BrenBarn · March 26, 2023, 6:56pm

This was mentioned somewhere on one of the other packaging threads, and on pypacking-native, but I think it should be reconsidered whether this idea of distributing sdists needs to be part of what PyPI does. It absolutely makes sense to provide a source tarball for various kinds of third-party distributors, and I can see how it makes sense for Python to provide a tool (like build) to create one. From my perspective, though, it absolutely does not make sense for that to be in the official package repository that is automatically searched and installed from when using the official package installer that comes with Python. Those are two quite different audiences, and one of them (people installing Python packages) is vastly larger than the other (people maintaining third-party distributions).

For the case described in this post, yes. But, just to clarify, that is not the case in general. For those who don’t have the necessarily compilation tools on their system, an sdist does not work and an appropriate wheel does (again as described on pypackaging-native). So it’s not as simple as a blanket recommendation to always provide sdists, because that causes problems for packages that aren’t pure Python.

mgorny · March 26, 2023, 7:11pm

Provided that the wheel was built for their platform. When this isn’t the case, again trying sdist has better chance than failing outright with “your platform is not supported, sorry”.

mattip · March 26, 2023, 7:17pm

I think projects should always provide an sdist to be a good citizen of the ecosystem. It enables repackaging the code for conda and other third-party packagers, and provides a more standardized snapshot of the project’s release than a git-release tarball.

If we could make progess with --only-binary by default with pip, then the sdist would only be visible to those who really need it: people doing pip install would not get an sdist as a fallback.

CAM-Gerlach · March 27, 2023, 1:41am

Right, for non-pure-Python packages, though the primary solution here is to use cibuildwheel (or a similar tool) to build wheels for all the major platforms, which helps far more users than providing an sdist would (since the proportion of users using the “big three” platforms is typically much higher than those that happen to have the requisite non-Python build dependencies installed, or know how to do so). Of course, an sdist should still be provided to at least give advanced users an option to build it, and for the other reasons mentioned. ^[1]

Of course, this is all getting somewhat OT, since the OP’s package is asking about a pure-Python package.

though there are some corner cases like PyTorch where it is extremely complex for non-experts to build the package themselves, and keeping the sdist available results in pip trying to install it for unsupported platforms which results in a worse user experience than just not having an sdist at all. ↩︎

bryevdv · March 27, 2023, 2:33am

Building and publishing sdists is just one more potential point of failure in our release automation. I’d like to point out that asking thousands of package maintainers to complicate their situation in perpetuity for your convenience is not really convinvincing, at all. We currently publish sdists, but they very much complicated by a need to bundle separately-built Typescript libraries that must identically match a CDN we maintain. So you aren’t truly building “from scratch” anyway. If I could drop all that complexity, I would in a heartbeat.

CAM-Gerlach · March 27, 2023, 3:26am

I’m having trouble understanding the complexity that is added here—perhaps because there’s something particularly specific to your project here. Sdists are not “built”, by definition, whereas the wheels you are canonically (and by default with build, unless you specifically tell it otherwise) built by first creating the sdist, and then building the wheel from that. For a pure-Python project, you’d have to pass additional arguments to the base build and twine invocations to actively not build and upload the sdists. And likewise for cibuildwheel, you’d have to manually modify the standard minimal cibuildwheel release automation workflow that it provides to not build and upload the sdist.

As a maintainer, you’re of course not obligated to go out of your way to make these sdists easy for average users to build on their own (by e.g. building dependencies, or providing individual support, or even emitting an error message at the start of the build unless an opt-in config var is set stating this is not officially supported). What I’m unclear on is how uploading the sdists that are already presumably generated during your build automation adds the amount of additional maintenance burden you describe (bundling in typescript binaries seems to be a concern for the wheels, but not for the sdist), so I’m assuming there’s something more to it here specific to your particular project that I’m missing.

h-vetinari · March 27, 2023, 4:07am

Speaking as someone who maintains 100+ packages in conda-forge (a fair amount of them non-python), I disagree on this point (to be more precise, the latter half). For the packages I work on, I vastly prefer the github sources, because:

I don’t have to worry about undocumented transformations of the source code between the git tag and the sdist (which – aside from being a potential attack vector – makes it unnecessarily hard to apply patches to the source in case that ever becomes necessary)
sdists often don’t contain the tests for a package, which I’d like to run as part of our CI (to verify everything runs correctly)

The github incident with the changed hashes recently that @mgorny mentions is IMO an outlier (and the speed with which this was reverted gives an indication how many people are relying on resp. were affected by this).

I actually agree much more with @rgommers’ point about unexpected from-source builds because the installation for some reason falls back to the sdist. In pure-python packages this will usually be fine, but that’s decidedly not the case when packages become more complex.

pradyunsg · March 27, 2023, 8:49am

I’ll pile on and say that (IMO) sdists on PyPI are not to be treated as source checkouts or source tarballs. I think generally folks wouldn’t be trying to run tests off of a file named foo-1.0.sdist, and the use of tar.gz as the extension makes people think that they should treat it like source tarballs. It’s not, and treating it as such typically complicates things for package maintainers.

encukou · March 27, 2023, 9:11am

The bigger reason is to verify that the source is buildable, and corresponds to the built artifact.
While we tend to take “builds from source with minimal tweaks” for granted, it makes the ecosystem more resilient. In the long term, it’s IMO absolutely worth the additional bit of maintenance cost.

Whether the source should be a sdist or git checkout, that’s another question. There’s value in having common tools & processes for both the simple case (pure Python) and more complex cases.
I think it would be great if sdist could, in the simple cases, be the source archive, rather than something you build separately (i.e. you’d just do git archive to build it)

fungi · March 27, 2023, 10:52am

I’ll pile on and say that the sdists on PyPI for the thousands of
pure Python projects I help maintain are absolutely to be treated
as source checkouts or source tarballs. Our community goes out of
our way to include tests, documentation and anything else from our
Git worktrees in our sdists and we even run automated tests when
they’re built, before uploading, to ensure that happens correctly.

We incorporate (sometimes legally necessary) content in our sdists
which is extracted from various Git metadata and so isn’t present in
a naive tarball dump of the worktree. We maintain a signing
infrastructure and publish corresponding detached OpenPGP signatures
for all our sdists and wheels, for verification by downstream
distribution package maintainers.

We were not impacted at all by GitHub’s compression change for a
couple of reasons, the above being one. The other is that our
community values open source software and services, so does not rely
on GitHub or similar proprietary, closed-source freemium platforms.