Purpose of an sdist

EpicWink · July 2, 2020, 9:53pm

What is the purpose of a source distribution? I want able to find a definition of sdist, let alone it’s purpose, in any PEP or documentation.

From what I can tell, it’s one of the following:

A means of distributing the original source of a package (ie no generated code), to be installed or built into a wheel with possibly some intermediary programs (Cython, cmake, etc)
A way to distribute the package to platforms which don’t have a wheel built for them, to be directly installed or built into a wheel

Currently, it seems it’s the former.

In the case of Cython, they’re documentation recommend to include the generated C code in the source distribution, which means the distribution is a combination of the two.

PS: In my opinion, it should be the latter, as you can archive your repository to achieve the former (assuming a repo is used).

bernatgabor · July 2, 2020, 10:12pm

Many enterprise/OS distributions environments don’t allow installation of wheels due to auditing reasons, and all software to be installed must be made from source only. This is even more important for c extensions where during wheel build the c/c++ code gets obfuscated.

steve.dower · July 2, 2020, 11:41pm

They’re also for “unpredictable” platforms, where the Python ABI may not match the developer’s build machine. Portable code can be built across hundreds of variations of distros, whereas a binary is only really going to work with one.

I often describe wheels as a compiler optimisation (the kind that completely optimises out the compiler ). An sdist is the canonical form of the package.

pf_moore · July 3, 2020, 7:27am

For me a sdist is exactly that it’s a distribution format using source code. So it does not (need to) include any files that aren’t relevant for installing the package, but it also should not contain platform-specific binaries.

As a “distribution format” it should also conform to a standard layout so that installers can use it automatically - currently we don’t have a standard defining that layout so what we have at the moment is a somewhat adhoc “historical” standard, but we’re working on that

tiran · July 3, 2020, 7:31am

An sdist should also include tests and documentation. The test suite allows vendors to verify that a build works correctly on a given platform and dependencies. This allows packagers to detect all sorts of problems before a package is published for general consumption. Linux distributions usually run tests during package builds to gate new versions.

EpicWink · July 3, 2020, 8:13am

I would expect at least some predictability in platforms. Is there a minimal toolchain and library set that’s expected (eg manylinux2014’s environment)? Or are sdists true source, including Fortran, configuration scripts, etc.

One could argue that you can get the tests from the corresponding tag of the package’s repo, and the full documentation from the hosted documentation, both of which could be linked to in the core-metadata (or however it’s stored in the sdist).

tiran · July 3, 2020, 8:44am

manylinux wheels only work for a very limited combination of operation system, libc, and CPU architecture. If you are running something else than a major Linux distro with glibc on X86_64 or X86, then you are out of luck. If you are on ARM (Raspberry Pi), any BSD, macOS, Alpine (musl libc), Solaris, or more exotic platforms then manylinux won’t help.

Source balls also permit users to compile a binary package with special compiler flags for debugging, performance, or better security. They also allow packagers to backport fixes to older releases.

For security reasons build systems have limited to no internet access. A distro packager reviews every new release for security issues, signs/hashes the source tar ball and uploads it to an internal CDN. This prevents anybody from tampering with the release.

steve.dower · July 3, 2020, 9:00am

Today you could, but sdists predate distributed version control, and Python predates ubiquitous internet connections.

So in a sense, an sdist is a tag in a repo, though as Paul says they are specifically laid out in a way that the tools can interpret them (much like a git repo is laid out for git). And since sdists have already out-lasted multiple version control systems, it seems likely they’ll continue to be the canonical way of sharing Python packages

pf_moore · July 3, 2020, 9:33am

If those are “relevant for installing the package”, that fits with my description. Whether they are relevant depends somewhat on the policies of consumers - as you point out, Linux distributors typically want them. That’s not something that I expect we’ll standardise or mandate in the short term, though.

encukou · July 3, 2020, 10:18am

Well, as metadata is put in pyproject.toml in increasingly accessible ways, I feel sdist becoming more and more obsolete. Which is good! It makes creating the source tarball easier, whether your VCS is Git or exchanging floppies full of patches.
IMO, preparing a source release archive is generally/historically about leaving out build artifacts and other things you don’t want shared. “Whatever’s added to Git” is so much easier to manage than MANIFEST.in or make distclean. And if the tests, docs and licence end up included, well, all the better.

ncoghlan · July 11, 2020, 7:03am

While sdists started as essentially Linux distro source tarballs that contained a setup.py script, I expect over time we’ll see them move more and more into the second role suggested: partial build artifacts that have had platform independent parts of their build pipeline executed.

We already see that with Cython, for example - many projects treat Cython as an sdist creation dependency rather than as a binary build dependency.

EpicWink · July 11, 2020, 9:20am

That’s related to what my ideas were coming in to this topic: I want to distribute my Cython-code projects without requiring the user to have Cython installed and without having to build wheels for all of our target platforms. It felt like that sdists, however, were specifically for original¹ source to build the compiled packages on any possible platform.

However, since reading the comments on this topic, I see sdists now as really something to ship around Python packages in whatever way the distributor wishes, whether it includes tests and build-configuration scripts, or is nothing more than the (pure-Python) wheel contents with a different file and metadata format.

I guess my original question is: how can you specify metadata for an object if you don’t know the object’s raison d’etre. I think that has been answered with the comments above.

¹ as in, the code actually written by a person, not generated by an application. You could argue that templated projects are generated, and that you should only consider the template configuration as part of the original source, for example

sinoroc · July 11, 2020, 10:04am

In the same line of questions (maybe a step too far as it’s not a platform-portability concern, and thus the answer is probably obvious, but maybe it helps with defining what a sdist can/should contain)…

Consider a project with gettext I18N/L10N , should the sdist contain the *.po (portable object) or the *.mo (machine object) files?

pf_moore · July 11, 2020, 11:00am

My view has changed over time, as wheels (form my environment) have become more ubiquitous, I think. But I still find both of those aspects of sdists to be important.

I strongly value sdists as a standardised way of publishing project source through the same channel as binaries - so we don’t encourage “binary only” projects, or even “binary available on PyPI, go here for the source”. As an example, the demise of Google code hosting resulted in source for some projects getting lost. Python projects that publish sdists are much less exposed to this risk.
I also value sdists as a means of making projects available for environments the authors don’t provide wheels for. Shipping artifacts that minimise the build complexity as much as possible makes the sdist more widely usable for this scenario.

I’ve yet to be convinced that conflating these two usages causes any significant conflict, although I think it’s worth being aware of the tension between the objectives.

In particular, I don’t think that the tension causes a problem in including static metadata in a sdist. The static metadata is firmly in the “partial build artifact” area, and as such, we should be:

Encouraging projects as much as possible to encode the underlying metadata in a way that can be built in the “platform independent” stage of building (so, for example, use markers rather than calculating dependencies at build time based on the target environment).
Defining sdist metadata as that pre-computed platform independent portion of the final wheel metadata.

Replying to myself from above, I’m aware that what I’m saying now is a weaker statement than I made previously. I guess I’m arguing that “platform independent distribution format” is a little more important than “source archive”, but I don’t see why both can’t be achieved. In the case of Cython, for example, why not just ship both the cython source and the generated C in the sdist?

ncoghlan · July 12, 2020, 3:09am

Right, for things like Cython, the sdists should always include the input files no matter what, so the question is then whether to declare a build dependency on Cython or to include pre-generated Cython output in the sdist.

Historically, the lack of automatically created build environments meant that the latter option was almost always more convenient for end users, at the cost of potential future compatibility issues with Python C API changes that require regeneration of Cython output with a newer version of Cython.

Now, though, there’s the middle ground of using pyproject.toml to declare a build dependency on the cython PyPI package, which should give the best of both worlds: future builds can benefit from Cython level performance improvements and compatibility updates without needing to update the project itself, but end users won’t need to pre install Cython globally either.

pf_moore · July 12, 2020, 8:36am

Aren’t you contradicting yourself here? Cython-generated code definitely constitutes one of the results of a platform independent part of the build pipeline being executed… (and see this post from cython’s author for further arguments in favour of shipping cython-generated files).

Regardless, we’ve just circled back to the point I made earlier, that it may depend on what you consider “necessary to build” - and that’s not something we should try to mandate.

ncoghlan · July 12, 2020, 5:40pm

There are differences between “What’s acceptable to include in an sdist” and “What’s desirable to include in an sdist”, and the big downside of including the output of a code generator like Cython is that the only way to update the default build output is to update the sdist.

There are projects on PyPI that can’t be installed on 3.7 or later due to this,as their latest sdist release was built on Python 3.6 with a version of Cython that generated code that accessed private or deprecated C APIs that have since been removed.

The right trade-off to make will vary by project, though, so at the ecosystem level we want to continue to allow both approaches.

brettcannon · July 14, 2020, 12:51am

And this is what sdists are to me: whatever is required to build a wheel.

ofek · July 14, 2020, 1:41am

I share this conception.

As an example, for coincurve I ship the (often updated) pinned version of a C library so build systems don’t require network access.

steve.dower · July 15, 2020, 7:41am

Add to this “and meet the distribution requirements of any included third party code” and I’m happy to support it.