Should sdists include docs and tests?

While I don’t have a strong opinion on defaults for sdist
generation, the projects I’m involved in consider sdists to be
“true” source tarballs, “truer” than a naive archive of the Git
worktree since we can bake in relevant Git metadata which would
otherwise be lost. Because we want distributions to be able to use
them as a basis for their own packaging efforts, we make sure any
files tracked by Git (except for things which make no sense outside
revision control, like .gitignore) are included in the sdists we
publish, and we test to make certain that’s the case. If we have
docs or tests or even CI configuration checked into the repo, we
always include it all in the sdist.

We also usually include tests in our wheels, since we consider them
to be a part of the software itself and so locate them inside the
importable package rather than in some parallel directory tree. That
one may import foo.tests.bar (yes we use “tests” not “test”
too).

For what it’s worth, I put the test code in sdists, but not in wheels.

I use the test directory at the root of the project (not tests), because that is the distutils (setuptools) default, and I do not need to add it to MANIFEST.in.

My test code is usually quite vanilla. Although I use pytest to run it, usually it can be run without it (cd test && python -m unittest). If test dependencies are required, they are in a dev_test extra (which I should rename to dev-test now, I think), so that info is also available in the sdist. At some point I tried quite hard to make ./setup.py test work for my projects , but it never did behave well enough, anyway that is gone now.

I do not add documentation to sdists (and wheels). But after having read more about how Linux distros packagers work, I now think adding documentation in sdists makes sense (at least the necessary for man pages and similar).

All in all, in the long term, I think a PEP 517 for test runners and doc builders could make sense.

2 Likes

Yeah; that makes sense—since your tooling (PBR, but also true of the popular setuptools-scm) not only ensures your sdists include everything checked into your repo, minus VCS-specific files, and also bakes in additional metadata, your sdists are indeed more complete representations of your source than a straight tarball of your worktree. Whereas in our (Spyder project) case, we still rely on manual MANIFEST.in maintenance but also maintain our own tooling (LogHub) that takes care of things like updating the CHANGELOG and AUTHORS from the GitHub issues, PRs and contributors, which is all checked into source control.

I’ve been suggesting switching to setuptools_scm for automatic version and manifest management, which would result in a setup very similar to yours, but there’s some amount of institutional inertia and reluctance to “fix what ain’t broke” and a number of places that Spyder itself uses the version information that would need to change, as well as higher short-term priorities (e.g. modernizing our packaging config/infra and CI setup) to spend our limited resources and change budget on.

This is our standard practice as well in the Spyder project, where we not only have tests be a subpackage but have tests subdirectories for every directory that contain tests for the code in each module within, i.e.:

_ spyder
|__ spam
|__ eggs
|__ ham
|  |__ foo
|  |__ bar
|  |__ tests
|     |__ test_foo
|     |__ test_bar
|__ tests
   |__ test_spam
   |__ test_eggs

As some background, Spyder is a 15 year old project originally developed by a scientist with limited programming experience, and the original “tests” were just regular functions and function/method calls in the if __name__ == "__main__" blocks of the code under test, so this was a natural outgrowth of that.

In some other (typically smaller) projects I’m involved in, we use the src layout and locate our tests outside the package, organized by type (unit, integration, functional) and run from the source against the installed wheel.

I was actually completely unaware of that; for reference, could you point me to where that is mentioned?

Using _ is fine; PEP 685 doesn’t change that. _ is the most common non-alphanumeric character used in extras names, and being perfectly okay under the PEP 508 rules adopted by PEP 685. It’ll automatically get normalized to - via PEP 503 normalization when written out to core metadata, and consuming tools will also normalize it when comparing, so it doesn’t really make any practical difference (except in one specific scenario, but changing it on your end now won’t help that at all).

If you do make a change, you might want to consider a test extra, and a dev extra that includes test plus all your development dependencies, at least in terms of standard convention. But no need unless you want to, at least until it is more standardized, though nominally the core metadata spec does reserve the test extra for running tests:

Two feature names test and doc are reserved to mark dependencies that are needed for running automated tests and generating documentation, respectively.

Correction, it says:

anything that looks like a test script: test/test*.py (currently, the Distutils don’t do anything with test scripts except include them in source distributions, but in the future there will be a standard for testing Python module distributions)

https://docs.python.org/3/distutils/sourcedist.html#specifying-the-files-to-distribute


Thanks. I did not know that. Interesting. I wonder if any tool actually recognizes those extras and handles them in any specific way.

No, but the regro-cf-autotick-bot uses PyPI releases to trigger automatic update PRs.

Gotcha, thanks. I wouldn’t rely on this, though, since that’s an implicit default, distutils is deprecated, things with Setuptools may have changed, and in particular are likely to much further with

(perhaps @abravalheri might have further comments on this)

Perhaps the legacy, deprecated/removed distutils commands for running tests and building docs automatically install them, but other than that I’m not aware of other tools doing so. Poetry has something like that, but AFAIK that uses their own tool config. There’s also currently plans to propose a standard PEP 517-like hook for invoking such, but it hasn’t reached the PEP stage yet.

Right. Does it only work with a url source set to PyPI in the recipe, or does it work with GitHub tar sources as well? I couldn’t find any answer in the docs about that.

Hi @CAM-Gerlach, currently the auto-discovery changes in setuptools are restricted to the package files. If the testing code resides within the package, they would be included.

For sdists, I personally recommend for people to use something like setuptools-scm or setuptools-svn whenever they can. I think this is a much better alternative than re-implementing a way to decide which files are transient or not (we all know the fate of the MANIFEST.in format…).

2 Likes

Oh nice, it appears the bot does work with GitHub releases: [bot-automerge] libbson v1.19.1 by regro-cf-autotick-bot · Pull Request #4 · conda-forge/libbson-feedstock · GitHub. So it appears there is no strong reason from the conda-forge perspective to include the tests in the PyPI sdist.

1 Like

In Nixpkgs we increasingly fetch from vcs because sdists don’t include the tests. It’s not bad to be at the actual source instead of a derived artifact, however, this way PyPI does lose part of it’s value for us.

1 Like

You can’t guarantee that you know how to build from a git checkout. With modern sdists and the expectation/requirement of having pyproject.toml and PKG-INFO in the sdists, we are moving towards a world where you can just use build on an sdist to get a wheel. Compare that to a git checkout where you have no guarantee that pyproject.toml or PKG-INFO is in the repo as they may auto-generated as part of sdist creation. You also can’t rely on those files being up-to-date at any point compared to the code in a git repo (e.g. is the version number being changed per commit, ahead of time, or only at release time?).

2 Likes

How common is it that a pyproject.toml is never checked into the repo and only generated for the sdist? Is there a conceivable benefit to not including even the minimal build-system table in version control? To be fair, you can’t guarantee a sdist even exists, either (e.g. for projects like Tensorflow), which I’d wager might be unfortunately more common then those using esoteric build-time pyproject.toml generation…never mind the fact that you certainly can’t guarantee a pyproject.toml with a PEP 518 [build-system] table is in the sdist, either, for the huge volume of pre-PEP 517 projects (though of course, there’s a defined way to handle that).

At least as of now, though, isn’t the presence of a PKG-INFO unstandardized and tool-specific, since the only thing the sdist “spec” actually specifies is a pyproject.toml (and previous to that, a setup.py?

I wasn’t aware PKG-INFO was ever checked in to the repo, nor is necessary for either sdist or wheel building so long as source metadata is available?

Good point, though again, I’m not suggesting using the repo but rather tarballs of the repo as of the release tag, that should always be an exact snapshot of the source from which the release was generated (barring a rather seriously flawed release process).

The one big exception is release tooling (e.g. setuptools_scm) that generates this from the Git metadata and doesn’t check the result into the repo at release time, but at least in the context I was asking about (conda-forge), its possible to work around by setting an env variable with the release tag during build.

By C.A.M. Gerlach via Discussions on Python.org at 29Mar2022 00:45:

How common is it that a pyproject.toml is never checked into the repo and only generated for the sdist?

That’d be me you’re talking about here :slight_smile: - Cameron

1 Like

How common is it that a pyproject.toml is never checked into the
repo and only generated for the sdist?

Many projects I’m involved in still haven’t settled on an approach
for PEP 517 adoption, and there are some who have pushed for
generating pyproject.toml at sdist build time from another
serialization format like YAML. Concerns were raised that committing
TOML files is at least a tacit acceptance of historical behaviors of
the TOML specification’s author, and people have long memories about
such debacles, not wishing to have their Git repositories and thus
their work associated with him. It can be explained away as just
another file format, but I’m not about to de-legitimize the sense of
offense or revulsion they’ve expressed.

Our workflow has always been to build the sdist with setuptools-scm, which happens to include the tests and docs (except in the cases where the tests must include dozens of megabytes of data), but run the tests and build the docs from the Git tag after installing the package from the sdist (and wheel). This is especially pertinent because the built container images only have the installed library/app, and so the tests are mapped in from Git tag.

1 Like

So if I understand this correctly, there are projects that developed, implemented and maintain a whole system for build-time generation of all the metadata, tool and build-system configuration in pyproject.toml files simply because the file format pyproject.toml happened to use was originally created by someone (which as far as I’m aware has had no involvement or connection with its development for many years) who apparently did something bad a long time ago?

As far as treating people pretty horribly, Linus has a rather notorious record on that count, and is the active maintainer of Linux (on which Android and much else is based), and the original creator of Git. Stallman’s the founder of the FSF and the GNU ecosystem and creator of GCC, Emacs, the GPL, etc, has said some pretty awful stuff too, including pretty recently. Theo de Raadt created OpenSSL as well as NetBSD and OpenBSD, the latter after he was kicked out of the former for abusiveness toward others. Unfortunately, the list goes on… By a much stricter standard of association, it would seem all of those projects should be proscribed, so I don’t understand why pyproject.toml is being singled out here.

If its really such a non-trivial problem, maybe @pradyunsg should rename it the TOML Obvious Minimal Language, severing any remaining association with its long-gone creator, which is a much more meaningful step than supporting such tortuous workarounds.

2 Likes

I think this is the heart of the question: is it about treating people horribly as a general concept, or is it about the specific actions involved?

The folk I know that don’t contribute to various projects are not because of a general ‘treating people horribly is bad’ concept, but because the specific actions they encountered were harmful.

Not using a project created by someone that acted harmfully is a larger step, and one we rarely see, but I can understand the drive. I think the idea that rebranding would be enough to wash away the feelings people have regarding the project reflects a failure to empathize with those feelings and their origins.

Now, @fungi didn’t say that there was a whole project, or ecosystem of projects, existing just to avoid TOML files - he said

Which to me just indicates that folk have things working fine already with no strong motivation to add pyproject.toml files yet (e.g. they are using pbr), and that those folk weren’t involved in our discussion around format selection years ago, and are only now having to process the impact of that decision and figure out what they will do.

How about other assets? An increasing amount of packages need javascript assets fetched via npm. These aren’t kept in the repo, but are added during build-time of sdist and wheel. An example is panel.

1 Like

No. See Source distribution format - Python Packaging User Guide

Which to me just indicates that folk have things working fine
already with no strong motivation to add pyproject.toml files yet
(e.g. they are using pbr), and that those folk weren’t involved in
our discussion around format selection years ago, and are only now
having to process the impact of that decision and figure out what
they will do.

And to be clear, this isn’t about avoiding implementing
pyproject.toml support in PBR itself. You can already use pbr.build
as a PEP 517 build-system.build-backend in a pyproject.toml and have
been able to for a while. The problem is not a technical one, but
rather a social one where technical solutions sometimes have to take
people’s feelings and emotions into account.

But yes, because PBR made it possible to have declarative package
configuration more than a decade ago, a lot of the projects relying
on it don’t see much benefit from switching to a different
declarative configuration other than to merely meet the expectations
of some new standard, and often that’s not enough to outweigh other
less technical concerns.

There are other cases too, e.g. re-using the same repository for publishing multiple packages.

For example, I noticed that pybind11 uses the same repository for pybind11 and pybind11_global, tweaking the metadata/config before generating the sdist. There is an approach being studied in DO NOT MERGE: trying out setuptools PR by henryiii · Pull Request #3711 · pybind/pybind11 · GitHub that uses nox to “edit” pyproject.tom during the build process.

2 Likes