Should sdists include docs and tests?

FRidh · March 28, 2022, 3:11pm

In Nixpkgs we increasingly fetch from vcs because sdists don’t include the tests. It’s not bad to be at the actual source instead of a derived artifact, however, this way PyPI does lose part of it’s value for us.

brettcannon · March 28, 2022, 11:05pm

You can’t guarantee that you know how to build from a git checkout. With modern sdists and the expectation/requirement of having pyproject.toml and PKG-INFO in the sdists, we are moving towards a world where you can just use build on an sdist to get a wheel. Compare that to a git checkout where you have no guarantee that pyproject.toml or PKG-INFO is in the repo as they may auto-generated as part of sdist creation. You also can’t rely on those files being up-to-date at any point compared to the code in a git repo (e.g. is the version number being changed per commit, ahead of time, or only at release time?).

CAM-Gerlach · March 29, 2022, 12:35am

How common is it that a pyproject.toml is never checked into the repo and only generated for the sdist? Is there a conceivable benefit to not including even the minimal build-system table in version control? To be fair, you can’t guarantee a sdist even exists, either (e.g. for projects like Tensorflow), which I’d wager might be unfortunately more common then those using esoteric build-time pyproject.toml generation…never mind the fact that you certainly can’t guarantee a pyproject.toml with a PEP 518 [build-system] table is in the sdist, either, for the huge volume of pre-PEP 517 projects (though of course, there’s a defined way to handle that).

At least as of now, though, isn’t the presence of a PKG-INFO unstandardized and tool-specific, since the only thing the sdist “spec” actually specifies is a pyproject.toml (and previous to that, a setup.py?

I wasn’t aware PKG-INFO was ever checked in to the repo, nor is necessary for either sdist or wheel building so long as source metadata is available?

Good point, though again, I’m not suggesting using the repo but rather tarballs of the repo as of the release tag, that should always be an exact snapshot of the source from which the release was generated (barring a rather seriously flawed release process).

The one big exception is release tooling (e.g. setuptools_scm) that generates this from the Git metadata and doesn’t check the result into the repo at release time, but at least in the context I was asking about (conda-forge), its possible to work around by setting an env variable with the release tag during build.

cameron · March 29, 2022, 2:09am

By C.A.M. Gerlach via Discussions on Python.org at 29Mar2022 00:45:

brettcannon:

You can’t guarantee that you know how to build from a git checkout.

How common is it that a pyproject.toml is never checked into the repo and only generated for the sdist?

That’d be me you’re talking about here - Cameron

fungi · March 29, 2022, 3:16am

How common is it that a pyproject.toml is never checked into the
repo and only generated for the sdist?

Many projects I’m involved in still haven’t settled on an approach
for PEP 517 adoption, and there are some who have pushed for
generating pyproject.toml at sdist build time from another
serialization format like YAML. Concerns were raised that committing
TOML files is at least a tacit acceptance of historical behaviors of
the TOML specification’s author, and people have long memories about
such debacles, not wishing to have their Git repositories and thus
their work associated with him. It can be explained away as just
another file format, but I’m not about to de-legitimize the sense of
offense or revulsion they’ve expressed.

EpicWink · March 29, 2022, 4:06am

Our workflow has always been to build the sdist with setuptools-scm, which happens to include the tests and docs (except in the cases where the tests must include dozens of megabytes of data), but run the tests and build the docs from the Git tag after installing the package from the sdist (and wheel). This is especially pertinent because the built container images only have the installed library/app, and so the tests are mapped in from Git tag.

CAM-Gerlach · March 29, 2022, 4:15am

So if I understand this correctly, there are projects that developed, implemented and maintain a whole system for build-time generation of all the metadata, tool and build-system configuration in pyproject.toml files simply because the file format pyproject.toml happened to use was originally created by someone (which as far as I’m aware has had no involvement or connection with its development for many years) who apparently did something bad a long time ago?

As far as treating people pretty horribly, Linus has a rather notorious record on that count, and is the active maintainer of Linux (on which Android and much else is based), and the original creator of Git. Stallman’s the founder of the FSF and the GNU ecosystem and creator of GCC, Emacs, the GPL, etc, has said some pretty awful stuff too, including pretty recently. Theo de Raadt created OpenSSL as well as NetBSD and OpenBSD, the latter after he was kicked out of the former for abusiveness toward others. Unfortunately, the list goes on… By a much stricter standard of association, it would seem all of those projects should be proscribed, so I don’t understand why pyproject.toml is being singled out here.

If its really such a non-trivial problem, maybe @pradyunsg should rename it the TOML Obvious Minimal Language, severing any remaining association with its long-gone creator, which is a much more meaningful step than supporting such tortuous workarounds.

rbtcollins · March 29, 2022, 4:57am

I think this is the heart of the question: is it about treating people horribly as a general concept, or is it about the specific actions involved?

The folk I know that don’t contribute to various projects are not because of a general ‘treating people horribly is bad’ concept, but because the specific actions they encountered were harmful.

Not using a project created by someone that acted harmfully is a larger step, and one we rarely see, but I can understand the drive. I think the idea that rebranding would be enough to wash away the feelings people have regarding the project reflects a failure to empathize with those feelings and their origins.

Now, @fungi didn’t say that there was a whole project, or ecosystem of projects, existing just to avoid TOML files - he said

Which to me just indicates that folk have things working fine already with no strong motivation to add pyproject.toml files yet (e.g. they are using pbr), and that those folk weren’t involved in our discussion around format selection years ago, and are only now having to process the impact of that decision and figure out what they will do.

FRidh · March 29, 2022, 8:01am

How about other assets? An increasing amount of packages need javascript assets fetched via npm. These aren’t kept in the repo, but are added during build-time of sdist and wheel. An example is panel.

pf_moore · March 29, 2022, 9:49am

No. See Source distribution format - Python Packaging User Guide

fungi · March 29, 2022, 12:22pm

Which to me just indicates that folk have things working fine
already with no strong motivation to add pyproject.toml files yet
(e.g. they are using pbr), and that those folk weren’t involved in
our discussion around format selection years ago, and are only now
having to process the impact of that decision and figure out what
they will do.

And to be clear, this isn’t about avoiding implementing
pyproject.toml support in PBR itself. You can already use pbr.build
as a PEP 517 build-system.build-backend in a pyproject.toml and have
been able to for a while. The problem is not a technical one, but
rather a social one where technical solutions sometimes have to take
people’s feelings and emotions into account.

But yes, because PBR made it possible to have declarative package
configuration more than a decade ago, a lot of the projects relying
on it don’t see much benefit from switching to a different
declarative configuration other than to merely meet the expectations
of some new standard, and often that’s not enough to outweigh other
less technical concerns.

abravalheri · March 29, 2022, 12:35pm

There are other cases too, e.g. re-using the same repository for publishing multiple packages.

For example, I noticed that pybind11 uses the same repository for pybind11 and pybind11_global, tweaking the metadata/config before generating the sdist. There is an approach being studied in DO NOT MERGE: trying out setuptools PR by henryiii · Pull Request #3711 · pybind/pybind11 · GitHub that uses nox to “edit” pyproject.tom during the build process.

steve.dower · March 29, 2022, 2:10pm

Another repository with many many packages in it: azure-sdk-for-python/sdk at main · Azure/azure-sdk-for-python · GitHub (most of these directories contain at least one PyPI package). You certainly can’t assume that a git checkout is as predictable as an extracted sdist.

By separating the two concepts we allow teams to follow a dev process that suits them, rather than enforcing a process on them just because we thought it was cute. They can keep build scripts, sources, and tests wherever they like, and use build steps to produce conformant^[1] sdists.

Whenever we produce a definition of what a “conformant” sdist is ↩︎

CAM-Gerlach · March 29, 2022, 9:16pm

Ah, yes—I should have taken a moment to re-read the standard before running my mouth late at night and embarrassing myself, sorry. I keep forgetting there is another copy in the root alongside the non-standard .egg-info subdir.

On that note, is there any plan to offer an (automated or on-demand) migration tool for PBR’s bespoke format to the pyproject.toml [build-system], [project] and [tool.pbr] tables, as @abravalheri did with ini2toml for Setuptools’ setup.cfg? This would likely greatly ease the burden of the transition for users, and ensures that they can interoperate more easily with other packaging tools, build backends, CI services, etc., as well as have a much less painful path forward in the future if PBR is ever retired or they want to migrate to a different tool.

brettcannon · March 29, 2022, 9:57pm

No idea, but it isn’t a spec to say, “a source repo containing a Python project should contain X”.

, but it’s not our call to require that either.

Correct, although Source distribution format - Python Packaging User Guide is properly vague about requiring anything involving sdists. The key thing is it would be nice if tools started to include PKG-INFO.

It’s not, but this is all hypothetical and about what may be in an sdist and not a VCS repo.

That “should” is properly in italics; none of this is mandated or specified anywhere; it’s just convention.

CAM-Gerlach · March 29, 2022, 10:25pm

Right; I was going off the PyPA Glossary, which says

Project

A library, framework, script, plugin, application, or collection of data or other resources, or some combination thereof that is intended to be packaged into a Distribution.

Since most projects create Distributions using either PEP 518 build-system , distutils or setuptools, another practical way to define projects currently is something that contains a pyproject.toml, setup.py, or setup.cfg file at the root of the project source directory.

But obviously, that’s not a normative specification, and rather a heuristic definition. And it clearly seems there are a non-trivial amount of existing and future use cases for one not having such, judging by the varied examples presented here.

fungi · March 30, 2022, 11:31am

On that note, is there any plan to offer an (automated or
on-demand) migration tool for PBR’s bespoke format to the
pyproject.toml [build-system], [project] and [tool.pbr]
tables, as @abravalheri did with ini2toml for Setuptools’
setup.cfg? This would likely greatly ease the burden of the
transition for users, and ensures that they can interoperate more
easily with other packaging tools, build backends, CI services,
etc., as well as have a much less painful path forward in the
future if PBR is ever retired or they want to migrate to a
different tool.

As a descendant of distutils2, PBR’s “bespoke format” is called
setup.cfg, and was eventually integrated into the SetupTools
project as well. At this point, you can probably also configure it
through the pyproject.toml support in SetupTools 61, though I’ve had
a busy week so haven’t had a chance to test that theory yet.
Certainly if SetupTools decides to drop support for setup.cfg
files, current PBR-using projects will need to take some action to
move their packaging configuration to a new format.

Ultimately, you don’t so much configure PBR as you configure
SetupTools and then take advantage of PBR’s automated integrations
to inject dynamic metadata into SetupTools, build the manifest,
create changelogs and authors files, or whatever else. Calling
pbr.build as your build-backend entrypoint in pyproject.toml is (at
the moment anyway) really just passing through to SetupTools with
the PBR plug-in loaded, since other mechanisms for engaging the
plug-in early enough to perform metadata injection raised errors
about calling setup.py directly (even though setup.py wasn’t being
called, SetupTools put the deprecation warnings in a code path that
ended up being traversed).

CAM-Gerlach · March 31, 2022, 3:28am

Ah; I was aware PBR used setup.cfg, but based on what you’d said and Setuptools’ documentation, I’d thought that some of the option names and syntax was different. I wasn’t aware it was compatible enough nowadays to be migrateable with the same tooling (or perhaps at least a close derivative of it). Thanks!

fungi · March 31, 2022, 1:50pm

Ah; I was aware PBR used setup.cfg, but based on what you’d said
and Setuptools’ documentation, I’d thought that some of the option
names and syntax was different. I wasn’t aware it was compatible
enough nowadays to be migrateable with the same tooling (or
perhaps at least a close derivative of it). Thanks!

Both are correct really. PBR inherited setup.cfg from distutils2 and
both made decisions about how to normalize setuptools.setup()
parameters in declarative form. When SetupTools itself added
setup.cfg support years later its maintainers made a few different
choices (and also changed their minds on some at one point), so PBR
adjusted by aliasing the old keys to the ones SetupTools chose and
deprecating those earlier forms.

For the most part at this point, SetupTools ends up reading the
setup.cfg and consuming the metadata there directly anyway, so
there’s little involvement from PBR in order to handle the
declarative packaging pieces if used with a fairly modern SetupTools
version. It keeps backward-compatible though so as to still be
usable on older platforms with contemporary toolchains, since many
of the projects which rely on it target “enterprise” deployments on
LTS distros (every proposed change is still tested for compatibility
with Python 2.7 and 3.6 on four-year-old platforms, for example).

gpshead · April 4, 2022, 2:25am

CAM-Gerlach:

that section and its reasons are only about why source tarballs should be used in preference to git clones , not why sdists should be used over true source tarballs—which, as mentioned above, are perfectly possible to obtain from GitHub; in fact, the linked section even includes an example of such:
Therefore use, for example,: [sic]
curl -sL https://github.com/username/reponame/archive/vX.X.X.tar.gz | openssl sha256

That example is not a good one.

(1) It is not a secure git tarball as it uses a tag or branch name. Those can be changed at any moment. The only viable way to link to a github generated tarball is to link to a full git hash based name. (Only assuming git doesn’t allow 160bit hash strings as tag and branch names, does it?)

(2) this example is piping the output from github to a hash algorithm. What’s the point? There is zero guarantee that a .tar.gz file is in a canonical format. It is not such a thing. file timestamps, owners, and ordering within it are all entirely arbitrary. As is the zlib implementation used to do the compression. You must not depend on a hash of a dynamically generated archive in a format not designed to be canonical to be constant. It is not.

IIRC I believe github has had “bugs” around this that they’ve jumped through hoops to try and undo “breakage” that caused changes to these in the past via their internal infrastructure because of other language ecosystem packaging systems making the unfortunate design mistake to make this false assumption. That is a very bad thing to rely on. Do not trust GitHub infrastructure, or anyone else’s infrastructure, to guarantee that an archive format not designed to be canonical is in fact canonical. You will be burned.