What do we want in standardized sdists?

“Standardizing sdists” has been proposed as a potential Packaging Summit 2020 topic and I’m curious: What would further standardization aim to do and which problems would it solve?

My understanding is that we have some amount of formalization about what an sdist should be, in PEP 517 and I’m not currently aware of any major deficiencies around this area.


Or to put it in a very different manner: WHAT DID YOU DREAM ABOUT @brettcannon?! :upside_down_face:
(in jest, context: Packaging Summit 2020 - Potential topics for discussion)

2 Likes

Primarily it’s naming. For instance, if you look at the code in pip to find the version number from an sdist you will notice it is not exactly based on a spec. :wink: Compare that to wheels where it’s very obvious how to get the version from the file name. This came up when I was trying to create a package to consume the simple repository API and wanted to group files by version and realized you can only reliably do that for wheels.

There’s also the issue of inconsistent file extensions. Once again, if you look at pip it looks for .tar.gz, .tgz, .tar., and .zip. That’s isn’t specified anywhere (closest is “whatever PyPI won’t reject upon upload”). This came up due to the simple repository API consumption package work and realizing there isn’t actually a way to group files into “wheels”, “sdists”, and “other” as there is no definition of what an sdist file extension is.

Due to the lack of clear file name extension support, there’s also no clearly defined preference around archive format and/or compression. Once again, it currently is “whatever PyPI takes” which isn’t tightly spec’ed anywhere. E.g. the default is .tar.gz but I believe .zip is still allowed. It also doesn’t allow for using e.g. zstd for those they may want the better performance. IOW it’s sort of flexible in letting you choose a tarball or a zipfile, but not in that it is only those two options and not for any specific reason while not letting in other alternatives. This came about because I realized that we have no support for zstd in the packaging ecosystem, and yet we also let sdists be either of a couple of possibilities, which felt oddly flexible yet restrictive all at once.

After that an sdist is a source archive that pip can build. Once again, not a spec. If we said “that has a pyproject.toml” at least that’s enough of a spec to actually be able to do something based on PEPs. Or if that was amended to “pyproject.toml or a setup.py where python setup.py bdist_wheel works with setuptools and wheel installed and there will be a wheel in the dist directory” that’s still more of a spec than what we have now.

1 Like

I want basically:

  1. Standardised naming convention (like wheels) encoding project name and version.
  2. Standard archive format (to allow easy handling in code that consumes sdists).
  3. A standard location and format for the metadata within the archive.
  4. A standard location for a PEP 517 “Source tree” - legacy (setup.py) or standard (pyproject.toml) - within the archive.

It would also be nice to encode in the filename whether a sdist supports Python 2 and 3, or just Python 3. I’m thinking of this as a solution to the issue of projects that want to desupport Python 2, but can’t use data-requires-python because they need to support --find-links or other index formats that don’t handle that tag, and can’t publish .py3 wheels because that would just cause a fallback to the sdist. Basically a simplified version of platform tags, suitable for sdists.

2 Likes

(on mobile; wait, it’s 1am!?)

PEP 517 does touch on naming and format (in the section I linked to). Copy pasting is tricky for me - but I think that counts as standardisation.

WRT 3, I think that’s making sdist metadata “reliable” which has its own topic earlier - and is definitely worth further discussion.

WRT 4, is there any situation where it’s not the root of the directory?

WRT requires-python, I think that’s partly a gap in pip’s functionality, since requires-python is available in the metadata of the distribution and we should start considering it during dependency resolution (instead of only when available during the discover-from-index stage).

1 Like

I’d forgotten about that. Yes, building on that would be good.

Yes, I think there is. Some existing sdists are a tar.gz containing a dist subdirectory containing a tar file containing the project directory containing the source tree. Here is a random example I happened to have on my PC. It might even be that this is the standard format that setup.py sdist generates. Let’s try to move away from this to something more sensible :roll_eyes:

Partly, but maybe having that information available would help other tools (that don’t need the complexity that pip has). I’m not wedded to the idea, just think it might be worth discussing. Wheels have it, and the fact that sdists don’t does make its existence in the wheel spec less useful.

Woah. That sounds rough but I haven’t ever seen this.

I do not know if pip’s unpacking code can even handle things like this, so I don’t even know if pip can install from something like this.

And looking at https://files.pythonhosted.org/packages/7d/db/efb104b26bffa7e729ab05d8275c50b3a48e9c9456cf9e34c90a2a7dd4dd/cs.binary-20191230.3.tar.gz, I don’t see the structure you’re describing.

Is PEP 517’s prepare_metadata_for_build_wheel sufficient as a standard mechanism to obtain metadata from an sdist?

Hmm, if I do gzip -l, I don’t see it either. Maybe it’s an artefact of 7-zip, because if I open the file in 7-zip, that’s the structure I see :frowning:

It is extractable using Python’s standard tarfile module, though, so it’s cosmetic rather than crucual.

That does a build step. I’m talking about standardising the PKG-INFO file in a (current) sdist, so that we can we can download and extract metadata without a build step. This has its own issues, as we get into dynamic metadata issues (setup.py could in theory generate a different version number depending on whether you’re building a wheel or a sdist!!!) but part of standardisation would be to declare such shenanigans as invalid/unsupported.

Basically, either make it so people can rely on PKG-INFO or remove it. The current “present but not reliable enough to use” state is not helpful.

1 Like

Does it require a full build? I had understood it as a lightweight mechanism to obtain the metadata without doing the full build, and as such a reasonable mechanism to use during the resolution phase.

Agreed.

Let’s move the metadata-in-sdist discussion to Why isn't source distribution metadata trustworthy? Can we make it so?.

1 Like

PEP 517 – A build-system independent format for source trees | peps.python.org seems to have the details, although they are a bit light, e.g. there’s no specification that escaping the package name should match the escaping done for wheels (although it seems to be implied by the sentence, “an sdist named {NAME}-{VERSION}.{EXT} will generate a wheel named {NAME}-{VERSION}-{COMPAT-INFO}.whl”). And that is definitely not happening in the wild as there are plenty of projects with - in their name that is kept in the sdist file name but handled appropriately in the wheel file name.