PEP 625: File name of a Source Distribution

Following discussions in Pip download (just the source packages, no building, no metadata, etc.), I’m propsing a PEP to minimally specify an sdist that only covers its file name, so pip and other installers have something to expect, without going into the details involved in sdist metadata consistency. The proposed naming scheme is based on existing widely used conventions, and any projects that work with standard Python packaging tools right now should be able to follow.

@pf_moore would you mind being listed as an author? I copied some of your words from the pip download thread directly into the PEP.

The PEP is available as rendered form at https://www.python.org/dev/peps/pep-0625/

PEP: 625
Title: File name of a Source Distribution
Author: Tzu-ping Chung <uranusjr@gmail.com>,
        Paul Moore <p.f.moore@gmail.com>
Discussions-To: https://discuss.python.org/t/draft-pep-file-name-of-a-source-distribution/4686
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 08-Jul-2020
Post-History: 08-Jul-2020
2 Likes

That’s fine with me. In which case, though, we probably need an alternative PEP-delegate to pronounce on the PEP. Would anyone be interested in taking that role?

Would allowing more file-formats be considered in the future? If so, could that be mentioned in the PEP so implementers can be aware of that possibility?

Assuming file formats mean archive formats, it is indeed a possible consideration. I do not intend to explicitly call it out, however, since personally I think it’s not a good idea, at least not in the current usages of Python distributions. Gzipped tarballs are almost universally used at this point, and as far as I am aware, nobody has ever expressed any concerns to the format. This can change when new usages appear, but is out of the concerns of this PEP.

Note that PEP 517 defines a sdist as being .tar.gz format here - it even goes into specifics about the precise details of the format. So I would not expect to support other formats of sdist in the near future, and doing so would likely need some fairly substantial changes to both tools and specs.

Ooh, I did not realise it went into such details specifying the sdist archive format. Maybe I should revise the PEP to just mention PEP 517 when talking about the archive format.

1 Like

Note: We’ll need to amend PEP 517 a bit as well, since it currently mandates an sdist to use the .tar.gz extension.

I really don’t like the idea of using .sdist as the name, when these are actually tarballs (I also don’t love the idea of ruling out .zip, since zip archives have some nice properties like “it’s easy to extract a single file from a zip file”). It will add some friction and confusion and I don’t think it solves a real problem. The rationale for rejecting “use a common sdist naming scheme” it is listed as:

(Note also there’s a typo in there: s/arhieve/archive)

However, the motivation mainly talks about things like pip download, which will never give you a tarball that is not a source distribution. In what situation is someone likely to encounter a tarball named exactly like an sdist like this that they must be able to know that the tarball is an sdist without unpacking it to look inside and check? Without a situation where this sort of confusion is common, I don’t think it’s worthwhile to migrate every build tool over to a totally new naming convention that obscures the actual format of the data.

At the very least, I think we should use {distribution}-{version}.sdist.{tar.gz|zip} or something of that nature.

Even ignoring the odd naming convention, I also think that this is premature. I sympathize with the desire to keep the scope of the migration to standardized sdists tight, but I think standardizing the name before you standardize the contents may be a mistake. The reason for this is that if we standardize the contents of the source distribution before standardizing the name and we change the naming convention, we can use the change in the naming convention as an indication that the tarball in question is compliant (or intending to be compliant) with the new spec. If we change the naming convention first, you’ll be forced to unpack every source distribution using the new convention in order to tell if it’s using the new spec.

2 Likes

That ship sailed with PEP 517 and the build_sdist hook, though. Obviously we can revisit any standard, but the point is that it’s not this PEP that’s requiring compressed tar format.

The logic here is precisely the same as the logic for wheels. We didn’t name wheels with a .zip extension so that we could define a standard for the whole filename, and we didn’t argue “you can look at the content” for the same reason for wheels that we do here - tools like pip’s resolver routinely scan and discard many, many files¹ and downloading and opening each one is too costly.

Currently pip assumes files from an index have the format NAME-VERSION.tar.gz but we have to include checks that the assumption wasn’t wrong. And we can’t handle a failure of that assumption particularly gracefully, and we also can’t make that assumption for all .tar.gz files that we encounter, because some may be archives of “source trees” (the PEP 517 term), not actual sdists.

In my view, this is an 80% solution. By allowing tools to reliably get project name and version for a sdist, and know that they can be relied on (in the sense that building from that sdist will always result in a distribution with that name and version) I believe we’d address at least 80% of the use cases for getting metadata from a sdist². It’s also something we’ll need to standardise when we finally define a full standard for sdists, so we don’t lose anything by doing this part now.

Versioning is something we’ll always need to consider for a new sdist format. So why can’t we take the following approach here:

  1. sdist version 0 - file name is the only standardised thing. If a file has the standard name, it conforms to the sdist version 0 spec, which tells you how you interpret that filename.
  2. sdist version 1 - standard metadata, in a newly-defined file. If a level 0 sdist has that file, it’s at least level 1, and the file itself can contain a version number to allow further versions.

When defining the metadata filename we use for level 1 and above, all we need to do is choose a name that doesn’t clash with anything that might be generated in a current-format sdist. Honestly, that shouldn’t be too hard³.

¹ Originally pip’s new resolver had a complexity limit of 100 backtracking operations - which correlates to “at least 100” files checked and discarded. We dropped that limit because it was way too low.
² And we could get at least 15% out of the remaining 20% if we could get dependency data - but fixing that at “build the sdist” time seems to be much more controversial.
³ Or we could extend this PEP to mandate such a file that coontains nothing but “sdist version = 0”. But I think that’s being unnecessarily cautious. Existing sdists aren’t standardised but they do have a sufficiently well defined format that we can invent a new metadata file name that won’t clash.

The two naming choices are rejected for different reasons. The first paragraph only apply to the {distribution}-{version}.tar.gz scheme. .sdist.tar.gz was rejected because it would break all the existing tools due to how sdist names are currently parsed, as described in the PEP.

Paul already discussed the .tar.gz extension in depth. The only thing I want to add that there is currently not anything that guarantees you’re getting an sdist for .tar.gz. Yes, that’s the case almost all of the time, and installers already assume that. But we are far from being able to guarantee that. At the very least (if we ignore all the historical stuff), PEP 503 will need to be extended to define what file formats a Simple Repository can serve.

To be honest, I dislike the .sdist suffix as much as you do, and would definitely be in favour of using .tar.gz if that’s possible. It reas well, and requires less work all around. The problem is, I don’t see how that is anywhere possible.

Wheels are zip files and we still encode them as .whl. IMHO correspingly having sdist is fine.

7 Likes

I was actually going to propose moving to an .sdist extension anyway, so I’m glad @uranusjr beat me to it. :slight_smile:

I will say that I fall into the “make a zip” camp on this. It’s always bothered me that we have two archive formats between sdists and wheels for no reason other than historical accident. But if we are changing the file extension then I don’t see why asking tools to know that an sdist is a zip file is asking too much when they probably already deal with zip files thanks to wheels. It will also make working with sdists easier externally as zip tools are much more widely available than tar.

4 Likes

We will need to update PEP 517 anyway to accomodate the new extension, so I feel it is plausible we associate the .sdist extension to mean a zipball containing source (structure TBD). Does anyone have insights on how this would affect backends (in particular, setuptools)?

I mentioned it in the other thread that’s going on, but I think that it would be great if we started… rebranding? the name of sdist. It’s kind of an awkward name to pronounce, and I think that the fact it feels wholly distinct from wheels makes it harder for people to to self discover that these are all pieces of the same pie and not wholly distinct things.

Thus I think it would be great if we actually treaeted sdists as a while as this weird legacy format and started to transition to “source wheels” which used an extension like .src.whl or .swhl or something. This makes it much easier to handle the transition IMO, similiarly to how .whl made it easier to handle the transition from eggs to wheels.

However that doesn’t exactly mean that this PEP is wrong, I think it just needs to either decide what exactly it is trying to do, and stick with that. If it’s just trying to make a small tweak to the existing sdist format to make it easier to parse, etc then I think it’s fine to just do the very small thing and keep the .tar.gz extension, keep the sdist name, etc and just provide a documented standard for what we expect the name to be. If we start adding new extensions and such, then I think we should go all in on a real new alternative and weave in the ideas like in Sdist idea: specifying static metadata that can be trusted and talk more about what we actually want our source format to look like.

Fundamentally changing the extension is going to come with a migration cost, and I don’t think we should pay that cost unless it comes with real benefits for end users, not just making it moderately easier for tools to parse the file names.

In other words, I think the middle ground that this PEP tries to tread ends up being a less ideal outcome than either just standardizing what we currently have OR going all in on defining a standardized source format.

6 Likes

I feel rebranding is a good idea. But wheel is already fairly established as the term as binary distributions, and it’d be extremely confusing if we extend it to mean something else. So if we’re rebranding, I’d suggest inventing something completely different, like .curd (because wheel). But again, no matter the name, that would make things easier since we can say everything that exists is not that new thing.

In other words, I’m perfectly fine if we make this PEP specify something that the current sdist is not; whatever that is would be decided later. But I also feel that does not really affect much currently exists in the PEP, since as Paul mentioned earlier upthread, whatever goes inside right now can be effectively treated as “metadata version 0” and be entirely revamped later. It’d be good enough as we can decide on the extension and archive format (at least no-one has issues with the file stem, yay).

3 Likes

In that case, I strongly suggest we use this PEP for documenting the status quo and go from there.

Also please file a PR against packaging.python.org adding a specification for the format, and reference that in the PEP – we’d want the specification on packaging.python.org to serve as the canonical source, with the PEP justifying the creation of the specification (in other cases, it would be the changes to the specification).

1 Like

4 posts were split to a new topic: How to propose new specs

So it seems people are on board with “rebranding.” Here’s my summary of the proposed approach:

  • All existing source distributions are left as-is. The term sdist will continue to be used for them.
  • We invent a new source distribution format. I’ll call it Curd until someone provides a better name.
  • A Curd is a ZIP archive named {name}-{version}.curd, with the same name and archive format as specified by PEP 427 Wheels.
  • The content of a Curd archive is not specified as this time. The current (unspecified) content layout would be called “Curd Format 0.” Installation tools are expected to handle a Curd’s content the same way as a ZIP sdist.

Here are the undecided parts:

  1. Other name/extension proposals to Curd/.curd?
  2. PEP 517 needs to be extended to produce a Curd. Do we reuse build_sdist, or invent a new function?
  3. Should PEP 517 itself be revised to include the functionality, or is a new PEP required?
  4. Do the above two points need to be done before or with PEP 625? Or should they be done after it is accepted?

I will draft and submit required PRs once we have conclusions to the above points.

2 Likes

I unfortunately can’t think of a Monty Python reference, so +0 on “curd”.

New concept, new function.

PEP 517 is still marked as provisional so it can be updated.

If we are considering sdists as a separate, old concept I don’t see why it can’t be done independently.

We should probably at a minimum mandate some mechanism by which to version these files yea? Wheels have the WHEEL file inside of the .dist-info directory that provides some basic metadata about the wheel itself (not about whatever was packaged inside it).

Here’s what that file looks like for pip’s latest release:

Wheel-Version: 1.0
Generator: bdist_wheel (0.34.2)
Root-Is-Purelib: true
Tag: py2-none-any
Tag: py3-none-any

It seems reasonable to have some file called like, CURD or something that has a structure basically like:

Curd-Version: 1.0
Generator: bdist_curd (0.1)

Or so.

The biggest thing that would need to be decided is where to actually put that file and/or what to call it. We could just put it at the root of the archive (there is precedent for this with the PKG-INFO file in sdists, though there is a curd project already, so that would mean they couldn’t have a curd directory at the top level of their sdist. We could maybe do something like .CURD or something so it gets hidden by default and also won’t clash with projects named curd.

I dunno? Seems like something we should do at least.