PEP 625: File name of a Source Distribution

uranusjr · July 16, 2020, 5:32pm

That would imply we need to have some minimal plan on the archive layout. I don’t object; it seems like we’ll need a .dist-info directory at some point anyway, and it makes sense to have CURD file in there.

dstufft · July 16, 2020, 5:42pm

Well, like I said we could just put at the root, which is basically the most minimal layout possible, namespacing it inside of a .dist-info directory is somewhat nicer for sure though. It gets a little weird if we end up using pyproject.toml at the root as our metadata file inside of a curd, since we’d have a dist-info directory with only CURD inside it, but if that’s the extent of our problems then we’re in a pretty good place

brettcannon · July 16, 2020, 7:55pm

One interesting side-effect of putting something like a CURD file in .dist-info is if tools don’t strip it out it then gives a bit more providence of how the wheel was ultimately created as it will record that it went through a curd.

pganssle · July 16, 2020, 8:37pm

I have to say I kinda hate the name curd (sounds gross to me), and I also think that people will be confused by this re-branding if it doesn’t include any clear indication in the name that it’s a source distribution. I think most people think that wheel files are an opaque format and don’t realize that they’re basically zip files.

A few potential consequences of people not understanding that “curd” means “source distribution”:

People won’t realize that installing a curd executes arbitrary code (setup.py).
People who “always build from source” may increasingly build from the repo instead of the source distribution.
People will be confused as to why there are “curds” and “wheels” (no matter how much we try to educate them — how many times has Brett written concise and eloquent blog posts or given concise blurbs on podcasts about pyproject.toml and still we’ve got universal confusion on that point). There’s a good chance that a good fraction of people will think, “Oh, curds is the new one, I should use that!” and not even distribute wheels.

Once again I strongly disagree with this. We should be silently upgrading people’s source distributions in the background so that they have actual reliable metadata. Everything we’re talking about is essentially backwards-compatible with existing source distributions with the possible exception of changing the file extension (which is another reason to avoid that). I think we will have made a grave mistake if we put ourselves in a situation where someone would feel the need to generate both an old-style source distribution and a new-style source distribution.

pf_moore · July 16, 2020, 9:20pm

I agree with this. In an attempt to stop “curd” getting traction merely by repetition, how about “Truckle” (file extension .trk) by analogy with “wheel”?

The rest of your post, I’ll have to think about. But in principle, if we can make it work that we keep the name “sdist”, keep the existing build_sdist hook, and still standardise the filename in a way that lets tools like pip take a filename and reliably determine project name and version, and the fact that it’s a sdist, just from that filename, then I’d be fine with that.

But I’m not sure we can, because pip (for example) allows arbitrary archives of source trees - which are definitely not sdists - as input.

In other words, I think we need a dedicated file extension at a minimum. I’m willing to concede that it might be possible to discard the “rebranding”, though.

ofek · July 16, 2020, 10:54pm

I also am not a fan of .curd.

Based on the last few posts in Purpose of an sdist, what about calling the new format Intermediate Wheel? The extension would be .in, matching the convention for input files to dependency resolvers.

pradyunsg · July 17, 2020, 4:16am

curd

Uhm… Well… I really don’t want us to go too far down the rabbit hole of bikeshedding the heck out of this name but let’s not call it this during the rest of this discussion.

“source wheels” is a neutral-enough name, so I propose we use that for now (and bikeshed the name in a separate thread).

pradyunsg · July 17, 2020, 4:30am

A few questions:

Is the expectation that this would just be a bunch of files to extract and process as a directory, according to PEP 517? Or that we’d actually have some metadata associated with this (other than name + version from filename)?
what would the dist-info folder contain and what are the guarentees around the contents of that directory?
do we want to provide random I/O style access to the pyproject.toml file (relevant if we plan on adding a second tier of compression as was discussed for the wheel format?)?

uranusjr · July 17, 2020, 5:22am

Yes, consumers are supposed to unarchive and use PEP 517 to get wheel metadata. This PEP specifically does not associate any information (except the CURD file Donald suggested) beyond the file name.
There is no guarantee a .dist-info directory would exist in the current format.
Since nothing in the archive is promised (again, except the file Donald suggested), there’s not guarantee on random accessing anything inside it.

sbidoul · July 17, 2020, 6:33am

I think we should/can discuss the file naming from its content separately.

I’ve not had time to dig into that part of the pip code but a quick test shows that pip download -f directory "pkg>=1" where directory contains pkg-0.1.tar.gz and pkg-1.0.tar.gz does what we expect, i.e. selecting the correct version. Then only it unpacks to get metadata (for dependencies).

That part seems to work just fine both for sdists and tarballs (actually pip does not make a difference).

Since to get dependencies pip needs to unpack anyway, it can look into the archive to find some sort of marker that tells it what metadata can be considered as static.

So which part of pip would change with a new naming scheme?

uranusjr · July 17, 2020, 7:04am

What if dependencies are not needed (e.g. pip download --no-deps)? Since we never specified how an archive should be named (more on this later), pip always needs to unpack to actually make sure the name and version are correct. PEP 517 finally mandates how an sdist should be named, but does not offer a way to distinguish PEP 517 and legacy sdists, so pip still needs to unpack to make sure the downloaded sdist actually follows the naming rules. The main goal of this PEP is to give source archives a distinguishable name so pip can avoid this step.

sbidoul · July 17, 2020, 7:29am

Unpacking is cheap compared to building, and we could argue that the practical cases where unpacking is not needed are marginal.

If there is a way to know which metadata is static after unpacking, so we can avoid prepare_metadata in the happy path, I think it is a sufficient improvement for now.

I fear the disturbance coming with a new naming scheme, a new vocabulary etc, for sdists is too big, compared to the actual benefit we’d get out of it.

uranusjr · July 17, 2020, 7:34am

I agree the benefits are marginal, and everything can be solved if there’s a way to get static metadata—except we’ve already spent months on that exact issue and the solution is nowhere in sight. Why can we not have both? It would not hamper the static metadata discussion, can be achieved relatively quickly, and has non-zero benefits even after static metadata is achieved.

pf_moore · July 17, 2020, 9:17am

Pip needs to build to do this. Technically, call prepare_metadata_for_build_wheel, in pip terms “prepare” the install requirement, but it involves a lot of the costly bits of a build.

Agreed. For me, that’s the major improvement.

I would also like to be able to avoid prepare_metadata_for_build_wheel where possible when collecting dependencies, because that’s the other bit of metadata the resolver uses to decide what files we’ll actually build. The current view seems to be that this isn’t always going to be possible - but I’d still like to see the “easy” cases covered statically:

Confirm that a project has no dependencies (and building won’t add any).
Where a project can specify dependencies without needing a build, let them do so along with an assertion that they won’t be modified when building.

Yes, the fallback will always be to run prepare_metadata_for_build_wheel, but as you say, unpacking is cheap compared to building and IMO we need to avoid that “prepare” step (which is often effectively “build”) as much as we can.

Ideally, we’d measure the slowdowns involved here to inform our approach, but unless someone is willing to do that I think “remove avoidable costs” is a reasonable principle.

EpicWink · July 17, 2020, 9:58am

Would it be more frictionless to say that .dist-info/<whatever> (or even .dist-info itself) is not allowed in source-wheel/curd/sdist v0, and that its absence means that this archive conforms to source-wheel/curd/sdist spec v0? This will mean you have to guarantee that source-wheel/curd/sdist specs v1 and beyond have to include .dist-info/whatever to indicate that they aren’t v0.

Also, is there currently any doubt that a file called foo-0.1.tar.gz hosted on a package index contains the distribution foo version 0.1 (regardless of whether it’s an sdist or a source archive)? If so, would this PEP make that guarantee?

pf_moore · July 17, 2020, 10:10am

I think that’s reasonable. We should confirm that no existing tools make sdists with a .dist-info subdirectory, but I don’t think they do.

That’s the assumption pip makes, and I believe it’s reliable on PyPI. I’m less sure it’s enforced on other indexes like devpi, and it’s definitely not a valid assumption for general URLs or --find-links.

Because it’s specific to indexes, I’d see it more as something we could mandate in the simple repository API (PEP 503). I’d do it in 2 parts:

sdists MUST conform to the naming convention NAME-VERSION.tar.gz (already covered in PEP 517)
Indexes MUST NOT expose files with the extension .tar.gz that do not conform to the sdist spec (extension to PEP 503). We should probably also include .whl here as well, although it’s less of an issue because .whl isn’t used for anything other than wheels.

ofek · July 17, 2020, 2:02pm

Flit, Poetry, nor the upcoming version of Hatch do that.

uranusjr · July 20, 2020, 5:50am

To keep things flowing, here’s the current state of the discussion. To move things forward, we’ll need to decide whether we actually want to rebrand, or simply standardise upon the existing .tar.gz format. Here’s my understandings to the trade-offs.

For rebranding:

Pro: Have a clear distinction to the “sdist” thing, which is awkward in name and concept. Tools can assume a distribution is following the new scheme without verification (which may imply a lengthy build-from-source step).
Con: A big disruption with marginal benefits.

For building on sdist:

Pro: Minimal disruption. All tools need to do is to improve the PEP 517 implementation to produce the updated format.
Con: There is no way to tell definitely whether a package follows newer source distribution format without physically inspecting the archive content.

As a tool implementer, I am personally inclined to prefer having a new naming scheme. The historically lack of specification has always been a big problem to both fetching and verifying sdists. For example, since sdists do not use the same package name escape rules like wheels, --find-links implementations cannot simply do name.split('-', 1)[0] to get the package name, but need to involve heuristics to match the user-input name to the file names (which may not even be normalised, to make things worse). The common .tar.gz and .zip suffixes also make it more difficult to implement messaging; the tool need to download, extract, and inspect PKG-INFO to tell whether a file is actually an sdist; if the sdist were to have a dedicated suffix and name scheme, on the other hand, the error can be shown early and easily.

steve.dower · July 20, 2020, 9:11am

“Rebranding” to *.sdist seems like the most feasible option to me.

Renaming to some cheese-inspired name is pointless. We have a perfectly well established arbitrary name already.

Otherwise, I agree standardising the name part is most important, regardless of extension or compression format. Though I would prefer a single extension, as “.tar.gz” is annoying to deal with using path manipulation functions.

pf_moore · July 20, 2020, 9:24am

Good point. We seem to have got sidetracked over new names in the “rebranding” discussion, but the important part of that idea is making sdists into a file type of their own, rather than just a .tar.gz with a particular content.

With that in mind,

Using the “brand” sdist, with .sdist as the file extension:

Pro: Allows tools to recognise the new format from the filename alone.
Pro: Minimal disruption. All tools need to do is to improve the PEP 517 implementation to produce the updated format.
Pro: Gradual improvement is possible. The initial version is recognised just by having the new filename, future iterations can add an explicit version marker.
Con: Some possibility of confusion with “legacy” sdists.

Pro or con, depending on your perspective - it doesn’t do anything more than make filename semantics reliable. I view this as a “pro”, because it’s the quick win that would most help tools like pip. Others may view it as a “con” because it’s a relatively minor benefit in itself (the other benefits of the change are mostly about setting things up to make future improvements easier).