PEP 625: File name of a Source Distribution

I like where the whole discussion is going. It seems to me like this specification would be a great improvement.

I don’t have a good alternative suggestion or even a strong argument, but I believe a clean cut from the legacy sdist term would provide some value. I wish the situation with sdist were similar to bdist (in setuptools terms), so that we could have a sdist_curd (or whatever instead of curd), similar to how we have bdist_wheel, bdist_rpm, etc. And bdist itself is not really an artifact, it’s the general concept, or the exploded version on the file system (or something like that). It is clear what a built distribution is and that it can have many forms (wheel, rpm, egg, etc.). Speaking as someone who’s following the Python packaging topics on Stack Overflow, I am a bit worried that for sdist there would be some confusion between old and new. But most likely, the ship has sailed, and we’ll have to accommodate with the situation.

Personally I would be perfectly fine with .sdist.
(Even better if it comes with the other improvements in the content of the file. Just the renaming would still be welcome but a bit disappointing.)

I like that we are getting rid of the unspecific .tar.gz extension, that wastes an opportunity to provide useful information.

I like that we are avoiding the wheel term here (.swhl), since this new source format might outlive the wheel format, and will be used to build other things than wheels.
(Maybe it’s just a matter of interpretation, maybe wheel is the global term and we should have had .bwhl and .swhl.)

Not sure if nowadays it’s still worth the headache of finding a 3 letters extension (like .whl)?

It’s only a “minor” benefit if you have never written code to try a tool that works with an sdist. :wink: I think the confusion is very minor and most people will piece together what .sdist represents and eventually people will just wonder why there are all of these .tar.gz files floating around on PyPI.

I don’t think so. .html killed that concern ages ago.

Since old installer versions wouldn’t recognise the new extension, we need to expect package authors to produce both the old and new sdist formats for a while. The PEP 517 hooks will also need to be updated to either have a dedicated hook for each format, or amended to have build_sdist able to produce both formats in one call. Otherwise I fear developers (both package authors and PEP 517 front-end providers) would simply choose to not adapt the new format, since the disadvantage would be too significant.

1 Like

Fair point. Would it not also apply to the full “rebranding” option, though? Needing a migration plan seems like it would be necessary for anything apart from “do nothing”.

Yes, all rebranding routes would need a similar migration plan. But I think the plan would be slightly trickier if we use the .sdist extension, since it would create a confusion around the build_sdist hook that other extensions don’t.

Is it just about the extension?
The content of the new format, would it be a strict breaking change or can it be made backwards compatible (with older pip versions, etc.)? In the latter case, PyPI could maybe present the new .sdist (or whatever the extension) also under the legacy .tar.gz extension (2 URLs pointing to the same file). But I guess it’s more complicated than that.

It’s actually a plausible idea, legacy sdist has so few requirements it’s very easy to make the modern sdists format 0 conform. But I don’t work on PyPI myself and can’t say whether it’d be implementable.

This is one of the biggest issues I have with changing the format (and with many other suggestions in other threads about this). I am convinced that we can find a reasonable way to standardize sdists “in place”, which will really aid the roll out here.

Another option that I think could work here: standardize on {distribution}-{version}.tar.gz, and do a more conservative version of my earlier idea to use the “extra fields” headers as an indicator that it’s a new-style sdist by adding a field like PYTHON_SDIST_VERSION=0. Anything that doesn’t match {distribution}-{version}.tar.gz and anything missing the extra field header is assumed to be a legacy sdist. If the extra-field metadata gets lost for some reason, NBD because (assuming we also adopt strict metadata at the same time), once you open up the tarball you’ll find the standardized metadata directory. It would be transparent to everything not looking for it, but you’d get “progressive enhancement” the closer your situation is to the happy path.

If none of those ideas fly, I think @sinoroc’s suggestion is the best alternative: PyPI can either offer both .sdist and .tar.gz files or it automatically changes .sdist to .tar.gz by default, and tools can specify that they want the new file extension (in either case, we’re in for some breakage if we make new versions of pip download pull down the .sdist file by default, since the pip you use for downloading and the pip you use for installing may be different versions, but maybe this is acceptable?).

I think both options are common in that they both theoratically work, but introduce tradeoffs in different places. It would be quite difficult to roll out a new extension, but installers would have a much nicer time. Sticking to .tar.gz avoids all the migration trouble, but pushes the complexity onto tool implementers that need to deal with the distributions. To me, this seems like a design decision we have to make. (I’ve stated my preference and won’t repeat here.)

1 Like

And to put the tool issue more concretely, if I look up files using the Simple API today I don’t have a reliable way to split the project name from the version, so I never know if I’m doing the split accurately. And asking one to download every sdist to check whether it has an internal archive field set to know how to split the file name is extremely costly. This is why mousebender launched with no sdist support and we didn’t group files for users by version as there wasn’t a reliable way to do that for sdists (pip’s code is just a massive guessing game of code for this sort of thing).

The only way I would ever be okay with keeping .tar.gz is if we say as a group that any project whose sdists are not using modern naming as outlined by PEP 517 better do the right separation via:

name, _, version = filename.removesuffix(".tar.gz").rpartition("-")

If we are not comfortable with leaving projects that don’t do that in the past and we don’t commit tools to do the proper naming going forward now, I don’t think trying to keep compatibility is helping us enough to not bit the bullet and do the more proper fix now so we have it for the next 30 years.

3 Likes

Not much more complicated (as simple as setting the anchor href in a sdist-duplicate of the .tar.gz file), but you’d have to ask private PyPI projects and PyPI hosters (eg Azure Artifacts) to update to support that as well right? In addition, devpi and my proxy project will have to be checked for issues. Would twine also need to be updated? Finally, static file-hosting PyPI websites can’t ever dynamically support this, and both files will have to be uploaded (so I guess twine must be updated)

Yes. Seems like we wouldn’t win much if a download of the distribution is still required. Maybe we can duplicate this bit of information in a data-attribute on the <a> link (similar to data-requires-python), so that pip could decide to skip the download if it can know for sure that the file name is reliable.

On the other hand, how good is PyPI at filtering out distributions that have unreliable file names? Is there any check being done at upload-time on PyPI’s side? On twine’s side? I genuinely ask, I have no idea how common it is to have unreliable .tar.gz file names on PyPI.

(This alias idea is probably too young to be worth going into the details here and now, so I will keep it short.)

If this alias trick ever makes it into PyPI then we already cover a lot of ground. Other server software might want to learn this feature later as well (maybe we’ll need a PEP), that’s their decision, nothing would break if they don’t. Twine will need an update if there is a new file format to upload, but nothing particular regarding this trick. Not much we can do for static file hosting (nothing comes to mind).

On PyPI, the splitting rule is pretty much guaranteed to be OK (pip has been assuming¹ it for years, so we’d have seen any big issues). I’m less sure whether the rule that the project name part must be normalised is OK, as pip defensively normalises names over and over, precisely because we can’t be sure.

Having a standard basically just promotes a de facto standard to a formal one, as far as PyPI and pip are concerned. However it also extends that to other places where sdists can be found (custom or private indexes, direct URLs, pip’s --find-links feature, etc).

The problem with the .tar.gz extension is that many of those other places allow arbitrary files to be present², so we can’t make the assumption that .tar.gz means “sdist” (and if we can’t assume we have a sdist, we can’t apply the sdist filename rule reliably). Hence the suggestion of a new extension. But if we can’t agree on a new extension and a transition mechanism, formalising the de facto standard would still be useful in itself:

  • For code that doesn’t do a build, parsing the filename is the only way to know name and version (until we standardise sdist contents, which is a bigger problem).
  • For new tools and adhoc scripts that can avoid needing to re-implement all of pip’s defensive code.

¹ Getting name and version from the filename is a really important performance optimisation for pip, so we assume the splitting rule and then defensively normalise the project name. We then check when we get the definitive project metadata that our earlier assumption was correct. The biggest benefit for pip of standardising this would be to get rid of a big chunk of defensive and error handling code, as well as guaranteeing that malformed filenames can’t be fed to pip and break it.
² Technically, indexes can contain arbitrary files too, according to PEP 503, so even in an index we shouldn’t assume a .tar.gz file is a sdist. But for all practical purposes, we can.

1 Like

If folks want to stick with the sdist name, that’s OK with me. I kind of hate it and always found it super awkward to pronounce, so if we can come up with something better I’d prefer it (my idea was source wheels to mirror source rpms, but I don’ actually care about it).

I do think we should regardless of what we call it, define a new extension so we can differentiate it from arbitrary tarballs. There’s an argument to be made that we can/should also move to using zips to align our two formats together, and to make random access of metadata files possible.

So I guess I’d say I’m in favor of:

  • Give some extension, ideally getting rid of the “sdist” name, but at a minimum define an extension.
  • Move to the zip format to make future introspection easier.

If we can also tie this together with providing metadata in a sdist that we can trust, then that provides a solid win for end users (faster installs from sdists) on top of the wins for tooling.

The transition plan is not the greatest thing in the world, but it’s not too bad. PyPI would see both style of sdists as separate artifacts, so projects would upload both, until a version of pip that supports only the new style sdist is sufficiently old enough that individual projects feel comfortable dropping the old .tar.gz option.

4 Likes

This seems to have quieted down a bit. To not lose momentum on dealing with the “sdist problem”, I am hoping we can at least drive this thread to completion before tackling any of the other threads like trustworthy metadata.

Classic sdists

Do we all agree that tools should be producing what PEP 517 specifies: {dist}-{version}.tar.gz with {version} normalized? I think so, but since tools need to know to understand the name can (not) be normalized, it’s something we should agree to. It also means pushing tools to start normalizing names if they aren’t already even in non-PEP 517 scenarios.

How about parsing names? Can we say:

dist, _, version = filename.removesuffix(".tar.gz").rpartition("-")

Is good enough and those with wonky version numbers that would break are just out of luck? Or do we need to document what pip does? I will say that this not being documented is why mousebender doesn’t try parsing file names (yet).

New-style sdists

From what I can tell, there are two sides.

One approach is what @uranusjr, @dstufft, @pf_moore, @steve.dower, and myself prefer, and that’s:

  1. .sdist file extension (barring someone coming up with a better name)
  2. Switching to a zip file format

On the other approach is @pganssle with not changing the naming and instead:

  1. Marking new-style sdists via metadata stored in the tarball file’s metadata itself

I would say that Paul’s metadata approach is the most backwards-compatible (with the assumption that any metadata decisions we make are also backwards-compatible), but the .sdist approach is the most visible/discoverable. Each of those decisions impact users and tools differently.

Is that a fair assessment as to where we currently are with this from a naming perspective?

2 Likes

I think we need to check what pip does. :slight_smile:

As someone who’s on the fence about change-naming vs indicate-via-metadata, I think so.

The relevent code is here: pip/src/pip/_internal/index/package_finder.py at 31299ee37058fafc84535ee3a47f0468433aeb20 · pypa/pip · GitHub

I think the docstring and comments of _find_name_version_sep() explains things quite well (I hope… I wrote them):

def _find_name_version_sep(fragment, canonical_name):
    # type: (str, str) -> int
    """Find the separator's index based on the package's canonical name.

    :param fragment: A <package>+<version> filename "fragment" (stem) or
        egg fragment.
    :param canonical_name: The package's canonical name.

    This function is needed since the canonicalized name does not necessarily
    have the same length as the egg info's name part. An example::

    >>> fragment = 'foo__bar-1.0'
    >>> canonical_name = 'foo-bar'
    >>> _find_name_version_sep(fragment, canonical_name)
    8
    """
    # Project name and version must be separated by one single dash. Find all
    # occurrences of dashes; if the string in front of it matches the canonical
    # name, this is the one separating the name and version parts.
    for i, c in enumerate(fragment):
        if c != "-":
            continue
        if canonicalize_name(fragment[:i]) == canonical_name:
            return i
    raise ValueError("{} does not match {}".format(fragment, canonical_name))
1 Like

Wheee! Thanks @uranusjr!

Looks like we straight up fail when things don’t match the name-version format, with the caveat of matching “name-foo-version” with “foo-version” being treated as the version?!

Yes, I think saying “no hypens in version” (for whatever we settle on) is reasonable. Plus, it’s currently implied by PEP 517’s naming specification. :slight_smile:

Yup, because foo-version is a totally legal version for packages before PEP 440. But PEP 517 mandates PEP 440 versions (because the version must match the wheel’s, which mandates PEP 508 which mandates PEP 440),1 so an sdist returned by PEP 517 backend must return a file name that can be split by rpartition("-"). The trick is to have a way to identify whether an sdist is using PEP 517.

1 The fact I can write this entirely from memory is terrifying.

3 Likes

sigh

I think pip can break backward compatibility here, because I really don’t think we should take non-PEP 440 versions now – it’s been long enough. Let’s move that discussion to pip’s issue tracker tho.


puts on sarcasm hat

Well, there’s no trick needed - just enable PEP 517 by default on everything.

I can relate. Welcome to the club, I guess?