PEP 625: File name of a Source Distribution

And a PR to add a “Source Distribution” specification on packaging.python.org as well. :slight_smile:

Should it just go straight to a spec on packaging.python.org and skip any PEP? It’s just documenting current practices so there wouldn’t be any real decision making beyond what the algorithm is.

2 Likes

I think it can just be done as a PR on packaging.python.org, as long as it is worded along the lines if “if you have a sdist, this is what you can assume about it”.

If it says when you can assume a file that you’ve been given is a sdist, I feel that it needs to be a PEP, as that would be a more substantial change. But my understanding is that we’re not going to do that.

3 Likes

Just for the record, I know that PyPI is the easiest case here, but regardless of what extension we use, we can mandate that sdist names on PyPI are unambiguous if we so desire. I also think we could modify the build tools we have now so that they produce unambiguous sdists now. It doesn’t drastically make the situation better for pip, but it does put more files on the happy path.

Honestly I think we probably could even deprecate and eventually remove support for installing from archives that don’t use rules similar to wheel naming.

That being said, I think mandating that version number matches what’s inside the archive is probably more than we can do without some explicit signal that this item is a sdist.

3 Likes

Any update on this?

Unfortunately nothing. There are a lot of things going on and this is pretty low in terms of priority.

@woodruffw has written a very nice blog post that describes the problem and helps motivate this change: https://blog.yossarian.net/2022/05/09/A-most-vexing-parse-but-for-Python-packaging

2 Likes

It’s also not a bug in any of the current Python packaging code: utilities like packaging.utils.parse_sdist_filename are well-defined with respect to their expected inputs, and only fail because some possible inputs are not well-defined.

Is probably not true, 1.0-2 is a perfectly valid, though unnormalized, PEP 440 version, and PEP 440 has never required that versions be emitted in their normalized form. Tools that don’t accept that as a version string are broken, as PEP 440 requires tools that parse a version number to accept the denormalized form.

The two files in the blog post are fundamentally pointing to different versions:

cffi-1.0.2-2-cp27-none-win_amd64.whl

Is wheel for the cffi project, with version 1.0.2, and the second build of that wheel.

cffi-1.0.2-2.tar.gz

Is a sdist for the cffi project, with a version of 1.0.2-2, which normalizes to 1.0.2.post2.

I don’t see anything in PEP 517 that would require sdists to escape their version numbers. The section that seems to be used as justification is:

We continue with the legacy sdist format, adding some new restrictions. This format is mostly undefined, but basically comes down to: a file named {NAME}-{VERSION}.{EXT} , which unpacks into a buildable source tree called {NAME}-{VERSION}/ . Traditionally these have always contained setup.py -style source trees; we now allow them to also contain pyproject.toml -style source trees.

Integration frontends require that an sdist named {NAME}-{VERSION}.{EXT} will generate a wheel named {NAME}-{VERSION}-{COMPAT-INFO}.whl .

The new restrictions for sdists built by PEP 517 backends are:

  • They will be gzipped tar archives, with the .tar.gz extension. Zip archives, or other compression formats for tarballs, are not allowed at present.
  • Tar archives must be created in the modern POSIX.1-2001 pax tar format, which uses UTF-8 for file names.
  • The source tree contained in an sdist is expected to include the pyproject.toml file.

Specifically, it’s stated that a front end will require an sdist named {NAME}-{VERSION}.{EXT} to generate a wheel named {NAME}-{VERSION}-{COMPAT-INFO}.whl.

That statement is true in cases where the version is 1.0-2 and wheel escapes to 1.0_2 and sdists do not.

Further, PEP 517 explicitly calls out what new restrictions are in place for an sdist, and having the version number be escaped, or in the PEP 440 normalized form is not required in either of them.

Also, in looking at the blog post, it appears the reason pip rejects the download is not because the use of a version number of 1.0.2-2, but because whatever produced that sdist produced it incorrectly. Sdists do not have build numbers like wheels do, and that version of cffi appears to be 1.0.2, and the wheel also happens to have a build number of 2, but the sdist should just be 1.0.2.

2 Likes

So… this is one of our “open PEPs”.

@pf_moore @uranusjr Do you know what’s needed here to get this over the line, so that this can get to the point that we get a decision from a PEP-Delegate on this?

  1. Someone needs to go back over the discussion here and ensure that the current version of the PEP reflects the discussion.
  2. Assuming I’m PEP delegate, I’d like to see a commitment from the various affected tools that they would support the new format - at a minimum that would be build backends, build frontends, warehouse, uploaders like twine, and possibly other index providers like devpi and artifactory. Otherwise it’s too easy for us to approve a standard like this and then no-one bothers to do anything about it.

Actually, regarding that second point, the “backward compatibility” section probably needs expanding. If an index doesn’t yet support the PEP (which could easily happen if a company doesn’t upgrade their index software straight away) what happens? If the index is mirroring PyPI, which now has the new filenames on it? Or if the internal developers want to upload sdists created with updated versions of tools?

You’re an author, so I’m pretty sure we’d want someone else to be the delegate. :slight_smile:

I forgot that! So I guess that means that as an author I can review the discussion here then :slightly_smiling_face:

I’ve only done a bit (it’s a long discussion!) but I noticed that as PEP 643 - Metadata for source distributions exists now, having it in the “rejected ideas” is a bit silly. And in fact, given that it was rejected as an alternative solution to the problem PEP 625 was intended to solve, do we even need PEP 625 any more?

I think we do, because it’s still necessary to get name and version without reading the file (index scanning, in particular). But maybe we can be a bit more pragmatic? Just mandate {name}-{version}.sdist.tar.gz, with name and version normalised as in the wheel spec, as the “new sdist filename format”. All sdists with this name format MUST conform to PEP 643, so the name and version can be read from the sdist static metadata. So tools that want to cross-check can confirm if a file with a new-format name is actually an old-format file with a weird version. Given that the pragmatic approach to parsing sdist filenames has been working for years, IMO that’s sufficient. We would start parsing foo-1.0.sdist.tar.gz as foo version 1.0, rather than as foo version 1.0.sdist until we checked the file content, but is that a real issue? There are currently no files on PyPI with a name *.sdist.tar.gz, for what it’s worth.

This also avoids all the complexities around indexes not supporting the new filename extension.

It does of course tie the new name to PEP 643 adoption, which is currently blocked on this Warehouse issue and after that this setuptools issue. I’m not sure if other backends support PEP 643, but I suspect that if they don’t, it’s less of a problem for them than it is for setuptools[1].

Maybe we should defer PEP 625 until PEP 643 is rolled out, and re-think at that point? @uranusjr what do you think?


  1. I’m assuming a “minimal” level of support, only requiring name and version to not be marked as “dynamic”. More would be better, of course, but that’s not needed here. ↩︎

Just FYI regarding Hatchling, I’m not officially supporting wheel & sdist features that PyPI rejects. So when it supports metadata versions >2.1 I’ll do PEP 643.

One benefit of not putting tar.gz in the name is that we could change the compression technology fairly easily, as long as the root METADATA file still exists. I’d argue we should do both of that, following a similar model to wheels on this front.

OK. The problem is that there are significant transition issues with a new extension, and I don’t know how to fix them. So if you prefer a new extension, please explain how we’d do the transition. I alluded to this problem above, but here’s some explicit scenarios:

  1. PyPI adds support for the new extension, and people start uploading them. Third party mirroring indexes that haven’t been upgraded yet, don’t recognise the new extension, and either fail to mirror them or start generating errors. Builds start failing because it looks like projects have stopped uploading sdists.
  2. Setuptools switches to the new extension. Again, a private index hasn’t added support for the new format, and builds fail because they can no longer upload artefacts.
  3. Technically, if build_sdist returns a filename that doesn’t have the .tar.gz extension, it’s not PEP 517 compliant (see this section of the PEP). There’s no API versioning in PEP 517, so how do we address that?

I’m suggesting we defer these issues until we actually want to change the compression format (which I think is a reasonable thing to want to do, but isn’t as pressing as getting a filename format that can be parsed reliably).

One other thought - when checking PEP 517 I note that sdists must contain a single directory named {name}-{version}. Se we don’t need the PEP 643 constraint actually, it’s possible to confirm whether .sdist is part of the version just by looking at the top level directory in the tar.

Or to heck with it altogether, we could simply require that backends normalise versions when creating sdist (and wheel?) filenames, and state that consumers of distribution filenames MAY assume that versions are normalised, but SHOULD attempt to parse un-normalised names whenever possible (for backward compatibility). That just formalises the current heuristics and requires that file creators stop producing filenames that don’t match the heuristics.

I’ll do some checks on my PyPI data to get some feel for the scale of the issue in reality.

2 Likes

OK, so there are some pretty awful cases on PyPI.

There are 2,404,380 .tar.gz files on PyPI.
Of these, 881,605 (37%) do not have a single hyphen in the filename.
151 have no hyphen, the rest have multiple hyphens.

Breaking it down by number of hyphen-separated “parts”, we have:

 1:        151  0.01%
 2:  1,522,775 63.33%
 3:    440,766 18.33%
 4:    305,250 12.70%
 5:    120,948  5.03%
 6:     10,791  0.45%
 7:      3,075  0.13%
 8:        347  0.01%
 9:        273  0.01%
10:          1  0.00%
11:          3  0.00%

Looking at the problem from a different perspective, if we know the project name (which we do if we’re scanning an index) then only 52 filenames do not start with the project name, if we canonicalise aggressively:

def canonicalize_name(name):
    return re.sub("[^a-zA-Z0-9]+", "-", name).lower()

So from this, I conclude that if we already know the project name, we can get the version in all but a vanishingly small number of cases.

The problem is when we don’t know the project name, in which case 37% of filenames are problematic.

However, the question here isn’t whether it’s possible to parse arbitrary sdist names on PyPI. We know it’s not already, or we wouldn’t be having this conversation. The question isn’t even whether we need to standardise this - we don’t have to, because pip’s current algorithm is sufficient for practical use. And even if we do standardise, we can’t reasonably say that tools can reject legacy sdist names any time soon. But what we can do, is say that:

  1. Producers of sdists will in future always use a filename of {name}-{version}.tar.gz, where {name} and {version} are normalised to ensure they do not containg hyphens.
  2. Consumers of sdists can check the number of hyphens in the filename, and if it is exactly 1, assume that the filename is in the standard format. This is useful because it can save the cost of extracting the metadata (in the case of PEP 643 compliant sdists) or of invoking the build backend for metadata (in the worst case of nothing more than PEP 517 compliance). The question is whether it might give the wrong answer.

I have checked all 1,522,775 filenames which only have one hyphen. In 93 cases, the {name} part does not match the project name. In none of those would pip’s algorithm give any better result. In 62 of those cases, the {version} part is a valid version - so assuming the “standardised” format would give the wrong answer.

I am of course ignoring cases where the sdist filename is just flat-out wrong, and a build would give completely different name and version metadata. I think it’s fair to assume such cases are rare or pathological enough that we don’t care about them.

So, if we standardise {name}-{version}.tar.gz, with name and version normalised to have no hyphens, we would have to fall back to existing heuristics in 37% of cases, and when we can use the new logic, we would get the wrong answer in 62/1522775 (0.004%) of cases. I think that’s an acceptable error rate, personally.

I hope these numbers are useful. I can provide the scripts if anyone wants them, although they are very adhoc (mostly just copy/pasted lines from the REPL, I must start using Jupyter notebooks more!). The raw data is basically a list of (project_name, filename) from PyPI.

I’d love to see similar analyses of private indexes, but I don’t know if anyone is likely to be able to do this.

7 Likes

Not in general, but not every tool is pip. Some tools can get away with only having first-class support if best practices are followed, and for those, it would help a lot if standardizing the best practices so that packaging tools accept PRs to follow them.

Some tools can afford to ignore (or special-case) older/existing releases. For example, a rough estimator of the current popularity of an API, or a semi-automatic RPM generator that currently needs hand-holding in many other edge cases.

For private indexes, you’d be able to fix issues by re-releasing every affected sdist, and then if you can rely on future releases being standardized, you can avoid a special case in internal tooling.

3 Likes

FWIW, I think it’s reasonable for pip to get strict as well, after providing sufficiently long periods of time for transitioning (~6 months to 5 years, depending on the change). We’ve even encoded that model into our CLI and workflows (eg: the --use-feature flag).

If a future version of pip starts rejecting source distributions where it’s not able to figure out the version/name from the distribution, I think it’s a reasonable workaround for users to use an older version of pip – especially if the build backends have been updated to do the right thing (or at least, have a clear workaround/guidance for this).

3 Likes

Absolutely. My main motivation for this is that I want to write my scripts without worrying too much about legacy data. Sorry for not making that clearer.

My point was that even for tools that have to support legacy data, this proposal will do no harm.

1 Like

Again 100% agreed. So the {normalised_name}-{normalised_version}.tar.gz proposal allows pip (and other tools) to implement that transitional period without compatibility issues when dealing with tools that are at a different point in the transition. My biggest problem with the “new suffix” proposals is that they give interoperability issues between tools for that transition.

1 Like