PEP 625: File name of a Source Distribution

I’ve said this before in other places, but not discuss.python.org – one of the transition tools that we have, that we don’t use enough in packaging tooling, is coupling behaviour changes to Python versions.

For example, in this case, we could also couple the strictness of {normalised_name}-{normalised_version}.tar.gz being required to a Python version (like 3.12+). To use more words, the packaging tooling would enforce that all sdists used on Python 3.12+ follow the stricter naming model.

That way, by the time 3.11 goes EoL, we can reasonably safely assume in all the tooling that all sdists will need to work with/follow the stricter format. And, it means that users who are affected by this will have to deal with this as part of their Python version upgrades – something that they should be budgeting time for and doing anyway. It’s also easier to communicate “I need a proper name on Python 3.12+” rather than “I need a proper name on pip 22.3+”.

Pragmatically, this should only need a change in pip – to reject distributions that don’t follow this naming on newer Python versions. The build backends only need to have some way to allow users to generate such source distributions then.

(This approach is why I’d said upto 5 years, in my previous post FWIW)

3 Likes

I’m here a bit late and don’t have much meaningful to say besides “yes”…

A problem to have if we follow the scheme discussed in the recent messages, is that we won’t have a definitive way to identify an archive is not an sdist (i.e. the reverse problem), i.e. if I package CPython itself into a tarball named python-3.11.tar.gz, packaging tools will happily pick it up. But I assume that’s an agreed-upon tradeoff to retain the widely accepted .tar.gz extension.

And I think we should defer this until PEP 643 is rolled out, and at least gains “reasonable” (famously vague criterion) adoption.

OK. Would you be happy for me to update the PEP with the new proposal (I might also add @pradyunsg’s transition suggestion as a possible approach under “backward compatibility”, mostly because I think it’s a nice idea) and mark it as “on hold until PEP 643 is implemented by PyPI and at least some common backends”? As I described it, we don’t need PEP 643 for this proposal, but I’m fine with waiting for it if that’s what you prefer.

Well, CPython itself also has a setup.py so… you can already try running pip install git+https://github.com/python/cpython. :slight_smile:

If something burns down because you ran the command above, I’m not responsible.

Do we want to normalise versions as per PEP 440? I’d prefer to (it reduces the amount of weirdness we have to deal with, as well as removing the possibility of including hyphens, which need escaping further). But I guess it’s an extra responsibility for producers. Although they need to validate the version, so is normalising that much extra effort?

As someone not having to produce sdists but having to parse sdist file names, I vote yes. :wink:

5 Likes

Sorry about the deleted posts. I’m generating above-average quantities of stupid today…

2 Likes

Hatchling already does this hatch/backend/src/hatchling/metadata/core.py at 876a5f3928d9de0a2cc0b72613e46e2fb5c95557 · pypa/hatch · GitHub

For name in file it does pep427(pep503(name)):

2 Likes

One other thought. A sdist is required to contain a single top-level directory named {name}-{version}. Should we require the same normalisation rules for this as for the filename?

I suspect that in practice, backends will use the same code to generate the top-level directory and the filename, so this will happen naturally, whether or not we mandate it. But it might be good to be explicit, in any case.

3 Likes

Good point. I don’t see why not, for consistency’s sake, and given if we don’t, then tools could produce inconsistent results depending on whether they check the outer tar.gz filename versus the inner directory name; this would allow them to rely on both interchangeably.

3 Likes

@uranusjr as co-author, are you happy with this?

I can!

This is with Bloomberg’s internal-only Python packages (i.e. packages not on PyPI). I basically did if name.endswith((".tar.gz", ".zip")): counter[name.count("-")] += 1 on a loop over all the files. I’ll redact the exact file counts, though I can say that the package index size is on the order of TBs. :slight_smile:

0   0.0016%
1  94.4120%
2   4.4493%
3   0.8342%
4   0.2082%
5   0.0373%
6   0.0513%
7   0.0062%

No files have more than 7. That comes to ~5.5% packages having != 1.

Cool! So that suggests (based on a sample size of 1 :wink:) that private indexes are better than PyPI, and it’s reasonable to assume PyPI is a worst case.

I’d therefore like to move forward with the updated PEP 625. It needs a PEP delegate, as I’m a co-author and therefore can’t do it. @pradyunsg would you be interested in doing that?

I’m happy to be the PEP-Delegate on this, assuming no one raises concerns with that over the next week or so. :slight_smile:

3 Likes

I had a colleague run the numbers for Azure Artifacts as well (all users, not just internal to Microsoft), and it looks pretty similar to PyPI.

  • we didn’t see any files with more than 7 hyphens
  • most files had exactly one hyphen, regardless of the package name (i.e. probably the suggested format)
  • most other files had exactly one more hyphen than in their package name (i.e. the package name was not normalised when creating the file)
  • ~45 packages (<0.1% overall) were “weird”

This includes packages mirrored from PyPI, but a few quick % checks agree that it’s no worse than PyPI, so Paul’s numbers from earlier are still the worst case. Mine are also based on download telemetry, so there’s a bias towards “actively used” sdists, though I didn’t get to see any actual names.

3 Likes

Based on the percentages, it looks like it has a either around 64 500 files, or some small multiple of it.

Make sure to open a new SC issue once you’re ready.

By “most”, do you mean something like 60%, or 90%, or 98%? I understand if you can’t provide exact figures due to proprietary data, but without at least ballpark numbers its kinda hard to compare it to expectations and the other numbers mentioned here (since it was always assumed that this number was > and quite possibly >> 50%, all of which could potentially count as “most”)?

About 70% of packages seen had exactly one hyphen (matching Paul’s 63.33% with two parts).

The next ~29.9% had package_name.count('-')+1 hyphens, which wasn’t measured by anyone else, but is close to (the negative of) Paul’s “only 52 filenames do not start with the project name”.

So as I said, it’s not quite as “bad” as the PyPI numbers, and so those are the best worst-case scenario we have right now.

1 Like

That’s not how it works. See the PyPA process.

If their self-nomination is accepted by the other PyPA core reviewers, the lead PyPI maintainer and the default PEP-Delegate for package distribution metadata PEPs, then they will have the authority to approve (or reject) that PEP.

In this case I’m “default PEP-delegate for package distribution metadata” and I approve. @pradyunsg’s “assuming no-one raises concerns” covers the rest of the necessary acceptance.

Packaging PEPs follow a separate process from core Python PEPs.

If I remember correctly, Pradyun helped write the packaging PEP process, so he knows what to do (which, as Paul pointed out, means not bothering the SC with this). :wink:

3 Likes

I’m terribly sorry, I’m aware of all that but I somehow misremembered that @brettcannon opened a SC issue when he replaced you as PEP-Delegate for PEP 639 when he had actually opened a discussion on PyPA committers (amidst some ambiguity in the spec process on the specifics of that point), mixing that up with Barry opening a SC issue for being PEP delegate on PEP 676 around the same time (which also involved a somewhat similar area of ambiguity in the process).

At least as when we switched PEP delegates for PEP 639, we interpreted the somewhat unclear phrasing “If their self-nomination is accepted by the other PyPA core reviewers” to imply that the PyPA core reviewers should actually be properly notified of said nomination and given a chance to raise concerns, i.e. by the standard specified mechanism of posting to the pypa-committers email list, which I assume @pradyunsg is going to do shortly (since I don’t see it there yet).