PEP 625: File name of a Source Distribution

From my understanding, PEP440 allows the use of - in the local version segment since they are acceptable.

I could not find the part where PEP-517 mandates the used of a PEP440 normalized version.

Meaning that

dist, _, version = filename.removesuffix(".tar.gz").rpartition("-")

would fail where pip’s current implementation still succeeds (given we have access to the canonical name).

>>> pip._internal.index.package_finder._extract_version_from_fragment('some-useful-tool-1.2+hello-there', 'some-useful-tool')

But it would definitely make sense to have such requirements in the PEP517 sdist {NAME}-{VERSION}.tar.gz:

  • VERSION must be a normalized PEP 440 version
  • NAME must be canonical_name.replace('-', '_')
1 Like

PEP 440 does indeed allows -, but the wheel spec requires escaping them to _ when they present in the file name:

Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _:

PEP 517 requires matching names for sdist and wheel:

Integration frontends require that an sdist named {NAME}-{VERSION}.{EXT} will generate a wheel named {NAME}-{VERSION}-{COMPAT-INFO}.whl.

So a PEP 517 sdist cannot have dashes in the version part of the file name.

1 Like

For wheels’ metadata directory {distribution}-{version}.dist-info, should the PEP 440 distribution part have hyphens replaced by underscores?

Good question! PEP 427 says the .dist-info directory follows PEP 367, which does not have the same escape rules. So the answer would be no if you implement to the spec.

But the same logic also applies to the name part, and none of the popular installers actually follow that—they do escape the name part:

$ pip install django-grappelli --no-deps -y -t ./tmp
$ ls ./tmp
django_grappelli-2.14.2.dist-info  grappelli

which indicates that the right answer should be yes, and we should fix the spec to reflect the reality.


For anyone else who didn’t know, defines the canonicalization of names.

Did this get moved to pip’s issue tracker?

And that would just change it to name, _, version = stem.partition("-") which is nice.

Right, but what about older sdists that are not following this? That was the motivator of my question: how much compatibility do we want to have here with old sdists?

So where do we sit on this? Document how pip handles the name parsing now but that sdists going forward are expected to follow PEP 517 for sdist v0 files?

Is this being tracked anywhere?

It wasn’t. I’m personally not in favour of doing this because I don’t see the merit. Wheels already mandates PEP 440 versions, and if we can push out a new sdist format that does likewise, old-style sdists can keep doing whatever they want. We can’t get rid of those old-style sdists on PyPI anyway, and hopefully one day people would just forget about them like eggs.

I think that’s reasonable. pip can document how it parse old-style sdist names (maybe in the Architecture documentation), so alternative tools (like mousebender) can try to do the same if they want to. Once we can distinguish new sdists from legacy ones (either by file name or archive metadata), tools can decide whether to only care about them and implement PEP 517 compatible parsing, or add support to legacy formats by copying pip’s implementation.

I don’t think so. This should probably be done by

  1. A new PEP to describe the .dist-info naming scheme.
  2. Modify PEP 427 (or have a new wheel spec version?) to refer to that new PEP instead of PEP 367.

I’m personally occupied by recently (which is also the reason I’m not moving this thread forward) and likely for another month, so don’t expect much to happen from me until then :slightly_smiling_face:

1 Like

I would assume such a thing would end up in ‘packaging’ eventually anyway.

Ditto. I’m just poking this enough to keep it alive, but not enough to help drive it to conclusion.

I’m coming to the conclusion that there’s not a lot of practical benefit to be had here if we don’t change the extension.

Parsing .tar.gz filenames from an index is relatively straightforward, and basically a solved problem. (It could actually be made better, as by the time we get a filename from an index, we know the project name from the name of the index page, so we can use that to help parsing). And detecting when a .tar.gz file is a sdist when it doesn’t come from a controlled source like an index is fundamentally unsolveable.

So unless we switch to a new extension, I see no real benefit to tools here. And I’m not sure there’s much appetite for a name change without standardising the content, at least to a basic extent.

For me, knowing how we do it today is helpful for things like mousebender and for any tools that want to support all the current sdists up on PyPI and elsewhere.

I’m getting the same impression.

I believe the comment has been made that switching to .sdist is great for tools, .tar.gz is great for users from a backwards-compatibility perspective. I don’t think that’s been disputed, but more that there’s disagreement as to whether the importance lies; making things work better with tools or making things easier for users from a transition/backwards-compatibility perspective.

Surely we can standardise generating and parsing separately? In PEP 514 I defined a fallback policy for consumers for cases where the metadata creator wasn’t up to date.

That would let us separate “you must create names like this” and “you should read names like that” and at least start a migration. Unless we have an urgent need to deprecate all past packages, which we don’t, we really just need to tell packagers to normalise their names properly. Everyone else has to read old names anyway, so we tell them about the edge cases we know of.

Pragmatism beats purity in this case I think.

(Edit - on rereading this made very little sense)

The main idea is to specify normalisation for generators. We don’t really seem to need any changes other than that right now, so the same rules as for wheels seems easy.

Now “if there’s exactly one hyphen, split at it to get name/version”, and it works for all “new” and many old names.

Parsers aren’t going to get to drop old names on the floor just yet, so the best thing we can do is offer guidance on how to deal with them. Continuing the logic above, we are really just resolving ambiguous hyphens. It’s late, so I’m not doing it now, but surely there are only a few heuristics needed (including “if you already know the name…”) that we can write up.

Do we need any changes to the file names besides normalising the parts?

That’s done, as noted earlier - PEP 517 specifies the name of a sdist¹. We’re already onto the “you should read names like this” discussion, and for that we need reliability - i.e., when reading we need to be sure the rules work. Otherwise we just have heuristics (which is what we have already), not a standard.

One question - on PyPI, the project name and version can be inferred accurately by code for nearly all sdists. Every sdist is visible on a project page, so we know the project, and by stripping that off the filename we are left with the version (the strip needs to ignore all non-alphanumeric characters, because many projects mess up normalisation, but the resulting approach works). Is it conceivable to run a mass renaming exercise on PyPI to standardise sdist names using an algorithm like this? We’d likely need a migration plan, and communication to inform projects if we rename their stuff, but is it a viable idea? It would address the biggest blockage to this whole exercise, which is “old PyPI stuff”…

¹ In theory, it specifies the name build_sdist must use, there’s nothing stopping a tool saving a sdist to disk with a different name, but let’s not be silly.

Sorry, it seemed like we were on the “should we change the extension and format as part of the naming change as well” part of the discussion.

If it’s already specified, this PEP should be withdrawn, yeah? And someone just needs to port Warehouse’s logic into packaging?

No, you’re right (in a way). The problem with the existing standard is that it doesn’t allow for recognising a sdist in a context where you’re not already sure you have a sdist. And that means we might want to change the current standard. So yes, we can look at that as purely a question of “this is how you write sdist names in the future”. But I’m not clear how it helps - any new standard will be trivially parseable, because it can be, and why would we make it otherwise?

Yes - to get any benefit for tools, we need a new extension. That was the point of my comment above. We’re not going to standardise the heuristics for interpreting existing .tar.gz filenames, precisely because they are heuristics. So tools will continue to do this regardless of any standard - and they already have the logic present. We may move “best practice” heuristics to a library like packaging or mousebender, but they won’t be a standard. They can be guaranteed accurate if you know you have a sdist conforming to PEP 517 naming, but degrading gracefully is really important, as we have no way of rejecting non-conformant names.

But as the only use case for PEP 625, standardising the filename before the internal structure, was to help tools reliably detect sdists and parse their names, if we can’t agree (yet) to use a new extension, then it should be withdrawn in favour of a future “standardise sdists” PEP.

Not “if it’s already specified”, but “if we aren’t going to get a new extension”. But otherwise yes.

It’s probably pip’s logic, and maybe should go to mousebender rather than packaging, but that’s optional, and it’s being worked on (I’m looking at it and I think Brett is too…)

This feels like it’s still aiming for purity. The only benefit in having a guaranteed accurate name is in the case where you literally have no “old-style” names at all, which isn’t going to happen for anyone any time soon.

Pragmatically, what you’re going to want is your first heuristic to be 100% reliable for conformant names, including those that happen to conform by accident, and that rejects any that may not be conformant. My proposal is:

  • .tar.gz or .zip suffix
  • basename only contains one hyphen
  • part between hyphen and suffix parses as a PEP 440 version

If these don’t work out, you switch into compatibility mode and do your heuristics.

If they do all work out, the only issue left is if the filename is a complete lie and the contents doesn’t match. But that’s always been a Bad Thing, and will continue to be a risk even if we standardise on another approach, just as hypothetical files with a .sdist suffix might still have invalid version numbers or unnormalised names. The only advantage of changing extension there is that you get to tell more users to go away and fix it themselves, rather than helping some of them succeed despite not using fully up-to-date dependencies.

I’m still not clear what you’re suggesting here. Precisely what (in terms of specific tools, and code in those tools) would change under your suggestion?

What I’m saying is that if we have a standard (which would need a new name), pip can handle local filenames and direct URLs differently - we can go direct to the sdist code path and take the name and version from the filename. With .tar.gz files (i.e., the current situation) we can’t do that, because a .tar.gz file might not be a sdist, so we have to download and unpack it, and call PEP 517 to get the metadata - because all we can assume is that it’s a buildable packed source tree.

Note that nothing will change for files downloaded from PyPI (or another index) as in that case we can reasonably assume that a .tar.gz file is a sdist.

Ah, I was fixating on PyPI/indexes rather than arbitrary links.

But why do you need the name and version info if you’ve already got a direct link? How are you going to select a better option if the version breaches constraints? Is this purely about rejecting optional packages? Or a more efficient error case?

And if all you’ve got is the name and version separately and are looking at a known list of files (local directory or an explicit list) for a match, you can normalise the candidates far more aggressively for the comparison and just ignore anything that isn’t “close enough”.

Maybe I’m still missing a key scenario, but it does seem pretty close to “in order to avoid breaking 1% of users we’re going to break 100% of users” (which is sometimes the right choice - I’ve done it myself - but I’m not convinced this is one of those occasions yet).

The better option is selected by backtracking, i.e. discard whatever specifies that direct link, and try something else. Not needing to build, or even download the package for its version is a very significant performance gain for many sdists.

So I guess I’ve missed what specifies a selection of direct links that isn’t PyPI (and affects the majority of pip users). Or is this just for transitive dependencies, where you’ll reject A==1 because it depends on and look for a different version of A, because C depended on B==3?

What if we just put a big warning in the (pip?) docs that if you reference an archive without a parseable name (or version number in the fragment?), your users will be forced to download it and they’ll get mad at you? Then print a message during solving that blames the package with the dependency for the download.

I stand by my view that you can’t get rid of this completely anytime soon, and will have to deal with it regardless. If you provide a way for packagers to solve the actual issue themselves, they will. We don’t have to rename the world for it.

Transitive dependencies is the answer, yes. And all that you’d achieve putting up a big warning is people coming to pip’s issue tracker saying we’re wrong, pip should be able to infer the sdist’s name and version like wheels, the file name already seems parsable, why are you building my packages?

And what I don’t get after all the messages is, why exactly is that a problem? No, we’re never going to get rid of all the old sdist name parsing stuff. But that’s totally fine.

  1. Now: Every sdist needs to be built to know its version.
  2. Your name rule proposal: Most sdists on PyPI can be reliably parsed. Sdists elsewhere will continue to be generated without reliable name parsing rules because we can’t get rid of the old tooling people continue to use against our wish.
  3. New extension: Sdists using the new extension can always be parsed reliably, no matter where they come from. Sdists using the old extension behaves exactly like now, and users can ask their project maintainers to improve them (by switching to the new extension).

As far as I can see, scenario 3 is strictly better to both scenarios 1 and 2, and scenario 2 has no advantages over 3 (except for project maintainers who need to generate a new format).

Edit: Come to think of it, why can’t we do both? The one dash parsing rule would help fix the problem in many situations right now, while the sdist can fix the rest if the projects migrate (and once we also standardise the formst, they become an improvements to project benefitted from the one dash parsing rule as well). They don’t really conflict at all.

This is basically my point, except that since the parsing rule will apply equally to both, it no longer justifies creating a new format. So we should defer that to a time where it adds value (because we’ve come up with better ideas for the content/format of the sdist, not just the name).