Escaping versions for wheel, sdist, and .dist-info names

The discussion in Clarify naming of .dist-info directories has reminded me of a couple subtle problems with PEP 427 and its handling of versions in wheel filenames, and now that progress is being made on escaping of versions in sdist and .dist-info directory names, this seems like as good a time as any to bring it up.

Problem number 1: PEP 427 requires that the version component of a wheel filename only contain alphanumeric characters, underscores, and periods, with all other characters converted to underscores; however, this overlooks the fact that, per PEP 440, version strings can also contain exclamation points (to denote version epochs) and plus signs (to denote local version identifiers). Converting these two symbols to underscores causes a loss of information that makes it impossible in certain cases to compare the versions of two wheels just by inspecting their filenames; for example, 1!2, 1+2, and 1-2 all end up escaped as 1_2 (which, incidentally, is not a valid PEP 440 version; see below). The wheel project ran into this problem in issue 268, which led to them greatly loosening their wheel filename regex, and the author of PEP 427 has written that applying the same escaping rules to the version component as to the other filename components is “probably a mistake”, yet there does not appear to have ever been any follow-up on this.

I would thus like to request that the relevant standards be amended to allow ! and + in version components of wheel, sdist, and .dist-info names. For the record, a scan yesterday of the 1,396,899 wheels on PyPI found 58 with exclaimation points in their versions and 244 with plus signs (the latter presumably uploaded before Warehouse started blocking local versions), in comparison to the 998 with underscores in their version components.

The second problem with version escaping as currently specified is its blanket transformation of all hyphens to underscores. Under PEP 440, hyphens and underscores in version strings are completely interchangeable, with one exception: the post in a post-release specifier can be replaced by a hyphen and only by a hyphen. (Interestingly, this restriction contradicts the statement later in the PEP that “[PEP 440] allows [the underscore’s] use anywhere that - is acceptable.”) So if we start out with a version string of the form 1.0-1 (an alternative spelling of the canonical 1.0.post1), it gets escaped to 1.0_1, which is not a valid PEP 440 version.

Possible ways to handle this are:

  • Amend PEP 440 and packaging to permit underscores in place of post

  • Require versions to be canonicalized before escaping, thereby eliminating all hyphens without affecting PEP 440 validity. (The .dist-info name proposal already requires project names to be canonicalized, but not versions.)

  • Amend the escaping rules for version components to be “Replace all hyphens with underscores, except for those hyphens that indicate an implicit post release, which should instead be replaced with the string .post.”

  • Require versions to be escaped by converting them to an equivalent form modulo canonicalization that does not contain a hyphen and leave it up to the wheel, sdist, and .dist-info generators exactly what they want to do.

  • Document that version strings in file & directory names need to be unescaped before use. Assuming that ! and + are allowed unescaped in version components, this leaves the hyphen as the only character in a valid PEP 440 string that needs escaping, and so unescaping is just s.replace("_", "-").

Following the same logic as the name field, the most straightforward would be to require version normalisation from now on (with the same “parser should expect unnormalised input” footnote), and replace dashes with underscores (which is really the only thing required).

1 Like

Following the same logic as the name field

Last time I checked, the logic for normalizing project names is only going to be applied to sdists and .dist-info directories; is PEP 427 going to be amended to require normalization in wheel filenames as well?

replace dashes with underscores

If a version is normalized, it won’t contain any dashes.

I assume so, since it’s not worth the .whl extension (which is a must to create a new filename spec). And existing rules already don’t work, as you described, so there’s likely very few (if any) edge cases.

For wheels, yes. I believe .dist-info names could theoretically contain non-PEP-440 versions though. It also doesn’t really hurt to preemptively cover the possibility of dashes in the future IMO.

pip 20.3b1 just hit this problem: https://github.com/pypa/pip/issues/9083

So the discussion is quite timely :stuck_out_tongue:

1 Like

Are exclamation marks ! and plus-signs + valid in all common filesystems? Also, will package indexes be required to quote this characters in the URL?

I checked, they are valid on Windows and POSIX-compliant systems. Both require escaping in URLs, but clients are already expected to handle them correctly (implied by Simple Repository API being HTML based).

I (finally) proposed an update to PEP 427 for this.

I don’t think we have sufficient consensus here for a PEP update yet.

I need more time to think this through, but some initial thoughts that I’d like to be taken into consideration.

  • IMO, it’s probably about time that the wheel spec should be moved to the PyPA Specifications document, particularly if we’re making changes to it. Therefore, whatever conclusion we arrive at here, should not be implemented as a direct change to PEP 427, but as a PR to the packaging user guide that moves the wheel spec there and makes the agreed changes (ideally as 2 separate commits, so that it’s easy to review the move and the modification independently).
  • There’s talk in this thread of canonicalising versions. Assuming you mean normalization as defined in PEP 440, can we be explicit that this is what we are referring to? I assume someone has confirmed that normalised form doesn’t use dashes. It might also be worth giving example code of how to normalise (i.e., str(packaging.versions.Version(v)) to get a normalised string).
  • I’m a strong -1 on anything that means that the version component of the wheel filename isn’t the same (in the sense of version equality) as the version in the metadata. “Canonicalising” characters like ! and + to underscore will break that.
  • The only actual requirement for wheel filenames is that the various “components” don’t contain a - character. I’d rather see that made explicit.
  • Making a rule that “normalised forms of stuff will never contain dashes” would be a better and more general solution to the problem of combining metadata, so we should do that, and make dash our official choice for “how we join stuff together”.

My suggestion would be that the Escaping and Unicode section of PEP 427 be rewriten to say something like the following:

The components of the file name (distribution, version, build tag, python tag, abi tag, platform tag) MUST NOT contain dashes. Normalisation rules for the various items ensure that dashes are not valid within components, so the rule here is that every component should be normalised.

  • The distribution name must be normalised as follows: (put some words here about what that means¹).
  • The version must be normalised as defined in PEP 440 (include a link).
  • The build tag must only include alphanumeric and underscore characters.
  • The definition of compatibility tags (include a link) does not allow dashes.

¹ Distribution name normalisation is a mess. PEP 427 does one thing, PEP 503 does something different. We should clean this up, but it’s not going to happen in this change. So let’s just state the existing PEP 427 rules, but limit them to the distribution name field.

We also need to formally document the constraint that standard forms must not contain dashes somewhere in the relevant specs, so that future changes to the version or compatibility tag specifications don’t inadvertently break that rule. Maybe a “Normalisation rules” section in the PyPA Specifications document would be the best place.

Finally, I’d like to see some more participation in this discussion from PyPA members here. Would a change like this impact the setuptools (@jaraco, @pganssle), flit (@takluyver), wheel (@agronholm, @dholth) or packaging (@brettcannon, @pradyunsg) projects? @uranusjr and I cover the implications for pip, but other pip maintainers’ views are welcome too. I assume the answer is probably “no”, so I don’t expect everyone to weigh in, but I’d like some indication that the community is OK with any standards change we agree on.

It would prompt a change in Flit, but not an onerous one. This discussion was reawakened yesterday following a prod from me, because Flit does what PEP 427 currently says, and pip doesn’t expect that. Specifically, Flit turns a version number like 0.1+foo into 0.1_foo for the filename, and then pip complains. People reported this as a bug in Flit.

One minor footnote where it might matter: flit_core 2.x supported building projects from source on Python 2, but I’ve now re-dropped Python 2 support. Projects that still support Python 2 rely on old versions of flit_core, so wouldn’t get a change like this.

Thanks for the information. I agree that this is in fact a bug in the spec - the fact that 0.1_foo isn’t a valid PEP 440 version makes it an unacceptable “normalisation”. Glad to hear it won’t be a major problem for flit. (For me, this is also confirmation that having multiple tools implementing the specs is a good thing, as it exposes issues like this :slightly_smiling_face:).

2 Likes

This would probably require changes in wheel but nothing too difficult. At least wheel file name parsing would have to be adjusted, but I’m already planning to delegate that to “packaging” since it now has a function for that. The bdist_wheel command would have to be checked for compliancy. But overall, nothing too drastic.

1 Like

Thomas has said that he would prefer to repeat some of the setuptools normalization rules in the wheel spec so that it is easier to implement without having to read every spec. That wording ‘runs of …’ comes from pkg_resources.safe_name https://github.com/pypa/pkg_resources/blob/main/pkg_resources/init.py#L1415

But the only thing that is wheel-specific is to make sure there are no dashes, so that we can split on dashes. The other rules about name and version normalization (whether a name or version number as part of a wheel filename is otherwise valid in packaging) come from other specs. For example wheel also says “Unicode please” when other specs may restrict to ASCII.

At a glance I would have the wheel spec say that dash to underscore was the only required escaping rule and to repeat the escaping rules from other PEPs as an aside.

Shall we move forwards? I’m broadly happy with @uranusjr’s proposed change, to say that - should be replaced with _ in all components.

I’d like to make the rules explicit within the spec where practical, though I’m not going to insist upon this point. E.g. if we refer to the recording installed packages spec, that points to PEP 503 for name normalisation, which says that a run of -_+ in the name should be replaced with a single -. I think the wheel spec should say that runs of -_+ in the distribution name are replaced with a single _ in the wheel filename, even if it refers to other specs to explain why.

(I’m OK with referring to PEP 440 for version normalisation, because that’s a more complex topic)

1 Like

The part I’m not sure about is how to proceed. Do I make a PR to packaging.python.org that mostly copy PEP 427, but rewrites the file name normalisation part?

I’ve just made a PR to copy the wheel spec with no changes as a first step.

1 Like

Yes, that’s the way we should go. Also, a PR to PEP 427 noting that the canonical reference for the wheel format is now at packaging.python.org, with a link to the new page.

I don’t think we should do that. Replacing - with _ isn’t reversible, and IMO it’s essential that we can rely on recovering the correct value from the wheel filename. In my view, we should:

  1. Give the rules for normalising names. I’m neutral on whether we should link to other rules, copy the existing rule, or change it. All that matters to me is that normalising a name will ensure it doesn’t contain -. At some point we will need to agree on a canonical “normalised name” form, if only for everyone’s sanity, but for now I’m OK with being expedient here.
  2. State that versions must be in PEP 440 normalised form (with a link), and note that this format will never contain -.
  3. Say that it is invalid for any of the wheel components to contain a -, and tools must refuse to create wheels where a component contains a - character.

If anyone knows of a case where any component other than name or version is allowed by an existing standard to contain a dash, please speak up. But I don’t think there are any.

1 Like

Can we already agree that the spec will be changed so it unambiguously requires a wheel with the version 1.0+x to have that exact string in its file name?

Then flit can be changed now in order to make pip play nice with it again (and fix our CI)

This may be tangental, but I wonder if we should have rules on how platforms should name themselves for a wheel. Currently the implementation is doing basically sysconfig.get_host_platform().replace("-", "_") because, well dashes aren’t allowed, so let’s replace all of them with underscores. But there’s nothing stopping a platform to return somthing that’d be ambiguous after the dash-underscore replacement.

That said, this is probably not a practical worry, since PyPA essentially controls what platforms are valid (by maintaining packaging.tags). But maybe this would be worth of an Information PyPA Specification.

Yes, I think everyone’s already on board with this from the beginning of the conversation (even predating this issue being surfaced in Flit). Please go ahead with the change! I think Flit can provide best forward-compatibility by normalising the version part (by either packaging.utils.canonicalize_version(s) or str(packaging.version.Version(s))), and using it verbatim in the wheel’s file name.

I’ve had a go at updating the spec in its new location: