Escaping versions for wheel, sdist, and .dist-info names

pip 20.3b1 just hit this problem: https://github.com/pypa/pip/issues/9083

So the discussion is quite timely :stuck_out_tongue:

1 Like

Are exclamation marks ! and plus-signs + valid in all common filesystems? Also, will package indexes be required to quote this characters in the URL?

I checked, they are valid on Windows and POSIX-compliant systems. Both require escaping in URLs, but clients are already expected to handle them correctly (implied by Simple Repository API being HTML based).

I (finally) proposed an update to PEP 427 for this.

I don’t think we have sufficient consensus here for a PEP update yet.

I need more time to think this through, but some initial thoughts that I’d like to be taken into consideration.

  • IMO, it’s probably about time that the wheel spec should be moved to the PyPA Specifications document, particularly if we’re making changes to it. Therefore, whatever conclusion we arrive at here, should not be implemented as a direct change to PEP 427, but as a PR to the packaging user guide that moves the wheel spec there and makes the agreed changes (ideally as 2 separate commits, so that it’s easy to review the move and the modification independently).
  • There’s talk in this thread of canonicalising versions. Assuming you mean normalization as defined in PEP 440, can we be explicit that this is what we are referring to? I assume someone has confirmed that normalised form doesn’t use dashes. It might also be worth giving example code of how to normalise (i.e., str(packaging.versions.Version(v)) to get a normalised string).
  • I’m a strong -1 on anything that means that the version component of the wheel filename isn’t the same (in the sense of version equality) as the version in the metadata. “Canonicalising” characters like ! and + to underscore will break that.
  • The only actual requirement for wheel filenames is that the various “components” don’t contain a - character. I’d rather see that made explicit.
  • Making a rule that “normalised forms of stuff will never contain dashes” would be a better and more general solution to the problem of combining metadata, so we should do that, and make dash our official choice for “how we join stuff together”.

My suggestion would be that the Escaping and Unicode section of PEP 427 be rewriten to say something like the following:

The components of the file name (distribution, version, build tag, python tag, abi tag, platform tag) MUST NOT contain dashes. Normalisation rules for the various items ensure that dashes are not valid within components, so the rule here is that every component should be normalised.

  • The distribution name must be normalised as follows: (put some words here about what that means¹).
  • The version must be normalised as defined in PEP 440 (include a link).
  • The build tag must only include alphanumeric and underscore characters.
  • The definition of compatibility tags (include a link) does not allow dashes.

¹ Distribution name normalisation is a mess. PEP 427 does one thing, PEP 503 does something different. We should clean this up, but it’s not going to happen in this change. So let’s just state the existing PEP 427 rules, but limit them to the distribution name field.

We also need to formally document the constraint that standard forms must not contain dashes somewhere in the relevant specs, so that future changes to the version or compatibility tag specifications don’t inadvertently break that rule. Maybe a “Normalisation rules” section in the PyPA Specifications document would be the best place.

Finally, I’d like to see some more participation in this discussion from PyPA members here. Would a change like this impact the setuptools (@jaraco, @pganssle), flit (@takluyver), wheel (@agronholm, @dholth) or packaging (@brettcannon, @pradyunsg) projects? @uranusjr and I cover the implications for pip, but other pip maintainers’ views are welcome too. I assume the answer is probably “no”, so I don’t expect everyone to weigh in, but I’d like some indication that the community is OK with any standards change we agree on.

It would prompt a change in Flit, but not an onerous one. This discussion was reawakened yesterday following a prod from me, because Flit does what PEP 427 currently says, and pip doesn’t expect that. Specifically, Flit turns a version number like 0.1+foo into 0.1_foo for the filename, and then pip complains. People reported this as a bug in Flit.

One minor footnote where it might matter: flit_core 2.x supported building projects from source on Python 2, but I’ve now re-dropped Python 2 support. Projects that still support Python 2 rely on old versions of flit_core, so wouldn’t get a change like this.

Thanks for the information. I agree that this is in fact a bug in the spec - the fact that 0.1_foo isn’t a valid PEP 440 version makes it an unacceptable “normalisation”. Glad to hear it won’t be a major problem for flit. (For me, this is also confirmation that having multiple tools implementing the specs is a good thing, as it exposes issues like this :slightly_smiling_face:).

2 Likes

This would probably require changes in wheel but nothing too difficult. At least wheel file name parsing would have to be adjusted, but I’m already planning to delegate that to “packaging” since it now has a function for that. The bdist_wheel command would have to be checked for compliancy. But overall, nothing too drastic.

1 Like

Thomas has said that he would prefer to repeat some of the setuptools normalization rules in the wheel spec so that it is easier to implement without having to read every spec. That wording ‘runs of …’ comes from pkg_resources.safe_name https://github.com/pypa/pkg_resources/blob/main/pkg_resources/init.py#L1415

But the only thing that is wheel-specific is to make sure there are no dashes, so that we can split on dashes. The other rules about name and version normalization (whether a name or version number as part of a wheel filename is otherwise valid in packaging) come from other specs. For example wheel also says “Unicode please” when other specs may restrict to ASCII.

At a glance I would have the wheel spec say that dash to underscore was the only required escaping rule and to repeat the escaping rules from other PEPs as an aside.

Shall we move forwards? I’m broadly happy with @uranusjr’s proposed change, to say that - should be replaced with _ in all components.

I’d like to make the rules explicit within the spec where practical, though I’m not going to insist upon this point. E.g. if we refer to the recording installed packages spec, that points to PEP 503 for name normalisation, which says that a run of -_+ in the name should be replaced with a single -. I think the wheel spec should say that runs of -_+ in the distribution name are replaced with a single _ in the wheel filename, even if it refers to other specs to explain why.

(I’m OK with referring to PEP 440 for version normalisation, because that’s a more complex topic)

1 Like

The part I’m not sure about is how to proceed. Do I make a PR to packaging.python.org that mostly copy PEP 427, but rewrites the file name normalisation part?

I’ve just made a PR to copy the wheel spec with no changes as a first step.

1 Like

Yes, that’s the way we should go. Also, a PR to PEP 427 noting that the canonical reference for the wheel format is now at packaging.python.org, with a link to the new page.

I don’t think we should do that. Replacing - with _ isn’t reversible, and IMO it’s essential that we can rely on recovering the correct value from the wheel filename. In my view, we should:

  1. Give the rules for normalising names. I’m neutral on whether we should link to other rules, copy the existing rule, or change it. All that matters to me is that normalising a name will ensure it doesn’t contain -. At some point we will need to agree on a canonical “normalised name” form, if only for everyone’s sanity, but for now I’m OK with being expedient here.
  2. State that versions must be in PEP 440 normalised form (with a link), and note that this format will never contain -.
  3. Say that it is invalid for any of the wheel components to contain a -, and tools must refuse to create wheels where a component contains a - character.

If anyone knows of a case where any component other than name or version is allowed by an existing standard to contain a dash, please speak up. But I don’t think there are any.

1 Like

Can we already agree that the spec will be changed so it unambiguously requires a wheel with the version 1.0+x to have that exact string in its file name?

Then flit can be changed now in order to make pip play nice with it again (and fix our CI)

This may be tangental, but I wonder if we should have rules on how platforms should name themselves for a wheel. Currently the implementation is doing basically sysconfig.get_host_platform().replace("-", "_") because, well dashes aren’t allowed, so let’s replace all of them with underscores. But there’s nothing stopping a platform to return somthing that’d be ambiguous after the dash-underscore replacement.

That said, this is probably not a practical worry, since PyPA essentially controls what platforms are valid (by maintaining packaging.tags). But maybe this would be worth of an Information PyPA Specification.

Yes, I think everyone’s already on board with this from the beginning of the conversation (even predating this issue being surfaced in Flit). Please go ahead with the change! I think Flit can provide best forward-compatibility by normalising the version part (by either packaging.utils.canonicalize_version(s) or str(packaging.version.Version(s))), and using it verbatim in the wheel’s file name.

I’ve had a go at updating the spec in its new location:

I just realised that there wasn’t much discussion here of what to do with distribution names, even though we ended up changing the rules there. @uranusjr was in favour of adopting the name normalisation from PEP 503, @pf_moore said “We should clean this up, but it’s not going to happen in this change”… and then the rule based on PEP 503 seems to have slipped in, while we were all focusing on the version part and where the spec lives. :confused:

This has come up because Flit has now implemented the updated spec - where . is translated to _ in the distribution name part of the .whl filename - and clashed with the rules implemented in Warehouse, which don’t allow that.

On the one hand, I think the new version is a clearly better spec in isolation. It gives us one rule for wheel filenames, sdist filenames and .dist-info directory names, and it means you can construct a wheel filename from a (PEP 503) normalised project name or an un-normalised project name and get the same result. On the other hand, it’s not what setuptools/wheel & warehouse currently do.

The change to the spec was merged on the 28th of February, so it’s been there for 9 months. Do we press forward and adapt the software to the spec? Or do we revert that part of the change as a mistake?

For extra complexity, the original spec is kind of ambiguous: the words say you replace “non-alphanumeric characters” with an underscore, which IMO would mean replacing . (".".isalnum() is False). But the code sample that accompanies it specifically exempts . from the regex-based replacement. I imagine we take the code sample as the definitive rule, so this is probably academic.

And another overlooked detail: case normalisation. The current wheel spec says:

In distribution names, any run of -_. characters (HYPHEN-MINUS, LOW LINE and FULL STOP) should be replaced with _ (LOW LINE). This is equivalent to PEP 503 normalisation followed by replacing - with _ .

But it’s not equivalent, because PEP 503 normalisation also includes lower-casing the name, which we missed here.

Honestly, I’d rather we just fixed warehouse and wheel.

It’s a nuisance this happened - and it probably demonstrates that we really shouldn’t accept any changes to a standard without a proper PEP, and a better process to ensure that all affected tools will implement the agreed changes. But where we are now, we’ve basically just got a revised version of the PEP that’s not yet implemented everywhere, which is hardly unusual.

Am I right that the blocker here is warehouse, which is blocking upload of wheels that are now valid under the new rules? If so, has an issue been raised against warehouse to track that, and do we have any idea how big of a problem it will be to make the fix?

1 Like

There’s an issue on Warehouse and even a pull request from @uranusjr.

It appears to be easy to allow the now valid names, with . replaced with _. But we presumably can’t immediately enforce the new normalisation rule in Warehouse, because setuptools/wheel would have to be changed first to make filenames this way. So we might need some extra complexity to prevent uploading two wheels for the same distribution & version with the same tag.