Escaping versions for wheel, sdist, and .dist-info names

The discussion in Clarify naming of .dist-info directories has reminded me of a couple subtle problems with PEP 427 and its handling of versions in wheel filenames, and now that progress is being made on escaping of versions in sdist and .dist-info directory names, this seems like as good a time as any to bring it up.

Problem number 1: PEP 427 requires that the version component of a wheel filename only contain alphanumeric characters, underscores, and periods, with all other characters converted to underscores; however, this overlooks the fact that, per PEP 440, version strings can also contain exclamation points (to denote version epochs) and plus signs (to denote local version identifiers). Converting these two symbols to underscores causes a loss of information that makes it impossible in certain cases to compare the versions of two wheels just by inspecting their filenames; for example, 1!2, 1+2, and 1-2 all end up escaped as 1_2 (which, incidentally, is not a valid PEP 440 version; see below). The wheel project ran into this problem in issue 268, which led to them greatly loosening their wheel filename regex, and the author of PEP 427 has written that applying the same escaping rules to the version component as to the other filename components is “probably a mistake”, yet there does not appear to have ever been any follow-up on this.

I would thus like to request that the relevant standards be amended to allow ! and + in version components of wheel, sdist, and .dist-info names. For the record, a scan yesterday of the 1,396,899 wheels on PyPI found 58 with exclaimation points in their versions and 244 with plus signs (the latter presumably uploaded before Warehouse started blocking local versions), in comparison to the 998 with underscores in their version components.

The second problem with version escaping as currently specified is its blanket transformation of all hyphens to underscores. Under PEP 440, hyphens and underscores in version strings are completely interchangeable, with one exception: the post in a post-release specifier can be replaced by a hyphen and only by a hyphen. (Interestingly, this restriction contradicts the statement later in the PEP that “[PEP 440] allows [the underscore’s] use anywhere that - is acceptable.”) So if we start out with a version string of the form 1.0-1 (an alternative spelling of the canonical 1.0.post1), it gets escaped to 1.0_1, which is not a valid PEP 440 version.

Possible ways to handle this are:

  • Amend PEP 440 and packaging to permit underscores in place of post

  • Require versions to be canonicalized before escaping, thereby eliminating all hyphens without affecting PEP 440 validity. (The .dist-info name proposal already requires project names to be canonicalized, but not versions.)

  • Amend the escaping rules for version components to be “Replace all hyphens with underscores, except for those hyphens that indicate an implicit post release, which should instead be replaced with the string .post.”

  • Require versions to be escaped by converting them to an equivalent form modulo canonicalization that does not contain a hyphen and leave it up to the wheel, sdist, and .dist-info generators exactly what they want to do.

  • Document that version strings in file & directory names need to be unescaped before use. Assuming that ! and + are allowed unescaped in version components, this leaves the hyphen as the only character in a valid PEP 440 string that needs escaping, and so unescaping is just s.replace("_", "-").

Following the same logic as the name field, the most straightforward would be to require version normalisation from now on (with the same “parser should expect unnormalised input” footnote), and replace dashes with underscores (which is really the only thing required).

1 Like

Following the same logic as the name field

Last time I checked, the logic for normalizing project names is only going to be applied to sdists and .dist-info directories; is PEP 427 going to be amended to require normalization in wheel filenames as well?

replace dashes with underscores

If a version is normalized, it won’t contain any dashes.

I assume so, since it’s not worth the .whl extension (which is a must to create a new filename spec). And existing rules already don’t work, as you described, so there’s likely very few (if any) edge cases.

For wheels, yes. I believe .dist-info names could theoretically contain non-PEP-440 versions though. It also doesn’t really hurt to preemptively cover the possibility of dashes in the future IMO.

pip 20.3b1 just hit this problem: https://github.com/pypa/pip/issues/9083

So the discussion is quite timely :stuck_out_tongue:

1 Like

Are exclamation marks ! and plus-signs + valid in all common filesystems? Also, will package indexes be required to quote this characters in the URL?

I checked, they are valid on Windows and POSIX-compliant systems. Both require escaping in URLs, but clients are already expected to handle them correctly (implied by Simple Repository API being HTML based).