Amending PEP 427 (and PEP 625) on package normalization rules

PEP-427 - Escaping and Unicode requires wheel and PEP-625 Specification require the sdist files distribution name part to be normalised by replacing all non-alphanumeric characters wit the _ character. This means that a package name of a.b-c.d gets transformed to a_b_c_d.

I’d like to propose to not make this escape for the . character. The . character is often used to define namespace packages. E.g. package pypi.alpha and pypi.beta is both under the pypi organization, and often are namespace packages under the pypi root package. Having the . in the distribution name makes it easy for systems to determine if a package belongs to a given namespace or not by just looking at the file names. By doing the normalization it’s no longer possible to do this, because package name a_b_c_d could be either a.b_c.d or a.b.c_d and so on.

The use case where I’ve run into this is setting up role-based upload policies for Artifactory. setuptools does not follow the above recommendation, so one can say that for packages in the a namespace (starting with a. in their distribution name) allow users 1, 2 and 3 to upload. Given the above normalization, such policies are no longer configurable because now you no longer are able to determine the namespace of the package by just looking at the filename. Having to open package and look into it makes URL pattern permissioning not possible.

I ran into this while using hatchling that follows this recommendation. The fact that setuptoools does not follow this tells me should be safe to make this change, unless someone with more understanding of those PEPs can tell otherwise.

Just to clarify take for example package zope.sqlalchemy. The PyPI URI is Links for zope-sqlalchemy which I’m happy with. But I’d like zope.sqlalchemy-1.6-py2.py3-none-any.whl to still be valid once you load that page or upload such filename to Links for zope-sqlalchemy
The way PEP-427 is formulated zope.sqlalchemy-1.6-py2.py3-none-any.whl should be zope_sqlalchemy-1.6-py2.py3-none-any.whl. I think normalizing characters other than . in the distribution name should be kept, but let’s not normalize the . character in the distribution name.

PS. PEP-503 - Normalized Names is what goverens the rewrite of the name in the https://pypi.org/simple/zope-sqlalchemy which we can keep as is I think.

cc @dstufft @dustin @pf_moore @jaraco

2 Likes

I chatted with @bernatgabor on Discord, and just to be clear, the desire here is to fix PEP 427, because the text says:

Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _ :

but the code in the PEP says

re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)

Which is how everyone implements it (and implementing it as the text says would turn 1.0 into 1-0 or py2.py3 into py2_py3, so it’s the only logical way for it to work).

The part about PEP 503 is actually asking for PEP 625 to not reference PEP 503, and just reference PEP 427 escaping rules.

2 Likes

Filled:

What would this do to package name comparison? For example, if there’s a package foo.bar that is listed in some install requirements as foo-bar, would those requirements no longer work? Would it be possible to create both a foo.bar package and a foo-bar package on PyPI?

See Revisiting distribution name normalization where this was discussed previously.

1 Like

I think for installation/requirements it’s fine to normalize it (aka within METADATA and what not), but in the wheel/sdist filename . should be kept to signal namespace packages without needing to open the zip file and look into some metadata file.

I’m strongly in favor of using PEP 503 everywhere except the display name in METADATA. This reduces ambiguity and thus subtle bugs, and matches current specs:

Tools SHOULD normalize this name, as specified by PEP 503, as soon as it is read for internal consistency.

When comparing extra names, tools MUST normalize the names being compared using the semantics outlined in PEP 503 for names

distribution is the name of the distribution as defined in PEP 345, and normalised according to PEP 503

2 Likes

FWIW, there’s a bunch of discussion on this that took place on the PyPA Discord, on the setuptools channel.

As a general summary, it’s probably fine for build-backends to emit names that replace . with _ in the distributions. Basically all packaging tooling will treat them the same. Custom/enterprise tooling does not. Correspondingly, this (. in name of a distribution) is something that’s allowed in the implementations of setuptools and pip today, so it is reasonable to bring all the specs to consistency with the implementation realities.


As for me, I’m firmly in the “don’t care” category for this, at this point. Personally, I don’t like this inconsitency but it also already exists in the ecosystem (something I didn’t know in the earlier discussion that @jaraco had opened on the same topic) and people depend on it. So, we’re gonna have to deal with it for basically forever at this point unless we reckon this is worth forcing breakage on users. Whether individual backends support having the . is upto them.

There’s certainly value in updating the specs to reflect reality though, which is that we allow this to be uploaded and used as of today.

From Discord, this would logically mean a change to PyPI to no longer treat e.g. foo_bar-1.0.0-py3-none-any.whl and foo.bar-1.0.0-py3-none-any.whl as distinct.

I think this is a bad idea.

2 Likes

I think PyPI should do that regardless FWIW.

So to avoid conflicts we’d be solely relying on the registration of projects using PEP 503? Seems dangerous.

Maybe I’m missing something, but do you mean the opposite?

Sorry, yes.

edit: fixed

1 Like

I don’t think whether the PEPs allow a . in the name or not matters at all for PyPI, either way we can normalize it for comparison, which would be allowed regardless.

Other than the service Artifactory, are there examples of using the . for special functionality? I’m trying to gauge whether that product’s influence is large enough to warrant us not at least recommending PEP 503, thus enshrining inconsistency in our specs forever.

1 Like

I think that PyPI should not allow uploads of two files which consumers like pip would treat as identical, because they would normalise the names to the same value.

I’m pretty sure it would cause a significant risk of typosquatting attacks if someone could publish (for example) a file called zope_interface-5.4.0-cp37-cp37m-win_amd64.whl which would then be present on PyPI alongside zope.interface-5.4.0-cp37-cp37m-win_amd64.whl.

I may have misunderstood the proposal here, though, as the above seems so obviously bad to me that I’m surprised it would even be considered…

1 Like

PyPI currently does allow that, because its’ doing simple string comparison for filename uniqueness.

What I’m saying is that we can fix PyPI regardless of what happens in this discussion, so it shouldn’t be used as a justification one way or another.

What this thread is saying is this:

  • PEP 427’s text says that any non alphanumeric character in a filename segment must be escaped into _.
  • PEP 427’s example code escapes any non alphanumeric OR . character in a filename segment into _ (so . is allowed, even though the text says it’s not).
    • Parts of PEP 427 depend on not escaping the ., like the compressed tags where you can do py2.py3, or version numbers to allow differentiating between 1.0 and 1-0 (which gets escaped to 1_0).
  • PEP 625 requires normalizing the project name in the sdist filename as per the rules in PEP 503, which PEP 427 does not require.

It appears that the text of PEP 427 is just wrong, and should be fixed, because PEP 427 itself would break if we followed the text of PEP 427 instead of the example code.

Assuming we fix that, then PEP 427 and PEP 625 require different things of their filenames:

  • PEP 427 lets you put unnormalized values in the filename, you just effectively have to escape - to _ so that the filename can be parsed.
  • PEP 625 requires normalized values in the filename and also requires escaping - to _ so that the filenames can be parsed.

@bernatgabor wants things to standardize on PEP 427’s rules, don’t require normalization, just escaping.

@ofek wants things to standardize on PEP 625s rules, don’t allow unnormalized values and require escaping.

I personally don’t care that much, but practically speaking it is easier to adjust PEP 625 to match PEP 427, which PEP 427 is already widely deployed than it is to further restrict PEP 427 (which would probably require it’s own PEP, or at least to be put into PEP 625). I do think it would be good to align PEP 427 and 625 to have the same overall rules, since it would be weird if foo.bar-1.0-py3-none-any.whl was the wheel for foo_bar-1.0.tar.gz.

PyPI is a non factor here, we don’t currently use normalization for comparison of filenames, but we should no matter what is decided, so that’s just an improvement to PyPI.

Thanks for the explanation.

Personally, I have a preference for requiring normalisation (specifically, name and version must be normalised, other components are already strictly defined enough that - is invalid and there is no normalisation needed).

I understand the use case of wanting to assign semantics to the segments of a package name that contains dots (my_org.my_package) but the analogy with namespace packages is IMO false - project names are not the same as (import) package names. If we want to allow namespaces for project names, we should standardise them properly and make them “official”. And in the meantime, an adhoc approach (something along the lines of “org names must be alphanumeric, and we’ll infer the org name by taking the portion befor the first underscore”) should be sufficient for private indexes (where the organisation has enough control over the names to enforce such a rule).

2 Likes

A few points to note here:

  • For some time now, the canonical spec has been the PyPA Binary Distribution (Wheel) Format Specification, not PEP 427. Per the Canonical Specification notice at the top of PEP 427 (which we are making more prominent and consistent to aid visibility),

    The canonical version of the wheel format specification is now maintained at Binary distribution format - Python Packaging User Guide . This may contain amendments relative to this PEP.

  • The canonical spec does not contain the apparently contradictory wording/regex in question, and rather specifies that only the distribution name should be normalized by that procedure rather than a blanket regex for each component, following a lengthy discussion cited earlier:

    • In distribution names, any run of -_. characters (HYPHEN-MINUS, LOW LINE and FULL STOP) should be replaced with _ (LOW LINE), and uppercase characters should be replaced with corresponding lowercase ones. This is equivalent to PEP 503 normalisation followed by replacing - with _. Tools consuming wheels must be prepared to accept . (FULL STOP) and uppercase letters, however, as these were allowed by an earlier version of this specification.
    • Version numbers should be normalised according to PEP 440. Normalised version numbers cannot contain -.
    • The remaining components may not contain - characters, so no escaping is necessary.

    Tools producing wheels should verify that the filename components do not contain -, as the resulting file may not be processed correctly if they do.


This was clarified in the corresponding section in the canonical spec, which states that only the distribution name component should be normalized by that regex, rather than each componet of the filename. The version component is normalized by PEP 440 and the rest of the filename needs no normalization, as the canonical spec now reflects. Also, to note, 1-0 is not a normalized PEP 440 version identifier.

Yes, but the canonical, up to date binary distribution spec does. Presumably, both specs should be consistent.

Right, but it was already fixed in the canonical spec to only normalize the distribution name via that regex, which avoids the issue.

In fact, the proposed sdist spec in PEP 625 and the canonical wheel spec appear to be fully consistent here, as the latter directly references the former spec to define its distribution name normalization, and both specify the version must be normalized by PEP 440. I don’t see where PEP 625 separately requires escaping (which it shouldn’t, as the aforementioned normalization schemes already guarantee - will not be present in either component).

PEP 427 is not, as you point out, internally consistent, so there is isn’t a single consistent interpretation of “PEP 427’s rules”, per say. The up to date, canonical wheel spec specifies the same thing as PEP 503 (except underscore) and is directly referenced by PEP 625.

Or rather, @ofek wants to retain the current specifications, as both the canonical wheel spec and the proposed PEP 625 specify identical name and version normalization, PEP 503 + underscore, and don’t require escaping.

Actually, at least specification-wise (if not yet adopted by all tools), the canonical wheel spec and the proposed sdist spec are already fully consistent, and what he is describing is the (specification) status quo. Regressing on that would require a substantive change to both specifications, and need to follow the PyPA specification update process:

If a change being considered this way has the potential to affect software interoperability, then it must be escalated to the Packaging category of the Python.org Discourse for discussion, where it will be either approved as a text-only change, or else directed to the PEP process for specification updates.

So unless there’s consensus here, it would presumably need a PEP.

1 Like

Eh, I don’t think that’s actually true.

The section you’re pointing to was updated without a PEP, to make previously compliant behavior no longer compliant, and it was called out by the author of that change 9mo later that he realized he updated that section of the PEP without anyone really discussing it because everyone focused on the version number and where the spec should live.

It was also pointed out that you currently can’t follow the updated spec and release to PyPI, because warehouse will reject your upload (pypi/warehouse#10030), which is still true today.

That was followed up by @pf_moore saying he’d rather we fixed Warehouse than go back onto the PEP, but that doesn’t seem like there was ever consensus for changing the rules to require normalization of distribution names, it just slipped in accidentally and has been in limbo since then.

2 Likes