The replacement character itself does not matter, but the problem is safe_extra does not normalise - (and .). Under safe_extra rules, a£b and a-b are not equivalent—a£b normalises to a_b, and a-b is a normalised form.
This is a problem for a package currently using - in the extra name. For example, if a package declares an extra a-b and dependency foo ; extra == 'a-b', pip install package[a_b] currently does not install foo, while a PEP-503-based rule will. This is more problematic if the package has both a-b and a_b declared as extras, although this is so fundamentally user-hostile I’d hope no package would do it. The same applies to .; foo ; extra == 'a.b' can currently be selected with [a.b], but PEP 503 would allow [a-b].
BTW, there an additional minor difference between safe_extra and PEP 503 normalisation. PEP 503 specifies any running non-alphanumeric sequences be normalised into a single dash, which means a--b becomes a-b, but safe_extra only replaces running non-safe sequences, so a__b normalises to a__b. But again, anyone relying on this is probably too user-hostile to be a meaningful consideration.
It seems the next step here is to write a short PEP specifying what @uranusjr mentioned. @uranusjr or @hroncok , is this something you’re actively working on or plan to in the near future? If not, maybe I can help.
Just to confirm, what is the exact normalization procedure currently proposed? @uranusjr , you mentioned that the logic should follow pkg_resources.safe_extra(), but then later highlighted a couple pathological user-hostile corner cases. Just to be clear, is
re.sub('[^A-Za-z0-9]+', '_', extra).lower()
the desired process, to avoid these issues while still preserving full meaningful backward compat?
Actually, as confirmed by my testing, in safe_extra(), any runs of the normalization character (_) are normalized to _, but runs of - and .are not normalized. So a__bdoes normalize to a_b, but a--b and a..b remain as-is. The above procedure handles this case more sensibly, as well as the other ones you mention.
In terms of spec implementation, it seems PEP should mention the need to both revise the PEP 508 language on the topic, and update/correct the text in the Provides-Extra field of the Core Metadata spec. The former is not currently hosted on the PyPA specifications site; perhaps the PEP could take the opportunity to formally declare such? The latter is, and so can be updated there; given this tweak is just to match existing established practice and doesn’t add, remove or substantially change the semantics of a metadata field, I’d think it doesn’t need a new core metadata version? @pf_moore , any insight on either of these?
Finally, regarding implementation in packaging tools, @uranusjr is your intent that this be implemented in packaging (e.g. packaging.utils.canonicalize_extra), and then pip can call that on both sides of the comparison when getting the extra, and setuptools and other backends can call it when writing Provides-Extra?
It looks like what’s happening here in the former case is that array_types is getting normalized to array-types per the rules for distribution names in PEP 503, just like the name part of the PEP 508 requirements specifiers in that context. However, the actual extras names themselves it is checking against are normalized per the rules implemented by safe_extra().
Unless I’m missing something, the fact that the normalization is not internally consistent on each side of the comparison seems like an bug, regardless of what the final normalization rule should be. @uranusjr , should this be addressed as such, or do you still prefer awaiting the outcome of this PEP as to what the normalization should be?
Moving discussion to What extras names are treated as equal and why? sine this is ultimately a bug in the specification and pip can only follow what the rules allow it to. pip would receive a fix automatically once this gets resolved at the PEP level and implemented in packaging .
Right, which is why I asked @uranusjr , in light of the internal inconsistency of using two different normalization schemes for the two sides of the comparison (beyond just the choice of normalization itself), whether pip should still wait for this to be resolved before at least being consistent with normalization (or not) on each side of the comparison.
Great, thanks for the update Miro! Glad to hear it! Let me know if I can help copyediting, proofreading or providing other feedback.
In case it is helpful, my above post tries to summarize the normalization procedure generally agreed upon, what specs would need to be updated, and some general ideas for implementation, based on re-reading the full discussion here as well as general background, alongside the remaining relevant questions that appear to still be unmentioned, unresolved or potentially ambiguous given what’s already been said.
Thanks again, and looking forward to seeing the PEP!
To note, this previous thread has some substantial discussion about what extras names are allowed.
In particular, per PEP 508 and confirmed by my testing, at least in requirements specifiers (not necessarily metadata), the following extras names are disallowed in packaging and the tools that rely on it for parsing, including pip, etc, and results in a fast failure and an error:
No characters outside of [A-Za-z0-9._-] are allowed anywhere in specified extra names (regardless of unicode character class)
Otherwise valid punctuation (., - and _) is not allowed as the leading or trailing character
Case-folding is not specified, but neither is explicitly contradicted by the spec and is implemented in practice in pip (AFAIK likewise at the packaging level, though I haven’t explicitly checked the code to verify)
As such, while backends may allow extras names not matching this spec to be stored in core metadata (though the nominal spec restricts this fairly similarly, to “valid Python identifiers” as of Python 2), it is not possible for anyone to actually specify them as requirements, so if any package has been using them, such extras have been uninstallable and unusable as-is anyway (both following the spec and in practice).
Thus, the rules for allowable distribution package names in requirements specifiers (both per PEP 508 and in practice) are identical to those for extras, since both are built on the base identifier specification. Therefore, the existing PEP 503 canonicalization logic can safely be applied (preferably using _ as the replacement char instead of -, for consistency with previous implementations and with Metadata 2.1), as @hroncokproposed originally, as any non-conforming extras names have been unspecifiable and uninstallable anyway. Therefore, there is only one delta to the existing safe_extra() that @uranusjrmentioned and I later clarified, i.e. runs of one or more - and . are not normalized to _, which as @uranusjr stated
So @hroncok , this would support your proposal that started this off,
Aside from the tweak of normalizing to _ instead of -, to conform to the discussion here, safe_extra() and Metadata 2.1. This would be particularly easy to implement in packaging, since it is only a one-character difference from canonicalize_name (or could even be used as-is, if we accept a change of _ to - as the normalization character).