Can we make extra validation/normalization the same as package name validation/normalization? It looks like it’s pretty similar, at least. That would be useful for simplicity, and also to keep our options open for reifying extras as part of the package name in the future.
For marker evaluation: simplest might be to declare that extra can only appear on the left-hand side of a == or !=, and that this then uses normalizing comparison rules?
If we do that then I don’t really see the point of also mandating that tools produce a specific string form.
It’s theoratically possible, but I’m not sure if it’s a good idea to declare all existing package managers broken to persue theoratical purity. You’d get no objection from me if you write that into a PEP, but I’m not going to try writing that PEP myself and defending the decision against user complaints.
I don’t think the replacement character matters, since users are always supposed to re-normalize before doing any comparisons. So if pip or whatever wants to prefer one replacement character or another internally, it doesn’t affect anything.
I guess the difference in £ is that safe_extra treats it as punctuation, a£b == a-b, while PEP 503 says that it’s illegal? And same for every other character that’s not ASCII alphanumerics, -, . or _?
The safe_extra approach doesn’t seem very useful – "sure, you can write your extra name in greek or cyrillic, but all extras written in those alphabets will be interpreted as if you had written a single "_"". And making those characters illegal probably wouldn’t be too disruptive – I doubt many people are using them? But idk if it’s like, “literally no-one” or “1 package” or “100 packages”, so maybe it would be disruptive enough to not be worth it, not sure.
The replacement character itself does not matter, but the problem is safe_extra does not normalise - (and .). Under safe_extra rules, a£b and a-b are not equivalent—a£b normalises to a_b, and a-b is a normalised form.
This is a problem for a package currently using - in the extra name. For example, if a package declares an extra a-b and dependency foo ; extra == 'a-b', pip install package[a_b] currently does not install foo, while a PEP-503-based rule will. This is more problematic if the package has both a-b and a_b declared as extras, although this is so fundamentally user-hostile I’d hope no package would do it. The same applies to .; foo ; extra == 'a.b' can currently be selected with [a.b], but PEP 503 would allow [a-b].
BTW, there an additional minor difference between safe_extra and PEP 503 normalisation. PEP 503 specifies any running non-alphanumeric sequences be normalised into a single dash, which means a--b becomes a-b, but safe_extra only replaces running non-safe sequences, so a__b normalises to a__b. But again, anyone relying on this is probably too user-hostile to be a meaningful consideration.
It seems the next step here is to write a short PEP specifying what @uranusjr mentioned. @uranusjr or @hroncok , is this something you’re actively working on or plan to in the near future? If not, maybe I can help.
Just to confirm, what is the exact normalization procedure currently proposed? @uranusjr , you mentioned that the logic should follow pkg_resources.safe_extra(), but then later highlighted a couple pathological user-hostile corner cases. Just to be clear, is
re.sub('[^A-Za-z0-9]+', '_', extra).lower()
the desired process, to avoid these issues while still preserving full meaningful backward compat?
Actually, as confirmed by my testing, in safe_extra(), any runs of the normalization character (_) are normalized to _, but runs of - and .are not normalized. So a__bdoes normalize to a_b, but a--b and a..b remain as-is. The above procedure handles this case more sensibly, as well as the other ones you mention.
In terms of spec implementation, it seems PEP should mention the need to both revise the PEP 508 language on the topic, and update/correct the text in the Provides-Extra field of the Core Metadata spec. The former is not currently hosted on the PyPA specifications site; perhaps the PEP could take the opportunity to formally declare such? The latter is, and so can be updated there; given this tweak is just to match existing established practice and doesn’t add, remove or substantially change the semantics of a metadata field, I’d think it doesn’t need a new core metadata version? @pf_moore , any insight on either of these?
Finally, regarding implementation in packaging tools, @uranusjr is your intent that this be implemented in packaging (e.g. packaging.utils.canonicalize_extra), and then pip can call that on both sides of the comparison when getting the extra, and setuptools and other backends can call it when writing Provides-Extra?
It looks like what’s happening here in the former case is that array_types is getting normalized to array-types per the rules for distribution names in PEP 503, just like the name part of the PEP 508 requirements specifiers in that context. However, the actual extras names themselves it is checking against are normalized per the rules implemented by safe_extra().
Unless I’m missing something, the fact that the normalization is not internally consistent on each side of the comparison seems like an bug, regardless of what the final normalization rule should be. @uranusjr , should this be addressed as such, or do you still prefer awaiting the outcome of this PEP as to what the normalization should be?
Moving discussion to What extras names are treated as equal and why? sine this is ultimately a bug in the specification and pip can only follow what the rules allow it to. pip would receive a fix automatically once this gets resolved at the PEP level and implemented in packaging .
Right, which is why I asked @uranusjr , in light of the internal inconsistency of using two different normalization schemes for the two sides of the comparison (beyond just the choice of normalization itself), whether pip should still wait for this to be resolved before at least being consistent with normalization (or not) on each side of the comparison.
Great, thanks for the update Miro! Glad to hear it! Let me know if I can help copyediting, proofreading or providing other feedback.
In case it is helpful, my above post tries to summarize the normalization procedure generally agreed upon, what specs would need to be updated, and some general ideas for implementation, based on re-reading the full discussion here as well as general background, alongside the remaining relevant questions that appear to still be unmentioned, unresolved or potentially ambiguous given what’s already been said.
Thanks again, and looking forward to seeing the PEP!
To note, this previous thread has some substantial discussion about what extras names are allowed.
In particular, per PEP 508 and confirmed by my testing, at least in requirements specifiers (not necessarily metadata), the following extras names are disallowed in packaging and the tools that rely on it for parsing, including pip, etc, and results in a fast failure and an error:
No characters outside of [A-Za-z0-9._-] are allowed anywhere in specified extra names (regardless of unicode character class)
Otherwise valid punctuation (., - and _) is not allowed as the leading or trailing character
Case-folding is not specified, but neither is explicitly contradicted by the spec and is implemented in practice in pip (AFAIK likewise at the packaging level, though I haven’t explicitly checked the code to verify)
As such, while backends may allow extras names not matching this spec to be stored in core metadata (though the nominal spec restricts this fairly similarly, to “valid Python identifiers” as of Python 2), it is not possible for anyone to actually specify them as requirements, so if any package has been using them, such extras have been uninstallable and unusable as-is anyway (both following the spec and in practice).
Thus, the rules for allowable distribution package names in requirements specifiers (both per PEP 508 and in practice) are identical to those for extras, since both are built on the base identifier specification. Therefore, the existing PEP 503 canonicalization logic can safely be applied (preferably using _ as the replacement char instead of -, for consistency with previous implementations and with Metadata 2.1), as @hroncokproposed originally, as any non-conforming extras names have been unspecifiable and uninstallable anyway. Therefore, there is only one delta to the existing safe_extra() that @uranusjrmentioned and I later clarified, i.e. runs of one or more - and . are not normalized to _, which as @uranusjr stated
So @hroncok , this would support your proposal that started this off,
Aside from the tweak of normalizing to _ instead of -, to conform to the discussion here, safe_extra() and Metadata 2.1. This would be particularly easy to implement in packaging, since it is only a one-character difference from canonicalize_name (or could even be used as-is, if we accept a change of _ to - as the normalization character).
I am rather swamped with things. @CAM-Gerlach If you are available to draft a PEP, I would not like to block you. It will certainly be easier for me to dedicate time to review it rather than write it from scratch.
Hey, thanks for the heads up. I have a bit of a backlog myself at the moment (then again, I always do ), but since I’ve already read over the relevant threads and have a decent idea of what should go there, I can try to draft something over the next week or so (in my fork, and I can open a PR to main for your reviewing convenience when ready). Unless you prefer otherwise, I’ll list you as a co-author and can sponsor it myself. @pf_moore should I list you as PEP-Delegate, ask someone else or just leave it blank on the draft for now? Thanks!
I’d been meaning to, but my in-progress work updating PEP 639 with the various latest agreed changes, as well as regular duties as a PEP editor and with other projects, plus IRL stuff, has taken priority so far. If you are available, basically, the desired specification based on the discussion (assuming I’m remembering correctly) is codifying normalizing extras names via the equivalent of
re.sub(r'\W+', '_', raw_extra_name.lower())
for the various reasons discussed in previous comments on the thread, which ensures nearly full backward compat with extras as previously used (outside of a few contrived, deliberately user-hostile pathological corner cases) while normalizing the cases that users are likely to run in to, and being consistent with PEP 503 normalization aside from _ instead of - as the normalization character (consistent with wheel names and the extras name current spec), and that non-\w special characters are normalized too (per the extras name spec and current practice).