PEP 685: Comparison of extra names for optional distribution dependencies

CAM-Gerlach · March 9, 2022, 8:32am

I submitted a pull request with some technical, proofreading and a few copyediting changes to the text of the PEP.

There was one substantive, rather significant issue with the PEP’s content, however, that should be discussed here, however—the normalization algorithm it specifies does not appear to be the one that represented the final rough consensus on the previous thread. Furthermore, its properties and quirks directly contradict several of the claimed advantages and stated motivations for it elsewhere in the PEP (unlike said algorithm), greatly diminish its practical benefit, and mean that it does not actually solve the original issue that sparked the PEP to begin with, as cited therein (that adhoc-ssl does not compare equal to adhoc_ssl).

The normalization algorithm currently cited in the PEP is:

re.sub('[^A-Za-z0-9.-]+', '_', name).lower()

However, as discussed on the previous issue, the algorithm should instead be

re.sub('[^A-Za-z0-9]+', '_', name).lower()

(i.e., the previous algorithm, except with . and - also normalized to _).

In real-world practice, the latter is exactly equivalent to PEP 503 normalization except with _ as the replacement character, because per PEP 508 and as actually implemented in packaging tools, no characters outside of [A-Za-z0-9._-] have been allowed anywhere in specified extra names.

Using the latter instead of the former means that:

Normalization is actually useful, as the only actual normalization the former algorithm does on currently possible extras names is making is making test__extra equivalent to test_extra, whereas the latter means that test_extra, test--extra and test.extra will all be normalized to test_extra.
The original issue that sparked the PEP, “the extra adhoc-ssl was not considered equal to the name adhoc_ssl by pip”, is actually solved.
The normalized form will always be a valid Python identifier, as currently required by the Extras spec (whereas the normalization proposed by the PEP, contradicting its claim, has no practical effect on any currently possible Extras name’s validity as a Python identifier, and allows both . and - which are invalid characters anywhere in such.)
The strange, unexpected and confusing behavior with test__extra being normalized to test_extra, but test--extra being left alone, is avoided (by normalizing both to test_extra); to wit, the PEP itself is confused on that point, as it states “Runs of characters, unlike PEP 503, do not get collapsed, e.g. ___ stays the same.” when in fact, ___ is collapsed (as I described on the previous thread, while “—” is not.
The normalization is consistent between project and extras names, except for the replacement character

As likewise discussed on the previous thread, this has effectively no greater real-world backward compatibility impact than the currently-specified behavior, as the only cases that would be meaningfully affected are very unlikely, fundamentally user-hostile and (based on pip’s behavior), appear to be mostly be currently broken anyway:

(to note, given the problem identified by the OP and my later testing, it appears that these extras cannot even currently be selected with pip to begin with) and

which, to note, due to the strangeness of the currently-specified implementation, the above actually has it backwards—a--b is not normalized, but a__b is normalized to a_b.