OK, here’s some basic stats.
- I have the metadata from 2,124,900 wheels from PyPI (I don’t have data for packages that don’t ship wheels).
- There are a total of 7338 unique extras across all of those packages. That strikes me as surprisingly low.
- I’ve uploaded the list of all those extras as All extras used in wheels from PyPI · GitHub
The situation is a bit of a mess, though. The
Provides-Extra metadata says “A string containing the name of an optional feature. Must be a valid Python identifier.” However, PEP 508 defines extras via
identifier_end = letterOrDigit | (('-' | '_' | '.' )* letterOrDigit)
identifier = < letterOrDigit identifier_end* >
name = identifier
extras_list = identifier:i (wsp* ',' wsp* identifier)*:ids -> [i] + ids
extras = '[' wsp* extras_list?:e wsp* ']' -> e
I really hate that grammar, but if I read it right, it allows extras to be a string of letters, digits,
., starting with a letter or a digit (so “3.6” is a valid extra!)
Of the 7338 extras I identified, 104 don’t conform to PEP 508, and 1258 are not Python identifiers. The discrepancy in numbers is mostly because Python identifiers don’t allow dots or dashes.
I collected lists of all cases where the normalisation algorithm resulted in 2 different extras normalising to the same value. I did this across all extras, not by package, so these do not necessarily imply that normalising would cause clashes within a package (I’d be extremely surprised if that ever happened, but I’d have to do a re-scan of the database to verify that).
I looked at the following 3 algorithms:
- Option 1 -
re.sub('[^A-Za-z0-9.-]+', '_', name).lower()
- Option 2 -
re.sub('[^A-Za-z0-9]+', '_', name).lower()
- PEP 503 -
re.sub(r"[-_.]+", "-", name).lower()
In all cases, I removed any cases where the only reason for a clash was uppercase vs lowercase, on the assumption that we definitely want extras to be matched case insensitively, so we can assume such cases are intended to map to the same canonical form.
- Option 1 - 24 clashes
- Option 2 - 99 clashes
- PEP 503 - 73 clashes
The most common other difference was extras which contained spaces. I feel like we’d definitely want to canonicalise “tensorflow with gpu” and “tensorflow_with_gpu” to the same value. PEP 503 is the odd one out here, as it doesn’t normalise spaces, so “a b” and “a_b” are different under PEP 503 rules. I think that’s probably a strike against using pure PEP 503. However, it’s worth noting that values with spaces are not valid extras according to either the core metadata spec, or to PEP 508.
If we limit the checks to only valid extras according to PEP 508, option 1 generates no clashes other than case sensitivity, Option 2 and PEP 503 only generated
dev-test: dev_test, dev-test, dev.test
dev-lint: dev-lint, dev.lint, dev_lint
apache-beam: apache-beam, apache.beam
(which seem fine, to me). Limiting the result to just valid core metadata (Python identifiers) none of the approaches caused any clashes.
I don’t really know what to make of all this. I think there are probably a number of actions to take:
- Decide if PEP 685 wants to take a stand on how “invalid” extra names get normalised. If it dismisses that possibility, then PEP 503 normalisation probably wins due to being consistent with elsewhere, but all of the stated variations work, insofar as they enable case insensitive comparison that treats “.”, “-” and “_” the same.
- Fix the mess that is the definition of what constitutes a valid extra. We have 2 different specs which are inconsistent, and from a practical standpoint it doesn’t look like tools enforce either standard
Personally, I think that PEP 685 should accept that invalid extras exist, and explicitly note that tools can apply PEP 685 normalisation to such non-standard extras. Part of me wants to say that it should say that tools SHOULD warn if applying normalisation to an invalid extra, but without a clear definition of what’s valid, that seems like it will only cause more confusion As for standarising valid extras, I’d like to fix that, but the only solution that feels to me like it would be straightforward would be to update the metadata spec to state that
Provides-Extra must follow PEP 508 format.
In theory, yes, it probably should. But that would be a fairly significant undertaking, and we’d probably not make progress if we tied this proposal to doing that.
In fact, it might be good to treat moving the various packaging specs out of the existing PEPs and into PyPA specifications — Python Packaging User Guide as a standalone project, which might be something @smm would be able to help co-ordinate. It’s something that really needs people with technical writing skills, rather than coders, which is probably why it’s only getting done in bits, with no real momentum behind it.