PEP 685: Comparison of extra names for optional distribution dependencies

Thanks for doing this @brettcannon!

My initial thought is that if we’re normalising extra names so similarly to distribution names, why not just use the same canonicalisation as PEP 503’s distribution names?

Given that we’re standardising with a change in behaviour already, and that one shouldn’t be comparing normalised values against non-normalised values, I think it’s actually reasonable to just go all in and have all names (extras, distribution etc) get normalised in a consistent manner. IIUC, that’s just replacing the _ with a - in this PEP.

2 Likes

It’s also less cognitive complexity IMO - there would be only one set of rules for how names related to a distribution get managed and once you understand that, you understand how things will work.

1 Like

Fedora statistics. Do note that packaging extras is quite a new thing, and packages need to explicitly pick what extras they decide to include and are encouraged to skip extras that are not useful for other packages (for example build/development requirements, commonly named dev, doc or test), and those that have requirements that are not packaged in Fedora. We already normalize by lowercasing only.

  • 3749 Python packages providing ^python3dist\([^\[]+\) – that means “base” packages, no extras
  • 158 Python packages providing '^python3dist\(.+\[.+\]\)' – that means an extra
  • 11 Python packages having nonalphanumeric characters in the extra (^python3dist\(.+\[.*[^a-z0-9]+.*\]\)):
    • 10 Python packages having underscore in the extra ('^python3dist\(.+\[.*_.*\]\)'):
      • databases[mysql_asyncmy]
      • databases[postgresql_aiopg]
      • django-timezone-field[rest_framework]
      • python-engineio[asyncio_client]
      • python-socketio[asyncio_client]
      • sqlalchemy[mssql_pymssql]
      • sqlalchemy[mssql_pyodbc]
      • sqlalchemy[postgresql_asyncpg]
      • sqlalchemy[postgresql_pg8000]
      • webscrapbook[adhoc_ssl]
    • 1 Python package having dash in the extra ('^python3dist\(.+\[.*-.*\]\)'):
      • google-api-core[grpcio-gcp]
    • 0 Python packages having dot in the extra ('^python3dist\(.+\[.*\..*\]\)')

Not much data, but some.

4 Likes

From a conceptual standpoint, and perhaps somewhat a practical one, it certainly is very attractive to use the exact same algorithm both places. To note, though, it doesn’t handle any extras names that contain characters outside of [A-Za-z0-9.\-_]. While they are prohibited by one of the two referenced specs, and at least per my testing, are not allowed at least in the versions of the tools I tested, this would be a hard-break to backward compat if any are used in practice—hopefully @pf_moore 's results can clarify this.

Aside from that, a few reasons (though none of them hard blockers):

  • This requires a much more significant change to the current spec, which presently requires valid Python identifiers—I’m not sure what the rationale is, though.
  • This is a change to the existing (at least Setuptools) normalization, which uses _
  • Per limited results and anecdotal experience, _ is overwhelmingly more common in existing extras names, so the normalized names would be much more inconsistent with the original ones—it seems users are used to such

Given the spec, current normalization and actual user usage all prefer _, I’m not sure its worth the cognitive dissonance to change that now for the sake of nominal consistency…though I don’t feel too strongly about that.

FYI, this also requires the change I proposed above, as right now - and . don’t get normalized (nor collapsed) at all.

Since this is a change to packaging specifications, shouldn’t PEP 508 be converted to a spec first, with this PEP proposing a change to that spec?

OK, here’s some basic stats.

  • I have the metadata from 2,124,900 wheels from PyPI (I don’t have data for packages that don’t ship wheels).
  • There are a total of 7338 unique extras across all of those packages. That strikes me as surprisingly low.
  • I’ve uploaded the list of all those extras as All extras used in wheels from PyPI · GitHub

The situation is a bit of a mess, though. The Provides-Extra metadata says “A string containing the name of an optional feature. Must be a valid Python identifier.” However, PEP 508 defines extras via

identifier_end = letterOrDigit | (('-' | '_' | '.' )* letterOrDigit)
identifier    = < letterOrDigit identifier_end* >
name          = identifier
extras_list   = identifier:i (wsp* ',' wsp* identifier)*:ids -> [i] + ids
extras        = '[' wsp* extras_list?:e wsp* ']' -> e

I really hate that grammar, but if I read it right, it allows extras to be a string of letters, digits, -, _, or ., starting with a letter or a digit (so “3.6” is a valid extra!)

Of the 7338 extras I identified, 104 don’t conform to PEP 508, and 1258 are not Python identifiers. The discrepancy in numbers is mostly because Python identifiers don’t allow dots or dashes.

I collected lists of all cases where the normalisation algorithm resulted in 2 different extras normalising to the same value. I did this across all extras, not by package, so these do not necessarily imply that normalising would cause clashes within a package (I’d be extremely surprised if that ever happened, but I’d have to do a re-scan of the database to verify that).

I looked at the following 3 algorithms:

  • Option 1 - re.sub('[^A-Za-z0-9.-]+', '_', name).lower()
  • Option 2 - re.sub('[^A-Za-z0-9]+', '_', name).lower()
  • PEP 503 - re.sub(r"[-_.]+", "-", name).lower()

In all cases, I removed any cases where the only reason for a clash was uppercase vs lowercase, on the assumption that we definitely want extras to be matched case insensitively, so we can assume such cases are intended to map to the same canonical form.

The results:

  • Option 1 - 24 clashes
  • Option 2 - 99 clashes
  • PEP 503 - 73 clashes

The most common other difference was extras which contained spaces. I feel like we’d definitely want to canonicalise “tensorflow with gpu” and “tensorflow_with_gpu” to the same value. PEP 503 is the odd one out here, as it doesn’t normalise spaces, so “a b” and “a_b” are different under PEP 503 rules. I think that’s probably a strike against using pure PEP 503. However, it’s worth noting that values with spaces are not valid extras according to either the core metadata spec, or to PEP 508.

If we limit the checks to only valid extras according to PEP 508, option 1 generates no clashes other than case sensitivity, Option 2 and PEP 503 only generated

dev-test: dev_test, dev-test, dev.test
dev-lint: dev-lint, dev.lint, dev_lint
apache-beam: apache-beam, apache.beam

(which seem fine, to me). Limiting the result to just valid core metadata (Python identifiers) none of the approaches caused any clashes.

I don’t really know what to make of all this. I think there are probably a number of actions to take:

  1. Decide if PEP 685 wants to take a stand on how “invalid” extra names get normalised. If it dismisses that possibility, then PEP 503 normalisation probably wins due to being consistent with elsewhere, but all of the stated variations work, insofar as they enable case insensitive comparison that treats “.”, “-” and “_” the same.
  2. Fix the mess that is the definition of what constitutes a valid extra. We have 2 different specs which are inconsistent, and from a practical standpoint it doesn’t look like tools enforce either standard[1]

Personally, I think that PEP 685 should accept that invalid extras exist, and explicitly note that tools can apply PEP 685 normalisation to such non-standard extras. Part of me wants to say that it should say that tools SHOULD warn if applying normalisation to an invalid extra, but without a clear definition of what’s valid, that seems like it will only cause more confusion :slightly_frowning_face: As for standarising valid extras, I’d like to fix that, but the only solution that feels to me like it would be straightforward would be to update the metadata spec to state that Provides-Extra must follow PEP 508 format.

In theory, yes, it probably should. But that would be a fairly significant undertaking, and we’d probably not make progress if we tied this proposal to doing that.

In fact, it might be good to treat moving the various packaging specs out of the existing PEPs and into PyPA specifications — Python Packaging User Guide as a standalone project, which might be something @smm would be able to help co-ordinate. It’s something that really needs people with technical writing skills, rather than coders, which is probably why it’s only getting done in bits, with no real momentum behind it.


  1. Disclaimer: I didn’t check the age of the wheels I scanned, it’s possible that older versions of tools allowed arbitrary extras but that has since been fixed. Someone should check this. ↩︎

3 Likes

Ideally, yes, but I don’t have that sort of time. Also see Bring over PEPs 517, 518, and 660 to the specs section · Issue #955 · pypa/packaging.python.org · GitHub for other PEPs that still need to be brought over.

You’re reading it correctly.

Yeah, no clear winner. :sweat_smile: All normalization approaches seem acceptable.

I don’t think it’s worth explicitly addressing beyond suggesting tools warn users about them.

It also works with PEP 508, so that means only the core metadata spec requires a potential update to unify what a valid extra name is (not sure if that requires a new core metadata version since the old names would still be valid?).

I think “invalid” would be anything that doesn’t match the grammar specified in PEP 508 which is what the core metadata spec specifies for Name already:

r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$"

We could add a packaging.utils.check_name() function that checks if that regex matches a name (which I probably want anyway to help validate metadata before it’s written out).

I agree since it means preexisting extras based on the current core metadata spec are still valid.

1 Like

Good point. I was getting distracted by the fact that some pre-existing extras would be invalid under the new spec. But they are invalid under the old spec too, so that’s not particularly compelling.

I did a quick check, and it appears that current setuptools normalises extra names (“a sample” gets stored in the metadata as “a_sample”) so the invalid extras I identified are likely from older releases. I might do some checking at some point to confirm that.

I forgot to say, but I agree, it seems to me that can just be a PR to the spec rather than a new version / PEP.

The process says

If a change being considered this way has the potential to affect software interoperability, then it must be escalated to the Packaging category of the Python.org Discourse for discussion, where it will be either approved as a text-only change, or else directed to the PEP process for specification updates.

so I’d say that as long as no-one objects here, we’re OK to treat it as a text-only change.

2 Likes

I’ve updated the PEP based on the feedback:

  • Specify the versions of pip and setuptools.
  • Loosen naming requirements to match PEP 508.
  • Use PEP 503 normalization.
  • Add some more references.
  • Said tools SHOULD warn when the extra name is invalid.
3 Likes

Actually, Option 1 does not do that, as test-extra, test.extra and test_extra (and test--extra and test.extra) are all left alone, not normalized to one common form. The only change that Option 1 makes to PEP 508-valid extras names is normalizing test__extra to test_extra. This is the fundamental difference between Option 1 and Option 2 (and why you’re not seeing different spellings of dev_lint get normalized to the same name).

I did previously do some limited testing of this which appeared to suggest at least some current tools reject extras that have non-PEP 508-conforming names, at least in some contexts. However, I would be concerned that tools should be tolerant of extras produced by older or other tools that may not have conforming names, given the current situation.

Given my background and expertise being a pretty good match for that and it being a more valuable use of my skills than copyediting others’ PEPs, its something I’m been meaning to offer to help with as one of my next projects, but I want to finish PEP 639 and some remaining PEP infra/documentation work before overextending myself further.

To be fair, we have a PEP (clarifying this being part of the motivating purpose for said PEP), so isn’t this a moot point as we can just specify it there (as PEPs are change proposals to the existing canoncial PyPA specifications, rather than standalone specifications)?

I’ve been taking care of a family situation over the past couple days, so I’ll save a detailed pass on the specific changes in the PEP for tomorrow, but responding to a few higher-level points on the content:

I fully agree with standardizing on PEP 508 for valid extras names (which is identical to the requirements for Name in core metadata), provided the PEP 503 or Option 2 normalized form (which are identical for valid such names) is always used to compare them.

Just to note, it is stated a few places in the PEP that PEP 508 is looser than Provides-Extra spec, but that is not strictly the case. Namely, the former requires extras names neither start or end with _ (or . and -), whereas valid Python identifiers can start or end with _. Also, non-ASCII alphanumeric characters are valid in Python 3 identifiers (though not in Python 2 identifiers, which was presumably still relevant at the time the spec was written), so if tools or users have interpreted these as being valid under the Provides-Metadata spec, then these will also be invalid.

It would probably be worth mentioning such cases should at least be mentioned in the relevant place(s) and clarifying the current text.

To note, Option 2 (with - instead of _) is PEP 503 normalization, provided the source names are valid per PEP 508 (which for this is identical to the format specified for the Name core metadata field). Therefore, applying it to PEP 508-valid package names and extras produces the same result as the specific regex used by packaging for PEP 503.

The one area where it differs is for existing extras that do not conform to PEP 508, where it replaces invalid characters with valid ones. This allows users to still specify extras for existing packages that are invalid under the updated specification (and the existing PEP 508 one, but not necessarily the Provides-Extra spec). This would seem to follow the principle of “loose on input, strict on output” and addresses most remaining backward compatibility concerns, but most of these names were always invalid, it requires handling conflicts and adds complexity.

Runs of _, unlike PEP 503, do not get collapsed, e.g. ___ stays the same.

Seems like we’re still confused on what this code actually does :stuck_out_tongue: Runs of _, do get collapsed, as do runs of other characters outside alphanumeric and -/. (and normalized to _), while runs of - and . do not, unlike PEP 503.

Uhm… if we’re calling them invalid right now and they don’t work today, why bother stretching to ensure that we don’t reject them in the future?

Hmmm… maybe this could also say:

… tools SHOULD warn when the extra name is invalid and ignore such an extra name. Alternatively, tools MAY raise an error, in effect refusing to process handle such an extra name.

I think it’s probably ~20-30 hours of copy-pasting + copy-editting work, to be honest – for someone who knows what they’re doing.

I agree that we shouldn’t block other improvements on doing that, but I wouldn’t expect this to be that much work. It’s just that it’ll be grindy and repetitive. It also requires someone else to sit and review 1000s of lines of prose (~1-20 hours, depending on familiarity + how much we trust that the first person did it correctly). :slight_smile:

2 Likes

True. I will tweak the PEP. I have also created an open issue around whether we need to bump the core metadata version for this.

No, I’ve just had a lot going on IRL and so my brain wasn’t reading the regex properly.

:+1:

2 Likes
  • Bump the core metadata version to 2.3.
  • Have metadata writers raise errors when encountering invalid names (based on the core metadata version).
  • Have metadata writers warn when a name wouldn’t be valid in the future.
  • At least ignore invalid names when reading, potentially raise an error.
  • Minor tweaks.
3 Likes

Looking at this, if a sdist has metadata version 2.3 (or 2.2) but doesn’t have a “Dynamic” field at all, it’s not obvious to me from PEP 643 whether that would be treated as “everything is static because it’s not specified as dynamic”, or if it would mean that everything is dynamic (for backward compatibility) because the rules only apply “When (Dynamic is) found in the metadata of a source distribution”.

I don’t recall if I had a particular interpretation in mind when I wrote the PEP, unfortunately :slightly_frowning_face: But with hindsight, I think it would be better to go with the backward compatible approach, as otherwise tools might hold off on moving to later metadata standards because they’d need to handle Dynamic first.

Would anyone object if I submitted a clarification for PEP 643 which made that point explicit:

If a project specifies metadata version 2.2 or later, but the Dynamic field is not present at all, then for backward compatibility, all fields are assumed to be dynamic. However, projects SHOULD explicitly include the Dynamic field if at all possible, rather than relying on this behaviour.

(I’m specifically thinking that setuptools might want to add support for the new license expression metadata when it’s finalised, and this would have a metadata version >2.2. I wouldn’t want that to be blocked on getting support for dynamic sorted out).

Or am I being too cautious here? It’s pretty easy just to add a Dynamic field listing everything if you have no better way of knowing what might be static, and your tool has to change to bump the metadata version anyway.

1 Like

That doesn’t make sense to me. If you don’t have any dynamic fields, will you need to add a dummy Dynamic entry to denote that your metadata are static? What should its value then be?

Oops, good point. So I think this is a more substantial question than I’d originally thought. I’ll repost my question as a separate topic, to avoid hijacking this one.

1 Like

I feel the PEP text should explicit state what is considered a valid extra name. Currently the rule is sort of inferred by various other rules and it is not easy for a tool to know how a user-provided extra name should be validated before normalisation.

3 Likes

I opened a PR with some clarifications, copyedits and other fixes:

Also, for those interested, @pf_moore 's aforementioned thread above is here:

I had a few additional, more substantive questions and comments on the PEP text:

Tools generating metadata MUST raise an error if an invalid extra name is provided as appropriate for the specified core metadata version. If an older core metadata version is specified and the name would be invalid with newer core metadata versions, tools SHOULD warn the user.

How is the core metadata version determined/specified? AFAIK, there is no user-facing way to specify this, at least in the build backends I’m familiar with. Per the discussion in @pf_moore 's thread, should this just be the latest and we can just simplify this to just raise an error if an invalid extra names is provided (since the core metadata spec is only specifying the current version, I’m not sure it makes sense to discuss older versions)?

Tools SHOULD warn users when an invalid extra name is read and SHOULD not use the name to avoid ambiguity.

What does it mean to “not use the name”? Should tools error out? Or just not write/install the extra? That seems like rather unexpected and undesirable behavior; if the tool is going to not just try to use the name anyway (presumably with normalization), it would be better to just error out rather than do something other than what the user explicitly requested.

Moving to PEP 503 normalization and PEP 508 name acceptance, it allows for all preexisting, valid names to continue to be valid.

Perhaps this should be clarified to say that valid extras specifiers per PEP 508 will continue to remain valid, since as mentioned above, this isn’t strictly true for the existing spec for Provides-Extra in core metadata.

It’s good the rationale section follows the regex spec (re.sub(r"[-_.]+", "-", name).lower()) with a prose description (" This collapses any run of the substitution character down to a single character, e.g. --- gets collapsed down to - .")

The specification section also repeats the regex spec. It would be nice to also have a prose description here.

I realise (currently) the two are identical, but I can see people directly linking to https://peps.python.org/pep-0685/#specification so there’s value in seeing that relevant information immediately.

1 Like