PEP 685: Comparison of extra names for optional distribution dependencies

Since this is a change to packaging specifications, shouldn’t PEP 508 be converted to a spec first, with this PEP proposing a change to that spec?

OK, here’s some basic stats.

  • I have the metadata from 2,124,900 wheels from PyPI (I don’t have data for packages that don’t ship wheels).
  • There are a total of 7338 unique extras across all of those packages. That strikes me as surprisingly low.
  • I’ve uploaded the list of all those extras as All extras used in wheels from PyPI · GitHub

The situation is a bit of a mess, though. The Provides-Extra metadata says “A string containing the name of an optional feature. Must be a valid Python identifier.” However, PEP 508 defines extras via

identifier_end = letterOrDigit | (('-' | '_' | '.' )* letterOrDigit)
identifier    = < letterOrDigit identifier_end* >
name          = identifier
extras_list   = identifier:i (wsp* ',' wsp* identifier)*:ids -> [i] + ids
extras        = '[' wsp* extras_list?:e wsp* ']' -> e

I really hate that grammar, but if I read it right, it allows extras to be a string of letters, digits, -, _, or ., starting with a letter or a digit (so “3.6” is a valid extra!)

Of the 7338 extras I identified, 104 don’t conform to PEP 508, and 1258 are not Python identifiers. The discrepancy in numbers is mostly because Python identifiers don’t allow dots or dashes.

I collected lists of all cases where the normalisation algorithm resulted in 2 different extras normalising to the same value. I did this across all extras, not by package, so these do not necessarily imply that normalising would cause clashes within a package (I’d be extremely surprised if that ever happened, but I’d have to do a re-scan of the database to verify that).

I looked at the following 3 algorithms:

  • Option 1 - re.sub('[^A-Za-z0-9.-]+', '_', name).lower()
  • Option 2 - re.sub('[^A-Za-z0-9]+', '_', name).lower()
  • PEP 503 - re.sub(r"[-_.]+", "-", name).lower()

In all cases, I removed any cases where the only reason for a clash was uppercase vs lowercase, on the assumption that we definitely want extras to be matched case insensitively, so we can assume such cases are intended to map to the same canonical form.

The results:

  • Option 1 - 24 clashes
  • Option 2 - 99 clashes
  • PEP 503 - 73 clashes

The most common other difference was extras which contained spaces. I feel like we’d definitely want to canonicalise “tensorflow with gpu” and “tensorflow_with_gpu” to the same value. PEP 503 is the odd one out here, as it doesn’t normalise spaces, so “a b” and “a_b” are different under PEP 503 rules. I think that’s probably a strike against using pure PEP 503. However, it’s worth noting that values with spaces are not valid extras according to either the core metadata spec, or to PEP 508.

If we limit the checks to only valid extras according to PEP 508, option 1 generates no clashes other than case sensitivity, Option 2 and PEP 503 only generated

dev-test: dev_test, dev-test, dev.test
dev-lint: dev-lint, dev.lint, dev_lint
apache-beam: apache-beam, apache.beam

(which seem fine, to me). Limiting the result to just valid core metadata (Python identifiers) none of the approaches caused any clashes.

I don’t really know what to make of all this. I think there are probably a number of actions to take:

  1. Decide if PEP 685 wants to take a stand on how “invalid” extra names get normalised. If it dismisses that possibility, then PEP 503 normalisation probably wins due to being consistent with elsewhere, but all of the stated variations work, insofar as they enable case insensitive comparison that treats “.”, “-” and “_” the same.
  2. Fix the mess that is the definition of what constitutes a valid extra. We have 2 different specs which are inconsistent, and from a practical standpoint it doesn’t look like tools enforce either standard[1]

Personally, I think that PEP 685 should accept that invalid extras exist, and explicitly note that tools can apply PEP 685 normalisation to such non-standard extras. Part of me wants to say that it should say that tools SHOULD warn if applying normalisation to an invalid extra, but without a clear definition of what’s valid, that seems like it will only cause more confusion :slightly_frowning_face: As for standarising valid extras, I’d like to fix that, but the only solution that feels to me like it would be straightforward would be to update the metadata spec to state that Provides-Extra must follow PEP 508 format.

In theory, yes, it probably should. But that would be a fairly significant undertaking, and we’d probably not make progress if we tied this proposal to doing that.

In fact, it might be good to treat moving the various packaging specs out of the existing PEPs and into PyPA specifications - Python Packaging User Guide as a standalone project, which might be something @smm would be able to help co-ordinate. It’s something that really needs people with technical writing skills, rather than coders, which is probably why it’s only getting done in bits, with no real momentum behind it.


  1. Disclaimer: I didn’t check the age of the wheels I scanned, it’s possible that older versions of tools allowed arbitrary extras but that has since been fixed. Someone should check this. ↩︎

3 Likes

Ideally, yes, but I don’t have that sort of time. Also see Bring over PEPs 517, 518, and 660 to the specs section · Issue #955 · pypa/packaging.python.org · GitHub for other PEPs that still need to be brought over.

You’re reading it correctly.

Yeah, no clear winner. :sweat_smile: All normalization approaches seem acceptable.

I don’t think it’s worth explicitly addressing beyond suggesting tools warn users about them.

It also works with PEP 508, so that means only the core metadata spec requires a potential update to unify what a valid extra name is (not sure if that requires a new core metadata version since the old names would still be valid?).

I think “invalid” would be anything that doesn’t match the grammar specified in PEP 508 which is what the core metadata spec specifies for Name already:

r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$"

We could add a packaging.utils.check_name() function that checks if that regex matches a name (which I probably want anyway to help validate metadata before it’s written out).

I agree since it means preexisting extras based on the current core metadata spec are still valid.

1 Like

Good point. I was getting distracted by the fact that some pre-existing extras would be invalid under the new spec. But they are invalid under the old spec too, so that’s not particularly compelling.

I did a quick check, and it appears that current setuptools normalises extra names (“a sample” gets stored in the metadata as “a_sample”) so the invalid extras I identified are likely from older releases. I might do some checking at some point to confirm that.

I forgot to say, but I agree, it seems to me that can just be a PR to the spec rather than a new version / PEP.

The process says

If a change being considered this way has the potential to affect software interoperability, then it must be escalated to the Packaging category of the Python.org Discourse for discussion, where it will be either approved as a text-only change, or else directed to the PEP process for specification updates.

so I’d say that as long as no-one objects here, we’re OK to treat it as a text-only change.

2 Likes

I’ve updated the PEP based on the feedback:

  • Specify the versions of pip and setuptools.
  • Loosen naming requirements to match PEP 508.
  • Use PEP 503 normalization.
  • Add some more references.
  • Said tools SHOULD warn when the extra name is invalid.
3 Likes

Actually, Option 1 does not do that, as test-extra, test.extra and test_extra (and test--extra and test.extra) are all left alone, not normalized to one common form. The only change that Option 1 makes to PEP 508-valid extras names is normalizing test__extra to test_extra. This is the fundamental difference between Option 1 and Option 2 (and why you’re not seeing different spellings of dev_lint get normalized to the same name).

I did previously do some limited testing of this which appeared to suggest at least some current tools reject extras that have non-PEP 508-conforming names, at least in some contexts. However, I would be concerned that tools should be tolerant of extras produced by older or other tools that may not have conforming names, given the current situation.

Given my background and expertise being a pretty good match for that and it being a more valuable use of my skills than copyediting others’ PEPs, its something I’m been meaning to offer to help with as one of my next projects, but I want to finish PEP 639 and some remaining PEP infra/documentation work before overextending myself further.

To be fair, we have a PEP (clarifying this being part of the motivating purpose for said PEP), so isn’t this a moot point as we can just specify it there (as PEPs are change proposals to the existing canoncial PyPA specifications, rather than standalone specifications)?

I’ve been taking care of a family situation over the past couple days, so I’ll save a detailed pass on the specific changes in the PEP for tomorrow, but responding to a few higher-level points on the content:

I fully agree with standardizing on PEP 508 for valid extras names (which is identical to the requirements for Name in core metadata), provided the PEP 503 or Option 2 normalized form (which are identical for valid such names) is always used to compare them.

Just to note, it is stated a few places in the PEP that PEP 508 is looser than Provides-Extra spec, but that is not strictly the case. Namely, the former requires extras names neither start or end with _ (or . and -), whereas valid Python identifiers can start or end with _. Also, non-ASCII alphanumeric characters are valid in Python 3 identifiers (though not in Python 2 identifiers, which was presumably still relevant at the time the spec was written), so if tools or users have interpreted these as being valid under the Provides-Metadata spec, then these will also be invalid.

It would probably be worth mentioning such cases should at least be mentioned in the relevant place(s) and clarifying the current text.

To note, Option 2 (with - instead of _) is PEP 503 normalization, provided the source names are valid per PEP 508 (which for this is identical to the format specified for the Name core metadata field). Therefore, applying it to PEP 508-valid package names and extras produces the same result as the specific regex used by packaging for PEP 503.

The one area where it differs is for existing extras that do not conform to PEP 508, where it replaces invalid characters with valid ones. This allows users to still specify extras for existing packages that are invalid under the updated specification (and the existing PEP 508 one, but not necessarily the Provides-Extra spec). This would seem to follow the principle of “loose on input, strict on output” and addresses most remaining backward compatibility concerns, but most of these names were always invalid, it requires handling conflicts and adds complexity.

Runs of _, unlike PEP 503, do not get collapsed, e.g. ___ stays the same.

Seems like we’re still confused on what this code actually does :stuck_out_tongue: Runs of _, do get collapsed, as do runs of other characters outside alphanumeric and -/. (and normalized to _), while runs of - and . do not, unlike PEP 503.

Uhm… if we’re calling them invalid right now and they don’t work today, why bother stretching to ensure that we don’t reject them in the future?

Hmmm… maybe this could also say:

… tools SHOULD warn when the extra name is invalid and ignore such an extra name. Alternatively, tools MAY raise an error, in effect refusing to process handle such an extra name.

I think it’s probably ~20-30 hours of copy-pasting + copy-editting work, to be honest – for someone who knows what they’re doing.

I agree that we shouldn’t block other improvements on doing that, but I wouldn’t expect this to be that much work. It’s just that it’ll be grindy and repetitive. It also requires someone else to sit and review 1000s of lines of prose (~1-20 hours, depending on familiarity + how much we trust that the first person did it correctly). :slight_smile:

2 Likes

True. I will tweak the PEP. I have also created an open issue around whether we need to bump the core metadata version for this.

No, I’ve just had a lot going on IRL and so my brain wasn’t reading the regex properly.

:+1:

2 Likes
  • Bump the core metadata version to 2.3.
  • Have metadata writers raise errors when encountering invalid names (based on the core metadata version).
  • Have metadata writers warn when a name wouldn’t be valid in the future.
  • At least ignore invalid names when reading, potentially raise an error.
  • Minor tweaks.
3 Likes

Looking at this, if a sdist has metadata version 2.3 (or 2.2) but doesn’t have a “Dynamic” field at all, it’s not obvious to me from PEP 643 whether that would be treated as “everything is static because it’s not specified as dynamic”, or if it would mean that everything is dynamic (for backward compatibility) because the rules only apply “When (Dynamic is) found in the metadata of a source distribution”.

I don’t recall if I had a particular interpretation in mind when I wrote the PEP, unfortunately :slightly_frowning_face: But with hindsight, I think it would be better to go with the backward compatible approach, as otherwise tools might hold off on moving to later metadata standards because they’d need to handle Dynamic first.

Would anyone object if I submitted a clarification for PEP 643 which made that point explicit:

If a project specifies metadata version 2.2 or later, but the Dynamic field is not present at all, then for backward compatibility, all fields are assumed to be dynamic. However, projects SHOULD explicitly include the Dynamic field if at all possible, rather than relying on this behaviour.

(I’m specifically thinking that setuptools might want to add support for the new license expression metadata when it’s finalised, and this would have a metadata version >2.2. I wouldn’t want that to be blocked on getting support for dynamic sorted out).

Or am I being too cautious here? It’s pretty easy just to add a Dynamic field listing everything if you have no better way of knowing what might be static, and your tool has to change to bump the metadata version anyway.

1 Like

That doesn’t make sense to me. If you don’t have any dynamic fields, will you need to add a dummy Dynamic entry to denote that your metadata are static? What should its value then be?

Oops, good point. So I think this is a more substantial question than I’d originally thought. I’ll repost my question as a separate topic, to avoid hijacking this one.

1 Like

I feel the PEP text should explicit state what is considered a valid extra name. Currently the rule is sort of inferred by various other rules and it is not easy for a tool to know how a user-provided extra name should be validated before normalisation.

3 Likes

I opened a PR with some clarifications, copyedits and other fixes:

Also, for those interested, @pf_moore 's aforementioned thread above is here:

I had a few additional, more substantive questions and comments on the PEP text:

Tools generating metadata MUST raise an error if an invalid extra name is provided as appropriate for the specified core metadata version. If an older core metadata version is specified and the name would be invalid with newer core metadata versions, tools SHOULD warn the user.

How is the core metadata version determined/specified? AFAIK, there is no user-facing way to specify this, at least in the build backends I’m familiar with. Per the discussion in @pf_moore 's thread, should this just be the latest and we can just simplify this to just raise an error if an invalid extra names is provided (since the core metadata spec is only specifying the current version, I’m not sure it makes sense to discuss older versions)?

Tools SHOULD warn users when an invalid extra name is read and SHOULD not use the name to avoid ambiguity.

What does it mean to “not use the name”? Should tools error out? Or just not write/install the extra? That seems like rather unexpected and undesirable behavior; if the tool is going to not just try to use the name anyway (presumably with normalization), it would be better to just error out rather than do something other than what the user explicitly requested.

Moving to PEP 503 normalization and PEP 508 name acceptance, it allows for all preexisting, valid names to continue to be valid.

Perhaps this should be clarified to say that valid extras specifiers per PEP 508 will continue to remain valid, since as mentioned above, this isn’t strictly true for the existing spec for Provides-Extra in core metadata.

It’s good the rationale section follows the regex spec (re.sub(r"[-_.]+", "-", name).lower()) with a prose description (" This collapses any run of the substitution character down to a single character, e.g. --- gets collapsed down to - .")

The specification section also repeats the regex spec. It would be nice to also have a prose description here.

I realise (currently) the two are identical, but I can see people directly linking to https://peps.python.org/pep-0685/#specification so there’s value in seeing that relevant information immediately.

1 Like

You only quoted parts of the relevant paragraph. Here’s the full paragraph:

Tools generating metadata MUST raise an error if a user specified two or more extra names which would normalize to the same name. Tools generating metadata MUST raise an error if an invalid extra name is provided as appropriate for the specified core metadata version. If an older core metadata version is specified and the name would be invalid with newer core metadata versions, tools SHOULD warn the user. Tools SHOULD warn users when an invalid extra name is read and not use the name to avoid ambiguity. Tools MAY raise an error instead of a warning when reading an invalid name if they so desire.

The only context in which “an older core metadata version” can be specified is in the case of a metadata consumer reading data generated by a tool that hasn’t been updated for the new spec. In that case, the consumer should warn, so that the user knows that the extra will have to change in future.

You seem to be thinking in terms of multiple valid metadata formats (one for each version). That’s not a helpful way of thinking of things - there’s only one metadata format, and it gets updated over time. The expectation is that all tools and projects will conform to the current metadata spec. The version numbering is solely to manage the fact that once metadata is generated, it doesn’t get rewritten, so handling legacy formats is a necessary evil, but only for metadata consumers. Producers should always follow the rules for the current (latest) specification[1].

Yes, the spec says

For broader compatibility, build tools MAY choose to produce distribution metadata using the lowest metadata version that includes all of the needed fields.

IMO that’s a mistake, and we should remove it, and expect build tools to produce metadata that conforms to the latest standard (yes, I will go back to my post on the other thread where I said this wasn’t worth the effort and update it :slightly_smiling_face:)


  1. Obviously there will always be delays in implementing spec changes, though. ↩︎

2 Likes

Ah okay, thanks, that does make more sense in the context of a metadata consumer—I had been thinking just in terms of a producer.

Actually, the latter was more or less what I intended to meant by

but I was only thinking in terms of the context of a metadata producer rather than a consumer.

Why, you don’t say… :stuck_out_tongue:

I’ve opened a pull request as pypa/packaging.python.org#1063; should I open (yet) another dedicated thread to ensure this has appropriate visibility, or should we just wait for further discussion on the existing thread?

That’s how I interpreted the PEP.

It currently says …

What are you specifically after here? Do you want me to copy ^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$ into the PEP?

Feel free to send a PR, but people to read the entire PEP. :wink: Plus the actual specification will be reflected at packaging.python.org and not this PEP, so that won’t be important long-term (if this PEP gets accepted).

1 Like

We seem to be getting into verbage discussions more than technical details. And since this all has to be translated into a PR for packaging.python.org, I don’t want to get too hung up on how things are written as long as the semantic meaning is correct.

So is there any more technical feedback on this PEP?

3 Likes