PEP 685: Comparison of extra names for optional distribution dependencies

brettcannon · March 9, 2022, 1:37am

As promised in What extras names are treated as equal and why? , here is the PEP to standardize how to normalize and compare extra names.

A rendered version can be found at PEP 685 – Comparison of extra names for optional distribution dependencies | peps.python.org .

PEP: 685
Title: Comparison of extra names for optional distribution dependencies
Author: Brett Cannon <brett@python.org>
PEP-Delegate: Paul Moore <p.f.moore@gmail.com>
Discussions-To: https://discuss.python.org/t/14141
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 08-Mar-2022
Post-History: 08-Mar-2022


Abstract
========

This PEP specifies how to normalize `distribution extra <Provides-Extra_>`_
names when performing comparisons.
This prevents tools from either failing to find an extra name, or
accidentally matching against an unexpected name.


Motivation
==========

The `Provides-Extra`_ core metadata specification states that an extra's
name "must be a valid Python identifier".
:pep:`508` specifies that the value of an ``extra`` marker may contain a
letter, digit, or any one of ``.``, ``-``, or ``_`` after the initial character.
Otherwise, there is no other `PyPA specification
<https://packaging.python.org/en/latest/specifications/>`_
which outlines how extra names should be written or normalization for comparison.
Due to the amount of packaging-related code in existence,
it is important to evaluate current practices by the community and
standardize on one that doesn't break most code, while being
something tool authors can agree to following.

The issue of there being no standard was brought forward by an
`initial discussion <https://discuss.python.org/t/7614>`__
noting that the extra ``adhoc-ssl`` was not considered equal to the name
``adhoc_ssl`` by pip 22.


Rationale
=========

:pep:`503` specifies how to normalize distribution names::

    re.sub(r"[-_.]+", "-", name).lower()

This collapses any run of the substitution character down to a single
character,
e.g. ``---`` gets collapsed down to ``-``.
This does **not** produce a valid Python identifier as specified by
the core metadata 2.2 specification for extra names.

`Setuptools 60 does normalization <https://github.com/pypa/setuptools/blob/b2f7b8f92725c63b164d5776f85e67cc560def4e/pkg_resources/__init__.py#L1324-L1330>`__
via::

    re.sub(r'[^A-Za-z0-9-.]+', '_', name).lower()

The use of an underscore/``_`` differs from PEP 503's use of a
hyphen/``-``.
Runs of ``.`` and ``-``, unlike PEP 503, do **not** get collapsed,
e.g. ``..`` stays the same.

For pip 22, its
"extra normalisation behaviour is quite convoluted and erratic" [pip-erratic]_,
and so its use is not considered.

.. [pip-erratic] https://discuss.python.org/t/what-extras-names-are-treated-as-equal-and-why/7614/10?


Specification
=============

When comparing extra names, tools MUST normalize the names being compared
using the semantics outlined in `PEP 503 for names <https://peps.python.org/pep-0503/#normalized-names>`__::

    re.sub(r"[-_.]+", "-", name).lower()

The `core metadata`_ specification will be updated such that the allowed
names for `Provides-Extra`_ matches what :pep:`508` specifies for names.
This will bring extra naming in line with that of the Name_ field.
Because this changes what is considered valid, it will lead to a core
metadata version increase to ``2.3``.

For tools writing `core metadata`_,
they MUST write out extra names in their normalized form.
This applies to the `Provides-Extra`_ field and the `extra marker`_
when used in the `Requires-Dist`_ field.

Tools generating metadata MUST raise an error if a user specified
two or more extra names which would normalize to the same name.
Tools generating metadata MUST raise an error if an invalid extra
name is provided as appropriate for the specified core metadata version.
If an older core metadata version is specified and the name would be
invalid with newer core metadata versions,
tools SHOULD warn the user.
Tools SHOULD warn users when an invalid extra name is read and not use
the name to avoid ambiguity.
Tools MAY raise an error instead of a warning when reading an
invalid name if they so desire.


Backwards Compatibility
=======================

Moving to :pep:`503` normalization and :pep:`508` name acceptance, it
allows for all preexisting, valid names to continue to be valid.

Based on research looking at a collection of wheels on PyPI [pypi-results]_,
the risk of extra name clashes is limited to 73 clashes when considering
even invalid names,
while *only* looking at valid names leads to only 3 clashes:

1. dev-test: dev_test, dev-test, dev.test
2. dev-lint: dev-lint, dev.lint, dev_lint
3. apache-beam: apache-beam, apache.beam

By requiring tools writing core metadata to only record the normalized name,
the issue of preexisting, invalid extra names should be diminished over
time.

.. [pypi-results] https://discuss.python.org/t/pep-685-comparison-of-extra-names-for-optional-distribution-dependencies/14141/17?u=brettcannon


Security Implications
=====================

It is possible that for a distribution that has conflicting extra names, a
tool ends up installing distributions that somehow weaken the security
of the system.
This is only hypothetical and if it were to occur,
it would probably be more of a security concern for the distributions
specifying such extras names rather than the distribution that pulled
them in together.


How to Teach This
=================

This should be transparent to users on a day-to-day basis.
It will be up to tools to educate/stop users when they select extra
names which conflict.


Reference Implementation
========================

No reference implementation is provided aside from the code above,
but the expectation is the `packaging project`_ will provide a
function in its ``packaging.utils`` that will implement extra name
normalization.
It will also implement extra name comparisons appropriately.
Finally, if the project ever gains the ability to write out metadata,
it will also implement this PEP.


Rejected Ideas
==============

Using setuptools 60's normalization
-----------------------------------

Initially this PEP proposed following setuptools to try and minimize
backwards-compatibility issues.
But after checking various wheels on PyPI,
it became clear that standardizing **all** naming on :pep:`508` and
:pep:`503` semantics was easier and better long-term.


Open Issues
===========

N/A


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.


.. _core metadata: https://packaging.python.org/en/latest/specifications/core-metadata/
.. _extra marker: https://peps.python.org/pep-0508/#extras
.. _Name: https://packaging.python.org/en/latest/specifications/core-metadata/#name
.. _packaging project: https://packaging.pypa.io
.. _Provides-Extra: https://packaging.python.org/en/latest/specifications/core-metadata/#provides-extra-multiple-use
.. _Requires-Dist: https://packaging.python.org/en/latest/specifications/core-metadata/#requires-dist-multiple-use

CAM-Gerlach · March 9, 2022, 8:32am

I submitted a pull request with some technical, proofreading and a few copyediting changes to the text of the PEP.

There was one substantive, rather significant issue with the PEP’s content, however, that should be discussed here, however—the normalization algorithm it specifies does not appear to be the one that represented the final rough consensus on the previous thread. Furthermore, its properties and quirks directly contradict several of the claimed advantages and stated motivations for it elsewhere in the PEP (unlike said algorithm), greatly diminish its practical benefit, and mean that it does not actually solve the original issue that sparked the PEP to begin with, as cited therein (that adhoc-ssl does not compare equal to adhoc_ssl).

The normalization algorithm currently cited in the PEP is:

re.sub('[^A-Za-z0-9.-]+', '_', name).lower()

However, as discussed on the previous issue, the algorithm should instead be

re.sub('[^A-Za-z0-9]+', '_', name).lower()

(i.e., the previous algorithm, except with . and - also normalized to _).

In real-world practice, the latter is exactly equivalent to PEP 503 normalization except with _ as the replacement character, because per PEP 508 and as actually implemented in packaging tools, no characters outside of [A-Za-z0-9._-] have been allowed anywhere in specified extra names.

Using the latter instead of the former means that:

Normalization is actually useful, as the only actual normalization the former algorithm does on currently possible extras names is making is making test__extra equivalent to test_extra, whereas the latter means that test_extra, test--extra and test.extra will all be normalized to test_extra.
The original issue that sparked the PEP, “the extra adhoc-ssl was not considered equal to the name adhoc_ssl by pip”, is actually solved.
The normalized form will always be a valid Python identifier, as currently required by the Extras spec (whereas the normalization proposed by the PEP, contradicting its claim, has no practical effect on any currently possible Extras name’s validity as a Python identifier, and allows both . and - which are invalid characters anywhere in such.)
The strange, unexpected and confusing behavior with test__extra being normalized to test_extra, but test--extra being left alone, is avoided (by normalizing both to test_extra); to wit, the PEP itself is confused on that point, as it states “Runs of characters, unlike PEP 503, do not get collapsed, e.g. ___ stays the same.” when in fact, ___ is collapsed (as I described on the previous thread, while “—” is not.
The normalization is consistent between project and extras names, except for the replacement character

As likewise discussed on the previous thread, this has effectively no greater real-world backward compatibility impact than the currently-specified behavior, as the only cases that would be meaningfully affected are very unlikely, fundamentally user-hostile and (based on pip’s behavior), appear to be mostly be currently broken anyway:

(to note, given the problem identified by the OP and my later testing, it appears that these extras cannot even currently be selected with pip to begin with) and

which, to note, due to the strangeness of the currently-specified implementation, the above actually has it backwards—a--b is not normalized, but a__b is normalized to a_b.

Jelle · March 9, 2022, 6:27pm

Is it possible to query PyPI to find out whether this specification results in conflicting/duplicate extra names for any package?

pf_moore · March 9, 2022, 6:32pm

Is it possible to query PyPI to find out whether this specification results in conflicting/duplicate extra names for any package?

I have an offline database of PyPI metadata, I can do that - probably tomorrow.

dustin · March 9, 2022, 7:37pm

Unfortunately not, PyPI does not currently store the Provides-Extra metadata field.

brettcannon · March 9, 2022, 9:39pm

That’s not the same feeling I got from the thread, hence the direct lift from safe_extra() in setuptools (i.e. I wrote the PEP in an hour between meetings ).

That all seems reasonable to me as reasons to modify the proposed regex. What do other people think?

Also check the proposal for writing metadata as it’s a bit more than what @uranusjr originally proposed by also normalizing Provides-Extra itself and not just the extra marker. Since PyPI doesn’t directly expose it I assumed it was best to write it down normalized and not how it might be written in a config or docs, but if people disagree do let me know.

hroncok · March 10, 2022, 7:12am

Thank you for actually doing this! Suggestion: When the PEP says steuptools do something or pip does something, it should probably mention the version of setuptools/pip, in case the behavior actually changes in the future.

hroncok · March 10, 2022, 7:19am

Setuptools does normalization via:
re.sub('[^A-Za-z0-9.-]+', '_', name).lower()
The use of an underscore/ _ differs from PEP 503’s use of a hyphen/ - . Runs of characters, unlike PEP 503, do not get collapsed, e.g. ___ stays the same.

This part alone is probably contradicting itself. The listed regex does collapse runs of characters. Setuptools function explicitly says that in docstring:

Any runs of non-alphanumeric characters are replaced with a single ‘_’, and the result is always lowercased.

CAM-Gerlach · March 10, 2022, 7:28am

Yeah, it wasn’t 100% clear to me either—the initial feeling from the thread was to just be conservative and go with safe_extra() as is, but further discussion (including by main players who’d previously suggested the conservative approach) implied that it was sensible to handle these additional cases and any possible breakage would be extremely unlikely and to very user-hostile patterns, and also expressed confusion about the unintuitive, surprising and possibly even unintentional behavior of safe_extra() in the cases the revised approach handles consistently (both internally and vis a vis PEP 503).

It would also be particularly helpful if @pf_moore could check if the change actually affects any existing projects.

This makes sense to me. Do you have a link handy to the specific discussion you’re referring to?

hroncok · March 10, 2022, 7:31am

I am with @CAM-Gerlach on this. The currently proposed regex indeed does not solve the issue I’ve had when I started that discussion.

CAM-Gerlach · March 10, 2022, 7:33am

Actually (as discussed in my above somewhat lengthy comment), both these descriptions are not correct, due to how strange and unintuitive the current behavior is. Runs of _ are replaced by a single _, but runs of - and . (which are non-alphanumeric) are not. The revision I propose above actually does exactly what is documented in the Setuptools docstring (suggesting the implemented behavior may actually be an unintended bug).

pradyunsg · March 10, 2022, 7:38am

Thanks for doing this @brettcannon!

My initial thought is that if we’re normalising extra names so similarly to distribution names, why not just use the same canonicalisation as PEP 503’s distribution names?

Given that we’re standardising with a change in behaviour already, and that one shouldn’t be comparing normalised values against non-normalised values, I think it’s actually reasonable to just go all in and have all names (extras, distribution etc) get normalised in a consistent manner. IIUC, that’s just replacing the _ with a - in this PEP.

pradyunsg · March 10, 2022, 7:40am

It’s also less cognitive complexity IMO - there would be only one set of rules for how names related to a distribution get managed and once you understand that, you understand how things will work.

hroncok · March 10, 2022, 7:49am

Fedora statistics. Do note that packaging extras is quite a new thing, and packages need to explicitly pick what extras they decide to include and are encouraged to skip extras that are not useful for other packages (for example build/development requirements, commonly named dev, doc or test), and those that have requirements that are not packaged in Fedora. We already normalize by lowercasing only.

3749 Python packages providing ^python3dist\([^\[]+\) – that means “base” packages, no extras
158 Python packages providing '^python3dist\(.+\[.+\]\)' – that means an extra
11 Python packages having nonalphanumeric characters in the extra (^python3dist\(.+\[.*[^a-z0-9]+.*\]\)):
- 10 Python packages having underscore in the extra ('^python3dist\(.+\[.*_.*\]\)'):
  - databases[mysql_asyncmy]
  - databases[postgresql_aiopg]
  - django-timezone-field[rest_framework]
  - python-engineio[asyncio_client]
  - python-socketio[asyncio_client]
  - sqlalchemy[mssql_pymssql]
  - sqlalchemy[mssql_pyodbc]
  - sqlalchemy[postgresql_asyncpg]
  - sqlalchemy[postgresql_pg8000]
  - webscrapbook[adhoc_ssl]
- 1 Python package having dash in the extra ('^python3dist\(.+\[.*-.*\]\)'):
  - google-api-core[grpcio-gcp]
- 0 Python packages having dot in the extra ('^python3dist\(.+\[.*\..*\]\)')

Not much data, but some.

CAM-Gerlach · March 10, 2022, 8:02am

From a conceptual standpoint, and perhaps somewhat a practical one, it certainly is very attractive to use the exact same algorithm both places. To note, though, it doesn’t handle any extras names that contain characters outside of [A-Za-z0-9.\-_]. While they are prohibited by one of the two referenced specs, and at least per my testing, are not allowed at least in the versions of the tools I tested, this would be a hard-break to backward compat if any are used in practice—hopefully @pf_moore 's results can clarify this.

Aside from that, a few reasons (though none of them hard blockers):

This requires a much more significant change to the current spec, which presently requires valid Python identifiers—I’m not sure what the rationale is, though.
This is a change to the existing (at least Setuptools) normalization, which uses _
Per limited results and anecdotal experience, _ is overwhelmingly more common in existing extras names, so the normalized names would be much more inconsistent with the original ones—it seems users are used to such

Given the spec, current normalization and actual user usage all prefer _, I’m not sure its worth the cognitive dissonance to change that now for the sake of nominal consistency…though I don’t feel too strongly about that.

FYI, this also requires the change I proposed above, as right now - and . don’t get normalized (nor collapsed) at all.

encukou · March 10, 2022, 10:54am

Since this is a change to packaging specifications, shouldn’t PEP 508 be converted to a spec first, with this PEP proposing a change to that spec?

pf_moore · March 10, 2022, 12:46pm

OK, here’s some basic stats.

I have the metadata from 2,124,900 wheels from PyPI (I don’t have data for packages that don’t ship wheels).
There are a total of 7338 unique extras across all of those packages. That strikes me as surprisingly low.
I’ve uploaded the list of all those extras as All extras used in wheels from PyPI · GitHub

The situation is a bit of a mess, though. The Provides-Extra metadata says “A string containing the name of an optional feature. Must be a valid Python identifier.” However, PEP 508 defines extras via

identifier_end = letterOrDigit | (('-' | '_' | '.' )* letterOrDigit)
identifier    = < letterOrDigit identifier_end* >
name          = identifier
extras_list   = identifier:i (wsp* ',' wsp* identifier)*:ids -> [i] + ids
extras        = '[' wsp* extras_list?:e wsp* ']' -> e

I really hate that grammar, but if I read it right, it allows extras to be a string of letters, digits, -, _, or ., starting with a letter or a digit (so “3.6” is a valid extra!)

Of the 7338 extras I identified, 104 don’t conform to PEP 508, and 1258 are not Python identifiers. The discrepancy in numbers is mostly because Python identifiers don’t allow dots or dashes.

I collected lists of all cases where the normalisation algorithm resulted in 2 different extras normalising to the same value. I did this across all extras, not by package, so these do not necessarily imply that normalising would cause clashes within a package (I’d be extremely surprised if that ever happened, but I’d have to do a re-scan of the database to verify that).

I looked at the following 3 algorithms:

Option 1 - re.sub('[^A-Za-z0-9.-]+', '_', name).lower()
Option 2 - re.sub('[^A-Za-z0-9]+', '_', name).lower()
PEP 503 - re.sub(r"[-_.]+", "-", name).lower()

In all cases, I removed any cases where the only reason for a clash was uppercase vs lowercase, on the assumption that we definitely want extras to be matched case insensitively, so we can assume such cases are intended to map to the same canonical form.

The results:

Option 1 - 24 clashes
Option 2 - 99 clashes
PEP 503 - 73 clashes

The most common other difference was extras which contained spaces. I feel like we’d definitely want to canonicalise “tensorflow with gpu” and “tensorflow_with_gpu” to the same value. PEP 503 is the odd one out here, as it doesn’t normalise spaces, so “a b” and “a_b” are different under PEP 503 rules. I think that’s probably a strike against using pure PEP 503. However, it’s worth noting that values with spaces are not valid extras according to either the core metadata spec, or to PEP 508.

If we limit the checks to only valid extras according to PEP 508, option 1 generates no clashes other than case sensitivity, Option 2 and PEP 503 only generated

dev-test: dev_test, dev-test, dev.test
dev-lint: dev-lint, dev.lint, dev_lint
apache-beam: apache-beam, apache.beam

(which seem fine, to me). Limiting the result to just valid core metadata (Python identifiers) none of the approaches caused any clashes.

I don’t really know what to make of all this. I think there are probably a number of actions to take:

Decide if PEP 685 wants to take a stand on how “invalid” extra names get normalised. If it dismisses that possibility, then PEP 503 normalisation probably wins due to being consistent with elsewhere, but all of the stated variations work, insofar as they enable case insensitive comparison that treats “.”, “-” and “_” the same.
Fix the mess that is the definition of what constitutes a valid extra. We have 2 different specs which are inconsistent, and from a practical standpoint it doesn’t look like tools enforce either standard^[1]

Personally, I think that PEP 685 should accept that invalid extras exist, and explicitly note that tools can apply PEP 685 normalisation to such non-standard extras. Part of me wants to say that it should say that tools SHOULD warn if applying normalisation to an invalid extra, but without a clear definition of what’s valid, that seems like it will only cause more confusion As for standarising valid extras, I’d like to fix that, but the only solution that feels to me like it would be straightforward would be to update the metadata spec to state that Provides-Extra must follow PEP 508 format.

In theory, yes, it probably should. But that would be a fairly significant undertaking, and we’d probably not make progress if we tied this proposal to doing that.

In fact, it might be good to treat moving the various packaging specs out of the existing PEPs and into PyPA specifications - Python Packaging User Guide as a standalone project, which might be something @smm would be able to help co-ordinate. It’s something that really needs people with technical writing skills, rather than coders, which is probably why it’s only getting done in bits, with no real momentum behind it.

Disclaimer: I didn’t check the age of the wheels I scanned, it’s possible that older versions of tools allowed arbitrary extras but that has since been fixed. Someone should check this. ↩︎

brettcannon · March 10, 2022, 9:39pm

Ideally, yes, but I don’t have that sort of time. Also see Bring over PEPs 517, 518, and 660 to the specs section · Issue #955 · pypa/packaging.python.org · GitHub for other PEPs that still need to be brought over.

You’re reading it correctly.

Yeah, no clear winner. All normalization approaches seem acceptable.

I don’t think it’s worth explicitly addressing beyond suggesting tools warn users about them.

It also works with PEP 508, so that means only the core metadata spec requires a potential update to unify what a valid extra name is (not sure if that requires a new core metadata version since the old names would still be valid?).

I think “invalid” would be anything that doesn’t match the grammar specified in PEP 508 which is what the core metadata spec specifies for Name already:

r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$"

We could add a packaging.utils.check_name() function that checks if that regex matches a name (which I probably want anyway to help validate metadata before it’s written out).

I agree since it means preexisting extras based on the current core metadata spec are still valid.

pf_moore · March 10, 2022, 9:55pm

Good point. I was getting distracted by the fact that some pre-existing extras would be invalid under the new spec. But they are invalid under the old spec too, so that’s not particularly compelling.

I did a quick check, and it appears that current setuptools normalises extra names (“a sample” gets stored in the metadata as “a_sample”) so the invalid extras I identified are likely from older releases. I might do some checking at some point to confirm that.

I forgot to say, but I agree, it seems to me that can just be a PR to the spec rather than a new version / PEP.

The process says

If a change being considered this way has the potential to affect software interoperability, then it must be escalated to the Packaging category of the Python.org Discourse for discussion, where it will be either approved as a text-only change, or else directed to the PEP process for specification updates.

so I’d say that as long as no-one objects here, we’re OK to treat it as a text-only change.

brettcannon · March 10, 2022, 10:24pm

I’ve updated the PEP based on the feedback:

Specify the versions of pip and setuptools.
Loosen naming requirements to match PEP 508.
Use PEP 503 normalization.
Add some more references.
Said tools SHOULD warn when the extra name is invalid.