What extras names are treated as equal and why?

Hello. In Fedora, we’ve recently started to deal with Python extras. For Python package names, we already generate RPM Provides and Requires in a form of:

python3.Xdist(<normalized_package_name>)

E.g.:

python3.9dist(sectuptools-scm)

Those are used on RPM level to bring in the required packages. It works reasonably well.

We follow the normalizing rules from PEP 503. It is crucial to us that the names are normalized, because on RPM level, sectuptools-scm != sectuptools_scm. By using the normalized form for both Provides and Requires, we don’t have this problem.

We have recently started to generate such RPM Provides and Requires for extras as well, in a form of:

python3.Xdist(<normalized_package_name>[<extras_name>])

E.g.

python3.9dist(sectuptools-scm[toml])

Everything worked well :rainbow: :unicorn:

Until we started to deal with case sensitivity, underscores and such. It appears that pip treats dnspython[DNSSEC] and dnspython[dnssec] equally (good!), but it does not treat webscrapbook[adhoc_ssl] and webscrapbook[adhoc-ssl] equally :exploding_head:

What are the normalize rules for extras names and where are they defined? And if they are undefined, should they be defined? Would re-using the same normalizing function that exists for package names make sense?

Thanks.

5 Likes

There is some specification at Core metadata specifications — Python Packaging User Guide (important to note that an extra like foo[bar, baz] is legal) but normalization is not included there.

This is probably an implicit standard via the implementation details of distutils/setuptools/pip that could stand to be explicitly defined.

1 Like

This is probably an implicit standard via the implementation details of distutils/setuptools/pip that could stand to be explicitly defined.

And like many things in this area, it’s complicated :stuck_out_tongue:

The de facto standard (implemented by setuptools) is to convert all running characters outside [A-Za-z0-9.-] to a single underscore, and covert to lowercase. User can specify effectively anything in setup.py.

But Core Metadata (linked above) says an extra name (quote) Must be a valid Python identifier—in other words, can’t contain - and ., and can’t start with a digit, etc.[1] This is in direct conflict to setuptools’s implementation, which would happily produce invalid extra names according to Core Metadata. (I’m not sure if wheel implements additional logic to block those names from being written into a wheel.)

So if we’re going to have a standard on this, this would be a good opportunity to fix the available character set. I’m not sure what the best rule is here considering backwards compatibility. The general form is easy to agree on, but there are various edge cases like whether the name can begin with a digit or a non-alphanumeric character.

The normalisation rule is probably good enough as-is, although I’d personally hope we can change to a rule similar to PEP 503 to avoid user confusion.

[1]: Since the rule was written pre-Python 3, I assume non-ASCII characters are also out.

3 Likes

Packaging (and hence our tooling) considers this as two extra names and we reduce the problem to foo[bzr] && foo[baz].

I’d happily help to define it explicitly.

That happens on “build time”. While definitively important, it is not what is determinative for this problem. I need to know what conversions (should) happen in installers when they resolve dependencies. Does pip follow the same logic?

It parses the requirement via packaging:

ALPHANUM = Word(string.ascii_letters + string.digits)
PUNCTUATION = Word("-_.")
IDENTIFIER_END = ALPHANUM | (ZeroOrMore(PUNCTUATION) + ALPHANUM)
IDENTIFIER = Combine(ALPHANUM + ZeroOrMore(IDENTIFIER_END))
EXTRA = IDENTIFIER

So it fails if it contains characters like ! or @. Good.

It seems that it does not consider any of the listed punctuation as equal.
It seems that it simply ignores the case.

Let’s standardize what packaging already uses to parse it? It can begin with a digit, but not a punctuation.

For 100% backwards compatibility, we could define the normalization rule as:

def normalize(extra_name):
    return extra_name.lower()

If we are not afraid of changes, we could define it to what PEP 503 does for names:

def normalize(extra_name):
    return re.sub(r"[-_.]+", "-", extra_name).lower()

That would require some changes in pip (or some of the vendored libraries). It would only blow up if some projects define multiple extras with names that only differ in punctuation. I’d say that is a no-risk, but maybe I am an optimist.

Should I PEP this? The PEP would define the rules of “valid” extra names (defined by current packaging parser) and their normalized form (one of the above).

2 Likes

A gentle bump. I offer to work on a PEP, but I’d like to have a consensus first.

Sorry for the late reply (I was actually reminded by an entirely unrelated thread elsewhere that wants to add a uncommon extra name…)

Yes, pip normalises all extra names when they come in with the same safe_extra() function. So

  1. Only ASCII alphanumerics, -, _, and . are allowed in an extra name.
  2. - and . are normalised to _ for comparison. So foo[my-bar], foo[my_bar], and foo[my.bar] are equivalent.
  3. Case is folded.

I think this would actually not cause any backwards compatibility issues for pip. Can’t say for other installers, but I don’t really care (they should’ve asked for a clarification like this thread before doing something different).

Does this count as a “small specification change” described in PyPA Governance - Specification Updates? I’d say let’s first send a clarification PR to PyPA specifications first, and write a PEP only if someone thinks it is required. :slightly_smiling_face:

1 Like

From the rules referenced in that thread:

If a change being considered this way has the potential to affect software interoperability, then it must be escalated to the distutils-sig mailing list for discussion, where it will be either approved as a text-only change, or else directed to the PEP process for specification updates.

It’s not entirely clear what constitutes “approval” here, but I’d take the view that getting a consensus is sufficient, with the PEP-delegate having a veto (not that the latter is relevant in this case - see next sentence :wink:).

I’m fine with doing this as a spec update, as long as no-one else wants to argue that it needs to be a PEP.

If that is the intended behavior, there must be a bug somewhere:

(venv) [tmp]$ pip install -U pip
...
Successfully installed pip-21.1

(venv) [tmp]$ pip install 'webscrapbook[adhoc-ssl]'
Collecting webscrapbook[adhoc-ssl]
  Downloading webscrapbook-0.40.0-py3-none-any.whl (146 kB)
     |████████████████████████████████| 146 kB 2.1 MB/s 
WARNING: webscrapbook 0.40.0 does not provide the extra 'adhoc-ssl'
...
Installing collected packages: MarkupSafe, werkzeug, jinja2, itsdangerous, click, lxml, flask, commonmark, webscrapbook
Successfully installed MarkupSafe-1.1.1 click-7.1.2 commonmark-0.9.1 flask-1.1.2 itsdangerous-1.1.0 jinja2-2.11.3 lxml-4.6.3 webscrapbook-0.40.0 werkzeug-1.0.1

(venv) [tmp]$ pip install 'webscrapbook[adhoc_ssl]'
...
Installing collected packages: pycparser, cffi, cryptography
Successfully installed cffi-1.14.5 cryptography-3.4.7 pycparser-2.20

That is why I started this thread.

Works for me! I’ve only said PEP because I assumed it was required.

Note that I said it needs “consensus” - I’d like to see a few more people agree that the proposed behaviour is acceptable here before it goes to a PR. The key is to give interested parties a chance to see the proposal and comment in a well-known forum, and the tracker for a PR doesn’t qualify for that IMO.

Indeed! I checked again and pip’s extra normalisaton behaviour is quite convoluted and eratic. There is code to normalise extras, but I couldn’t find anywhere the logic is ever reached, and can’t help but wonder maybe this worked at some point in the past and silently broke without anyone noticing. I guess that’s yet another reason to have an enforced specification around this…

3 Likes