What extras names are treated as equal and why?

Hello. In Fedora, we’ve recently started to deal with Python extras. For Python package names, we already generate RPM Provides and Requires in a form of:

python3.Xdist(<normalized_package_name>)

E.g.:

python3.9dist(sectuptools-scm)

Those are used on RPM level to bring in the required packages. It works reasonably well.

We follow the normalizing rules from PEP 503. It is crucial to us that the names are normalized, because on RPM level, sectuptools-scm != sectuptools_scm. By using the normalized form for both Provides and Requires, we don’t have this problem.

We have recently started to generate such RPM Provides and Requires for extras as well, in a form of:

python3.Xdist(<normalized_package_name>[<extras_name>])

E.g.

python3.9dist(sectuptools-scm[toml])

Everything worked well :rainbow: :unicorn:

Until we started to deal with case sensitivity, underscores and such. It appears that pip treats dnspython[DNSSEC] and dnspython[dnssec] equally (good!), but it does not treat webscrapbook[adhoc_ssl] and webscrapbook[adhoc-ssl] equally :exploding_head:

What are the normalize rules for extras names and where are they defined? And if they are undefined, should they be defined? Would re-using the same normalizing function that exists for package names make sense?

Thanks.

6 Likes

There is some specification at Core metadata specifications — Python Packaging User Guide (important to note that an extra like foo[bar, baz] is legal) but normalization is not included there.

This is probably an implicit standard via the implementation details of distutils/setuptools/pip that could stand to be explicitly defined.

1 Like

This is probably an implicit standard via the implementation details of distutils/setuptools/pip that could stand to be explicitly defined.

And like many things in this area, it’s complicated :stuck_out_tongue:

The de facto standard (implemented by setuptools) is to convert all running characters outside [A-Za-z0-9.-] to a single underscore, and covert to lowercase. User can specify effectively anything in setup.py.

But Core Metadata (linked above) says an extra name (quote) Must be a valid Python identifier—in other words, can’t contain - and ., and can’t start with a digit, etc.[1] This is in direct conflict to setuptools’s implementation, which would happily produce invalid extra names according to Core Metadata. (I’m not sure if wheel implements additional logic to block those names from being written into a wheel.)

So if we’re going to have a standard on this, this would be a good opportunity to fix the available character set. I’m not sure what the best rule is here considering backwards compatibility. The general form is easy to agree on, but there are various edge cases like whether the name can begin with a digit or a non-alphanumeric character.

The normalisation rule is probably good enough as-is, although I’d personally hope we can change to a rule similar to PEP 503 to avoid user confusion.

[1]: Since the rule was written pre-Python 3, I assume non-ASCII characters are also out.

3 Likes

Packaging (and hence our tooling) considers this as two extra names and we reduce the problem to foo[bzr] && foo[baz].

I’d happily help to define it explicitly.

That happens on “build time”. While definitively important, it is not what is determinative for this problem. I need to know what conversions (should) happen in installers when they resolve dependencies. Does pip follow the same logic?

It parses the requirement via packaging:

ALPHANUM = Word(string.ascii_letters + string.digits)
PUNCTUATION = Word("-_.")
IDENTIFIER_END = ALPHANUM | (ZeroOrMore(PUNCTUATION) + ALPHANUM)
IDENTIFIER = Combine(ALPHANUM + ZeroOrMore(IDENTIFIER_END))
EXTRA = IDENTIFIER

So it fails if it contains characters like ! or @. Good.

It seems that it does not consider any of the listed punctuation as equal.
It seems that it simply ignores the case.

Let’s standardize what packaging already uses to parse it? It can begin with a digit, but not a punctuation.

For 100% backwards compatibility, we could define the normalization rule as:

def normalize(extra_name):
    return extra_name.lower()

If we are not afraid of changes, we could define it to what PEP 503 does for names:

def normalize(extra_name):
    return re.sub(r"[-_.]+", "-", extra_name).lower()

That would require some changes in pip (or some of the vendored libraries). It would only blow up if some projects define multiple extras with names that only differ in punctuation. I’d say that is a no-risk, but maybe I am an optimist.

Should I PEP this? The PEP would define the rules of “valid” extra names (defined by current packaging parser) and their normalized form (one of the above).

2 Likes

A gentle bump. I offer to work on a PEP, but I’d like to have a consensus first.

Sorry for the late reply (I was actually reminded by an entirely unrelated thread elsewhere that wants to add a uncommon extra name…)

Yes, pip normalises all extra names when they come in with the same safe_extra() function. So

  1. Only ASCII alphanumerics, -, _, and . are allowed in an extra name.
  2. - and . are normalised to _ for comparison. So foo[my-bar], foo[my_bar], and foo[my.bar] are equivalent.
  3. Case is folded.

I think this would actually not cause any backwards compatibility issues for pip. Can’t say for other installers, but I don’t really care (they should’ve asked for a clarification like this thread before doing something different).

Does this count as a “small specification change” described in PyPA Governance - Specification Updates? I’d say let’s first send a clarification PR to PyPA specifications first, and write a PEP only if someone thinks it is required. :slightly_smiling_face:

1 Like

From the rules referenced in that thread:

If a change being considered this way has the potential to affect software interoperability, then it must be escalated to the distutils-sig mailing list for discussion, where it will be either approved as a text-only change, or else directed to the PEP process for specification updates.

It’s not entirely clear what constitutes “approval” here, but I’d take the view that getting a consensus is sufficient, with the PEP-delegate having a veto (not that the latter is relevant in this case - see next sentence :wink:).

I’m fine with doing this as a spec update, as long as no-one else wants to argue that it needs to be a PEP.

If that is the intended behavior, there must be a bug somewhere:

(venv) [tmp]$ pip install -U pip
...
Successfully installed pip-21.1

(venv) [tmp]$ pip install 'webscrapbook[adhoc-ssl]'
Collecting webscrapbook[adhoc-ssl]
  Downloading webscrapbook-0.40.0-py3-none-any.whl (146 kB)
     |████████████████████████████████| 146 kB 2.1 MB/s 
WARNING: webscrapbook 0.40.0 does not provide the extra 'adhoc-ssl'
...
Installing collected packages: MarkupSafe, werkzeug, jinja2, itsdangerous, click, lxml, flask, commonmark, webscrapbook
Successfully installed MarkupSafe-1.1.1 click-7.1.2 commonmark-0.9.1 flask-1.1.2 itsdangerous-1.1.0 jinja2-2.11.3 lxml-4.6.3 webscrapbook-0.40.0 werkzeug-1.0.1

(venv) [tmp]$ pip install 'webscrapbook[adhoc_ssl]'
...
Installing collected packages: pycparser, cffi, cryptography
Successfully installed cffi-1.14.5 cryptography-3.4.7 pycparser-2.20

That is why I started this thread.

Works for me! I’ve only said PEP because I assumed it was required.

Note that I said it needs “consensus” - I’d like to see a few more people agree that the proposed behaviour is acceptable here before it goes to a PR. The key is to give interested parties a chance to see the proposal and comment in a well-known forum, and the tracker for a PR doesn’t qualify for that IMO.

Indeed! I checked again and pip’s extra normalisaton behaviour is quite convoluted and eratic. There is code to normalise extras, but I couldn’t find anywhere the logic is ever reached, and can’t help but wonder maybe this worked at some point in the past and silently broke without anyone noticing. I guess that’s yet another reason to have an enforced specification around this…

3 Likes

There has been no disagreement here, but neither there was agreement. Is there any place I need to go to and promote this discussion there?

I don’t know, to be honest. Maybe it needs a PEP just to raise visibility and trigger a proper discussion. At a minumum, given that setuptools and pip are mentioned in the thread, I think you need to get explicit confirmation from the maintainers of those projects that they don’t have a problem with whatever you are proposing. @uranusjr has commented here already, so I guess that’s sufficient for pip (I’m also a pip maintainer but I haven’t researched the question and how it would affect pip).

If you can’t get sufficient voices confirming it can be done as a spec update, the fallback is that it’s a PEP.

I think a PEP is needed, but have had not enough time to write it yet. The specification should be fairly short, but is substential enough to be its own document IMO. It needs to standardise at least two things:

  1. What extra names should be considered equivalent. This should use pkg_resources.safe_extra() since it’s the only logic that can possibly work without breaking things.

  2. How extras should be compared in a PEP 508 environment marker. This is needed because PEP 508 does not sufficiently define how the value of extra should be compared. There are two possibilities:

    1. The standard should mandate all metadata-producing tools to normalise an extra before putting it into a marker (i.e. should write e.g. foo; extra == 'x.y' instead of foo; extra == 'X.Y').
    2. Marker evaluation logic (e.g. packaging.marker) must be amended to perform normalisation when comparing markers (i.e. Marker("extra == 'X.Y'").evaluate({'extra': 'x.y'}) must return True).

    I think the latter solution is likely more viable, since we can’t fix metadata in existing packages on PyPI.

This topic actually came up again when I was working on pip’s importlib.metadata support. It’s really a PITA that I wish more people can take interest in.

(See _iter_egg_info_dependencies in pip/_internal/metadata/importlib/_dists.py.)

1 Like

I think option (1) is better (we recently had a similar debate about whether project names should be explicitly normalised, which I think we should avoid repeating) but we should also require (2) for compatibility with existing un-normalised data. (That’s the “Transition plan” section I’m proposing we include in new PEPs :slightly_smiling_face:)

1 Like

For anyone else who, like me, doesn’t know what that involves:

GitHub link

Seems reasonable: loose on the input, strict on the output.

Is it just how to normalize extras, or is there something bigger you’re referring to?

Can we make extra validation/normalization the same as package name validation/normalization? It looks like it’s pretty similar, at least. That would be useful for simplicity, and also to keep our options open for reifying extras as part of the package name in the future.

For marker evaluation: simplest might be to declare that extra can only appear on the left-hand side of a == or !=, and that this then uses normalizing comparison rules?

If we do that then I don’t really see the point of also mandating that tools produce a specific string form.

It’s theoratically possible, but I’m not sure if it’s a good idea to declare all existing package managers broken to persue theoratical purity. You’d get no objection from me if you write that into a PEP, but I’m not going to try writing that PEP myself and defending the decision against user complaints.

Can you give some examples of extra names that would be broken? From a quick look it seems like safe_extra and PEP 503 normalization are equivalent but I could easily be missing some details.

a£b. Also, safe_extra uses _ as the replacement character whereas PEP 503 uses -.

I don’t think the replacement character matters, since users are always supposed to re-normalize before doing any comparisons. So if pip or whatever wants to prefer one replacement character or another internally, it doesn’t affect anything.

I guess the difference in £ is that safe_extra treats it as punctuation, a£b == a-b, while PEP 503 says that it’s illegal? And same for every other character that’s not ASCII alphanumerics, -, . or _?

The safe_extra approach doesn’t seem very useful – "sure, you can write your extra name in greek or cyrillic, but all extras written in those alphabets will be interpreted as if you had written a single "_"". And making those characters illegal probably wouldn’t be too disruptive – I doubt many people are using them? But idk if it’s like, “literally no-one” or “1 package” or “100 packages”, so maybe it would be disruptive enough to not be worth it, not sure.