What extras names are treated as equal and why?

Hello. In Fedora, we’ve recently started to deal with Python extras. For Python package names, we already generate RPM Provides and Requires in a form of:

python3.Xdist(<normalized_package_name>)

E.g.:

python3.9dist(sectuptools-scm)

Those are used on RPM level to bring in the required packages. It works reasonably well.

We follow the normalizing rules from PEP 503. It is crucial to us that the names are normalized, because on RPM level, sectuptools-scm != sectuptools_scm. By using the normalized form for both Provides and Requires, we don’t have this problem.

We have recently started to generate such RPM Provides and Requires for extras as well, in a form of:

python3.Xdist(<normalized_package_name>[<extras_name>])

E.g.

python3.9dist(sectuptools-scm[toml])

Everything worked well :rainbow: :unicorn:

Until we started to deal with case sensitivity, underscores and such. It appears that pip treats dnspython[DNSSEC] and dnspython[dnssec] equally (good!), but it does not treat webscrapbook[adhoc_ssl] and webscrapbook[adhoc-ssl] equally :exploding_head:

What are the normalize rules for extras names and where are they defined? And if they are undefined, should they be defined? Would re-using the same normalizing function that exists for package names make sense?

Thanks.

5 Likes

There is some specification at Core metadata specifications — Python Packaging User Guide (important to note that an extra like foo[bar, baz] is legal) but normalization is not included there.

This is probably an implicit standard via the implementation details of distutils/setuptools/pip that could stand to be explicitly defined.

1 Like

This is probably an implicit standard via the implementation details of distutils/setuptools/pip that could stand to be explicitly defined.

And like many things in this area, it’s complicated :stuck_out_tongue:

The de facto standard (implemented by setuptools) is to convert all running characters outside [A-Za-z0-9.-] to a single underscore, and covert to lowercase. User can specify effectively anything in setup.py.

But Core Metadata (linked above) says an extra name (quote) Must be a valid Python identifier—in other words, can’t contain - and ., and can’t start with a digit, etc.[1] This is in direct conflict to setuptools’s implementation, which would happily produce invalid extra names according to Core Metadata. (I’m not sure if wheel implements additional logic to block those names from being written into a wheel.)

So if we’re going to have a standard on this, this would be a good opportunity to fix the available character set. I’m not sure what the best rule is here considering backwards compatibility. The general form is easy to agree on, but there are various edge cases like whether the name can begin with a digit or a non-alphanumeric character.

The normalisation rule is probably good enough as-is, although I’d personally hope we can change to a rule similar to PEP 503 to avoid user confusion.

[1]: Since the rule was written pre-Python 3, I assume non-ASCII characters are also out.

3 Likes

Packaging (and hence our tooling) considers this as two extra names and we reduce the problem to foo[bzr] && foo[baz].

I’d happily help to define it explicitly.

That happens on “build time”. While definitively important, it is not what is determinative for this problem. I need to know what conversions (should) happen in installers when they resolve dependencies. Does pip follow the same logic?

It parses the requirement via packaging:

ALPHANUM = Word(string.ascii_letters + string.digits)
PUNCTUATION = Word("-_.")
IDENTIFIER_END = ALPHANUM | (ZeroOrMore(PUNCTUATION) + ALPHANUM)
IDENTIFIER = Combine(ALPHANUM + ZeroOrMore(IDENTIFIER_END))
EXTRA = IDENTIFIER

So it fails if it contains characters like ! or @. Good.

It seems that it does not consider any of the listed punctuation as equal.
It seems that it simply ignores the case.

Let’s standardize what packaging already uses to parse it? It can begin with a digit, but not a punctuation.

For 100% backwards compatibility, we could define the normalization rule as:

def normalize(extra_name):
    return extra_name.lower()

If we are not afraid of changes, we could define it to what PEP 503 does for names:

def normalize(extra_name):
    return re.sub(r"[-_.]+", "-", extra_name).lower()

That would require some changes in pip (or some of the vendored libraries). It would only blow up if some projects define multiple extras with names that only differ in punctuation. I’d say that is a no-risk, but maybe I am an optimist.

Should I PEP this? The PEP would define the rules of “valid” extra names (defined by current packaging parser) and their normalized form (one of the above).

2 Likes