PEP 639, Round 3: Improving license clarity with better package metadata

konstin · July 31, 2024, 7:42pm

When i just read the text without the context from the thread, it wasn’t clear to me whether e.g. [!...] syntax was allowed or forbidden by this PEP, especially with the part about rejecting LICEN{CSE*; The PEP does tell you about which syntax you need to support, but not about the remaining space of characters.

What about the following:

Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST NOT be assigned special meaning, they must be matched verbatim. Note that this includes all alphabetic characters, not only ascii characters^[1].
*. **, ? and / as well as [] containing only the verbatim matched characters from the list in (1) MUST be supported [with the usual rules].
For the remaining characters (this concerns mainly non-alphanumeric ascii), i propose one of two options:
a. The behavior on all characters not mentioned in (1) or (2) is implementation defined: An implementation MAY reject them, it MAY match them verbatim or it MAY apply an extended feature set (such as supporting [!...]). For example, LICEN{CSE* may or may not be rejected.
b. Other characters MUST be rejected by the implementation. This can be implemented by a scan over all characters of the string plus a separate check for ...

For option 3a, we change the text from:

To achieve better portability, the filenames to match should only contain the alphanumeric characters, underscores (_), hyphens (-) and dots (.).

to

Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST be matched verbatim, with the exception of the parent indicator rule for ... Note that this includes all alphabetic characters, not only ASCII characters.

The behavior of characters not mentioned is implementation defined. An implementation MAY reject them, it MAY match them verbatim or it MAY apply an extended feature set (for example, supporting [!...] for exclusions).

We remove the LICEN{CSE* error example.

For option 3b:

Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST be matched verbatim, with the exception of the parent indicator rule for ... Characters not mentioned MUST be rejected by an implementation, and implementations MUST NOT support additional semantics for glob matching.

The above works for both rust and python (and i assume most other languages too) since the extended features in both uses non-alphanumeric ascii characters, so when we reject those characters, we can never trigger the additional behaviors in the glob implementations.

I can make a PR for each option.

str.isalpha() says: " Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the Alphabetic property defined in the section 4.10 ‘Letters, Alphabetic, and Ideographic’ of the Unicode Standard.“. This is different from rust’s char::is_alphabetic, which says " Alphabetic is described in Chapter 4 (Character Properties) of the Unicode Standard and specified in the Unicode Character Database DerivedCoreProperties.txt.”. I believe specifying that you need to support non-ascii alphabetic characters is sufficient. ↩︎