I also don’t have much time so it would be cool if there was a brief summary of what was changed compared to a few months ago so I know what to do in Hatchling
remove the default glob values expected from build tools to match license-files (build backends can still define them if they wish, the standard just doesn’t make any recommendation about them)
change the error policy around deprecated license classifiers - tools MAY – not MUST – raise an error if they encounter them
flatten the value of license-files key - expect glob patterns, specify the glob characters supported (The glob patterns MAY contain special glob characters: ``*``, ``?``, ``**`` and character ranges: ``[]``, and tools MUST support them.)
Meaning, this is a valid license files declaration now:
license-files = ["LICENSE.txt", "licenses/*"]
I believe the draft edits are now done and ready for review.
I’ll give people a week for anymore comments and I will do another thorough read just to be safe. It would also be great to get a non-Python project that reads metadata (e.g., uv, so maybe @konstin or @zanie ?) to make sure we aren’t doing something that’s too specific to Python (i.e. the glob part).
The glob matcher is indeed somewhat difficult. The de-facto standard in rust is the glob crate (glob - Rust), which seems very similar but not bug-by-bug compatible to the python version. I assume the PEP is written to match python’s std glob matcher, but std has some undocumented features, such as group negation:
(We found those because the glob crate does support them)
One option is to say that ?, *, ** and [] must be supported, and leaving the rest implementation defined. Another option is to make the exact desired syntax and semantics of globbing clearer, and we have to build a matcher in rust that matches precisely that.
I think this is the better option, and that’s how I interpreted Paul’s comment above: define the minimum and people who care about portability should only use those characters.
Its value is an array of strings which MUST contain valid glob patterns, as specified below. The glob patterns MAY contain special glob characters: *, ?, ** and character ranges: [], and tools MUST support them. Path delimiters MUST be the forward slash character (/), and parent directory indicators (..) MUST NOT be used. Tools MUST assume that license file content is valid UTF-8 encoded text, and SHOULD validate this and raise an error if it is not.
When i just read the text without the context from the thread, it wasn’t clear to me whether e.g. [!...] syntax was allowed or forbidden by this PEP, especially with the part about rejecting LICEN{CSE*; The PEP does tell you about which syntax you need to support, but not about the remaining space of characters.
What about the following:
Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST NOT be assigned special meaning, they must be matched verbatim. Note that this includes all alphabetic characters, not only ascii characters[1].
*. **, ? and / as well as [] containing only the verbatim matched characters from the list in (1) MUST be supported [with the usual rules].
For the remaining characters (this concerns mainly non-alphanumeric ascii), i propose one of two options:
a. The behavior on all characters not mentioned in (1) or (2) is implementation defined: An implementation MAY reject them, it MAY match them verbatim or it MAY apply an extended feature set (such as supporting [!...]). For example, LICEN{CSE* may or may not be rejected.
b. Other characters MUST be rejected by the implementation. This can be implemented by a scan over all characters of the string plus a separate check for ...
For option 3a, we change the text from:
To achieve better portability, the filenames to match should only contain the alphanumeric characters, underscores (_), hyphens (-) and dots (.).
to
Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST be matched verbatim, with the exception of the parent indicator rule for ... Note that this includes all alphabetic characters, not only ASCII characters.
The behavior of characters not mentioned is implementation defined. An implementation MAY reject them, it MAY match them verbatim or it MAY apply an extended feature set (for example, supporting [!...] for exclusions).
We remove the LICEN{CSE* error example.
For option 3b:
Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST be matched verbatim, with the exception of the parent indicator rule for ... Characters not mentioned MUST be rejected by an implementation, and implementations MUST NOT support additional semantics for glob matching.
The above works for both rust and python (and i assume most other languages too) since the extended features in both uses non-alphanumeric ascii characters, so when we reject those characters, we can never trigger the additional behaviors in the glob implementations.
Thank you for the thorough response.
Here, I lean towards the option 3a. I feel the specification shouldn’t add even more restrictions at this point.
Also, as noted by @hroncok when we discussed it briefly: option 3a will allow projects to use Python’s glob module in full, while 3b not really.
Would you be so kind to send a PR with those changes?
Be careful about “implementation-defined behaviour”. It will mean that some pyproject.toml files are correct according to the spec, but not portable across tools.
I’m familiar with “implementation-defined behaviour” in C standards. There, as far as I know, it was mostly used when several pre-standard compilers behave differently, and the standard didn’t want to “pick sides”. The resulting distinction between valid code and portable code is painful.
Unlike C code that might reasonably be written for one specific compiler, pyproject.toml is an interoperability format. Implementation-defined behaviour doesn’t make sense to me here.
Any characters or character sequences not covered by (1) and (2) are invalid. Projects MUST NOT use such values. Tools consuming this field MAY reject invalid values with an error.
This puts the responsibility on project authors to use portable constructs, and makes it clear what those portable constructs are. That’s sufficient. IMO there’s no great benefit in requiring consumers to check and reject invalid constructs. For all practical purposes no-one is ever going to use them, so it’s extra busy-work.
We have plenty of other standards where we require specific constraints on data values, but leave it as a tool UI decision whether to validate. This seems like a good case for that approach as well.
Another Astral employee working on uv here. I’m also the author of Rust’s regex and globset crates.
I would echo @encukou’s concerns about implementation defined behavior. I think it would be great to avoid that whenever possible.
I don’t have a ton of familiarity with the PEP process, so my apologies if this suggestion isn’t appropriate, but another option here is to just say in the PEP, “The glob syntax and behavior should behave the same as Python’s standard library glob.glob function, assuming default keyword parameters.” And then file a bug to improve the documentation of the glob module. For example, to cover the [!...] syntax. This will also help answer other questions that I think are not covered by the PEP, such as the behavior of ** (symlinks, hidden directories, etc). And in the case where documentation fails, implementors of this PEP can fallback to examining the actual behavior of the glob.glob function. Not ideal, but it’s easy to specify and unambiguous.
This also incidentally seems to match up with Rust’s glob crate pretty well. And even if it didn’t, I don’t think it would be a huge deal to port Python’s glob.glob function to Rust in order to copy its semantics exactly.
That’s a good point as being portable across tools is important (it’s why we created [project] in the first place).
I think that’s a nice solution to this; say you should write portable glob patterns, but not requiring tools to police users; very “consenting adults”.
You would have to invert that: get the docs fixed first, have the PEP make that statement, and make sure you’re clear which Python version you’re referring to. But this also runs the risk of shifting meaning if glob.glob changed in the future.
I wouldn’t change the PEP based on what I’m about to say but my preference would actually be (I know it won’t happen) that we all standardize on Git-style glob patterns. This would require a third-party package but more closely matches what users are accustomed to. Rust would have no problem with this just Python build backends would have to have a dependency either explicitly or vendored. Hatchling already depends on the best/only good option: pathspec
I know it won’t happen but it would be the best option for users and could extend to file inclusion patterns for artifacts.