Is there a clear explanation anywhere of exactly how Git-style glob patterns differ from the Python standard library? (pathspec doesn’t seem to include one, and a quick internet search wasn’t enlightening either)
Reading the linked page, the only clear difference I could readily see was the Git-style glob patterns omitting negated character ranges (and that could potentially be handled by disallowing unescaped ! characters as the first character in a range definition). There might be some discrepancies lurking in the exact details of how ** is interpreted, but I didn’t dig that deep.
Summarising the key points from the git page:
Blank lines: not applicable
Comment lines: not applicable
Use / as directory separator (already in the PEP)
Use \ to escape metacharacters (not clearly stated as a general rule in either case, but consistent)
Use ? to match a single character (consistent)
Use * to match multiple characters, but not dir separators (consistent)
Use ** to match multiple characters, including directory separators (apparently consistent, but I didn’t check in detail)
Negated character ranges: currently allowed in the PEP, not allowed in git patterns (disallowing to improve portability is a good idea)
Overall pattern negation: not really part of the globbing engine itself (it’s part of how multiple glob entries are composed together into a set of paths), so accepting that usage when ! appears as the first character in a pattern doesn’t seem like a dealbreaker either. Alternatively, disallow it and instead encourage developers to structure their license files such that negated patterns aren’t needed (e.g. by moving the anomalous files to a different folder)
In my reading of the linked documentation for Git-style patterns, the biggest seems to be the semantics around whether a pattern is relative or not. In the PEP, all patterns are relative paths, while the git pattern allows for non-relative paths if-and-only-if there is no directory separator at the start or in the middle of the pattern.
The pattern LICENSE will match all four, /LICENSE will match only the first, foo/LICENSE will match only the second (same meaning as /foo/LICENSE), and LICENSE/ will match only the fourth (the only directory match).
This is a good set of examples. Since we need to use glob in recursive mode for the ** pattern to work as desired (even though the PEP text doesn’t explicitly say that), I checked how that handles these cases:
While not unreasonable, I don’t think those outcomes are particularly great.
Instead, I think the git pattern matching is more desirable, and since this is the first pyproject.toml field to allow glob patterns, we can use it to set the precedent for future fields.
The git semantics can be obtained from glob.glob in recursive mode via the following pair of pattern pre-filtering rules (patterns that contain a non-leading directory separator do not require modification):
for any pattern that begins with a directory separator, add a leading .
for any pattern that does not include a directory separator, add a leading **/
glob patterns that contain a directory separator MUST be handled as references relative to the pyproject.toml file (adding a leading . if the pattern starts with a directory separator)
glob patterns that do not contain a directory separator MUST be handled as if they started with the **/ pattern
to avoid pattern ambiguity, build tools MAY emit a warning when license file patterns start with a directory separator or do not include a directory separator
pattern and character set negation is not supported and hence ! characters MUST NOT be used (unless escaped to ensure handling as a regular character)
Alternatively (if the implicit recursive search is deemed undesirable - having LICENSE match files in vendored libraries could be surprising!):
all license file patterns MUST be handled as references relative to the pyproject.toml file, including both patterns that start with a directory separator (the leading separator is ignored) and patterns that do not include a directory separator (only files adjacent to the pyproject.toml file will be matched)
to avoid pattern ambiguity, build tools MAY emit a warning when license file patterns start with a directory separator or do not include a directory separator
pattern and character set negation is not supported and hence ! characters MUST NOT be used (unless escaped to ensure handling as a regular character)
Edit: as per comments below, option 2 (patterns are not recursive by default) is considered preferable for this use case.
I concur with your conclusion that .gitignore patterns are compatible with Python’s glob with your suggested pre-filtering. There may be a couple of edge cases, but they’re unlikely to be important.
It’s not quite correct to say that ** matches multiple characters including separators: a pattern like foo/**/bar (which includes two separator characters) matches the path foo/bar (which includes only one) in most glob implementations, including Git’s. The 3.13+ pathlib docs have a bit more on this topic.
Based on the current discussion, I’d add the suggestions by @konstin’s and @pf_moore to the The Add license_files key part of the specs in PEP. This would explicitly disallow the negated character ranges and overall pattern negations and ensure the portability across tools:
Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST be matched verbatim, with the exception of the parent indicator rule for ... *. **, ? and / as well as [] containing only the verbatim matched characters MUST be supported.
Any characters or character sequences not covered by this specification are invalid. Projects MUST NOT use such values. Tools consuming this field MAY reject invalid values with an error.
I’m more puzzled about the git-style pattern matching.
Thank you for all the examples which really help to grasp the topic.
My – possibly, quite naive – expectation when defining a pattern like LICENSE would be that only the file in the current directory would be matched. To match more, I’d look for the wildcard characters. Since ** are allowed by the PEP, I can achieve the recursive search via the explicit statement **/LICENSE. A pattern defined like that clearly states the intent both for tools and anyone reading the metadata.
There could also be corner cases when files not intended to make it into the distribution (e.g. test files) are being matched by an implicit recursive search, making it more challenging for the project authors to exclude them.
Hence it makes more sense to me to specify that:
all license file patterns MUST be handled as references relative to the pyproject.toml file, including both patterns that start with a directory separator (the leading separator is ignored) and patterns that do not include a directory separator (only files adjacent to the pyproject.toml file will be matched)
But here, I’m way out of my depth and don’t really know how to find the best way forward.
Personally I would say that patterns must not be absolute and are always relative to the directory containing pyproject.toml. So a leading slash is prohibited, and recursion must be explicitly requested via **. I don’t think there’s a good case for following git’s “recursive by default” approach - the use cases are very different.
Note that git’s format has a few smaller behaviours that we might not want:
If there is a separator at the end of the pattern then the pattern will only match directories, otherwise the pattern can match both files and directories.
If there is a separator at the beginning or middle (or both) of the pattern, then the pattern is relative to the directory level of the particular .gitignore file itself. Otherwise the pattern may also match at any level below the .gitignore level.
Honestly, I think we really shouldn’t be modelling off of git and instead doing a list of globs here. They’re gonna be easier to reason about for most package authors IMO, it is what basically every build-backend that’s doing custom includes does today AFAIK and the nuances are also more familiar for most people IMO.
I like this as well. Adding ** is not difficult and the concept is probably something that can be taught in a blog post, making it better to just be explicit in this case instead of unintended consequences from implicit recursion.
The fact a simple pattern could inadvertently match a vendored file concerned me when writing up the git based semantics, hence including the description of the less magical alternative.
I’ve now edited that reply to note that subsequent comments were strongly in favour of not making patterns recursive by default (and I agree that’s the better path for the PEP to take).
LGTM. However, I think it may be worth explicitly describing the semantics of ranges in [...], just because it’s subtle. Maybe something like this:
Within [...], the hyphen indicates a range: a-z. Hyphens at the start or end are matched literally.
Otherwise, people might interpret “containing only the verbatim matched characters” as meaning that the hyphen must be matched verbatim, not as a range.
It pains me to say so, but there’s one more thing I was reminded of.
We’ve allowed the character ranges in the glob patterns, which are locale-specific. Should we specify which locale to use when processing them? I’ve done some ddg-fu to realize the topic is complex, handled differently on POSIX and Windows and generally quite a can of worms. Is there a standard way of handling this that we could just adopt?
I’d say limit ranges to ASCII only. It’s somewhat exclusionary (projects may not be able to refer to a license named in the maintainer’s native language using ranges) but I doubt that will be a problem in practice (99% of use of this field will likely be a list of explicit filenames).
The new text looks great! I hacked together a quick implementation in rust, with some python bindings to test against a python reference implementation: GitHub - konstin/pep639-globs
Within [...], the hyphen indicates a locale-agnostic range (e.g. a-z, order based on Unicode code points).
glob.glob uses fnmatch which seems to follow this behavior:
>>> import fnmatch
>>> fnmatch.fnmatch('č', '[a-z]') # Czech alphabet is abcčd...
False
>>> fnmatch.fnmatch('č', '[a-ř]') # I can still use č if I accept it comes after z
True
>>> import locale
>>> locale.setlocale(locale.LC_ALL)
'LC_CTYPE=cs_CZ.utf8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C'
>>> locale.setlocale(locale.LC_COLLATE, 'cs_CZ.utf8')
'cs_CZ.utf8'
>>> fnmatch.fnmatch('č', '[a-z]') # no difference
False
>>> fnmatch.fnmatch('č', '[a-ř]')
True