PEP 639, Round 3: Improving license clarity with better package metadata

ncoghlan · August 3, 2024, 5:55am

Is there a clear explanation anywhere of exactly how Git-style glob patterns differ from the Python standard library? (pathspec doesn’t seem to include one, and a quick internet search wasn’t enlightening either)

Reading the linked page, the only clear difference I could readily see was the Git-style glob patterns omitting negated character ranges (and that could potentially be handled by disallowing unescaped ! characters as the first character in a range definition). There might be some discrepancies lurking in the exact details of how ** is interpreted, but I didn’t dig that deep.

Summarising the key points from the git page:

Blank lines: not applicable
Comment lines: not applicable
Use / as directory separator (already in the PEP)
Use \ to escape metacharacters (not clearly stated as a general rule in either case, but consistent)
Use ? to match a single character (consistent)
Use * to match multiple characters, but not dir separators (consistent)
Use ** to match multiple characters, including directory separators (apparently consistent, but I didn’t check in detail)
Negated character ranges: currently allowed in the PEP, not allowed in git patterns (disallowing to improve portability is a good idea)
Overall pattern negation: not really part of the globbing engine itself (it’s part of how multiple glob entries are composed together into a set of paths), so accepting that usage when ! appears as the first character in a pattern doesn’t seem like a dealbreaker either. Alternatively, disallow it and instead encourage developers to structure their license files such that negated patterns aren’t needed (e.g. by moving the anomalous files to a different folder)

gotmax23 · August 3, 2024, 8:08pm

Can the default license file glob be updated to include LICENSES/*.txt to match the (somewhat) popular REUSE standard for specifying licenses?

tungol · August 5, 2024, 6:02pm

In my reading of the linked documentation for Git-style patterns, the biggest seems to be the semantics around whether a pattern is relative or not. In the PEP, all patterns are relative paths, while the git pattern allows for non-relative paths if-and-only-if there is no directory separator at the start or in the middle of the pattern.

In the git-style globs, given this layout:

./LICENSE
./foo/LICENSE
./bar/foo/LICENSE
./baz/LICENSE/

The pattern LICENSE will match all four, /LICENSE will match only the first, foo/LICENSE will match only the second (same meaning as /foo/LICENSE), and LICENSE/ will match only the fourth (the only directory match).

ncoghlan · August 6, 2024, 9:06am

Stephen Morton:

In the git-style globs, given this layout:
./LICENSE
./foo/LICENSE
./bar/foo/LICENSE
./baz/LICENSE/
The pattern LICENSE will match all four, /LICENSE will match only the first, foo/LICENSE will match only the second (same meaning as /foo/LICENSE), and LICENSE/ will match only the fourth (the only directory match).

This is a good set of examples. Since we need to use glob in recursive mode for the ** pattern to work as desired (even though the PEP text doesn’t explicitly say that), I checked how that handles these cases:

>>> from glob import glob
>>> from functools import partial
>>> rglob = partial(glob, recursive=True)
>>> rglob("LICENSE")
['LICENSE']
>>> rglob("/LICENSE")
[]
>>> rglob("foo/LICENSE")
['foo/LICENSE']
>>> rglob("LICENSE/")
[]

While not unreasonable, I don’t think those outcomes are particularly great.

Instead, I think the git pattern matching is more desirable, and since this is the first pyproject.toml field to allow glob patterns, we can use it to set the precedent for future fields.

The git semantics can be obtained from glob.glob in recursive mode via the following pair of pattern pre-filtering rules (patterns that contain a non-leading directory separator do not require modification):

for any pattern that begins with a directory separator, add a leading .
for any pattern that does not include a directory separator, add a leading **/

>>> rglob("**/LICENSE")
['LICENSE', 'bar/foo/LICENSE', 'baz/LICENSE', 'foo/LICENSE']
>>> rglob("./foo/LICENSE")
['foo/LICENSE']
>>> rglob("./LICENSE")
['./LICENSE']
>>> rglob("**/LICENSE/")
['baz/LICENSE/']

In terms of the PEP, I’d frame it this way:

glob patterns that contain a directory separator MUST be handled as references relative to the pyproject.toml file (adding a leading . if the pattern starts with a directory separator)
glob patterns that do not contain a directory separator MUST be handled as if they started with the **/ pattern
to avoid pattern ambiguity, build tools MAY emit a warning when license file patterns start with a directory separator or do not include a directory separator
pattern and character set negation is not supported and hence ! characters MUST NOT be used (unless escaped to ensure handling as a regular character)

Alternatively (if the implicit recursive search is deemed undesirable - having LICENSE match files in vendored libraries could be surprising!):

all license file patterns MUST be handled as references relative to the pyproject.toml file, including both patterns that start with a directory separator (the leading separator is ignored) and patterns that do not include a directory separator (only files adjacent to the pyproject.toml file will be matched)
to avoid pattern ambiguity, build tools MAY emit a warning when license file patterns start with a directory separator or do not include a directory separator
pattern and character set negation is not supported and hence ! characters MUST NOT be used (unless escaped to ensure handling as a regular character)

Edit: as per comments below, option 2 (patterns are not recursive by default) is considered preferable for this use case.

brettcannon · August 7, 2024, 10:08pm

I think you’re assuming my .gitignore file is fancy enough to know how it differs from POSIX glob.

It doesn’t really matter since the default is just an idea/suggestion to tools if they want to be clever.

barneygale · August 8, 2024, 1:48pm

I concur with your conclusion that .gitignore patterns are compatible with Python’s glob with your suggested pre-filtering. There may be a couple of edge cases, but they’re unlikely to be important.

It’s not quite correct to say that ** matches multiple characters including separators: a pattern like foo/**/bar (which includes two separator characters) matches the path foo/bar (which includes only one) in most glob implementations, including Git’s. The 3.13+ pathlib docs have a bit more on this topic.

ksurma · August 16, 2024, 2:17pm

Based on the current discussion, I’d add the suggestions by @konstin’s and @pf_moore to the The Add license_files key part of the specs in PEP. This would explicitly disallow the negated character ranges and overall pattern negations and ensure the portability across tools:

Alphanumeric characters, underscores (_), hyphens (-) and dots (.) MUST be matched verbatim, with the exception of the parent indicator rule for ...
*. **, ? and / as well as [] containing only the verbatim matched characters MUST be supported.
Any characters or character sequences not covered by this specification are invalid. Projects MUST NOT use such values. Tools consuming this field MAY reject invalid values with an error.

I’m more puzzled about the git-style pattern matching.
Thank you for all the examples which really help to grasp the topic.

My – possibly, quite naive – expectation when defining a pattern like LICENSE would be that only the file in the current directory would be matched. To match more, I’d look for the wildcard characters. Since ** are allowed by the PEP, I can achieve the recursive search via the explicit statement **/LICENSE. A pattern defined like that clearly states the intent both for tools and anyone reading the metadata.
There could also be corner cases when files not intended to make it into the distribution (e.g. test files) are being matched by an implicit recursive search, making it more challenging for the project authors to exclude them.

Hence it makes more sense to me to specify that:

all license file patterns MUST be handled as references relative to the pyproject.toml file, including both patterns that start with a directory separator (the leading separator is ignored) and patterns that do not include a directory separator (only files adjacent to the pyproject.toml file will be matched)

But here, I’m way out of my depth and don’t really know how to find the best way forward.

pf_moore · August 16, 2024, 4:22pm

Personally I would say that patterns must not be absolute and are always relative to the directory containing pyproject.toml. So a leading slash is prohibited, and recursion must be explicitly requested via **. I don’t think there’s a good case for following git’s “recursive by default” approach - the use cases are very different.

pradyunsg · August 16, 2024, 5:30pm

Note that git’s format has a few smaller behaviours that we might not want:

If there is a separator at the end of the pattern then the pattern will only match directories, otherwise the pattern can match both files and directories.

If there is a separator at the beginning or middle (or both) of the pattern, then the pattern is relative to the directory level of the particular .gitignore file itself. Otherwise the pattern may also match at any level below the .gitignore level.

Honestly, I think we really shouldn’t be modelling off of git and instead doing a list of globs here. They’re gonna be easier to reason about for most package authors IMO, it is what basically every build-backend that’s doing custom includes does today AFAIK and the nuances are also more familiar for most people IMO.

brettcannon · August 16, 2024, 10:37pm

I like this as well. Adding ** is not difficult and the concept is probably something that can be taught in a blog post, making it better to just be explicit in this case instead of unintended consequences from implicit recursion.

ncoghlan · August 16, 2024, 11:56pm

The fact a simple pattern could inadvertently match a vendored file concerned me when writing up the git based semantics, hence including the description of the less magical alternative.

I’ve now edited that reply to note that subsequent comments were strongly in favour of not making patterns recursive by default (and I agree that’s the better path for the PEP to take).

ksurma · August 19, 2024, 9:05am

I’ve opened: PEP 639: Make the policy around globs tighter by befeleme · Pull Request #3913 · python/peps · GitHub
I decided to keep the specification more focused on what’s allowed and stating “the rest is forbidden” rather than inventing an inherently incomplete list of such forbidden patterns. I hope it’s clear now, for example, that character negation is not supported.

pf_moore · August 19, 2024, 10:17am

LGTM. However, I think it may be worth explicitly describing the semantics of ranges in [...], just because it’s subtle. Maybe something like this:

Within [...], the hyphen indicates a range: a-z. Hyphens at the start or end are matched literally.

Otherwise, people might interpret “containing only the verbatim matched characters” as meaning that the hyphen must be matched verbatim, not as a range.

ksurma · August 19, 2024, 2:21pm

That’s a good remark, I added this mention to the PR.

brettcannon · August 19, 2024, 9:41pm

And I just merged it!

@ksurma do you consider the PEP done and want a pronouncement?

ksurma · August 20, 2024, 7:52am

It pains me to say so, but there’s one more thing I was reminded of.

We’ve allowed the character ranges in the glob patterns, which are locale-specific. Should we specify which locale to use when processing them? I’ve done some ddg-fu to realize the topic is complex, handled differently on POSIX and Windows and generally quite a can of worms. Is there a standard way of handling this that we could just adopt?

pf_moore · August 20, 2024, 8:07am

I’d say limit ranges to ASCII only. It’s somewhat exclusionary (projects may not be able to refer to a license named in the maintainer’s native language using ranges) but I doubt that will be a problem in practice (99% of use of this field will likely be a list of explicit filenames).

konstin · August 20, 2024, 10:18am

The new text looks great! I hacked together a quick implementation in rust, with some python bindings to test against a python reference implementation: GitHub - konstin/pep639-globs

hroncok · August 20, 2024, 10:51am

What if we said:

Within [...], the hyphen indicates a locale-agnostic range (e.g. a-z, order based on Unicode code points).

glob.glob uses fnmatch which seems to follow this behavior:

>>> import fnmatch
>>> fnmatch.fnmatch('č', '[a-z]')   # Czech alphabet is abcčd...
False
>>> fnmatch.fnmatch('č', '[a-ř]')  #  I can still use č if I accept it comes after z
True
>>> import locale
>>> locale.setlocale(locale.LC_ALL)
'LC_CTYPE=cs_CZ.utf8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C'
>>> locale.setlocale(locale.LC_COLLATE, 'cs_CZ.utf8')
'cs_CZ.utf8'
>>> fnmatch.fnmatch('č', '[a-z]')  # no difference
False
>>> fnmatch.fnmatch('č', '[a-ř]')
True

ksurma · August 20, 2024, 12:59pm

Here: PEP 639: Character ranges are treated locale-agnostic by befeleme · Pull Request #3914 · python/peps · GitHub
If this is deemed sufficient, let’s merge and pronounce the draft as ready for further processing.