PEP 639, Round 3: Improving license clarity with better package metadata

I have just opened a PR with changes discussed here, most notably:

  • allowing custom SPDX identifiers per SPDX specification
  • making License-Expression and License fields mutually exclusive
  • not recommending a single reference library for expression parsing and validation
  • relaxing the validation and case-normalization to SHOULD, with the logical follow up: if tools decide to validate (meaning they have access to a license list), they also SHOULD normalize; otherwise there’s no sense in mandating case-normalization.
5 Likes

First off, thanks to @ksurma for revitalizing this PEP! I’ve read through the latest version based on Karolina’s PR that I just merged and I have some questions.

If the License-Expression field is present, build tools SHOULD and publishing tools MUST raise an error if one or more license classifiers is included in a Classifier field, and MUST NOT add such classifiers themselves.

I think this would hurt transitioning too much. Tools ingesting licensing information will need time to switch to the new metadata. I think the publishing part of that should be removed and the build tool part to a MAY since the trove classifier is not inherently wrong.

For all newly-uploaded distributions that include a License-Expression field, the Python Package Index (PyPI) MUST reject any that also specify any license classifiers.

See above as to why I think this is too strict.

Its value is a table, which if present MUST contain one of two optional, mutually exclusive subkeys, paths and globs

I’m not sure if globs is important enough of an optimization to have it. Do projects end up w/ so many license files in so many places that this is that useful compared to listing all paths? And if they do have that many licenses, how often are they kept in a directory to begin with (because you could optimize to include all files in a directory and then make this field just an array of strings and rename the field license-paths)?

the globs subkey valid glob patterns, which MUST be parsable by the glob module in the Python standard library.

This is unfortunately a bit under-defined if you implement a build back-end in another language. I think at best you could get away with POSIX glob patterns.

If the license-files key is marked as dynamic (and not present), to preserve consistent behavior with current tools and help ensure the packages they create are legally distributable, build tools SHOULD default to including at least the license files matching the above patterns, unless the user has explicitly specified their own.

Do people like the defaults of ["LICEN[CS]E*", "COPYING*", "NOTICE*", "AUTHORS*"], or should it be left up to build back-ends to decide what to do to fill in a dynamic license-files? I’m leaning towards the latter since we don’t prescribe what build back-ends do for any other field specified as dynamic.

Otherwise I think everyone’s outstanding concerns were addressed w/ the latest PEP version and Pradyun volunteering us to maintain SPDX support in packaging (which I’m fine with). :sweat_smile:

4 Likes

I agree it should be left to the backend, for the reasons you give.

Re: license classifiers handling: I’m alright with making it less strict - I’ll submit a PR.

If we were to keep one from those two, I’d lean towards keeping the globs rather than paths.
Reason: literal paths often can be also interpreted as globs, but globs enable more concise approach with projects having a lot of vendored dependencies. If there are glob special characters in the filenames, there are escaping mechanisms in place.
For most of the projects, the defaults, now defined as an array of globs, will work out of the box.
With pip, the list of 20 paths can be achieved with just two globs:

In: glob.glob("src/pip/_vendor/*/LICENSE*")
Out: 
['src/pip/_vendor/cachecontrol/LICENSE.txt',
 'src/pip/_vendor/certifi/LICENSE',
 'src/pip/_vendor/distlib/LICENSE.txt',
 'src/pip/_vendor/idna/LICENSE.md',
 'src/pip/_vendor/packaging/LICENSE',
 'src/pip/_vendor/packaging/LICENSE.APACHE',
 'src/pip/_vendor/packaging/LICENSE.BSD',
 'src/pip/_vendor/pkg_resources/LICENSE',
 'src/pip/_vendor/requests/LICENSE',
 'src/pip/_vendor/resolvelib/LICENSE',
 'src/pip/_vendor/tenacity/LICENSE',
 'src/pip/_vendor/urllib3/LICENSE.txt',
 'src/pip/_vendor/distro/LICENSE',
 'src/pip/_vendor/platformdirs/LICENSE',
 'src/pip/_vendor/pygments/LICENSE',
 'src/pip/_vendor/rich/LICENSE',
 'src/pip/_vendor/tomli/LICENSE',
 'src/pip/_vendor/pyproject_hooks/LICENSE',
 'src/pip/_vendor/truststore/LICENSE']

In: glob.glob("src/pip/_vendor/*LICENSE*")
Out: ['src/pip/_vendor/typing_extensions.LICENSE']

This will also mean less fuss for the maintainers who won’t have to manually update the lists upon removing or adding dependencies.

Thank you for pointing that out, it’s an important bit to think about. Can do.

This is a default existing in setuptools and in hatch. In setuptools this is in place since version 56.0.0 - for over three years now. In that regard, this PEP just codifies the existing practice.
FTR, Poetry includes LICENSE* in sdists and LICENSE, LICENSE.*, COPYING, COPYING.*, LICENSES/** in wheels.
I believe the common defaults would make it a tad easier for the project authors to decide whether their project needs a custom array without the need to go through the respective build backend documentation and transition between different build backends if needed.
Also, please note, that the defaults are expected if there is no license-files key present. If marked dynamic, it is expected that something else will take care of filling in the value and the defaults are not taken into consideration.

In my experience having the first backend to support this globs is in fact useful. I would strongly advise that we keep this.

I agree with this!

The defaults are pretty good based on my experience with Hatchling but I would like to change SHOULD to MAY. In that case I would probably add an additional pattern that I’ve noticed in the wild.

I consider the fact that the defaults you suggest are globs as an implementation detail.

Well, is that because that’s what the PEP initially said or because experience says that’s the right thing? I also don’t want to codify something as a standard just because someone else does it w/o justification as to why it’s a good idea.

Then I really don’t like the idea of the defaults. I can’t think of any other metadata where we have implicit defaults due to a lack of value and not specified as dynamic (closest is the README’s content-type and that’s not a full metadata field and it’s still inferred from data provided in [project]). A key part of [project] is the implicitness is relegated to dynamic, which is still explicitly specified as being calculated.

I realize that slows down adoption, but I’m fine w/ that as we have decades to make the transition. And I personally think making fallbacks part of the spec a bad thing is a lesson learned from PEP 518 and the default back-end.

I also think back-ends can suggest people add the key if they detect common license file names, or even offer to update their pyproject.toml for them.

Would making the paths always support globs instead of having to specify that fact worth it then? I just see users messing up and thinking paths supporting globs and wondering why there back-end is saying the path doesn’t match anything. And asking every back-end to try and detect someone messed up by specifying a glob pattern instead and letting the user know.

FYI backends would never automatically update because they try to keep dependencies to a minimum and that requires something like tomlkit.

I would be in favor of dropping the globs key and making paths support that.

2 Likes

Yes, setuptools followed one of the initial drafts of this PEP. Your point regarding the dynamic handling and having a strong justification for these concrete defaults makes sense to me. Upon reconsideration I lean towards not mandating any defaults in the PEP and leaving it up to the build backends to figure out. This does not prevent future standardization if a clear pattern emerges.

So, a combined approach would be to flatten the value of license-files to an array that can contain paths or globs. All files matching the globs and all the files on specified paths must be included.
Currently the specification says that if no file matches a path, tools MUST raise an error and if no file matches a glob, tools SHOULD warn and MAY raise an error. How would you propose the guidance in the case of the combined array?

Example:

license-files = [
    "src/foo/vendor/*/LICENSE*",
    "bar/AUTHORS.rst",
    "COPYING.md"
]

Do you find it less confusing than what the current draft specifies?

Thank you, @ofek, for the continuous feedback!

1 Like

Great!

Yep, that’s what I’m thinking.

Raise an error for each entry that doesn’t match a file, glob or literal path. My thinking is if you specified a path you meant for it to match something, so if nothing matched you probably have a typo.

I think it’s simpler and thus lowers the barrier to specifying a path.

1 Like

When I thought about the globs+path support, I immediately questioned: Why there needs to be explicit support for a path if I can use a glob that only matches that one path? So I am all in favor of making it flat, globs only. “LICENSE” is a perfectly valid glob after all.

The only problem I see is if somebody would like to use a license file with literal *, ?, etc. in it. While this is probably silly, I think the PEP should specify such symbols can be escaped by backlashes.

2 Likes

I don’t think that’s a good idea. glob.glob uses fnmatch whose documentation states:

For a literal match, wrap the meta-characters in brackets. For example, '[?]' matches the character '?'.

I think we should use this syntax rather than a bespoke one that backends would need custom handling for.

There’s no (suitable) way to make these characters valid on Windows in any filename, so it’s silly enough to make your package non-portable, which I think means it’s silly enough to simply disallow (or ignore) in a specification.

4 Likes

Is that from POSIX or a custom solution?

Yeah, I’m okay saying “don’t do that” or “use a build back-end that can support your esoteric path requirements”.

From here, I think it’s a POSIX convention.

1 Like

Oh, sorry about not checking the docs. Either way, what I really wanted to say is:

  • let’s make it globs-only
  • let’s mention how to handle license files with literal glob symbols in them

And I don’t mind if the mention says “You can’t have this, rename your files not to have such symbols”.

I got carried away by suggesting the backslashes (that’s what works in Bash, but indeed not in glob.glob).

1 Like

Please don’t do this. Even though the proposal is going to take away defaults, backends would still have their own default values and in that case most entries would not match and necessitate an error.

If you want to raise an error when there is no match at all then I would be okay with that but please note in that case we are sort of requiring a license file now which is disruptive.

My preference would be no error.

2 Likes

Is the stdlib’s implementation for glob/fnmatch on full parity with POSIX?

I was under the impression that Python’s implementation was rather basic, but it might be because I am used to bash which probably extends the POSIX standard…[1]

We probably should stick with the most basic of the two. It would be problematic if we specify POSIX patterns and then we cannot use Python’s glob/fnmatch for the implementation…


UPDATE: I had a look on the section 2.14.3 (“Patterns Used for Filename Expansion”) and 9.3.5 (referenced by 2.14.3) of IEEE SA - IEEE 1003.1-2024 and it seems to me that this part of the text does indeed mentions negative matches using the ! character and named character classes (which I don’t think are covered in Python’s glob)…

If that is confirmed to be the case, I believe it is better not to mention the POSIX standard in the PEP and stick with Python’s glob/fnmatch.


  1. For example, I can see definitions for “complementation” and character classes in the man pages for glob, which would be supported in Python. ↩︎

Given that not all clients may be written in Python, I’d suggest simply specifying exactly what’s supported. The explanation of supported patterns in the glob docs takes a bit of reading to understand (what is a “hidden” file? are initial dots treated specially on Windows as well as Unix? is the spec assuming “recursive” is set so that ** is supported?)

I’d be fine with a spec that said consumers must support at least *, [...] (without [!...]) and **, and for portability users should restrict themselves to those patterns and alphanumeric filenames (plus a dot for extensions). That covers any realistic use cases without requiring a custom implementation from tools.

8 Likes

For me, the most the reasonable UX is this:

  • if a user supplies their own glob and it matches nothing, the build backed MUST error to prevent typos, forgotten files etc.
  • if the default glob is used and it matches nothing, the build backend MAY warn

This means, we do not require a license file.

4 Likes

For me, this behaviour does not need to be specified in the PEP beyond saying “We do not require a license file.”

Tools are welcome to innovate on their UX as much as they like. We shouldn’t try to specify it up-front.

3 Likes