PEP 639, Round 3: Improving license clarity with better package metadata

Hello,

A new draft of PEP 639 has been published.

It proposes the adoption of SPDX license expression syntax as a way to declare the licenses of Python packages.

With Core Metadata version 2.4 the following changes apply:

  • a new field License-Expression must be present and must contain a valid SPDX expression
  • a new field License-File may be present zero or more times, and each must contain one path to a license file declared by the user. All files, either matched by globs, or literal paths, must be included in the distribution
  • license files are stored in the .dist-info/licenses/ subdirectory of the produced wheel.

License classifiers and the License field are deprecated.

From the user’s perspective:

  • declaration of the license is performed via a top-level string value of key license in [project] table of pyproject.toml. It has to contain a valid SPDX expression and maps to the License-Expression field of the Core Metadata.
  • specification of the license files in the distribution is done via list of either license-files.globs or license-files.paths which are mutually exclusive. If license-files are not present in the metadata, there’s a default value tools must assume (license-files.globs = ["LICEN[CS]E*", "COPYING*", "NOTICE*", "AUTHORS*"]). This maps to License-File entries in the Core Metadata.

The table values for the license key in the [project] table of pyproject.toml are deprecated. Reasoning is part of the PEP.

Please review the examples to get the practical idea of the proposed changes.

There are no immediate hard breaks in the backwards compatibility.
The PEP specifies than when distributions contain the new License-Expression field, PyPI must validate and reject uploads that don’t conform to the specification.
The PEP leaves a great margin of freedom to the tools regarding the advice they want to produce in case of the incorrect license expressions detected.
Also, the changes will require the updates of a few additional specifications, listed in the PEP.


There are two open issues listed that may require further debate:

In the previous thread @pf_moore has raised a concern about the recommended tool to parse and normalize SPDX expressions.

The PEP recommends license-expression · PyPI for that.
If deemed insufficient (as @ofek mentioned in regards to hatchling), that’s a valid concern.
Since SPDX has become a de facto standard of license declaration in the last years, I gather that could be a candidate for creating an official lightweight library that would only do parsing and normalising of the SPDX expressions.

I’m looking forward to your inputs.

9 Likes

I’m very much in favour of allowing LicenseRef-<CUSTOM-TEXT>. I use this convention at work for licences for which are well defined but aren’t in the SPDX list (I see no reason to get into SPDX inclusion criteria here).

It would be unfortunate for us to have to use a fixed string like LicenseRef-Custom (the licences in question are neither proprietary nor public domain) and fallback to license file detection (which we already do but is brittle).

Thanks for this. I think that the requirement for tools to be able to parse/normalise SPDX expressions needs to be looked at more closely. Based on the comments made by @ofek, I looked at some sizes. Installing just the named package into an empty virtual environment uses the following space:

Package Size
license-expression 1.2M
flit-core 192K
pdm-backend 400K
hatchling 900K
setuptools 3.5M
meson-python 4.3M

As you can see, the license library would be a significant addition to the small backends, and non-trivial even for bigger ones. Given that builds are typically done in isolated environments, often in docker containers with limited space, this isn’t an issue that should be dismissed lightly. Also, library dependencies for build backends can be problematic - they often need to be vendored in order to avoid bootstrapping issues, and that adds maintenance costs (in this case, users can’t reliably use a new license until the whole ecosystem updates their vendored copies of the license data file).

And worse still, 800K of the license-expression package is simply the data file - which is needed to enforce the “official” capitalisation of license IDs. That seems like a significant overhead for very little benefit. It also suggests that a correct “lightweight library” isn’t actually going to be that much smaller than the current one.

I think it would make more sense to make validity of the license expression to be a requirement for publishing a package (so that tools like twine and indexes like PyPI must reject invalid licenses) but not for building one (so that build backends don’t need to include a license parsing library). Yes, that might mean that mistyped license expressions in internally-distributed projects could go unnoticed. But as they are private, is that such a big issue?

Maybe someone could write a simple license metadata checking tool, which could be invoked via pre-commit or similar - essentially a linter for license data. That seems like a far better approach than making every build backend responsible for the technicalities of license expressions.

On the consumer side of things, the PEP could say that consumers (installers being the obvious, but far from only, case) SHOULD be prepared to deal with malformed license expressions, but MAY simply refuse to process them. That seems to be flexible enough to cater for all levels of interest. And it allows enough flexibility for libraries that read metadata but don’t know what their callers might want to do with it, to make their own API decisions.

5 Likes

That data file compresses to 65K with gzip -9, so the whole package could be trimmed to perhaps <500K without too much fuss. But that’s still bigger than some backends.

edit: because it was easy: a pull request

2 Likes

I disagree. That data file contains a lot of duplicated metadata and a lot of information that is not going to be relevant in the context of parsing a license expression. We don’t need to know the name of the xml file associated with the license ID for example, when parsing the license expression.

Unless I’m misunderstanding, the entirety of the information we need for parsing license expressions are the short form license identifiers (like MIT, BSD-3-Clause etc) and the short form of exception ids (like Asterisk-exception, etc). These two lists would be very tiny relative to any of those packages. You can see the entire list of license identifiers here: SPDX License List | Software Package Data Exchange (SPDX).

It would also not be particularly difficult to maintain this list up to date, since the SPDX folks maintain versioned machine readable files with all the metadata (license-list-data/json at main · spdx/license-list-data · GitHub). These files can be used as the source of truth for this process and be regularly regenerated + released.

FWIW, I wouldn’t mind if we ended up with these lists and corresponding parser stuff living in packaging. There’s already a bunch of parsers in there for specific METADATA fields in there so this would be an obvious addition in that regard, and it would fit in well with packaging.metadata’s goals as well.


My suggestion would be that we don’t specify a specific package in the PEP for parsing license expressions, and leave it as an implementation detail that we’d hash out separately. Not all of the relevant tooling may be implemented such that it can use a Python package, for example).

And, we need a baseline implementation for modern packaging PEPs to get to accepted state nowadays – this isn’t really an implementation design problem that we need to solve in the PEP’s text and we can leave this detail out of the PEP.


I’d prefer mutually exclusive. I think relaxing the strictness around these fields is something we can do in the future, if maintaining these as mutually exclusive is found to be problematic. We can’t become stricter without a backwards incompatible metadata 2.0 release.

The corresponding section in the PEP states (in support of the current position of “optionally? fill both”):

This would improve backwards compatibility and was suggested by some on the Discourse thread.

I don’t understand what the backwards compatibility benefit of this would be, and would appreciate it if someone could clarify this.

I think so.

As I see it, it’s an escape hatch provided by the folks behind SPDX because they understand that their approach doesn’t cover all possible license situations.

I don’t see any reason for why someone might not need such an escape hatch in Python projects[1]. I argue the potential for user confusion is a (low cost) tradeoff in exchange for maintaining compliance with the entire SPDX license expression syntax rather than inventing our own subset of it with all the associated costs. That said, I also won’t mind starting stricter here with the specific allow list of names and expanding that if we get feedback that the restriction is annoying/harmful (which is what @RazerM seems to be suggesting, in their reply above).


  1. Unless Python has some special legal loophole/magic/lawyer-repellent properties, that I don’t know about. ↩︎

6 Likes

Here is the first PR: Add SPDX license data by ofek · Pull Request #799 · pypa/packaging · GitHub

Ah. I hadn’t looked into the implementation to that level. If that’s the case, then a lightweight implementation could well still be possible. And if packaging is willing to implement the necessary parsing and validation, then I’m happy to concede the point. Although I will note that flit-core doesn’t vendor packaging, so validating licenses would affect them more than other backends - so there may still be a reason to consider making validation optional for backends, while still being required for publishing.

4 Likes

I agree with that… That is a very strong requirement to be tackled by build tools.

Build tools already have the restriction that is not trivial to add dependencies, and every dependency added comes with a cost. Unless there is support for parsing and validating licenses in a library that is already used and vendored commonly in build tools (yes that would be packaging), I disagree with enforcing any form of validation/normalisation/processing of license expressions in build time.

2 Likes

I have a question that I did not find in the rejected ideas.

I apologise if it has already been discussed in other threads, but it is a bit hard to follow years long discussions :sweat_smile:.

Why warning in the absence of License-Expression is necessary even if License-File is specified?

In my mind if License-File is given every consumer of the package has everything that they need to figure out the licensing… In the end of the day any expression will be only a proxy for the actual text of the license…

Therefore there should be no need to impose on the user to specify License-Expression and/or for tooling to devise logic to backfill the value, right? By that logic, when License-File is present License-Expression should be purely optional with no warning.

From the point of view of usage, I think it is easier for users to just select the license from the GitHub interface when creating a new project (and then it gets automatically picked up by the default glob) than figuring out/remembering the correct SPDX expression. It is extra burden for package developers in a process that many already think is too convoluted (specially people without previous packaging experience).

Now what I wrote above is just my thought process, I am not actually arguing that the PEP should change the way the warnings and optional exceptions work. But I do think that there should be an explanation for that in the rejected ideas at least.

TL;DR: The “Rejected Ideas” section is missing: “Don’t produce a warning or exception for the lack of License-Expression when License-File is given”.

1 Like

Thank you for the replies!
Let me go through the opened points and summarize the current state.

Ad Open Issues:

With the rise of adoption of the SPDX standard it feels to me less of a concern that the custom identifiers may be misused. “Maintaining compliance with the entire SPDX license expression syntax rather than inventing our own subset of it with all the associated costs” which @pradyunsg mentions, is a strong argument for allowing the custom identifiers.
OTOH, the current draft of specification ensures the 100% possibility to validate the license expressions. Custom identifiers will bring an element of an unknown if authors decide to use them, so the data may not be accurate. Is this an acceptable tradeoff?

I will try to do my best to dig the reasoning out and get back to that.

If the fields were mutually exclusive, should the tools be mandated to validate that there are never both of them present at the same time? Both the build and publishing ones?
(I can’t edit the original post, but License-Expression is an optional field, not a mandatory one, so it is possible to create a valid package that doesn’t produce any licensing metadata).

I’m sorry, I don’t understand the second part - if we don’t recommend any specific tool for validation, would it prevent the PEP from moving further?


So the path would look like this: an author may declare a license expression. If they do, the build tools produce the metadata containing License-Expression field. We expect it is a valid SPDX expression, but at this point there’s no requirement to validate that. Only when the author wants to upload it to the registry, the publishing tools would validate the expression and reject uploads that do not conform.

Looking at the specification, at this point, most of the requirements for build and publishing tools is a “SHOULD”, with a “MUST” being:

  • MUST store a case-normalized version of the License-Expression field using the reference case for each SPDX license identifier and uppercase for the AND, OR and WITH keywords.

If we relaxed this point to SHOULD for build tools and require validation for publishing tools (including PyPI), would this be an acceptable outcome?
Or would you prefer to downgrade all the validation requirements to MAY for build tools?

1 Like

Changing that MUST to a SHOULD for build tools, and making no other change, would be my first suggestion. I’m not a backend author, so I don’t know how much of an issue it is if the spec says SHOULD rather than MAY (for all of the validation and normalisation requirements). My experience in other areas, though, is that there’s a certain amount of community pressure generated by a SHOULD requirement, which means that developers who don’t want to conform to it have to be prepared to defend their choice, often repeatedly.

What I frankly don’t understand is why making everything formally validated and normalised is so important for this field in particular. I support the idea of saying that License-Expression must be a valid SPDX expression, that seems sensible and in line with other languages. But I don’t see why it’s necessary to go into so much detail over who is responsible for validating, and normalising the data.

I know we’re a long way down the road at this point, and I’m not PEP-delegate for this so my opinion is just a personal view, but I’m on record as having thought from the start that this proposal is over-engineered. I think it’s got a lot better, but I’d still prefer it to be much more lightweight.

3 Likes

Is there a reason why License-Expression should be optional? Shouldn’t it be mandatory to help the license consumers to know what they are dealing with? License-Expression is much more easy to consume than a license file. AFAIK, detecting with certainty what license is contained in a license file is not a simple task. First of all, I don’t think there is a common format for license files, which means that they can contain multiple licenses, etc. At least if these packages were forced to set an expression, consumers could know more easily what license packages are using.

No. Nothing should be mandatory. That’s not what this ecosystem is about. You do as much as you can or want to do to help the people who are sharing in your code, but that’s the extent of the obligation. PyPI has terms in place to ensure that all packages are sufficiently licensed for one level of distribution.

If this change makes it easier to do the right thing, great! But we should not make something mandatory without a solid practical reason (i.e. package is not installable by anyone for technical reasons, as opposed to merely being underinformed about licensing).

+1

5 Likes

Couldn’t it be made mandatory at least if License-File is specified? One common issue I’ve encountered is that a project specifies a GPL2 license file, but then is it GPL2-or-later or GPL2-only. This would eliminate that confusion from the start of the project definition.

Hang on. IANAL, but surely if there’s a license file, that is precisely what applies? A license expression is only ever an abbreviation of the intent (or in the absence of a license file, a reference to a standard set of terms). Having a GPL2 license file and a license expression that says something that’s not in that file doesn’t “eliminate confusion” - what it actually does is add questions over which is considered authoritative. And that’s the point when you need a lawyer, and you need to have a conversation with the project author.

This sort of subtlety is precisely what we don’t want the standards to get sucked into - we’re providing a place where a project can record some data if they want, not offering any sort of legal framework or proforma for how to license your project.

5 Likes

The issue is that the GPL license file is identical for both cases. The “or later” clause is there to indicate that it is allowable, but it’s on the project to define if it’s gpl only or or later. In the Linux project it’s clarified in the Copying file. With the SPDX, you are meant to specify if it’s “only” or “or-later” since plain GPL2 is deprecated.

I think License-Expression should be mandatory and authoritative when the License-File is specified. It is the project’s duty to make sure the license file and expressions match. It would help both on the downstream packaging side and on the consumer side when they need to bundle a project.

That’s an issue for the project and/or the GPL people. The project has to decide what their intent is, and how to express it. The Python packaging standards shouldn’t get involved in it (because we don’t have the legal expertise, nor should we need it).

What if the license file is for GPL, but the expression says MIT? Or if the license file states a license that isn’t expressible as a SPDX expression? Do we want to get sucked into the legal implications if we say things like “expression takes precedence over license file”?

All of which brings up another point. The standard must allow for arbitrary licenses somehow. Python is used in corporate environments where all sorts of proprietary and restrictive licenses are used. We absolutely cannot have a standard that prohibits creating or publishing such packages - yes, they won’t be published on PyPI (the PyPI rules already prohibit that, I believe) but that doesn’t mean they can’t be published on a custom index, for example explicitly built to distribute licensed software to paying customers.

3 Likes

PyPI’s terms basically say “if your license doesn’t allow the PSF+mirrors to redistribute, you grant that license specifically to us when you publish to PyPI”. So you can redistribute on PyPI without granting recipients permission to further redistribute. (I believe this is exactly the kind of thing something like Intel MKL would have relied upon some years ago when they didn’t want people distributing the libraries outside of an application.)

But yeah, we can improve license clarity for common cases, but we can’t block out uncommon cases, no matter how much we may disagree with them.

3 Likes

Maybe authoritative was a bad choice there. I didn’t mean it in a legal standpoint, but rather in a trust, i.e. when scraping the license e.g. when creating a Fedora package feom PyPI, it would be a trusted source. Still mandatory for the purpose that license file alone are not sufficient for determining the accurate license.

If they differ it’s an issue with the project and not the packaging or specification.

Proprietary, unknown and custom licenses that are not specified in spdx should absolutely be supported. There doesn’t seem to be codes for these in the standard (partly understandable, e.g. how would custom license change to a coded license), but couldn’t the expression expand on this?

1 Like