Improving license clarity with better package metadata

There has been some recent discussion to improve the clarity of license-related metadata.

What’s the problem? We have two license fields (License and license-related Classifiers) and yet there are cases where we cannot convey a clear license.

I have started a draft PEP to foster a discussion on how we could improve this:

1 Like

Copying a comment I made on one of the github issues, I think it’s important that we don’t insist that package authors become license lawyers in order to publish packages. There needs to be a really easy entry point for people who really don’t care.

At the moment, that’s probably “Hmm, license - (does some googling, picks something that superficially seems like what they want) - License: MIT (or maybe GPL, or BSD - something very common and generic)” Saying “Enter a SPDX license expression” to such a user would be immensely off-putting.

@pf_moore You are right: MIT is a perfectly valid license expression FWIW, so that should be easy in the easy case. [WTFPL] is also a valid one is that’s your style. So I do not think this should put off anyone.
I may also prefer to re-purpose the existing License field and have tools provide warnings to gently nag users for a start than having a separate License-Expression field.

1 Like

That’s part of the problem… BSD means many things as there are many variants. GPL is also vague as it could mean any versions of the GPL ever published.

I think SPDX expressions are fairly human friendly in the simplest cases. Here are some examples:

  • BSD-3-Clause
  • MIT
  • LGPL-3.0-only
  • LGPL-3.0-or-later
1 Like

You should put me down as BDFL-Delegate, unless someone else wants to propose themselves (I’m the default BDFL delegate for packaging, although I can delegate onwards to someone else if appropriate).

Also you should have a PEP sponsor. I’m happy to sponsor the PEP (which basically means, I’ll help and advise you with any admin around the PEP process itself) if you wish.

I have some questions and concerns about the PEP itself, but I’ll make a separate post for them.

Some general questions on the PEP.

  1. The new field is optional. But by deprecating all other license fields, and mandating that this field must follow SPDL rules, you;re in effect leaving package developers who want a license that’s not on the SPDL list (in particular, developers of proprietary packages) no option other than to omit license metadata altogether. That seems to go against the general intention of making license information more accessible.
  2. You say that tools should validate and warn if the field doesn’t follow SPDL rules. What tools? Package creation tools (setuptools, flit)? Or package upload tools (twine)? Or indexes themselves (warehouse, devpi)? The PEP isn’t clear on where this validation should take place.
  3. There’s also the question of how tools do this validation. Is there a reference library (in Python)? From what I read on the SPDL website, the rules seem to be based around “the definitive list of license IDs is obtained from our website”. So what options, if any, exist for offline (or behind a firewall) checking?
  4. What happens if the SPDL rules change? Or if SPDL drop an existing license from their list? A strict reading of the PEP is that existing packages on PyPI (or elsewhere) may suddenly have invalid metadata, through no fault of their own.

On a more general note, the PEP reads as if the longer term intention is to become more strict on what licenses are valid and/or acceptable. This is something we need to be very careful about. It’s one thing for PyPI to change its rules, for example to only allow upload of open source packages (although that would be a separate proposal), but it’s quite another to lock down the metadata that expresses a package author’s intention (independent of where they plan to publish their code). I think the wording of the PEP needs to be improved to make it clearer that authors aren’t being expected to choose only SPDL-sanctioned licenses, and how they “opt out” if they so wish. (Or, it needs to be more open about the long term intent, and you need to be prepared to address any concerns and/or complaints).

(Note - I’m using the phrase “SPDL rules” here, rather than just “SPDL syntax”, because the fact that SPDL syntax includes a list of allowable license names, makes it more than just syntax, IMO, and takes it into “what licenses are acceptable” territory).

Thank you very much. This is quite gentle of you. I added your name as a BDFL delegate to the draft PEP. Your sponsorship is much welcomed too!

Just want to reiterate that in PEP 566, the latest metadata specification, we changed the canonical source for field specifications to the Core Metadata Specification reference document. This was specifically to eliminate the need to create a new PEP every time we wanted to change the specification.

As such, I don’t think that creating a new PEP is required here, just this discussion and an eventual PR against pypa/packaging.python.org.

EDIT: This is wrong, we still need a PEP.

(cc @ncoghlan).

2 Likes

Fair point, but I’d view it as (1) allowing a streamlined discussion process that doesn’t need a formal PEP, and (2) allowing non-controversial changes to be made without requiring the mechanism for “big” changes.

In this case, I don’t think the change is non-controversial, so we need some process. I’m happy if it’s not ultimately a PEP, but I think writing the proposal up in “PEP form” and presenting it here is far from wasted effort. I also think we need someone to make the final judgement on whether the proposal should be accepted. Everything else is just admin, agreed, and we should ditch it if it doesn’t help. Although where should we store the final version of the document? It will have useful background and explanation that certainly wasn’t in the original PR. IMO, writing the conclusions of the discussion up is valuable - expecting people to trawl Discourse for information isn’t ideal…

If anyone else wants to nominate themselves as “final arbiter” on the proposal (I’ll avoid using the term BDFL-delegate, if we want to avoid PEP-related terminology) I’d be happy with that. I don’t want to give the appearance of using my authority to promote a personal agenda. Or are you suggesting that we don’t need a decision maker at all, and as long as anyone with commit rights on the pypa/packaging.python.org repository (which, by the way, doesn’t include me :slight_smile:) agrees with the change, it can be applied?

BTW, if everyone but me thinks the proposal is non-controversial, that’s fine - a consensus of “it’s fine, let’s just approve it and move on” is perfectly valid :slight_smile:

I am split there on making this optional or not. On the one hand, it does not make sense not to have no license. There is always one even if not stated, so it should be mandatory.

On the other hand, there are time when this needs to be stated and times when this does not matter. In particular as long as a package is not published, this should be optional. So making this mandatory when published to Pypi (e.g. going through twine and the warehouse API) and optional otherwise could be an option.

The other thing (and may be it should be addressed in the PEP) is that we may need some transition period to go from optional to any mandatory?

On the topic of proprietary licenses, we could support extra identifiers for this and this is an area that needs development as I was explaining in this PR (packaging.python.org/pull/635#issuecomment-521544448)

There are some constraints in the short term for dealing with licenses that are not on the SPDX license list. This includes the common public domain dedications and any kind of proprietary licenses. A solution might be the emerging SPDX notion of “license namespaces”.
Another one would be to support a few extra license ids to capture these (that’s the way npm has dealt with this). But that would be break using strictly the SPDX spec.

The npm folks resolved this by using an [“UNLICENSED”] id (which is IMHO a bad move as this is way too close to the “UNLICENSE” license id and a "SEE LICENSE IN " which I am not a big fan of.
(See docs.npmjs.com/files/package.json#license)

Having an explicit License-File field would be better if we go down this route IMHO (and there has been specific things done on the wheel side to support multiple license files )

I am fine either way. In the end the strict changes to the metadata spec may not be big, but the implications for tools and users may be so I am trying to foster a sane discussion first. And we could go either way with or without a PEP as long as the discussion and debate was there IMHO.

I concur: having a single document that will summarize the essence of the changes, issues and discussions (in a PEP or a PEP-like docu]) is IMHO a good thing for such change.

I guess projects won’t need to do anything until they update to metadata version 2.3 (assuming 2.2 introduces this as optional, and 2.3 makes it mandatory). So that’s probably sufficient transition process. Existing files will have older metadata versions and so will remain perfectly valid. Only newer files (generated by versions of the build tools that have been updated to emit metadata version 2.3) will have the new metadata, and the build tools can validate that (it would be a decision for the tool, as to whether omission of the “License Expression” data would be treated as an error, or simply as a signal to emit metadata version 2.2 rather than 2.3…)

It’s worth noting that consumers of package metadata won’t be able to assume that this value is present, even if it’s made mandatory - they’ll still have to deal with projects using older metadata standards.

That begs the question, where will tools find details of how to validate this field, if we’re adding our own extensions? Ideally, we’d need a library implementation for this, that tools could share.

1 Like

In terms of the process, isn’t the draft PEP only supposed to be submitted to the official peps repo once the sponsor has deemed it ready? (Here is a link to the wording in PEP 1 – the “Submitting a PEP” section.) IIRC, that language is there to avoid having PR’s in the repo that are still in a preliminary state.

Chris:
I am fine with closing the pending PR and resubmit when we are ready: there is a fringe benefit to the PR though: the file is checked for consistency and contributors are checked for CLA.

1 Like

Paul:
I maintain a Python implementation of a license expression handling library at https://github.com/nexB/license-expression/ and also on Pypi (Apache-licensed). This is being integrated in the Python fuller suite of SPDX utilities (that I maintain) by a Google Summer of Code student at https://github.com/spdx/tools-python/ and is used in a few other projects such as the Free Software Foundation Europe reuse tool (at reuse.software) and the scancode-toolkit (that I also maintain).

So this should be a base implementation and if we go that route I would happily donate that to Pypa. It should a good way for tools to integrate license expressions parsing/validation/normalization/rendering and is flexible enough to accept the SPDX list of license ids and any extra additions.

1 Like

Any tool that reads and writes core metadata should be able to do some validation of an SPDX license expression, but the place where I would see this as being the most effective to provide user guidance would be IMHO first client-side with clean and clear nag message when invalid:

  1. at package creation time (e.g. setuptools and similar)
  2. at package publishing time (e.g. twine and similar)

Indexes would want to validate too if they validate core metadata: i.e. warehouse does validate these such as checking Classifiers

As mentioned above , I maintain a library (“license-expression” on Pypi) that is the de-facto reference in the not-so crowded space of license expression tools.This can work with a list of licenses offline (e.g. by bundling the SPDX-list-as-json as package data)

Actually, it would make sense and be a sane thing to have a package that implements the validations suggested in the PEP using both a bundled SPDX list and extra ids for proprietary and public domain. Let me work on that too.

Paul:
This is a valid question: the SPDX list and rules have been fairly stable and we are dealing with a responsible group of seasoned FOSS contributors and professionals (I am one of the co-founders). That said, I participated (heavily) in the effort of using SPDX ids to cleanup and rationalize the licensing of the linux kernel and what we decided then was to actually freeze/copy down the SPDX rules in the kernel documentation such that we would be insulated of any unwanted changes in the future and could adopt such changes – if any-- in an orderly fashion.

We could very much have the same approach here. That said, since npm, Rubygems, Composer, the linux kernel, and so many projects depend on these rules and licenses ids today there would be total uproar if the SPDX group went rogue. Rather than make the PEP heavier we can more simply reference exact versions of the specs and list rather than copying a subset of it.

Or if SPDL drop an existing license from their list?

The SPDX group does not drop any license from their list: they mark a license as obsolete instead as a rule. See the “Deprecated License Identifiers” at the bottom of this page: https://spdx.org/licenses/ though this is a rather rare event (the biggest of such events are two past events linked with the adoption of “license exceptions” on the one hand, and a request of the FSF on the hand on how to better deal with clarity around the GPL and related licenses)