PEP 639, Round 3: Improving license clarity with better package metadata

steve.dower · May 14, 2024, 10:01pm

By leaving the expression field as optional, so that a custom license can be easily identified by the absence of the expression field.

pradyunsg · May 14, 2024, 11:18pm

No.

What would prevent this PEP from moving forward is the lack of an implementation of its design in a backend. And, to that end, my suggestion is to not specify a library for validation (eg: in case a new/better/different-design library comes out) and instead leave it as an implementation detail for the build backends to figure out entirely. That said, I also don’t feel strongly about this specific suggestion since (unless I’m misreading) the use of this library isn’t written to be mandatory anyway.

Lecris · May 15, 2024, 10:05am

Spit-balling here, could it be an empty string "" instead? The thing I am trying here is to nudge project owners to populate this field, even if they have to populate with empty. It indicates that the project owner has checked the license and available SPDX and confirmed it is one of custom/unknown/proprietary.

Other feedback from Fedora matrix forum:

it might be interesting to reference what Rust / cargo has ended up with:
both package.license and package.license-file fields in metadata exist.
the former is a free-form string but it’s supposed to be a valid SPDX expression.
the latter is only supposed to be used if the project license is “special” and cannot be expressed with SPDX.
both values cannot be present in the same project.
(for reference: The Manifest Format - The Cargo Book )

abravalheri · May 15, 2024, 11:53am

If I understand correctly, developers that opt-in from this license are suppose to specify which version they are using in the “license notes” each file should contain (and/or in COPYING/NOTICE files).

Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and “any later version”, you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.

GNU General Public License v2.0 - GNU Project - Free Software Foundation (FSF).

So probably searching for “any later” is a good indicator.

To be honest my personal opinion is that it should not be mandatory…

I believe that the SPDX index is at best a proxy for the text of the license and license notes that the user includes in their software.

It is good if they decide to do add an SPDX index, but it is also very good if they only add the correct license files and notes. We should strive to lower the barrier of entrance for package authors and this looks like a transfer of responsibility from consumers to authors.

abravalheri · May 15, 2024, 12:00pm

I think the relaxation from MUST to SHOULD is not enough for build-backends^[1]. MAY would be better. No need for validating would be the best.

Unless the tools for making validation/parsing/etc are made available without need for extra dependencies (e.g. via stdlib or maybe packaging – with the caveat that even if they are in packaging some backends might not be able to use). ↩︎

ofek · May 15, 2024, 2:52pm

Karolina Surma:

RazerM:

I’m very much in favour of allowing LicenseRef-<CUSTOM-TEXT>. I use this convention at work for licences for which are well defined but aren’t in the SPDX list (I see no reason to get into SPDX inclusion criteria here).

With the rise of adoption of the SPDX standard it feels to me less of a concern that the custom identifiers may be misused. “Maintaining compliance with the entire SPDX license expression syntax rather than inventing our own subset of it with all the associated costs” which @pradyunsg mentions, is a strong argument for allowing the custom identifiers.
OTOH, the current draft of specification ensures the 100% possibility to validate the license expressions. Custom identifiers will bring an element of an unknown if authors decide to use them, so the data may not be accurate. Is this an acceptable tradeoff?

FYI Hatchling does not allow for arbitrary text there but rather does this: hatch/backend/src/hatchling/licenses/parse.py at hatchling-v1.24.2 · pypa/hatch · GitHub

pradyunsg · May 15, 2024, 8:34pm

I strongly disagree – not performing validation is the same as having an unstructured field for this purpose. If we’re doing that, we might as well not introduce a new field here.

abravalheri · May 15, 2024, 8:59pm

I am referring to validation in the build-backend only. Because of the restrictions/difficulties to bring dependencies in.

I am not referring to other tools like package indexes (I don’t have an opinion about those, because don’t have experience implementating them).

pradyunsg · May 15, 2024, 9:08pm

I understand that – to elaborate a bit on what I’m saying: I disagree that this should not be marked as a thing that build-backends should do. We can pick between “MUST” (required), “SHOULD” (recommended) and “MAY” (optional) but we should not have it such that backends as not expected to do anything with this metadata.

The build-backend is the best place to provide immediate feedback to users, since it is the piece that converts arbitrary configuration provided by the user (via TOML, Python, what-have-you) into the Core Metadata format that is consumed downstream by all the other pieces. This means that the build-backend is best placed to provide explicit, clear and actionable error messages to users about what their misconfiguration is. Moving this to the package index effectively makes it significantly more likely that, for example, projects on package indexes other than PyPI don’t actually end up having this (eg: because not every index ends up enforcing this) which would mean that installers might end up being stricter than build-backends, and they can’t provide particularly actionable errors at that point.

IMO, it would be better to have this be specified as SHOULD (i.e. recommend) backends to implement it. If a backend has a strong reason to not do so, they’re fine to – IMO it’ll just be clearer to end users to have the backends do this and I expect it’ll be a better UX overall as well.

abravalheri · May 15, 2024, 9:17pm

I strongly disagree of creating this expectation from build backends, by using wording like SHOULD.

In an ideal world that would be nice to do. But we have to be realistic with what we can achieve and with the responsibilities that we can take^[1].

As I previously mentioned, I have no problem if this validation/parsing/processing etc comes from stdlib or dependencies that are already are vendored by all build backends. ↩︎

JeanChristopheMorinPerso · May 16, 2024, 1:35am

Thanks for clarifying. I’m sorry if my question/comment sounded like I want to force anything on this community. I replied too quickly and should have explained my thoughts a little bit more.

The way I see things right now is that there is two different audiences for the PEP. One is package producers (package maintainers) and the other is package consumers. Producing a package with a correct license is not too difficult today. Consuming packages and determining what licenses they use is a different story though.

The thing that makes consuming this information hard is how underspecified and unstructured the current core metadata spec is. PEP 639 would greatly improve the situation by supporting an SPDX expression.

Now my thinking is this: If the License-Expression field is optional, wouldn’t it put us in the same situation as before? An SPDX expression is a well known structured format while license files aren’t. So if a project was to decide that they don’t want to set License-Expression, then users of the package would have to resort to parsing license files to know what the license is at a quick glance.

By not having a License-Expression on everything, tooling around license management will have a difficult time to do their job. For example, I can image that maybe pip or another installer could grow a functionality that would allow a user to limit what licenses are allowed to be installed. I believe having a License-Expression on each package would help manage supply chains more easily. It would obviously not solve all problems because there is still possible discrepancies between what a package says its/their license(s) is/are, but at least it would be a start.

If all packages have a License-Expression, PyPI and other indexes would be able to show it to users in their UI. Without an expression, these services would have to either omit the license entirely or display all the license files. Both options are not user friendly. Another alternative would be for these services to parse the license files and try to determine what license they really are, which is a not so simple thing to do and probably impossible to do accurately.

I hope this clarifies my initial reply!

I might be naive (which is possible) and missing context (again possible), but why would we get sucked into legal obligations? This already happens today in lots of projects. I see this almost on a daily basis when packaging stuff at Anaconda. In what way would a different License-Expression and License-File be different than a currently differing classifier and license file? If it’s not a concern today, why would it be a concern with PEP 639?

This is already supported by the SPDX standard by using LicenseRef-MyCustomLicenseHere, see Clause 10: Other Licensing Information Detected - specification v2.3.0.

pf_moore · May 16, 2024, 7:27am

Because you are suggesting that we make a statement about which takes priority. If 2 parties get into a dispute because they disagree over which value takes precedence, how well we publicise our rules could be an important factor in the decision. As I say, IANAL, but I’d rather keep out of that.

ncoghlan · May 16, 2024, 10:02am

As a concrete example, I would really appreciate it if the package metadata for the nVidia CUDA packages and the Intel MKL packages could meaningfully specify their real licenses. “Other/Proprietary”, while technically accurate, is not particularly useful. ^[1]

Since these are proprietary licenses they’re never going to get actual SPDX identifiers (see New license request: Intel-SmpL-2018 [SPDX-Online-Tools] · Issue #838 · spdx/license-list-XML · GitHub for example), but nVidia and Intel could pick their own (hopefully meaningful) strings as long as custom licenses are allowed.

With the rules for acceptable PyPI uploads only requiring artifact distribution rights rather than full open source licensing, we can reasonably assume these aren’t going to be the only packages where there’s no applicable standard SPDX license identifier.

For those that are curious the Intel MKL wheels appear to be covered by https://www.intel.com/content/www/us/en/developer/articles/license/end-user-license-agreement.html#inpage-nav-2 and the various nVidia CUDA wheels are covered by License Agreement for NVIDIA Software Development Kits — EULA ↩︎

pradyunsg · May 16, 2024, 12:14pm

Could I ask you to elaborate on the reasons for this? I don’t quite understand what you’re concerned about here.

I’ll say that, the current tentative implementation approach seems to be to end up implementing the parser for this in packaging, which does meet your “already vendored” mention in the footnote-rendered-inline thingie.

jamestwebber · May 16, 2024, 1:37pm

Even if it were mandatory, I don’t think it would be back-filled to existing packages on PyPI^[1]. So there will be a mix of packages with and without this classifier in either scenario.

Your examples make sense: I can see how using the License-Expression classifier as a filter would be very useful for some users. I do think it would be useful even if it wasn’t universally adopted, though–assuming that new packages gradually adopt usage, over time the majority of maintained packages will have such a classifier. It wouldn’t entirely eliminate the work of verifying individual licenses but it would greatly reduce it.

Packages usually aren’t modified like that, and also who would do it? ↩︎

abravalheri · May 16, 2024, 1:54pm

I kind of summarised the reasons before in PEP 639, Round 3: Improving license clarity with better package metadata - #8 by abravalheri and PEP 639, Round 3: Improving license clarity with better package metadata - #29 by abravalheri.

But yes, if it is possible for a backend to implement the validation/parsing (or any kind of processing needed) without getting in any extra dependency (or having to reimplementing it from scratch), then the issue with feasibility is gone, and then I have no problem with using a SHOULD.

fungi · May 16, 2024, 1:59pm

Even if it were mandatory, I don’t think it would be back-filled
to existing packages on PyPI^[1]. So there will be a mix of
packages with and without this classifier in either scenario.

If it were to be made “mandatory” then that would be in the context
of “mandatory when using a newer metadata version” and so even new
uploads for packages declaring older metadata versions would still
lack it, wouldn’t just be those which were previously uploaded.

Packages usually aren’t modified
like that, and also who would do it? ↩︎

ncoghlan · May 16, 2024, 11:59pm

On the mandatory-or-optional front, publishing tools can’t realistically enforce the new field being semantically correct, they can only enforce syntactic validity.

This means making it mandatory is likely to make the field SNR worse rather than better, since rather than simply omitting the field when they haven’t fully thought through their licensing choices, publishers will be forced to pick a value that makes the tools happy.

By leaving the field as optional at a syntactic level, policy definitions can be left to the entity that has the most information on what matters for their use case: the folks actually downloading and installing packages (or the organisations they work for).

For the consumers most concerned about licensing details, even the new better specified field will only be viable as a first pass filter, since packages with acceptable nominal licences will need further analysis to check if their overall licence includes other terms (e.g. due to vendored libraries).

That doesn’t make the proposal useless, it just means making the new field syntactically mandatory would make it less effective at its intended purpose (allowing publishers that have put thought into their licensing choices to make those choices explicit in their published metadata) rather than being helpful.

ksurma · May 20, 2024, 9:42am

Thanks all! That’s a lot of interesting points raised.

How I see this: If there’s an expectation of a specific format of a value, it makes sense to have it validated somewhere. The field can be validated for semantics and normalized automatically, so there’s no need to leave that to projects’ authors eyes (it’s easy to make a typo and carry it forever).
I lean to @pradyunsg opinion that it feels reasonable to create an expectation that build tools will perform that validation. They could choose not to (e.g. being a lightweight build backend by design may be a sufficient reason to opt out from it). If the tools detect an invalid expression, they may block the users via an error or they may warn that this could cause problems if wanting to distribute packages via public indices, and still let produce such a package.
I’ll also take a second look at warning/error guidance in the PEP. Maybe not everything has to be upfront specified and it’d be better to leave a margin of freedom for the tools.

At this point it seems to me that allowing custom identifiers as defined by SPDX would be the most straightforward way to achieve this.

I believe that’s not the role of PyPI as it is currently designed. PyPI should not be blank-considered a trusted source and I’d prefer this PEP doesn’t play a role in making it appear so. It’s still a responsibility of the project authors to make sure the declared metadata matches their intention and is coherent within a distribution (e.g. text in the license file matches the license expression). Packagers in Fedora are expected to validate that information in the process of including a package into the distribution and raise an issue with projects’ authors if they find discrepancies.

Currently, a project needs to define a name and a version (some build tools will initiate the version for the project author). I don’t see a reason to mandate including licensing information there - it’s not mandatory in the wide software world to license projects (this means no one can use and modify a project, nevertheless it’s perfectly possible to do). I feel the best place to nudge project owners to license things properly is in their issue trackers, and I believe most would happily comply if approached by the downstream packagers.
I don’t see a benefit of an empty string over not including the field at all.

With my Fedora packager hat on, I’m on the side of making the proposal as minimal as possible to achieve consensus over the implementation. The gist is that we need the SPDX standard and we need to be able to locate the license files. The feedback from folks being in that chain before us, like build and publishing tools authors, is much appreciated.

dstufft · May 27, 2024, 1:31am

For whatever it’s worth, I believe PyPI will almost certainly validate the syntax of a License-Expression (which includes validating that it contains valid identifiers), but will almost certainly not attempt to verify that a License-Expression clause is actually an accurate representation of license for a given project.

It was previously mentioned, but there’s also not a mechanism for us to backfill this metadata on PyPI, the METADATA file in an archive is what it is, and we can’t modify that.

We’re also unlikely to require this information unless the PEP requires us to do it. I don’t think we should require it though, as others have mentioned required fields tend to be filled in with garbage to “get things working”, so we should be careful that our list of required fields is kept pretty minimal.

I think that having SHOULD for a build back end to validate the expression is syntactically correct is a reasonable thing to include, SHOULD is explicitly there to allow implementers to opt out of doing something when they have a good reason for doing so, while suggesting that it is a good idea for them to do it if possible.