PEP 639: Improving license clarity with better package metadata

Yes! that would be much welcomed!

There is something that I may want to address which may be to change slightly the way we would use the License field. The suggestion I got from @thatch would be to simply prefix the license expression string with spdx: such as in spdx: Apache-2.0 OR BSD-3-Clause to signal that the License field contains an expression that could then be strictly validated. This would address some of the concerns raised about reusing the License field for expressions and the fact that it would not be known if the field can be validated strictly or not.

@pf_moore @brettcannon feedback welcomed on this refinement!

That sounds like a good change. Explicit configuration is always better than implicit configuration (or behavior) :slight_smile:

1 Like

At the risk of throwing an accelerant on this pile of embers…

Would it be in scope for this PEP to address the situation where the license expression for a wheel must be different from the expression for the sdist?

This isn’t an issue for any pure-Python packages, but any wheel constructed from a package which contains non-Python code will almost certainly include binary code derived from sources outside of the Python package. While it’s quite likely that the code in question is freely distributable, it will often not be distributed under the same license(s) as the source code in the sdist.

There are some obvious examples on pypi.org already; PyTorch being one, as the wheels available from pypi.org contain CUDA-related code from NVIDIA, which is not even open source code :slight_smile: It is freely distributable and so the producers of those wheels are not violating any licenses, but the consumers of those wheels are being given a incorrect license declaration if they are relying on the declaration in the sdist (as copied over into the wheel).

@kpfleming that an issue alright

I recall entering these related tickets a long time ago on numpy and scipy:

I think that for now we should likely clarify (in doc for now) what is the scope of the license: sdist or built wheel.
I tend to feel that the license expression should be that of the built package and not that of only the source code if there is more than meets the eye.
BUT on the other hand the set of code baked in a binary today may not be the same as the set baked in tomorrow or may vary based on the local setup (such as using GCC vs. clang or glibc vs. musl, etc.) so it becomes rather difficult in some corner cases.

The case of pytorch is a rather unique one though I feel like they would need to err towards documenting the built binary rather than just the sources.

I think that’s out of scope. You not only would need a way to specify the license of the source separate from the wheel, but you would have to do it per wheel (e.g. CUDA versus non-CUDA wheels of PyTorch). PyPI doesn’t even provide a way to expose that information, let alone having a way to have your build tool of choice let you say what license should be set for your built wheel.

I think that would be a separate discussion (and potentially a PEP).

1 Like

That’s very logical and I assumed that would be the response.

My only remaining question on that topic (and I realize it’s pushing the envelope) is whether the PEP (and the metadata itself) could recommend some sort of indicator which would cause the build tool (the thing producing the wheel) to not copy the license expression from the sdist into the wheel. This would be used by package maintainers who know that their package will incorporate other software when a wheel is built, and that the license expression they are providing for the sdist will not be accurate if a wheel is produced from it.

I think that’s a separate PEP since the one proposed here is to simply use SPDX.

1 Like

Thanks. I’ll put that in my plans and see if I can find some collaborators :slight_smile:

1 Like

I think this would be a welcome change and would reduce confusion and frustration.

Is this the last ‘open issue’ before this PEP can proceed to the next phase?

Thanks for pushing through on this @pombredanne!

I think the only thing preventing me from being +1 on this PEP is the overloading of the License field rather than creating a new License-Expression field. At this point in this thread, @pradyunsg, @dstufft and @brettcannon and myself have all either stated preferences for a new field or concerns about reusing License, so I’m not sure we’re in the minority here anymore.

The only real reason I’ve seen for reusing License is:

Given my experience with packaging metadata, I don’t think “field inflation” is an actual issue, and instead I think we should prioritize the following:

  • if possible we should should avoid anything that changes the behavior of any existing field, even across different Metadata-Versions;
  • we should not make packaging tools have to guess the Metadata-Version based on some characteristic of the metadata value;
  • we should be able to tell the user if they are trying to use an license expression but either have a typo, wrong syntax or an invalid identifier, e.g. strictly validate.

Creating a new License-Expression field would resolve all of these.

3 Likes

@dustin @kpfleming I am now convinced that knowing exactly that a license is an expression is a must. We could prefix the license value with spdx: but this amounts to somehow create a “field-in-a-field” where License would either a be plain string or a structured expression: This does not feel right to cram structure inside a value here.

A separate License-Expression field would end up being much cleaner and remove any ambiguity. @pf_moore would you agree?

If there is no objection, I will update the draft along these lines. And I will also explain the new migration path and add suggestions on how tools can help with it.

One question, what should happen if a metadata file contains both License and License-Expression?

Yes, I agree.

1 Like

We could keep license as it is e.g. an optional unstructured field where extra license commentary can be provided. Or we deprecate it entirely and accept that a license is always an expression, and that any extra license commentaries and peculiarities go in a pointed License-File content.

I’d rather follow that plan: if the metadata includes License-Expression, then it must not have License or any licensing-related classifiers.

While it seems incredibly unlikely today that there would ever be a new license expression specification that supersedes SPDX, it may be prudent to plan for that by naming the field in a way that makes it clear that it can only contain SPDX license expressions.

1 Like

It is possible for build tools to put different metadata in wheels vs sdists. So, if built wheels should have a different license, it’s IMO very reasonable to have it change between them.

I don’t think this is a metadata issue tho, but more of how the various build backends expose this ability and whether they allow users to do stuff like this.

2 Likes

It might also be an issue for metadata consumers, which may not be expecting per-file differences (PyPI, for example, displays the license at the project level, not at the file level). But that’s equally not a metadata issue, as the mechanism is there to allow per-file data, it’s just how the tools handle the situation.

1 Like

https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/

1 Like