We have encountered a difficulty with this PEP on the pip project, specifically something which will affect any project that vendors dependencies.
The license metadata field contains the project’s license (in pip’s case, MIT). That has been the case for many years, and is easy to translate from the old format to the new format. However, as pip vendors its dependencies, it’s reasonable to say that the combined license for a particular distribution file consists of pip’s license plus the licenses of the vendored dependencies. Again, there’s nothing new here - pip has vendored dependencies for years, and we bundle the dependencies’ license files, so the information is available.
Presumably, one of the advantages of the license expression is that it can be consumed by automated tools. However, there’s no metadata in PEP 639 that lists the license expressions of bundled dependencies. A suggestion has been made that pip should put the combined license into the License-Expression metadata, but if we do that, we lose any way to record the project license, and it looks like a change to the licensing of the project (this is made worse by the fact that PyPI reports License-Expression as the project license data, not the license data for an individual distribution file).
It’s worth noting that the pip documentation explicitly notes (albeit in a rather obscure place) that pip is released under the MIT license.
What is the best way forward here? I see three main options:
Pip records license = "MIT" and bundles dependency license files. This keeps the status quo, but doesn’t help tools that want to automate license scanning in a way that includes vendored dependencies.
PEP 639 gets updated to add a new field to hold license expressions for vendored dependencies.
Pip misrepresents its own license to include the dependency licenses, with something like license = "MIT AND (Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND ISC AND MIT AND MPL-2.0 AND PSF-2.0)". In case it’s not obvious, I’m pretty strongly against this option.
Would waiting for SBOMs to come along be an option?
I don’t really have an answer but a couple of factors spring to mind:
Licenses are a long way from being the only piece of metadata that ought to be captured for vendored packaged so a way of handling only the license field for vendored packages seems wrong
Some of the consumers of this field are going to be repackagers and they’ll likely be devendoring the dependencies anyway so the license = "MIT AND (everything else)" option would just be getting in their way
This needs to be changed. In e.g. python-flint’s case the sdist is MIT but the wheels on PyPI have other things bundled in by e.g. auditwheel.
I’m not sure what to do in terms of PEP 639 for python-flint. This is how it currently looks:
When building wheels for PyPI with cibuildwheel the venerable cat supports bundled licenses:
We need better tooling for this but really the pyproject.toml should say MIT but something else (auditwheel?) should update the metadata somehow when bundling things in.
It would be better I think to have a clear distinction in whichever metadata between the project license and the license of bundled things.
It implies that pip should specify all of the licenses but then that would mean that unvendoring redistributors of pip have to strip them out again and the metadata doesn’t distinguish which are for the vendored dependencies vs which are for pip itself. The guide makes no mention of projects that use e.g. auditwheel that would need to bundle licenses for things that are not in the sdist (i.e. cannot be specified statically in pyproject.toml).
Actually, yes. A SBOM feels like it’s the ideal (and correct) place to put information like this. I’d love to hear what the PEP 639 authors have to say, but I think this is what I’d want pip to do. And we have a tool (vendoring) which does our vendoring for us, so I’d hope that tool could add an option to generate the SBOMs automatically for us.
I think option 3 is the only correct one, as it is the licence for the files being distributed. The package I get from PyPI is not MIT licensed, it is MIT AND (...).
P.S. pip’s wheel file on PyPI doesn’t include licence files of the vendored dependencies, which I think is a violation of those licences.
The question is whether the license metadata field refers to the license of the project or to the license of the distribution package. A project can use one license while bundling or vendoring things that have other licenses in either sdist or wheels. PEP 639 seems ambiguous about the interpretation of what the license actually applies to.
If the license expression in the pyproject.toml in the sdist must apply to all the contents of the sdist then where can the license for the project itself be recorded?
SBOMs are explicitly intended to provide a way to document all the contents of a package and so PEP 770 is designed for tools that do bundling to be able to bundle that metadata without messing with core project metadata.
As I read it the PEP gives two different not entirely compatible definitions for the License Expression:
The first one in the rationale section defines “License-Expression” as related to the distribution (as “package” here presumably means “distribution package”):
License-Expression that provides an unambiguous way to express the license of a package using SPDX license expressions.
And the second on is in the terminology section, where “license expression” is defined in terms of the project:
license expression
SPDX expression
string with valid SPDX license expression syntax including one or more SPDX license identifier(s), which describes a Project’s license(s) and how they inter-relate. Examples: GPL-3.0-or-later, MIT AND (Apache-2.0 OR BSD-2-clause)
So I would ask the PEP authors to clarify the core-metadata field License-Expression as whether it’s intended to be:
The license(s) of the project
The license(s) of the distribution package
Or can be either at the discretion of the project
And in any case I would appreciate some clarifying text in the specification.
FWIW, my personal view is that the best resolution here would be (1), with clarification that the license of the distribution package should be calculared by combining the license of the project with any licenses of vendored packages, extracted from the project SBOM(s). If a more convenient way of getting the distribution package license is needed, a 3rd party library like packaging could add an API to extract the necessary data. The advantage of this solution is that it’s usable today.
I dislike (2) because it would require two significant changes before being usable. First, a new field would need to be added to hold the license of the project. Second, PyPI would need to be updated to use that field when displaying the project license.
IMO, (3) will simply become equivalent to (1), precisely because it’s the only option that allows recording all relevant data without a change to the spec and to PyPI.
To expand on this slightly there is not a single distribution package e.g. there are sdists and there are wheels and then there are packages outside of PyPI. There can be vendored things that are:
In the sdist and all wheels but not necessarily in the package in non-PyPI packaging systems (the pip case).
In the wheels only (the python-flint case).
Only in some wheels for particular platforms (also the python-flint case e.g. extra things bundled on Windows).
In the sdist but not in the wheels (e.g. a vendored build dependency).
For an example of the last case the sdist for numpy has a vendored-meson subdirectory which contains numpy’s fork of meson. That is used for building numpy from the sdist but is not vendored into the binary wheels and is not part of the installed package.
If the license expression in each distribution package refers to the contents of that package then none of the packages can be taken as specifying the project license i.e.:
The terms under which contributions to the project are made.
The license that applies to the devendored package.
The license that applies if building from source rather than installing from wheels.
In many/most downstream repackaging scenarios it will be the project license that should be recorded. For example in python-flint’s case the conda package or any Linux distro package etc should show the MIT license regardless of whether the wheels on PyPI have other things bundled in.
The issue with recording the project source code license vs the license of distributed artifacts (ie wheels or other forms of binary packages) came up almost immediately when meson-python implemented support for PEP 639 (ENH: add support for PEP 639 by dnicolodi · Pull Request #681 · mesonbuild/meson-python · GitHub and subsequent comments). The focus of the discussion there were Meson subprojects pulled in by the package at build time, but the same of course applies to vendored dependencies or platform libraries added to the binary distribution by different mechanisms.
At the time it felt like the authors of the PEP did not consider these use-cases when drafting the PEP. IIRC, currently NumPy is forced to implement some hacks to get the correct redistribution license recorded in the wheels.
Clarifying whether License-Expression and License-File applies to the project source code or the distributed binary package would be helpful for both producers and consumers of the metadata.
I didn’t realise that the PEP actually has a whole separate page of rejected ideas where this is mentioned:
As an additional use case, it was asked whether it was in scope for PEP 639 to handle cases where the license expression for a binary distribution (wheel) is different from that for a source distribution (sdist), such as in cases of non-pure-Python packages that compile and bundle binaries under different licenses than the project itself. An example cited was PyTorch, which contains CUDA from Nvidia, which is freely distributable but not open source.
However, given the inherent complexity here and a lack of an obvious mechanism to do so, the fact that each wheel would need its own license information, lack of support on PyPI for exposing license info on a per-distribution archive basis, and the relatively niche use case, it was determined to be out of scope for PEP 639, and left to a future PEP to resolve if sufficient need and interest exists and an appropriate mechanism can be found.
I don’t think it is really possible to just leave this case “out of scope” because it is the situation for many of the most widely used packages and they will have to put some value in the metadata. It’s not only PyTorch but NumPy, SciPy, matplotlib, … Basically most packages with nontrivial wheels.
Overlooking this case in the PEP undermines the general meaning of the metadata because it cannot be accurate for all packages under the interpretation that many people would assume. The PEP makes it possible to specify a particular license unambiguously but the interpretation of what the license applies to is ambiguous.
For me it’s (2) because what people care about when they install a project is what licenses they are now bound to. As an example, let’s say pip vendors something that’s GPL (or some other viral licence). In that case no one will care about just the source license for pip is MIT, but they will care about the licenses for all code that is being installed that will effect the user’s code and how they can distribute their project since GPL will take precedence.
Or put another way, if I can’t read the license expression to make sure I keep GPL code out of my project from forcing GPL requirements on me, then the license expression is no where near as useful. Now if people want a separate way to say the source license compared to everything that ends up being installed, then an idea is a Source-Licence-Expression just for the project source itself and License-Expression falls back to that if it isn’t defined.
One very important group of people who should care are the people who contribute to pip. The project license is the license that governs the code that they contribute. The fact that the code gets bundled up with other things in some situations does not affect the project license itself. The license is shown clearly on GitHub for all who would consider contributing. Future tooling will no doubt parse this field from pyproject.toml.
I want the pyproject.toml on GitHub to state clearly what the license of the project is regardless of anything else that might be bundled into some of the distributions. If we want to have other metadata (e.g. SBOMs) that list the licenses of everything bundled into a distribution then that is fine. If the intended meaning was to convey the license of everything in a particular distribution then the PEP fails at that because it has not allowed for the possibility that different distribution files might have different things in them.
This is backwards: the license-expression in the project section should be the project license. Metadata governing the license of the contents of e.g. the sdist belongs in some sdist metadata.
Also, by making the user manually calculate the combined license expression to put in the metadata field, isn’t that a strictly worse UX than having the tool (vendoring/auditwheel/etc) record the necessary licenses automatically?
That’s fine, but that’s a different use-case and is also solvable more easily when looking at a source tree than it is to determine what other licenses apply to a project when installed based on all the files it contains.
That’s your opinion, not a fact, so it isn’t “backwards”, it just isn’t what you think it should be while I happen to hold the opposite view based on the current spec.
That very much depends on whether those tools can calculate such a thing. For instance, pip’s vendoring isn’t done at build time for wheels but ahead of time and is committed into the source tree. So in the case of the build back-end not doing the vendoring it doesn’t apply.
Now for projects that have some build back-end that inject code then that changes things a bit. If you assume they can gather all the licenses for what is to be injected and then and them together, having a back-end-only field in the core metadata could allow for that. In that instance you could then take Oscar’s argument that license-expression is the expression for the whole source tree (e.g. pip still lists all the licenses for the code it vendors), but some new field that isn’t accessible in pyproject.toml is set by back-ends to represent the license of what’s being distributed and overrides license-expression when present in the final METADATA.
I haven’t read the entire PEP line by line to write this comment but when I looked before I didn’t see anything that unambiguously states what the license referred to by the license-expression field should be understood as applying to.
It does, because the vendoring tool that we use can add the necessary SBOMs (and should, for other reasons, anyway).
The key points here are:
SBOMs should contain license expressions anyway (at least, that’s what I understood from reading some information about SBOMs).
Tools (or people) doing vendoring should be adding SBOMs.
With the SBOMs and the project license, a distribution file license can be calculated. With just a distribution file license, the project license can’t be determined.
I’ll also note that the semantics of the license field in pyproject.toml and the License-Expression metadata field are different. The former is global to the project[1], and the latter is per distribution file. The two are only equivalent when the field is static - and the cases we’re discussing are precisely those where “static” isn’t correct…
even if it’s defined to be the distribution file license, it can only be specified once at the project level ↩︎