Expressing project vs. distribution licenses post-PEP 639

Okay that makes sense and I think you are right that it would be harder to approve mainly because it would be more complicated to define everything unambiguously.

Perhaps though this should be stated more explicitly in the PEP following Damian’s earlier point:

The use case is that it is the license for the project that governs the project’s own code. It is also the license that applies if devendored which is otherwise not stated anywhere. I can adapt code from pip into another MIT-licensed project regardless of the license of any vendored components. I can also open a PR to pip and know that my contribution will be licensed as MIT.

This is what “project license” means to me and I think that this is what most package maintainers are going to want put in a the project.license field in pyproject.toml.

These things are subtly different though i.e. “the resulting artifact” is not the same as what you “install from PyPI” and in fact what you install from PyPI can be different per-distribution and what you get from a PyPI wheel might not be the same as what you would get if building from sdist yourself. If we want to be precise about the license of these things then we have to define these terms very clearly.

If the new format retains ambiguity about what it refers to then there is no advantage. I think the suggestion to do this was for projects where using the new format might be considered inaccurate.

This point comes up a lot (e.g. recent). As I read it dynamic refers to whether it is known before building a wheel what the metadata is. So license = dynamic would mean that somehow when building from sdist the resulting wheel would have a license that was determined dynamically by the build backend. That is not what happens with auditwheel because it runs afterwards as a separate step only in the specific situation that you want to make the wheels portable before uploading to PyPI. The files that are bundled into those PyPI wheels would not be bundled into anything if you installed from the sdist because then the extension modules would just link against existing libraries in your system.

So no this is not dynamic but the statically recorded metadata might be different in the sdist vs the wheels that are on PyPI. This is also quite an important thing to understand which is that the particular need of providing binaries on PyPI means that some things get bundled into the wheels that may have different licenses. Those licenses are irrelevant if you install from conda or if you build yourself etc and you don’t want them to get mixed up with the project license.

The problem with this is that the naming of the metadata fields looks like the reverse of what I think you want them to mean:

[project]
# project.license is the license of the sdist contents:
license = "MIT AND BSD3"
# project.source-license is the project license:
source-license = "MIT"
1 Like

But then we lose the machine-readability for that information, which is the entire point of this PEP, right?

While installing wheels from PyPI might seem to some like the main way of dealing with Python packages, there are other distribution methods like conda that take the project source (for example as a sdist) and deal with external binary dependencies in an entirely different way.

Whose POV is that? A user’s? A packager’s? A maintainer’s? How are you sure you’re not gonna delete too many or too few licenses from that expression?

2 Likes

Speaking for pip, simply that the pip maintainers can’t agree on using license = "MIT" rather than a more elaborate license = "MIT AND (...)" form, because we all have different opinions on which license the new field is intended to record. Keeping the status quo (the legacy form) avoids that issue.

IMO, if the PR that switched pip to the new form had simply left the license as “MIT”, I don’t think we’d have ever had this discussion (at least, I wouldn’t have started it). But now that the question has been asked, we’re at an impasse.

IMO, yes, very much so. It’s a statement of what contributors should expect when contributing code. It defines the license for all the code that’s written explicitly for the project. It’s the license that’s controlled by the project.

The licenses of vendored libraries are important for understanding the rules for using and distributing the project as a whole, but they are not owned or controlled by the project, except in the trivial sense that the project won’t use dependencies with licenses they don’t agree with, or which are incompatible with the project’s license. But the vendored licenses, to me, are clearly subordinate to the main project license, and that relationship is lost if all you have is a “combined” license expression.

It’s very much this. And if the decision is to create a follow-up PEP that added a “project license” field[1], then I’d happily use that for pip once PyPI supports it.

What pip does in the interim is still open, though. We would have two choices that I would accept:

  1. Stick with the status quo, and continue using the old form.
  2. Take advantage of your statement that PEP 693 “keeps the same ambiguity as the legacy system” and use license = "MIT".

Both of those maintain the key thing that I consider non-negotiable, that PyPI displays “MIT” as the pip project license, on the human-focused PyPI project page for pip.

This is a debate for the pip project, though, and doesn’t really affect the standards. The only thing that matters to us from a standards perspective is whether your statement above - that projects can choose to use the license expression for the project license or the distribution file license, and the new standard retains the ambiguity of the legacy form - is the official clarification of the PEP’s position. If it is, then option (2) above is open for pip. If not, then I think we only have option (1) available.


  1. I prefer “project” over “source” here, but let’s leave the bikeshedding for later ↩︎

4 Likes

FWIW I would have brought up the question, because I have the same use case as @brettcannon and had the same understanding as to the intention of this PEP as @ksurma. Also reading the pip PR discussion I don’t think I was the only one.

Now the intent is clear it seems all that needs doing with the PEP is fixing a few mistakes in the language. And that users with the need for a project license core metadata field need to propose a new PEP. Unless the argument is to edit the PEP to something that was not intended when it was accepted?

Um, @ksurma said that the intention was that PEP 639 didn’t address the ambiguity in the legacy format. If that’s what you mean by “fixing a few mistakes in the language” then that’s fine. But I don’t see how saying that PEP 639 remains ambiguous helps your use case (which, if I understand it, relies on having an unambiguous interpretation of License-Expression).

2 Likes

Sorry, I had to do a second read, yes the PEP relies on the ambiguity, but the author’s main use case is as a distribution.

My point still mostly stands, the language should be updated to clarify that it’s ambiguous and up to the discretion of the project (option 3 on my original post asking for clarification).

Then someone sufficiently motivated could build on this for a specific project and/or distribution core metadata field.

Is there a reason why people expect the license expression field to be treated specially, by including vendored dependencies, but not any of the other metadata fields?

Like I’ve said before, the license is not the only piece of metadata that a vendored dependency has.

  • If I want to know all the packages I have in an application/environment/dependency tree then the Name field will not capture the names of vendored packages.
  • If I want to look for vulnerable dependency versions then Version does not capture vendored package versions.
  • If I want to generate a list of 3rd party contributors then Author/Maintainer won’t … you get the picture.

Is it just because licences → legal → scary stuff?

5 Likes

Isn’t this sort of thing what SBOMs are for?

It is because this PEP defines a license metadata field but people are unclear how to use it. The PEP text and the examples in the packaging guide suggest that this field might be expected to include all vendored things. That is not how all projects have recorded the license metadata in the old format making it unclear whether it is valid for them to just change the metadata format from license = {text = "MIT"} to license = "MIT" because maybe this new format was expected to have a more precise meaning.

The clarification here indicates that it is valid just to change the format:

On the flipside though this means that another PEP would be needed before the spec could be depended on by some applications that would consume this metadata.

1 Like

I thought it was, but I’ve had pushback that SBOMs might not be present, or might not contain everything you need. I’m unconvinced by that argument, personally (anything optional can be omitted, what’s so different about SBOMs?) but didn’t feel like pushing the point.

Personally, I’d rather follow @bwoodsend’s logic, and say that all of the project metadata (in pyproject.toml and in the metadata fields of sdists and wheels) should refer solely to the project’s own data, ignoring vendoring. Vendored projects are what SBOMs cover, and anyone interested in the “full” picture including vendored code should be looking at SBOMs.

Sadly, I don’t have the luxury of being able to make the rules here, so all I can do is point out where my use case(s) aren’t covered by what we actually have.

4 Likes

As I understand it, it’s not the project’s metadata, it’s the distribution’s metadata, and should refer to the licenses that apply to using that distribution here. In the case of vendoring, omitting vendored licenses from distribution metadata is actually inaccurate.

1 Like

Yes, and that’s kind of where I was leading. Currently I can’t even write a vendoring-aware pip freeze-like tool. I think we need a more general way of describing vendored dependencies and that’s exactly what SBOMs are.

What is “it” here?

To me, it’s self-evident that what is in the [project] section of the pyproject.toml file is project data[1]. In most cases, that translates directly into distribution file metadata, but in certain situations, that isn’t sufficient. That’s when we need extra data, but that extra data is not project data, it’s added by the build process (and as such, it will vary depending on exactly which build process is used).

Part of the problem is that in many ways, vendoring is about building an application, and Python packaging is bad at handling application building. Even bundling shared libraries in wheels is to an extent about application building, because whether a wheel bundles a shared library or links to one on the user’s system only matters when you try to run the final assembled application.


  1. the clue’s in the name :slightly_smiling_face: ↩︎

3 Likes

Unfortunately this is not specified anywhere and if we want to talk about the distribution’s metadata then we need something that pins down what it actually means.

As an example consider this case. What I was trying to do was make it so that installing from the python-flint sdist could download the non-Python dependencies and build them for you. Meson unlike setuptools is capable of building the dependencies and it would be much nicer for anyone installing from source if we could do that for them. Ultimately I decided this should only be done as opt-in and it is still not implemented because meson-python can’t bundle the libs yet (unlike auditwheel etc).

What that would mean though is that you have an sdist that contains only MIT code but the build backend might download and install LGPLv3 code. All of a sudden “what gets installed” is not the same as “what’s in the sdist” even if you are building from the sdist. The imprecise language that people are using above about these things show that the definitions need to be made very clear before anyone can depend on a precise interpretation of what the metadata means.

I agree that it is necessary to have some metadata that unambiguously describes the licenses for the contents of a distribution since among other things those licenses govern how it can be distributed. I that we need to make it very clear what the metadata field refers to though and in general I think it is a mistake to try to retrofit a precise definition onto a metadata field that is already used imprecisely.

3 Likes

PEP 639 seemed clear to me that “it” (the license expression field) was intended to refer to the distribution. Most of the guidance in it was for how build tools should interpret the fields, and then there’s this verbatim:

For all newly-uploaded distribution archives that include a License-Expression field, the Python Package Index (PyPI) MUST validate that they contain a valid, case-normalized license expression with valid identifiers (as defined above) and MUST reject uploads that do not. Custom license identifiers which conform to the SPDX specification are considered valid. PyPI MAY reject an upload for using a deprecated license identifier, so long as it was deprecated as of the above-mentioned SPDX License List version.

It’s crystal clear to me that this is to be used for distribution purposes and refers to the distribution. If this isn’t actually the case, then this is a massive failure in something meant to improve clarity.

(There’s many more indicators of this reading in the PEP, including in the seperate page covering rejected ideas)

1 Like

I’ll also point to the plain motivation section which said:

Software must be licensed in order for anyone other than its creator to download, use, share and modify it. Today, there are multiple fields where licenses are documented in Core Metadata, and there are limitations to what can be expressed in each of them. This often leads to confusion both for package authors and end users, including distribution re-packagers.

and then

As a result, on average, Python packages tend to have more ambiguous and missing license information than other common ecosystems.

Leaving this out of the distribution would go against the motivation as that would be missing license information.

1 Like

I’m scanning through the pep for language here, taking this as something I’ll want to learn from before I propose anything, and I think I’ve found clear evidence the authors explicitly meant to include vendoring:

The current license classifiers could be extended to include the full range of the SPDX identifiers while deprecating the ambiguous classifiers (such as License :: OSI Approved :: BSD License).

However, there are multiple arguments against such an approach:
[…]

  • It only covers packages under a single license; it doesn’t address projects that vendor dependencies (e.g. Setuptools), offer a choice of licenses (e.g. Packaging) or were relicensed, adapt code from other projects or contain fonts, images, examples, binaries or other assets under other licenses.

They rejected just expanding the existing mechanism because it wouldnt cover vendoring.

I think the root issue here is the word “package” is terribly overloaded in python’s packaging world, so any use of it is inherently going to end up misunderstood by someone, possibly even because the author was unclear as a result of the overloading. I’m thankful the rest of the pep spells out what’s meant to be included, but maybe this should be seen as a reason to fix the terminology we use to be less ambiguous and ensure all future packaging peps avoid ambiguous use of “package”?

3 Likes

We’ve already had this discussion about the language in the PEP, and it has been clarified, it was not intended to be less ambiguous than the existing license metadata field.

I agree that the PEP leans strongly towards being a distribution, but that makes sense because that was the use case for the authors and the PEP deligate, but not the intent.

We will not get consensus on changing the intent of the PEP at this point, so let’s just get a clarification upfront so future readers aren’t also confused about the intent.

1 Like

I don’t think it’s anywhere near as clear as you’re suggesting. Yes, it says that PyPI must validate license expressions, but that’s simply ensuring that invalid data doesn’t make it into the index. It says nothing about semantics.

Yes, you can infer that the field is intended to be related to the distribution file, but (1) that’s not the semantics of the legacy form (and changing semantics is something that a PEP really should be very explicit about) and (2) by displaying the data at the project level, PyPI clearly doesn’t conform to that interpretation.

By the way, surely the PEP should say indexes and not just PyPI? We really shouldn’t be writing interoperability standards that treat PyPI differently from other index implementations :slightly_frowning_face:

@ksurma has already said that the thinking behind the PEP was largely around distribution file metadata. Whether we agree or not, that’s no longer in question. What is still up for debate is what projects that have traditionally recorded the project license in the license field in pyproject.toml should do. @ksurma has suggested (by saying that PEP 639 retains the ambiguity of the legacy field) that such projects can continue to use the new license field in that way - but if that’s the case, it does impact the usability of the License-Expression metadata as distribution file metadata. Maybe that’s an inevitable consequence of allowing the ambiguity. Maybe it’s a fatal flaw in ambiguity, and we need a revised clarification. Or maybe it’s all just a litte bit more precise, but still a mess in certain ways just like it has always been, and that’s OK. I don’t know. And to be honest, I don’t really have a strong opinion. All I care about is that pip can continue to display “MIT” as the license on its PyPI page, with a weaker preference that we don’t have to do so by sticking with the legacy format (because that’s not sustainable for the longer term).

1 Like

As would not having a way to record the project license. Unfortunately, the PEP didn’t achieve its motivations as well as it might have wanted to.

1 Like

As a user, I’d want indexes advertising the license of the distribution I’m downloading from them :neutral_face: if they are going to display this at all, otherwise I’d want them not to display anything about it and leave it up to the user to go be responsible in learning about what they depend on (I already lean towards telling people to do this, but I shouldn’t have to tell them “the specification lets this not reflect the distribution, so you have to” it should be “you should know what you depend on and ship to your users”)