Expressing project vs. distribution licenses post-PEP 639

No doubt true, but it’s a reasonable assumption, because the alternative (this license doesn’t describe what’s being distributed here) makes the information entirely useless to me.

If I’m about to get a package that contains copyleft components, that’s what I need to know, regardless of whether they were in the original sources or merged in later. And frankly, a package that advertised itself as (say) MIT but installed GPL components would probably get itself blocked entirely by our internal systems and red flagged in any repository that used it (whereas one that openly described itself as containing GPL would be merely yellow flagged, reviewed on a case-by-case basis, and typically approved).

So we can argue correctness and misconceptions as much as we like, but as someone who has to actually interpret and rely on this information, I expect the license that’s attached to the distribution (and PyPI is a distribution platform, unlike GitHub which is a collaboration platform) to reflect what’s being distributed. Since most open-source licenses focus on distribution, I would expect actual lawyers to also prefer that the correct licenses appear on the distribution platform.

4 Likes

You’re conflating pyproject.toml and core metadata.

Before PEP 643, as opposed to the dynamic field in pyproject.toml, is when “there was no issue with a wheel having different metadata from the sdist or source tree”. There actually was, and it’s one reason pip’s resolver is slower than we’d like. And as far as I’m aware, uv makes an invalid (but usually accurate) consistency assumption here that allows it to be fast. But yes, the standards had nothing to say on the matter before PEP 643.

But PEP 643 is careful to allow a metadata field to have a value and still be dynamic. The value is specified to be informational only, which is a very weak constraint, but tools can use it to be more reliable.

The pyproject.toml field is completely different. As far as I understand, it was added specifically to ensure that the standardised [project] table took priority over the build backend, and developers had to opt into allowing the backend to calculate values. It’s this field that is arguably over-strict, not allowing any middle ground where the backend can amend, but not ignore, the [project] table[1].

We can change the way PEP 621 defines dynamic. That would be a new PEP, and would probably be difficult to get right. Such a change might spill over into PEP 643, or it might not. It might also need different rules for different fields, which might be even more complex (there’s already a proposal for loosening the rules for dependencies, but that proposal probably won’t help for licenses).

One final point that I will note. The project license is something the project maintainers decide on, and it’s clearly worthy of being recorded somehow[2]. And the project license should be static. But the combined license that includes the constraints added by vendored content is a derived field, computed by taking the project license and the licenses of the vendored content and combining them. While it may not be possible in general to automate that computation[3] I’d still argue that the combined license is the one that should legitimately be represented by a dynamic field, and ideally calculated by a tool rather than by hand.


  1. I can understand why it’s strict - clearly defining the limits on such a “middle ground” is incredibly hard, and there was no practical evidence that it was needed at the time the PEP was written ↩︎

  2. Whether that’s License-Expression or something else isn’t important ↩︎

  3. although what’s wrong with PROJECT-LICENSE AND (VENDORED1 AND VENDORED2 AND ...)? ↩︎

1 Like

This feels like a straw man. If a project is licensed under MIT, and vendors a GPL component, isn’t that combination simply in violation of one, if not both, of the two licenses? I think that at a minimum we should be assuming that license data is describing a legal situation…

Nope, not at all. Depending on how they’re integrated, it probably means the distribution has to be under the terms of the GPL (this is why copyleft is called “infectious”). Some licenses explicitly try to prevent this, which would be in breach of the license (e.g. Intel’s MKL license), but most OSS licenses are GPL-compatible.

3 Likes

You say that, and then you go on to 100% agree with me, so I’m not quite sure whether we’re arguing or talking past each other (as usual :wink: ).

2 Likes

I’m preparing a PEP for partially dynamic metadata, and I think it might help here quite a bit. I’ve already added a special case for license (special in that it’s a string, not really in the way it’s handled).

4 Likes

People aren’t happy with it, but it’s not going to be solved by this already accepted PEP which wasn’t intended to solve it.

I won’t assign motivations as to why people keep replying on this PEP thread to an intent issue the author has stated the intent, but I would much rather some intent language is added to the PEP and spec, and we move onto a new PEP that solves the use cases brought up by these discussions rather than relitigating past choices.

2 Likes

I would like to share that right now we face the exact same “conceptual dilema” as experienced by pip in setuptools. Because of the vendored dependencies, it is non trivial for us to derive and (maybe more importantly) maintain a composed complex license expression.

So what I am inclined to do is to omit the license expression, but maintain all the license files that are all shipped inside the distribution. This way the license information are entirely provided in raw form and is accessible for all parties that are interested in recovering this information.

(If we have more advancements in the standard and it is clarified that license-expression refers only to the “primary” project itself, then I would also addopt license = "mit")

There is also a 4th option, right? That is: distribution authors that have concerns with the SPDX expressions as per the existing PEP 639 don’t have to include it, but they can provide all the license texts in full form as license files, so the information is available for any interested consumer.

In this option we do loose the ability of having a “quick indicator” of the “primary” projects license, but it looks like it is the safest option in the scary context of legal information.

6 Likes

Yes, I’ve updated since my previous opinion:

I now think that we should strongly recommend to users that the field must only be static if it accurately describes the contents of all distribution artifacts for a given release.

Yes, that is an option, although I feel that it’s worse than simply sticking with the legacy form. There’s no need to move to PEP 639, and for projects that don’t feel that the new spec meets their needs, sticking with what has worked for them for years seems like a sensible thing to do.

In what way is it worse than the legacy form?

There has been confusion in this thread that is now resolved. The conclusion is that the new spec does not change anything in respect to whether pip’s metadata should say MIT or something else in either the old or the new format.

1 Like

I was saying that @abravalheri’s suggestion of not including a SPDX expression at all is worse than retaining the legacy form - which at least records something in the metadata.

I agree that with the clarification, license = { text = "MIT" } can be simply changed to license = "MIT" with no change in meaning (and the same applies to any other cases where the legacy license value was a SPDX expression).

It looks like setuptools didn’t use the old license metadata, though, relying solely on the (now deprecated) License classifier. In that case, I can see why they would simply drop the deprecated field, and not add a PEP 639 license field - if it’s just as ambiguous as the legacy field, their reasons for not wanting to use the legacy form probably apply equally to the PEP 639 form.

1 Like

First, let me say a huge thanks to @ksurma for stepping up and getting this over the finish line, to @brettcannon for facilitating and reviewing it, and to the community for their input, support and implementation!

Now that what turned out to be serious medical issues plaguing me for the past year are (mostly) resolved, now seems to be an opportune time as any to re-engage with the discussion given the questions raised surround the intent of portions of the text that I happen to be the primary or sole author of. I’ll address the main points raised here, with replies to more specific cases in one or more followups.

TL;DR:

  • The PEP always intended to and (clarity defects aside) does frame the license expression as that of the distribution packages, not the project.
  • Even in the couple instances where “project” is imprecisely mentioned, the therein-explicitly-referenced definition of such includes vendored dependencies (like Pip’s) and anything else checked in to the source tree.
  • For the edge cases where it does matter, there is no one single obvious definition of “project license” that can be useful or programmatically checkable for all or perhaps even most uses given all those same edge cases, without per-file license info i.e. duplicating SBOMs
  • Accordingly, I think we should:
    • Rectify this clarity defect in the PyPUG to make clear what the PEP intended
    • Also explicitly note the limitations in the lack of per-artifact license
    • Work toward a PEP defining per-artifact licenses (or perhaps solving the broader underlying problem of per-artifact metadata, viz. dynamic and PyPI as a whole, which is the real issue here)
    • If people are still interested in incorporating a notion of “project license” into the Python packaging metadata, separate from (or perhaps better just consuming) a SBOM, propose another PEP defining what that actually means
Sidenote: Some personal perspective

FWIW I’m a bit surprised and disappointed at the level of pessimism expressed by a couple of community members here toward the PEP already being a “failure” at its goals, the authors having negligently ignored crippling flaws in clarity, and the PEP being so little or even negative improvement for many project over the previous deeply flawed and fragmented situation that it is better to not adopt it at all. Particularly given these folks include those who were heavily involved throughout the process and contributed a significant amount of constructive feedback that improved the final version.

For some context, the PEP was first officially posted for review and comment well over half a decade ago, and was finally approved after over 600 comments, dozens if not hundreds of substantive, feedback-motivated changes, as well as at least one full (Hatch) and one partial (Setuptools) production implementation people could test with years before its finalization.

Not once that I can recall during that time did anyone raise the issue of project vs. distribution license, and only once late in its lifecycle was even the issue of the differing licenses for sdists vs. wheels brought up, which was prominently acknowledged above the fold in the Non-Goals section in the first few paragraphs of the text (criticism of the Rejected Ideas section being in a separate linked file to the contrary, which was itself a response to criticism that the extensive Rejected Ideas section accumulated over this long history of feedback and modification made the substantive parts of the PEP hard to navigate).

Furthermore, given even the existing scope had already dragged out the standards process as long as it had and had for years been blocking solving a number of related issues, adding yet another order of complexity to provide even more fine-grained license information would have likely led to even more delays to the present day in not only further debating and iterating on the PEP but also implementing it and educating users on it, rather than already solving the fundamental problems for ~>95-99% of cases it does now with the potential to iterate further with the benefit of real-world experience.

As the (original or final substantive) author of most or all of the relevant bits text here, I can confirm as correct the interpretation of the majority of people here (e.g. @brettcannon , @mikeshardmind , @steve.dower , @Liz , @notatallshaw , etc.) as to the intention of the text: that the license expression represents the license of the package, i.e. the distribution artifact(s), not whatever is (somewhat arbitrarily) considered the “project”. The edge case explicitly left to later PEPs to handle was whether it represented just the license(s) of the sdist, the union of all licenses of all distribution artifacts or something in between.

Specifically:

  • The intent and in most cases of the letter of the PEP always was to define the license of the distributed package (distribution artifacts) built using the project’s pyproject.toml-specified build system, not whatever the license for whatever one wants to define as the “project”.
  • The few places where the PEP imprecisely alludes to the license expression being of the “Project” are clarity defects in my writing that I didn’t get to fix before passing on the torch and for which I take full responsibility (although as detailed below, the explicitly referenced definition of “Project” does include vendored dependencies, at least those included in the source tree).
  • The original PEP didn’t include the pyproject.toml keys at all, only the Core Metadata fields for the packaging metadata; the former was only added later to provide a standardized way to populate the later without relying on backend-specific config.
  • The primary intended consumer of this particular class of metadata was and always has been packaging-related tooling, which are fundamentally most concerned with the license of the distribution package (per the title of the PEP), which is what actually matters to the end users using them.
  • The project may be packaged differently with/without vendored dependencies by other third-party distribution systems not under the direct control of the project authors (something that PEP author @ksurma is intimately experienced with being one of the people responsible for Python packaging at Red Hat).
    • However, the project’s authors have no direct control over that, nor can be expected to anticipate what level of vendoring might or might not be retained or stripped by any given packaging system.
    • And packagers must conduct their own due-diligence manual review of the project’s license files anyway in order to legally package it and set up their specific distribution system’s metadata accordingly regardless, so it is unclear that a “Project License” provides much or any meaningful value in those cases.

A fundamental issue that anyone trying to define a “project license” will have to contend with is trying to draw the somewhat arbitrary boundary between what is and isn’t included in a “project”, and how to handle the many edge cases—the very edge cases for which this distinction matters to begin with, yet which may in turn limit the usefulness of any single definition. For example:

  • Do vendored dependencies count if they are checked in to the source tree?
  • If they are part of the main repo vs. submodules?
  • If they are inline with the rest of the source instead of in a separate _vendor directory—inside or outside of src?
  • If they are modified/forked? To what degree?
  • If they are a single file, or multiple?
  • Or one or more functional unit(s) within a file?
  • What about tools or helper scripts vs modules?
  • Generated files?
  • Images/logo assets under non-code or other licenses?
  • Etc.

So, to define a single “project license”, either:

  • Each of those questions must be answered (which adds complexity and will necessarily limit the usefulness of the result for a substantial number of cases one way or another), or
  • Some or all left ambiguous (leaving us not that much better off than we are now), or
  • Discard the notion of a single “project license” completely and instead precisely define licenses per-path, file, etc…which you could add to Pyproject metadata, but is already handled much more thoroughly by existing SBOMs and the automated tooling surrounding them

FWIW, the “Project” used in the License Expression definition is explicitly referenced to be the PyPUG definition of the term, which then and now states:

Since most projects create Distributions using either PEP 518 build-system, distutils or Setuptools, another practical way to define projects currently is something that contains a pyproject.toml, setup.py, or setup.cfg file at the root of the project source directory.

By that definition, the vendored dependencies in Pip are part of the “project”, since they are part of the project source tree and checked in to the source repo.

7 Likes

Having though about it of the course of the weekend, after rereading the history of PEP and reading the new posts here, I’ve come to realize my previous statement was incorrect, I’ve got to retract and apologize for the confusion. PEP intended to clarify the distribution license. The difference between the project and the distribution isn’t as sharp as presented in some of the posts (Project being “A library, framework, script, plugin, application, or collection of data or other resources, or some combination thereof that is intended to be packaged into a Distribution.”)

PEP cleanly covers the cases when project’s “core” code matches the sdist matches the wheels.

It attempts to cover the cases when project contains vendored dependencies (describing them in User scenarios and Advanced example - in my PEP editing times I wasn’t thinking there’s a case for storing the project “core” code license separately. As it now appears, that’s a material for a clarifying PEP (that could only tackle this specific group of projects). Another, admittedly hacky idea, would be to leverage parentheses to group all of the vendored libraries licenses – MIT AND (Python-2.0.1 AND Apache-2.0 AND BSD-2-Clause) retains semantics of MIT AND Python-2.0.1 AND Apache-2.0 AND BSD-2-Clause but looks differently for the human eye.

It doesn’t cover more complicated cases when the sdist and wheel licenses differ. That is left for a future PEP.

Projects may abstain from migrating to PEP 639.


Debating over what is the right (= factually correct) license of an artifact is out of scope of a Python technical standard. It’s a topic one step higher, over the whole software creation, and the idea of declaring bundled/linked libraries is a thing generally agreed upon (from the top of my head: SPDX standard, downstream repackagers like Fedora Linux).
As it’s clear to everyone in this debate, declaring the correct licenses of the particular artifacts when they differ, is a nontrivial task - and it’s out of scope of PEP 639. It also touches the issue of the UX of PyPI (and other indexes) which must be taken into account when designing more robust solution in the future.

I acknowledge for projects like pip it can be challenging to track the vendored dependencies, especially if they are often added/removed and I understand why it feels like a burden to have to revise them with each release (and reluctance to even start doing that). I know, because we do this. We definitely need improved detection tooling, but that’s out of the scope of this PEP.

Achieving machine readability was mentioned in the thread - as a Fedora package maintainer, I’m obliged to review the licensing information and inspect packages to make sure the resulting artifact correctly declares its licenses. I believe the spirit of “human in the loop” will remain in place even when there are more reliable machine-readable metadata available, solely because the licensing information is extremely important for actors like Linux distributions.

2 Likes

OK. Given this view, do you agree that the PyPI behaviour of displaying the license expression for an arbitrary one of many distribution packages for a project on the project summary page, as “the” project license is a bug? (And a rather dangerous one, because it could mislead human readers into thinking there are no license issues when there are).

What is the correct behaviour for PyPI?

And yes, I know that PyPI does this for other data as well, but I’m not aware of anything else that has the same legal implications.

2 Likes

Thank you, @CAM-Gerlach, for joining the discussion. I read your post after drafting mine, so I see there’s some duplicity to our points, but in general the views match.

Given this view, do you agree that the PyPI behaviour of displaying the license expression for an arbitrary one of many distribution packages for a project on the project summary page, as “the” project license is a bug?

PyPI states on the project page:

Unverified details
These details have not been verified by PyPI

As I read it, it says: “don’t believe these details at face value, it’s up to you to verify them”. The license field below doesn’t say anything about what it refers to. So, technically speaking, nothing you can put there is a bug? (I’m not saying we shouldn’t improve it, but maybe it’s not that grave of a problem right now?).

2 Likes

I think this is a fine position to take on the PyPI issue.

Provided we’re clear to package authors that users (and indeed, the spec) are interpreting their license metadata as applying to any distributions they produce (and not necessarily redistributors), most of the time it’s going to be accurate enough for most people.

Those who are seriously concerned about the legal impact are not going to trust author-provided metadata anyway, no matter how fine-grain the display.

1 Like

Responding to a few specific comments up until my first post here:

@ksurma is an experienced Python developer and packager at Red Hat who IIRC also worked on their SPDX infrastructure and license tag conversion for Fedora, etc. packaging, and so is intimately familiar dealing with these sorts of real-world issues regularly (which among other reasons was why I invited her to succeed me as PEP author :slightly_smiling_face: ).

And I can’t speak for her on that latter question, but it seems to me its the same reason you’d be confident adding the right license(s) if building the expression up from scratch—manually reviewing the source licenses, which is a must anyway if you’re a packager (as I am with Conda-Forge). And including the full set in both the license expression and license files makes it easier to quickly reference them, provides a reasonable starting point and serves as an additional set of datapoints to notice a potential discrepancy.

Putting my PEP Editor hat on for a moment, while I’m not going to hard-block it I would echo @pf_moore here that the PEP is a historical document and updates should go in the canonical docs. We PEP editors have worked long and hard to guide users toward the normative document; if a large sticky banner at the top of the screen explaining this and linking to it doesn’t go far enough, we’re open to suggestions as to what is without being too invasive.

Once Setuptools deps adopted license expressions, it should be relatively straightforward for an automated tool e.g. a pre-commit hook to construct and maintain the constructed license expression.

However, while not completely trivial you’re presumably tracking your vendored dep licenses anyway to package them as you are legally required to (via the old or new license files approaches), and make sure you’re still compliant if they were to change, so at least in one sense if you’re already doing that this can be viewed as just documenting that. I’m happy to contribute a PR with the initial set, if that would help any.

FWIW, Setuptools was the example I used in the PEP

Its pretty disappointing if after nearly 6 years, 600 comments and hundreds perhaps thousands of person hours by dozens of people all we’ve managed to accomplish is implement a license metadata scheme that is actively worse for the many projects that need it most than the deeply and fundamentally broken and fragmented set of metadata constructs that it was supposed to replace. And pretty unfortunate that none of the maintainers of such projects were able to speak up about this dealbreaking concern until after the PEP was reviewed, accepted, implemented and marked Final, despite their active involvement, helpful feedback and constructive contributions throughout the process (not that the honest feedback isn’t still appreciated now where it can at least motivate followup work).

I’m still not sure I understand the motivation for wanting the Python Package Index page for a package to not display the license expression that actually applies to consumers of the artifacts distributed on that package index. As a consumer of the package, I would feel rather mislead if the license on the package index was not the actual license of (any of the) distribution artifacts, which is what actually matters for anyone consuming the package via that index. We’re talking about a package index after all, not a code contribution platform.

Sure, if the package vendors different dependencies in different artifacts for a release the displayed license expression won’t be fully precise (if using the union of all artifact licenses), comprehensive (if using the intersection of the same) or some combination of the two (if dynamically setting a license expression per-artifact, depending on which PyPI picks), but it is still less accurate for the expression to lack all vendored licenses completely. Just because PyPI’s handling isn’t fully correct yet doesn’t mean we should give up trying completely here, especially for the many cases where this edge case isn’t hit.

How does it work? Nowadays, the only step involved for creating a release is something like twine upload, which merely uploads some distribution files to PyPI. How does PyPI choose where to extract the project summary for a given project version from? What influence do package authors have other that?

1 Like

Yes indeed, I absolutely agree with you that it is a significant defect. The fact that it is a longstanding issue that would likely require significant re-archetecting to fix was one of the stated motivations why I declared per-artifact license expressions out of scope for this PEP and a subject for a followup, to avoid blocking this PEP (and the other things it was blocking in turn) possibly indefinitely on a substantial PyPI rework.

However, given the inherent complexity here and a lack of an obvious mechanism to do so, the fact that each wheel would need its own license information, lack of support on PyPI for exposing license info on a per-distribution archive basis, and the relatively niche use case, it was determined to be out of scope for PEP 639, and left to a future PEP to resolve if sufficient need and interest exists and an appropriate mechanism can be found.

Indeed, I suggest (and would actively support to the extent I am able) such a followup PEP in my recommended next steps above, which could potentially address the underlying intersecting issues here (PyPI support, dynamic, etc) affecting per-artifact metadata in general.

I certainly don’t claim to be an expert nor authority on PyPI, but to a first approximation, my off the cuff initial suggestion pending further study would be to only display the license expression on the top-level PyPI project page if it matches for all artifacts (perhaps excepting that those for auditwheel, etc.-injected blobs?) and instead have it say “See artifacts” or similar that links to the page listing the per-release-artifact metadata, e.g. a table of release artifacts and their license expressions (not a UI/UX designer, this would require a lot more thought by a dedicated workgroup).

In theory if they differed PyPI could display that of the sdist (~the intersection), or the union of all artifact license expressions but that would need to be made clear that’s what it is with a link to see all of them per-artifact. And given as you say its a potentially dangerous area may not worth that at all if there’s an alternative to display them all per-artifact.

For now, though, we can either leave it unspecified whether for the projects that do inject additional dependencies into their wheels the (static) license expression should be that of the sdist or the union of all the licenses in the sdist and wheels (or something else?), or explicitly specify one or the other.

Not an expert on this like @pf_moore but AFAIK PyPI grabs the metadata from whatever is the first distribution artifact uploaded for a given release. See pypi/warehouse#8090 for more info.