Expressing project vs. distribution licenses post-PEP 639

I hear your frustration. Licensing has just developed over the course of years. I generally agree with @steve.dower point of view in the thread.
Speaking for myself, it was never my intent to bully open source maintainers to comply with corporate lawyers. SPDX standard settles on joining the license identifiers without getting into the requirements of particular licenses - which, I believe, it does on purpose to increase the adaptability of the standard in the wide software community. Due diligence can be reduced to just listing the license identifiers.
SPDX has been always expected to become the standard of declaring licenses in Python (it’s expressed verbatim in PEP 621), it was widely supported before the licensing PEP was drafted.

When pip is distributed, its currently declared license doesn’t match the contents of the distribution. It happens before anyone else is involved in the equation. I hear your discontent with people pointing out this. I hear it’s manual work. I hear that you care about your project license and want to be spared the other nonsense. I hear the PyPI page concern. I’m sympathetic to all that.
pip, being such an important package in the Python ecosystem, could use the precise declaration, if only because so many others use it.

If you decide not to do that, either with legacy way of declaring licenses, or with PEP 639, I don’t have the means nor will to force you. You may see occasional issues about the license, because of the developing understanding in the community how to declare them better.


Since we will hardly move the discussion forward, let me propose:

  • The changes to glossary entry in SPDX expression and add a note about PEP not covering the per-distribution-artifact license on packaging.python.org will be made (I’ll submit the PRs)
  • The mentioned changes to licensing, containing among others: differentiation between project/distribution license in the package metadata; proposing a solution to per-distribution-artifact license, have to be tackled in the new PEPs
  • Let’s close this thread and allow new issues to appear in new threads
3 Likes

To be clear, I wasn’t making a value judgement on the PR or your work, I appreciate all the hard work you’ve done on this topic.

IMO there’s an issue, in general, transitioning from PEP to spec. There seem to be several different reasons a PEP can’t be moved verbatim into the spec, e.g. it doesn’t belong in one place, or the spec isn’t designed to capture some things that are in the PEP like intent, and when there needs to be a change of wording the PEP is scrutinized by a wider audience than when working changes in the process of adding it to the spec.

So, I end up reading historical PEP documents, I’m sure others do to.

3 Likes

This is a bit reassuring, however there is a spectrum of downstream users out there and not all of them may take the same approach. We can already see some amount of pressure for the maintainers to include an SPDX in the projects. There is the PR mentioned by Paul and if needed, a quick check on setuptools issue tracker and open PRs can provide further evidence.


What if we stablish a convention among the few projects that have to deal with vendoring? A convention is something that we can implement fairly quickly and iterate if necessary.

The PEP seems to allow custom license expressions using the LicenseRef- prefix.
We could use something like LicenseRef-Primary-MIT or LicenseRef-Core-MIT or LicenseRef-MIT-and-others to signify ā€œthe main body of work is licensed under MIT, but there are other dependencies distributed alongside it with different licenses, please carefully check all the license filesā€.

This does not go against the standard, express intent, and (because it is a custom classifier) will flag to any interested consumer they need to look at the licensing more carefully.

(I am bad of naming so the exact expression can be further debated/bikeshed. Now that I think about it MIT AND LicenseRef-Others also seem viable.)

@pf_moore would this approach be something you would be interested?

1 Like

I doubt that is going to satisfy people who are asking for SPDX expressions. I think that pip and setuptools should provide accurate SPDX expressions that cover the whole contents of the distributions. The problem if I understand correctly is just that:

  • You don’t want to write license = A and B and ... in pyproject.toml if there is no way to record the project license separately.
  • You don’t want other things (e.g. GitHub/PyPI) to read the SPDX from License-Expression in the sdist and misrepresent that as the project license.

If projects like pip and setuptools (I can also think of other likely cases) will not be happy with using the metadata according to the specification in a way that satisfies what downstream consumers want then there just needs to be another PEP to build on the good work of PEP 639. As @ksurma suggests no PEP or specification etc can force package authors to do things that they don’t want. The specification needs be defined in a way that package authors are happy with or otherwise they won’t go along with it.

The issue is that there is an ambiguity about how license metadata was being used with different people wanting different things from it. PEP 639 did not create that problem except in the sense that it provides a clear way that the full distribution license can be expressed so people now want that to be in the sdist. PEP 639 provides all the machinery that is needed to solve this though: if licenses are recorded in SPDX then it should be straight-forward for cases involving vendoring to be handled by tools that can combine different license expressions.

An additional field in pyproject.toml is all that is needed as suggested by @pitrou above:

license = MIT
license-vendored = A and B and ...

Most projects could omit the second field in which case it would be taken as empty. Then the sdist License-Expression can record the SPDX expression that combines these. I assume that both pip and setuptools would be happy with this provided GitHub and other things that display license information don’t conflate License-Expression with the project license.

The other case of wheels having different licenses is also easily handled with SPDX expressions so although the PEP declared it out of scope it provides exactly what is needed. There just needs to be a clear understanding that distribution licence is per-distribution metadata so the license for an sdist is not necessarily the same as license for a built wheel and different wheels can be different etc. Indexes such as PyPI should show license information on a per-distribution level and not take the license of any particular distribution and display it either as the project license or as if it applies to all the distributions.

I have considered whether there is some way to record the relationship between sdist vs wheel etc license but I don’t think there is a way to express the combinations that would capture all cases and still be useful to consumers. I think all that can be done is to say that distributions can have different licenses and that build backends and vendoring tools should use SPDX expressions to combine the licenses of vendored things when needed. The different licenses could still be listed in pyproject.toml metadata but in tool-specific sections.

1 Like

I wouldn’t object to it, but I suspect it would be more difficult to sell to the other pip maintainers. And I’m not sure that I like it enough to be willing to make the case for it over just omitting the license expression. So it would probably be worth hearing opinions from some of the other pip maintainers (and from anyone else who would use this).

1 Like

I would be somewhat interested in using it. check-jsonschema vendors schemas, sometimes under an upstream license and sometimes in the absence of one (I’ve reached out to some maintainers and they don’t consider their schemas code, which confuses me…).

Watching this discussion has left me most inclined towards no longer publishing a license and saying ā€œyou can fish it out of the distsā€, since it’s not possible to render the division between my code and the vendored files both easily and with clarity.

I agree that the pypi presentation of a license is important to me here. Probably requires a separate thread.

One thing which I dislike about the combined license declaration (e.g., MIT and (...)) is that it makes the license sensitive to what is vendored and even how it is mechanically distributed. If I move all of the vendored files into a data package, they’ll disappear from the license. I know some folks are saying ā€œthat’s fine, that’s just reflecting the reality of your licensesā€, but as a maintainer it feels very strange. The license is part of the public face of my project, and changing that due to mechanical, internal details is not how I expect my manual declaration of metadata (in pyproject.toml) to work. I get the rationale, but it feels weird and that’s what I’m reporting/sharing.

I don’t have all that much insightful stuff to say on this. I’d like to be able to present the project license publicly. I’d like to be able to declare the complex expression for built artifacts. Right now the two are still combined, which is understandable but not really what I want as a maintainer.

4 Likes

This is the critical issue for me. Until that happens, I don’t see an acceptable solution for pip that allows us to record accurate distribution metadata in the License-Expression field (as I’ve said, having the complex distribution metadata displayed on the project page is a showstopper for me).

The other alternative is for PyPI to simply stop displaying license data until it can do so accurately.

It sucks that so much of this is contingent on the fact that PyPI presents distribution metadata as if it were project metadata. But to be fair, most metadata doesn’t really distinguish between project and distribution file (who has a wheel with a different author, or with a different description?) License is simply the first time when the two properties ā€œis displayed by PyPIā€ and ā€œis likely to have different values in different distribution filesā€ have come together.

5 Likes

It sucks that so much of this is contingent on the fact that PyPI > presents distribution metadata as if it were project metadata. But > to be fair, most metadata doesn’t really distinguish between > project and distribution file (who has a wheel with a different > author, or with a different description?) License is simply the > first time when the two properties ā€œis displayed by PyPIā€ and ā€œis > likely to have different values in different distribution filesā€ > have come together.

Not to be pedantic, but the pip project authors are not the authors of all the contents of the distribution file either, again because of the vendored dependencies.

2 Likes

That’s true, and not pedantic at all. Does anyone want the Author field to include the authors of all vendored dependencies? If not, then maybe we need to rethink how strongly we want to assert the ā€œcore metadata is distribution file metadataā€ principle. And if so, then I look forward to the issues that will get raised on pip and setuptools, asking us to change our metadata to reflect our vendored dependencies…

Go down this route too far, and ā€œpipā€ isn’t the name of our dependencies, and 25.1 isn’t the version. But that is getting pedantic :slightly_smiling_face:

2 Likes

If I was rewriting the specification and sites that display it from scratch without caring for historical context, I wouldn’t use ā€œauthorā€ but ā€œpublisherā€, referring to who is responsible for publishing the distribution, and that would be seperate. Authorship can be derived down to per-line blames even for those who care by going through history of the VCS of the project, and doesn’t generally have the same consequences here as license information.

I think there is a good point here somewhere about what is actually useful metadata and useful displayed information and we could likely get away with changing the author field to be more explicit here in referring to the project, without the same consequences that doing so for the license would have in terms of user confidence.

4 Likes

Can you explain why it such a showstopper to you? For those that don’t care about license identifiers, it appears to be primarily a cosmetic issue (ā€œewww, long license expressionā€), but what harm does it actually do?

I agree that the healthiest way would be to distinguish the project license from what gets vendored (and pipe that through pyproject.toml, pip, PyPI, etc.), but given that this is not in place yet, why is it such a problem to express the actual license situation as well as possible, given the constraints of the available metadata fields? Cosmetics are hardly more important than accuracy here[1].


  1. it’s also inconsistent with the frequently appearing line of argument that pyproject.toml was primarily designed for wheels, i.e. explicitly for the distribution case. ā†©ļøŽ

3 Likes

Possibly not in a way that will sound convincing to you, as there appears to be a significant disconnect between people involved in this discussion over whether a ā€œproject licenseā€ is an important thing or not, and what people expect of the PyPI project page. But I’ll try.

I consider the project license (the one that the project authors chose when starting the project) to be important, and significant. In many ways, it defines the way the project maintainers feel about sharing their code, as well as their tolerance for viewing open source as a ā€œmatter of principleā€ or a means to an end. For me, therefore, it is crucially important for people viewing information about ā€œthe projectā€ to see that license.

When looking for Python projects, PyPI is (in my experience) the key place that people look. And when they look there, they see a summary of the project, including the description, project statistics, a list of maintainers, links to key project web pages, and the license. I can confirm that when looking for a library for a particular task, I tend to look at all of this information when making my choice. I can’t speak for others, but I consider that to be a perfectly reasonable way of assessing a project.

Not seeing the license there would be a shame, although I’ll concede not a disaster (that’s why ā€œdon’t publish any license metadataā€ is an acceptable workaround for me).

Seeing a license expression like MIT AND (Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND ISC AND MIT AND MPL-2.0 AND PSF-2.0), though, would give completely the wrong impression. It says to me that this project is obsessive about licensing, and over-complicates matters. I would expect that if I were to get involved with this project, I’d be subjected to debates over license minutiae, possibly at the expense of the simple goal of sharing useful code.

Remember, I’m viewing the license on PyPI to be ā€œthe license the project authors chose for their projectā€. So it’s a statement of intent[1]. That’s what it means (in all practical senses) for projects that don’t vendor any 3rd party code. And no amount of claims that ā€œit’s not how you should interpret what PyPI showsā€ will counteract the weight of the reality that this is how people have always interpreted the data in the past.

I don’t want pip to be the sort of project that gives that type of impression to new users, or worse, to potential contributors. Pip’s a foundational packaging tool, and I want us to come across as welcoming and easy to work with, to users and contributors.

I hope that helps. If you want another perspective, you should read @sirosen’s post above, which expresses a lot of what I think much more clearly than I just did.


  1. Even if it’s so complex that it’s unclear how anyone could have such a complex intent ā†©ļøŽ

4 Likes

Thanks for the response. I can see where you’re coming from, though I don’t necessarily agree with the conclusions.

IMO the key place to look at for a project is the source repo (wherever it is, which is generally reachable from the PyPI project page) – as the name Python Package Index implies, it’s primarily about packages, not projects. Of course I understand that it’s a convenient shortcut because it’s all in one place and for most cases, there’s no vendored dependencies to consider at all.

That’s… quite a stretch. I don’t deny that there are projects that are painfully obsessive about licensing, but I could make the same (spurious) argument that this kind of license implies that the maintainers must be professional, conscientious about high quality work, and a joy to collaborate with.

One way or another, this would be a pretty extreme case of judging a book by its cover, especially if the repo (with the actual project license, and lots of searchable history) is just a click away. And even more so once people would learn over time that PyPI shows distribution metadata, not project metadata.

3 Likes

Just to be clear for anyone who may be so motivated, you’re saying you want License-Expression to only be about the project’s metadata, and so another PEP that allowed for specifying the distribution metadata – both in pyproject.toml and core metadata – would be required/acceptable (we already have licenses in wheels, so at least the license files part is taken care of)?

Not exactly. I’d be equally happy with a new field that recorded the project license. In fact that would be better - I see no reason to change the interpretation of License-Expression if we don’t need to.

2 Likes

Sorry, yes it is. I’ve experienced far too many GPL zealots being utterly obnoxious, and as a result I am very strongly distrustful of people who focus on the letter of the law over intent, in the context of licensing.

A complex license expression, presented as a project license as opposed to a distribution file license, comes across that way to me.

1 Like

Per the request of the OP and PEP author @ksurma here I’ve split the followup discussion on how to better express project vs. distribution vs. per-artifact license to a new thread where we can focus on how to move forward given this PEP is Final.

Fully agree with you there.

Though, at least in theory to avoid license field proliferation in core metadata, perhaps we could simply un-deprecate and (continue to) use the existing License core metadata field for this since that’s what its generally been historically used for anyway, as you’ve mentioned. Given this is purely for human consumption and advisory purposes only, doesn’t require the same degree of precision as the actual distribution license, and the ā€œproject licenseā€ as has typically been conceptualized here is typically only a single license but further free-text clarification may be useful, it need not require it be SPDX (though it can be) or follow a rigid format, and our rationale for not re-using the License field for a license expression doesn’t really apply.

We’d still need new keys in pyproject.toml, but I’ve got a proposal for that, which keeps license = str as it is for the currently-supported simpler cases and uses license subtable keys to disambiguate the more complex cases.

But maybe not worth it to reuse this fields if current tools don’t really use it anyway and one more field in the Core Metadata vs. reusing a mostly-identical existing one doesn’t matter as much as the gain in explicitness and lack of ambiguity (whereas key proliferation in pyproject.toml is more of a practical concern).

Taking into account everyone’s the feedback here, here’s a proposal for a followup to handle the use cases presented and making the project/distribution/specific artifact license clear and explicit, balanced with trying to minimize additional complexity for the simple cases and requiring further churn for projects already adopting PEP 639.

Pyproject metadata

Keep the semantics of license = str the same as now, i.e. this is the license of all distributions.

To handle the more complex cases discussed here (and also be more explicit), allow table subkeys containing license expressions under license, which may be included in any combination to express the specificity required:

  • project - License(s) under which code is contributed to the project, i.e. the ā€œproject licenseā€
  • distribution - License(s) that apply to all distributions of the project;
  • sdist - License(s) that apply specifically to the source distribution, in addition to those in distribution
  • wheels - Licenses that apply specifically to all wheels, in addition to those in distribution
  • wheeltag - Table, which subkeys that are each a wheel tag, containing an additional license expression specific to that tag.

Keys other than project and distribution can be specified as empty strings if desired to be explicit that the relevant context has no additional licenses.

Making sdist, wheels, etc. additive minimizes extra verbosity and duplication, while still giving an equivalent result if users do duplicate the rest of the distribution license, and allowing specifying disjoint/complete sdist and wheel licenses as distribution is not required.

Alternatively, we could have wheel(s) be a table with the tags as subkeys, with a special all key for what wheels is now and possibly still allowing wheels to be a string value if all wheels have the same licenses.

In distribution core metadata

License-Expression is clarified to mean specifically the license expression that applies to the distribution artifact containing said core metadata (currently it is the same for all artifacts). For an sdist, license.distribution + license.sdist, for a wheel license.distribution + license.wheels + any matching license.wheeltags.

Project-License is added, the license expression for new code contributed to the project/the root LICENSE.txt. (Or in theory we could reuse the old LICENSE field for this, loosening the constraint that it must be a SPDX expression, though upon further thought probably not worth it).

Dynamic behavior

For the case described above where everything is specified statically in pyproject.toml, License-Expression would not need to be marked dynamic there. It would seem to need to be either marked dynamic or special-cased in the PEP 643 sdist metadata, since it cannot be copied verbatim into the wheel (instead, it needs to be retrieved from the static pyproject.toml instead).

Ideally, tools would be able to automatically determine the licenses of code/binaries they vendor into wheels, or perhaps the sdist as well. License-Expression could be marked dynamic in these cases, and with @henryiii 's PEP for partially-dynamic metadata, conceivably baseline license(s) could be specified in pyproject.toml up to the desired degree of specificity, and/or licenses not detected by standard tools manually added, with the rest added automatically at build time in the appropriate contexts. Alternatively, tooling could be used pre-build to update the static values in pyproject.toml, without the need for dynamic.

PyPI UI/UX

On the project main page, PyPI would if provided display ā€œproject licenseā€ and ā€œdistribution licenseā€. ā€œProject licenseā€ is Project-License (project.license), while ā€œdistribution licenseā€ is License-Expression if it is identical between all artifacts for the latest release (or equivalently, if only either license = str or license.distribution is specified in the pyproject.toml with no further per-distribution specificity).

If distribution license varies between distributions, then the value of ā€œDistribution licenseā€ is the link ā€œSee distributionsā€ (or equivalent wording, e.g. ā€œSee per-file informationā€), which links to the Files page that would now list the license next to each artifact (or, less ideally, at least list this on the view details page).

5 Likes

I agree that it is unlikely that it would satisfy all people completely, but I was wondering if it could improve a little bit our lives for the time being until a better solution is available (changes in spec take time and can be emotionally taxing).

Currently, the absence of License: or License-Expression: metadata can lead to some confusion, with people considering it a bug. I am still not convinced about it, but for the sake of brainstorm, this is what I have in mind for the 2 possible approaches:

  1. Omitting the License-Expression:
  • We face two groups: (a) those confused by the lack of metadata, believing it to be a bug, and (b) those who want setuptools to provide a complete license expression for their automation tools to consume.
  1. Adding a valid but vague custom License-Expression:
  • We primarily deal with the second group and and explicitly state our intention to be vague.

In an ideal world where there are conditions and tools available to automate the process of deriving such complex expression, I wouldn’t mind to call them in the script that updates the vendored dependencies. But that would not be a priority for me and it is unrealistic if we have to to manually produce and maintain it.

All said, I am still not convinced about using a vague custom license expression. If users continue to report the absence of license metadata as a bug, I might create a PR adding a vague custom license expression. Otherwise, we may continue omitting it.

Some other personal opinions:

  • I agree with Paul that it would be beneficial to display the ā€œprimary/coreā€ project license on PyPI instead of (only) the artifact’s license.
  • I agree that it would be great if other tools (like GitHub) displayed the ā€œprimary/coreā€ project license. It wouldn’t be nice if tools misrepresented the project license by reading the ā€œartifact’sā€ license expression from pyproject.toml.
  • Currently, I don’t feel the need to differentiate the licenses of sdists, wheels, or GitHub tarballs, etc…
  • I don’t have a particular strong need of separating license metadata for the artifacts and the license metadata for the project itself. Ideally I would prefer to only declare the license metadata for the project itself. I find the ā€œread the license files shipped inside the artifact if you want to know moreā€ approach fine by me.
1 Like