Expressing project vs. distribution licenses post-PEP 639

The prerequisite for this would be to stop displaying a license on the PyPI project page, and instead show a license for each individual downloadable distribution.

The problem, then, is that most people only look at the PyPI project page and let pip install choose the right distribution for their system. Nobody wants to locate the right file manually and look up the license next it.

But, really, I don’t understand what the problem is with PyPI showing the project license on the project’s page. It’s time to stop being dogmatically opposed to it, and admit that it’s a useful piece of metadata. You don’t like it? Don’t look at it, but don’t pretend that nobody should like it.

3 Likes

I don’t want this to sound like I’m suggesting that PEP 639 wasn’t worth doing, but do we have good information on who those people are? My original assumption was that companies and large organisations were intending to scan metadata to audit licenses for compliance reasons, but @steve.dower (and others) said that they use other sources for that information, and that we “shouldn’t worry” about those users.

So I’m genuinely unclear here - who are the actual consumers of the License-Expression metadata, and what are their needs? How much of a problem is it if a few edge cases[1] like pip and setuptools simply don’t provide a machine readable license expression?


  1. I’m ignoring the case of auditwheel adding vendored libraries for the moment ↩︎

I think it’s more a matter of practicality rather than principle. PyPI[1] doesn’t have a concept of “project metadata” - everything comes from one distribution file or another (usually the first uploaded, which is essentially arbitrary). Re-engineering that design is a non-trivial problem. The PyPI team know it needs doing, but they are so stretched on resources that it’s unlikely to happen in the immediate future :slightly_frowning_face:


  1. And in some senses, the whole packaging metadata ecosystem ↩︎

We can “pick up the vibe” in Restore License Classifier by WilliamRoyNelson · Pull Request #4957 · pypa/setuptools · GitHub and [Docs] PyPI Meta tags no longer include license · Issue #4956 · pypa/setuptools · GitHub [1]

These discussions suggest organisations use some tools like license_scanner and liccheck.


  1. I apologise if this sounds like I am pointing fingers. This is not my intention, but I admit that it is difficult to write things in a way that does not sound like that. I understand that different people have different needs/conditions and that sometimes, e.g. smaller companies, people don’t have the resources of a full legal team working supporting them and rely on best effort. It is not my intention to be confrontational, but at the same time, I don’t want to absorb the responsibility of dealing with such complex 3rd party licensing concerns for myself. ↩︎

FWIW, I don’t see this as any better than not using it at all.

As has been mentioned a number of times on this thread, GitHub and other code hosts already automatically display the project’s license when possible, based on strict matching of the LICENSE.txt or equivalent file, as that is the canonical source of truth as far as project license. It is particularly important for them that they concern themselves with the license the project is developed under by default in order (rather than of some distribution of it) due to the implications for licensing contributor contributions. This is even codified in the GH ToS. Therefore, I don’t see any plausible chance of anything like that happening.

Indeed for most projects, including Pip and Setuptools, there is no difference, and we believed the population of projects for whom it was small enough relative to the additional complexity to defer it to a followup PEP. However, folks here have emphasized its importance and brought up a number of very high profile/widely used projects where there is a difference, in some cases a very significant one (e.g. FOSS/non-FOSS).

It may not matter as much the folks contributing code, but for users of your packages the distribution license is what actually controls how they can and can’t use the package and what conditions are placed on them, not the project license.

I for one (and I suspect most others here) don’t unconditionally object to showing it at all, but rather displaying it either exclusively without the distribution license(s) and without clearly specifying that it is the project license, i.e. merely the license under which (most) development of the project takes place rather than the actual license that applies to users of the project, misleading the user into thinking that is the only license that applies to the package they are downloading and using (rather than simply merely one among many, in the cases at issue here). This being the Python Package index rather than the Python Project Index (the project’s development site, code host, etc. where the project license is directly relevant) after all :slight_smile:

For what its worth, as one of “those people” and the primary author of the PEP for most of its life, I care about it as a career open source maintainer who wants to ensure I can accurately, portably and ethically express the license of the software I maintain and the work of others it includes, make sure I’m aware of and respecting the license of the software I consume and depend on, and make it easy for others to view and respect mine. And I’ve never never worked a day in my life for a for-profit corporation or large organization, or on any proprietary software.

Nor do I qualify the opposite, one of those “GPL zealots being utterly obnoxious”, as I use MIT for all the projects I create/maintain and am no particular fan of Stallman or the FSF’s overly strict and pedantic stance on many things. (Although whether I still quality as the latter bit or that of “I am very strongly distrustful of people who focus on the letter of the law over intent” still applies to me I cannot say.)

3 Likes

Out of curiosity I downloaded one wheel each from 4664 of the top 5000 downloaded PyPI packages.
Of those 219 contained a License-Expression: in the METADATA file.
Of those 1 (boost_histogram) contained AND, and 2 (structlog and coincurve) contained OR.

License-Expression: Count
Apache-2.0 93
MIT 85
BSD-3-Clause 15
BSD-2-Clause 5
ISC 5
MPL-2.0 4
PSF-2.0 2
MIT OR Apache-2.0 2
Unlicense 2
GPL-2.0-or-later 2
LGPL-3.0-only 1
LGPL-3.0-or-later 1
BSD-3-Clause AND BSL-1.0 1
CC0-1.0 1

The most common (out of 491 distinct) values of License: METADATA were:

License: Count
MIT 1018
MIT License 637
Apache-2.0 324
BSD 227
Apache 2.0 216
Apache License 2.0 141
UNKNOWN 114
BSD-3-Clause 112
Apache License, Version 2.0 49
BSD-2-Clause 34
Apache-2.0 license 29
Apache 28
MIT license 25
NVIDIA Proprietary Software 23
Apache Software License 20

Only 29 contained AND (3) or OR (26) in the first 40 characters of License::

License: Count
MIT OR Apache-2.0 7
LGPL-3.0-only OR GPL-2.0-only OR […] 4
Apache-2.0 or BSD 2
Apache-2.0 OR MIT 2
MPL-2.0 AND MIT 1
MIT and MPL-2.0 1
Apache-2.0 OR BSD-3-Clause 1
Apachev2 or later or GPLv2 1
BSD-3-Clause and Public-Domain 1
BSD 3-Clause License or Apache […] 1
BSD 2-Clause or Apache-2.0 1
GNU General Public License v2 or […] 1
GPL-2.0-or-later OR Apache-2.0 1
LGPL-2.1-only OR MPL-1.1 1
EPL-2.0 OR GPL-2.0-or-later 1
EPL-2.0 OR BSD-3-Clause 1
MPL-1.1 OR GPL-2.0-only OR […] 1
CC0-1.0 OR Apache-2.0 1

(I didn’t check for ,, / or and/or etc.)

Maybe about 100-200 seemed to mention different license types in their License-File:

Expand for some examples
Project License: License-File: mentions
Bootstrap_Flask MIT MIT, Apache
ConfigUpdater MIT GPL, MIT
Django BSD-3-Clause BSD, GPL
EditorConfig PSF-2.0 BSD, GPL
Flask_Admin BSD BSD, MIT
PennyLane_Lightning Apache License 2.0 MIT, Apache
Pympler Apache License, Version 2.0 BSD, MIT, Apache
adbc_driver_manager Apache-2.0 BSD, MIT, Apache
ag2 Apache Software License 2.0 MIT, Apache
amqp BSD BSD, GPL
astropy BSD-3-Clause GPL, MIT
audioop_lts PSF-2.0 BSD, GPL
auditwheel MIT BSD, MIT
aws_cdk_lib Apache-2.0 BSD, MIT, Apache
azure_ai_ml MIT License BSD, GPL, MIT, Apache
azure_monitor_opentelemetry MIT License MIT, Apache
botocore Apache License 2.0 MIT, Apache
celery BSD-3-Clause BSD, GPL
certbot Apache License 2.0 MIT, Apache
contextlib2 PSF License GPL, Apache
coremltools BSD MIT, Apache
crc32c LGPL-2.1-or-later BSD, GPL
cryptography Apache-2.0 OR BSD-3-Clause BSD, Apache
datadog BSD-3-Clause BSD, Apache
datadog_api_client BSD BSD, MIT, Apache
ddtrace LICENSE.BSD3 BSD, Apache
django_celery_beat BSD BSD, GPL
django_celery_results BSD BSD, GPL
dulwich Apachev2 or later or GPLv2 GPL, Apache
et_xmlfile MIT BSD, GPL, MIT
fastavro MIT MIT, Apache
fixtures Apache-2.0 or BSD BSD, Apache
grpcio Apache License 2.0 BSD, Apache
immutables Apache License, Version 2.0 BSD, Apache
jaxlib Apache-2.0 BSD, MIT, Apache
markdown2 MIT BSD, GPL, MIT
mecab_python3 BSD BSD, GPL
mujoco Apache License 2.0 BSD, GPL, MIT, Apache
mypy MIT GPL, MIT
neo4j Apache License, Version 2.0 BSD, Apache
newrelic Apache-2.0 BSD, MIT, Apache
newrelic_telemetry_sdk Apache-2.0 MIT, Apache
nipype Apache License, 2.0 BSD, Apache
numba BSD BSD, GPL, MIT
opencv_contrib_python Apache 2.0 BSD, GPL, MIT, Apache
opencv_contrib_python_headless Apache 2.0 BSD, GPL, MIT, Apache
opencv_python Apache 2.0 BSD, GPL, MIT, Apache
opencv_python_headless Apache 2.0 BSD, GPL, MIT, Apache
openvino OSI Approved :: Apache Software License BSD, GPL, MIT, Apache
oracledb Apache and/or UPL BSD, GPL, MIT, Apache
outcome MIT OR Apache-2.0 MIT, Apache
pandas_market_calendars MIT MIT, Apache
pdbp PSF BSD, GPL
pillow MIT-CMU BSD, GPL, MIT
prometheus_client Apache Software License 2.0 BSD, Apache
psycopg2 LGPL with exceptions BSD, GPL
psycopg2_binary LGPL with exceptions BSD, GPL
pyahocorasick BSD-3-Clause and Public-Domain BSD, GPL, MIT
pyarrow Apache Software License BSD, MIT, Apache
pycurl LGPL/MIT GPL, MIT
pymc Apache License, Version 2.0 MIT, Apache
pynose GNU LGPL BSD, GPL
pypdfium2 BSD-3-Clause, Apache-2.0, […] BSD, Apache
pyramid_debugtoolbar BSD GPL, MIT
pyreadstat Apache License Version 2.0 MIT, Apache
pyroute2 GPL-2.0-or-later OR Apache-2.0 GPL, Apache
python_subunit Apache-2.0 or BSD BSD, GPL, Apache
pytype Apache 2.0 MIT, Apache
rdt BSL-1.1 GPL, MIT
ruff MIT BSD, GPL, MIT, Apache
shapely BSD 3-Clause BSD, GPL
smartsheet_python_sdk Apache-2.0 MIT, Apache
sniffio MIT OR Apache-2.0 MIT, Apache
snowflake_connector_python Apache-2.0 MIT, Apache
symengine MIT BSD, GPL, MIT, Apache
tableauhyperapi Apache-2.0 BSD, MIT, Apache
tb_nightly Apache 2.0 BSD, MIT, Apache
tensorboard Apache 2.0 BSD, MIT, Apache
torch BSD-3-Clause BSD, GPL, MIT, Apache
trio MIT OR Apache-2.0 MIT, Apache
uv MIT OR Apache-2.0 MIT, Apache
uvloop MIT License MIT, Apache
vine BSD BSD, GPL
yarg MIT MIT, Apache

(But this is only approximate. Some might just mention things like “BSD is GPL-compatible” for example, or concatenate licenses without mentioning a license type name explicitly etc.)

2 Likes

Thanks for making this proposal. I’m going to make a more limited counter proposal but not because I want to argue about which proposal is best but rather I want to understand the reasoning behind your proposal and whether you think that it is more useful in some way. This is what I suggested above but I’ll state it more simply:

  • We have two fields in pyproject.toml with one being the project license and the other being the license of things that are vendored into the sdist.
  • These two fields can be combined to form License-Expression for the sdist and the project license gets recorded as some other metadata field that carries through to sdist, wheel, site-packages.
  • We say that License-Expression applies to all the contents of any particular distribution so in a wheel the License-Expression just means the contents of that wheel.
  • We don’t attempt to record anything in pyproject.toml about the license of the wheels but build backends and vendoring tools should generate complete license information as SPDX expressions when possible.

The reason I suggest it this way is because I know that package authors can provide all of this although ideally better tooling would help them to do so. It is not clear to me whether you think that the fields you proposed would be more useful to anyone or if it is just an attempt to spell out some possible license combinations that seem well defined.

I’m not sure what license of all distributions means. Do you mean the intersection so e.g. if the sdist is “A and B” and the wheel is “A and C” then here you would have “A”? Or do you mean the union like “A and B and C”?

Is distribution different from license above?

Again I am not sure if this means intersection or union but I am also not sure what wheels you are referring to:

  • Is this the wheels that the project uploads to PyPI specifically?
  • Does it apply to wheels in other indexes that are possibly built by other people?
  • Does it refer to the wheel that you would get if you built a wheel yourself from the sdist?

Firstly, note that the license can be different in each of these three situations. Secondly,
I think that as project authors we can only tell you precisely the license of the wheels that we build ourselves and upload to PyPI and we have already given that information in the form of the License-Expression field in each of the wheels.

Here are examples of why we can’t tell you the license of the other wheels:

Other people can build the wheels differently and bundle different things for example Christoph Gohlke provides NumPy wheels that use Intel’s MKL BLAS library which has a different license from the openblas library that is shipped in NumPy’s PyPI wheels. So there can exist wheels with other licenses that the project has no control over.

If you build the wheel yourself then the build may use your C compiler and other things that package authors don’t really control and the compiled code may e.g. statically link against libraries that are in your system that we don’t necessarily know the license of. In general I don’t think that the build backend or the package authors can say exactly what the license would be for a wheel that is built in an environment that we have no control over.

It is not the case any more but as an example (I’m sure @steve.dower will correct me) I think Microsoft used to provide a free version of MSVC with the stipulation that you can use it for development but you may not redistribute the binaries it outputs. The build backend has no way to know whether your C compiler has restrictions like this so only you can say what the license is for the wheel you built.

Also if you build yourself then there can be build options so you can do e.g.

pip install -Csetup-args=-Dblas=/path/to/mkl .

and then the license of the built wheel might depend on the options you pass when building. It is also possible that the build backend auto-detects feature in your environment for example using pkgconfig to locate a BLAS library so that in principle the license of the built wheel might depend on properties of your environment in a way that cannot be meaningfully captured in generic pyproject.toml metadata.

I think this only makes sense if you restrict yourself to the wheels that are on PyPI and again I’m not sure how this is more useful than just seeing the License-Expression for each of those particular wheels.

In general I think it is difficult to spell out in pyproject.toml what the license will be for the wheels in all cases. What can be done is just to have some metadata that indicates whether or not the common simple case applies i.e. that sdist license == wheel license == installed files license (provided no other person builds the wheels and bundles extra stuff into them). Capturing all cases where that does not apply is complicated while at the same time potentially not really providing any useful information.

1 Like

Thank you for the response. I feel that there’s a subtext here suggesting that you think that I don’t fit that description. If that’s the case, can I categorically and strongly object to that characterisation. Just because I have different views on how license data should be recorded and published does not mean that I don’t care about behaving ethically and in line with the wishes of developers whose work I benefit from.

If I’m reading too much into your wording, I apologise and sorry for over-reacting.

Let’s accept that we all want what you say above, and work together on trying to achieve that in a way that we can all agree on.

2 Likes

I’d be concerned that the latter approach could be misleading/incorrect. Consider if B and C were both compatible with A, but not with each other.

Hi @CAM-Gerlach, I just wanted to clarify that I don’t want to disregard the effort in PEP 639. I think it brought a substantial improvement to the ecosystem and now what see are edge cases.

I did see the previous comments in the thread , but I was referring to personal needs and concerns. Other people have different needs and concerns and my previous comment does not antagonise or invalidate that.

Actually I think it is useful if we identify which projects need what, so that we can have a more targeted discussion, which is easier/faster to handle. This goes into the direction of being pragmatic.

On a different topic, I think it is great you want to go the extra mile and provide detailed licensing information about the the vendored pieces of code your project includes via license expression for your user base. But I wanted to emphasize that open-source maintainers are able to ethically do their activities without going this extra mile.

This does not mean that we don’t check what we distribute, only that checking in controlled circumstances is different (and in my opinion more straightforward) than producing accurate complex SPDX expressions, even if in a cognitive level. I don’t believe that you think otherwise, I just wanted to be explicit.

4 Likes

Not to speak for C.A.M., but my own interpretation of what is meant by “this is the license of all distributions” is that license = str should only be used in the “easy” cases where the project code is the only thing being put into the sdist and wheel

So in your example where sdist is “A and B” and wheel is “A and C”, then it would be inaccurate to use the license = str form because there is no single license expression that accurately describes all of the the distributions

e.g. a pyproject.toml file that says

[project]
name = example
...
license = MIT
...

Means that “example” is developed under an MIT license and does not vendor anything in any distribution artifact that is not also an MIT license

1 Like

Again, not trying to speak for C.A.M., but I think that is indeed reading too much into the wording

My reading of it (and I hope an accurate distillation of the “opposing sides”) is that we all care about ethically following the requirements of licenses and we want those to be clearly expressed

As I understand, both of you default to using an MIT license. I do too because it’s the easiest way to say “I’m providing this to the world for free, what you do with it is your business, just please don’t sue me for anything, make sure my name stays attached to it, and I don’t want to get dragged into any more legal debates, I’m not a lawyer”

As it currently is, there is an incompatibility between “I want to clearly express the license for my project” and “I want to clearly express the license for everything that a downstream user might get with my project”

(Again, just my own interpretation of the comments, not trying to put words into anyone’s mouth if this is inaccurate)

C.A.M.'s position is: I really want to make it super easy for anyone to see what are all the possible license terms someone would get from any of my sdists or wheels and I’m ok with doing extra work up front to make that 100% right and, if necessary, I will sacrifice the clear expression (in the pyproject.toml file and project page on PYPI) of what the license is for just the code I or any contributor wrote. That is in the LICENSE file at the project root and you can see it there if you want to look.

Paul’s position is: I really don’t want to debate licenses. It’s already enough work maintaining these projects and the most important thing to clearly express is that “pip” is MIT licensed and anyone wanting to contribute does so under that license. It should be unambiguously stated as the “project.license” right next to the “project.name”, it should show up that way on PyPI, and if you really want to figure out what the license terms are for any single bit inside a wheel, you can crack it open yourself and look. If you have strict legal requirements, you won’t be able to trust my handwritten license expression and you’d have to do that anyways, so why should I sacrifice my clear intent that I care deeply about.

I hope[1] that something similar to the framework C.A.M. suggested yesterday can bring those viewpoints together in a compatible way and everyone can get exactly what they want.

There’s definitely been some heat in this thread over the past few days, but it definitely seems like it’s cooled down quite a bit and it’s been encouraging to see that the discussion has been able to continue with respect and appreciation


  1. and I think it would, but I’m not a packaging expert or a lawyer ↩︎

3 Likes

Do note that the “they” here is a nearly $3,000,000,000,000 company, so I think there’s some middle ground between that and the weekend hobbyist. :wink: Companies at this scale can also compile all sdists from scratch, use their own build systems, etc. So the resource availability is simply not comparable to e.g. a brand-new start-up or some university student making their first open source project where they are concerned about viral licenses accidentally creeping into their project.

3 Likes

Hi Paul, I apologize and I’m sorry for not being more careful like I should have with my wording there to clarify that I didn’t intend to impugn your or others’ motivations or exclude them from that description (as opposed to describing and explaining my own for why this topic is important to me and what motivates the approach I take for my own work). Perhaps I was the one reading too much into some of your comments and overreacted with an overly-defensive tone in my own, in turn making it easier to do the same with mine. NB, by “ethics” I intended to mean own personal code of ethics I hold myself to as opposed to any global objective sense of “ethical/unethical”; I apologize if that came across as implying that you or others aren’t holding themselves to their own ethical standards or weren’t also motivated toward the same just with a different set of constraints, objectives and perspective.

:100: FWIW, your comments from a very different perspective than my own have been a great help in understanding your needs and where you are coming from on this, and helped move things toward and motivate the concrete proposal above.

With that proposal, the license subkeys are optional and your and @abravalheri 's current license.text = "MIT" for the project license translates directly into license.project = "MIT", and explicitly indicates that you aren’t declaring a license.distribution and consumers should manually inspect your .dist-info/licenses instead. That achieve what you are looking for?

Once we have some lightweight automation in place (via parsing your license_files and/or via License-Expressions in vendored packages), provided either as a pre-commit hook/release automation script, etc. or as part of build backends (with dynamic = ["License-Expression"]), you could opt in to having a License-Expression that is be assembled and kept updated automatically as part of the build and release process.

Sorry if it came across as too pointed; my comment “As has been mentioned a number of times on this thread” was only intended to be in reference to address the specific concern that had been brought up previously about other development tools like GitHub displaying the artifact license rather than the project license, given GitHub and similar development platforms already do so via the canonical LICENSE file rather than try to infer it via ecosystem-bespoke packaging metadata and are particularly cautious to avoid misrepresenting it.

@oscarbenjamin maybe it would help to match the fields with the cases you describe above, as the four proposed license subkeys match 1:1 with each of your four cases (see []):

The final key, project, is the project license.

As alluded to before, I’ve consciously avoided unnecessarily restricting ourselves with the specific term “vendoring” here and potentially inviting another round of debates and ambiguity over what counts or not, and instead sidestep that by simply just speaking of the license of everything included in a given distribution, which is what actually matters here.

There can be plenty of types of differently-licensed content in a distribution that isn’t “vendored” per-say; e.g. non-code assets (logos, icons, images etc) created by the project itself under a non-code license, code snippits or files that were originally adapted from other projects but have become a regular part of the codebase rather than separately vendored, non-code things like writing, images, sounds, etc. that we don’t necessarily speak of as being “vendored”, a non-CC0 template used to create the project or part of it, the project or parts of it also covered by a previous license, etc. You can find examples of many of these in Spyder’s third-party licenses.

Just to make sure we’re on the same page, this is equivalent to using just the license.project and license.sdist fields in my proposal.

I feel like I"m missing something here, as it seems to me this still leaves out of scope (at least to the extent they are with the current PEP) many if not most of the key use cases you outlined previously that “absolutely need to be handled”, and you used to argue the PEP is not “workable”, particularly those of “the most important Python packages” like NumPy, Matplotlib, Pandas, PyTorch, etc. But maybe that’s what you intended, to illuminate the rationale for doing more than what you initially proposed? Here’s some of the issues as I see them:

  • It breaks backward compat with what we have now, as there is no key that corresponds to the current license nor is it possible to produce the same License-Expression as before for the majority of existing packages that can already use it (since your proposal completely excludes wheels).
  • This part is a major regression from what we have now for the simpler cases i.e. the majority of projects (including pip and setuptools) that have the same distribution license for all artifacts, as it would mean the wheels would either lack License-Expression entirely, projects would either need bespoke build system config to duplicate license.sdist for wheels, or would need to rely entirely on automation to reliably duplicate from scratch what they already provided statically.
  • Another regression from now: License-Expression must always be dynamic in both the pyproject metadata and sdist metadata with no way to provide or retrieve it statically
  • The one case you mentioned that this does handle is sdist-specific dependencies that may or may not be present in the wheel (e.g. build tooling, assets at the repo root, etc.).
  • However, to provide License-Expression for any wheels, this means that build systems must each develop bespoke mechanisms to construct a license expression themselves from scratch (aside from the project license), either using automation or another set of bespoke manually-entered user data, for the wheels and cannot rely on anything in this key as a baseline.
  • Requiring license automation adds substantial onus on backends up front and its not clear how backends are supposed to do this to any meaningful degree of correctness in absence of a standard for communicating distribution license information to begin with.
  • Alternatively Backends could each design and implement their own set of bespoke [tool.backendname] keys to cover these same cases and asking users to manually enter possibly duplicate license information two different places in a non-standard or interoperable way, instead of adding a couple extra subkeys to the existing table that authors could start using right now.
  • The currently-implemented spec already allows backends to do this anyway (with dynamic = License-Expression), and with Henry’s forthcoming partial-dynamic PEP would allow backends to use the existing license as a starting point and manual escape hatch rather than starting from nothing.

Sorry for the confusion. Neither is the case for the reasons mentioned by @pf_moore ; the answer is roughly what @jamesdow21 said, neither: for both backward and forward compatibility this is declared to be equal to both B (sidst) and C (all wheels) and means they are equal (and so is sufficient for the majority of projects that are pure-Python or don’t otherwise have different license for different artifacts), i.e. project.license = MIT AND BSD-3-Clause is exactly equivalent to

[project.license]
sdist = MIT AND BSD-3-Clause
wheel = MIT AND BSD-3Clause

or more concisely but still nearly as explicitly

[project]
license.distribution = MIT AND BSD-3-Clause

While it would be nice to apply the same equality restriction to A also, this would break backward compatibility with the current spec and introduce significant ambiguity since it has no mechanism to indicate a separate project license. Instead, to avoid a repeat of the present confusion and ambiguity, this must be indicated explicitly.

[project.license]
project = MIT
distribution = MIT

Yup, this is for the common case where sdist license = wheel licenses; note that it does not and cannot imply anything about project license for backward compatibility with the current spec and to avoid the ambiguity brought up here. For explicitness, spelling this as license.distribution would be recommended but by no means required instead once tools all support the new spec.

Not for the most projects (those accommodated by the current spec) for which distributions all have the same license. For those that don’t, license it is still well-defined (as the common subset of all distribution licenses), with the more specific keys allowing users to specify as precisely as needed the licenses that only apply per-distribution.

The same as your first python-flint case:

I.e. any licenses present in all wheels but not the sdist.

Yes, exactly that (for the reasons you describe in detail). Only the wheels the project has direct control over; anything else is impossible to define precisely or comprehensively and is as much out of the direct scope here as any other third party distribution mechanism, and the onus is entirely on them as the distributor to follow the license(s) not at all on the project.

It avoids duplication in having to manually type out the redundant project, distribution and all wheels’ licenses across every single wheel, and minimizes as much as possible the length and complexity of the license expressions that users actually have to write. The wheels themselves will just have the concatenated License-Expression in core metadata for that particular wheel, but this is how that gets populated to begin with (either statically, or partial-dynamically combined with automation where possible).

Maybe an example will help? Suppose a project is MIT, has vendored code under BSD-3-Clause and ISC, vendors a sdist build dep under MPL-2.0, adds a BSD-2-Clause dep to all wheels, and in the Linux wheels adds a GPL-2.0-Only dep. With my proposal, you’d write:

[project.license]
project = "MIT"
distribution = "BSD-3-Clause AND ISC"
sdist = "MPL-2.0"
wheels.all = "BSD-2-Clause"
wheels.manylinux = "GPL 2.0-Only"

If you had to write everything out manually (which you still could do with my proposal if you really wanted to but shouldn’t ever be necessary), you’d end up with:

project = "MIT"
sdist = "MIT AND BSD-3-Clause AND ISC AND MPL-2.0"
wheels.pp310-pypy310_pp73-win_amd64 = "MIT AND BSD-3-Clause AND ISC AND BSD-2-Clause"
wheels.pp310-pypy310_pp73-macosx_10_15_x86_64 = "MIT AND BSD-3-Clause AND ISC AND BSD-2-Clause"
wheels.cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64 = "MIT AND BSD-3-Clause AND ISC AND BSD-2-Clause AND GPL-2.0-Only"
# Etc, etc, etc times each python version, implementation, platform, arch, etc...

Instead, you just fill in the minimum necessary details into the [license] table, and your build tool does the tedious construction of the actual License-Expressions for you (and in the future could in theory also add detected licenses automatically, if you opt in dynamic = License-Expression).

I agree. The metadata is equivalent but I suggested to list the license of the vendored things separately in pyproject.toml like:

license = MIT
license-vendored = BSD3
License-Expression (in sdist) = MIT and BSD3

rather than

license.project = MIT
license.sdist = MIT and BSD3
License-Expression = MIT and BSD3

In my mind the license-vendored key might be something that is computed by a vendoring tool whereas the project license is just something that is hard-coded. I don’t know whether this distinction matters to projects like pip but conceptually it feels a little different to me.

What I am saying is just that I am not sure that it is useful to put this in pyproject.toml rather than in the License-Expression in the wheels.

I think we misunderstand each other here because we implicitly understand the meaning of the project.license key differently. Let me try to clarify how I read this differently from you but let’s not argue about what the PEPs or specifications do or do not say or what the intent was because ultimately it doesn’t matter. What does matter is just that in future we should make these things clear so right now we need to understand where any confusion or disagreement might lie.

When I read the PEPs or specifications etc I don’t see anything anywhere that says that e.g. license or License-Expression applies to all distributions of a project. Maybe that was what was intended in PEP 639 and I just implicitly read it differently because I am thinking of projects where that would be impossible so it seems obvious to me that License-Expression could only refer to the contents of one particular distribution.

I guess this is why the PEP declared the case of wheel vs sdist having different license out of scope and this is also why I was looking for a clear statement of what the license applies to even though it seems implicitly obvious to some that it should be the license of all the distributions.

To me it didn’t seem backwards incompatible. PEP 639 described the case of sdist license being different from wheel license as out of scope but it seems obvious to me how that case fits in: each distribution needs to have a License-Expression that applies to its own contents. Clearly in that case the license key cannot apply to all of the distributions and so the obvious choice is that it applies to the sdist.

Let me clarify how this works for python-flint (the process is the same for NumPy etc):

  1. The sdist contains only MIT code (nothing vendored).
  2. If you build a wheel yourself then nothing is vendored and the wheel license is MIT.
  3. When preparing a wheel for PyPI we post-process the wheels to make them portable by vendoring things in and then the license for those wheels is MIT and A and …

Steps 1 and 2 above are covered by Python’s packaging specifications but step 3 is not except in the sense that e.g. the manylinux specification mandates that compliant wheels must not link to non-vendored shared libraries. The vendoring doesn’t happen as part of a PEP 517 build that turns the sdist into a wheel (although I have considered this for the future). Instead there is a process that can turn a built wheel into a portable wheel and in so doing that process must update the License-Expression field.

In any other packaging context but PyPI wheels there would be no ambiguity here e.g. the conda source and built packages are MIT. I don’t know where else it is packaged but if it were in homebrew, linux distro, etc then the license would be MIT. The piwheels package doesn’t work but if it did then the situation would be that you first sudo apt-get install libflint-dev and then piwheels would host a wheel without vendoring anything and it would be MIT.

What you are looking for then is for there to be something in pyproject.toml that says what the license of the wheels are specifically on PyPI which is different from what the package license is in literally every other context. Packaging specifications should not special case PyPI though and should be applicable to any index so it would at least have to be URL-based e.g. something like:

license.wheels.pypi.org = ...

Conceptually I don’t think it even really makes sense that pyproject.toml would list information about particular wheels that are hosted on particular indexes though.

What this means is that because we provide wheels on PyPI you want us not to write license = MIT even though that is unambiguously the license of all of the code and of all built packages in all other packaging situations (including if you just build from the PyPI sdist).

Can you see why that doesn’t make any sense to me and why I would have already just used license = "MIT" (if the build backend supported it yet)?

2 Likes

Just jumping off of this, in reading this thread and the thread it was split from it seems like this is one of the major issues lurking in the background. It appears that the only real “statements” of the scope of any metadata-specified license come from the fact that all the specs are under the heading “Package Distribution Metadata”, and the (conflicting!) fact that the pyproject.toml spec is under the heading “Declaring Project Metadata”.

My intuition tends to be that in this kind of situation the best interim solution is to put up an extremely broad warning at the top of the docs that tells everyone that certain use cases are totally unsupported and out of spec scope. In this case that would be something like “The metadata specifications given here only address the situation where a single license applies to the project itself as well as to every distribution (sdist and wheel) built from it. All other situations are out of scope and the meaning of any license-related metadata in any such situation is undefined.” I’m not sure how anyone else feels about that but I think having something like that might help us to avoid not noticing this kind of issue until a late stage (as seems to have happened with PEP 639), and might also help users understand the limitations of the information they get from PyPI or other metadata-derived sources.

I don’t think it was acceptable that PEP 639 left these cases out of scope in the first place. If people want to say now that they are not only out of scope from the PEP but out of scope from the specifications as well then this is completely unacceptable and the PEP should have been rejected if this was going to be the outcome. If instead we just want to acknowledge explicitly that license information is per-distribution metadata and make that clear in the specification then that is fine.

If anyone thinks that this only applies to a few packages then think again. The tools that do this wheel vendoring are fundamental to the fact that it is even possible to have platform specific wheels on PyPI at all. The standard tool for this on Linux is auditwheel which was introduced alongside the manylinux specification in PEP 513. Before PEP 513 PyPI would reject any wheels uploaded with any linux platform tag: the fact that we can have Linux wheels on PyPI at all is because of this vendoring which is why tooling to make wheels for PyPI does this as standard. The same is also true for Windows and MacOS and everything else although in those cases I don’t think that there is a PEP that I can link to that clearly explains why this is needed in the way that PEP 513 does.

This is not a temporary issue that can be resolved in the way that people would like in a future PEP. The idea of metadata that applies to “all the distributions” does not even make sense in the context of Python packaging so even asking for such a thing is misguided. If there were a place to record all-distribution metadata then PEP 639 would no doubt have specified that it should have an SPDX license expression field. There is no place for that metadata to go though because all metadata in Python packaging is per-distribution metadata.

It is fundamental to the design of Python packaging that the set of distributions is not closed in any sense. The packages can come from anywhere: PyPI, piwheels, an internal index, a local wheel-house, the current working directory, pip’s cache, a remote git repo etc. Everything is designed so that common use cases are supported by working entirely from per-distribution metadata gathered from distributions that can come from anywhere. The fact that wheel filenames are so convoluted is so that a tool like pip can take nothing but a list of distribution filenames sourced from anywhere and then select a file from that list. Having made that selection the metadata for the given distribution is simply contained in that distribution.

The pyproject.toml file only exists in an sdist (or VCS checkout) and its metadata is only really there to describe building a wheel from that sdist. The only thing pyproject.toml can reasonably say about the license of the wheels is that it could say what the license would be for a wheel that is built from the sdist itself but that is not the information that people are asking for here.

I don’t actually think that this is a major problem though or that there is really much to be solved here in terms of changing the specification for the metadata. The issue is simply that few people understand how Python packaging works for non-pure-Python wheels on PyPI.

Let me quote myself from before and perhaps what I said then will make more sense to people reading it now:

My reading of PEP 639 is that it does not contradict what I just said above and it simply does not explain what a project is supposed to do in a case where the licenses of the distributions are different. Other people apparently read it differently and think that the PEP simply lacks an explicit statement of the clear intention that the license metadata must apply to “all the distributions”.

It doesn’t matter which of these two interpretations of the PEP is correct or what the intention of the authors was. What matters is that it is a fact that many important projects that upload distributions to PyPI do upload distributions that have different licenses. That fact is not changed by simply saying that those projects are “out of scope”. If you actually want to have the license metadata for all the PyPI packages then you simply have to accept that it is per distribution metadata because it cannot be anything else.

3 Likes

Strong +1 from me on this interpretation. Based largely on this, my proposal for the best way forward here is:

  1. Add clarifying language to the core metadata spec for License-Expression that the license expression applies only to the distribution file that it is contained in.
  2. Add a clarifying note to the specification of the license key on pyproject.toml noting that it should only be specified if the licenses of the sdist and all wheels built from the sdist are identical (if they might differ, the license key needs to be marked as dynamic, and the metadata added by the build backend in a backend-specific manner).
  3. Make it clear in the specs that the standardised license data is the distribution license, and not a “project” or “contributor” license (in situations where the distinction matters).

All of these are simple clarifications of intent, IMO, and can be done as PRs to the specification documents.

In addition, I think I should add a clarifying PR for the Dynamic core metadata field, making it clear that the guarantee that wheels built from the sdist will have the same metadata as the sdist unless the field is marked as dynamic only applies to wheels built from the sdist by a build backend. Wheels constructed from other wheels by tools such as auditwheel do not have to follow this constraint[1].

Finally, we need PyPI to fix its license display, so that it displays the per-distribution file data at the distribution file level.

Following on from that, we still need a new PEP to allow projects that want to, to record their project license. But that will of necessity take longer. Until it happens, people may find it disappointing that the project page on PyPI no longer displays license information, but I think that’s necessary given that we’re taking a stricter approach on distribution licenses. I’d like to see such a PEP take a simple approach, recording just a license expression that states the “project license” (deliberately left to the project to interpret as they wish), which is required to be static and intended for display to users as part of project-level information. But this should be covered in the PEP rather than doing it now.

As for how this would affect pip, I’d advocate for pip to not switch over to PEP 639 until PyPI makes the change noted above, because I do not want pip’s distribution license to be what is presented on the project page. But once the PyPI change is made, I could live with putting the complex distribution license that includes vendored package licenses in our License-Expression (although I’d expect that to be done by the tool that handles our vendoring - I don’t think we should maintain the value manually, as I’m not comfortable with the risk that we get it wrong, for example when adding a new vendored library or removing an existing one).


  1. This could be a disappointment for people wanting consistent metadata in all built artifacts, notably resolvers, but we have to document reality, and hot have our standards be wishful thinking ↩︎

10 Likes

It is possible that I don’t fully understand how dynamic works but in general I don’t think that the license for the sdist itself can be dynamic or at least I would never want to record it as dynamic. If we say that the license key is dynamic then I think that implies that it really means the license of the built wheel rather than the license of the sdist but then somehow we need a separate key for the license of the sdist.

The real examples I’m thinking of for this are not really dynamic but just like:

license-sdist = A and B
license-wheel = A

There just is no way to represent that without having more than one key somehow. Listing vendored things separately could also handle this:

license = A
license-vendored-sdist = B