PEP 639: Improving license clarity with better package metadata

pombredanne · August 16, 2019, 1:13pm

pf_moore:

On a more general note, the PEP reads as if the longer term intention is to become more strict on what licenses are valid and/or acceptable. This is something we need to be very careful about. It’s one thing for PyPI to change its rules, for example to only allow upload of open source packages (although that would be a separate proposal), but it’s quite another to lock down the metadata that expresses a package author’s intention (independent of where they plan to publish their code). I think the wording of the PEP needs to be improved to make it clearer that authors aren’t being expected to choose only SPDL-sanctioned licenses, and how they “opt out” if they so wish. (Or, it needs to be more open about the long term intent, and you need to be prepared to address any concerns and/or complaints).

An important point about proprietary licenses: there is an quasi infinity of these and they are not often public and therefore hard to catalog so they can easily be handled under the “Proprietary” catch-all identifier and this should not anger anyone IMHO. In contrast there is a finite and slowly growing number of open source licenses: so using more precise Ids makes sense

Now, here is the intent of this PEP:

As a package consumer I would like this license thing to be crystal clear so I know quickly if I am dealing with a known open source license or a proprietary license.
As a package author, I would like this to be as simple and non-intrusive as possible.

So what about this alternative take:

we continue to use the License field and have tools warn against the usage of license-related Classifiers
The License field is optional and can be one of three things:
2.1. a valid SPDX license expression (with extra Proprietary and Public-Domain ids). Everything is fine and jolly.SPDX License List | Software Package Data Exchange (SPDX)
2.2. an invalid license expression string, some other string, an empty string or is not present: the license is assumed to “Proprietary”. Tools are encouraged to provide a warning.
At a later stage, in a new metadata 2.3 version the license-related Classifiers are deprecated entirely.

And entirely separately, I could later craft another PEP related to Pypi such that only non-empty and valid license expression are allowed (TBD with or without proprietary licenses)

That’s a fair point and a topic of recent discussions @ SPDX to create a “namespace” concept and effectively allow any licenses. Today, they are taking effectively a stance towards open source (and in some cases at least some “source available” proprietary licenses). I agree that Python, the PSF and this PEP should not care or impose any restrictions of which license can be used for a Python package in general.

pombredanne · August 16, 2019, 3:42pm

@dustin if the license-related “Classifiers” were to be deprecated, how would twine/warehouse behave? would warehouse still accept the deprecated license classifiers? or would they be rejected when you try to upload a package?

dustin · August 16, 2019, 5:28pm

@pombredanne There’s more details in https://github.com/pypa/warehouse/issues/2996, but essentially PyPI has the notion of a “deprecated” classifier, and if you try to publish a distribution that uses a deprecated classifier, you get a 400 error which is displayed by twine.

Twine does not have a mechanism right now for allowing an upload to succeed, but printing some warning/notification.

cjerdonek · August 16, 2019, 5:58pm

Yes, I think it would be better to follow the process. Otherwise, it gives people the impression that it’s okay to post drafts there while they’re still being prepared. You can still have the draft in a personal branch.

pombredanne · August 16, 2019, 6:23pm

I updated the topic with a link to https://github.com/pombredanne/spdx-pypi-pep/pull/2 instead and closed [WIP: new PEP] Use SPDX license expressions in Core package metadata by pombredanne · Pull Request #1148 · python/peps · GitHub accordingly

Steap · August 16, 2019, 7:46pm

Hello,

I originally reported issue #2996 after facing issues with license classifiers. I am mostly interested in making packaging easier for distribution developers (people who work on Fedora, Debian, *BSD, etc.).

The current situation (license classifiers + a “License” field that can be used as a fallback) is a bit confusing for package authors, and also makes it hard for automated tools to figure out what license is used. My workflow when trying to identify a license if the following:

Are the classifiers non-ambiguous? If yes, we are done.
Is the License field non-ambiguous ? If yes, we are done.
Is there a LICENSE file in the project’s repo, and can we figure out what it is ? If yes, we are done.

This gives the following crazy code crazy code in a tool I maintain. I would love to ses non-ambiguous SPDX identifiers make their way into Python packaging.

Rest assured I do not want to turn all Python programmers into license lawyers There are already lots of things people do not care about, but end up doing anyway: writing tests, writing a setup.py file, etc. Licensing should be no different. An unlicensed package is hard to include in a distribution, since maintainers have strict rules about what they may or may not include in their distro. As @pombredanne pointed out, people who do not care can select a license such as MIT and use it for all of their projects. I am also under the impression that most people choose a license for their packages, but cannot express it in a non-ambiguous way because of the current limitations. For instance, scanning ~180k packages on PyPI, I counted more than 4500 different values for the “license” field.

I like the idea of repurposing the “License” field. Some people already use SPDX identifiers in this field (I do for my BSD-3-Clause projects, and I know other projects do it too). Which means some projects would be compatible with the new semantic without changing a single line.

I understand that changing the semantic of a field might be frowned upon, so I do not have a strong opinion on this and would be fine with a new field as well.

We could probably be quite lenient in the beginning and only enforce stricter rules once we’re confident the new semantic works well enough. It would probably be great to have a fallback option (ie “License: dont-bother-me-your-new-field-is-buggy-as-hell”): this would let us analyze why the author was not able to specify the license they wanted to use and fix issues.

Warehouse can sometimes “refuse” an upload and return a 400 error. In the future, maybe it should do so when the license is not properly defined and the issue can easily be fixed. For instance, if the license is set to “GPL”, the upload could fail and the user could be shown a message listing all the possible GPL identifiers from the SPDX list.

Speaking as someone who is interested in distro packaging, what I really care about is having reliable and useful info from PyPI · The Python Package Index/json. I think setuptools/flit/poetry/etc. could probably tell the user “hey, by the way, you should probably drop these classifiers, and use XXX instead”. If I remember correctly, poetry already recommends using SPDX identifiers.

I would indeed like such tools to be community-maintained.

ncoghlan · August 17, 2019, 1:43am

Regarding the change process, the specifications section under packaging.python.org is intended to be like the Python language reference section under docs.python.org: clarifications and correction of errors and accidental omissions in existing specifications don’t require a PEP, but additing new fields or making significant changes to existing fields is likely to still need one in order to fully document the rationale for the related design decisions.

The old process (which didn’t work very well, hence the change in PEP 566) tried to use the PEPs themselves as both the reference document and to provide the rationale for change from the previous iteration, which made it hard to tell what was actually changed and what remained the same relative to the previous version.

https://www.pypa.io/en/latest/specifications/#handling-major-updates (and the preceding section on clarifications and minor updates) attempts to document that distinction, so if there’s wording that could be clarified there, suggestions would be appreciated.

dustin · August 17, 2019, 2:01am

Got it, thanks for the clarification @ncoghlan (and for the confusion @pombredanne).

pombredanne · August 17, 2019, 7:05am

Based on The pyproject.toml file | Documentation | Poetry - Python dependency management and packaging made easy @sdispater is indeed recommending SPDX ids and is listing some. This is not yet expressions but as close as it gets.

@takluyver’s Flit doc is mostly consistent with the Core metadata docs:

license
The name of a license, if you’re using one for which there isn’t a Trove classifier. It’s recommended to use Trove classifiers instead of this in most cases.

@jaraco’s Setuptools also lists the license_file which was originally introduced by the wheels tool. This doc section on medatata also lists using both the “license” and the “classifiers” fields which is not aligned with PEP 566 doc and the Core metadata license-related texts at packaging.python.org

Using a single license_file (singular) in wheels “metadata” has since been replaced by the plural license_files list by @agronholm with @njs support based on some ticket I had entered on wheels originally @ bitbucket)

See also this doc in wheels: wheel/docs/user_guide.rst at b8b21a5720df98703716d3cd981d8886393228fa · pypa/wheel · GitHub

There used to be an option called license_file (singular). As of wheel v1.0, this option has been deprecated in favor of the more versatile license_files option.

@pf_moore would you agree that handling license files also needs to be addressed in this PEP so we have a clean, consistent and properly documented one single way to handle licensing documentation in packages?

pradyunsg · August 17, 2019, 8:25am

Thanks for chiming in Nick! Much appreciated!

Let’s write this down somewhere in the PyPA Specifications page?

pf_moore · August 17, 2019, 9:01am

I think it would be acceptable for the proposed change (and the PEP) to not make any comment about license files, but I think that doing so would be better - as you say, it makes the proposal into a complete review of licensing, and a proposal to address the whole area.

If you;re happy to expand the scope to include license files, it seems like a good ideal (in general, your approach with the PEP has been very good so far, so I’m happy to trust your judgement on questions like this )

Nick did point to here, but maybe that needs to be more discoverable or clearer? Both @di and I missed it, which implies the answer is “yes”, but I’m not sure what could be improved…

ncoghlan · August 17, 2019, 12:22pm

I think the problem is actually the opening paragraph on PyPA specifications - Python Packaging User Guide, as the link to the process page is the subtle “pypa.io” one at the end.

It would probably be a lot clearer if the subsection titles were repeated as bullet points on the main spec page, with direct links to the relevant part of the process page.

dholth · August 17, 2019, 2:52pm

I have always wanted to add more categories to wheel, including a “docs” category. In the wheel you would have a *.data/docs or e.g. *.data/license which could be installed somewhere sensible.

+1 on SPDX metadata

agronholm · August 18, 2019, 9:49am

I haven’t read all the posts here, but hopefully, when this is all sorted out, someone will create a ticket against wheel with a link to the spec so I can then implement it. Thx

pombredanne · August 19, 2019, 6:44am

I pushed a new version of the draft at https://github.com/pombredanne/spdx-pypi-pep/pull/2

The main changes are:

reuse the License field
add the License-File field (which is already in use in wheel and setuptools)
add a section to survey how license is documented in Python and elsewhere
add a section wrt. a reference implementation for a license expression validation library
integrate the reviews and feedback to date

aixtools · August 19, 2019, 10:47am

I hope I am not off topic.

I would expect this license is filled in by whoever controls the src/* area. For someone as myself, generally only interested in bugs in fixing bugs in packages and/or re-wrapping (ultimately in wheels) - is this anything I need to be concerned about. Or is (or will) it all be magically resolved by ‘build format=“wheel”’?

cjerdonek · August 19, 2019, 11:09am

A few questions:

~~Is the License-File field for just one file or possibly more than one? The draft says it’s a string, but it also says “license file(s)” (with an “s”) in the text.~~ (I see now that “multiple use” means it can be multi-valued.)
For a project that might vendor many distinct libraries, each with possibly different licenses (e.g. pip), how would one indicate that with what’s being proposed?
When more than one license is being used, would it make sense to be able to map the license identifier to the corresponding License-File value(s)?
Also, if certain files are subject to certain licenses, I imagine the metadata isn’t attempting to reflect those distinctions.

It might make sense for the PEP to give examples of what its limits are (use cases like the above that it either can or can’t describe).

pombredanne · August 19, 2019, 12:00pm

Hi @aixtools !
in your case (e.g. porting wheels to AIX AFAICR ) this is not something you should be concerned with for now.

This is not something that is set automatically neither today nor if this new PEP is approved and implemented. This is something that a package author would explicitly set in the package metadata.

pombredanne · August 19, 2019, 12:44pm

This is a mighty great example since there is so much variety there:

Looking at the current pip (19.2.2) setup.py we have these:

license=‘MIT’, ← this would be technically correct as a valid license expression

classifiers=[
…
“License :: OSI Approved :: MIT License”, ← this should trigger an informational warning that this is redundant (we could also be harsher)

But in reality you have many more licenses from the src/pip/_vendor directory, beyond the primary MIT.

appdirs: MIT
distro: Apache-2.0
ipaddress: Python-2.0
pyparsing: MIT
retrying: Apache-2.0
six: MIT
cachecontrol: Apache-2.0
certifi: MPL-2.0
chardet: LGPL-2.1-or-later
colorama: BSD-3-Clause
distlib: Python-2.0
distlib/_backport/misc.py: MIT
idna: BSD-3-Clause AND (Python-2.0 AND Unicode-DFS-2015)
lockfile: MIT
lockfile/pidlockfile.py: Python-2.0
msgpack: Apache-2.0
packaging: Apache-2.0 OR BSD-2-Clause
pep517: MIT
pep517/colorlog.py: Apache-2.0
pkg_resources: MIT
progress: ISC
pytoml: MIT-0
requests: Apache-2.0
urlib3: MIT
urllib3/packages/rfc3986: Apache-2.0
webencodings: BSD-3-Clause

Therefore you have two options IMHO:
A) one keep only MIT as an expression in the license metadata since you are already vendoring and including the licenses text of all these. This is in the spirit of keeping this simple. Since you include all texts and source code, you are complying with most if not all license conditions AFAIK and so would any of redistributors downstream.

B) provide a proper license license expression joining them all (with an AND) and a license_files list of all the license files that you are already including for full disclosures and details.

This expression would be either the full unsimplified expression:

MIT AND (MIT AND Apache-2.0 AND Python-2.0 AND MIT AND Apache-2.0 AND MIT AND Apache-2.0 AND MPL-2.0 AND LGPL-2.1-or-later AND BSD-3-Clause AND (Python-2.0 AND MIT) AND (BSD-3-Clause AND (Python-2.0 AND Unicode-DFS-2015)) AND (MIT AND Python-2.0) AND Apache-2.0 AND (Apache-2.0 OR BSD-2-Clause) AND (MIT AND Apache-2.0) AND MIT AND ISC AND MIT-0 AND Apache-2.0 AND (MIT AND Apache-2.0) AND BSD-3-Clause)

(I have used parenthesis here purely for “cosmetic effect” to group the subexpressions of each package together and put MIT as the left most one since this is the main one)

or a simplified version: MIT AND (Apache-2.0 AND BSD-3-Clause AND ISC AND LGPL-2.1-or-later AND MIT AND MIT-0 AND MPL-2.0 AND Python-2.0 AND Unicode-DFS-2015) using still the extra MIT at the start to depict (strictly visually) that it is the main one. I have cheated for that and used this snippet to do a boolean simplification of the full long version:

$ pip install license-expression
$ python
>>> from license_expression import Licensing
>>> l = Licensing()
>>> expression = 'MIT AND (MIT AND Apache-2.0 AND Python-2.0 AND MIT AND Apache-2.0 AND MIT AND Apache-2.0 AND MPL-2.0 AND LGPL-2.1-or-later AND BSD-3-Clause AND (Python-2.0 AND MIT) AND (BSD-3-Clause AND (Python-2.0 AND Unicode-DFS-2015)) AND (MIT AND Python-2.0) AND Apache-2.0 AND (Apache-2.0 OR BSD-2-Clause) AND (MIT AND Apache-2.0) AND MIT AND ISC AND MIT-0 AND Apache-2.0 AND (MIT AND Apache-2.0) AND BSD-3-Clause)'
>>> parsed = l.parse(expression)
>>> parsed
AND(LicenseSymbol(u'MIT', is_exception=False), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'MPL-2.0', is_exception=False), LicenseSymbol(u'LGPL-2.1-or-later', is_exception=False), LicenseSymbol(u'BSD-3-Clause', is_exception=False), AND(LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'MIT', is_exception=False)), AND(LicenseSymbol(u'BSD-3-Clause', is_exception=False), AND(LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'Unicode-DFS-2015', is_exception=False))), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Python-2.0', is_exception=False)), LicenseSymbol(u'Apache-2.0', is_exception=False), OR(LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'BSD-2-Clause', is_exception=False)), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False)), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'ISC', is_exception=False), LicenseSymbol(u'MIT-0', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False)), LicenseSymbol(u'BSD-3-Clause', is_exception=False)))
>>> parsed.simplify()
AND(LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'BSD-3-Clause', is_exception=False), LicenseSymbol(u'ISC', is_exception=False), LicenseSymbol(u'LGPL-2.1-or-later', is_exception=False), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'MIT-0', is_exception=False), LicenseSymbol(u'MPL-2.0', is_exception=False), LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'Unicode-DFS-2015', is_exception=False))
>>> str(parsed.simplify())
'Apache-2.0 AND BSD-3-Clause AND ISC AND LGPL-2.1-or-later AND MIT AND MIT-0 AND MPL-2.0 AND Python-2.0 AND Unicode-DFS-2015'

pombredanne · August 19, 2019, 12:52pm

That’s a great point. I thought of it a bit and that would be really nice but IMHO do it right you have to start using a mapping (two parallel are too prone to alignment errors) which would complicate things much more.

A mapping would be needed as you may not guarantee that all expressions have a license file and that any expression does not have more than one. (e.g. an Apache license LICENSE and its NOTICE file for instance)

So all in all, I would keep things as the simpler one license, multiple license files. It is flexible and expressive in most cases. In the rarer and more complex cases where there are many licenses involved you can still use the proposed conventions at the cost of a slight loss of clarity by not implementing a change to support your suggestion. But you are not forcing the more complex data model (e.g. a mapping) on everyone that does not need it.

We could of course have data field with multiple possible value types (it’s a string, it’s a list, it’s a mapping!) but personally I think this is the source of endless confusion. This is what has been done for instance in npm (historically) and in Rubygems (still today) and as result you never know what you get and it is a mess.