Improving license clarity with better package metadata

Yes, I think it would be better to follow the process. Otherwise, it gives people the impression that it’s okay to post drafts there while they’re still being prepared. You can still have the draft in a personal branch.

I updated the topic with a link to instead and closed accordingly

1 Like


I originally reported issue #2996 after facing issues with license classifiers. I am mostly interested in making packaging easier for distribution developers (people who work on Fedora, Debian, *BSD, etc.).

The current situation (license classifiers + a “License” field that can be used as a fallback) is a bit confusing for package authors, and also makes it hard for automated tools to figure out what license is used. My workflow when trying to identify a license if the following:

  • Are the classifiers non-ambiguous? If yes, we are done.

  • Is the License field non-ambiguous ? If yes, we are done.

  • Is there a LICENSE file in the project’s repo, and can we figure out what it is ? If yes, we are done.

This gives the following crazy code crazy code in a tool I maintain. I would love to ses non-ambiguous SPDX identifiers make their way into Python packaging.

Rest assured I do not want to turn all Python programmers into license lawyers :slight_smile: There are already lots of things people do not care about, but end up doing anyway: writing tests, writing a file, etc. Licensing should be no different. An unlicensed package is hard to include in a distribution, since maintainers have strict rules about what they may or may not include in their distro. As @pombredanne pointed out, people who do not care can select a license such as MIT and use it for all of their projects. I am also under the impression that most people choose a license for their packages, but cannot express it in a non-ambiguous way because of the current limitations. For instance, scanning ~180k packages on PyPI, I counted more than 4500 different values for the “license” field.

I like the idea of repurposing the “License” field. Some people already use SPDX identifiers in this field (I do for my BSD-3-Clause projects, and I know other projects do it too). Which means some projects would be compatible with the new semantic without changing a single line.

I understand that changing the semantic of a field might be frowned upon, so I do not have a strong opinion on this and would be fine with a new field as well.

We could probably be quite lenient in the beginning and only enforce stricter rules once we’re confident the new semantic works well enough. It would probably be great to have a fallback option (ie “License: dont-bother-me-your-new-field-is-buggy-as-hell”): this would let us analyze why the author was not able to specify the license they wanted to use and fix issues.

Warehouse can sometimes “refuse” an upload and return a 400 error. In the future, maybe it should do so when the license is not properly defined and the issue can easily be fixed. For instance, if the license is set to “GPL”, the upload could fail and the user could be shown a message listing all the possible GPL identifiers from the SPDX list.

Speaking as someone who is interested in distro packaging, what I really care about is having reliable and useful info from I think setuptools/flit/poetry/etc. could probably tell the user “hey, by the way, you should probably drop these classifiers, and use XXX instead”. If I remember correctly, poetry already recommends using SPDX identifiers.

I would indeed like such tools to be community-maintained.

1 Like

Regarding the change process, the specifications section under is intended to be like the Python language reference section under clarifications and correction of errors and accidental omissions in existing specifications don’t require a PEP, but additing new fields or making significant changes to existing fields is likely to still need one in order to fully document the rationale for the related design decisions.

The old process (which didn’t work very well, hence the change in PEP 566) tried to use the PEPs themselves as both the reference document and to provide the rationale for change from the previous iteration, which made it hard to tell what was actually changed and what remained the same relative to the previous version. (and the preceding section on clarifications and minor updates) attempts to document that distinction, so if there’s wording that could be clarified there, suggestions would be appreciated.


Got it, thanks for the clarification @ncoghlan (and for the confusion @pombredanne). :+1:


Based on @sdispater is indeed recommending SPDX ids and is listing some. This is not yet expressions but as close as it gets.

@takluyver’s Flit doc is mostly consistent with the Core metadata docs:

The name of a license, if you’re using one for which there isn’t a Trove classifier. It’s recommended to use Trove classifiers instead of this in most cases.

@jaraco’s Setuptools also lists the license_file which was originally introduced by the wheels tool. This doc section on medatata also lists using both the “license” and the “classifiers” fields which is not aligned with PEP 566 doc and the Core metadata license-related texts at

Using a single license_file (singular) in wheels “metadata” has since been replaced by the plural license_files list by @agronholm with @njs support based on some ticket I had entered on wheels originally @ bitbucket)

See also this doc in wheels:

There used to be an option called license_file (singular). As of wheel v1.0, this option has been deprecated in favor of the more versatile license_files option.

@pf_moore would you agree that handling license files also needs to be addressed in this PEP so we have a clean, consistent and properly documented one single way to handle licensing documentation in packages?

Thanks for chiming in Nick! Much appreciated!

Let’s write this down somewhere in the PyPA Specifications page?

1 Like

I think it would be acceptable for the proposed change (and the PEP) to not make any comment about license files, but I think that doing so would be better - as you say, it makes the proposal into a complete review of licensing, and a proposal to address the whole area.

If you;re happy to expand the scope to include license files, it seems like a good ideal (in general, your approach with the PEP has been very good so far, so I’m happy to trust your judgement on questions like this :slightly_smiling_face:)

Nick did point to here, but maybe that needs to be more discoverable or clearer? Both @di and I missed it, which implies the answer is “yes”, but I’m not sure what could be improved…

I think the problem is actually the opening paragraph on, as the link to the process page is the subtle “” one at the end.

It would probably be a lot clearer if the subsection titles were repeated as bullet points on the main spec page, with direct links to the relevant part of the process page.

I have always wanted to add more categories to wheel, including a “docs” category. In the wheel you would have a *.data/docs or e.g. *.data/license which could be installed somewhere sensible.

+1 on SPDX metadata

1 Like

I haven’t read all the posts here, but hopefully, when this is all sorted out, someone will create a ticket against wheel with a link to the spec so I can then implement it. Thx :+1:

1 Like

I pushed a new version of the draft at

The main changes are:

  1. reuse the License field
  2. add the License-File field (which is already in use in wheel and setuptools)
  3. add a section to survey how license is documented in Python and elsewhere
  4. add a section wrt. a reference implementation for a license expression validation library
  5. integrate the reviews and feedback to date

I hope I am not off topic.

I would expect this license is filled in by whoever controls the src/* area. For someone as myself, generally only interested in bugs in fixing bugs in packages and/or re-wrapping (ultimately in wheels) - is this anything I need to be concerned about. Or is (or will) it all be magically resolved by ‘build format=“wheel”’?

A few questions:

  • Is the License-File field for just one file or possibly more than one? The draft says it’s a string, but it also says “license file(s)” (with an “s”) in the text. (I see now that “multiple use” means it can be multi-valued.)

  • For a project that might vendor many distinct libraries, each with possibly different licenses (e.g. pip), how would one indicate that with what’s being proposed?

  • When more than one license is being used, would it make sense to be able to map the license identifier to the corresponding License-File value(s)?

  • Also, if certain files are subject to certain licenses, I imagine the metadata isn’t attempting to reflect those distinctions.

It might make sense for the PEP to give examples of what its limits are (use cases like the above that it either can or can’t describe).

Hi @aixtools !
in your case (e.g. porting wheels to AIX AFAICR ) this is not something you should be concerned with for now.

This is not something that is set automatically neither today nor if this new PEP is approved and implemented. This is something that a package author would explicitly set in the package metadata.

This is a mighty great example since there is so much variety there:

Looking at the current pip (19.2.2) we have these:

license=‘MIT’, <-- this would be technically correct as a valid license expression


“License :: OSI Approved :: MIT License”, <-- this should trigger an informational warning that this is redundant (we could also be harsher)

But in reality you have many more licenses from the src/pip/_vendor directory, beyond the primary MIT.

  • appdirs: MIT
  • distro: Apache-2.0
  • ipaddress: Python-2.0
  • pyparsing: MIT
  • retrying: Apache-2.0
  • six: MIT
  • cachecontrol: Apache-2.0
  • certifi: MPL-2.0
  • chardet: LGPL-2.1-or-later
  • colorama: BSD-3-Clause
  • distlib: Python-2.0
  • distlib/_backport/ MIT
  • idna: BSD-3-Clause AND (Python-2.0 AND Unicode-DFS-2015)
  • lockfile: MIT
  • lockfile/ Python-2.0
  • msgpack: Apache-2.0
  • packaging: Apache-2.0 OR BSD-2-Clause
  • pep517: MIT
  • pep517/ Apache-2.0
  • pkg_resources: MIT
  • progress: ISC
  • pytoml: MIT-0
  • requests: Apache-2.0
  • urlib3: MIT
  • urllib3/packages/rfc3986: Apache-2.0
  • webencodings: BSD-3-Clause

Therefore you have two options IMHO:
A) one keep only MIT as an expression in the license metadata since you are already vendoring and including the licenses text of all these. This is in the spirit of keeping this simple. Since you include all texts and source code, you are complying with most if not all license conditions AFAIK and so would any of redistributors downstream.

B) provide a proper license license expression joining them all (with an AND) and a license_files list of all the license files that you are already including for full disclosures and details.

This expression would be either the full unsimplified expression:

  • MIT AND (MIT AND Apache-2.0 AND Python-2.0 AND MIT AND Apache-2.0 AND MIT AND Apache-2.0 AND MPL-2.0 AND LGPL-2.1-or-later AND BSD-3-Clause AND (Python-2.0 AND MIT) AND (BSD-3-Clause AND (Python-2.0 AND Unicode-DFS-2015)) AND (MIT AND Python-2.0) AND Apache-2.0 AND (Apache-2.0 OR BSD-2-Clause) AND (MIT AND Apache-2.0) AND MIT AND ISC AND MIT-0 AND Apache-2.0 AND (MIT AND Apache-2.0) AND BSD-3-Clause)

(I have used parenthesis here purely for “cosmetic effect” to group the subexpressions of each package together and put MIT as the left most one since this is the main one)

  • or a simplified version: MIT AND (Apache-2.0 AND BSD-3-Clause AND ISC AND LGPL-2.1-or-later AND MIT AND MIT-0 AND MPL-2.0 AND Python-2.0 AND Unicode-DFS-2015) using still the extra MIT at the start to depict (strictly visually) that it is the main one. I have cheated for that and used this snippet to do a boolean simplification of the full long version:
$ pip install license-expression
$ python
>>> from license_expression import Licensing
>>> l = Licensing()
>>> expression = 'MIT AND (MIT AND Apache-2.0 AND Python-2.0 AND MIT AND Apache-2.0 AND MIT AND Apache-2.0 AND MPL-2.0 AND LGPL-2.1-or-later AND BSD-3-Clause AND (Python-2.0 AND MIT) AND (BSD-3-Clause AND (Python-2.0 AND Unicode-DFS-2015)) AND (MIT AND Python-2.0) AND Apache-2.0 AND (Apache-2.0 OR BSD-2-Clause) AND (MIT AND Apache-2.0) AND MIT AND ISC AND MIT-0 AND Apache-2.0 AND (MIT AND Apache-2.0) AND BSD-3-Clause)'
>>> parsed = l.parse(expression)
>>> parsed
AND(LicenseSymbol(u'MIT', is_exception=False), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'MPL-2.0', is_exception=False), LicenseSymbol(u'LGPL-2.1-or-later', is_exception=False), LicenseSymbol(u'BSD-3-Clause', is_exception=False), AND(LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'MIT', is_exception=False)), AND(LicenseSymbol(u'BSD-3-Clause', is_exception=False), AND(LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'Unicode-DFS-2015', is_exception=False))), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Python-2.0', is_exception=False)), LicenseSymbol(u'Apache-2.0', is_exception=False), OR(LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'BSD-2-Clause', is_exception=False)), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False)), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'ISC', is_exception=False), LicenseSymbol(u'MIT-0', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False), AND(LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'Apache-2.0', is_exception=False)), LicenseSymbol(u'BSD-3-Clause', is_exception=False)))
>>> parsed.simplify()
AND(LicenseSymbol(u'Apache-2.0', is_exception=False), LicenseSymbol(u'BSD-3-Clause', is_exception=False), LicenseSymbol(u'ISC', is_exception=False), LicenseSymbol(u'LGPL-2.1-or-later', is_exception=False), LicenseSymbol(u'MIT', is_exception=False), LicenseSymbol(u'MIT-0', is_exception=False), LicenseSymbol(u'MPL-2.0', is_exception=False), LicenseSymbol(u'Python-2.0', is_exception=False), LicenseSymbol(u'Unicode-DFS-2015', is_exception=False))
>>> str(parsed.simplify())
'Apache-2.0 AND BSD-3-Clause AND ISC AND LGPL-2.1-or-later AND MIT AND MIT-0 AND MPL-2.0 AND Python-2.0 AND Unicode-DFS-2015'

That’s a great point. I thought of it a bit and that would be really nice but IMHO do it right you have to start using a mapping (two parallel are too prone to alignment errors) which would complicate things much more.

A mapping would be needed as you may not guarantee that all expressions have a license file and that any expression does not have more than one. (e.g. an Apache license LICENSE and its NOTICE file for instance)

So all in all, I would keep things as the simpler one license, multiple license files. It is flexible and expressive in most cases. In the rarer and more complex cases where there are many licenses involved you can still use the proposed conventions at the cost of a slight loss of clarity by not implementing a change to support your suggestion. But you are not forcing the more complex data model (e.g. a mapping) on everyone that does not need it.

We could of course have data field with multiple possible value types (it’s a string, it’s a list, it’s a mapping!) but personally I think this is the source of endless confusion. This is what has been done for instance in npm (historically) and in Rubygems (still today) and as result you never know what you get and it is a mess.

yes, this is entirely out of the scope of this PEP, though it is mentioned when discussing the documentation of “license in code files”. The best solution there is IMHO using the SPDX-License-Identifier approach as I helped deploy in the linux kernel.

:+1: … would you prefer to see these examples inline, or in an appendix section?

Thanks for the replies!

The various examples built from the example of pip are the only things I would think should go in an appendix if you include them (because the pip example is so involved). In the abstract, the ideas themselves are pretty easy to describe and don’t need to be left to an appendix. Having said that, I think the pip examples might be too complicated. You could communicate the same thing with a package that vendors three or four libraries instead of pip’s more than a couple dozen.

Regarding the use cases in the abstract, I think it might be good to have a “scope” subsection towards the beginning describing what’s in and out of scope for the PEP, and then you can put more detailed info about out-of-scope use cases in a “Rejected ideas” section per PEP 1: The out-of-scope use cases can include the ideas of (1) mapping licenses used in the license expression to specific files in the license files (or vice versa), and (2) mapping licenses to specific source files and/or directories of source files (or vice versa).

That’s my concern with reusing the License field – it’s using metadata that was not written with the assumption of being a strict, unambiguous bit of information and treating it as a strong compliance metric.

But then again, pip’s almost a special snowflake and I basically have very little skin in this game of legal, so… someone who actually is affected by this needs to confirm how much false positives affect them. :man_shrugging:t2: