PEP 639, Round 2: Improving license clarity with better package metadata

That sounds awesome to me. If someone else chimes in with an approval of that new plan I’ll devote time this weekend to updating my implementation of this PEP.

edit: had time today, example usage:

[build-system]
requires = ['hatchling']
build-backend = 'hatchling.build'

[project]
name = 'foo'
license = 'MIT'
dynamic = ['version']

[tool.hatch.version]
path = 'foo/__about__.py'

[tool.hatch.build]
packages = ['foo']

[tool.hatch.build.targets.sdist]
include = ['/tests']

[tool.hatch.build.targets.wheel]
core-metadata-version = '2.3'
1 Like

What are the next steps here?

Sorry, I’ve been on a trip visiting various family members and wanted to give others an opportunity to chime in before going ahead with the rest of the above plan, to ensure there was consensus before reworking everything again. But as it seems we are all in agreement, if not one has any further objections, and assuming it doesn’t conflict with anything @pradyunsg has already worked on, I can go ahead with the above proposed revisions in the next few days or so.

1 Like

Yes I am. :slightly_smiling_face:

Sounds good!

Why this (I didn’t find any discussion about this in this topic and the PEP has enough of a different approach that I didn’t see how this decision was reached)? I could understand adding an expression or spdx key to go along with the other keys to support either approach (i.e. not only point to the license file but also classify it). While I fully support make stating what license a project uses as easy as possible, I also prefer to not toss out information that can be useful if people are up for the extra work (speaking from experience where I wrote a tool to generate a 3rd-party notices file that gathered licenses into a file while also working at a company that supports reading SPDX expressions for legal compliance; darn you, Rust, for only supporting one or the other!).

1 Like

Any interest in being a PEP delegate on this one? :slightly_smiling_face:

Good question—as you astutely point out, since proposal reverts to a previous direction not taken in the current version of the PEP, there wasn’t a single canonical place (at least in the rejected ideas) where I explicitly and cohesively explain this particular element, requiring the reader to synthesize a substantial amount of previous discussion or a number of disparate bits from the specification and/or rejected ideas. I’ll make sure to address this in the revised version of the PEP.

This is actually a pretty complex question with several distinct parts (adding an expression key versus allowing just a flat table value, deprecating the text key, and deprecating the file key), but the TL;DR is that the existing license table subkeys were already mutually exclusive per PEP 621, and the core metadata/source metadata keys they map to are deprecated in favor of the much richer, more powerful and non-mutually-exclusive mechanisms in this PEP that cover the same ground and more, per the consensus in the previous discussion. Read on for the fully detailed answer, which I could abridge and include in the next revision of the PEP.

The current text and file table subkeys of the license key are stated in PEP 621 to be mutually exclusive, and map to metadata fields that (per the strong consensus on the previous thread) this PEP deprecates:

The table may have one of two keys. The file key has a string value that is a relative file path to the file which contains the license for the project. Tools MUST assume the file’s encoding is UTF-8. The text key has a string value which is the license of the project whose meaning is that of the License field from the core metadata. These keys are mutually exclusive, so a tool MUST raise an error if the metadata specifies both keys.

I couldn’t find an explicit justification for this given in PEP 621, the reasons are somewhat unclear, but the upshot is that at present with PEP 621, so with those two keys, it is currently only possible to specify “one or the other”, as you say; more specifically, at most one of either:

  • Some free-form text describing the license, with unspecified syntax and semantics (text), or
  • A single license-related file, with unspecified semantics and mapping to core metadata or distribution archive contents (file).

This existing mutual exclusivity seems to be undesirable and overly restrictive, and seems to be one of the core things that bothers you above (and me too, which is why this PEP dramatically improves upon this situation…but I’ll get to that in a minute!).

Furthermore, it appears to be intended that text, and possibly file, map to the License field in core metadata. The clear consensus in the previous discuss thread for this PEP, both before and after I became involved, was that the License core metadata field should be deprecated by, and certainly mutually exclusive with the License-Expression field this PEP adds, to ensure there was one (and preferably one one) obvious way to concisely specify the license(s) of the project in the package metadata, avoiding user confusion, substantial legal ambiguity, and duplication, and to allow arbitrarily complex licenses, combinations and exceptions to be described all using a standardized, unambiguous, machine-parsable format. Therefore, use of the text key (and the file key for this purpose) is correspondingly deprecated and replaced by (and mutually exclusive with) specifying a license expression (and specifying license-related files for special cases, as appropriate).

Similarly, for some time now, Setuptools, wheel (the library) and other packaging tools have deprecated mechanisms that only allow specifying only a single license file (license_file), which is overly restrictive for many cases (including yours, when you have at least both a license file and notices file) and replaced them with ones that enable specifying multiple (license_files), and per the previous consensus, was what this PEP specified on the core metadata side well prior to my revisions. Therefore, for similar reasons as above, file is deprecated and replaced by a nearly equally simple but much more flexible way of specifying any number of license files to include, which unlike it, can also be specified alongside a license expression, and has safe, sensible, and standardized defaults and semantics for including license files in distribution archives and listing them in core metadata.

So, in summary, while the project source metadata changes in PEP 639 (with the revisions above) allows the license to be stated as easily as practical, with a single SPDX short identifier for most cases (and common license-related files included automagically), this PEP also allows much greater richness with license metadata for those who, like you say, are up for the extra work. In particular, it allows them to specify both a full license expression with any number of licenses, exceptions, and relationships, and one or any number of license files that they choose, if the clearer and more sensible defaults don’t already cover their use case, and are nearly a strict superset of the expressiveness of the previous two, which they would otherwise duplicate.

As for adding an expression table subkey to the license key, I actually not only carefully considered it, but (believe it or not!) had the same initial thought as you and in fact implemented exactly that in an earlier draft of the PEP. However, given the other two keys are to be deprecated and mutually exclusive with the new ones, being close to subsets of their functionality and mapping to deprecated metadata fields (for the reasons above); and there didn’t appear to be likely future keys that would be added, I opted not to add the extra complexity of an expression table subkey and making it mutually exclusive with the others, as opposed to just adding the string key (which neatly makes a license expression mutually exclusive with both as a natural and obvious consequence of the basic structure). As I discuss in the license expression as string value rejected idea:

If an expression subkey was added to the license table, it would retain the clarity of a new top-level key, but add additional complexity for no real benefit, with an extra level of nesting, and users and tools needing to deal with the mutual exclusivity of the subkeys, as before. And allowing both (as a table subkey and the string value) would inherit both’s downsides, while adding even more spec and tool complexity and making there more than “one obvious way to do it”, further potentially confusing users.

EDIT: I meant to include this before, but skipped it. There are a couple of possible niche use cases of the existing License field that are arguably not completely equally handled by the new License-Expression and License-File fields: bespoke proprietary licenses, and other arbitrary license-related information. For the former, since there is no well understood, standardized, meaning of such licenses, it seemed best to minimize ambiguity by cover this case with the LicenseRef-Proprietary license expression and including and specifying the license-file(s) that describes it; if custom identifiers for such cases are still desired by bespoke/proprietary tooling, the PEP does not prohibit them from allowing such, and if there’s sufficient need, we could (now or later) implement a LicenseRef-Custom value or allow arbitrary LicenseRef-{custom} identifers. To cover the second case, the user can simply include any extra info in a new or existing License-File that can automatically or explicitly be included archives and listed in the metadata, or include it in the short/long description; custom LicenseRef-s could also help cover that case if really needed. See discussion here and on the previous PR for more on that. END EDIT

In case you’re wondering why not add another files (and/or paths, globs, etc) subkey to the license table, see this rejected idea, and for the justification for the syntax and semantics of the license-files key, see the relevant rejected idea subsection.

Hopefully this clarifies things, and in case parts are still unclear, I’m happy to answer followups!

By the way, this is super cool; for the Spyder scientific environment/IDE I initially did that manually but in a strict machine-parsable format, which others later implemented tools to read, parse and update.

It’s up to @pf_moore . I can be, although I would probably ask the open source office here at work for input.

Sure, but that doesn’t mean all keys need to stay mutually exclusive.

My key point is I don’t want to lose the ability to specify the license file somehow (and to be clear, that doesn’t preclude embedding the license in the metadata like we do today, just as long as one can continue to programmatically get the license text from a wheel and sdist).

Sure, I was just making mention of the current status quo that you were comparing this PEP to.

Yup; this PEP actually greatly improves your ability to do so. Previously, with PEP 621, you could only specify a single license file, and only if you didn’t specify the license metadata; furthermore, it wasn’t explicitly specified what backends were supposed to do with file (include the path in license? include the full text? include the file in the distribution? some combination, or something else entirely?), and how metadata consumers could access it in a consistent, defined manner.

With this PEP, there is a new license-files key that allows adding multiple license-related files in addition to a license expression, either by full relative path or glob, a clearly defined mechanism for storing both the path in the metadata and the full text in the .dist-info per what Wheel and Setuptools have implemented, with additional tweaks to avoiding conflicts, backward compat issues and clutter in .dist-info and allowing including licenses from subdirs, e.g. vendored projects, rather than arbitrarily dropping them.

I said in a post above that I’d prefer someone else to volunteer to be PEP delegate for this, so feel free. But in general, the position as I understand it is that anyone can volunteer to be PEP delegate, it doesn’t need my approval for them to do so (although I would get a vote on approving them along with the other PyPA committers).

By the way, if anyone knows whether the process in PyPA Specifications — PyPA documentation

If their self-nomination is accepted by the other PyPA core reviewers, the lead PyPI maintainer and the default PEP-Delegate for package distribution metadata PEPs, then they will have the authority to approve (or reject) that PEP.

means that we need a PyPA vote, or if a simple call for anyone who objects to speak up is sufficient, please let me know!

1 Like

Have you actually come across a need for this from a top-level perspective?

Huh, interesting use-case!

I volunteer then! Let the regret begin. :wink:

If you’re mirroring how the Python core team does it then you just appoint and see if anyone screams.

2 Likes

We’re not, quite. See the quote above, it needs PyPA committer approval. I’ve posted a request for any objections to the PyPA committers list. Assuming no-one objects, I’d say that the job’s yours :slightly_smiling_face:

Perhaps we can call a vote to make this change to the governance PEP, so in future we can do this. @brettcannon what timeline you give to people to scream? Do you communicate on some list PEP delegate nominations?

We announce the delegate to python-dev so people know who they need to speak/influence the decision. We have actually never had to change a delegate, so it’s honestly a hypothetical we need to be concerned about it.

Just FYI, in case it was easy to miss, the <details> section in my original reply covers several of the followup questions in, well, detail (maybe too much, which is why I collapsed it) :laughing:

I’m not 100% sure if you’re asking about why it needs to be a top-level key under [project], or the need for multiple license files, so I’ll address both.

Regarding the former, if you’re wondering why not add another files (and/or paths , globs , etc) subkey to the license table instead of making license-files a top-level [project] key, that was what I originally implemented in the PEP, but see this rejected idea for why it proved unworkable in practice, and for the justification for the syntax and semantics of the license-files key, see the relevant rejected idea subsection.

As for the need for multiple license-related files, that is actually surprisingly common for the many projects that contain code or vendored deps under other licenses (pip, Setuptools, Spyder, etc), by those that have a license and a notice file (Such as the projects you mentioned, or several I’ve used), by projects under multiple licenses (packaging), and a number of other cases; which are all technically currently in violation of the relevant licenses by not doing so (unless they happen to match the current de-facto defaults of their tool, which this PEP also standardizes). This also matches the support of and for the reasons mentioned by Wheel, Setuptools and others.

Indeed, and its more common than you might think! Pip and Setuptools are but two examples (that are currently technically not following the licenses of their vendored deps by not doing so), the latter of which is included in the examples section of this PEP.

Thanks! :smirk:

1 Like

Thanks! For reference, that discussion can be found at

https://mail.python.org/archives/list/pypa-committers@python.org/thread/TJXEXQHP7XXYC33I5EXCD3KTHWHX2VVO/

I opened a draft PR (with that discussion referenced) to make the change on the PEP, pending the outcome of the approval.

Just to note, it seems (as @pf_moore pointed out) that there have been several instances I’m aware of lately (from having somehow instigated each of them) of there being some ambiguity in the PyPA governance document (all, somehow, instigated by me)—whether the PyPA GitHub org is the sole canonical PyPA project location or those on Bitbucket, GitLab, etc are equivalent (and whether a vote is needed to move a project from one or another); what specifically “approval” means for a PEP delegate change, and whether my minor to update references to Python-Dev on the PyPA site to point here required a vote. It might be good to address those together, if so.

By the way, I should have mentioned that the vote is complete and Brett has been accepted as PEP delegate.

6 Likes

As a quick update on next steps here, I’ve been busy with holiday and conference travel, PEP editing and OSS maintainer-ing, but I’m going to make time to do at least the low-hanging-fruit length reduction steps in the next few days (with the rest if and when PEP 676 is accepted, which allows for easily splitting the ancillary, non-normative content into separate rendered auxiliary files linked from the PEP for readers interested, while greatly reducing the length of the PEP itself), followed by implementing the PEP 621 proposal we’ve all more or less agreed on; see my comment for a fuller summary.

1 Like

And a quick update on my end is I have reached out to some folks internally here at Microsoft whose job is to deal with software licenses to see if there’s any key information we would be lacking that would make their lives easier if the PEP were accepted (i.e. SPDX and support for multiple license files).

My current question to everyone here is what are people’s thoughts on the tweaks to wheels in terms of where to store license files? And do people like storing the files instead of embedding them in METADATA?

1 Like

Great, thanks Brett! To note, this also resolves a lot of issues that many users have brought up with the current license tag system, and for a couple years now seems to be the de-facto intended path forward from PyPI’s end rather than adding and maintaining lots of new license tags while deprecating many old ones.

I would appreciate feedback as well; they intend to address some significant issues with the current implementation, as discussed on the relevant linked Wheel and Setuptools issues, and follow what was suggested there to only require small tweaks in the implementation and not pose meaningful backward compat concerns, but before standardizing it I’d certainly want to hear others’ thoughts as to any unforseen issues with that approach and any other potentially viable alternatives.

I’d be curious to hear a concrete implementation proposal for the latter, in order to better compare it as a potential alternative. I considered something like and explored a few speculative possibilities, but I didn’t find many benefits to that approach that would justify the spec and implementation complexity and other downsides over just making a couple small tweaks to the existing system that has been mostly implemented in many tools already.

In particular, we’d have to design, agree on, prototype and implement a mechanism to to this in existing metadata producers and consumers, which includes addressing questions like:
* How to store the original filenames/paths, or how else to identify each set of file contents?
* How to map the file identifiers to the license—some kind of embedded data structure?
* What would the API look like for accessing them? Just dump all the text? Get a particular license?
* How do we handle encoding, escaping and special characters?
* What do the PEP 621 keys look like? Co-opt what we have now, or design something new?
* What about in other tools? Do we suggestion tools change the behavior of the existing license_file/license_files in wheel and setuptools, or add yet another config setting?
* Etc…

There’s also other potential concerns:

  • For projects with a lot of license files (e.g. Spyder, a medium to larger project, has just a NOTICE.txt file of 200 KB, and a number of other license-related files) this could potentially impose a small but non-trivial performance penalty reading and parsing METADATA (currently 15 KB, including a large readme) on every access.
  • Embedding the license texts within an existing machine-formatted data file, as opposed to leaving them in their original files, makes them more opaque and less easy and obvious to access, and could potentially conflict with license provisions that require explicitly preserving the files/their names (e.g. Apache with NOTICE) and with making them sufficiently user-visible/discoverable.
  • The current proposed approach doesn’t impose any significant further difficulties on users and tools currently adding and accessing license files from wheels, while any such new embedded approach could raise backward compat concerns, in addition to imposing an additional burden on tool authors who’ve already implemented the current approach.

However, I’d appreciating hearing more from the side of anyone advocating an embedded approach as to the potential advantages I may be missing, and be able to more fully consider a concrete proposal that tried to answer some of the questions above. I don’t want to get too hung up on this, as the original request in adding a formal specification for storing license files in wheels was to just formally codify and refine the the existing implemented behavior, but I don’t want to just dismiss it before hearing from others who may have better ideas. Thanks!

1 Like

The main goal I see with it is that license files are “installed” at all. Where doesn’t really matter as long as it’s sensible and easy to access for third party library which may want to aggregate them. The first inclination would have been to use the package dir itself. Realistically however that would only work if the license-files were stored there to begin with. That often isn’t the case. Thus we need a folder that packaging tools have full control over, e.g. dist-info. That’s also where setuptools, wheel, and others store them already. As for the dedicated subfolder, most cases would probably work just fine without it, but for the few which have multiple license-files with similar names, we need a solution to avoid conflicts.

Although it might avoid the before-mentioned conflicts, I don’t think that would be a good solution. Like I said earlier the intention is only to have them available and accessible after a wheel is installed. If third party tools what to use them, it’s enough if they can use something like importlib.metadata to read them themselves.

Another aspect to consider, not all license files are created equal. Just of the top of my head, there are txt, json, xml, and likely others. If all of them are stored in text form, in addition to a filename / filepath we would also need to store information about the filetype. Having to deal with all of that is just unnecessary for packaging tools IMO. We should make it as simple as possible to include license files with wheels.

3 Likes

Probably more-or-less the same as today, but with some marker to delineate the separation between licenses.