PEP 639, Round 2: Improving license clarity with better package metadata

Thanks. As far as I’ve seen in current projects, the existing convention for including license files is license_file/license_files and similar in setuptools, wheel, and other tools, and using the license field and or Trove tags to include license information in the metadata.

What I’m guessing you’re referring to, however, is projects dumping the full text of their license file(s) in the License core metadata field? I don’t think I’ve ever personally come across any projects (or guides/documentation) that do the latter, though I imagine there might be a few out there—could you point us to any mainstream examples you’re aware of?

Just dumping the file text in there with a separator has all the issues and limitations mentioned in my comment above; it simplies a couple of the implementation questions, but only by simply not providing the indicated capabilities at all (original file paths, file names, and file types, which are important for understanding and making practical use of the text, and creating further legal uncertainty as to whether this mechanism is an acceptable alternative to including the original files), which the existing, implemented “include license files in the wheel” approach offers right now, or with some small tweaks.

This would really require a whole new discussion, since up until this point in the past two threads, I can’t recall anyone requested a radical departure like this, as opposed to codifying the existing, proven “include the files” approach, with a metadata key for the paths and perhaps other minor tweaks. And at this point, its hard to see us essentially throwing out the current proposal, implementations and format and start over with a more complex, seemingly less functional alternative unless there are critical flaws in the former, or strongly compelling reasons to prefer the latter (and if so, I’d certainly want to hear them).

What certainly would be particularly helpful to have additional feedback on, if anyone has it, is (as you mentioned above) the two proposed tweaks to the current mechanism of storing license files in wheels (rooting them in a subdir to avoid conflicts with current, future and implementation-specific .dist-info files, and storing them at their original path inside it to allow multiple with the same name and provide additional context). Is there any current use case that might be affected that I’m not aware of? Are there any implementation difficulties that I haven’t forseen?

Yes.

Nope, because this all hypothetical. :grin:

The key thing I want to make sure is people are okay with updating:

  1. The core metadata spec
  2. The pyproject.toml spec
  3. The wheel spec

for licenses. Now I think people are since this is legal stuff none of us want to get wrong, but this isn’t a small ask either when you’re affecting pretty much the entire packaging stack from users to builders to installers. This isn’t about the difficulty of implementation, it’s a question of difficulty of change.

1 Like

Right; I think we’re on the same page then, thanks :+1: I’d definitely appreciate that as well; it mostly codifies what existing tools are already doing, with some minor tweaks (that should be very low impact) to address issues identified in practice, but we definitely want to make sure we get things right, we’re recommending the best approach and we haven’t overlooked anything important. Looking forward to hearing from others on this.

In particular, regarding the core metadata spec, it should be noted that the spec makes no assumption that core metadata is located on a filesystem anywhere, so it’s not at all obvious how a filename should be interpreted. For example, I have an application which loads project metadata into a sqlite database. And I have another utility that downloads just the metadata file from a wheel (using partial HTTP reads). In both of these cases, under this proposal the metadata isn’t complete without additional information about where to find the license files.

2 Likes

As long as there’s a way to discover the license files somehow in the wheel file then I don’t see why the file paths need to listed in the core metadata. Best I can think of for usability is as a check that the license files are there as expected, but RECORD should cover that. Otherwise it’s a tool bug and not something I think we need to guard against (since we don’t guard against this similarly for anything else).

So for me, the bigger question is where should the license files be stored? The PEP currently says there should be a licenses_files sub-directory in .dist-info. While I would drop the _files suffix, is that location okay for people? Would people prefer a top-level .licenses directory instead like .data (as covered by the PEP)? And if people do, would the files get copied to disk or not?

My usage of metadata is more to do with dependencies than license data, so I don’t really care except in theory, but what about sdists? With PEP 643, someone could be looking at metadata in a sdist. And in theory, PyPI could implement a new JSON API that returned a package’s metadata. My point is that the PEP can’t realistically define where to find licenses in all cases, so it needs to state a general approach, and maybe some specifics for common cases like wheels, installed packages, and sdists.

Maybe it could say that the license-file metadata should be a string (in the format of a filename, if we want to make that explicit) and the file format, API, or tool exposing the metadata must state how the field should be mapped to the actual license text. The PEP would then also propose the relevant additions to the wheel, sdist and database of installed packages specs, as these are the only 3 standards that deal with metadata.

Maybe I’m being overly theoretical here? But the current metadata spec is very careful to not restrict itself to specific formats (it actually carefully avoids stating the file format, limiting itself to purely “this name has this meaning”). We could make the concrete forms more explicit, but if we do that, I’d argue that’s a much bigger change to the spec.

I think this is why CAM had it in the metadata. In the PEP the path is anchored to the top directory of the sdist.

But maybe this isn’t a problem worth solving for sdists? The SPDX expression will be in the metadata, so at least license compatibility is discoverable. And as long as wheels have a structured place to put the files then the sdist will contain the license files somewhere. And since what code is installed is going to be dependent on what wheel is produced from the sdist it might be more important to worry about wheels and package specs than structured data in sdists. Plus pyproject.toml will have the paths, so tools can read that to try and get the license files in the sdist.

1 Like

Thanks for the great insights and discussion!

I should have explicitly mentioned it before, and it looks like Brett already read them and pointed out the more important ones, but for the benefit of others I do discuss in detail a number of the alternatives for storing license files in the License File Paths section of the Rejected Ideas, including retaining continuing to flatten license files with name conflicts, continuing to dump them in .dist-info, using another scheme, such as embedding, to resolve conflicts, adding a new license category in the .data of the wheel, and naming the sudirectory licenses instead.

That’s a really good question, and one I didn’t fully consider myself initially, since I only added the proposed license_files subdir near the very end of the process of redrafting the PEP. It was discussed at some length and included in the author’s original draft of the PEP as without the License-File fields, there’s no way for consumers to reliably discover the license files present in a wheel or installed project, since they are dumped directly in .dist-info.

With the license_files subdir in wheels and installed projects, assuming RECORD is equally accessible alongside the metadata, consumers could determine the included license files and their paths as would be listed in License-File fields by simply inspecting RECORD or listing the contents of the license_files subdir, so License-File in the core metadata isn’t strictly required in those cases.

However, the same is not true of sdists (and, nominally, the source tree)—as you mention, license files are stored right in the root sdist tree alongside the Python import package(s), the rest of the MANIFEST-included files, the package data, the data files, etc., so without License-File, there is no way for consumers to statically determine them per PEP 643 (even to package in a wheel, install a project to site-packages, etc) from the PKG-INFO metadata. They instead would have to fall back to reading in the pyproject.toml (assuming its even there), parsing it, processing license-files (static paths or dynamic globs) via the specification in this PEP, finding the matching files and then compiling a de-duplicated list of relative paths. That’s a whole lot less non-trivial than simply reading a static list of relative paths directly from the PKG-INFO (or however else the metadata is exposed).

There’s also cases where only the metadata is available but not the full RECORD, e.g. via the JSON API that @pf_moore suggests, where License-File still has value in exposing what license files are present and where they can be found, e.g. through partial HTTP reads, without downloading and introspecting the full archive. Other distribution formats, or other means of storing metadata, may not have the equivalent of a RECORD or may not store all license files in a defined directory, but the values of License-File still provide a standardized, abstract, unique string identifier for (and record of) each such legally-required file.

Good point. In other distribution formats, or other means of storing metadata, it is the responsibility of the format how (or, indeed, whether) the License-File values—which can be viewed more abstractly as a simply a unique, possibly /-separated string key, as well as a record of the required license file paths—should be mapped to the corresponding license file contents, which may or may not be interpreted as a literal file path on disk.

In the first case, the names listed in License-Files could be mapped to their contents via, for example, a table with keys of the string path, or a hash of it (as a unique identifier for each file), and the values being the text file contents, or the full path to them. In the second, as mentioned above, License-File still refers to the path, relative to license_files in the wheel, which both is useful for tools examining, collecting statistical data on (e.g. vendor compliance, license name prevalance), validating, etc. metadata without downloading every wheel, and perhaps could be used to allow reading the licenses directly via partial HTTP reads of the indicated file.

I personally preferred license_files to licenses due to it being a little more clear and explicit and matching the name of the PEP 621 key and core metadata field. But I’m happy to change it if others prefer licenses; it also avoids bikeshedding over license_files vs license-files and similar.

To clarify here, unless I’m misunderstanding (which is possible), a wheel category as suggested by @dholth maps to a directory inside .data, which in turn maps to a sysconfig prefix (purelib, platlib, include, etc). Defining a new top-level .licenses directory alongside data would be a major extension to the wheel spec, which is a much bigger incompatible change than a new category/sysconfig prefix (which in turn is a substantially larger, more incompatible change than a small tweak to the existing, non-standardized but fully compatible approach).

They would have to, or it would not satisfy the conditions of nearly all licenses, and be a major regression over the current implemented (and PEP 639-proposed) behavior.

Initially, the PEP tried to simply state that in all cases, the license file paths are relative to the directory containing the file with said metadata (the pyproject.toml in the source and sdist, the PKG-INFO in the sdist, and the METADATA in the wheel and installed project), which represents the de-facto status quo. However, besides not covering situations where the metadata is not stored as a file, it doesn’t allow for a license_files directory, to resolve the conflict and forward compatibility issues (as well as littering .dist-info with arbitrary license files) of the status quo approach.

As such, the current iteration of the PEP instead already intended to do close to what you suggest, with the License-File core metadata field (as opposed to the license-files PEP 621 project source metadata key) being defined in terms of each license file’s location relative to the designated root license directory of each format (the root of the source tree and sdist, and the license_files directory inside .dist-info of the wheel and installed projects. I initially included language recommending the same for future and non-standardized distribution formats, but removed it as insufficiently general guidance for formats that PEPs and the PyPA do not define or standardize, and is more appropriately determined with if and when they are.

Per your suggestion, we could revise the text of the normative specification of the License-File field to say something a little more general, and add the caveat that its interpretation may depend on the format.

(Which it does already; I’m assuming you saw this, but just want to make sure we’re on the same page here.)

/ separated? As a Windows user, I’d naturally use \.

The format should be specified explicitly, otherwise we will get people messing this up (I speak from experience…)

I had actually missed it (sorry, I was sloppy in not fully reading the PEP before posting) but Brett pointed it out to me. So yes, this is covered.

I’m a little confused—is there a reason we would mandate \ instead of the standard / as the path delimiter for License-File? / is fully supported Linux, Mac and Windows platforms, at least as far as the OS and correctly-written applications are concerned (outside of some niche contexts with legacy cmd commands)? Whereas \ is not used on any platform but Windows (alongside /), and is the escape character the *nix shell and in most programming languages (including Python) and many file formats (including TOML), and / is the standard for both string and pathlib Paths on Python on all platforms. Therefore, it only makes sense to specify /, not \, as the standard path delimiter for License-File.

Already covered :wink:

From the specification for the License-File core metadata field in the PEP:

Path separators [sic] MUST be the forward slash character (/), and parent directory indicators (..) MUST NOT be used. License file content MUST be UTF-8 encoded text.

And in the specification for the license-file project source metadata key in the PEP:

Path separators [sic], if used, MUST be the forward slash character (/), and parent directory indicators (..) MUST NOT be used. Tools MUST assume that license file content is valid UTF-8 encoded text, and SHOULD validate this and raise an error if it is not.

Technically, the language for both should read path delimiter, not separator, since the path delimiter (os.sep, / on POSIX and / or \ on Windows) separates components of a single path, while the path separator (os.pathsep, : on POSIX and ; on Windows) is used to separate multiple paths in the $PATH—though colloquially the former is often referred to as the latter. I’ll correct that in the next pass.

Well, considering its length I think you can be forgiven for that, heh…I’m working on getting it down to about a third of its original length by moving non-critical sections out of the PEP itself and into supporting documents, which will be fully possible if and when PEP 676 is accepted, as well as editing to reduce verbosity, re-organizing it, eliminate the huge references section for inline links (already done) and cutting down the glossary to just the terms not defined in the PyPA one (and we can hopefully move the generally useful definitions there for the future).

No, sorry I was being a bit glib, trying to say that “it’s a pathname” isn’t sufficiently precise by itself. But as you say, the PEP has this covered properly.

Meh, I’d get cross at people making comments without checking, so I apologise anyway for not meeting my own standards, if nothing else :slightly_smiling_face: Thanks for your patience.

1 Like

That’s true of a lot of things with sdists :sweat_smile:. But also note that technically reading the license from an sdist and not the built and installed wheel could lead to incorrect results as what gets ultimately installed may not match the licenses specified by an sdist (both by over- and under-fitting the licenses).

Yep, although I have not even gotten to the point of discussing globs for pyproject.toml yet. :grin:

Depends on whose triviality you’re optimizing for. It’s easier to read the file paths from PKG-INFO if you find reading that file easier than pyproject.toml; that’s not always true as your license processing tool may not have a library available to read its format as nicely as TOML (e.g. Microsoft’s tool is written in TypeScript). There’s also the concern of the complexity of having more to implement for tool authors who have to accept and implement this PEP, else users won’t even have a chance of using it. So from my perspective it is not clear-cut that one way is more trivial than the other.

Hehe, very true of course. But I don’t think think this means we can’t improve things here for this.

Hmm, I’d be curious to hear more on this point—if the source license-file metadata is not labelled dynamic via PEP 643 in the sdist (otherwise it cannot be trusted regardless), and thus can be considered static, and the specifications is followed, isn’t there a 1:1 correspondence between license files in the sdist and in the wheel, since the license files at the specified paths or globs in the source true MUST have been included in the sdist at those same paths, and there wouldn’t be any additional files that would match in the sdist but not the source tree?

Unless, I suppose, the build system injected arbitrary files into the sdist from which the wheel was built and happened to match a glob, and relied on the original source PEP 621 metadata (or tool-specific equivalent) instead of the declared static sdist License-File metadata and included exactly those files—in fact, it is License-File that allows them to statically do this. If this is a concern, couldn’t we address this in the specification, by requiring build tools use the static sdist metadata, mark License-File as dynamic and/or exclude their own generated files unique to the sdist when matching license files? Or is there another way this could happen that I’m overlooking?

That’s a good point, though for at least this field specifically, shouldn’t it be nearly simple as

FIELD_NAME = "License-File:"
license_files = []
with open("PKG-INFO", encoding="utf-8") as pkg_info:
    for line in pkg_info:
        if not line:
            break
        if line.startswith(FIELD_NAME):
            license_files += [line.removeprefix(FIELD_NAME).strip()]

There’s also something very important I missed—that assumes that metadata is stored in PEP 621 form in the pyproject.toml in the first place, which is likely not going to be the case for a while since not major tools have not yet adopted it and user adoption is even further behind, whereas the License-File field will be available immediately with a simple tool update.

There’s also the concern of the complexity of having more to implement for tool authors who have to accept and implement this PEP, else users won’t even have a chance of using it.

Yeah, fair—but tools are already generally doing essentially all of that anyway in order to embed license files in wheels, typically via (at user option) either the default globs or explicit custom paths, its a much more constrained number of tools to wrangle, and I’ve examined several existing implementations and am willing to help tool authors make, test and verify the relatively minimal necessary tweaks to follow the refined specification here, both on the input (PEP 621, if supported, otherwise likely no changes needed) and output (wheel) sides.

Not necessarily; it’s up to us and what we put in this PEP (if it gets accepted). For me, the point of dynamic is saying a field’s value cannot be statically known upfront. Now the potential list of licenses can be known upfront in a source tree/sdist via pyproject.toml, but it may list extra licenses if the resulting wheel doesn’t actually need them all. Plus the licenses are accurate for the source tree/sdist.

Or put another way, I think it’s fine if the licenses listed in an source tree or sdist don’t all make it into a wheel, but it should accurately reflect what licenses are necessary for the sdist/source tree itself.

There’s also a point to be made that if there isn’t a license-file in core metadata that it doesn’t really fall under dynamic’s scope.

(Aside: maybe this complexity is why crates.io doesn’t let you specify both a license file and SPDX license expression?)

I have learned to never assume how “simple” something could be for someone else.

I’m fine with that. I’m care about not making anything worse now and making sure things are better in a decade.

Once again, be careful about making assumptions about how “minimal” a change is. I personally hate it when people come into my projects and claim something is “simple” and will “only take 10 minutes” when there’s way more to doing a change than updating some code.

1 Like

Providing a mechanism to enable different licenses for sdists and wheels was discussed in the last thread, and while it would be useful for some specific cases, the consensus was that due to its niche applicability and the relatively high degree of complexity/major changes needed to implement it in tools and particularly on PyPI (which has no concept of distribution-specific metadata), it is considered out of scope for this PEP (as mentioned therein and thus the specification in this PEP ensures that License-Expression and License-File (and the corresponding actual license files) are consistent between sdists, wheels and installed projects.

Indeed, but that would mean it cannot be relied upon at all, and there would be no way to indicate that it could be.

Actually, that’s more an orthogonal issue, that the license expression and license file(s) should match, and could be inconsistent if they don’t. We’re treating them a little differently here, in that License-Expression lists that project’s license expression according to the authors, while License-File lists the paths of the files to be included in order to conform to the requirements of said license.

The directly analogous issue is package indices not really providing for different distribution artifacts from the same release having different metadata (i.e. PyPI only exposes one set in the UI, API, etc, and would require a significant re-work and complexity to accommodate the proposal of having different licenses for different artifacts).

Well, unfortunately its not a foregone conclusion that tools and users widely adopt PEP 621, given Poetry’s resistance and how long it is taking to get it implemented in Setuptools, and that many/most packages are still stuck on a non-declarative setup.py and don’t have a pyproject.toml at all, long after PEP 518 was accepted and Setuptools added full support for declarative configuration in setup.cfg, whereas with License-File as proposed, users don’t have to do anything but update their packaging tools.

I’m also a little unclear how including static License-File metadata makes things any worse now—I know @pf_moore raised some points that should be clarified in the spec to abstract it a bit from being explicitly tied to a physical file path on disk, but I wasn’t aware of a way in which this was actively harmful—maybe I missed something in the back-and-forth?

Sure, you’re right—many/most of the changes won’t be tweaks to the code, but all the other things that go into making a change (validation/UI text, tests, reviews, docs, changelog, etc), and I also had more the wheel changes in mind than adding License-File, which indeed is not a huge change but is certainly not “minimal”. Someone already implemented this in Setuptools some time ago based on the previous draft of this PEP; while it has other changes conflated, the PR gives a decent practical idea of what it would take to implement in a large, well-established and fairly high-techdebt project. Of course, it’ll be different for everyone, but it at least gives us a useful real-world baseline.

NB, I’ve just pushed a new (much smaller :smile: ) PR, which:

  • Refines and clarifies the language around License-File per @brettcannon and @pf_moore 's suggestions and the related discussion
  • Removes the speculative “Future PEPs” section and cuts down the Terminology section as requested by @pradyunsg , particularly the introduction and the parts that overlap with the PyPUG, cutting down the PEP’s length by around a full page
  • Moves the User Scenarios section to an appendix as suggested by @pf_moore , reducing the length of the main PEP by another two pages

If and when PEP 676 is accepted, I will be able to move the appendecies to separate linked auxiliary documents outside the PEP but still hosted in the PEP repo (which combined with collapsing the ToC), further reduces the total PEP length by nearly 2/3rds and the length (number of entries) in the ToC by over 3/4.

I’ve tagged @brettcannon and @pf_moore for review. Once merged, a followup PR will make the PEP 621 changes decided upon above, namely removing the separate license-expression key and making that the flat string value of the license key instead, updating the Converting legacy metadata guidance section to reflect it cannot be converted during build if specified in project, and simplifying and updating the rest of the PEP accordingly (which should shorten its length by a fair bit more, particularly the normative Specification section).

And that’s fine if people choose not inspect the wheel for a possible narrower set of license files. If this comes down to a bunch of files in a specific directory then there’s no real concern here.

:person_shrugging: Practicality beats purity.

Welcome to packaging specs, where you try to do the right thing while balancing for adoption and hoping for acceptance. :wink:

:tada:

As the PEP’s author and a PEP editor, you don’t need to wait on reviews from either of us (and honestly you should probably leave Paul off as he’s made it my problem instead of his :wink:).

I’m going to hold off on commenting further until the shorter PEP is in and I can give it a complete read. I also suggest starting another topic as this draft will be different enough to want to make sure it’s surfaced to anyone who has muted this topic already.

1 Like

I don’t mind a reset either; I’m not entirely certain we aren’t debating the same side at this point and just don’t realize it, and I’m starting to get a bit lost myself as to the direct connection with the current PEP :smile:

Right, but I’m not sure how that’s practically possible, both in the current implementations (which only have one static license_files setting) and following the specification in this PEP (which has a static license-files PEP 621 key, and specifies that License-Files must be consistent, and the files preserved, between sdists, wheels and installed projects). And if License-Files is marked static in the sdist per PEP 643, tools shouldn’t be tinkering with it, and otherwise all bets are off anyway. I’m also getting a little lost myself on how this is a blocker to having License-Files

Indeed it does—which is why I’m struggling to understand the practical issue, when both current implementations and the proposed spec appear to preclude the inconsistency between License-Files in the sdist and those actually included in the wheel that could create a possible problem for tools relying on static metadata versus dynamic introspection?

Sorry; since it was based off his concern and attended suggestion, as part of a discussion with you, I wanted to give you both a chance to weigh in if you wanted, but I went ahead and merged it (as you saw). Next up: the license-expressionlicense project source metadata key change.

2 Likes

FYI at this point I’m just waiting for @CAM-Gerlach to tell me he is done with edits and is ready for me to do a thorough review.

3 Likes

@CAM-Gerlach Is the PEP ready for review?