Improving license clarity with better package metadata

yes, this is entirely out of the scope of this PEP, though it is mentioned when discussing the documentation of “license in code files”. The best solution there is IMHO using the SPDX-License-Identifier approach as I helped deploy in the linux kernel.

:+1: … would you prefer to see these examples inline, or in an appendix section?

Thanks for the replies!

The various examples built from the example of pip are the only things I would think should go in an appendix if you include them (because the pip example is so involved). In the abstract, the ideas themselves are pretty easy to describe and don’t need to be left to an appendix. Having said that, I think the pip examples might be too complicated. You could communicate the same thing with a package that vendors three or four libraries instead of pip’s more than a couple dozen.

Regarding the use cases in the abstract, I think it might be good to have a “scope” subsection towards the beginning describing what’s in and out of scope for the PEP, and then you can put more detailed info about out-of-scope use cases in a “Rejected ideas” section per PEP 1: https://www.python.org/dev/peps/pep-0001/#what-belongs-in-a-successful-pep. The out-of-scope use cases can include the ideas of (1) mapping licenses used in the license expression to specific files in the license files (or vice versa), and (2) mapping licenses to specific source files and/or directories of source files (or vice versa).

That’s my concern with reusing the License field – it’s using metadata that was not written with the assumption of being a strict, unambiguous bit of information and treating it as a strong compliance metric.

But then again, pip’s almost a special snowflake and I basically have very little skin in this game of legal, so… someone who actually is affected by this needs to confirm how much false positives affect them. :man_shrugging:t2:

I also have concerns about reusing the License field: as long as it still supports free-form text, it would prevent us from being able to ever strictly validate this field, and it would make it impossible for us to discern whether a given string is a valid license expression with a small typo, or just some free-form description. IMO it would be better if we could strictly validate this field from day 1, and this requires a separate field.

In my experience, very few projects are using valid SPDX identifiers in the License field. I may try to aggregate the metadata to get some statistics on this if folks are interested.

Would we make this new field mandatory as well? If it were optional, I think most projects would end up not using it.

Here are the 4589 different values used in the ~180k projects on PyPI (I ran the numbers about 2 months ago). The “MIT” SPDX identifier appears almost 46k times, but it’s probably a bit of a coincidence, not something that was done on purpose. The vast majority of values seem to not be valid SPDX expressions, it is true.

We could eventually make it mandatory in later metadata versions, but I don’t think we can do that until we’ve given sufficient time to deprecate the license classifiers and inform authors that we will begin requiring this field.

This file seems to be password-protected.

OK. Should we encourage packaging tools (setuptools, flit, poetry…) to start adding warnings ASAP, and later have warehouse refuse uploads if the License-Expression is invalid/not present?

Sorry, I edited the link.

Could it be made mandatory if License-File is present, because that field is new? (Or maybe not, since some people are already providing it for a different purpose.)

Yes, it can be.

Further, these new fields could also be exclusive to the use of license classifiers and License metadata field (use License + classifiers, or License-File + License-Specifier).

Then, we can mark License as deprecated with this PEP, recommending the use of License-File + License-Specifier, and suggest that tooling should start warning on the use of License or classifiers.


TBH, I like the approach I just outlined above. :slight_smile:

2 Likes

While I initially was leaning towards a separate field, it makes more sense to me to reuse the License field.

Yes, it would be hard to discern from free text to a license expression with a typo, but this would still be possibly in many cases (say you made in a typo in a license id and not in a “and/or/with” keyword).

There are also several benefits besides this:there is immediate backward compatibility
And this much simpler to understand for users and authors. There is now a single license metadata field.

We have first a soft validation and warning and later (say in about one year from the time we have this adopted by major tools) we start doing a stricter validation, possibly rejecting incorrect things in pypi and other publishing tools.
I foresee a lot of confusion if we start having not two but three license metadata fields for a transition period.

That said, if there is a consensus that this is a better approach I will update the draft PEP accordingly

That could work, but I am concerned (and may be unduly) by the introduction of a new field for the same concept. If I look at the ways things worked out with npm reusing the license field to make it structured with a nag message, e.g. a soft validation worked rather well and has been minimally invasive. Now npms have typically a higher proportion of packages with a properly declared license in their manifest (per https://clearlydefined.io/stats and some other unpublished study I made). Higher than any other major application package.

My preferred way would be simpler: warn users for invalid License and the use of a license Classifier for as long as it makes sense (say a good year). Then start doing a stricter validation.

Yet again, if there is a consensus emerging here that a new field is really a better solution, I will switch to that.

That would be the way forward alright. (Preferably reusing License :slightly_smiling_face: )

Actually license_file is already in use in wheels and setuptools and license_files is in use in wheels, so these are not new. They are just specified in the draft PEP because they are in use but were never specified.

1 Like

Ah right. I notice now the PEP does mention this.

Thanks for the clarification.

1 Like

@cjerdonek the latest push addressed this comment and more https://github.com/pombredanne/spdx-pypi-pep/pull/2/commits/0a22b857ea8b8649cc16dfed1b5ccdd9333f20dd

I also addressed your comments in the PR with this commit: https://github.com/pombredanne/spdx-pypi-pep/pull/2/commits/da5ea2708cca19f3dd75f01256e8e37d4bf8d7cc

Thanks! I did a quick skim but I don’t really have time right now to look into it in detail.

1 Like

I posted an update to the draft PEP at https://github.com/pombredanne/spdx-pypi-pep/pull/2 after feedback from @ncoghlan

The key question that is pending is whether there is a consensus on using a separate and new field License-Expression or whether to reuse License. (I am very much in favor of the later option e.g. reusing License.)

@ncoghlan seems to like this too so far.

@pradyunsg @pf_moore @cjerdonek @dustin could you post what is your latest thinking there?

I like the way the proposal is going. In terms of process, I think you’ve done a great job of responding to comments and concerns, and I don’t have any concerns of my own remaining at this point.

I don’t honestly have an answer on License-Expression vs License. I’ve no real feel for what consumers of this new metadata will look like, and it’ll mostly be them that will be affected by the choice. So I think you should be looking for input from people who expect to write code that uses the data - but I don’t know whether any of the current participants in the discussion will be doing that.

Good work on the proposal this far!

1 Like

I don’t have a strong opinion on this, so I’ll defer to others. (My earlier post was a suggestion rather than an expression of preference.)

1 Like