Improving license clarity with better package metadata

I am reaching out for feedback to:

I will be reaching out also to Linux distro Python package maintainers. They are in the first line as metadata consumers when packaging (massively) Python packages

I posted an update to the draft PEP at https://github.com/pombredanne/spdx-pypi-pep/pull/2 and https://github.com/pombredanne/spdx-pypi-pep/blob/metadata-2.2-spdx-license/pep-9999.rst and we now use LicenseRef-Public-Domain and LicenseRef-Proprietary that are SPDX supported rather Public-Domain and Proprietary as suggested by @ncoghlan and Oleg Grenrus (that kindly chimed in bringing his perspective and experience with the use of SPDX license ids and expressions in Haskell cabal )

At this stage I think we are almost there to move this to an official draft PEP. There are two last objections/concerns to resolve though:

  1. should we freeze the SPDX license list version used in the metadata (e.g. the list of license identifiers) See https://github.com/pombredanne/spdx-pypi-pep/pull/2#discussion_r330419938

  2. should we reuse the license field or add a new license-expression field in the metadata See https://github.com/pombredanne/spdx-pypi-pep/pull/2#discussion_r330425388

Comments on this are welcomed to reach a consensus on these two!

2 Likes

@pombredanne I’ll request you to create dedicated issues for those concerns on your repository and link to them from here – that’ll let us have much more scoped discussions and figure things out without adding to an already long thread. :slight_smile:

Here is is:

1 Like

@pf_moore At this stage I think the document at https://github.com/pombredanne/spdx-pypi-pep/pull/2 is ready to be promoted to a draft PEP.

It has gathered a good feedback and all objections and alternative have been generally well considered and discussed.

Any feedback or objections on your side?

Nope, I’m fine with that. I’ve been watching the discussions and the comments against the PEP and I’ve been very happy with the level of discussion and feedback. Please submit your PEP as a PR against the core PEPs repository and one of the PEP editors will assign it a number and merge it.

In practical terms, a lot of the discussion and review has already happened, as far as I can see, so I would expect that the next stage should be relatively straightforward. The PEP already has Discussions-to set to here, so I suggest updating this thread with the assigned PEP number, and requesting any further review comments or queries be posted here. You could either summarise the changes that have occurred since the original posting, or re-post the latest version of the PEP, whichever you feel is most reasonable. I’d expect there would be relatively few remaining comments to be made, but we can see.

Once there’s been some time for any further comments to be raised and addressed, and you’re happy with the state of the PEP, let me know it’s ready for a final decision.

3 Likes

I’m trying to catch up on OSS work, and it appears this PEP is still in the draft phase, so hopefully these comments are still useful here.

We already have a history with a field in packaging metadata that we attempted to interpret as a syntax and then had to fall back to reading as plaintext-- the long description. One of the problems with that field is that people would make minor syntax errors and nothing would fail, just they would get an unrendered text. We had no mechanism to make this better in the short term, because we had no way to differentiate between “This is invalid because I made a mistake, and this is invalid because it’s not in that actual syntax”. Even worse is when we would be able to parse the input without error, but it actually wasn’t meant to be that kind of input, so we would just end up with a nonsense result that wasn’t what anyone intended or wanted. We eventually added a field that would allow people to indicate how we should parse their content, to override this guess-and-fail fallback.

So learning a lesson from that, I would be really hesitant to add yet another field that would have all of the same problems (this time perhaps with legal ramifications if someone took a accidentally correct but not actually accurate license field at face value). So I would strongly encourage either adding a separate field for the license expression syntax, or another field that indicates that the license field should be interpreted as a license expression.

I’d also suggest that we have multiple levers of “required” we can leverage. The hardest one to lever is in the client side tooling, because changing those tools to require fields (new or otherwise) means we immediately invalidated every existing package that didn’t have that field already. However we have another lever in that we can make fields mandatory on PyPI, but at the metadata level optional. This means that for bulk of published packages, this field would be mandatory since most published packages get published on PyPI, but the entire corpus of existing packages, as well as private packages remain unaffected.

2 Likes

Hi Donald:
Thank you for chiming in!

I have the OK from @pf_moore to move this to an official draft and it only has been delayed by the holidays so your comments are still very much useful and timely. And even if they would have been coming later, this will still be a draft!
Your point about “This is invalid because I made a mistake, and this is invalid because it’s not in that actual syntax” which is furthermore informed by your experience on PyPI and the PyPA tools is important and valid. Let me reply in details tomorrow.

1 Like

@dstufft you wrote:

This is a serious concern but IMHO the long_description field issue came in part because the PyPI-accepted reStructuredText syntax was a less common subset of the main ReST spec and that the field was never validated by client tools or by PyPI (e.g. setuptools never warned that a long_description was not valid PyPI ReST syntax and a PyPI upload was never rejected nor triggered a warning because the field was not valid ReST). So long_description was a field with a structure (restricted ReST syntax) or no structure at all and both being valid and there never was any validation done or possible. In this context, the solution to use an extra field to specify its format makes perfect sense to me.

There are several differences with the license and using license expression in that field though: there is only one license expression syntax and with this PEP the license field would always be validated and an invalid field would always trigger a warning either because of invalid syntax or an invalid or unknown license identifier used in an expression.

Let me review the two cases:

  1. Say we have such a new field, for instance a new boolean field is_spdx_expression defaulting to False if absent.
    1.1. If False, a tool should validate the syntax and issue a warning if the syntax is invalid. If the syntax is correct and there are unknown license ids, it would issue a warning too. (e.g. it could reference a new SPDX license id that is not yet known to the tool doing the validation).
    1.2. If True, license field should be a valid SPDX license expression. If syntax is not correct a tool should report an error. Otherwise, a tool should report a warning if a license is unknown.
  2. Say I am instead not having such a field, then the process would be exactly the same as in 1.1

Therefore, a new is_spdx_expression-like field purpose would be to decide if an expression syntax error must be a warning or an error and I am not convinced that this difference would be enough to add such a field.

Applying validation on PyPI makes 100% sense. Do you think a specific reference in the PEP would be needed?

I would see things evolve this way once this PEP is accepted:

  1. PyPI is updated to provide some warning feedback on the license field and that for a period of one year (TBD). After that transition period, uploading new packages without a valid license expression would be rejected. The PyPI Web UI could also leverage the structured license expressions to provide more informative data.
  2. Separately, client tools are updated to support license expression validation and report warning on the license field as needed. They can decide to make it mandatory or optional as they like. I would expect that they would eventually align with PyPI validation requirements.

Would this make sense?

You also wrote:

This is not an issue IMHO. We are not making any legal claim that a license field – even if syntactically correct – has any legal value whatsoever. This is just a bit of user provided metadata which may be correct or not. If validated as a correct license expression, then users are free to attach it the SPDX-specified meaning. Or not. This is not different from the current classifiers that may be correct or not and correctly populated by a package author or not. We cannot control nor provide any guarantee for that.

1 Like

So I definitely think that the PyPI issue was exacerbated by the particulars of PyPI’s ReST rendering, though I still think it applies here. I don’t think you can assume a warning is going to be emitted and if it is, that people are even going to read the warning.

PyPI itself doesn’t have a mechanism to report warnings back about an upload, it can either accept the upload or it can fail the upload.

That leaves only client side warnings, however, even if people added those checks today, it would still be several years until they broadcast out to people. I think that distutils warned about missing fields (URL maybe? I forget what one) for like a decade and people were still shipping packages that had them missing and/or malformed. Some part of that is because people generally just ignore warnings, another part of that is that for setuptools at least the output is super verbose and things get lost easily, see for example a very simple test package:

$ python setup.py bdist_wheel
/Users/dstufft/.pyenv/versions/3.7.3/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.05.07.01' to '2019.5.7.1'
  normalized_version,
running bdist_wheel
running build
installing to build/bdist.macosx-10.14-x86_64/wheel
running install
running install_egg_info
running egg_info
writing dstufft.testpkg.egg-info/PKG-INFO
writing dependency_links to dstufft.testpkg.egg-info/dependency_links.txt
writing top-level names to dstufft.testpkg.egg-info/top_level.txt
reading manifest file 'dstufft.testpkg.egg-info/SOURCES.txt'
writing manifest file 'dstufft.testpkg.egg-info/SOURCES.txt'
Copying dstufft.testpkg.egg-info to build/bdist.macosx-10.14-x86_64/wheel/dstufft.testpkg-2019.5.7.1-py3.7.egg-info
running install_scripts
creating build/bdist.macosx-10.14-x86_64/wheel/dstufft.testpkg-2019.5.7.1.dist-info/WHEEL
creating 'dist/dstufft.testpkg-2019.5.7.1-py3-none-any.whl' and adding 'build/bdist.macosx-10.14-x86_64/wheel' to it
adding 'dstufft.testpkg-2019.5.7.1.dist-info/METADATA'
adding 'dstufft.testpkg-2019.5.7.1.dist-info/WHEEL'
adding 'dstufft.testpkg-2019.5.7.1.dist-info/top_level.txt'
adding 'dstufft.testpkg-2019.5.7.1.dist-info/RECORD'
removing build/bdist.macosx-10.14-x86_64/wheel

Ironically that has a warning and I didn’t even realize it until I was proof reading this post, because it just blended into the noise of producing a package-- that’s possibly a setuptools problem and we can decide that we don’t think it’s worth doing something different because of it, but I think it is worth at least thinking about the chances someone is even going to see said warning.

I think that if we actually want to ensure that people are putting well formed data in, then we need to validate that field in a way that will actually fail. Part of the transition of doing that can involve a period of time when we only warn, but ultimately I think well formed data needs to be hard validated somewhere in the pipeline, but we can’t really do that with the PEP as stands.

Personally I’d lean towards adding a new field for the data itself, something like License-Expression or even SPDX-Expression or something. I’m not a huge fan of boolean fields and I don’t see us ever having a second kind of license expression where we might want the ability to toggle between them like we do with long description. I think that gives us a very clean path forward:

  1. PyPI immediately starts validating this field, there will never be a published package that has an invalid field here.
  2. Packaging tools start warning if this field is omitted, and can also validate internally that it’s valid.

We probably could never make the field mandatory in the packaging spec itself (e.g. pip should be able to install a package without it, build tools should be able to build a package without it), but I could very easily see an argument that at some point PyPI should require it for uploads.

2 Likes

@dstufft @sumanah this would be another area where the ability to warn, as part of an async upload mechanism, would be useful.

1 Like

Gentle ping to @pombredanne on this - it looks like you never submitted this for a PEP number. Do you need help with the process, or is it just a matter of other things taking priority (quite likely in the current situation!)?

1 Like

Another gentle nudge to @pombredanne.

If you won’t have time to take this up anytime soon, please let us know. Also, if you are OK if someone else champions this going forward, please state that as well - it might help unblock this task. :slight_smile:

1 Like

This is just a matter of priorities. I will restart work on it to get closure and a proper PEP in the pipes

@pradyunsg same. Thank you guys for reminding me!

5 Likes

Any update on the PEP submission?

1 Like

I should be done this week!

4 Likes

Hello. I came here because I’d like to have an optional field License-Files exposed in core metadata. My understanding is that it would require writing a PEP, bumping the metadata version and adding the field, however given how related the change is to what @pombredanne is working on – can this be done together? I can work on the PEP with you if desired or just review it.

A suggestion: have you considered a repeated License-File field instead of License-Files? Almost all other “list” quantities like Classifier, Requires-Dist, etc use multiple-use fields rather than a single field with some delimiters (Keywords is not, but that’s a bit weird anyway, and I think setuptools is broken with respect to keywords and its delimiters anyway?).

Though, also, I do think it would be very unusual to have any sort of metadata that is a reference to a file in the package as opposed to metadata about the package. It seems like it might be a better idea to glom on to the sdist standardization discussion and suggest that all the license files go into a dedicated directory like .dist-info/license_files?

I have not yet give this much detailed thought to the actual proposal, I’m basically just checking if it would be possible. Your proposal makes sense, as it is more consistent with the rest of the fields. Thanks.

Not sure how is this related to sdists. My need to be able to identify the license files is for installed projects, not sdists.

1 Like