@pombredanne I’ll request you to create dedicated issues for those concerns on your repository and link to them from here – that’ll let us have much more scoped discussions and figure things out without adding to an already long thread.
Nope, I’m fine with that. I’ve been watching the discussions and the comments against the PEP and I’ve been very happy with the level of discussion and feedback. Please submit your PEP as a PR against the core PEPs repository and one of the PEP editors will assign it a number and merge it.
In practical terms, a lot of the discussion and review has already happened, as far as I can see, so I would expect that the next stage should be relatively straightforward. The PEP already has Discussions-to set to here, so I suggest updating this thread with the assigned PEP number, and requesting any further review comments or queries be posted here. You could either summarise the changes that have occurred since the original posting, or re-post the latest version of the PEP, whichever you feel is most reasonable. I’d expect there would be relatively few remaining comments to be made, but we can see.
Once there’s been some time for any further comments to be raised and addressed, and you’re happy with the state of the PEP, let me know it’s ready for a final decision.
I’m trying to catch up on OSS work, and it appears this PEP is still in the draft phase, so hopefully these comments are still useful here.
We already have a history with a field in packaging metadata that we attempted to interpret as a syntax and then had to fall back to reading as plaintext-- the long description. One of the problems with that field is that people would make minor syntax errors and nothing would fail, just they would get an unrendered text. We had no mechanism to make this better in the short term, because we had no way to differentiate between “This is invalid because I made a mistake, and this is invalid because it’s not in that actual syntax”. Even worse is when we would be able to parse the input without error, but it actually wasn’t meant to be that kind of input, so we would just end up with a nonsense result that wasn’t what anyone intended or wanted. We eventually added a field that would allow people to indicate how we should parse their content, to override this guess-and-fail fallback.
So learning a lesson from that, I would be really hesitant to add yet another field that would have all of the same problems (this time perhaps with legal ramifications if someone took a accidentally correct but not actually accurate license field at face value). So I would strongly encourage either adding a separate field for the license expression syntax, or another field that indicates that the license field should be interpreted as a license expression.
I’d also suggest that we have multiple levers of “required” we can leverage. The hardest one to lever is in the client side tooling, because changing those tools to require fields (new or otherwise) means we immediately invalidated every existing package that didn’t have that field already. However we have another lever in that we can make fields mandatory on PyPI, but at the metadata level optional. This means that for bulk of published packages, this field would be mandatory since most published packages get published on PyPI, but the entire corpus of existing packages, as well as private packages remain unaffected.
I have the OK from @pf_moore to move this to an official draft and it only has been delayed by the holidays so your comments are still very much useful and timely. And even if they would have been coming later, this will still be a draft!
Your point about “This is invalid because I made a mistake, and this is invalid because it’s not in that actual syntax” which is furthermore informed by your experience on PyPI and the PyPA tools is important and valid. Let me reply in details tomorrow.
This is a serious concern but IMHO the long_description field issue came in part because the PyPI-accepted reStructuredText syntax was a less common subset of the main ReST spec and that the field was never validated by client tools or by PyPI (e.g. setuptools never warned that a long_description was not valid PyPI ReST syntax and a PyPI upload was never rejected nor triggered a warning because the field was not valid ReST). So long_description was a field with a structure (restricted ReST syntax) or no structure at all and both being valid and there never was any validation done or possible. In this context, the solution to use an extra field to specify its format makes perfect sense to me.
There are several differences with the license and using license expression in that field though: there is only one license expression syntax and with this PEP the license field would always be validated and an invalid field would always trigger a warning either because of invalid syntax or an invalid or unknown license identifier used in an expression.
Let me review the two cases:
Say we have such a new field, for instance a new boolean field is_spdx_expression defaulting to False if absent.
1.1. If False, a tool should validate the syntax and issue a warning if the syntax is invalid. If the syntax is correct and there are unknown license ids, it would issue a warning too. (e.g. it could reference a new SPDX license id that is not yet known to the tool doing the validation).
1.2. If True, license field should be a valid SPDX license expression. If syntax is not correct a tool should report an error. Otherwise, a tool should report a warning if a license is unknown.
Say I am instead not having such a field, then the process would be exactly the same as in 1.1
Therefore, a new is_spdx_expression-like field purpose would be to decide if an expression syntax error must be a warning or an error and I am not convinced that this difference would be enough to add such a field.
Applying validation on PyPI makes 100% sense. Do you think a specific reference in the PEP would be needed?
I would see things evolve this way once this PEP is accepted:
PyPI is updated to provide some warning feedback on the license field and that for a period of one year (TBD). After that transition period, uploading new packages without a valid license expression would be rejected. The PyPI Web UI could also leverage the structured license expressions to provide more informative data.
Separately, client tools are updated to support license expression validation and report warning on the license field as needed. They can decide to make it mandatory or optional as they like. I would expect that they would eventually align with PyPI validation requirements.
Would this make sense?
You also wrote:
This is not an issue IMHO. We are not making any legal claim that a license field – even if syntactically correct – has any legal value whatsoever. This is just a bit of user provided metadata which may be correct or not. If validated as a correct license expression, then users are free to attach it the SPDX-specified meaning. Or not. This is not different from the current classifiers that may be correct or not and correctly populated by a package author or not. We cannot control nor provide any guarantee for that.
So I definitely think that the PyPI issue was exacerbated by the particulars of PyPI’s ReST rendering, though I still think it applies here. I don’t think you can assume a warning is going to be emitted and if it is, that people are even going to read the warning.
PyPI itself doesn’t have a mechanism to report warnings back about an upload, it can either accept the upload or it can fail the upload.
That leaves only client side warnings, however, even if people added those checks today, it would still be several years until they broadcast out to people. I think that distutils warned about missing fields (URL maybe? I forget what one) for like a decade and people were still shipping packages that had them missing and/or malformed. Some part of that is because people generally just ignore warnings, another part of that is that for setuptools at least the output is super verbose and things get lost easily, see for example a very simple test package:
$ python setup.py bdist_wheel
/Users/dstufft/.pyenv/versions/3.7.3/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.05.07.01' to '2019.5.7.1'
installing to build/bdist.macosx-10.14-x86_64/wheel
writing dependency_links to dstufft.testpkg.egg-info/dependency_links.txt
writing top-level names to dstufft.testpkg.egg-info/top_level.txt
reading manifest file 'dstufft.testpkg.egg-info/SOURCES.txt'
writing manifest file 'dstufft.testpkg.egg-info/SOURCES.txt'
Copying dstufft.testpkg.egg-info to build/bdist.macosx-10.14-x86_64/wheel/dstufft.testpkg-2019.5.7.1-py3.7.egg-info
creating 'dist/dstufft.testpkg-2019.5.7.1-py3-none-any.whl' and adding 'build/bdist.macosx-10.14-x86_64/wheel' to it
Ironically that has a warning and I didn’t even realize it until I was proof reading this post, because it just blended into the noise of producing a package-- that’s possibly a setuptools problem and we can decide that we don’t think it’s worth doing something different because of it, but I think it is worth at least thinking about the chances someone is even going to see said warning.
I think that if we actually want to ensure that people are putting well formed data in, then we need to validate that field in a way that will actually fail. Part of the transition of doing that can involve a period of time when we only warn, but ultimately I think well formed data needs to be hard validated somewhere in the pipeline, but we can’t really do that with the PEP as stands.
Personally I’d lean towards adding a new field for the data itself, something like License-Expression or even SPDX-Expression or something. I’m not a huge fan of boolean fields and I don’t see us ever having a second kind of license expression where we might want the ability to toggle between them like we do with long description. I think that gives us a very clean path forward:
PyPI immediately starts validating this field, there will never be a published package that has an invalid field here.
Packaging tools start warning if this field is omitted, and can also validate internally that it’s valid.
We probably could never make the field mandatory in the packaging spec itself (e.g. pip should be able to install a package without it, build tools should be able to build a package without it), but I could very easily see an argument that at some point PyPI should require it for uploads.