By leaving the expression field as optional, so that a custom license can be easily identified by the absence of the expression field.
No.
What would prevent this PEP from moving forward is the lack of an implementation of its design in a backend. And, to that end, my suggestion is to not specify a library for validation (eg: in case a new/better/different-design library comes out) and instead leave it as an implementation detail for the build backends to figure out entirely. That said, I also donât feel strongly about this specific suggestion since (unless Iâm misreading) the use of this library isnât written to be mandatory anyway.
Spit-balling here, could it be an empty string ""
instead? The thing I am trying here is to nudge project owners to populate this field, even if they have to populate with empty. It indicates that the project owner has checked the license and available SPDX and confirmed it is one of custom/unknown/proprietary.
Other feedback from Fedora matrix forum:
it might be interesting to reference what Rust / cargo has ended up with:
bothpackage.license
andpackage.license-file
fields in metadata exist.
the former is a free-form string but itâs supposed to be a valid SPDX expression.
the latter is only supposed to be used if the project license is âspecialâ and cannot be expressed with SPDX.
both values cannot be present in the same project.
(for reference: The Manifest Format - The Cargo Book )
If I understand correctly, developers that opt-in from this license are suppose to specify which version they are using in the âlicense notesâ each file should contain (and/or in COPYING
/NOTICE
files).
Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and âany later versionâ, you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.
GNU General Public License v2.0 - GNU Project - Free Software Foundation (FSF).
So probably searching for âany laterâ is a good indicator.
To be honest my personal opinion is that it should not be mandatoryâŚ
I believe that the SPDX index is at best a proxy for the text of the license and license notes that the user includes in their software.
It is good if they decide to do add an SPDX index, but it is also very good if they only add the correct license files and notes. We should strive to lower the barrier of entrance for package authors and this looks like a transfer of responsibility from consumers to authors.
I think the relaxation from MUST
to SHOULD
is not enough for build-backends[1]. MAY
would be better. No need for validating would be the best.
Unless the tools for making validation/parsing/etc are made available without need for extra dependencies (e.g. via stdlib or maybe
packaging
â with the caveat that even if they are inpackaging
some backends might not be able to use). âŠď¸
FYI Hatchling does not allow for arbitrary text there but rather does this: hatch/backend/src/hatchling/licenses/parse.py at hatchling-v1.24.2 ¡ pypa/hatch ¡ GitHub
No need for validating would be the best.
I strongly disagree â not performing validation is the same as having an unstructured field for this purpose. If weâre doing that, we might as well not introduce a new field here.
I am referring to validation in the build-backend only. Because of the restrictions/difficulties to bring dependencies in.
I am not referring to other tools like package indexes (I donât have an opinion about those, because donât have experience implementating them).
I understand that â to elaborate a bit on what Iâm saying: I disagree that this should not be marked as a thing that build-backends should do. We can pick between âMUSTâ (required), âSHOULDâ (recommended) and âMAYâ (optional) but we should not have it such that backends as not expected to do anything with this metadata.
The build-backend is the best place to provide immediate feedback to users, since it is the piece that converts arbitrary configuration provided by the user (via TOML, Python, what-have-you) into the Core Metadata format that is consumed downstream by all the other pieces. This means that the build-backend is best placed to provide explicit, clear and actionable error messages to users about what their misconfiguration is. Moving this to the package index effectively makes it significantly more likely that, for example, projects on package indexes other than PyPI donât actually end up having this (eg: because not every index ends up enforcing this) which would mean that installers might end up being stricter than build-backends, and they canât provide particularly actionable errors at that point.
IMO, it would be better to have this be specified as SHOULD (i.e. recommend) backends to implement it. If a backend has a strong reason to not do so, theyâre fine to â IMO itâll just be clearer to end users to have the backends do this and I expect itâll be a better UX overall as well.
I strongly disagree of creating this expectation from build backends, by using wording like SHOULD
.
In an ideal world that would be nice to do. But we have to be realistic with what we can achieve and with the responsibilities that we can take[1].
As I previously mentioned, I have no problem if this validation/parsing/processing etc comes from stdlib or dependencies that are already are vendored by all build backends. âŠď¸
No. Nothing should be mandatory. Thatâs not what this ecosystem is about. You do as much as you can or want to do to help the people who are sharing in your code, but thatâs the extent of the obligation. PyPI has terms in place to ensure that all packages are sufficiently licensed for one level of distribution.
If this change makes it easier to do the right thing, great! But we should not make something mandatory without a solid practical reason (i.e. package is not installable by anyone for technical reasons, as opposed to merely being underinformed about licensing).
Thanks for clarifying. Iâm sorry if my question/comment sounded like I want to force anything on this community. I replied too quickly and should have explained my thoughts a little bit more.
The way I see things right now is that there is two different audiences for the PEP. One is package producers (package maintainers) and the other is package consumers. Producing a package with a correct license is not too difficult today. Consuming packages and determining what licenses they use is a different story though.
The thing that makes consuming this information hard is how underspecified and unstructured the current core metadata spec is. PEP 639 would greatly improve the situation by supporting an SPDX expression.
Now my thinking is this: If the License-Expression
field is optional, wouldnât it put us in the same situation as before? An SPDX expression is a well known structured format while license files arenât. So if a project was to decide that they donât want to set License-Expression
, then users of the package would have to resort to parsing license files to know what the license is at a quick glance.
By not having a License-Expression
on everything, tooling around license management will have a difficult time to do their job. For example, I can image that maybe pip or another installer could grow a functionality that would allow a user to limit what licenses are allowed to be installed. I believe having a License-Expression
on each package would help manage supply chains more easily. It would obviously not solve all problems because there is still possible discrepancies between what a package says its/their license(s) is/are, but at least it would be a start.
If all packages have a License-Expression
, PyPI and other indexes would be able to show it to users in their UI. Without an expression, these services would have to either omit the license entirely or display all the license files. Both options are not user friendly. Another alternative would be for these services to parse the license files and try to determine what license they really are, which is a not so simple thing to do and probably impossible to do accurately.
I hope this clarifies my initial reply!
What if the license file is for GPL, but the expression says MIT? Or if the license file states a license that isnât expressible as a SPDX expression? Do we want to get sucked into the legal implications if we say things like âexpression takes precedence over license fileâ?
I might be naive (which is possible) and missing context (again possible), but why would we get sucked into legal obligations? This already happens today in lots of projects. I see this almost on a daily basis when packaging stuff at Anaconda. In what way would a different License-Expression
and License-File
be different than a currently differing classifier and license file? If itâs not a concern today, why would it be a concern with PEP 639?
All of which brings up another point. The standard must allow for arbitrary licenses somehow. Python is used in corporate environments where all sorts of proprietary and restrictive licenses are used. We absolutely cannot have a standard that prohibits creating or publishing such packages - yes, they wonât be published on PyPI (the PyPI rules already prohibit that, I believe) but that doesnât mean they canât be published on a custom index, for example explicitly built to distribute licensed software to paying customers.
This is already supported by the SPDX standard by using LicenseRef-MyCustomLicenseHere
, see Clause 10: Other Licensing Information Detected - specification v2.3.0.
If itâs not a concern today, why would it be a concern with PEP 639?
Because you are suggesting that we make a statement about which takes priority. If 2 parties get into a dispute because they disagree over which value takes precedence, how well we publicise our rules could be an important factor in the decision. As I say, IANAL, but Iâd rather keep out of that.
I donât see any reason for why someone might not need such an escape hatch in Python projects
As a concrete example, I would really appreciate it if the package metadata for the nVidia CUDA packages and the Intel MKL packages could meaningfully specify their real licenses. âOther/Proprietaryâ, while technically accurate, is not particularly useful. [1]
Since these are proprietary licenses theyâre never going to get actual SPDX identifiers (see New license request: Intel-SmpL-2018 [SPDX-Online-Tools] ¡ Issue #838 ¡ spdx/license-list-XML ¡ GitHub for example), but nVidia and Intel could pick their own (hopefully meaningful) strings as long as custom licenses are allowed.
With the rules for acceptable PyPI uploads only requiring artifact distribution rights rather than full open source licensing, we can reasonably assume these arenât going to be the only packages where thereâs no applicable standard SPDX license identifier.
For those that are curious the Intel MKL wheels appear to be covered by https://www.intel.com/content/www/us/en/developer/articles/license/end-user-license-agreement.html#inpage-nav-2 and the various nVidia CUDA wheels are covered by License Agreement for NVIDIA Software Development Kits â EULA âŠď¸
I strongly disagree of creating this expectation from build backends, by using wording like
SHOULD
.
Could I ask you to elaborate on the reasons for this? I donât quite understand what youâre concerned about here.
Iâll say that, the current tentative implementation approach seems to be to end up implementing the parser for this in packaging, which does meet your âalready vendoredâ mention in the footnote-rendered-inline thingie.
If all packages have a
License-Expression
, PyPI and other indexes would be able to show it to users in their UI. Without an expression, these services would have to either omit the license entirely or display all the license files. Both options are not user friendly.
Even if it were mandatory, I donât think it would be back-filled to existing packages on PyPI[1]. So there will be a mix of packages with and without this classifier in either scenario.
Your examples make sense: I can see how using the License-Expression
classifier as a filter would be very useful for some users. I do think it would be useful even if it wasnât universally adopted, thoughâassuming that new packages gradually adopt usage, over time the majority of maintained packages will have such a classifier. It wouldnât entirely eliminate the work of verifying individual licenses but it would greatly reduce it.
Packages usually arenât modified like that, and also who would do it? âŠď¸
I kind of summarised the reasons before in PEP 639, Round 3: Improving license clarity with better package metadata - #8 by abravalheri and PEP 639, Round 3: Improving license clarity with better package metadata - #29 by abravalheri.
But yes, if it is possible for a backend to implement the validation/parsing (or any kind of processing needed) without getting in any extra dependency (or having to reimplementing it from scratch), then the issue with feasibility is gone, and then I have no problem with using a SHOULD
.
Even if it were mandatory, I donât think it would be back-filled
to existing packages on PyPI[1]. So there will be a mix of
packages with and without this classifier in either scenario.
If it were to be made âmandatoryâ then that would be in the context
of âmandatory when using a newer metadata versionâ and so even new
uploads for packages declaring older metadata versions would still
lack it, wouldnât just be those which were previously uploaded.
Packages usually arenât modified
like that, and also who would do it? âŠď¸
On the mandatory-or-optional front, publishing tools canât realistically enforce the new field being semantically correct, they can only enforce syntactic validity.
This means making it mandatory is likely to make the field SNR worse rather than better, since rather than simply omitting the field when they havenât fully thought through their licensing choices, publishers will be forced to pick a value that makes the tools happy.
By leaving the field as optional at a syntactic level, policy definitions can be left to the entity that has the most information on what matters for their use case: the folks actually downloading and installing packages (or the organisations they work for).
For the consumers most concerned about licensing details, even the new better specified field will only be viable as a first pass filter, since packages with acceptable nominal licences will need further analysis to check if their overall licence includes other terms (e.g. due to vendored libraries).
That doesnât make the proposal useless, it just means making the new field syntactically mandatory would make it less effective at its intended purpose (allowing publishers that have put thought into their licensing choices to make those choices explicit in their published metadata) rather than being helpful.
Thanks all! Thatâs a lot of interesting points raised.
What I frankly donât understand is why making everything formally validated and normalized is so important for this field in particular. I support the idea of saying that
License-Expression
must be a valid SPDX expression, that seems sensible and in line with other languages. But I donât see why itâs necessary to go into so much detail over who is responsible for validating, and normalizing the data.
How I see this: If thereâs an expectation of a specific format of a value, it makes sense to have it validated somewhere. The field can be validated for semantics and normalized automatically, so thereâs no need to leave that to projectsâ authors eyes (itâs easy to make a typo and carry it forever).
I lean to @pradyunsg opinion that it feels reasonable to create an expectation that build tools will perform that validation. They could choose not to (e.g. being a lightweight build backend by design may be a sufficient reason to opt out from it). If the tools detect an invalid expression, they may block the users via an error or they may warn that this could cause problems if wanting to distribute packages via public indices, and still let produce such a package.
Iâll also take a second look at warning/error guidance in the PEP. Maybe not everything has to be upfront specified and itâd be better to leave a margin of freedom for the tools.
All of which brings up another point. The standard must allow for arbitrary licenses somehow. Python is used in corporate environments where all sorts of proprietary and restrictive licenses are used. We absolutely cannot have a standard that prohibits creating or publishing such packages - yes, they wonât be published on PyPI (the PyPI rules already prohibit that, I believe) but that doesnât mean they canât be published on a custom index, for example explicitly built to distribute licensed software to paying customers.
At this point it seems to me that allowing custom identifiers as defined by SPDX would be the most straightforward way to achieve this.
I didnât mean it in a legal standpoint, but rather in a trust, i.e. when scraping the license e.g. when creating a Fedora package feom PyPI, it would be a trusted source. Still mandatory for the purpose that license file alone are not sufficient for determining the accurate license.
I believe thatâs not the role of PyPI as it is currently designed. PyPI should not be blank-considered a trusted source and Iâd prefer this PEP doesnât play a role in making it appear so. Itâs still a responsibility of the project authors to make sure the declared metadata matches their intention and is coherent within a distribution (e.g. text in the license file matches the license expression). Packagers in Fedora are expected to validate that information in the process of including a package into the distribution and raise an issue with projectsâ authors if they find discrepancies.
By not having a
License-Expression
on everything, tooling around license management will have a difficult time to do their job.
Spit-balling here, could it be an empty string
""
instead? The thing I am trying here is to nudge project owners to populate this field, even if they have to populate with empty. It indicates that the project owner has checked the license and available SPDX and confirmed it is one of custom/unknown/proprietary.
Currently, a project needs to define a name and a version (some build tools will initiate the version for the project author). I donât see a reason to mandate including licensing information there - itâs not mandatory in the wide software world to license projects (this means no one can use and modify a project, nevertheless itâs perfectly possible to do). I feel the best place to nudge project owners to license things properly is in their issue trackers, and I believe most would happily comply if approached by the downstream packagers.
I donât see a benefit of an empty string over not including the field at all.
With my Fedora packager hat on, Iâm on the side of making the proposal as minimal as possible to achieve consensus over the implementation. The gist is that we need the SPDX standard and we need to be able to locate the license files. The feedback from folks being in that chain before us, like build and publishing tools authors, is much appreciated.
For whatever itâs worth, I believe PyPI will almost certainly validate the syntax of a License-Expression
(which includes validating that it contains valid identifiers), but will almost certainly not attempt to verify that a License-Expression
clause is actually an accurate representation of license for a given project.
It was previously mentioned, but thereâs also not a mechanism for us to backfill this metadata on PyPI, the METADATA
file in an archive is what it is, and we canât modify that.
Weâre also unlikely to require this information unless the PEP requires us to do it. I donât think we should require it though, as others have mentioned required fields tend to be filled in with garbage to âget things workingâ, so we should be careful that our list of required fields is kept pretty minimal.
I think that having SHOULD
for a build back end to validate the expression is syntactically correct is a reasonable thing to include, SHOULD
is explicitly there to allow implementers to opt out of doing something when they have a good reason for doing so, while suggesting that it is a good idea for them to do it if possible.