Hatchling now uses the PKG-INFO file to derive metadata in an effort to comply with the spec. My assumption was that using that file was the way to achieve that based on my reading of the standard even though that file is not explicitly mentioned there but rather in the living document for the source distribution format.
I think Flit is compliant because metadata is static (or at least won’t change) but other backends like poetry-core are not because they allow for certain degrees of dynamism similar to Hatchling’s build hooks.
Am I not supposed to read the PKG-INFO file? If not, how in the world are we supposed to guarantee static content while allowing for dynamic metadata logic?
+1 to the question. pymsbuild also just loads the PKG-INFO if it’s there (though there’s still technically a chance for a wheel build to modify it again, but that only makes things further away from what this user seems to expect).
Is the summary of the issue here that when patching an sdist there can be a mismatch between pyproject.toml and PKG-INFO? Wheels will then use the “wrong” PKG-INFO values?
Would a solution be that if somebody is patching an sdist, they’ll need to delete the PKG-INFO and then build backends should regenerate the sdist archive (including PKG-INFO) and wheels can then reference the updated values in PKG-INFO.
I commented to that effect on the issue, but unless they also change the package name or version (most likely by adding a local version identifier) I’d consider even this to be contrary to the spirit of the spec, even if it’s technically compliant, because it violates the key property of static data, which is that consumers can assume that all builds for that name/version will have the same (static) metadata.
Assuming upstream says they require libfoo == 2.2.2. But we have tested that it works fine with any 2.x version and want to patch the metadata to allow it, to build an RPM package that allows installing it together with RPM-packaged version of libfoo 2.3.5 – we are violating the standard and should rename the project or add a local version identifier when we do that?
(This is getting offtopic, feel free to split it out to a separate topic if you know how.)
What I really struggle with is understanding which standard or which PEP mandates that if a PKG-INFO file is present in an unapckged Source Tree[1], a build-backed MUST, SHOULD, or even MAY read all metadata from PKG-INFO instead of reading it from pyproject.toml when generating a wheel.
Source distribution format - Python Packaging User Guide only says the PKG-INFO must be present in the Source Distribution top-level directory. I cannot locate any sort of information for build-backends for how to treat the file.
When found in the metadata of a source distribution, the following rules apply:
If a field is not marked as Dynamic, then the value of the field in any wheel built from the sdist MUST match the value in the sdist. If the field is not in the sdist, and not marked as Dynamic, then it MUST NOT be present in the wheel.
This does not itself explicitly require that backends read from sdist built metadata in, e.g., a PKG-INFO rather than pyproject source metadata in a pyproject.toml, but I’m not sure how else one would in practice meet that requirement.
I’m not sure that it specifically needs to be called out in a spec or definition anywhere, but I think there is a subtle difference between a spec compliant source distribution and an unpacked source distribution that has been patched.
I think it’s fair to say that any patches that modify a previously spec compliant source distribution can only be considered to be spec compliant if they are successfully repackaged by a spec-compliant build backend.
It surely isn’t possible to standardise what happens to spec compliant source distributions once they have been unpacked and patched.
Correct. The naming distinction between sdists and source trees was introduced in PEP 517, and is documented here. The precise implications of something “being a sdist” rather than just a packed source tree have been informally discussed over the years, but those discussions haven’t been captured in a spec.
Isn’t the concept of a source RPM similar? I know almost nothing about the RPM format, but is it legitimate to open up a source RPM, change it, and then install it as if it were the original distribution? And to be clear, by “legitimate” I mean “supported by the distributor”, not “allowed by the license”.
That is an interesting point. The problem is that a build backend can’t know if the content of a directory is an unpacked sdist or an arbitrary source tree except by the presence or absence of a PKG-INFO file. So using that as a heuristic is a reasonable approach.
It would make some sense to formally state that a source tree containing a PKG-INFO file should be treated by build backends as an unpacked sdist. But it’s simply formalising existing assumptions. And I can’t imagine it makes sense to specify anything else as a marker for that situation…
So the standard does not say a build backend must read the metadata from PKG-INFO but it does say that the built wheel must have the same metadata as present in PKG-INFO. That in practice means it either needs to read it from PKG-INFO or from both (and validate). That’s quite hard to follow, but let’s say I understand it now. Thanks.
What I still don’t understand is how all this relates to unpacked source trees. The standard talks about metadata of a source distribution. There is no such thing as metadata of a source tree. Is it explicitly spelled out that “a collection of files and directories … which contains a pyproject.toml file that can be use to build a source distribution” is a Source Tree. Do we also assume that if such a source tree also contains PKG-INFO it is a source distribution and any modification to the metadata in pyproject.toml are silently ignored as long as the metadata can be read from PKG-INFO?
EDIT: This was typed before the @pf_moore’s comment above.
It is, but if you break your system by doing it, you are on your own. So it is legitimate to the point of “if you know what you are doing”.
The same way as I see it with sdist. If I open it up to relax a dependency and install it, if it no longer works with the relaxed dependency, I shoot myself into my own leg, but it was a legitimate thing to do.
I think you are fully empowered here to be honest. If you delete the PKG-INFO in your patching system, do things work as expected?
Im not sure, but I can’t think of a way to recognize if a source tree is from an unpatched or patched sdist. Recalculating the metadata via PEP-517 to verify that PKG-INFO matches pyproject.toml defeats the purpose to some extent. Do you have any recommendations, outside perhaps documenting this behaviour loudly somewhere for people who patch sdist?
As far as I can see, there’s nothing prohibiting you from exercising your OSS rights here. You don’t need to use a local version modifier for example, it’s only a recommendation to make it more clear that the artifacts produced come from a patched upstream package.
The whole situation is unfortunate I think. I think everyone agrees with you that nicer error messages when this occurs would be nice, however, due to the nature of Dynamic, I’ve not seen any paths forward that:
Enable you to patch without deleting PKG-INFO (not great, but easy enough after the first time)
Not invoking PEP-517 for every sdist or tree, even if they haven’t been patched, just to confirm that the values are deterministic (much worse, because it requires needless recomputation of data that has already been declared static in most cases)
Personally, and I’m not a build backend developer, I would say that it’s a reasonable optimisation to assume the presence of a PKG-INFO file means that the source tree is an unpacked sdist, and it’s ok to use the data from that file rather than recalculate it. That’s at least in part because in theory, it’s completely tool-defined what “building” a source tree even means…
The packages that were created before this standard will work as expected. But if somebody actually patches the PKG-INFO, we broke them.
We could educate our “patchers” to always delete PKG-INFO when they patch, but unless they get hard errors without doing it, it will be close to impossible.
Yeah, let’s talk practical. Whether we agree the use case is legitimate or not, I think people will run into it more often for other use cases, even if they just want to debug a problem by downloading-modifying-installing an sdist.
Given:
download an sdist
unpack it
patch (the source of) metadata, but let PKG-INFO unchanged
build a wheel (to install it) via a build frontend
What currently happens with hatchling:
Changes in the metadata are silently ignored, metadata from PKG-INFO is used. I find this very unfortunate, error-prone, and outright confusing. That’s why I opened the issue in the first place.
What should happen?
Should the build frontend try to detect the situation (e.g. by seeing if PKG-INFO is the newest metadata-relevant[1] file) before calling PEP 517 hooks and error if it happened?
Or should the build backend try to detect the situation (e.g. also by checking modification timestamps[2] or by calculating the metadata again and comparing[3])?
However, each file in the tree is relevant because a package version can be read from a .py file or from another source. ↩︎
At least the build backend knows which files might be metadata-relevant. ↩︎
However, that goes against some spirit of the standard. ↩︎
My suggestion doesn’t help the silent failure scenarios, but instead of deleting the PKG-INFO, would you perhaps somehow prefer to insert something into the [tool] section for hatchling to force it to recalculate metadata (therefore overwriting and ignoring any existing PKG-INFO.
To me, it seems like the same amount of effort, but I thought I’d ask.
Our tooling is backend-agnostic. A solution that works for hatchling is not good enough, unfortunately.
Pragmatically, if there is a way to tell any build backend we want it to force-recalculate the metadata, that would work for us (I’d not prevent others from getting bitten, but I don’t have the energy to try to find a solution for everybody).
I.e. if we can pass an option to PEP 517 config_settings that says something like ignore_pkg_info=True or recalculate_metadata=True and the backedns would be instructed by the standard to acknowledge that, I’m good.
That is indeed a possibility. It would require a new standard, and until now backend maintainers have seemed uninterested in trying to standardise any aspects of config_settings, so it might be difficult to get consensus on that. But if you need a tool-agnostic answer, that’s the way to go.
Probably, however, I am afraid it will take months to have such a standard
Is there no chance we could solve this without that?
I guess we can just error whenever we find PKG-INFO. That way, packagers will be forced to remove it and we never end up deleting it for them if they patched it. I will need to figure out how many packages would be impacted by that.