Respecting Core Metadata 2.2 when building from source distributions

That’s not going to happen. I’m going to do this: Persistent dynamic core metadata breaks user assumptions · Issue #1334 · pypa/hatch · GitHub

Essentially, an option in pyproject.toml that will ignore the PKG-INFO file. I’m not going to document this option because this is a hack just for redistributors and reading from that file is the only way to guarantee static metadata/the spec when backends have functionality that allows for dynamic metadata.

The implementation you mentioned doesn’t make much sense to me. If I’m going to raise an error when the metadata doesn’t match then how is that different than requiring patchers to target the PKG-INFO file?

It makes perfect sense to our use case. It makes the patchers aware of the problem and patch both places or remove PKG-INFO explicitly.

Having this behavior would solve my problem entirely and it would be so easier to explain to people (compared with an undocumented option to pyproject.toml).

2 Likes

Many more.

  • Distribution scanning tools (for example dependabot)
  • Data analysis scripts

Basically anything that wants to get metadata for a distribution artefact, but doesn’t want (or have the capability) to invoke a build backend, with all the complexity that involves.

That’s a discussion to be had with individual backends, but if you want a backend agnostic answer (which is what seems to be important here, as any suggestions that depend on knowing what the backend is have been rejected) then it would have to involve a new standard (or at least an amendment to an existing standard, and given how much debate this has caused, I’m not going to accept it as a clarification, or text-only change…)

I don’t think so, no. --no-build-isolation is still a standards-compliant build process (the standard doesn’t say anything about whether the build environment is constructed automatically or by hand). And how you construct the build environment has no impact on the metadata in the resulting wheel.

Absolutely it could. But @ofek doesn’t want to, and to be blunt, I think that turning the issue into a debate about the standards and their implications and intent, feels like an attempt to put undue pressure on him to change his mind. Personally, I will defend his right to make whatever choices he wants when maintaining his own project as long as the standards allow it.

If you can persuade @ofek to make such a recommendation, you’d have also persuaded him to change hatchling. And if you can’t, then a recommendation that doesn’t have the support of existing backend developers isn’t much use.

How will you ensure that the guarantee that wheels built from a sdist will match the static metadata from the sdist? If the answer is that there’s no way for that not to be the case, then good for you - you’re one of the lucky backends that has nothing to worry about. Flit is another one. Setuptools, on the other hand, is likely to have some difficulties here, because they support arbitrary code execution - I can see them having to mark a lot of things as dynamic (which I’m not happy about, but at least they have a migration path to gradually make more things static).

Maybe. But I’d reject it unless it also said “unmodified”, precisely because that would be a change in behaviour that would invalidate hatchling’s current behaviour, and I’m not willing to accept that as a simple clarification - as I said above.

2 Likes

Sorry for not being perfectly clear. My question is more so can’t the patching happen on both files and then backends don’t have to change anything?

1 Like

Yes, it can be done that way. The problem is that nobody expects that and the requirement is a breaking change. And when the patchers don’t do that (they don’t realize they need to do that), they silently get the unexpected (i.e. incorrect) result, rather than a failure.

2 Likes

But nobody expected people to be unpacking and paching sdists either. The problem here is a lack of understanding of each others’ positions, certainly, but there are misunderstandings on both sides, and a lack of representation in both directions (no-one reached out to distros to ask how metadata 2.2 would impact them, but it’s equally true that distros didn’t get involved in the PEP discussions to find out if the proposal would affect them).

I feel like this is simply the tip of a rather large iceberg. There’s clearly a lot of baked in expectations and practical knowledge within the distro maintainer community on what is and is not going to work when patching a sdist. That goes way beyond this issue. For example, I wouldn’t know how to patch the version of a package using setuptools_scm - my instinct to edit pyproject.toml clearly wouldn’t work. So saying that patchers shouldn’t need to know what build backends are involved is obviously an over-simplification.

There’s new knowledge to be passed onto the impacted users - that’s clear. That knowledge might be as simple as “when patching metadata, make sure you make the corresponding change to PKG-INFO if it’s present”. Or it might be more complex. But I don’t think it’s reasonable to require ecosystem changes to never need users to learn new processes - that’s going to make it virtually impossible to innovate. So we need to accept that users will be affected, and decide what advice to give to the affected users.

There are other aspects of the way distro maintainers patch sdists that I’m not clear on. My understanding is that they use --no-build-isolation to manage the build environment carefully, and ensure reproducible results. That seems like a great thing for them to do, but if that’s the case, how does the new version of hatchling get pulled in without review? If there was a difficult-to-fix bug in a build backend, which resulted in corruption of patched sdists, wouldn’t you just pin to the previous working version of the backend?

I’m not looking to blame anyone here. Nor am I trying to minimise the difficulties this is causing to distros like Fedora. But I think it’s important that we understand all of the possible avenues of approach here. My impression is that positions are getting entrenched, because we’re all looking for someone else to be responsible for dealing with the impact of metadata 2.2. And I don’t think we’ll find a good solution if that happens.

The fact that the issue is a silent behaviour change rather than a noisy error makes things harder. But would Fedora really have been happier if a whole bunch of builds had simply started failing without warning? We might have worked out what had happened a bit more quickly, but we’d still be in the same position.

Personally, I’m still not clear on why “nobody expects that” is such a huge hurdle. How many people are we talking about here when we say “nobody”? Do Fedora have thousands of people patching sdists? Hundreds? Tens? A couple? That’s not rhetoric, I genuinely have no idea. I also don’t know how other policy changes get communicated - when setuptools deprecated direct invocation of setup.py in favour of using a PEP 517 compatible build tool like build, how did that change get implemented? It was just as silent, so presumably people had to be told to change their processes in a similar way to what we’re talking about here. Again, I’m not trying to judge the situation, I genuinely don’t know how any of that is handled.

5 Likes

I don’t think that’s a good solution. Essentially, pyproject.toml is a source file here, and PKG-INFO is the output file here. Expecting people to manually deal with output when source changes is against the principle of the least surprise. To put it in Python terms, it’s as if .pyc files weren’t invalidated when respective .py files changed — i.e. the user would end up editing sources, and then scratching their head why program behaved as before the change.

I can’t imagine anyone telling people they need to remove .pyc files when editing Python sources, nor telling them that they need to add a special marker to .py files to indicate that they’ve edited them. What happens instead is that Python detects that .pyc files are outdated and updates them. And I think that build systems should do the same thing wrt outdated PKG-INFO. Admittedly, it is nowhere as simple, especially that the format doesn’t provide any means of tracking source files used to create PKG-INFO — but even checking mtime of pyproject.toml would be better than the current behavior.


Honestly, I think the biggest problem here is that Python packaging is infinitely complex and counter-intuitive, which means that anyone involved on either end is surprised to find a very high barrier of entry. The Gentoo Python Guide has already 300 KiB of .rst files, and it is by no means comprehensive. At this point, distribution developers pretty much can’t really package anything written in Python without getting a special training and/or senior developer review, and even senior developers have a hard time following the constantly changing landscape.

At the same time, Python framework in Gentoo has already a bunch of safety checks included to detect the most common pitfalls. Again, it is by no means comprehensive and I keep extending it whenever we discover yet another counterintuitive pitfall. This thread makes me think that we will need to add another check to make sure that PKG-INFO is dealt with when pyproject.toml is patched.

1 Like

@pf_moore What are your thoughts on adding a recommendation that backends SHOULD give precedence to the PKG-INFO file for source distributions? As in, the recommended implementation of how to achieve the requirements set forth in the spec for metadata 2.2.

Where/how should we add this?

That’s in direct contradiction to what @hroncok wants, and it’s the opposite of what @rgommers suggested, to which you replied “that’s not going to happen”.

I’m not going to support the standards taking sides in this dispute. If we get a consensus, then it can be documented and can be part of the standard if appropriate, but we need to get to a satisfactory solution for all parties before we start making pronouncements in the standards.

At the moment, the behaviour you implemented in hatchling is allowable, according to Metadata 2.2, but that depends on how you decide if the directory given to the backend is “building from a sdist” as opposed to simply building a standalone source tree.

My position is:

  1. I’m not willing to disallow hatchling’s behaviour unless backends (not just hatchling) confirm that they are OK with enforcing the requirements of metadata 2.2 some other way.
  2. I’m not willing to make any sort of stronger statement that the hatchling approach is “right”. It’s allowed, and everything else is between the backend and its users.
  3. I am willing to explore how we distinguish better between “building from a sdist” and “building from a source tree” given the existence of workflows that involve unpacking a sdist, patching it, and building a wheel, plus the fact that when a backend gets a directory it has no indication of where it came from (specifically whether it was unpacked from a sdist). The intent here would be to find a way to classify patching as being a build from a raw source tree rather than from a sdist (because the metadada 2.2 requirements don’t apply in that situation).

My personal view is that if PKG-INFO is present in a source tree presented to a build backend, the backend should treat the build as being “building from a source tree”, and therefore metadata 2.2 rules apply. But I know that @hroncok isn’t happy with that constraint (which would imply patchers either delete or edit PKG-INFO) so we still need to explore what would satisfy all parties.

In my view, PKG-INFO is a built artifact that is included in a sdist, in the same way as binaries and metadata are built artifacts included in a wheel. To create a source tree from a sdist, therefore, you unpack the sdist and delete the built artifacts, giving you the clean sources[1]. Historically, there’s not been a need to do that, but that doesn’t mean it wasn’t conceptually the correct model.


  1. The fly in the ointment then is PBR, which doesn’t consider a source directory which isn’t a VCS checkout to be a valid source tree. I don’t have a good answer there, except to suggest that PBR add support for non-VCS source trees, which they have so far chosen not to do, as is their right. ↩︎

3 Likes

Admittedly, it is nowhere as simple, especially that the format
doesn’t provide any means of tracking source files used to create
PKG-INFO — but even checking mtime of pyproject.toml would be
better than the current behavior.

More than that even. In examples we discussed earlier, PKG-INFO
might not even be generated only from source files but also from
other sources of data (such as Git metadata, environment variables,
specially-formatted directory names…). I think this goes back to
the fundamental disagreements we’ve had over the years on whether an
sdist is a source distribution or an installable package. Some
people want it to be one or the other, but it’s both depending on
who you ask. To people who believe strongly that an sdist is a
source distribution, the fact that it includes a file named PKG-INFO
makes that file part of the source, not merely some ephemeral
metadata that should be wholly reconstructable from the contents of
other files also shipped alongside it in the same sdist.

6 Likes

Sure I wasn’t asking you to take sides I was merely asking if we can provide a recommendation on how backends can guarantee identical metadata when being built from source distributions.

1 Like

Yes, it does seem like classic Hyrum’s Law. It’s unfortunate. Nobody is intentionally “breaking” builds and nobody is intentionally trying to hide “silent failures”. If they were easy and cheap to detect, good error messages would be everyone’s preference.

Once the dust has settled, would it help to define the “roundtrip semantics” for an “identity build”:

  • sdist → unpacked sdist → sdist
  • sdist → unpacked sdist → wheel
  • wheel → unpacked wheel → wheel

I realise we have that in the various PEPs, but to me as a bit of an outsider, it doesn’t feel cohesive in one place. It could entirely be my fault though.

Basically, what should backends do when they encounter PKG-INFO (unpacked sdist case) and METADATA (and RECORD?) (unpacked wheel case) and for whatever reason, the build tool is “repacking” them. The repacking might seem contrived, but humor me, maybe it’s some verification steps to ensure integrity of the build artifacts, or some source code audits or something.

My expectation would be (reminder NOT in patching mode here):

  • sdist → unpacked sdist → sdist = Should be a noop. Repack should happen and that’s it. Hashes would match, mtimes times might change.
  • sdist → unpacked sdist → wheel = Normal sdist build caveats, but the resulting wheel metadata should be the same as the first (sdist → sdist) case
  • wheel → unpacked wheel → wheel = This would be a little odd to do outside patching. I’d be ok with saying it’s not valid or possible for identity build. If it was permitted it should be a noop. Repack should happen and that’s it. Hashes would match, mtimes might change.

Once that’s understood, breaking those cycles would be the “patching cases”. That might be that PKG-INFO and METADATA (and RECORD?) must be deleted or some other mechanism to break out of the behaviour of handling “distributions” and treating things as raw source again.

Might even be agreed eventually that patching should occur from “raw source” VCS clones or tarballs without any metadata etc.

I just want to reiterate that all of those would break Hatchling and I’m assuming setuptools by default when using data from the VCS without the backend reading from PKG-INFO or adding hacky options to derive the metadata.

1 Like

I think we’re in violent agreement there tbh. The “roundtrip semantics” in my mind would help clarify the relationship between named and versioned artifacts (sdist and wheel) that are distributed with metadata and how backends should process them. Metadata Preserving mode.

It would also help clarify how to break out of Metadata Preserving mode for patching and into Metadata Creating mode. Which would probably mean deleting metadata or some yet to be agreed mechanism.

Doesn’t seem like it’s helping though. So I’ll withdraw from the details.

Possibly we should consider standardising .dist-info inside sdist? We already have API hooks for generating the metadata directory independently from building the wheel, and the build step is supposed to leave an existing metadata directory untouched, so perhaps having a convention for “here is the unpacked sdist and its [patched] metadata directory” would address this need?

Basically, instead of PKG-INFO we put *.dist-info in the sdist. And if it’s found when we’re building the wheel, we use it instead of regenerating it.

1 Like

First, I want to stress that I came here to find a solution. I do not want to get entrenched and don’t want to fight you. I am glad that the conversation is happening and I will try to step back before I say nobody or everybody next time. Let me try to answer your questions/comments.

You are absolutely right. Thanks for saying this. I’ll do my best to try to follow emerging standards and participate in their design discussions before they are accepted rather than after. Sometimes, I only know things impact us when they do.

Agreed, the patchers need to know what they’re doing. In an ideal world, they would know about all the standards as well as all the particularities of the tools they are using and/or patching. However, the fact is they often don’t. And when things break in an obvious way, they are forced to go and learn. Or when we can detect a recurring patten and prevent them from shooting themselves in their foot, we do that. E.g. when they use setuptool_scm wrongly, and end up with version 0.0.0 (as setuptools defaults to rather than error) we can detect that and error.

Again, I agree. The problem (as I see it) in this particular instance is not that things are changing. The problem is that when they do, it’s easy to miss it and do the wrong thing. I agree that if this is the way things work now, we need to educate the patchers that PKG-INFO matters. However, I think they should not learn about this by getting unexpected results.

In fact, if the sole result of this entire discussion is that we need to educate our patchers, so be it. But I want to make sure that if they are not educated, they are not doing the wrong thing without knowing it.

The new version of hatchling does not get pulled in without review. That’s why this discussion even started, because the review happened, the issue was found, and the update is currently blocked.

Obviously, it could have happened that we would have noticed this only after it was updated. In that case, we might have reverted the update. If there was a bug, we would report it, try to fix it, and we would have reverted the new version if the fix was not delivered fast enough. That’s exactly how/why I opened the original issue – I thought this was a bug. Unfortunately, it is more complicated than that.

We don’t “pin” – we use one version of hatchling that has been tested and is known to be well integrated. This brings new challenges and problems when different packages require different versions of (for example) hatchling, but we usually deal with that sort of problems quite fine.

I want to find a solution to a problem. That is my sole motivation here. And I am glad you see the problem, I really am. I am not sure you fully understood the problem and I know that if you did not, it is my fault. I am trying to explain it the best I can.

I (speaking with my Fedora fedora on my head) would really be happier if the builds had simply started failing rather than producing wrong results. If the build fails and says “the metadata in pyproject.toml and PKG-INFO is not consistent”, there is an obvious way out for the Fedora packager. If they have been patching pyproject.toml for a while and suddenly their patches are ignored (without warning as well), they don’t know where to look for a solution.

I will try to get back to you with a data-based number. But from the top of my head, I would guess ~1 hundred.

The change was indeed complicated and required changes in the infrastructure as well as individual packages. Not all Fedora packages have been migrated to the new way yet. The good thing is that they don’t have to until it happens. There is a deprecation period (unlike this change in hatchling that has just happened) and when the period is over, their packages will fail to build, rather than produce incomplete/surprising results.

4 Likes

@hroncok Do you have an idea in mind that would both satisfy your use case and make it so backends can freely copy static fields in source distributions’ PKG-INFO file to the METADATA file in wheels to satisfy the ability to retain e.g. VCS metadata?

If we have no pragmatic way forward then I’m afraid the latter use case would be deemed more important than the former distro use case.

I mean that as no disrespect because I truly understand that the situation for you currently sucks but in my opinion we must prioritize how project maintainers themselves intend for distribution metadata to work.

1 Like

I’m guessing it’s too late to add a new Metadata field to 2.3?

Adding a “Metadata-Inputs-Checksum”(or similar/better name) to PKG-INFO could be a cheap way for backends to detect a mismatch and fail early.

My preference for Flit is that sdists aren’t special, i.e. building from an unpacked sdist should be like building from any other source tree, using pyproject.toml and ignoring the presence of PKG-INFO. I’d expect the metadata to end up the same by design. Technically the package author can dynamically provide a different version number, but hopefully this is vanishingly rare.

I’d be open to reading PKG-INFO, checking the metadata we’re creating, and erroring out if it would be different. I certainly wouldn’t silently override the values we get from pyproject.toml + source code with ones from PKG-INFO.

(N.B. As far as I know, building from an unpacked wheel is not expected. A wheel is already built, and doesn’t even have pyproject.toml. You could of course re-zip the unpacked files, but that’s not something build backends need to implement. So we can just focus on unpacked sdists.)

4 Likes

On reflection, I think that the issue here around unpacked sdists is somewhat orthogonal to the intent behind Metadata 2.2. As far as the spec is concerned, the key is that if a tool is given a sdist, and it builds a wheel from it, then the resulting wheel will contain metadata that matches what’s in the PKG-INFO of the sdist. The spec has no interest in what happens if you unpack a sdist, modify it, and then build - it’s an accident of implementation that the sdist is unpacked by one tool (the frontend) and built by another (the backend) and it’s only that separation that even allows patching[1].

So the behaviour you describe for flit is perfectly fine - and in practice, you don’t need to check PKG-INFO because it simply reflects the fact that you cannot get different metadata without changing the source code. As the only fields that are allowed by flit to be dynamic in pyproject.toml are version and description, and those are picked up from the source code, everything can be static in PKG-INFO without needing a check (if the check failed, it would be a bug in flit).

For other backends, this may not be true, as they may allow more dynamic generation of metadata. But there’s a statement in PEP 643 which is relevant here:

Backends MUST NOT mark a field as Dynamic if they can determine that it was generated from data that will not change at build time.

Data that is read from pyproject.toml (i.e., not in the dynamic list in that file) must therefore be static in the sdist metadata, and must match the pyproject.toml value. Any discrepancy between a non-dynamic value in pyproject.toml and a value in PKG-INFO must therefore indicate that we are not building from a sdist[2].

Where that leaves me, in how I think of this, is that if we ignore patching for a moment, backends can reliably use either pyproject.toml or PKG-INFO to get values for metadata that is marked as “static” in the PKG-INFO file, as both approaches are guaranteed, by design, to give the same result.

When we consider patching, we are not “building from a sdist” in the PEP 643 sense, and the rules from PEP 621 therefore apply, which say

Data specified using this PEP is considered canonical. Tools CANNOT remove, add or change data that has been statically specified. Only when a field is marked as dynamic may a tool provide a “new” value.

So pyproject.toml takes precedence.

So for static data (see later for dynamic) backends can always safely use pyproject.toml as the canonical source. They can only use PKG-INFO safely if they know the sdist hasn’t been patched. The most reliable way of checking for patching is to ensure that the values in pyproject.toml and PKG-INFO match.

(As an implementation note, I’ll comment that reading static data from pyproject.toml is likely to be no slower, and possibly marginally faster, than reading it from PKG-INFO just because TOML is an easier format to parse than RFC822, and there’s no calculation to do for static data).

Dynamic data is different, though. If a field is marked as dynamic in pyproject.toml and dynamic in PKG-INFO, it has to be recomputed at build time. It’s hard to see how anything else is even possible. The complicated case is when the field is marked as dynamic in pyproject.toml but static in PKG-INFO. In that case, PKG-INFO is acting as a “frozen” value holding the result of the dynamic calculation done at sdist build time, and Metadata 2.2 explicitly says that wheels must be built using that frozen value.

What that means is:

  1. If a field is marked as “dynamic” in pyproject.toml but “static” in PKG-INFO
  2. And the build backend cannot guarantee that assuming none of the source code has changed recalculating will give the same value

Then, and only then, is the backend required to give PKG-INFO precedence.

The only case I can see when this might happen is if a backend is calculating the version from VCS metadata like tags, as that data isn’t part of the source code. And that’s a fairly well-known case and tools seem to already have solutions in place (such as environment variables) to replace the VCS data if it’s not available.

The above is my analysis based on re-reading PEPs 643 and 621, and based on my intent as PEP 643 author. For now, I’m offering it simply as a personal interpretation, but I’m confident enough in it that I would be comfortable converting it into a formal pronouncement on the intended behaviour if people wanted me to (assuming, of course, that no-one was able to demonstrate a flaw in my reasoning to me).

I’m sorry, @ofek, but this means that while hatchling’s new behaviour isn’t in violation of Metadata 2.2, it is in violation of PEP 621 (something that Eli Schwartz mentioned on the issue that triggered this discussion, but which had never been brought up here, and which I hadn’t considered the implications of until now).

In terms of patching, it means that patching static metadata in pyproject.toml is allowed, and the patched data should be respected. But when patching dynamic data, patchers must take care to ensure that there isn’t a static value in PKG-INFO that would override the patching.

If anyone has any issues with the above analysis, please flag them - I’m not trying to stop the debate by posting this, just to collect my thoughts on where the discussion so far has led us.


  1. Unpacking a sdist, patching it and then repacking it into a sdist is a completely different matter, which isn’t relevant here ↩︎

  2. We might be building from a patched sdist, but that’s not “building from a sdist” in the sense PEP 643 uses the phrase ↩︎

5 Likes