For background and context, I have to say that I don’t recall the question of someone modifying or patching a sdist coming up in the discussions on PEP 643. If it had, I’m pretty sure my response would have been “we don’t typically support modifying sdists after they have been built, you should unpack the sdist and delete PKG-INFO, and then rebuild the sdist when you’re done (and build wheels from the rebuilt sdist)”. And that response would have been included in the PEP. Would the PEP have been rejected over that point? I can’t say. And it’s largely irrelevant anyway now, as the PEP as written is what got accepted.
But while I understand the point being made about this not matching people’s pre-existing expectations, I do think that “if you unpack a sdist to modify it, you should delete PKG-INFO” is a fairly simple change to how people should think about unpacking a sdist. The complicated part is getting people to understand that a sdist isn’t simply an archive of the source repository, but that’s been an issue for a long time[1].
And on the other point, I thought people debugging an issue just edited the installed code in place? I know I do
See previous debates over whether sdists should contain tests, or documentation sources, or CI configuration files, or … ↩︎
Via standards, no. Standards are slow to produce, and even slower to get adopted. That’s pretty much by design, to ensure that all the potential issues are thought through (as much as possible - this thread demonstrates that even so, things can still get missed).
Whether to change how hatchling behaves is for @ofek to decide, but the existing standards (and established practices) support the current behaviour.
Honestly, the only solution that’s realistically available right now is for you to change your process. Would it help if I offered to write a small tool to validate whether an unpacked and patched sdist generated an unchanged PKG-INFO? It wouldn’t be complicated - little more than “read and then delete PKG-INFO, build a sdist, compare the new PKG-INFO with the old one” - but if it makes incorporating such a check into your process any easier, I’m willing to do so.
That’s what I’ve been assuming and I can’t think of any case in which that wouldn’t be true. Unless a project literally has that filename alongside pyproject.tomland they choose to include it in the source distribution. Perhaps we can officially say somewhere that doing so is against the spec.
Our tooling is backend-agnostic. A solution that works for
hatchling is not good enough, unfortunately.
Pragmatically, if there is a way to tell any build backend we
want it to force-recalculate the metadata, that would work for us
(I’d not prevent others from getting bitten, but I don’t have the
energy to try to find a solution for everybody).
As a counterexample, PBR examines the tree looking for a PKG-INFO
file or Git metadata in order to determine the package version.
Attempting to rebuild the unpacked sdist of a PBR-using project
after deleting its PKG-INFO would result in failure (unless a
desired version string is supplied in an exported environment
variable, the solution generally recommended to downstream distro
package maintainers).
This whole discussion seems to drift in a bit of a worrying direction, based on what looks to me like a misunderstanding of the intent of PEP 643 and a technically compliant but unusual implementation choice in hatchling.
@pfmoore it looks like you wrote both this comment and all of PEP 643 having in mind distributed artifacts on PyPI only. Is that right? If so, the spec makes sense - requiring PKG-INFO and METADATA in wheels to be consistent and reliably marked as static/dynamic is clearly useful. And since PyPI exposes the metadata in its API, it matters.
If you really meant “all wheels”, then I’m not quite seeing why that’s useful or reasonable. For one, distros adding patches to make things work for them is common and expected - and should not require changing the version number. For another, disabling build isolation is widely used to build against dependencies other than what may be pulled in by setting up an isolated build env. There are many reasons for doing so, and it can result in different content of wheels - and I don’t think any backend appends local version identifiers in this case (pip install mypkg --no-build-isolation results in the same tags).
I don’t think that is necessary. The equivalence can be enforced by design. I think a common design (the most common?) is this:
Build wheel from either VCS checkout (source tree) or sdist: read pyproject.toml metadata, write METADATA
Since the code for reading in metadata is always the same, the metadata that ends up in PKG-INFO/METADATA will be consistent (can be checked in a backend’s test suite). Unless you’ve modified pyproject.toml or did some other patching of course, or the user used some backend-specific flag to change what ends up in METADATA. In all of those cases, I think you want the modified metadata. hatching ignoring the modifications by reading PKG-INFO instead of pyproject.toml is what leads to the surprise here.
I think that having completely different code paths for building from sdist vs. building from source tree is (a) more complicated to implement, and (b) increases the chances of the source->sdist->wheel and source->wheel paths not giving the same end result when making releases to PyPI.
It seems interesting to lay out what all the different backends do. It seems so far there are two designs:
Design A:
build_sdist: pyproject.toml → PKG-INFO
build_wheel: pyproject.toml → METADATA
Design B:
build_sdist: pyproject.toml → PKG-INFO
build_wheel from source tree: pyproject.toml → METADATA
build_wheel from sdist: PKG-INFO → METADATA
From the discussion so far, hatchling and pymsbuild seem to use B. meson-python uses A. My expectation was/is that most other backends use A, because the issue @hroncok is running into now would have come up a lot earlier otherwise.
Agreed, that seems like a useful thing to add to the spec.
I think things would be a lot easier if the answer here was “no, do not read from PKG-INFO”. I’m not sure what potential issue that causes in hatchling when the dynamic metadata logic is used as part of a release process though. Do you have a description somewhere?
For context, the main reason for that PEP is so resolvers can use metadata from source distributions directly without having to go through the build process.
This isn’t actually relevant to the distro patching case, because distros set up their build envs separately, and I can’t think of a case where they’d want to invoke a resolver from a build frontend - they will always use --no-build-isolation for pip or --no-isolation for build.
The trouble with untangling that is that a build backend doesn’t know whether the build is isolated or not. If it knew, then “don’t read PKG-INFO for non-isolated builds” would be a potential solution here - but that can’t really work.
I don’t know what the best way forward is, but +1 for considering the need for patching explicitly in the design. This is important to distros, and we actually even have to do this in the release process to PyPI for numpy/scipy (we patch in license content for shared libraries vendored in by auditwheel & co; that doesn’t belong in the sdist but is required content for wheels - hoping to do that more cleanly once we have PEP 639).
Given that PEP 643 / Metadata 2.2 introduces “dynamic” for sdist AND the only field that can’t be dynamic in both pyproject.toml and an sdist is “name”, would it be possible to patch the name?
The build backend can then use this mismatch between name in pyproject.toml and PKG-INFO to trigger recalculation of metadata. It would be tidier if it was version, but alas, that value can be dynamic.
My comments (and the PEP) are framed in terms of distributed artifacts, yes. But not just on PyPI.
The key issue here is that the primary point of PEP 643 is to ensure that metadata consumers (specifically installers) can reliably optimise by not running a build step in order to know the metadata for a candidate. So the key point is that if you have a sdist for foo-1.0, and you read PKG-INFO and find that the dependencies are defined as static, then you can be sure that building that sdist won’t result in a wheel that has different dependencies (which would break the resolution you determined from the sdist).
A further wrinkle is that if pip is using a set of indexes where there are multiple sdists, all for foo-1.0, then it can (and does) choose one in an essentially arbitrary manner. This is what leads to the assumption that “all sdists for a given package name and version must be equivalent”. It’s not a standard as such, it’s simply a practical consequence of the fact that if you have two files, both called foo-1.0.tar.gz, with no rule to choose between them, then the only way to get deterministic results is if you ensure that those two files are functionally identical. (Some tools offer ways to distinguish, like index priorities, but this is not standardised, it’s just an implementation detail). If the two foo-1.0.tar.gz files will never appear in the same installer invocation, the question of whether they are functionally equivalent will never arise.
I meant “all wheels”, in the sense of “all wheels built from a given sdist following standards-compliant build processes”. That essentially means “running a build backend”. The problem here is whether the following process can be considered (as a whole) as “building a wheel from the sdist”:
Unpack the sdist.
Change the metadata stored in the sdist.
Build the wheel using a build backend.
Personally, I don’t think that counts as simply “building the wheel from the sdist”. It’s a different process (“building a patched wheel” if we want a term for it, maybe) and PEP 643 doesn’t have anything to say on that process. Maybe I was too strong in saying it was “against the spirit” of the PEP - let’s just say the PEP doesn’t apply in that situation. But as a consequence, just like any other process that doesn’t follow interoperability standards, it may not interoperate well with tools that do follow those standards.
This is somewhat off-topic, though, as it’s unrelated to whether hatchling is allowed to do what it’s doing.
Of course you can build wheels that don’t have the same metadata as the PKG-INFO file in the sdist specifies. Those wheels can’t have been “built from the sdist” in the sense of running a build backend directly against the sdist, though (that’s what the metadata 2.2 specification asserts). But if some tool ignores the wheel and builds the project from the sdist (for example, pip with --no-binary :all:) then you have no right to be surprised when you don’t get the metadata that’s in the wheel. Again, this isn’t anything to do with the spec. It’s just how installers work - they pick an artifact to install, and then install it. They don’t have any information about the provenance of the artifacts.
I don’t necessarily disagree here. But those are all implementation questions for build backends, and not something that existing standards cover. In particular, PEP 643 has nothing to say on the behaviour of source tree → wheel builds. There’s currently no specification for distinguishing between an “unpacked sdist” and a general source tree, so backends need to use heuristics to decide if something is an “unpacked sdist” to which the metadata 2.2 rules apply - and the presence of a PKG-INFO file seems like a reasonable, but not perfect[1] heuristic. Maybe there should be a new build backend hook which allows frontends to pass a sdist, still as a single archive file, to the backend and get the backend to do the unpacking. Installers could then use that to ensure PEP 643 semantics. But that’s speculation, as such a hook doesn’t exist right now.
I suspect this is actually because most other backends haven’t yet implemented metadata 2.2 support. Remember that up until a week or two ago, sdists containing metadata 2.2 couldn’t be uploaded to PyPI.
I’m not entirely sure what precisely is being proposed, but by all means suggest some words to add, and we can consider them. I assume the change would be to the “source trees” specification in PEP 517 (which hasn’t as yet been transferred to the specification area on packaging.python.org, as far as I know), to say that a “source tree” additionally must not have a PKG-INFO file alongside pyproject.toml, unless it’s an unpacked and unmodified copy of a sdist? Or something like that?
From the specification point of view, the key thing is that in the case of building from a sdist using pyproject.toml and using PKG-INFO must be equivalent, so the build backend is free to do whichever it chooses. I imagine “use PKG-INFO” is easier in terms of not having to separately enforce the consistency guarantees from the spec, but how complex that is to do will depend on the backend.
+1 from me on considering patching, not just in terms of metadata 2.2, but across the whole ecosystem. Like you say, it’s an important use case. But to do that means that people involved in patching scenarios need to raise the issues when the relevant standards are being discussed. We can fix things after the fact, but doing so will likely involve extra standards, transition plans, etc. So much better to get things right first time.
So you’re saying that PBR is unable to build a source tree that isn’t a sdist or a VCS checkout? I can imagine a number of cases where that might cause issues - obviously they aren’t things the PBR user base tend to do, otherwise you’d have seen bug reports by now, but it sounds like even without metadata 2.2, this has the potential to trigger the same issue with patching that we’re talking about here (maybe PBR only does this for the version, which is less likely to be something people want to patch?)
I’m not sure that “must not involve changing or deleting PKG-INFO” is a reasonable constraint to demand here. I understand and accept that the need to update PKG-INFO is surprising, but surely it’s no more so than any other case where you need to modify generated metadata in a file that you’re patching? If you patch files in a wheel, you need to update RECORD to reflect that - how is this so different? Other than being new, of course.
IMO, the recommendation for patching sdists which contain metadata version 2.2 or later should be that if you patch metadata content in the pyproject.toml file, you need to update PKG-INFO. You can either do this by changing the value to the patched value, or by removing the patched item and adding it to the Dynamic list. Either of these approaches will result in the patched data being used.
Alternatively, you can delete the PKG-INFO file, converting the unpacked sdist back into a bare source tree, but if you do that you must ensure that any backend-imposed constraints on the form of the resulting source tree are respected (the example here being that PBR only supports non-sdist source trees that are git checkouts).
So you’re saying that PBR is unable to build a source tree that
isn’t a sdist or a VCS checkout? I can imagine a number of cases
where that might cause issues - obviously they aren’t things the
PBR user base tend to do, otherwise you’d have seen bug reports by
now, but it sounds like even without metadata 2.2, this has the
potential to trigger the same issue with patching that we’re
talking about here (maybe PBR only does this for the version,
which is less likely to be something people want to patch?)
Well, yes, a (probably the) primary reason projects use PBR is so
they can avoid committing a version number into a file in their
project’s Git repository, allowing it to be inferred from the
repository state at build time. In order to support building from
unpacked source tarballs in situations where the Git repository
itself is unavailable, PBR requires the unpacked tarball to be from
an existing sdist (or technically to have the sdist’s metadata
present) in order to find the stored version number.
Version numbers are, in fact, one of the most likely things to be
patched by downstream distro package maintainers, usually because
they want to adjust the versions to reflect changes made to their
builds of packages. For this purpose, they can override all the
above logic by exporting an environment variable which contains the
version string they want to be used by the build backend.
And yes, we do get bug reports from time to time because someone is
trying to build from a “GitHub tarball” of a project’s repository,
which contains neither the Git state nor sdist metadata. We just
make it very clear that workflow is wholly out of scope, and
honestly I don’t see any way that could be made to work (aside from
the envvar escape hatch of course) since using an unbuilt file tree
from a Git repository without any Git context provides no usable
information to base a version number on unless that version is
committed to the repository’s file tree, which is exactly what
projects relying on PBR are trying to avoid in the first place.
So I guess that distros who currently patch the version number of projects are already used to having to edit PKG-INFO when working with PBR-managed projects. That’s interesting to know.
So I guess that distros who currently patch the version number of
projects are already used to having to edit PKG-INFO when
working with PBR-managed projects. That’s interesting to know.
To my knowledge, they just export an environment variable from their
package building scripts telling PBR what version number they want
their package to use, and any existing PKG-INFO version field or Git
repository context will be ignored if present. Distro packages
typically have their own metadata with separate version info in them
anyway, and situationally-appropriate wrapper scripts which invoke
the software’s build process in whatever way that project expects.
My original point was, there is no one-size-fits-all build process
downstream distro package maintainers can assume, these sorts of
details will depend on the build backend’s behavior. Just assuming
you can unpack an arbitrary sdist, delete PKG-INFO, and rebuild
solely from the contents of pyproject.toml is going to be wrong in
at least some cases (more like hundreds or perhaps thousands),
particularly when pyproject.toml doesn’t include all the necessary
context for things like the package version.
FYI setuptools-scm/hatch-vcs does not have that problem because
there is a parentdir_prefix_version
option that takes care of such cases.
That’s an interesting workaround I hadn’t seen before (introduced in
4.0.0 a few years ago), and should work as long as the distro’s
package build process doesn’t also supply the name of the directory
to unpack into or rename the directory before trying to build the
package.
In the case of PBR, the lack of Git context or sdist metadata also
serves as a canary indicating that other generated files are also
likely to be missing (e.g. the AUTHORS file which the project’s
license may mandate inclusion of in source tarballs).
You’ve convinced me. I agree. In my head I was trying to bend things around PBR sniffing PKG-INFO, but I think the parentdir_prefix_version approach that setuptools-scm/hatchling-scm are taking would probably make more sense. Or PBR can choose not to support Metadata 2.2 as well I guess. No tools need to support the new standards, but it’s obviously better for the long-term health of the ecoysystem if they do choose to do so.
It doesn’t have to not support metadata 2.2, it can simply not support patching sdists without making the necessary patches to PKG-INFO. The situation is no different than with hatchling (or any other backend that chooses to read PKG-INFO when building a wheel from the sdist).
The double negatives are a bit difficult for me to parse, but yes, same page I think.
Reframing:
PBR can decide what they want to support. Current state is they support Metadata < 2.2 and sdist patching without any modification or deletion of PKG-INFO. Regeneration just happens always.
If they decide to support Metadata >= 2.2 they’ll need to decide to what extent patching by distros is supported and what that looks like (deleting PKG-INFO and sniffing parentdirs, patching PKG-INFO fields, something else, etc).
Similar for the other backends. There isn’t a one-size fits all.
The closest to one-size fits all so far would be to patch PKG-INFO if you patch pyproject.toml, but maybe marginally more effort. The bluntest and quickest, but that wont work for PBR at the moment is to delete PKG-INFO. There’s also no way to determine if the patcher doesn’t do this, so silent “failures” can occur, but patching of sdist is a power-user operation.
Thanks for the detailed answer. This key point makes perfect sense and is valuable indeed.
Perhaps it would be useful to add the intended consumers, since that is now left in the middle. I hadn’t considered build backends as consumers. The two types of consumers that really do need it are (I think):
PyPI, for serving this metadata through its API
Build frontends, for determining runtime dependencies without having to potentially build a wheel first
Are there more?
Sure, according to the standards as written this is allowed. The question is whether it’s a good idea, since it immediately leads to the problem that started this thread. If the implementation would be:
Read from pyproject.toml just like it would do for a source tree
Read PKG-INFO to validate equality of metadata only, and raise an error with an informative message if metadata isn’t matching.
Then that would be more robust and distro-packager friendly.
I suspect that you are excluding all uses of --no-build-isolation here as well?
Sure, agreed. We don’t need a standard for absolutely everything to make progress though. It seems like hatchling could revert its recent change in preference of the design I sketched above. That would avoid the problem, and we could write a recommendation for build backend implementers in line with that (can be but doesn’t have to be a standard).
That seems unnecessary, at least at this point. I’d strongly prefer to not add more hooks, the cost of that is quite high.
I’m not sure since we haven’t started implementing yet, but my expectation after reading PEP 643 is “ask pyproject-metadata to produce 2.3 metadata, write some tests to verify we’re already compliant, and be done” (and the open PR for support in pyproject-metadata has a very small diff too; PEP 643 really doesn’t ask for too much that is new). It didn’t even occur to me that a backend author would want to not read from pyproject.toml at all - which is why I asked for a concrete example higher up.
Only “unpacked” not “unmodified”, but yes indeed, something like that. Is a PR to the PEPs repo and a separate thread for visibility the way to go here?
A note on the “patching PKG-INFO too” suggestions: that is more difficult to because it’s a generated file; it has a higher change of the patch going out of date, and it makes it hard to apply a patch independent of whether the input to the build is an sdist or a VCS tag (those two are usually treated completely interchangeably by distro tooling). So it’s preferred I think to not have to do that; using PKG-INFO metadata for verification only will avoid the need for this.
PBR can decide what they want to support. Current state is they
support Metadata < 2.2 and sdist patching without any modification
or deletion of PKG-INFO. Regeneration just happens always.
If they decide to support Metadata >= 2.2 they’ll need to decide
to what extent patching by distros is supported and what that
looks like (deleting PKG-INFO and sniffing parentdirs, patching
PKG-INFO fields, something else, etc).
I’m probably lost, but I don’t see how that’s a lack of support for
Metadata >= 2.2. PBR supports building from a Git checkout, from an
existing sdist, or from a bare source tree with version info
supplied in an environment variable. Metadata 2.2 doesn’t place
requirements on how you handle bare source trees anyway, does it?
You could say the same about setuptools-scm, except that it also
supports building from a bare source tree with version info supplied
by a specially-formatted parent directory name (which PBR could add
too, though I’m unconvinced it would be a wise choice since it could
lead to people unknowingly distributing legally incomplete source
code).