Enforcing consistent metadata for packages

pf_moore · April 1, 2024, 11:24am

One of the longer-term goals in the packaging ecosystem has been a move towards statically defined package metadata, with the ultimate intention that tools can read package metadata without any need to execute Python code. With the adoption of PEP 621, for defining metadata in pyproject.toml, and PEP 643, for allowing projects to record static metadata in sdists, we are now in a position where most projects can reasonably expect to define metadata statically.

The next big step will be to ensure that all artifacts for a given version of a project have consistent metadata. Once we can do this, it will be possible to simplify processes around resolving sets of requirements quite significantly. In fact, even though it is not currently guaranteed, some tools like Poetry and PDM that produce lock files already assume consistent metadata, to make the problem tractable, and have encountered very few issues as a result.

Why bother?

Being able to assume that all files associated with a particular version of a package have the same metadata will simplify a lot of processes - resolution algorithms, generation of lockfiles, package analysis, etc. Many of these work right now, but they either have to do significant extra work to deal with the possibility of inconsistent metadata, or they simply fail if they find their assumption of consistency is invalid. The result is extra maintainer work, and unnecessarily fragile tools.

What’s the problem?

The reason we can’t simply declare that metadata must be consistent for a given project version is that in order to make a useful standard, we need to address the various edge cases that might come up. So the point of this post is to give people a chance to publicly describe possible situations where a rule that “all artifacts for a given version of a project must have the same metadata” would cause issues.

At the moment, there’s no plan or timescale for implementing a rule like this. The point of this post is simply to collect information to inform such a plan - it’s notoriously hard in the packaging ecosystem to find out how people are pushing the limits of “common practice”, except by implementing something and seeing what breaks. If we can get some discussion of this topic, my hope is that we can spot the issues in advance.

I’ll start with some cases that have come up recently.

Enforcing consistency

The first problem is probably the most fundamental. How do we even enforce such a consistency rule? With sdist metadata, we already have the means to state that every wheel built from a given sdist will have the same metadata (by marking every field as static). To make it universal would mean deprecating, and ultimately removing, the ability to have fields in a sdist marked as dynamic. Is this sufficient? Do we need a further rule that all sdists for a given project version must have the same metadata? Is it even meaningful to talk about multiple sdists for a project version?

Tools like pip already assume that any two artifacts (wheels or sdists) with the same filename are functionally identical, for practical reasons. Maybe we should just make that assumption official?

Visibility of files

Consistency only matters in the context of a tool using the artifacts. So if I edit the metadata of a wheel, but never publish it, and never use it, my actions have no impact. What this means is that in practice, a standard for consistent metadata only applies to sets of files presented to a tool for consideration.

How do we state such a constraint without making it the user’s responsibility to check every file? It doesn’t seem unreasonable for a user to expect a package index like PyPI to only serve conforming packages, so do we need to make it a requirement for indexes to enforce consistency? But what about private directories (accessed via options like pip’s --find-links)? How do we make it reasonable for maintainers of such directories to ensure the rules are followed?

Installed packages

When a package is installed, the metadata from the wheel is stored in the environment’s site-packages, as per the installed package metadata spec. This metadata needs to be consistent with other sources of metadata for that version of the project, for exactly the same reasons that sdist and wheel metadata need to be consistent.

Patching sdists

Linux distributors routinely patch sdists and build their system packages from the resulting wheels. This patching will, by design, violate the consistency rules we’re discussing. How do we handle this?

Specifically, an installed package must also have consistent metadata, and if that installed package is a distro-packaged version of a Python project, the distro’s patches could violate the consistent metadata requirement.

Is a statement of intent sufficient?

Given all of the above, and any other cases that may come up in the subsequent discussion, is it even worth trying to come up with an enforceable standard? As I already mentioned, tools like Poetry and PDM are currently managing fine assuming consistent metadata, and pip has made similar assumptions for years around interchangeability of files with the same name.

Maybe rather than trying to write a strict standard, all we really need is a formal statement that such assumptions are allowed and expected, and situations where they do not apply will no longer be required to be supported by the Python packaging ecosystem.

I’d be interested in hearing people’s views on this subject, particularly people who are working in areas where metedata consistency is already an issue, such as backend developers, maintainers of tools that produce lockfiles, Linux distro packagers, and maintainers of projects who cannot reasonably publish 100% consistent metadata.

dustin · April 1, 2024, 2:14pm

Now that we have static metadata files available for all wheels on PyPI, maybe a good starting point would be to analyze them across releases/projects to see if/when they differ? That might reveal some use cases.

cemici · April 1, 2024, 3:06pm

I had the same idea.

I find no interesting variation present in the top 100 wheels, per https://pastebin.com/6wbUnhWg

(The cases that I looked at where there are different metadatas but no “interesting” variation is things like different line endings for metadata generated on different platforms)

obvs that is only 100, and there is room to quibble about what is “interesting”, feel free to make this a starting point for further exploration.

pf_moore · April 1, 2024, 3:44pm

I planned on having a go at doing that. But downloading all of the files is a non-trivial exercise^[1], which I haven’t had time to do yet.

I’d look at semantic differences, parsing the metadata files into actual structures, and checking for differences in the parsed data. This is harder (because there is data on PyPI that isn’t cleanly parseable) but more useful. The key thing though is what @dustin said - we’re not just looking for inconsistencies (we know from the experience with PDM and Poetry that the data is basically consistent for practical purposes), we need to find out why any inconsistencies exist. This might simply be accidental (although these days it takes actual effort to generate inconsistent data, so I imagine that will be rare), or it might indicate a genuine use case that needs looking into. But such cases are unlikely to be in popular wheels, again because of the PDM/Poetry experience. (I may be putting too much reliance on PDM/Poetry here - the truth, and the key point of my post, is that I simply don’t know for sure).

I will say that I don’t have much time for investigation right now, so I’m hoping that people with more experience than me with the issue will take this as a chance to get involved.

not hard, just fiddly, especially if I want to avoid the situation where a failure in the middle means I have to start over ↩︎

cemici · April 1, 2024, 3:48pm

Yes, that is what I did.

going deeper than 100 there are still very few cases of real variation, but one “real”-looking case comes from apache-beam. There they want to avoid requiring pyarrow on 32-bit Windows, which seems to be something that is hard - impossible? - to express with pep 508 markers

cemici · April 1, 2024, 4:39pm

Going considerably deeper, per https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json, I find only the following packages with meaningful variation in their metadata files at the latest release:

tensorflow 2.16.1
apache-beam 2.55.0
onnxruntime 1.17.1
gensim 4.3.2
ray 2.10.0
lmdb 1.4.1
onnxruntime-gpu 1.17.1
mediapipe 0.10.11
pysam 0.22.0
tensorflow-cpu 2.16.1
open3d 0.18.0
embreex 2.17.7.post4
nagisa 0.2.11
nmslib 2.1.1
pybluez 0.23
pyodps 0.11.5.post0
cmreshandler 1.0.0
nlopt 2.7.1
py-sr25519-bindings 0.2.0
xdis 6.1.0
vispy 0.14.2
xpress 9.3.1
spark-parser 1.8.9
vowpalwabbit 9.9.0
pymem 1.13.1
magicinvoke 2.4.6
mosek 10.1.28
uncompyle6 3.9.1
aspose-words 24.3.0
py-bip39-bindings 0.1.11
py-ed25519-zebra-bindings 1.0.1
pyresample 1.28.2

Caveat that this approach does not catch cases of inconsistency in packages that have only uploaded sdists: if there is variation in the built wheels but those wheels are not published, then we do not see that.

On a superficial pass only the apache-beam one looked to me as though it might be not replaceable by static dependencies and pep508 markers - though making that change is surely easier in some projects than others, and perhaps I am wrong.

Reaching out directly to some of those projects probably would give a more reliable read on this, better anyway than my guesses.

sinoroc · April 1, 2024, 4:59pm

Probably conda packages will also fall in this bucket.

pf_moore · April 1, 2024, 5:20pm

Is there anyone involved in the conda ecosystem who could provide more details? I don’t know how conda handles standard metadata (as opposed to their own metadata). I presume the only point of intersection would be if someone installed foo-1.0 using conda, is there a possibility that the installed metadata will differ from that in the sdists and wheels published on PyPI? My understanding is that the conda installer doesn’t read standard metadata itself, so it doesn’t care directly about what we might do here.

Have conda been affected by the Metadata 2.2 changes where sdist metadata can require that all wheels built from the sdist must have the same metadata as the sdist does?

You’re absolutely right, though, conda is an important ecosystem that we need to interoperate with, and we don’t currently have enough awareness of their processes.

jamestwebber · April 1, 2024, 5:38pm

Not an expert in conda packaging but I did a quick test: two envs, one where I install matplotlib and more-itertools via pip and the other I installed them with conda.

The pure-python packages have the same METADATA file in both cases, but matplotlib is (unsurprisingly) different, because it was built differently.

If there’s something else I can check, let me know.

sinoroc · April 1, 2024, 6:20pm

I had understood that in conda ecosystem there is a separate index for package metadata and that this metadata is possibly updated after package upload. If it is the case I wonder if the installed metadata is the one from inside the package or the one from the separate metadata index. But maybe this is not true at all, and I had only misunderstood things. I would not know how to check this. My knowledge of conda is very limited.

P.S.: I now found a clue of this in "Removing broken packages " section of the conda-forge documentation: “If the only issue is in the package metadata, we can directly patch it using the repo data patches feedstock.”

kknechtel · April 1, 2024, 6:58pm

To my understanding, the primary use case for dynamic is for version numbers, so that developers don’t have to update pyproject.toml to build a different version of the same codebase. I feel like the next step before trying to implement a change like you describe (which I have to agree sounds very nice) is to see what else people are commonly using dynamic for, and why.

My gut reaction is “no, of course not”, but I’d be very interested to hear arguments to the contrary.

IMO the cleanest way to do this is to have a single source for that metadata, which is a file separately available from PyPI and not part of the sdist/wheel. Of course, that is very much a breaking change. My overall feeling is, let’s break it, and do everything possible to make sure that this set of breaking changes fixes everything. (Like with that castle in the Holy Grail movie.) The way things currently stand, I really can’t see Python packaging getting to where everyone is looking for it to end up, without ever making a breaking change.

effigies · April 1, 2024, 7:53pm

Also, a note regarding dynamic version numbers: it is also common to modify/generate _version.py files during the sdist or wheel builds, as do versioneer and setuptools-scm. Would enforcing consistent metadata also require consistency of hashes? If the version interpolating tool changes its output (as setuptools-scm has), then you could get inconsistent results across builds.

pf_moore · April 1, 2024, 7:58pm

There’s two subtly different contexts for “dynamic” here. In pyproject.toml, it’s anything that the backend might calculate somehow - whether from static data outside of pyproject.toml or in a genuinely dynamic way doesn’t matter. In sdists, it’s only things that the backend chooses to mark as having the potential to vary at wheel build time. Versions are not allowed to be dynamic in this sense.

The impression I get is that the only backend that can produce dynamic metadata in this second (metadata 2.2) sense is setuptools, and then only if the value is marked as dynamic in pyproject.toml, or there’s no [project] section in pyproject.toml.

For me, the questions are:

Do any other backends have the ability to create metadata 2.2 dynamic data?
What about other cases like building from a source tree?

That’s my instinct as well. Pip assumes that all sdists for a project/version are functionally identical, so I’d imagine a bunch of things would fail if this assumption was ever violated. What’s less clear cut is what other sources do people use to build wheels from, and how do we reason about those? Or maybe I’m over-thinking the issue, and we should simply not worry (and say that it’s implicit that people shouldn’t try to present source code as version X.Y of project foo if it isn’t the same as the official sdist foo-X.Y.tar.gz).

The frustration here is that people use pip (and more generally, the packaging ecosystem) both as a distribution mechanism (where a consistent view of what a given version is goes without saying) and as part of a development workflow (where versions are fluid, and people routinely change the code without changing the version). Unfortunately, I think we’re long past the point where we could ever change that reality.

This is totally incompatible with the “development workflow” side of the issue. If I’m working on my project, working towards release 1.0, and I haven’t yet published any code, where would the “single source” of that metadata be? I know people use pip in situations like this (app/myapp depends on ../lib/mylib in a monorepo). I don’t know if they would use lockfiles similarly (but I suspect so).

This is what triggered the comments I made in the “Visibility of files” section of my original post. As long as a development workflow only ever sees a single, consistent, view of the in-development libraries and code, we can still reason about (consistent) metadata. But it’s definitely not as easy as the simple “sdists and wheels all published on PyPI” case - which IMO is the least interesting problem, precisely because it’s the simplest to solve (and it’s the one we’ve almost solved already).

johnthagen · April 2, 2024, 12:13am

As a humble user of a tool such as Poetry, I will say that inconsistent metadata has bitten my team more than a few times over the years. One current example is open3d (10k stars on GitHub), which uses different dependencies based on which platform the wheel is built for (ML dependencies)

Open3d 0.16.1 fails to install correct dependencies under Poetry · Issue #5747 · isl-org/Open3D · GitHub

For various reasons, it’s not trivial to fix this with markers after the fact now.

So as a user, I would be trilled if static metadata was required so that tools like Poetry/PDM/future uv? could work properly across everything in PyPI.

BrenBarn · April 2, 2024, 6:13am

I’ve only done pretty simple conda package builds myself, but my understanding is that this is correct. The conda build process can interact with standard metadata, for instance by referencing pyproject.toml (and this is sometimes done to, e.g., copy a version number from there so that the conda package version will match). But once it gets to the point of installing the conda package, only conda’s own metadata matters.

The metadata that can be altered in this way is (to my knowledge) conda’s own metadata, not the PyPI-standard metadata.

Like I say, I’ve only done simple conda builds, but from my understanding the big difference between conda and pip/pypi is that there is a much more strict split between build and install, and so there is no analogue of sdists. The only things that conda install installs are built conda packages; there is no building at install time. This is good for users: it means when you try install a package, either it is found and installed, or it’s not and doesn’t. Of course things can take a long time going down rabbit holes of dependency resolution, or a bug in the package/recipe can cause it to install and then fail. But you don’t get the pip behavior of inscrutable build errors at install time; those errors are forced back to build time where the package (or recipe) author can debug them.

I continue to think that the best way forward is for the “official” Python packaging world to move towards a similar vision. That is, a strong separation between built-and-ready-to-install packages and “inchoate” things like sdists that still require build steps whose intricacies and potential failure modes are difficult to foresee.

This doesn’t mean things like sdists shouldn’t exist, it just means we shouldn’t expect to be able to install them directly (and certainly not in a default user-facing tool akin to pip). In my view, if it were expected and considered normal that you should never try to install an sdist without a separate, explicit build step, many of the complications raised in this thread would matter much less, or not at all.

To be honest this seems like the best solution to me. I don’t see what the use is in maintaining that the standards technically allow various edge cases if there’s no practical commitment from tools to support them. That could still be a standard in the sense that there could be a PEP that officially says “keep your metadata consistent or you’re on your own”, but maybe not “strict” in the sense that pip and PyPI wouldn’t enforce it.

It seems to me that cases like Linux distro packagers patching sdists are already outside the purview of PyPI standards and I don’t see much hope for bridging the gap there. “Patching an sdist of somepackage” is just another way of saying “modifying the source of somepackage”. Neither PyPI, nor the original author of somepackage, nor anyone else can make any guarantees to someone who modifies the source of that package. In effect, patching the sdist is creating a new, derived package.

rgommers · April 2, 2024, 8:04am

Clarifying question: you use “consistent metadata” and “the same metadata” interchangeably. Can I assume you don’t actually mean the exact same - because that would prevent the use of environment markers completely - but rather that wheel metadata can be reliably deduced from the sdist metadata?

Related to the question above: I assume that they want to use only the sdist metadata, maybe retrieved through the PyPI JSON API? It’s not possible to start from wheel metadata even if everything is consistent I think (e.g., due to an environment marker a runtime dependency can be completely missing in the wheel metadata for some platforms or Python versions).

aragilar · April 2, 2024, 8:25am

There’s some discussion of the issues with immutable metadata (which would have to be a prerequisite of consistent metadata unless modifying files was an option) at Metadata handling on PyPI - pypackaging-native, which may be worth looking at/considering.

In terms of things that break your assumptions that don’t involve outside packaging tools, as far as I know, on Windows a wheel which has been “fixed up” (e.g. DLLs copied in) will differ from the original wheel (even if that original wheel does not get uploaded to PyPI), but still have the same filename. Also, it’s not obvious that the “wheel fixing up” (need a better name for this) might not change what the metadata should be (e.g. vendoring a specific build of another package due to strong ABI coupling and so removing a dependency, or maybe the inverse, adding in a package due to how the package is built)? I think if the right fields are marked as dynamic these previous cases won’t be an issue (you’re still stuck with wheels with different metadata, but that’s been conveyed by using dynamic), but I think at best you can only make these assumptions about sdists/wheels which are pre-vetted (maybe PyPI needs a flag in the JSON API to tell clients that the metadata has been checked to be consistent, and they can rely on it, so you can slowly backfill old packages and know early on whether you need to use a “handle legacy packages” codepath or not, which if no packages require it, would be much faster).

pf_moore · April 2, 2024, 9:16am

No, I do mean exactly the same. Why do you think it would prevent the use of environment markers? Metadata 2.2 already demands that static metadata is identical in all wheels built from the sdist, so this isn’t new.

pf_moore · April 2, 2024, 9:17am

Do you have a concrete example of this happening?

rgommers · April 2, 2024, 9:45am

Ah, maybe I had a wrong impression here. I was thinking of a case like pyproject.toml containing:

dependencies = ["importlib-resources >= 3.2.0; python_version < '3.10'"]

and then building say a cp312 wheel from it. That wheel doesn’t depend on importlib-resources, so I thought it may not include a Requires-Dist: importlib-resources. I just did a quick check for one package (matplotlib) and its cp312 wheel does include:

Requires-Dist: importlib-resources >=3.2.0 ; python_version<"3.10"

I’m not sure if any build backend does the simplification here to omit the Requires-Dist - which I suspect was allowed before Metadata 2.2?

Either way, you are right that Metadata 2.2 already addresses this.