I want to put out this “food for thought” post, as I don’t believe any of this is written in the Python packaging standards but has naturally come out of resolver tools implementing the standards. And I think it’s going to start becoming an important point when discussing new standards and different ways that different tools interact with them.
First, some informal definitions:
A release is a specific version of a project, e.g. pandas 2.3.3
A distribution is a concrete file that containers the package, e.g. pandas-2.3.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
A release may have one to many distributions, but a distribution belongs to exactly one release. In any given environment there may be many valid distributions available for a release. There is no guarantee, outside what is specified in the filename, that the metadata of different distributions is consistent.
When a tool has to resolve dependencies it is faced with a problem: If a given distribution is incompatible with existing requirements it does not mean other distributions of the same release are incompatible. For example, a user has the requirements foo>=1 bar!=2, the latest release of foo is 2.0.0, and the first distribution checked specifies bar==2, one could check all other distributions of foo 2.0.0 or one could mark the foo 2.0.0 release as not compatible with the users requirements.
In practice, as far as I am aware, all Python package dependency revolvers mark the release as not compatible, rather than the individual distribution, there are a number of practical reasons for not iterating through all compatible distributions for each release:
It saves significant amount of IO
It saves having to build the source distribution for every incompatible release
It simplifies dependency resolution/satisfiability algorithm implementations
It reduces dependency resolution/satisfiability algorithm computational costs, as a 2x increase in the number of possible choices for each project is an exponential increase in the possible solution space
As a consequence it is good practice to not specify different dependency metadata for different distributions of the same release. Where some dependencies are platform or API specific, it is ideal if these can be represented using markers, so tools that statically resolve multiple platforms or APIs can make one IO call per release.
It is also important that future standards not require resolver tools to make multiple IO calls for dependency metadata for each release. Further future standards should avoid requirements that invalidate basic assumptions in dependency resolution/satisfiability algorithms, such as dependencies that are affected by the current internal state of the resolver, which for example would invalidate nogood learning.
This is helpful, and I’m definitely a fan of the packaging guide including more material of this type. Would it be useful to clarify that any given release can have one to many distributions but only one of them can be a source distribution?
A quick request: if you go with that example, could you please use distinct version ranges for the two packages? I kept WTF-ing why it would need to check multiple distributions of foo 2.0.0 to check if they’re compatible with foo!=2 before I realized we’re talking about two packages xD.
Overall, big +1 on this. Ideally we’d standardize on this, but I can see it’s going to be hard to push through, but making it an official recommendation is also an important step forward.
If, as @notatallshaw said, all existing implementations work on that assumption, then I don’t see why it would be hard to make it a formal requirement - after all, we know that nothing can rely on breaking the assumption, so compatibility is easy to establish.
The big problem is more that it’s hard to define precisely. As worded, the assumption is weaker than simply saying “all distribution files must have the same dependency metadata”. And phrasing the assumption as given in a standards-like form, something along the lines of “all distribution files in a release must…” could be tricky to get right. But if someone wants to try to do so, I’d support it.
Here’s my formulation, which would work with uv. I can’t speak for poetry, but from what I know from their resolver I expect something similar:
All distributions of the same package name and version on the same index page must have the same values for Requires-Dist in their {name}-{version}.dist-info/METADATA file, in the same order.
This formulation ensures that a clause-learning resolver can learn new clauses per (name, version) entry. I added the order here too cause uv uses it as arbitrary-but-stable input for the resolver, this isn’t a strict requirement. I added the index requirement as indexes work as namespaces, and a resolver can’t assume that metadata from one index applies to another index too. It’s IMHO fine to just say “same” and ignore any questions about normalization, I don’t want to make this a capitalization discussion, the parsing and normalization rules do work and handle this.
That’s the stricter version I alluded to above. I have a vague recollection that some people objected in the past to a proposal along those lines, because “some projects”[1] add platform-specific dependencies in wheels for particular platforms. Which conforms to the limitations in the OP, but not to the formulation you gave.
If we want to push for that formulation, it needs to be a separate thread. For this thread, we should focus on the looser constraint that @notatallshaw described in the OP. I fully support adding that as an informational page to the PUG. And I would be OK with someone formulating it as a required standard - but only if it didn’t constrain packages any more than the current informal behaviour does.
That’s the thing. There isn’t a specific formulation. Instead, there’s a description of practical choices that installers make (and in fact, pretty much have to make, to get usable performance), and the advice is “make sure that we don’t create standards or requirements that make it impossible to continue making these choices”.
That’s a very important thing to document. It would also be useful if we could make it a standard, but it’s not clear that we can convert the existing information into an enforceable standard.
In terms of putting something in the packaging guide about this, I see two options:
Best practice for users and tools to keep dependency metadata identical for a given release, with an explanation of the situation I described as the motivation
Or the opposite way around:
An informational page on how Python package resolvers work at a high level, with some points about best practice for users and tools because of this
I’m not inclined to write a standard on this, because while the high level approach of resolvers is the same, there’s a lot of nuance once you expand on individual resolvers, as @konstin mentions.
I think a standard would be in an odd place of not being able to be specific if it was avoiding requiring existing tools to make any changes, and even if all tool authors were happy to make changes for a standard it could be too specific and prevent innovations in the resolver space.
Though if someone else wants to write a standard I’m not going to block it.
I think the former would be clearer, i.e. follow the usual “what to do“ and “why“ split. Some resolver details might be helpful, but you may also open a can of worms over differences between different resolvers.
Probably worth considering other tools, such as security scanners, that might be processing the first wheel available.
Dependencies can be marked as dynamic, so isn’t this assumption invalid (and will likely conflict with efforts to be able to properly express dependency metadata e.g. wheel variants)?
The goal of wheel variants is being able to express hardware-specific, system-library-specific and package-ABI-specific dependencies using static metadata, closing the gap where packages currently need to use tricks like local versions on an index and source distribution that need to be built without build isolation to determine their dependencies. The proposal intends to address the classes of problems where dynamic dependencies are currently required.
If a package with dynamic dependencies builds into different requirements in different builds, note that this will break dependency locking, as the new dependencies may not have been considered in the lockfile and can be missing or conflicting. If you’re using a pylock.toml, you’re encoding at least some assumption about immutable metadata into your deployment process.
I firmly believe that at some point we are going to have to address the issues that overly dynamic metadata causes. Whether we do that in a way that supports “legacy” distributions/releases, or whether we just declare that we no longer support certain types of variation, I don’t know. That’s something we will need to decide.
But I don’t know if people have the energy to tackle that topic now…
Wheel variants are one of the motivating reasons I wrote this up.
In general I see no issues with wheel variants as long as they don’t require, infer, or imply that a resolver needs to check multiple distributions for a given release, say in the event of a requirement conflict. In the world of wheel variants there may be hundreds of compatible distributions per-release and checking each distribution would have the problem I outlined in the original post scaled up massively.
I would be happy writing up somewhere in the packaging guide that it is best practice to have identical dependency metadata for all distributions in a single release. It is then codified, and any new standard should not break that best practice, at least in expected common cases.