PEP 665: Specifying Installation Requirements for Python Projects

njs · August 12, 2021, 12:36pm

AFAIK, “NP-hardness” usually isn’t something you can split off and isolate from the rest of the problem, like, “oh it’s this dependency right here that makes the resolution problem NP-hard”. Resolution problems just are NP-hard as a consequence of combining all the pieces together. Or more concretely: can you give an example of an algorithm (maybe as pseudocode or whatever) that handles “easy” resolution cases but not “hard” ones? I just don’t understand what simplifications you think the installer can make, or how a lockfile generator cna tell whether it’s generating “NP-hard problems” or not.

Oh sure, I’m not saying it would be easy :-). Version resolution is an intrinsically hard problem! But I can see how we might address the “long marker string” problem in a way that’s good enough in practice, e.g. by hard-coding knowledge of common cases (os_name == "nt" and sys_platform == "win32" are the same thing), or by implementing a simple optimization pass to deduplicate things. Collapsing A or A → A is a trivial transformation on the marker AST.

OTOH I still haven’t seen any explanation of what the alternative even is. How does pdm actually generate its lockfiles? Can you explain the algorithm?

BTW, here some properties that I think we’d all agree would be nice:

Given a lock file + an environment, there should be at most one valid solution, i.e., all installers should produce the same result.
The needs entries in lock files should match those that appear in the actual packages; locking is mostly a matter of figuring out which subset of PyPI you need to include in your lockfile.

But in fact, it is impossible to have both of these properties simultaneously. Consider this counter-example:

The user requests packages A, B, and C.

Package A v1 depends on package B v2.

Package A v2 depends on package B v1.

(Note: this is just a simple way to create multiple valid solutions: you can either have A v1 + B v2, or A v2 + B v1, and different resolvers will pick different solutions based on heuristics.)

Package C is only available in one version, v1, which depends on:

package A v1 on macOS (A==1; sys_platform == "darwin")
package A v2 on Windows (A==2; sys_platform == "win32")
nothing at all on Linux

So on macOS, there is only one valid solution: A==1, B==2, C==1

And on Windows, there is only one valid solution: A==2, B==1, C==1

And on Linux, these are both valid solutions, and both solutions are valid using only packages and requirements that the locker will include in the lock file, because they’re required in at least some environments.

So if you encode this in the obvious way into a PEP 665 lockfile, you’ll end up with instability on Linux, where different installers can produce different results.

Upthread it I think it was claimed that poetry/pdm currently produce lockfiles that are guaranteed to only have a single solution, and that use the original package Requires-Dist metadata in the lockfile – is that correct? How do they handle this case? The only solution I can think of is to synthesize an extra top-level requirement like A==1; sys_platform != "win32" and sys_platform != "darwin", but I don’t know how you algorithmically figure out when an extra constraint like that is needed or what it should look like. (I’m not even sure how an algorithm would write down the needed constraint once it found it, given that it needs to be the negation of some other arbitrary constraints, and the marker language doesn’t have not.) Do poetry/pdm have a solution here?