PEP 665: Specifying Installation Requirements for Python Projects

Depends on your definition of a resolver. The installer will need to be able to recusively collect dependencies from package entries, evaluate environment markers, and match wheel tags to choose a valid set of files to install. Which is technically a resolver. But there is guaranteed to be one valid combination to install (assuming the lock file is valid), so the installer will never need to be able to perform complex NP-hard version selection stuff that most people mean when we talk about a resolver.

2 Likes

I’m fine with that if others are.

That’s assuming indexes were used, though. Lockers could record that in their tool table.

This doesn’t directly address build-time dependencies in any special way. Are you asking to pin what an sdist specifies in their pyproject.toml’s build-system table? Technically there is nothing preventing a locker from listing build dependencies and an installer using the lock file to satisfy the build requirements.

I’m on board with making this a “SHOULD” recommendation to directly record what Requires-Dist the locker saw in the needs array.

Same here.

1 Like

This doesn’t sound right to me at all. The proposed format absolutely lets you give the installer NP-hard problems. Maybe an installer could somehow only support a restricted set of needs setups, to exclude the combinations that create exponential blowup? But I don’t know how you’d do that – recognizing whether a given needs configuration is NP-hard seems like it might also be NP-hard :-). And it’s certainly not compatible with the idea of recording Requires-Dist lines directly in the lockfile, because those are sufficient to create NP-hard version selection all on their own.

If we want to ensure there’s exactly one resolution and that installers don’t need a full NP-hard resolver, then we should make the lockfile format less expressive, so it can’t describe NP-hard problems.

The most natural version would be to have the resolver to compile down the resolution solution to a single flat list of (package, version, marker). The markers would let you handle the case poetry raised of wanting a single lockfile to work with multiple environments, but the idea would be that the installer doesn’t look at dependencies at all, it jsut blindly installs all the entries in the list that match the current environment.

2 Likes

I am afraid it isn’t as trivial as you think. Both poetry and pdm have tried marker resolution(which means blind installer) but gave up to installer resolution. Say we have A that depends on B when os_name == 'nt' and C that depends on B when sys_platform == 'win32'. If we are to record the marker of B, a marker merging should be performed and the result would be os_name == 'nt' or sys_platform == 'win32'. As the number of dependants grows we can expect an extremely long marker string on B. And we would want a marker deduplication to avoid some rare failures so that os_name == 'nt' or os_name == 'nt' collapses to os_name == 'nt'. In one sentence, it would be another hard problem.

2 Likes

@frostming already explained this is not as doable as naturally conceived. I think the consensus from authors and other contributors to the current PEP 665 draft is that it is indeed possible to describe NP-hard problems with the current syntax, but the installer should not be worried about those NP-hard cases since the locker should not emit such lockfiles in practice. We could definitely restrict the syntax to eliminate those NP-hard cases in theory, but that’s require us to come up with an entirely new spec and new parser implementation that’s only going to be useful for a very limited number of people.

IMO it’s far more productive to make this an “uncodified” agreement between the locker and the installer. If this bothers you, we can definitely add something to PEP 665 that says the locker must not emit complex things (although I am not sure what terms we should use to describe this rule more clearly).

4 Likes

AFAIK, “NP-hardness” usually isn’t something you can split off and isolate from the rest of the problem, like, “oh it’s this dependency right here that makes the resolution problem NP-hard”. Resolution problems just are NP-hard as a consequence of combining all the pieces together. Or more concretely: can you give an example of an algorithm (maybe as pseudocode or whatever) that handles “easy” resolution cases but not “hard” ones? I just don’t understand what simplifications you think the installer can make, or how a lockfile generator cna tell whether it’s generating “NP-hard problems” or not.

Oh sure, I’m not saying it would be easy :-). Version resolution is an intrinsically hard problem! But I can see how we might address the “long marker string” problem in a way that’s good enough in practice, e.g. by hard-coding knowledge of common cases (os_name == "nt" and sys_platform == "win32" are the same thing), or by implementing a simple optimization pass to deduplicate things. Collapsing A or AA is a trivial transformation on the marker AST.

OTOH I still haven’t seen any explanation of what the alternative even is. How does pdm actually generate its lockfiles? Can you explain the algorithm?

BTW, here some properties that I think we’d all agree would be nice:

  1. Given a lock file + an environment, there should be at most one valid solution, i.e., all installers should produce the same result.
  2. The needs entries in lock files should match those that appear in the actual packages; locking is mostly a matter of figuring out which subset of PyPI you need to include in your lockfile.

But in fact, it is impossible to have both of these properties simultaneously. Consider this counter-example:

The user requests packages A, B, and C.

Package A v1 depends on package B v2.

Package A v2 depends on package B v1.

(Note: this is just a simple way to create multiple valid solutions: you can either have A v1 + B v2, or A v2 + B v1, and different resolvers will pick different solutions based on heuristics.)

Package C is only available in one version, v1, which depends on:

  • package A v1 on macOS (A==1; sys_platform == "darwin")
  • package A v2 on Windows (A==2; sys_platform == "win32")
  • nothing at all on Linux

So on macOS, there is only one valid solution: A==1, B==2, C==1

And on Windows, there is only one valid solution: A==2, B==1, C==1

And on Linux, these are both valid solutions, and both solutions are valid using only packages and requirements that the locker will include in the lock file, because they’re required in at least some environments.

So if you encode this in the obvious way into a PEP 665 lockfile, you’ll end up with instability on Linux, where different installers can produce different results.

Upthread it I think it was claimed that poetry/pdm currently produce lockfiles that are guaranteed to only have a single solution, and that use the original package Requires-Dist metadata in the lockfile – is that correct? How do they handle this case? The only solution I can think of is to synthesize an extra top-level requirement like A==1; sys_platform != "win32" and sys_platform != "darwin", but I don’t know how you algorithmically figure out when an extra constraint like that is needed or what it should look like. (I’m not even sure how an algorithm would write down the needed constraint once it found it, given that it needs to be the negation of some other arbitrary constraints, and the marker language doesn’t have not.) Do poetry/pdm have a solution here?

1 Like

One example would be pip’s resolver before 2020. It recursively visit requested dependencies (with matching environment marker) and pick the first thing it sees that works with what it currently knows. If anything ever comes up down the road that invalidates the previous solution, it simply ignores it and carries on (a more “robust” installer implementation can choose to error out immediately instead of ignore the conflicting constraints).

My previous comment was ambiguous and is likely the source of misunderstanding here. The lock file does not promise to contain exactly one solution for each environment, but at least one; in this situation there are two valid solutions, and the installer would be free to choose either. What the lock file promises, though, is that whichever solution the installer chooses at this point is going to be valid in the end, and the installer will never need to perform backtracking or conflict resolution or whatever, i.e. the “complex” part. The approach taken by existing resolvers is that since the user does not specify further, they should accept either solution, so the tool just picks one arbitrarily (but consistently to avoid confusion in practice).

1 Like

The Software Heritage project has developed the Software Heritage ID format, or SWHID. A SWHID is a persistent identifier. Looking forward, I suggest to replace the url field with a swhid field. In a SWHID the item of interest is separated from the url at which it can be found. In Nixpkgs we do this as well behind the scenes, because it should not matter where an artifact is fetched from, it’s just additional info.

Note Software Heritage only deals with source code, so I am not quite sure how it (the SWHID) would have to be adapted to deal with artifacts such as sdists and wheels.

1 Like

But that would require either hitting an external service to resolve what the ID resolves to in terms of a URL and a key design point of the PEP is that an installer does not need to contact any third parties to resolve what to download and install.

2 Likes

Why not both?

To go from just a core identifier to an object, a resolver is indeed needed. However, by also including the origin field (URI) that is not needed, as we effectively get the url as proposed now. Note I do find it a bit of a pity they use SHA1 only and not support SRI.

How do lockers guarantee that backtracking is never needed? E.g. a simple case:

Top-level requirements: A, B

A’s requirements:

  • B == 1; sys_platform == "win32"
  • B == 2; sys_platform == "darwin"

Now, suppose the installer happens to process the top-level requirements in the order B first, then A. Since both B == 1 and B == 2 have to be listed in the lockfile, the installer has to pick one. Presumably it will pick B == 2, because that’s the latest version. But on windows, this will later turn out to be a mistake, forcing it to backtrack…

Does that make this an invalid lockfile, or… what?

1 Like

(Disclaimer: I don’t actually work on a platform-agnostic locker, so this is only what I persume they’d do.) One way would be to split the top-level B requirement into ["B==1; sys_platform == 'win32'", "B; sys_platform == 'darwin'"], although this could be undesired if we’re going to use needs to record the “raw” requirements.

Also I recall a lockfile format (Poetry? don’t really remember) I researched has a platform marker field on individual package sections to indicate the package entry is only valid on certain platforms (kind of like the top-level marker field in PEP 665 but for each package), which feels like a good solution to this. Having marker (and perhaps tags?) in each package entry also makes it more similar to the top-level metadata field, and I like this kind of mirroring personally.

1 Like

If we propagate and combine all markers pertaining to a package in the lock file does that mean we don’t need to keep the markers in needs since the resolution would be at the package version level as to what to install?

If we were to do this, my questions would become:

  1. Do we then make needs just list package names?
  2. Does this change people wanting to record the original input to the locker’s resolver?

Wouldn’t hurt for symmetry.

Would this lower the computational overhead for the installer?

1 Like

PEP 665: add some open issues · python/peps@2147ddc · GitHub added some open issues:

  1. Record the creation date by @ncoghlan
  2. Idea to pin build dependencies by @kushaldas
  3. Record the original inputs to the locker’s resolver by @ncoghlan and @njs
  4. Make keeping a lock file in pyproject-lock.d optional by @steve.dower (and privately @eric.snow )
1 Like

I imagine this should only be used when there are multiple package entries under a name to “guide” installers to one of them directly, and in that case yes the installer can do less calculation.

1 Like

Add the idea of allowing marker and tags in [package] tables via Add an open issue about `marker` and `needs` per package version · python/peps@2d7c608 · GitHub (and w/ follow-up commits to fix the markup :sweat_smile:).

1 Like

Things have settled down here, so I think it’s time to drive towards closing out the open issues. Here they are. Any we can’t reach consensus around will be decided by the authors of the PEP.

Allow for Tool-Specific type Values

It has been suggested to allow for custom type values in the
code table. They would be prefixed with x- and followed by
the tool’s name and then the type, i.e. x-<tool>-<type> . This
would provide enough flexibility for things such as other version
control systems, innovative container formats, etc. to be officially
usable in a lock file.

Support Variable Expansion in the url field

This could include predefined variables like PROJECT_ROOT for the
directory containing pyproject-lock.d so URLs to local directories
and files could be relative to the project itself.

Environment variables could be supported to avoid hardcoding things
such as user credentials for Git.

Don’t Require Lock Files Be in a pyproject-lock.d directory

It has been suggested that since installers may very well allow users
to specify the path to a lock file that having this PEP say that
"MUST be kept in a directory named pyproject-lock.d " is pointless
as it is bound to be broken. As such, the suggestion is to change
“MUST” to “SHOULD”.

Record the Date of When the Lock File was Generated

Since the modification date is not guaranteed to match when the lock
file was generated, it has been suggested to record the date as part
of the file’s metadata. The question, though, is how useful is this
information and can lockers that care put it into their [tool]
table instead of mandating it be set?

Locking Build Dependencies

Thanks to PEP 518, source trees and sdists can specify what build
tools must be installed in order to build a wheel (or sdist in the
case of a source tree). It has been suggested that the lock file also
record such packages so to increase how reproducible an installation
can be.

There is nothing currently in this PEP, though, that prohibits a
locker from recording build tools thanks to metadata.needs acting
as the entry point for calculating what to install. There is also a
cost in downloading all potential sdists and source trees, reading
their pyproject.toml files, and then calculating their build
dependencies for locking purposes for which not everyone will want to
pay the cost for.

Recording the Requires-Dist Input to the Locker’s Resolver

While the needs key allows for recording dependency specifiers,
this PEP does not currently require the needs key to record the
exact Requires-Dist metadata that was used to calculate the
lock file. It has been suggested that recording the inputs would help
in auditing the outcome of the lock file.

If this were to be done, it would be an key named requested which
lived along side needs and would only be specified if it would
differ from what is specified in needs .

3 Likes

Thank you. No further objections, your honour :smiley:

1 Like

Thank you for this PEP, Brett.
I’m commenting here as a developer from a Platform Provider (Azure Functions, i.e serverless) so hopefully I can provide that perspective. I should also note that this is my first time commenting on this platform, so please let me know if I’m ignoring any best practices; thanks.

Recording the date of when a lock file is generated would be really useful to my job. When investigating user-reported incidents, especially those where a customer app suddenly starts acting “different” in some way, having this kind of metadata is key to providing a root-cause analysis.

However, I don’t have strong feeling as to where to store these timestamps; just that ideally they would be stored somewhere.

In Azure Functions, we structure Python apps in a slightly unconventional way (with respect to local apps/libraries) and so I support having flexibility in where and how to store these lock files.

and that’s all I had to say :wink: .
The rest of the PEP seems reasonable to me and I think it would be a positive tool if accepted.

3 Likes