PEP 665: Specifying Installation Requirements for Python Projects

kushaldas · August 11, 2021, 6:17am

We will know the exact versions of the build dependencies. Say project A depends on proxy xyz as build dependency. But,
the latest version of xyz may just fail building of A. Or say a build time dependency provided by the OS package for an extension.

I was stuck with build-time dependency update before. So, hoping that this will provide reproducible environment.

Note: Trying to reply via email to see how this works.

Kushal

njs · August 11, 2021, 7:16am

Fair! I think there are three separate things we’re talking about here with the needs metadata:

Telling the installer which packages to install. IIUC this is the intended purpose in the PEP. For this, the needs don’t necessarily have anything to do with the actual package metadata. In fact I guess the simplest way to generate a lockfile would be to list all the requirements directly in the top-level needs, and leave all the individual package needs declarations empty. That’s basically what you get right now from pip-compile --generate-hashes.
Recording the inputs to the locker, so that one can recreate it or check if it’s up to date. Unlike Nick, I think this is out-of-scope for this. Different lockers will have different inputs that affect the locking process (pip has constraints, --allow-pre, and --allow-binary; poetry doesn’t even use PEP 308 syntax for inputs; target environmetns might affect locking; etc. etc.). And if the motivation for standardizing this is to allow lockfile consumers like IDEs and PaaS providers to create environments without knowing about locker-specific details, then the whole point is to decouple this lockfile format from whatever idiosyncratic input format a specific locker is using.
Recording what the locker saw while it was generating the lockfile, so that the installer can validate that the world hasn’t shifting around and broken reproducibility. This is the purpose of the artifact hashes, for example – it lets the installer confirm that it sees the same artifacts that the locker saw. Packages on PyPI or wherever shouldn’t change, but we can’t guarantee that, so we save the hashes. Yeah, it creates noise in the diff, but not much we can do about that.

This was why I raised the idea of also including a record of the specific requires-dist metadata that the locker saw while generating the lockfile. All wheels for a given (package, version) should have the same requires-dist metadata, but we can’t guarantee that, so maybe we should save the metadata the locker saw. It’s conceptually like the hashes, not like the needs metadata. It woudlnt’ affect the actual environment; it’s just there to make sure things aren’t broken.

uranusjr · August 11, 2021, 9:45am

brettcannon:

uranusjr:

Or maybe we could just use install_requires (or more technically Requires-Dist ) as needs entries directly? (Maybe with some name and version normalisation.)

What specifically would need to change in the PEP to allow for that since I feel it already is allowed? We say that PEP 508 specifiers are allowed and the PEP essentially mandates a resolver in the installer already, so what’s preventing using the needs field for this with the way it’s worded? Just a flat-out statement of “you don’t need to tighten the PEP 508 specifiers and can leave them as the original input to the locker”? Or are people explicitly concerned lockers may output different things for needs compared to their Requires-Dist input that was received and want to force lockers to record their original input?

Yes this is already allowed, but from the feedback I feel we should make this more obvious, or even explicitly recommend lockers to put Requires-Dist strings in it with a SHOULD or something.

If we do this…

The needs field would satisfy this recording specific Requires-Dist metadata feature. This needs field would not matter for the everything-top-level-needs usage since putting everything at the top-level makes the installer install them all no matter what the dependencies between those packages are—unless one of the locked version is outside of the per-package needs’s version specifier, but that should not happen anyway unless your resolver is faulty in the first place.

ncoghlan · August 11, 2021, 10:05am

The needs field has the pinned dependencies in it. It can’t also be used to record the typically looser input dependencies without ambiguity as to which pins came from the input and which came from the locking process.

That said, I think Nathaniel makes a valid point that the only level where every locker will have PEP 508 install_requires info available is for each package version analysed - there’s no requirement that the top level inputs use that format, and at least one case where they definitely don’t (poetry). That means standardisation of the lock process inputs isn’t practical, and if a locker just wants to record its own inputs, it can put them in the tools table.

Any CI use cases like the ones I mentioned above would be tool specific, but that’s probably fine, since the CI setup for a project is typically at least somewhat coupled to your choice of locker anyway.

uranusjr · August 11, 2021, 12:44pm

It does not, at least that’s not what I had in mind when we discussed it. The needs fields only record loosely specified requirements (but the PEP does not currently define how loose, and I’m proposing we formally define it to match user intent and distribution metadata), and the pinned versions is specified under the package.<name> table specified by a needs entry.

From PEP 665’s example: (unrelated fields have been removed for brevity)

[metadata]
needs = ["mousebender"]

[[package.attrs]]
version = "21.2.0"

[[package.mousebender]]
version = "2.0.0"
needs = ["attrs>=19.3", "packaging>=20.3"]

[[package.packaging]]
version = "20.9"
needs = ["pyparsing>=2.0.2"]

[[package.pyparsing]]
version = "2.4.7"

None of the needs entries list pinned dependencies. The one in metadata, for example, only says mousebender is needed, but does not pin a version. Only [[package.mousebender]] contains the resolved version, 2.0.0. If the user instead specified mousebender>=2 (with an imaginary pip resolve 'mousebender>=2' command, for example), metadata.needs would become ["mousebender>=2"] to reflect that user intent, while the [[package.mousebender]] section stays exactly the same to reflect the actual resolved distribution info.

ofek · August 11, 2021, 5:42pm

Just to be clear, in what situations would installers be required to have a resolver?

uranusjr · August 11, 2021, 6:13pm

Depends on your definition of a resolver. The installer will need to be able to recusively collect dependencies from package entries, evaluate environment markers, and match wheel tags to choose a valid set of files to install. Which is technically a resolver. But there is guaranteed to be one valid combination to install (assuming the lock file is valid), so the installer will never need to be able to perform complex NP-hard version selection stuff that most people mean when we talk about a resolver.

brettcannon · August 11, 2021, 8:26pm

I’m fine with that if others are.

That’s assuming indexes were used, though. Lockers could record that in their tool table.

This doesn’t directly address build-time dependencies in any special way. Are you asking to pin what an sdist specifies in their pyproject.toml’s build-system table? Technically there is nothing preventing a locker from listing build dependencies and an installer using the lock file to satisfy the build requirements.

uranusjr:

If we do this…

njs:

Telling the installer which packages to install. IIUC this is the intended purpose in the PEP. For this, the needs don’t necessarily have anything to do with the actual package metadata. In fact I guess the simplest way to generate a lockfile would be to list all the requirements directly in the top-level needs , and leave all the individual package needs declarations empty. That’s basically what you get right now from pip-compile --generate-hashes .

njs:

This was why I raised the idea of also including a record of the specific requires-dist metadata that the locker saw while generating the lockfile. All wheels for a given (package, version) should have the same requires-dist metadata, but we can’t guarantee that, so maybe we should save the metadata the locker saw. It’s conceptually like the hashes, not like the needs metadata. It woudlnt’ affect the actual environment; it’s just there to make sure things aren’t broken.

The needs field would satisfy this recording specific Requires-Dist metadata feature. This needs field would not matter for the everything-top-level-needs usage since putting everything at the top-level makes the installer install them all no matter what the dependencies between those packages are—unless one of the locked version is outside of the per-package needs ’s version specifier, but that should not happen anyway unless your resolver is faulty in the first place.

I’m on board with making this a “SHOULD” recommendation to directly record what Requires-Dist the locker saw in the needs array.

Same here.

njs · August 11, 2021, 10:00pm

This doesn’t sound right to me at all. The proposed format absolutely lets you give the installer NP-hard problems. Maybe an installer could somehow only support a restricted set of needs setups, to exclude the combinations that create exponential blowup? But I don’t know how you’d do that – recognizing whether a given needs configuration is NP-hard seems like it might also be NP-hard :-). And it’s certainly not compatible with the idea of recording Requires-Dist lines directly in the lockfile, because those are sufficient to create NP-hard version selection all on their own.

If we want to ensure there’s exactly one resolution and that installers don’t need a full NP-hard resolver, then we should make the lockfile format less expressive, so it can’t describe NP-hard problems.

The most natural version would be to have the resolver to compile down the resolution solution to a single flat list of (package, version, marker). The markers would let you handle the case poetry raised of wanting a single lockfile to work with multiple environments, but the idea would be that the installer doesn’t look at dependencies at all, it jsut blindly installs all the entries in the list that match the current environment.

frostming · August 12, 2021, 1:19am

I am afraid it isn’t as trivial as you think. Both poetry and pdm have tried marker resolution(which means blind installer) but gave up to installer resolution. Say we have A that depends on B when os_name == 'nt' and C that depends on B when sys_platform == 'win32'. If we are to record the marker of B, a marker merging should be performed and the result would be os_name == 'nt' or sys_platform == 'win32'. As the number of dependants grows we can expect an extremely long marker string on B. And we would want a marker deduplication to avoid some rare failures so that os_name == 'nt' or os_name == 'nt' collapses to os_name == 'nt'. In one sentence, it would be another hard problem.

uranusjr · August 12, 2021, 5:15am

@frostming already explained this is not as doable as naturally conceived. I think the consensus from authors and other contributors to the current PEP 665 draft is that it is indeed possible to describe NP-hard problems with the current syntax, but the installer should not be worried about those NP-hard cases since the locker should not emit such lockfiles in practice. We could definitely restrict the syntax to eliminate those NP-hard cases in theory, but that’s require us to come up with an entirely new spec and new parser implementation that’s only going to be useful for a very limited number of people.

IMO it’s far more productive to make this an “uncodified” agreement between the locker and the installer. If this bothers you, we can definitely add something to PEP 665 that says the locker must not emit complex things (although I am not sure what terms we should use to describe this rule more clearly).

njs · August 12, 2021, 12:36pm

AFAIK, “NP-hardness” usually isn’t something you can split off and isolate from the rest of the problem, like, “oh it’s this dependency right here that makes the resolution problem NP-hard”. Resolution problems just are NP-hard as a consequence of combining all the pieces together. Or more concretely: can you give an example of an algorithm (maybe as pseudocode or whatever) that handles “easy” resolution cases but not “hard” ones? I just don’t understand what simplifications you think the installer can make, or how a lockfile generator cna tell whether it’s generating “NP-hard problems” or not.

Oh sure, I’m not saying it would be easy :-). Version resolution is an intrinsically hard problem! But I can see how we might address the “long marker string” problem in a way that’s good enough in practice, e.g. by hard-coding knowledge of common cases (os_name == "nt" and sys_platform == "win32" are the same thing), or by implementing a simple optimization pass to deduplicate things. Collapsing A or A → A is a trivial transformation on the marker AST.

OTOH I still haven’t seen any explanation of what the alternative even is. How does pdm actually generate its lockfiles? Can you explain the algorithm?

BTW, here some properties that I think we’d all agree would be nice:

Given a lock file + an environment, there should be at most one valid solution, i.e., all installers should produce the same result.
The needs entries in lock files should match those that appear in the actual packages; locking is mostly a matter of figuring out which subset of PyPI you need to include in your lockfile.

But in fact, it is impossible to have both of these properties simultaneously. Consider this counter-example:

The user requests packages A, B, and C.

Package A v1 depends on package B v2.

Package A v2 depends on package B v1.

(Note: this is just a simple way to create multiple valid solutions: you can either have A v1 + B v2, or A v2 + B v1, and different resolvers will pick different solutions based on heuristics.)

Package C is only available in one version, v1, which depends on:

package A v1 on macOS (A==1; sys_platform == "darwin")
package A v2 on Windows (A==2; sys_platform == "win32")
nothing at all on Linux

So on macOS, there is only one valid solution: A==1, B==2, C==1

And on Windows, there is only one valid solution: A==2, B==1, C==1

And on Linux, these are both valid solutions, and both solutions are valid using only packages and requirements that the locker will include in the lock file, because they’re required in at least some environments.

So if you encode this in the obvious way into a PEP 665 lockfile, you’ll end up with instability on Linux, where different installers can produce different results.

Upthread it I think it was claimed that poetry/pdm currently produce lockfiles that are guaranteed to only have a single solution, and that use the original package Requires-Dist metadata in the lockfile – is that correct? How do they handle this case? The only solution I can think of is to synthesize an extra top-level requirement like A==1; sys_platform != "win32" and sys_platform != "darwin", but I don’t know how you algorithmically figure out when an extra constraint like that is needed or what it should look like. (I’m not even sure how an algorithm would write down the needed constraint once it found it, given that it needs to be the negation of some other arbitrary constraints, and the marker language doesn’t have not.) Do poetry/pdm have a solution here?

uranusjr · August 12, 2021, 2:12pm

One example would be pip’s resolver before 2020. It recursively visit requested dependencies (with matching environment marker) and pick the first thing it sees that works with what it currently knows. If anything ever comes up down the road that invalidates the previous solution, it simply ignores it and carries on (a more “robust” installer implementation can choose to error out immediately instead of ignore the conflicting constraints).

My previous comment was ambiguous and is likely the source of misunderstanding here. The lock file does not promise to contain exactly one solution for each environment, but at least one; in this situation there are two valid solutions, and the installer would be free to choose either. What the lock file promises, though, is that whichever solution the installer chooses at this point is going to be valid in the end, and the installer will never need to perform backtracking or conflict resolution or whatever, i.e. the “complex” part. The approach taken by existing resolvers is that since the user does not specify further, they should accept either solution, so the tool just picks one arbitrarily (but consistently to avoid confusion in practice).

FRidh · August 12, 2021, 7:40pm

The Software Heritage project has developed the Software Heritage ID format, or SWHID. A SWHID is a persistent identifier. Looking forward, I suggest to replace the url field with a swhid field. In a SWHID the item of interest is separated from the url at which it can be found. In Nixpkgs we do this as well behind the scenes, because it should not matter where an artifact is fetched from, it’s just additional info.

Note Software Heritage only deals with source code, so I am not quite sure how it (the SWHID) would have to be adapted to deal with artifacts such as sdists and wheels.

brettcannon · August 12, 2021, 9:08pm

But that would require either hitting an external service to resolve what the ID resolves to in terms of a URL and a key design point of the PEP is that an installer does not need to contact any third parties to resolve what to download and install.

EpicWink · August 12, 2021, 11:25pm

Why not both?

FRidh · August 13, 2021, 5:29am

To go from just a core identifier to an object, a resolver is indeed needed. However, by also including the origin field (URI) that is not needed, as we effectively get the url as proposed now. Note I do find it a bit of a pity they use SHA1 only and not support SRI.

njs · August 13, 2021, 12:19pm

How do lockers guarantee that backtracking is never needed? E.g. a simple case:

Top-level requirements: A, B

A’s requirements:

B == 1; sys_platform == "win32"
B == 2; sys_platform == "darwin"

Now, suppose the installer happens to process the top-level requirements in the order B first, then A. Since both B == 1 and B == 2 have to be listed in the lockfile, the installer has to pick one. Presumably it will pick B == 2, because that’s the latest version. But on windows, this will later turn out to be a mistake, forcing it to backtrack…

Does that make this an invalid lockfile, or… what?

uranusjr · August 13, 2021, 8:58pm

(Disclaimer: I don’t actually work on a platform-agnostic locker, so this is only what I persume they’d do.) One way would be to split the top-level B requirement into ["B==1; sys_platform == 'win32'", "B; sys_platform == 'darwin'"], although this could be undesired if we’re going to use needs to record the “raw” requirements.

Also I recall a lockfile format (Poetry? don’t really remember) I researched has a platform marker field on individual package sections to indicate the package entry is only valid on certain platforms (kind of like the top-level marker field in PEP 665 but for each package), which feels like a good solution to this. Having marker (and perhaps tags?) in each package entry also makes it more similar to the top-level metadata field, and I like this kind of mirroring personally.

brettcannon · August 14, 2021, 12:03am

If we propagate and combine all markers pertaining to a package in the lock file does that mean we don’t need to keep the markers in needs since the resolution would be at the package version level as to what to install?

If we were to do this, my questions would become:

Do we then make needs just list package names?
Does this change people wanting to record the original input to the locker’s resolver?

Wouldn’t hurt for symmetry.

Would this lower the computational overhead for the installer?