PEP 751: now with graphs!

brettcannon · October 30, 2024, 10:10pm

After the discussion in PEP 751: lock files (again), I have updated PEP 751 in three key ways:

It stores the dependency graph instead of a set of package versions
It records the known entry points into the dependency graph in a [[groups]] array (which also eliminates the need to have multiple lock files and makes the lock file self-contained, supporting dependency groups in the process)
The metadata is stored in a more separated/exploded fashion

There are two open issues in the PEP we can specifically discuss: the exploded format for files and whether a top-level requires-python makes sense.

I’m also assuming there will be other things discussed as well.

For the very long list of people whose input is particularly important:

Poetry: @radoering
PDM: @frostming
Hatch: @ofek
uv: @charliermarsh , @zanie , @konstin
pants: @anon62990384
pip: @pf_moore
security: @woodruffw , @dustin , @sethmlarson

brettcannon · October 30, 2024, 10:23pm

Dependabot: @jeffwidman

charliermarsh · October 31, 2024, 5:10am

(Deleting in favor of an expanded comment below.)

groodt · October 31, 2024, 10:45am

Enough details SHOULD be provided such that the lock file from the details in this table

… can be reproduced

?

charliermarsh · October 31, 2024, 1:38pm

Okay, a longer reply with some examples after spending more time with the PEP. Apologies for anything I’m misunderstanding, and thanks for all your work here Brett.

First, I find the overload of [[groups]] to represent both dependency groups and entrypoints in the graph to be a little confusing. E.g., the fact that the project name is repeated between groups.name and groups.project seems like a sign that the schema is just a bit off? A few questions around this…

Can you expand a bit on this piece? What is this constraint expressing?

Lockers MUST NOT allow for ambiguity by specifying multiple package versions of the same package under the same group name when a package is listed in any project key.

Why is packages.groups necessary? Why the back-link?
Am I right that every package listed as a groups.project should have exactly one matching [[packages]] entry?
Would you express constraints “between” groups (e.g., workspace member A depends on member B) by creating two groups (one for A, one for B), then two packages (one for A, one for B), then adding a dependency from package A to package B? In this case, would B’s dependencies have groups = ["A", "B"]? Or groups = ["B"]?
What’s the motivation for including “root packages” as [[groups]] in the first place? I think that once you have a graph like this, you could install from “any” root without them.

From my experience in uv, I think this top-level [[groups]] concept only seems necessary for non-[project] projects with a [dependency-groups] table. Otherwise, can’t “groups” just be expressed as part of [[packages.dependencies]]? (If that’s true, I might suggest making [[groups]] exclusively used for [dependency-groups] that aren’t associated with a project, and removing the concept of groups.project.)

Second, how would this schema express a set of requirements like:

root depends on flask @ https://github.com/pallets/flask@main ; sys_platform == 'darwin'
root depends on flask @ https://github.com/pallets/flask@2778b7 ; sys_platform != 'darwin'

At time of writing, these refer to different commits, but when built, map to the same Flask version (3.0.3).

For clarity, what would the version field look like for root’s dependency on flask? And how would installers know which flask [[package]] to include?

Third, how would this schema express a set of requirements like:

root depends on flask >= 3
root depends on child
child depends on flask==3.0.3 ; sys_platform == 'darwin'
child depends on flask==3.0.2 ; sys_platform != 'darwin'

In this case, there would be two [[package]] entries for flask. But when you’re walking the graph, and you start at root, you have to pick a flask version, only knowing that you need flask >=3. How would an installer know which to include? (They could figure it out by walking to child, but now you’re performing a resolution.)

I think this can be solved by writing the resolved versions on the edges, rather than just the specifiers. Is it solved in the PEP by using packages.groups?

Fourth, how would this schema express a scenario like:

Workspace has a root package root
root depends on workspace member child, another local package
child has a [dependency-groups] entry named dev
User wants to install the dev group from child – something like: uv sync --package child --group dev

It seems like dependency groups are only allowed at the top-level, and aren’t considered “attached” to a project – projects and groups seem somewhat mutually exclusive in the design. (In the uv lockfile, dependency groups are modeled similarly to optional dependencies, i.e., translating to the design you have here, I might’ve expected dependencies to have a group field in addition to a feature field.)

Fifth, per your question, I don’t feel strongly about including a top-level requires-python. I would vote to include it, at least optionally, but it’s not a “blocker” in my view.

We use it to: (1) ensure that we can re-lock if the requires-python changes, (2) record the exact requires-python that the user provided, which could be stricter than whatever is computed from the dependencies if the user’s project has a higher requirement (this is especially important if you want to support installing without a pyproject.toml, since otherwise, you lose this information!), and (3) it lets us do some marker simplifications because we can make assumptions about the supported Python ranges.

I think (2) seems the most important, assuming that’s a PEP goal?

Six, again per your question, I would prefer not to extract wheel tags out of the filename (i.e., not parsing out extract data and storing it in structured fields). It doesn’t seem very useful to me and it makes the lockfile less concise. (This is relatively minor, not a strong opinion.)

willingc · October 31, 2024, 6:55pm

Pixi: @wolfv

brettcannon · October 31, 2024, 11:51pm

I was trying to avoid having two separate arrays for specifying supported roots into the graph. I’m open to making them separate.

But that’s not required, it’s just a choice.

No ambiguity in following the graph in the same group. This allows for having e.g., a group for the newest versions of things and a group with the oldest version of things in the same lock file, but also not having a more complicated installer algorithm beyond “follow the edges whose markers you support” (i.e. how to guarantee no resolving in the installer by having to make any decisions).

See my answer above; without groups you can’t segment the packages to avoid conflicts in versions and other things if you want everything in one lock file.

For that group, yes.

Sorry, this is hard to follow as you seem to be referring to packages and groups with the same letters (plus I don’t know what you mean by “workspace” as that’s not a defined term in the PEP or Python packaging in general).

But if I’m following correctly, if group A was to use the same packages as group B in some instances, then you would either do groups = ["A", "B"] or list the packages twice, once in each group.

Not if you skip extras that you didn’t lock for. Let’s say you lock for spam[fast], but skipped spam[compat]. This would be recorded in [[groups]] by only recording the one case. But if you left that detail out and tried to enter the graph at spam, you would have to walk the graph first to know that the lock file won’t succeed in your case.

So you’re right, you don’t have to record the supported roots, but if lock files are meant to be self-contained then you’re saying you always have to know what you want to install and that you won’t know if it even feasible until you try it, versus looking at the lock file and seeing what it recorded as what it’s meant for. So it’s a question around ergonomics of treating lock files as self-contained and how much should they try to make themselves easier to work with in total isolation?

Groups also help with ambiguity when the same package is listed multiple times for different versions.

I think it depends on whether you expect any and all extras for every package in the lock file to be supported? If the answer is “yes” then you’re right. If you say “no, but that’s okay to find out later”, then you’re also right. But if you say, “no, and you should know that upfront”, then treating a project as an explicit root makes sense to me.

This is why I hate sdists in lock files. Let me turn that around and ask how you support that in uv? The only way I see that working is not keying packages off of their name and version but off of the file source, but that’s a bit a shift from how packaging has worked up to this point.

That’s where groups come in; different group for different needs.

Only way I can see that working is if you propagate up the 'darwin' marker requirement to two different requirements under root (or two separate groups). Otherwise what’s your proposal to solve it without getting a resolver involved?

Create a group for that child’s dependency groups.

I think there might be a misunderstanding here: the lock file is not expected to cover any and all paths down all edges of the graph at any arbitrary node. That could lead to a massive graph and a long resolve time for locking (but probably not installing). So I don’t expect you to want all of the extra stuff some random package version has.

Separate from that, I don’t view dependency groups as something you would install from some arbitrary dependency, but from the project you’re directly working with. So going with your workspace example (which, once again, isn’t a defined concept, so I’m somewhat guessing here), if you had a monorepo where you wanted to support all dependency groups in all projects, then you would explicitly do that with separate groups.

Yes, they are a top-level thing, but I don’t think they are mutually-exclusive.

Ah, is this what you mean by “writing the resolved versions on the edges” (even though groups as not versions directly)? As in you select the group and instead of filtering at the [[packages]] level you filter at the requirements level? Or filter at both levels since you still have to pick the correct version somehow (and the PEP currently handles that by saying, “there can only be one package version”)?

I’m going to ask the direct question here and say what design choice do you think I’m missing here? What’s your proposal of how the graph should be laid out in the file?

charliermarsh · November 1, 2024, 12:17am

Ahh, I see – yes, in that sense, we do treat entrypoints differently than other packages in uv, since we fully resolve all of their extras (though we don’t write their lockfile entries any differently – we don’t record them as roots in any way).

Yeah that’s exactly what we do: we key off name, version, and source. The source could be a registry index, a Git URL, a direct HTTPS URL, or a local file path.

Yeah, I think the marker needs to make it up to root, but you also need to write the selected versions and not just the version specifiers. In other words, I think the [[package]] entry for root should have two separate flask entries: one that points to 3.0.3 with the Darwin marker, and one that points to 3.0.2 with the non-Darwin marker.

Yeah I don’t want that either.

Sure thing. My questions are attempting to do two things: (1) help me understand the [[groups]] concept, and (2) help me evaluate correctness. I have fewer opinions on (1) right now since I’m still trying to understand it (I think I need to sketch out a few examples on my own to map these to uv).

On (2) though, right now my feeling is that that: (1) we should be writing the resolved versions on the “edges” (the package.dependencies entries), not just the requirements (otherwise, I don’t yet see how the multiple-flask versions case works); and (2) we should be keying off of package source and not just version (otherwise, I don’t yet see how the multiple flask Git URLs example works).

charliermarsh · November 1, 2024, 12:25am

Can you expand on this a little bit? What is the “one case” that gets recorded? Is there a group for each extra, for each package? I’m just trying to square this with the pseudo-code.

charliermarsh · November 1, 2024, 12:54am

@brettcannon – Here, I’m trying to model two local source trees, one of which depends on the other. Both projects define their own dependency groups. I want users to be able to install:

root
child
The test group in root (with or without root itself)
The lint group in child (with or without child itself)
root with the stubs extra enabled
child with the async extra enabled

Is this roughly the correct representation for that setup?

sirosen · November 1, 2024, 1:46am

I may be mistaken, but this one looks to me, right now, like an impedance mismatch.

You denoted this in your sample as child~lint, but there is no standardized syntax for referring to the dependency groups of a package.^[1]

It seems like the locking use-case you’ve described isn’t one which can be standardized without defining some syntax for that.

I’m not sure if it’s useful, but I’ll note that if I were trying to write a string which describes that dependency group, I’d do it based off of the path to the relevant pyproject.toml, since that’s where the data came from – not the package name. i.e. Something like ./child/pyproject.toml~lint.
That has the important benefit that it does not allow you to try to lock the dependency groups of a package off of pypi.

I don’t have any strong opinion (positive or negative) about giving syntax to “the dependency groups of the package in directory X”. I have a very strong opinion against giving syntax to “the dependency groups of package X”.

This was very intentional in PEP 735. Among other things, it reduces the potential for confusion between Dependency Groups and Extras, and allows Extras to be the dedicated “public interface” part. ↩︎

charliermarsh · November 1, 2024, 2:01am

I hear you, but part of what I’m trying to confirm here is that there are no rules around how groups.name is constructed and what it can contain (at least in the PEP as written). You can create a group with any name. I could’ve used any string there – I could’ve called them "Foo" and "Bar". So I’m trying to understand what the expected use-cases and outputs are, and how they would intersect with the CLIs of the tools that are going to implement this spec.

sirosen · November 1, 2024, 2:06am

Oh, I had missed that name is not constrained! That means you can call it child~lint and users can ask an installer to “install child~lint” and there’s at least nothing which forbids such usage.

I’m not sure if I should think of that as a beneficial decision or a gap? It allows names to be non-standardized-strings, but that seems like it could also cause compatibility issues (at least in theory).

pf_moore · November 1, 2024, 11:42am

Some minor points to start with:

The specification now only allows a single lockfile, named pylock.toml. I don’t have a problem with that from pip’s point of view (I expect pip to take an argument that’s the lockfile name, so it doesn’t matter to us) but I have a feeling someone (maybe @ofek?) intended to produce multiple lock files for different situations, so this might be problematic in that case.
In the specification of [locker], there’s a statement

Enough details SHOULD be provided such that the lock file from the details in this table (provided the same I/O data is available, e.g., Dependabot if only files from a repository is necessary to run the command).

This seems incomplete. Should it be “the lock file can be reproduced from the details…”?

Looking at the pseudo-code for installers, it feels rather complex, but if I assume you’ve got it correct, I don’t have a problem with pip implementing it. I am a little concerned about maintainability, as it certainly wasn’t obvious to me that it’s correct, though. In particular, the lack of a “plain text” description of how to interpret a lockfile means there isn’t anything to check the pseudo-code against. And it means that someone who simply wants to audit a lockfile to check what’s going to be installed, will have to reverse-engineer the pseudo-code to do so.

Thinking about the SHOULD requirements for installers, I imagine pip would not make sdist usage opt-in. It doesn’t fit well with our current --no-binary, --only-binary, --prefer-binary options^[1]. I also don’t think we’d want to commit to supporting syncing an existing environment. Doing that without a resolve step seems like it could be a tricky design problem (we don’t even cleanly handle installing into a non-empty environment correctly in some edge cases at the moment). I imagine we’d handle it like --target - the supported use case is installing into an empty environment, and while we don’t disallow installing into a non-empty environment, we don’t guarantee that the resulting environment will be consistent.

I’d be interested to know how other installers would handle these two cases. Unfortunately, the only other installer likely to care is uv, and I don’t know how much they can decouple the logic of the “installer only” uv pip interface from that of the uv project interface. @charliermarsh? Will uv even have a uv pip install --lockfile pylock.toml independently of uv sync?

The discussion in Speculative: --only-binary by default? · Issue #9140 · pypa/pip · GitHub covers switching to only binary by default for normal installs. While that doesn’t preclude doing it for lockfiles, the discussion covers the changes to our option structures we’d need to make so that users still had the necesary choice ↩︎

radoering · November 1, 2024, 1:50pm

From Poetry’s point of view, I fully agree: name and version alone are not sufficient to unambiguously identify a lockfile entry; sometimes you need the source.

That is interesting. If I understand correctly a split in the graph must be propagated not only upwards but to all other locked packages that depend on this dependency, e.g.

root depends on child1
root depends on child2
child1 depends on flask >= 3
child2 depends on flask==3.0.3 ; sys_platform == 'darwin'
child2 depends on flask==3.0.2 ; sys_platform != 'darwin'

Since the installer might visit child1 before child2, child1 must have two separate flask entries with the resolved versions.

@charliermarsh Does uv only lock the resolved versions or does it also lock the version specifiers?

I wonder if it is sufficient to lock resolved versions or if version specifiers are still required in addition. If I do not miss anything, for Poetry, the only reason for locking version specifiers in addition to resolved versions would be supporting something like this (without a cache or index access), i.e. printing why a package is required (including specifiers):

$ poetry show package-a
 name         : package-a
 version      : 0.8.0
 description  : ...

required by
 - package-b requires >=0.4.2
 - package-c requires >=0.6

Probably not that relevant so that version specifiers can be neglected in favor of resolved versions.

charliermarsh · November 1, 2024, 1:51pm

Ah I missed this – I don’t think we’d do this either. Or at least, I find it surprising, and I don’t see how it would mesh with our interfaces.

This doesn’t raise an alarm bells for me in uv, I assumed we would support this.

Yeah I was hoping to support that.

charliermarsh · November 1, 2024, 1:52pm

In uv, we only lock the resolved versions. We actually do include the specifiers as a separate “metadata” table in the lockfile, but they’re not used during resolution or installation – they’re just used for cache invalidation (we want to know if a package’s dependencies changed so that we can discard the lockfile).

(I’m happy to talk more about how we solve this in uv if it’s relevant though by default I’ll try not to sidetrack too much.)

ofek · November 1, 2024, 2:03pm

I haven’t had time to read the update so thank you very much for bringing this to my attention! Indeed, Hatch was going to have a lock file per environment. Would someone mind giving a TL;DR about why this is no longer supported?

pf_moore · November 1, 2024, 2:12pm

How would you expect this to work? Consider the following situation:

We have an environment currently containing A 1.0 and B 1.0. B 1.0 depends on A <= 1.0.
We have a lockfile built from a single requirement, for (any version of) A. It locked A 2.0, because that’s the latest version.

If you try to sync the lockfile into the environment, the installer sees A 1.0 installs and a request to install A 2.0. Does it do so? If it does, it will end up with an inconsistent environment. But in order to reject the request with an error, it needs to not only check all the packages in the lockfile (only one in this case, but potentially hundreds) but also all the packages in the environment, whether or not they are mentioned in the lockfile (again, potentially hundreds of these).

At the moment, pip handles this with a post-install check, which will tell you that the environment is now inconsistent, but at that point it’s too late, the install has completed.

The reason I distinguished between uv pip install --lockfile and uv sync here is that uv sync has access to the complete definition of the environment, as it’s being managed by uv. But in the case of an arbitrary environment, there’s no way of knowing what’s in there except by scanning every .dist-info directory in the environment. And that’s potentially an extremely high-cost operation, which in 99% of cases will be wasted effort.

charliermarsh · November 1, 2024, 2:32pm

Yeah totally fair question. uv pip install would do basically the same thing: we perform the install, then we have a post-install check to validate that the environment is consistent and warn if not. I tend to view this as “okay” in the context of uv, because if you’re using uv pip install, you’re “opting in” to manipulating your environment at a lower level as opposed to using a declarative specification for what you want the environment to be. But I agree that it can leave users in a broken state… No argument there.

We also have uv pip install --exact which will bring the environment entirely in-sync with the provided requirements (so, in that example, it would uninstall B).