PEP 751: lock files (again)

ncoghlan · September 16, 2024, 11:57pm

I don’t think that assumption is true when the change affects an edge higher in the graph. The graph format would contain the change to the affected edge (that’s the point of the graph format), while the linear format would push it out to all of the affected nodes.

Marker syntax (following the way extras work). However, we could put it in the TOML either as you suggest (in the group list), or else by adding a separate conditional-markers TOML list (combined with the main marker with “and”) where the components are joined to each other with “or”, and may be either plain strings or objects with a group qualifier.

The second approach has the benefit that path dependent qualifiers could be specified as objects with a marker field for the marker fragment, and an installing field referencing back to the package that adds that branch to the marker expression.

anon62990384 · September 17, 2024, 12:25am

Speaking for Pex, a graph is definitely preferred and the way it has generated locks from day 1. It is more flexible as you say: you can arbitrarily subset a graph lock deterministically. This is very useful when you want to, say, lock a many-project repo with 1 lock and then subset from that lock for individual libraries, binaries or even tests. If you don’t care about sub-setting but do care about auditing / locality you can add a tool. You cannot do the inverse with a linear list and add a tool to get arbitrary sub-setting.

mikeshardmind · September 17, 2024, 1:13am

I view a graph here as more an intermediate step that exists between the top level requirements and how a solution was found. While tooling generated lock files usually should not be manually edited, being able to see at a glance the conditions in which a specific package is pulled in is invaluable, and saying that people should add yet more tooling to get that ability for review… what’s the gain here? Speaking only in my own experience, the flat output from pdm allowed review to identify a couple of transitive dependencies we really didn’t need to take on, and led to better cross-platform support in a few cases when instead of taking those on, we made something more portable.

The lock file is a specific outcome, the solution in this case. locking as a graph doesn’t seem to enable anything additional here as far as I can tell, at least not under the assumption that there still exists a record of what the original inputs were (those are in pyproject.toml + backend for anything dynamic right?) and that tools are not forbidden from having their own caches that are not part of the unified lock file specification (ie. no performance reason where a graph outperforms for some case, since tools could cache in that case)

ncoghlan · September 17, 2024, 2:37am

If a tool wants to support fast subsetting, a graph overlay on the linear list only needs to record the package-package edges (since in the linear list format, which packages to actually install in a given environment would be determined by the marker and conditional marker information recorded for each package).

PEP 751 already allows for that level of graph recording via the optional packages.dependents and packages.dependencies fields, so I don’t think the subsetting use case needs to be specifically considered when deciding how best to record conditional marker clauses (since it will be covered regardless).

brettcannon · September 18, 2024, 12:29am

Thanks to everyone who took the poll! If I subtract out one of the two uv devs who voted for the same outcome you end up w/ a tie! Makes my life a little harder.

For projects that try to handle monorepo scenarios, there seems to be a benefit (and uv has given the same explanation):

I think this is can be true as long as the data as a whole is expressive enough to capture choices made earlier in the graph that could influence which branches to take (e.g., extras and groups), but maybe @charliermarsh knows of situations where this isn’t true?

mikeshardmind · September 18, 2024, 3:40pm

I had to take some time to give this more thought, but I don’t see how retaining the edges explicitly in the lockfile helps here or that what it helps with is locking. It should be possible to either cache or recreate the exact same edges by constraining valid nodes to those already expressed in the lockfile.

If I have a directed graph: A → {B, C}, B → {D, E}, C → {E: version less than 3}
Then a subset of this compatible with A is going to be contained in the flat version. Subsetting for B is simply traversing B’s requirements while limiting to what’s already been recorded for A for solutions.

This would make the subsetting for B

B → {D, E: version less than 3}

a version of both D and E that is less than version 3 would be in the flat format for A, simply using the versions solved for already result in the correct output.

Explicitly keeping the edges seems to only cut out that traversal of B’s requirements, which I’m not sure how to put a value on when caching seems sufficient here, and I have first-hand experience of how the flat format is more useful to human review.

brettcannon · September 18, 2024, 8:25pm

I don’t think it helps with the locking itself, it just makes the lock file more versatile. I think some of the motivation for wanting the graph is what I have proposed up until this point has taken an approach where installers just walk from the top of the file down and install packages based on whether it applied base on environment markers. That meant you couldn’t install a subset of packages with what I have brought up.

But I think it should be possible to tweak things so that as long as you know which entry points into the graph you would want, you can still write out the dependencies in a flat list as long as you can specify which entry point you’re requesting.

Now, one thing that the flat file cannot do is support arbitrary entry points into the graph w/o recording the edges. So my question to @charliermarsh and @anon62990384 is whether they support arbitrary access to the dependency graph in their lock files from their respective tools or whether it is known upfront what the entry points will be (i.e. can the user choose any entry point into the graph when they call your tool, or do you control the entry points)?

anon62990384 · September 18, 2024, 8:50pm

@brettcannon Pex allows truly arbitrary entry points. The Pex lock file format is perhaps way simpler than what’s been described so far. Relevantly just:

The input requirement specifiers to lock as a list, i.e.: ["foo[bar, baz]; python_version >= '3.8'", "spam"]
The resulting list of locked dependencies, each of which includes:
a. The list of locked artifacts and their hashes
b. The locked dependency version
c. The dependency’s Requires-Python metadata
d. The dependency’s Requires-Dist metadata

(N.B.: That c & d use the “assume all dist metadata is equal” assumption which has been discussed previously)

Pex locks do not support forks like uv and others do, its guaranteed a lock locks exactly 1 version of a project (or fails locking), and so you can take an arbitrary entry point, say "eggs[bip]" in the example above which is an interior node not in the input requirements, and if there is a version of "eggs" in the lock, the graph is walked from there with the "bip" extra activated while walking the "eggs" Requires-Dists, and then recurse. If a multi-project repo diverges in requirements such that a forked lock would be required, you must create 2 (or more) locks. Generally use has shown the number of locks is small (1-3) in large multi-project repos and the social motivation to keep divergence down between projects is high.

Also, unlike other lockers here I think, Pex knows nothing about pyproject.toml - it deals in requirement specifiers 1st class and has no care where they come from or how they are arranged.

An example is here: pex/package/pex-scie.lock at main · pex-tool/pex · GitHub
That’s maybe a confusing example since it contains 4 separate locks and the lock just locks psutil, which has no further dependencies. Suffice it to say, which lock to use is determined at use time by a best-fit algorithm that takes the average rank of each lock that can satisfy the input requirements (maybe an arbitrary subset) where the rank is calculated from each used locked dependency’s best tag match to the consuming interpreter. I.E.: sdist ranks last, 1st tag output by packaging.tags.sys_tags ranks 1st. The "platform_tag" for the individual locks are vestigal and otherwise unused. Both the locking interpreter and the consuming interpreter can either be a real interpreter or a foreign platform description (we spoke about these several years ago now and I never raised a PEP for the idea), examples of which are here: pex/package/complete-platforms at main · pex-tool/pex · GitHub

charliermarsh · September 19, 2024, 1:17am

In uv: in theory, we could install from any root; in practice, we only allow installing from a workspace member, so we know the set of entrypoints upfront (it’s basically the set of local packages).

charliermarsh · September 19, 2024, 1:23am

I don’t quite follow how you would know which packages to include for (B) if you’re not recording B’s dependencies, and the dependencies of those dependencies, etc.

mikeshardmind · September 19, 2024, 3:21am

If the entry point is B, then when you go to install B, you have B’s metadata, and therefore dependencies. you pick those constrained to what was recorded in A’s flat version (already have the solution for that dep), and repeat for B’s dependencies (And so on), never needing to resolve a version or fetch new additional data, only pick the already solved version from the larger but flattened graph.

anon62990384 · September 19, 2024, 3:47am

Aha - yes. That makes sense. It does ~destroy the Pants / Pex use case where you subset just using metadata in the lock in ~O(10)ms and then parallel download and install after that. You’d instead need to download, find the next set to download, download them, etc. That performance difference would probably be a no-go, but agreed with what you point out - you can recover the dep graph from metadata not saved in the lock.

brettcannon · September 19, 2024, 9:20pm

No one is suggesting any lock file format would require fetching more data or do a version resolve as that leads into an actual resolver.

I’m not quite sure what you’re referring to which would cause this sort of install approach from a lock file?

anon62990384 · September 19, 2024, 10:31pm

@brettcannon perhaps I did not understand the flat vs graph poll. If the lock file contains the full Requires-Python & Requires-Dist metadata (I call that a graph) - then all is good no matter the format. If, instead, the flat format throws out that metadata, then, to form an arbitrary subset, you’d 1st need to fetch the subset roots from the flat list of locked projects to get those wheels / sdists locally so that you could then read the missing Requires-Python & Requires-Dist metadata from the just-fetched distributions, and then use that info to look up more projects in the flat list and then recurse.

mikeshardmind · September 19, 2024, 11:50pm

It’s also possible I musunderstood something in here, but for context, here’s what pdm records in what was described as a flat version currently for a random dependency within a lockfile I have locally.

[[package]]
name = "aiohttp"
version = "3.10.5"
requires_python = ">=3.8"
summary = "Async http client/server framework (asyncio)"
groups = ["default"]
dependencies = [
    "aiohappyeyeballs>=2.3.0",
    "aiosignal>=1.1.2",
    "async-timeout<5.0,>=4.0; python_version < \"3.11\"",
    "attrs>=17.3.0",
    "frozenlist>=1.1.1",
    "multidict<7.0,>=4.5",
    "yarl<2.0,>=1.0",
]
files = [
  ... # omitted
]

and this repeats in the lock file for each dependency. While I expressed the traversal of dependencies iteratively to rebuild the graph, this doesn’t actually mean that each dependency would need to be fetched iteratively, it’s in the lockfile still.

I can see how this can be seen as expressing a graph, which is why I personally thought the distinction was less that the graph still existed in the output and more in how the graph is actually presented to users. (a flat list of nodes aware only of their children, and with relevant constraints pushed down to them)

If there’s another format in question, and including pdm as an example of what you perceived as a flat format was unintentional @brettcannon , then I’d also want further clarification here

anon62990384 · September 20, 2024, 12:11am

Aha! Thanks @mikeshardmind. In that case Pex locks have been “flat” since day 1 and you can arbitrarily subset them. I find the “flat” / “graph” descriptors mighty odd then and clearly am out of my depth here.

brettcannon · September 20, 2024, 12:22am

In case it wasn’t obvious, I’m struggling deciding between encoding the graph or a set (I’m using “set” instead of “linear list” to emphasize that the ordering of packages doesn’t matter). So, I’m going to write out my thoughts and see if something obvious comes to mind (I doubt it since I have been thinking about this for a while now), or to see what you all have to say to see if we can reach consensus as a group.

Set

This is when the locker writes out a list of packages that an installer evaluates each listed package one-by-one on their own (this is what PEP 751 originally proposed, what PDM currently does, and what Poetry is slated to do). The way you control what to install is primarily via environment markers and requires-python. This isolation of information facilitates auditing as you can look at any one package and understand whether it would get installed or not in any scenario.

The way a locker would most likely work is you provide a set of requirements and then it works out which packages are necessary for various marker requirements (basically if a marker requirement is met of not). The locker then propagates those marker requirements through the dependency graph so that when you write it out you get a collection of markers for a package that encapsulate any and all ways you could have reached the package through the dependency graph.

One potential drawback to this approach so far is this propagation of markers can lead to very large markers for a package. That can hurt the benefit of being able to read the package details. It also simply leads to bigger files.

There’s also the question of how to support a single lock file having multiple entry points into the dependency graph. This comes up w/ extras, PEP 735 (aka dependency groups), and monorepos. Typically this is handled via group labels. That way, for each package, you check if it’s a part of the group and if it is whether it should be installed via more markers. If you imagine lockers creating synthetic groups for anything that isn’t necessarily user-specified, you seemingly can get a subset of the dependency graph via some group. This does require, though, knowing all of your entry points into the graph upfront to create a group for them.

It has been suggested you could record enough data for the set of packages to recreate the graph, but I will admit that starts to feel redundant. It also means you are recording more information than necessary, so someone is probably doing work that they don’t have to. It’s not drastically bad, but it then becomes a question of whether keeping the edge details is optional or a requirement (which affects whether installers can rely on it), in which case we have encoded the same information twice.

Graph

This is when the locker writes out each package like it’s a node, listing the direct dependencies per package (this is what uv does). For the locker this is effectively writing down the graph they came up with during resolution. Installers then evaluate the markers of the dependencies to decide what other package(s) to install. This is not a resolution as you wouldn’t have backtracking.

The pro of this is it directly encodes what the locker came up with. It’s also a simpler and smaller lock file. This also means you can enter the graph from anywhere without necessarily encoding groups down to each package. The drawback is you can’t tell if a package would be installed simply by looking at its entry in the lock file.

So it seems the question is how much any of this matters:

Simplicity
File size
Ability to enter the dependency graph arbitrarily
Knowing whether a package can get installed just by looking at it
Complexity entirely in the locker or shared a bit between locker and installer

mikeshardmind · September 20, 2024, 12:39am

…

I think we’re not on the same page here on what pdm’s behavior is or definitions then. pdm’s lock files seem to actually sit somewhere between these definitions, as it records each entry like a node, but also pushes marker data from packages that require that package in the graph to each node. This both retains enough info to allow arbitrary subsetting, but also records enough information that for pre-determined entrypoints to the graph (pdm’s groups) there’s enough info at each entry to just linearly look through the lock file.

I’d amend my vote above to a graph if I have to pick one or the other and this is how you are viewing the distinction, but I’m not following fully because I don’t think your examples fully follow your intended defintions.

anon62990384 · September 20, 2024, 1:55am

Ok, coming up to speed I think with @mikeshardmind anyway. So the “Graph” can be turned into the “Set” by an external tool - not saying this should be done!.

I just now locked torch on a linux machine for requires-python >=3.8 using pdm and get, for example, this entry:

[[package]]
name = "nvidia-nvjitlink-cu12"
version = "12.6.68"
requires_python = ">=3"
summary = "Nvidia JIT LTO Library"
groups = ["default"]
marker = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
files = [
    {file = "nvidia_nvjitlink_cu12-12.6.68-py3-none-manylinux2014_aarch64.whl", hash = "sha256:b3fd0779845f68b92063ab1393abab1ed0a23412fc520df79a8190d098b5cd6b"},
    {file = "nvidia_nvjitlink_cu12-12.6.68-py3-none-manylinux2014_x86_64.whl", hash = "sha256:125a6c2a44e96386dda634e13d944e60b07a0402d391a070e8fb4104b34ea1ab"},
    {file = "nvidia_nvjitlink_cu12-12.6.68-py3-none-win_amd64.whl", hash = "sha256:a55744c98d70317c5e23db14866a8cc2b733f7324509e941fc96276f9f37801d"},
]

So this is “Graph” plus 1 field - marker. That marker field, if not present in the lock, could be added after the fact by a tool that does the marker propagation IIUC looking only at the data in the “Graph” lock - no internet access, etc. Again - not saying this makes any sense to do, just trying to stress my understanding of the difference in the formats.

Do I have that right @mikeshardmind?

(I will note this particular lock is broken (see the Windows whl and the Linux only markers), not due to pdm doing anything wrong, but due to torch having inconsistent metadata in its wheels for a given version! A known issue and, unfortunately, in a very widely used project)

steve.dower · September 20, 2024, 9:58am

To summarise this distinction a different way, in case it’s helpful: a graph requires a recursive install algorithm, while a set only requires iteration.

In code, locked as a set:

for entry in lock_set:
    entry.install()

Locked as a graph:

def install(node):
    if node.entry.is_installed():
        return
    node.entry.install()
    for dep in node.dependencies:
        install(dep)

for entry in lock_graph:
    install(entry)

I’ve omitted markers/conditions that may be provided by the caller, but otherwise, I believe it’s a fair comparison.^[1] The set version is significantly simpler for an installer (and as John said, a graph can be converted to a set with a simple (recursive) algorithm - no special knowledge, metadata access or resolution is required).

The graph will have repeated references to the same package, so the “is_installed()” check is required even when installing into an otherwise empty environment. The set has already handled duplicates, so it isn’t. Synchronising an already-existing environment is left as an exercise to the reader. ↩︎