PEP 751: lock files (again)

pf_moore · September 20, 2024, 10:15am

To be fair, given that a graph can be converted to a set, a better installer algorithm (avoiding the need for the is_installed check when the target environment starts empty) is to convert to a set and use the set algorithm.

It’s a minor point, although it does make it clear that at some point, someone will probably convert to a set (either explicitly or implicitly), so the real question here is how far down the process from requirements to packages to install do we stop and spit out a lockfile. Which is a trade-off between flexibility for the format and complexity for the consumer (installer, human reader, audit tool, …)

I imagine lockfile producers prefer flexibility, whereas consumers prefer simplicity

steve.dower · September 20, 2024, 10:26am

The locker is going to have to convert to a set regardless, in order to detect conflicts between package dependencies.^[1] So it should always have both formats handy.

As I said earlier, I’m a fan of having a set of files that are allowed to be installed, and then a graph of packages to install (using only the files in that set). That allows a “dumb” tool to cache all the files that may be needed and then an installer can work offline/air-gapped (including cross-platform), and also (in my opinion) makes auditing easier, while still preserving all the information.

But yes, the two formats can be converted without loss for the purposes of installation. (The graph format preserves more meta-information than set.)

I’m assuming that we don’t want error messages about conflicting dependencies to be deferred until install time… ↩︎

konstin · September 20, 2024, 11:01am

I find having the edges in the lockfile is invaluable for understanding the lockfile, figuring out why a dependency was included and why at that version. Case in point: pip-compile adds the graph as # via marker annotations. Without this information, you need to either grep through the venv METADATA (if you know that this exists) or manually check each package.

As for complexity, from the perspective of the installer both flat traversal and graph traversal are both so simple compared to everything else that an installer needs to do that the difference doesn’t matter. For reference, this is the optimized logic for converting the graph to flat markers in uv: uv/crates/uv-resolver/src/graph_ops.rs at b738b359103ae016d3746402e2b2f07d76a8d420 · astral-sh/uv · GitHub, while the PEP 440 and PEP 508 implementations alone (a requirement for evaluating markers) are 6k and 9k lines respectively.

pf_moore · September 20, 2024, 11:28am

The complexity of the graph model isn’t so much of a problem for installers as it is for other consumers, which don’t have install code to cope with - human readers and auditing tools, as well as any other “interesting” uses that people come up with.

I’m not arguing against the graph representation - I think it may end up being the correct choice. But I do want to ensure that we look at the full picture, lockers and installers are not the only components involved here, even if they are the main ones.

DanCardin · September 20, 2024, 1:19pm

Should a theoretical pip freeze --lockfile be a consideration for this discussion? (in that producing a graph might not be trivial) or would it translate into a graph with no edges, since everything is unconditional anyways?

Also it seems like it shouldn’t really be important to encode locker-specific concerns into the format, because the aim here isn’t intercompatibility between lockers on the same lockfile, it’s to unify the parsing of these files at installation time, right?

For example, entering the graph arbitrarily for uv workspaces. it’s a uv-specific concept, and the lockfile only needs provide tool-specific data locations so that uv can record any info it needs to support that feature. But it feels less important that some other installer can distinguish that info beyond selecting for extras or eventually dependency groups.

pf_moore · September 20, 2024, 1:39pm

I would not expect pip to gain a --lockfile argument to the freeze command. On the other hand, introspecting an environment to produce a lockfile that reproduces that exact environment is a reasonable use case, and I’d support someone writing a standalone utility to do it.

Having said that, the content of an environment doesn’t contain enough information to know exactly which distribution files are needed to build it. At least, not unless PEP 710 covers that use case, which it might (it’s not a goal of that PEP to support locking, though).

I did experiment with a standalone utility to convert the pip install --report output into a lockfile. But I got discouraged when everyone started wanting multi-platform lockfiles, as that’s not something the pip output would ever support.

pf_moore · September 20, 2024, 1:46pm

There has been a lot of discussion about what lockers and workflow management tools need, but much less about installer use cases. Whether entering the graph arbitrarily is important to include in the standard isn’t just about whether uv needs it (as you say, that can be handled by tool-specific data) It’s whether there’s a use case for supplying a consumer with a lockfile and an instruction to “install what’s in here starting at X”, where you don’t control what installer the user has available. The canonical example of this is cloud providers letting you ship a lockfile to install your app.

I don’t think we’ve had any indication so far that anyone needs arbitrary installers to support uv’s “enter the graph at any point” feature. That doesn’t mean the need isn’t there - just that no-one has asked yet.

charliermarsh · September 20, 2024, 2:02pm

So I think the main distinction here is whether the markers are written on the nodes, or the edges. Is that correct?

To clarify: in uv, we store both the raw metadata (requires-dist, mostly for debugging and cache invalidation for mutable dependency sources like local directories) and the list of resolved dependencies (edges in the graph) with their markers. When we install, we traverse the graph, filtering edges out whose markers don’t match the current environment. So yes: the TOML representation is a flat list of nodes, but the install operation is performed by traversing edges rather than via a linear scan.

If uv were to write a marker on a node, it wouldn’t be “the set of platforms on which this package is installed”, it would be “the superset of platforms on which this package could be installed”. For example:

Root A could depend on anyio with sys_platform == 'win32'
Root B could depend on anyio with sys_platform == 'darwin'
So the marker on anyio would be sys_platform == 'win32' or sys_platform == 'darwin'
But it’s not true that anyio would always be installed on win32.

I think it is possible to do arbitrary subsetting without writing the markers on the resolved edges (and just depend on requires-dist being present). In the example above, if you install root B, you’d see anyio ; sys_platform == 'darwin'. If you’re on Darwin, you’d “keep” that dependency. Then you’d look for the anyio node in the graph that’s compatible with the current platform (which would have sys_platform == 'win32' or sys_platform == 'darwin'), and use that. (Note that there can be many anyio nodes in the graph at different versions, so you’re not just looking for “the anyio node”.) This relies on the assumption that you can only have a single version of a given package in the resolution for a single platform.

I would need to do a lot more thinking and testing to confirm that. Though candidly I like it way less than just writing markers to the resolved edges. In my opinion, it actually makes the graph harder to follow, harder to resolve (I think you’d first need to filter the graph by node markers, then do a graph traversal), and harder to validate.

I can come up with more examples if helpful.

charliermarsh · September 20, 2024, 2:07pm

I may be misunderstanding the criteria but for what it’s worth, I think Hatch has also expressed a desire to support workspaces. Pex also supports this and so presumedly it would be considered a need for that tool.

anon62990384 · September 20, 2024, 2:24pm

…

I sure hope so! That’s what Pex has done for a few years or so. It does run under the even tighter constraint that there can only be a single version of a given package in the lock file period - no fork support - but for the purposes of subsetting for this interpreter here now, that distinction doesn’t matter.

If the uv and pdm cases are good examples, I think Pex doesn’t care at this point as long as Requires-Dist are included in the lock whether or not the “flat” / “Set” / markers on a node data is added on top.

pf_moore · September 20, 2024, 2:27pm

That’s a different aspect. Hatch, pex and uv could all support this via tool-specific data. That’s suboptimal in the sense that you’re replicating work, but it’s not a matter of interoperability, in the sense that uv would (routinely) install from a lockfile produced by hatch.

What I’m talking about is a tool that only does installs, not workflow. For example, pip, but maybe just a custom tool built on top of installer. The question is whether anyone would need to ask such a tool to install just root X from a full uv lockfile (where roots like this are a uv-specific concept, so how would the UI for requesting it even look?)

anon62990384 · September 20, 2024, 2:41pm

I understand that distinction, but the distinction is also moot if the Requies-Dist info is simply not thrown away. If it survives in the lock format, like it does for pdm, Pip can act as a flat installer, Pex can subset, ignoring the redundant flat marker data already present in a Requires-Dist walk.

I think @konstin / @mikeshardmind had it right and the flat node marker data is really only a useful affordance for humans reviewing a lock diff in traditional text file diff format. You can reason at a glance locally. It may be that affordance is important so we add that extra denormalized data. If it’s additive though, a tool is free to ignore it and use the original normalized data IIUC.

charliermarsh · September 20, 2024, 2:58pm

The capability that’s being described, though, is being able to install any node in the graph (where install means, “install a package and its locked dependencies”), which isn’t uv-specific.

charliermarsh · September 20, 2024, 3:02pm

I agree that we can get what we want in uv regardless of whether the markers are written on the nodes or on the edges (I think the latter assumes that you’re writing the edges to the nodes, and not just requires-dist).

We can always write the “resolved edges” as tool-specific metadata if needed.

What I would like to see when ready, though, is the full proposed installer algorithm, including support for extras, based on writing the resolved markers to each node.

ofek · September 21, 2024, 1:09am

Just FYI I don’t care much (I think, this is hard to grok with limited time ) about this particular point because Hatch will have a separate lock file for every environment.

pf_moore · September 21, 2024, 10:07am

I think this is a good point. The discussion has moved a long way from being something I can link back to real world use cases (maybe because I’m not familiar with the relevant features of uv). With that in mind, I think it would be really helpful to ground this back in a real world use case. I’m going to stick with uv, as it’s clearly the tool with the strongest interest in this feature, but I’m going to use a non-uv consumer, as what matters here (i.e., in the context of a standard) is interoperability.

So let’s suppose someone has a uv project, and they are using whatever uv feature it is that involves the need for “entering the graph at an arbitrary point”. They create a lockfile for their project, and send it to me. I’m a hosting provider who offers support for Python standard lockfiles. Under the hood, I probably use pip to install them, but I don’t document that - I have a front end config screen that users can use to enter the name of the lockfile to install (and maybe some additional parameters, but I’m restricting them to “well known” things that have a clear definition based on the lockfile spec).

What does the application developer ask me to do, as the hosting provider, in order to install and run their application?

I know this gets very much into UI questions, but the key here is that we’re clear what user-level concepts the spec defines. Obviously, it defines a “lockfile” that can be “installed”. And with multi-environment lockfiles, the spec introduces a concept of some sort of environment definition. So far, so good.

But this latest subthread was triggered by the following comment:

And as far as I know, no-one has clearly articulated what a “root” is. Given that the only top level object in the current lockfile spec is a package, then the only meaning for “root” that I can see is “package”. And that seems weird, because if I create a lockfile for a project that just depends on requests, what possible use is there for saying “Install the lockfile starting from certifi”? Furthermore, I don’t see how extras fit in here - packages are identified by name, the extra isn’t included (and there’s no guarantee that the lockfile will contain all the needed packages for an extra that isn’t part of the original locking request). Could I install the above lockfile starting from idna[all] (note that requests depends on idna, but not on idna[all])?

All of this has got very abstract, and as I said, I think we need to go back to actual use cases. And in particular, use cases that emphasise interoperability, where the producer and consumer are different. Otherwise we risk designing a standard that tries to be a superset of every feature of every workflow management tool.

But Hatch will need to support this feature if it’s added, because the whole point of this being a standard is that Hatch will need to be able to read and install standard lockfiles - which may well have come from uv and hence could use all of these features!

In particular, are you saying that Hatch won’t support consuming multi-environment lockfiles? I can’t see anything in the PEP that suggests that a tool claiming to support the standard is allowed to do that. @brettcannon what’s your view here?

Actually, @brettcannon - with my PEP delegate hat on, I’d like to stress this last point. The reason for creating a standard is interoperability, and the more features we add, the more burden we add to tools that claim to support standard lockfiles. The PEP needs to be extremely clear on what features a tool is required to support, what is optional, and how interoperability will be handled in cases where different tools support different features (if that’s allowed).

charliermarsh · September 21, 2024, 12:31pm

And as far as I know, no-one has clearly articulated what a “root” is. Given that the only top level object in the current lockfile spec is a package, then the only meaning for “root” that I can see is “package”.

The point on extras here is good. You’re right – we don’t actually support installing “any package” because we don’t record their extras. Let me explain in a bit more detail using uv-specific concepts (ignoring the question of whether this is a supported use-case)…

In uv, we support workspaces, which are collections of local projects that can depend on one another and are locked together with a consistent set of dependencies. Every workspace has a root package, along with a bunch of member projects. (The root package itself is also a member). Members can depend on one another; the root can (but is not required to) depend on members; etc.

Users can run uv sync --package ${member} to sync their virtual environment to the dependencies for any workspace member, while sharing a lockfile so as to ensure that all members use a consistent resolution. When resolving, workspace members are resolved with all extras so as to enable activating arbitrary extras for any workspace member when installing. When installing, we look for the node in the graph corresponding to the specified member, and traverse the graph starting from there. Within the lockfile, there is no distinction between the root and any of the workspace members. The root is just another node in the graph. (We encode some information about the workspace (like the list of members), but this is for cache invalidation and not for correctness.) In this context, then, each “root” is a local project that is designated as a root at lock time; they are not arbitrary nodes in the graph.

If you run uv sync from a workspace member’s directory, we sync that member by default. So if you were just using uv, you could send your workspace to the hosting provider and ask them to sync from that member’s directory (i.e., given them a project and specify a current working directory). Of course, that’s an installer-specific abstraction that I wouldn’t expect to be part of the standard. But it would be enabled by the standard supporting multiple roots.

(There’s another way to support this from uv’s side, which is: create a separate lockfile for every member in the workspace. I think this is strictly worse from uv’s perspective (which is why we didn’t do it), since you no longer have a single source of truth, need to keep all of those files in-sync, and need mechanisms to ensure that every lockfile across the workspace is using a consistent set of resolved dependencies. But it would have the benefit that it doesn’t require the standard to support multiple entrypoints to the lockfile.)

Moving on from uv, to summarize my opinions:

Being able to designate multiple equivalent roots is a very useful property.
You can implement similar workflows without that property by managing multiple distinct lockfiles. (I dislike this, but I expressed that above.)
If “allow multiple entrypoints to the lockfile” is a supported use-case, I’d prefer to see the markers expressed on the edges rather than on the nodes. The marker on the node just isn’t very useful: it doesn’t tell you the set of platforms on which the package will be installed, and you can no longer do a linear scan of the lockfile anyway, you have to do some kind of graph traversal.

pf_moore · September 21, 2024, 1:18pm

Thanks. I was broadly aware of the uv concepts, but this doesn’t really address my main point, which is why do we need this in a standard? Specifically, in an interoperability standard - if these concepts don’t translate to other tools, why should we standardise them rather than just let you do what you want in uv?

There’s also the option of holding all the extra data you need in tool.uv. In fact, that’s precisely the point of the tool namespace, to hold data that is of use to one particular tool but which doesn’t translate well to other tools. I appreciate that using tool.uv might be less convenient for you compared to having first class support in the standard, but it’s less convenient for us to have to express highly uv-specific concepts in terms that apply to all users of the spec.

Can you demonstrate that for hatch, PDM, pip-tools or any other locking tool? I don’t dispute that it’s a useful property for the users uv is targeting, but are those use cases ones that we want to demand that the whole packaging ecosystem supports? Again my point here is that unless we make certain lockfile features optional, all tools need to support them. And if features are optional, interoperability is compromised.

Understood. But hatch seems content to use multiple lockfiles. Should we require them to handle this feature just so they can consume lockfiles that were produced by uv? The tool namespace is where you’re intended to put things that are specific to your tool, that aren’t expected to be supported by other tools.

I don’t have an opinion on this. I’m trying to establish (a) whether “allow multiple entrypoints to the lockfile” is something we should support, and (b) how we express that support in a usable, generic way.

To summarise my opinions here:

I don’t care about implementation details or what data structures end up in the lockfile.
I want the lockfile spec to be something that existing workflow tools can use with as little friction as possible.
I don’t want to claim we’re creating a lockfile standard if uv’s lockfiles can only be consumed by uv, or PDM’s can only be consumed by PDM, etc.
In particular, it’s critical for me that when handed a lockfile^[1], no matter what tool produced it, a user should be able to create an environment from it using their tool of choice. This one’s pretty much a dealbreaker for me - both as PEP delegate and as a pip maintainer (because pip is likely to be the “tool of choice” for a lot of people).
I want the installer side to be as braindead simple as possible. That’s the “no resolve at install time” constraint, and we have that covered already.

Also, as a pip maintainer, I expect users to ask pip to support installing from a standard lockfile. I’m happy that the installation algorithm will be simple enough that this won’t cause us any problems. But I’m really struggling to work out what command line options we’ll need to support. pip install --lockfile pylock.toml is an obvious mininum. But how does the user specify a “root”? What is a “root”? Is it a package name? Is pkg[extra] valid? If so, we presumably don’t want to support the full requirement spec, so what is the syntax we support here? The PEP (and ultimately the lockfile spec) needs to answer these questions, or pip won’t be able to design a standards-compliant UI that supports this feature.

I’m OK if the lockfile has to be created using a “only use portable features” flag of some form - it’s fine to support tool-specific features, but creating a lockfile for distribution should be easy and natural to do. And use of non-portable features like information in the tool section should be instantly recognisable by consumers, so they can report the problem. ↩︎

charliermarsh · September 21, 2024, 1:41pm

Really critical point(s), and they get back to the fact that the lockfile standard we’re discussing here is really attempting to do a few different things.

As an example, I actually don’t know the answer to this: do Hatch, PDM, etc. plan to support installing from lockfiles that aren’t produced by those tools? Or do they intend to use this format to facilitate interoperability with other installers, like pip?

charliermarsh · September 21, 2024, 1:43pm

To answer this concretely: for uv, at least given the scope that has been discussed this far, we can support creating an environment from a lockfile produced by any tool. But I doubt that we can reliably update that lockfile, since we may rely on tool-specific metadata in the lockfile to facilitate resolution, or something to that effect.