PEP 751: lock files (again)

charliermarsh · September 25, 2024, 12:52pm

I’m personally fine with the former. And even if Dependabot can’t put up version update PRs (or can only do so for certain package managers for which it implements support), there would already be value from the standard, since Dependabot could still give update notifications and do vulnerability analysis with read-only access to the lockfile IIUC.

sethmlarson · September 25, 2024, 2:15pm

From an SBOM perspective: being able to distinguish which dependencies are “direct” or “top-level” and which dependencies are transitive is useful. For example, someone remediating vulnerabilities in their application might prioritize those for direct dependencies. This information can be encoded in an SBOM using “describes” for the top-level packages and “depends-on” relationships to show dependency trees.

Whether that information is encoded with the “graph” option or some other mechanism (like listing top-level packages) would both work for this case, the graph option making the job slightly easier by not needing to install the packages to figure out dependency information.

sirosen · September 25, 2024, 5:31pm

Brett Cannon:

edmorley:

However, the full Dependabot functionality (as opposed to only security alerts about vulnerable packages) will require it being able to update the lockfile, rather than just needing to read it. If package managers have any tool specific config/state stored in pyproject.toml / elsewhere, that will presumably get out of sync with the lockfile for anything other than simplistic lockfile changes. And in fact, it seems the Dependabot use-case would still need to perform full package resolution given that the new version of a package could change the dependency graph? And as such, Dependabot will still probably need to support/run all of the individual package managers anyway?

What does Dependabot do when it updates a pyproject.toml file that has a [tool] section? Does it just leave it alone?

If this is a question about the present-day behavior (I think it is), the answer in my experience is “yes”.

Where I use Dependabot and Poetry, the behavior is that it will update the poetry.lock and won’t touch pyproject.toml. I wouldn’t be surprised if poetry update --lock <pkg>, a poetry command which has this effect, is part of the process, but it’s impossible to know merely by observing the behavior.

It would be great if we could hear from the Dependabot team (does anyone have a POC at GitHub who could maybe reach out?) about what they need. It’s reasonable to assume – absent any feedback to the contrary – that they will just invoke the relevant locking tool.

Having fields for some creation metadata is a good idea, but I suggest having more than the command which was used. The command is concrete, and the tool name, which is more abstract, would allow readers of this data to more consistently programmatically read locker configuration.

With pip-compile, you get the locking command as it was invoked, but you can set the CUSTOM_COMPILE_COMMAND to tweak this. So it’s common to have something like

CUSTOM_COMPILE_COMMAND='make freezedeps' pip-compile ...

in a makefile or tox.ini.

You then get a preamble on the generated requirements.txt of

#
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
#    make freezedeps
#

Note that pip-compile, being built on pip, is sensitive to which Python version was run, so it’s recording some of that information.

Dependabot will update pip-compile outputs, but it won’t touch that header comment. (So I suppose it makes it a lie if you apply those updates? )

Similarly, Poetry produces a comment line in poetry.lock, which reads

# This file is automatically @generated by Poetry 1.7.1 and should not be changed by hand.

(sub 1.7.1 for whatever version). It may jump up and down as different developers update the same lock, with different tool versions, and there are upsides and downsides to that behavior.

My gut instinct for what data would be useful:

name of what tool was used (abstract, “which tool is it?”)
locking command (concrete, “what was invoked?”)

Additional data, like the Python implementation and version used, could go into [tool] tables, but I think having the name of the tool, which probably matches the name of the [tool] subtable, is necessary in order for Dependabot and its ilk to choose which locker behavior to use. Having the locking command recorded would possibly lead to tools needing to parse that command, which I think is an undesirable outcome.

ofek · September 26, 2024, 5:26am

Yeah, I’ll reach out and ask them to comment today.

jeffwidman · September 26, 2024, 5:35pm

Hi there, I work on the Dependabot team at GitHub, and co-maintain several open source Python projects in my (limited) spare time.

I/We are very interested in this conversation–internally we’ve discussed multiple times reaching out to the Python community to try to make it easier for Dependabot to understand/update Python dependencies… the lack of a standardized lockfile format makes life difficult for us.

However, we’re on an extremely tight ship deadline of Oct 7th for a brand new Dependabot feature that we’ll be revealing at GitHub Universe.

Would it be okay if I set a cal reminder to circle back on this discussion after that? That way I have time to read through the thread and give thoughtful answers/opinions. Or do you need answers ASAP?

In the meantime a few cliff notes:

Philosophically, we view Dependabot as the “platform for running native package managers” rather than “re-implementing native package manager functionality in Ruby”. So in general we shell out to the native package manager to perform the actual update logic.
That abstraction breaks because we support restrictions on allowed versions (ie, don’t allow bump to vulnerable versions, and users can specify restrictions in their dependabot.yml file), and so we have to use Ruby code to introspect the resulting package version files. A lockfile would be very useful because we could easily inspect that to say “native package manager generated a candidate set that isn’t allowed”.
The actual logic controlling Python updates is entirely open source so you can inspect it if you have questions.
The file layout can be a little confusing, so it’s often helpful to first review the high-level architecture/flow diagram.

Happy to clarify if there are more questions around “How does Dependabot work?”

Otherwise, I can weigh in more with design thoughts/opinions after Oct 7th.

jeffwidman · September 26, 2024, 5:57pm

A Python lockfile would also be very useful to the Dependency Graph features (example) in GitHub. I’ve alerted the team internally that works on that feature, so they may swing by to comment as well.

brettcannon · September 26, 2024, 7:02pm

Yep, that’s works!

And you don’t need to read through the whole thread; it’s rather long . I think what we all would be after is what information would you want a lock file to record to make it work well with Dependabot? For instance, the idea has been floated to record the name of the tool and the command run to create the lock file. Would that be enough along w/ having a standard to begin with?

ncoghlan · September 29, 2024, 2:26am

This made me realise I actually have a CVE process question, both in general (@sethmlarson?) and for Dependabot in particular:

when CVEs are declared against a library, can they be marked as platform specific, or is the entire version always flagged, even when a vulnerability is platform specific?
if CVEs themselves can be marked as platform specific, do Dependabot scans (and other vulnerability checkers) actually use that info?

If the answer to both questions is yes, then this is another case where the flattened marker format would be preferable.

My other thought on that front was that we’ve been talking about recording the markers on edges or nodes, but we’ve only been talking about the outgoing edges where the marker applies to the far end of the edge.

If we instead (or additionally) recorded conditional markers on incoming edges, then the markers would always appear in the node they affect, regardless of whether they were conditional or not.

This approach would affect the dependents array on each node: rather than being an array of strings, it would also allow name/marker inline tables for conditional dependencies.

While technically redundant, the marker could also still be recorded in the dependencies table for the origin node, but then we’d have to define whether tools were required to check for inconsistencies (and what to do if they were found).

There would still be other details to work out (such as whether marker on a node should reflect all environments where the node might be installed, or just those where that node is a top level input to the lock resolution), but linking markers to incoming edges at the dependency end seems to me like it would eliminate the main downside of marking edges instead of nodes.

sethmlarson · September 30, 2024, 12:40pm

There are mechanisms for programmatically declaring platform-specific vulnerabilities, but tooling today doesn’t use that data by default. Optional metadata likely varies wildly in consistency so can’t be relied on… Maybe something our hosted OSV database could handle more easily?

konstin · October 2, 2024, 10:06am

I want to describe the two use cases i see for multiple entrypoints in the lockfile: workspaces and development dependencies.

I call it a workspace if someone has a git repository with multiple packages that they want to manage together. In other ecosystems such as javascript, Rust and PHP, large open source projects with a monorepo are generally split into packages, while in Python, this is rare (review). The current workspace support in Python are -e ./packages/member includes in requirements.in: They allow locking and editing multiple packages together in the same requirements.txt. With first-class workspace support, we get the ability to install individual or all workspace members while keeping the same set of consistent dependencies. At least currently, workspace member inclusions or exclusions can’t be described with marker edges.

Development dependencies/dependency groups (PEP 735) allow defining optional sets of dependencies that are similar to extras - they can be turned off and on by name - but unlike extras, they do not get published with the package and they are not bound to having a package in the first place. To ensure consistency between different developers’ machines and to avoid CI failures, these deps are also frozen in the lockfile. Since they are not published, they don’t have a marker expression to select for, instead, they are modeled as having a root for the package a root or multiple roots for the package with dev deps/dep groups, and cli flags that determine which roots we select.

I’m not trying to make the point that PEP 751 critically needs multiple entrypoints, but i want to describe why uv has a dependency graph that can’t be sufficiently described with a single root and markers on the edges.

brettcannon · October 4, 2024, 9:23pm

Just a quick update that I am starting a rewrite of PEP 751 taking a graph approach, keeping the collected data on point/focused, and leaning in a bit on the idea that installers will be run more than lockers (i.e. parsing out more structure into the TOML under the assumption that TOML parsers will be optimized more than any parser we have for our dependency specifier DSL).

But while I’m writing it I currently have 2 open issues we can start discussing now.

Do we want lock files to be self-contained?

I ask this because if you view the lock file as recording a dependency graph that could have multiple roots, how do you know what root you may want? With a pyproject.toml nearby you have that written down. Same goes for Pants or any other tool that views the lock file as an ancillary file. But if you want to be able to hand a lock file and just a lock file to someone, that gets a bit trickier. You could still have an installer require you specify the root somehow manually, but my guess is people wouldn’t want to make people guess at that.

So in a 1.0 of this do we want to punt on this and see if it’s necessary, to do people think it’s an issue that needs solving now? I am leaning toward not worrying about it, personally.

How do you know that an install will fail?

Right now the PEP is going to say you need to record all the dependency requirements for a package. By doing that you can see whether those requirements can be met and thus have a successful install (i.e. are the edges dangling?). But if people find listing all dependency requirements noisy, then we would need some other solution for this as a way to tell what conditions must be met if an install actually succeeded instead of waiting until runtime to see if any ImportError triggers. We could bring back specifying the conditions required to use the lock file as a possible solution.

I will say what I currently have in the PEP works nicely for a pip lock command since it can just read from METADATA for each package in the environment and just record it in the lock file.

mikeshardmind · October 4, 2024, 11:15pm

While it sounds like a nice property in theory, realistically, all of the places people use lock files should be sitting in the same tree as the project sourcing the lock file.

+1 on requiring recording this, but it’s probably worth clearing up that even if all software dependencies resolve, there are cases where import or use can fail, and that’s not a shortcoming of this pep but that we may need more markers with time for things like “has a GPU that can be targetted with sycl”, any failure should not be because of a missing declared dependency though.

In that vein of thought, pushing requirement markers to child nodes in the graph (ie. similarly to current pdm lockfiles) should make this easier from the side of consuming the lockfile, as this turns consumption into a linear^[1] scan that every needed dependency could be satisfied in the current environment.

two passes at most, in a node list that isn’t ordered to optimize consumption, which such an order may have other desirable properties, such as a lexically ordered one being easier for both human review and git diffs on additions not shuffling dep order ↩︎

anon62990384 · October 4, 2024, 11:41pm

I don’t have a strong opinion on this one, but to correct the record: In the Pants - really Pex - case, the lockfile contains all the information in one spot. The input requirements to the lockfile are stored in the lock file as a list (for example: pex/testing/data/locks/issue-2415.lock.json at 13826f15bf7f18aa949d891c1561a66fa5157cc5 · pex-tool/pex · GitHub). This was the original “root” if you want to think of it that way. When the lockfile is used and no requirements are specified, the original input requirements (contained in the lockfile) are assumed. Any requirements whatsoever can be requested explicitly from the lock of course, and those “roots” will or will not be satisfiable via the lock.

sirosen · October 5, 2024, 3:14am

The user facing experience that you’re requiring here is that attempting to install from a lock should fail under certain conditions.

Something about this is hard for me to follow – is the scenario that the installation completed, marker constraints were obeyed, and the resulting environment is missing one or more required packages?
If that’s the case, recording sufficient information to detect the failure immediately sounds like a strong plus, if not a requirement.

The requirements for each locked package version seem easy to record, and easy to process. Something more abstract may be less reliable.

zanie · October 5, 2024, 3:30pm

uv currently requires the pyproject.toml alongside the uv.lock file which has caused some confusion, i.e., in Docker builds. However, I think now that we have documentation with clear examples around that use-case we haven’t really gotten many complaints about requiring the pyproject.toml alongside the file. One reason to separate them is to allow installation from the lockfile without invalidating a Docker layer cache due to a pyproject.toml change (which could be entirely unrelated). However, in most cases, the pyproject.toml is needed for tool settings anyway so it’s a pretty niche use-case to want them to be usable separately. I think you should avoid trying to address this now.

pf_moore · October 5, 2024, 4:08pm

I personally think that the “send a lockfile to a hosting provider” use case is better served if the only thing you need to send is the lockfile. But I’m not against the idea that we defer the question for now - as long as the format allows tools to create a lockfile that can be installed with no other information needed, then I don’t think we need to make it mandatory.

Although thinking about it, I’d want the pip UI for lockfiles to be something like pip install --lockfile pylock.toml. Does that mean I’m in the “we must solve this now” camp, or do you expect that to be possible (maybe at the expense of pip having to say “this lockfile isn’t installable” for some lockfiles)?

Let me put it another way - what do you expect the pip command to install a lockfile to look like?

zanie · October 5, 2024, 4:35pm

I think the major use-case for various lockfile roots is local development in which the project source code should be around anyway. In this case, the lockfile will usually reference the project itself so it seems dubious that you wouldn’t have a pyproject.toml around.

I do think it’s important for lockfiles with a single root to be installable without additional files. This is critical for replacing the requirements.txt use-case where you define dependencies without a requirement on the project itself.

If you’re only shipping a lockfile to a provider, you’ll need to have packaged and published your project somewhere — in this case, I don’t think know of a use-case for multiple roots. I think you’d need to generate a separate lockfile that points to your published package anyway (since the one you’re using for development wouldn’t) which provides an opportunity for you to create a lockfile with a single root.

I guess it’s still an open question if the lockfile needs to define a default root for this purpose or if installers should bail if they encounter a lockfile with multiple roots.

sirosen · October 5, 2024, 11:34pm

This requires that there is some association, even if it’s implicit, between a lockfile and pyproject, doesn’t it?

I think we can foresee the following scenario:

$ ${lock_cmd} foo/pyproject.toml > bar/foo.pylock
Generated lockfile for 'foo'!

$ ${install_cmd} bar/foo.pylock
Error: bar/foo.pylock implicitly relies upon a pyproject.toml , but one was not found!

The PEP explicitly takes no stance on where lockfiles are located in a project directory.
But that means that if some tools decide that “the directory containing the lockfile should contain pyproject.toml” and other tools decide “the working directory should contain pyproject.toml”, it could be an issue.

Allowing for the lock to go anywhere but also allowing it to be reliant on project metadata leads naturally to a question of “how can users instruct an installer about that relationship?”

I’d be slightly concerned if usages for different installers diverge too much. If I need to associate a lock with pyproject metadata, it’s in scope for this PEP to discuss how that’s expected or recommended to happen.

aragilar · October 6, 2024, 12:51am

The majority of the requirements.txt files I have lying around have no pyproject.toml (or setup.py for that matter). This is because usually I’m creating an environment to run code interactively in, rather than installing per-se. I feel this is a very common workflow on the scientific side of the ecosystem, so I would hope such workflows would be supported at some point by the new lock file format (even if it’s not in the first version).

ncoghlan · October 6, 2024, 3:10pm

Specifying a top level default-install option feels like it wouldn’t impose a major burden for lockers, while making the expected behaviour of installers much easier to predict.

The field could be optional, with omitting it meaning that installers were expected to error out saying a starting point in the installation graph needed to be specified.

If there was a way for lockers to list a set of defined starting points, the installer UX could be even more consistent across tools.

My main point of concern is with lock files for multi-project workspaces and optional dependency groups, where it would be nice if their complex lock files could be used directly for deployment with simpler installation tools, rather than having to export a simpler lock file variant (similar to the way people export requirements.txt files now).

Edit: @zanie raises a good point that “dev” lock files and “deployment” lock files are likely to have another difference anyway: how they reference the project itself (often editable for dev, but versioned like everything else for deployment). So maybe my dream of being able to use dev lockfiles directly for deployment is a foolish one, and we should be designing on the assumption that “lock for dev” and “lock for deployment” are intrinsically different requests to make of a locking tool.

Edit 2: if we did make that distinction, then “lock for deployment” could imply not only “single installation root” but also “no tool table entries”.